In [1]:
import pandas as pd

#### Technique: Variable-Length, Character-Seperated Strings (string splitting)

This researcher wanted to use any-length strings for her data, so that names like "Joe" and "Josephine" could all be put in the filename more naturally.  So, her filename convention was this: 

`<Subject>_<Date>_<SessionCondition>_<SessionNum>.<FileExtension>`

By settling on the underscore ("_") for the seperator for the variables (the dot "." is always used as a seperator for the file extension), the code for extracting those variables is quite simple and just involves "splitting" the string along those seperators.

| Code | Description | 
| :-- | :-- |
| **Indexing by Seperated Value** (i.e. "Splitting and Indexing" a String) |  |
| values = "hello_world".split('_') |  |
| hello = "hello_world".split('_')[0] |  |
| world = "hello world".split(' ')[1] |  |
| hello, world = "hello world".split(' ') |  |
| basename, extension = "filename.txt".split('.') |  |
| hello, *rest = "hello dog cat bunny cow".split(' ') |  |

**Exercises**

**Example**: The filename convention here is `<Subject>_<Date>_<Group>_<SessionNum>.<FileExtension>`.  Extract the date this filename into its own variables:

In [2]:
fname = "Arthur_20241008_control_1.txt"
base, ext = fname.split('.')
data = base.split('_')
date = data[1]
date

'20241008'

Extract the Group from this filename into its own variables:

In [3]:
fname = "Arthur_20241008_control_1.txt"
base, ext = fname.split('.')
data = base.split('_')
group = data[2]
group

'control'

Extract all the data from this filename into a dictionary:

In [4]:
fname = "Arthur_20241008_control_1.txt"
base, ext = fname.split('.')
subject, date, session, num = base.split('_')
data = {"Subject": subject, "Date": date, "Group": group, "SessionNum": num}
data

{'Subject': 'Arthur',
 'Date': '20241008',
 'Group': 'control',
 'SessionNum': '1'}

Use the filenames below to extract data into a session metadata table in a for-loop (feel free to copy-paste and adjust the solution from the previous section!) Include the original filename in its own column, to make finding the file later simpler:

In [5]:
fnames = ["Arthur_20241008_control_1.txt", "Josephine_20241009_control_1.txt", "Arthur_20241010_treatment_2.txt", "Joseph_20241011_control_2.txt"]
fnames

['Arthur_20241008_control_1.txt',
 'Josephine_20241009_control_1.txt',
 'Arthur_20241010_treatment_2.txt',
 'Joseph_20241011_control_2.txt']

In [6]:
all_sessions = []
for fname in fnames:
    base, ext = fname.split('.')
    data = base.split('_')
    session = {
        "Subject": data[0],
        "Date": data[1],
        "Group": data[2],
        "SessionNum": int(data[3]),
        "Filename": fname,
    }
    all_sessions.append(session)

df = pd.DataFrame(all_sessions)
df

Unnamed: 0,Subject,Date,Group,SessionNum,Filename
0,Arthur,20241008,control,1,Arthur_20241008_control_1.txt
1,Josephine,20241009,control,1,Josephine_20241009_control_1.txt
2,Arthur,20241010,treatment,2,Arthur_20241010_treatment_2.txt
3,Joseph,20241011,control,2,Joseph_20241011_control_2.txt


## Self-Describing Metadata: Getting Key-Values Directly from a String

### Searching the String for Patterns using index()

Sometimes you want to find certain information in a string, by relying on their being specific text right before the data you want

| Code | Description |
| :-- | :-- |
| idx = "JoeSess1".index("Sess") |    |
| sessNum = "JoeSess1"[idx+4 : idx+5] |  |


The following Filenames have a different file naming convention:

`<SessionID>_<BrainRegion>-d1=<ImageHeightInPixels>,d2=<ImageWidthInPixels>.<FileExtension>`

Using the index to find the `d1=` section from this filename, extract the image height:

In [20]:
fname = "242_CA1-d1=720,d2=1080.tif"
start_idx = fname.index("d1=") + len("d1=")
end_idx = fname.index(",")
height = int(fname[start_idx:end_idx])
height

720

Using the index to find the `d2=` section from this filename, extract the image width:

In [25]:
fname = "2045_CA3-d1=1080,d2=720.tif"
start_idx = fname.index("d2=") + len("d2=")
end_idx = fname.index(".")
width = int(fname[start_idx:end_idx])
width

720

Using the index to find the `_` section from this filename, extract the brain region:

In [26]:
fname = "24_DG-d1=720,d2=720.tif"
start_idx = fname.index("_") + len("_")
end_idx = fname.index('-')
brain_region = fname[start_idx:end_idx]
brain_region

'DG'

Extract all the data from the following filenames in a loop to build a session table. Include the original filename in its own column, to make finding the file later simpler:

In [46]:
fnames = ["242_CA1-d1=720,d2=1080.tif", "2045_CA3-d1=1080,d2=720.tif", "24_DG-d1=720,d2=720.tif", "52313_CA1-d1=720,d2=720.tif", "4_DG-d1=1080,d2=1080.tif"]
fnames

['242_CA1-d1=720,d2=1080.tif',
 '2045_CA3-d1=1080,d2=720.tif',
 '24_DG-d1=720,d2=720.tif',
 '52313_CA1-d1=720,d2=720.tif',
 '4_DG-d1=1080,d2=1080.tif']

In [47]:
sessions = []

for fname in fnames:

    # Session ID
    start_idx = 0
    end_idx = fname.index('_')
    session_id = fname[start_idx:end_idx]
    
    # Height
    start_idx = fname.index("d1=") + len("d1=")
    end_idx = fname.index(",")
    height = int(fname[start_idx:end_idx])

    # Width
    start_idx = fname.index("d2=") + len("d2=")
    end_idx = fname.index(".")
    width = int(fname[start_idx:end_idx])

    # Brain Region
    start_idx = fname.index("_") + len("_")
    end_idx = fname.index('-')
    brain_region = fname[start_idx:end_idx]

    session = {"SessionID": session_id, "Height": height, "Width": width, "BrainRegion": brain_region, "Filename": fname}
    sessions.append(session)

df = pd.DataFrame(sessions)
df

Unnamed: 0,SessionID,Height,Width,BrainRegion,Filename
0,242,720,1080,CA1,"242_CA1-d1=720,d2=1080.tif"
1,2045,1080,720,CA3,"2045_CA3-d1=1080,d2=720.tif"
2,24,720,720,DG,"24_DG-d1=720,d2=720.tif"
3,52313,720,720,CA1,"52313_CA1-d1=720,d2=720.tif"
4,4,1080,1080,DG,"4_DG-d1=1080,d2=1080.tif"


### Variable-Length Data on Variable Keys: Using a Double-Seperator to Store Keys Directly in the Filename

**Example**: Extract all the data from the filename:

In [50]:
fname = "sess=232_subj=Bill_grp=Control.txt"
base, ext = fname.split('.')
data = {}
for item in base.split('_'):
    key, value = item.split('=')
    data[key] = value

data

{'sess': '232', 'subj': 'Bill', 'grp': 'Control'}

Extract all the data from the filename

In [52]:
fname = "sessId-11_height-720_width-1028_region-DG.tif"
base, ext = fname.split('.')
data = {}
for item in base.split('_'):
    key, value = item.split('-')
    data[key] = value

data

{'sessId': '11', 'height': '720', 'width': '1028', 'region': 'DG'}

Extract all the data from the following filenames in a loop to build a session table. Include the original filename in its own column, to make finding the file later simpler:

In [58]:
fnames = ["sessId-11_height-720_width-1028_region-DG.tif", "sessId-13_height-720_width-720.tif", "height-720_width-1028_region-DG_sessId-110.tif", "height-720_width-1028_region-DG_sessId-110_quality-bad.tif"]
fnames

['sessId-11_height-720_width-1028_region-DG.tif',
 'sessId-13_height-720_width-720.tif',
 'height-720_width-1028_region-DG_sessId-110.tif',
 'height-720_width-1028_region-DG_sessId-110_quality-bad.tif']

In [62]:
sessions = []

for fname in fnames:
    base, ext = fname.split('.')
    session = {}
    session["filename"] = fname
    for item in base.split('_'):
        key, value = item.split('-')
        session[key] = value
    
    sessions.append(session)

df = pd.DataFrame(sessions)
df

Unnamed: 0,filename,sessId,height,width,region,quality
0,sessId-11_height-720_width-1028_region-DG.tif,11,720,1028,DG,
1,sessId-13_height-720_width-720.tif,13,720,720,,
2,height-720_width-1028_region-DG_sessId-110.tif,110,720,1028,DG,
3,height-720_width-1028_region-DG_sessId-110_qua...,110,720,1028,DG,bad


#### (Extra Demo) Making Data Model Contracts Explicit With Schemas

When working with this data, The file naming conventions we've looked at so far have all had some extra explanations (sometimes called a "contract" or a "schema") between the filename and code that analyzes it, in order to better-understand it.

Python provides some tools for making schemas explicit, as a data model.  Here, we'll look at the built-in Named Tuple feature:


In [7]:
from collections import namedtuple

MetadataModel = namedtuple("MetadataModel", "subject date group sess_num")

fname = "Arthur_20241008_control_1.txt"
base, ext = fname.split('.')
data_tuple = MetadataModel(*base.split('_'))
data_tuple


MetadataModel(subject='Arthur', date='20241008', group='control', sess_num='1')

Named tuples can be converted to dictionaries using the `_asdict()` method.

In [8]:
data_dict = data_tuple._asdict()
data_dict

{'subject': 'Arthur', 'date': '20241008', 'group': 'control', 'sess_num': '1'}