In [1]:
import pandas as pd

# Parsing Metadata from Filenames: Working with Strings, Lists, and Dicts

## Key-Value Mappings: Dictionaries

| Code | Description |
| :-- | :-- |
| data = {} | Make an empty Dict | 
| data = {'a': 3, 'b': 5} | Make a Dict with two items: "a" and "b" |
| data['a'] |  |
| data['c'] = 7 |  |
| list(data.keys()) |  |




**Exercises**

The `image` dict describes how researcher Tom's recording is formatted:

In [5]:
image = {'height': 1920, 'width': 1080, 'format': 'RGB', 'order': 'F'}
image

{'height': 1920, 'width': 1080, 'format': 'RGB', 'order': 'F'}

Write the code to print out the width of the image, by accessing the `"width"` key:

In [6]:
image['width']

1080

What is the height of the image?

In [7]:
image['height']

1920

How are the pixel data in the image formatted?

In [8]:
image['format']

'RGB'

What happens if you use the same approach to find out which key has the value `1080`?  What does this tell you about how key-value maps like Dictionaries are designed for?

In [14]:
image[1080]

NameError: name 'image' is not defined

Make a dictionary: Reorganize the code below: tell Python that the three variables below all belong together by putting them into a dictionary called `session`.

In [10]:
subject = "Josie"
date = "2023-07-23"
group = "control"

session = {'subject': subject, 'date': date, 'group': group}
session

{'subject': 'Josie', 'date': '2023-07-23', 'group': 'control'}

Check that the dictionary is constructed properly by getting the subject from it. It should show "Josie"

In [12]:
session['subject']

'Josie'

In [17]:
default_session = {'subject': 'Ken', 'experimenter': 'Barbie', 'time': '09:00', 'notes': 'Nothing new.'}
today_vars = {'subject':  'Allan', 'notes': 'Did a good job.'}


{'subject': 'Allan',
 'experimenter': 'Barbie',
 'time': '09:00',
 'notes': 'Did a good job.'}

In [18]:

session1 = default_session | today_vars
session1

{'subject': 'Allan',
 'experimenter': 'Barbie',
 'time': '09:00',
 'notes': 'Did a good job.'}

## Extracting Metadata from strings

| Code | Description |
| :--- | :--- |
| **Indexing by Position (i.e. "Slicing" a String)** |   |
| bonn = "BonnKölnAachen"[:4] |  |
| köln = "BonnKölnAachen"[4:8] |  |
| aach = "BonnKölnAachen"[8:] |  |

**Exercises**

This researcher had a rule for her filenames: she would store session metadata in **fixed-length** strings, with information always in the same order:
  - **Subject Name**: 6 Characters
  - **Date**: 8 Characters
  - **Treatmet Group**: 7 Characters:
  - **Session Number**: 5 Characters ("sess" and then the number)

That way, when she later needed the information, she could extract it from the filename just by slicing it!

What subject's data is in this file?

In [22]:
fname = "Arthur20241008controlsess1.txt"   # Filename convention: Subject, Date, Group, Session
fname[:6]

'Arthur'

What group is this subject in?

In [24]:
fname = "Arthur20241008controlsess1.txt"   # Filename convention: Subject, Date, Group, Session
fname[14:21]

'control'

What Session number was this?  Turn it from a string into an int with the `int()` function.

In [None]:
fname = "Arthur20241008controlsess1.txt"   # Filename convention: Subject, Date, Group, Session
fname[]

Extract all four metadata variables from the following file and put them into their own variables (note that the subject has fewer than 6 characters in their name.  After slicing the data, you can replace the underscore characters with "empty strings" by using the `replace()` method on strings (e.g. `"name__".replace('_', '')`):

In [42]:
fname = "Joe___20241009experimsess1.txt"  # Filename convention: Subject, Date, Group, Session
subject, date, group, sess = fname[:6].replace('_', ''), fname[6:14], fname[14:21], int(fname[25])
subject, date, group, sess

('Joe', '20241009', 'experim', 1)

Make a dictionary with the keys "Subject", "Date", "Group", and "SessionNum" with the data from this filename:

In [37]:
fname = "Arthur20241008controlsess1.txt"   # Filename convention: Subject, Date, Group, Session
session = {
    "Subject": fname[:6], 
    "Date": fname[6:14], 
    "Group": fname[14:21],
    "SessionNum": int(fname[25]),
}
session

{'Subject': 'Arthur', 'Date': '20241008', 'Group': 'control', 'SessionNum': 1}

Building a table of metadata usually has the following steps, which can be done in a loop:

1. Extract data into a dictionary
2. Append the dictionary into a list of dictionaries
3. Change the list of dictionaries into a data frame (the table)

**Example**: Fill in the missing data extraction code for the filenames below to make a session table.  Include the original filename in its own column, to make finding the file later simpler:

In [34]:
fnames = ["a2.txt", "b3.txt"]

In [36]:
import pandas as pd

all_sessions = []
for fname in fnames:
    session = {
        "Letter": fname[0],
        "Number": int(fname[1]),
        "Filename": fname,
    }
    all_sessions.append(session)

df = pd.DataFrame(all_sessions)
df

Unnamed: 0,Letter,Number,Filename
0,a,2,a2.txt
1,b,3,b3.txt



**Exercise**: Fill in the missing data extraction code for the filenames below to make a session table. Include the original filename in its own column, to make finding the file later simpler:


In [40]:
fnames = ["Arthur20241008controlsess1.txt", "Joseph20241009controlsess1.txt", "Arthur20241010treatmesess2.txt", "Joseph20241011controlsess2.txt"]
fnames

['Arthur20241008controlsess1.txt',
 'Joseph20241009controlsess1.txt',
 'Arthur20241010treatmesess2.txt',
 'Joseph20241011controlsess2.txt']

In [41]:
all_sessions = []
for fname in fnames:
    session = {
        "Subject": fname[0:6],
        "Date": fname[6:14],
        "Group": fname[14:21],
        "SessionNum": int(fname[25:26]),
        'Filename': fname,
    }
    all_sessions.append(session)

df = pd.DataFrame(all_sessions)
df

Unnamed: 0,Subject,Date,Group,SessionNum,Filename
0,Arthur,20241008,control,1,Arthur20241008controlsess1.txt
1,Joseph,20241009,control,1,Joseph20241009controlsess1.txt
2,Arthur,20241010,treatme,2,Arthur20241010treatmesess2.txt
3,Joseph,20241011,control,2,Joseph20241011controlsess2.txt


#### Technique: Variable-Length, Character-Seperated Strings (string splitting)

This researcher wanted to use any-length strings for her data, so that names like "Joe" and "Josephine" could all be put in the filename more naturally.  So, her filename convention was this: 

`<Subject>_<Date>_<SessionCondition>_<SessionNum>.<FileExtension>`

By settling on the underscore ("_") for the seperator for the variables (the dot "." is always used as a seperator for the file extension), the code for extracting those variables is quite simple and just involves "splitting" the string along those seperators.

| Code | Description | 
| :-- | :-- |
| **Indexing by Seperated Value** (i.e. "Splitting and Indexing" a String) |  |
| values = "hello_world".split('_') |  |
| hello = "hello_world".split('_')[0] |  |
| world = "hello world".split(' ')[1] |  |
| hello, world = "hello world".split(' ') |  |
| basename, extension = "filename.txt".split('.') |  |
| hello, *rest = "hello dog cat bunny cow".split(' ') |  |

**Exercises**

**Example**: The filename convention here is `<Subject>_<Date>_<Group>_<SessionNum>.<FileExtension>`.  Extract the date this filename into its own variables:

In [2]:
fname = "Arthur_20241008_control_1.txt"
base, ext = fname.split('.')
data = base.split('_')
date = data[1]
date

'20241008'

Extract the Group from this filename into its own variables:

In [3]:
fname = "Arthur_20241008_control_1.txt"
base, ext = fname.split('.')
data = base.split('_')
group = data[2]
group

'control'

Extract all the data from this filename into a dictionary:

In [4]:
fname = "Arthur_20241008_control_1.txt"
base, ext = fname.split('.')
subject, date, session, num = base.split('_')
data = {"Subject": subject, "Date": date, "Group": group, "SessionNum": num}
data

{'Subject': 'Arthur',
 'Date': '20241008',
 'Group': 'control',
 'SessionNum': '1'}

Use the filenames below to extract data into a session metadata table in a for-loop (feel free to copy-paste and adjust the solution from the previous section!) Include the original filename in its own column, to make finding the file later simpler:

In [5]:
fnames = ["Arthur_20241008_control_1.txt", "Josephine_20241009_control_1.txt", "Arthur_20241010_treatment_2.txt", "Joseph_20241011_control_2.txt"]
fnames

['Arthur_20241008_control_1.txt',
 'Josephine_20241009_control_1.txt',
 'Arthur_20241010_treatment_2.txt',
 'Joseph_20241011_control_2.txt']

In [6]:
all_sessions = []
for fname in fnames:
    base, ext = fname.split('.')
    data = base.split('_')
    session = {
        "Subject": data[0],
        "Date": data[1],
        "Group": data[2],
        "SessionNum": int(data[3]),
        "Filename": fname,
    }
    all_sessions.append(session)

df = pd.DataFrame(all_sessions)
df

Unnamed: 0,Subject,Date,Group,SessionNum,Filename
0,Arthur,20241008,control,1,Arthur_20241008_control_1.txt
1,Josephine,20241009,control,1,Josephine_20241009_control_1.txt
2,Arthur,20241010,treatment,2,Arthur_20241010_treatment_2.txt
3,Joseph,20241011,control,2,Joseph_20241011_control_2.txt


## Self-Describing Metadata: Getting Key-Values Directly from a String

### Searching the String for Patterns using index()

Sometimes you want to find certain information in a string, by relying on their being specific text right before the data you want

| Code | Description |
| :-- | :-- |
| idx = "JoeSess1".index("Sess") |    |
| sessNum = "JoeSess1"[idx+4 : idx+5] |  |


The following Filenames have a different file naming convention:

`<SessionID>_<BrainRegion>-d1=<ImageHeightInPixels>,d2=<ImageWidthInPixels>.<FileExtension>`

Using the index to find the `d1=` section from this filename, extract the image height:

In [20]:
fname = "242_CA1-d1=720,d2=1080.tif"
start_idx = fname.index("d1=") + len("d1=")
end_idx = fname.index(",")
height = int(fname[start_idx:end_idx])
height

720

Using the index to find the `d2=` section from this filename, extract the image width:

In [25]:
fname = "2045_CA3-d1=1080,d2=720.tif"
start_idx = fname.index("d2=") + len("d2=")
end_idx = fname.index(".")
width = int(fname[start_idx:end_idx])
width

720

Using the index to find the `_` section from this filename, extract the brain region:

In [26]:
fname = "24_DG-d1=720,d2=720.tif"
start_idx = fname.index("_") + len("_")
end_idx = fname.index('-')
brain_region = fname[start_idx:end_idx]
brain_region

'DG'

Extract all the data from the following filenames in a loop to build a session table. Include the original filename in its own column, to make finding the file later simpler:

In [46]:
fnames = ["242_CA1-d1=720,d2=1080.tif", "2045_CA3-d1=1080,d2=720.tif", "24_DG-d1=720,d2=720.tif", "52313_CA1-d1=720,d2=720.tif", "4_DG-d1=1080,d2=1080.tif"]
fnames

['242_CA1-d1=720,d2=1080.tif',
 '2045_CA3-d1=1080,d2=720.tif',
 '24_DG-d1=720,d2=720.tif',
 '52313_CA1-d1=720,d2=720.tif',
 '4_DG-d1=1080,d2=1080.tif']

In [47]:
sessions = []

for fname in fnames:

    # Session ID
    start_idx = 0
    end_idx = fname.index('_')
    session_id = fname[start_idx:end_idx]
    
    # Height
    start_idx = fname.index("d1=") + len("d1=")
    end_idx = fname.index(",")
    height = int(fname[start_idx:end_idx])

    # Width
    start_idx = fname.index("d2=") + len("d2=")
    end_idx = fname.index(".")
    width = int(fname[start_idx:end_idx])

    # Brain Region
    start_idx = fname.index("_") + len("_")
    end_idx = fname.index('-')
    brain_region = fname[start_idx:end_idx]

    session = {"SessionID": session_id, "Height": height, "Width": width, "BrainRegion": brain_region, "Filename": fname}
    sessions.append(session)

df = pd.DataFrame(sessions)
df

Unnamed: 0,SessionID,Height,Width,BrainRegion,Filename
0,242,720,1080,CA1,"242_CA1-d1=720,d2=1080.tif"
1,2045,1080,720,CA3,"2045_CA3-d1=1080,d2=720.tif"
2,24,720,720,DG,"24_DG-d1=720,d2=720.tif"
3,52313,720,720,CA1,"52313_CA1-d1=720,d2=720.tif"
4,4,1080,1080,DG,"4_DG-d1=1080,d2=1080.tif"


### Variable-Length Data on Variable Keys: Using a Double-Seperator to Store Keys Directly in the Filename

**Example**: Extract all the data from the filename:

In [50]:
fname = "sess=232_subj=Bill_grp=Control.txt"
base, ext = fname.split('.')
data = {}
for item in base.split('_'):
    key, value = item.split('=')
    data[key] = value

data

{'sess': '232', 'subj': 'Bill', 'grp': 'Control'}

Extract all the data from the filename

In [52]:
fname = "sessId-11_height-720_width-1028_region-DG.tif"
base, ext = fname.split('.')
data = {}
for item in base.split('_'):
    key, value = item.split('-')
    data[key] = value

data

{'sessId': '11', 'height': '720', 'width': '1028', 'region': 'DG'}

Extract all the data from the following filenames in a loop to build a session table. Include the original filename in its own column, to make finding the file later simpler:

In [58]:
fnames = ["sessId-11_height-720_width-1028_region-DG.tif", "sessId-13_height-720_width-720.tif", "height-720_width-1028_region-DG_sessId-110.tif", "height-720_width-1028_region-DG_sessId-110_quality-bad.tif"]
fnames

['sessId-11_height-720_width-1028_region-DG.tif',
 'sessId-13_height-720_width-720.tif',
 'height-720_width-1028_region-DG_sessId-110.tif',
 'height-720_width-1028_region-DG_sessId-110_quality-bad.tif']

In [62]:
sessions = []

for fname in fnames:
    base, ext = fname.split('.')
    session = {}
    session["filename"] = fname
    for item in base.split('_'):
        key, value = item.split('-')
        session[key] = value
    
    sessions.append(session)

df = pd.DataFrame(sessions)
df

Unnamed: 0,filename,sessId,height,width,region,quality
0,sessId-11_height-720_width-1028_region-DG.tif,11,720,1028,DG,
1,sessId-13_height-720_width-720.tif,13,720,720,,
2,height-720_width-1028_region-DG_sessId-110.tif,110,720,1028,DG,
3,height-720_width-1028_region-DG_sessId-110_qua...,110,720,1028,DG,bad


#### (Extra Demo) Making Data Model Contracts Explicit With Schemas

When working with this data, The file naming conventions we've looked at so far have all had some extra explanations (sometimes called a "contract" or a "schema") between the filename and code that analyzes it, in order to better-understand it.

Python provides some tools for making schemas explicit, as a data model.  Here, we'll look at the built-in Named Tuple feature:


In [7]:
from collections import namedtuple

MetadataModel = namedtuple("MetadataModel", "subject date group sess_num")

fname = "Arthur_20241008_control_1.txt"
base, ext = fname.split('.')
data_tuple = MetadataModel(*base.split('_'))
data_tuple


MetadataModel(subject='Arthur', date='20241008', group='control', sess_num='1')

Named tuples can be converted to dictionaries using the `_asdict()` method.

In [8]:
data_dict = data_tuple._asdict()
data_dict

{'subject': 'Arthur', 'date': '20241008', 'group': 'control', 'sess_num': '1'}