# Metadata expansion

The purpose of this notebook is to expand the metadata `.csv` file to explicitly include columns for the date, time, and session of each vocalization, aiming to facilitate grouping recordings from the same session for the purpose of pre-processing and similar tasks.

The file names are formatted so as to include all the information. Following the [original ReCANVo documentation article](https://www.nature.com/articles/s41597-023-02405-7), the file names are formatted as
$${\rm{YYMMDD}}\_{\rm{HHMM}}\_{\rm{SH}}\_{\rm{SM}}\_{\rm{SS}}{\rm{.ss}}\,-\,{\rm{EH}}\_{\rm{EM}}\_{\rm{ES}}{\rm{.ss}}$$
where each block carries information as follows:
- $\rm{YYMMDD}$: Date of session;
- $\rm{HHMM}$: time of session;
- $\rm{SH}$, $\rm{SM}$, $\rm{SS.ss}$: hour, minute, and second (with decimals) of the start of the specific recording within the session *relative to the session's staring time*;
- $\rm{EH}$, $\rm{EM}$, $\rm{ES.ss}$: hour, minute, and second (with decimals) of the end of the specific recording within the session *relative to the session's staring time*.

The goal of this notebook is to expand all this information as columns in a pandas dataframe and export it as a new `.csv` file.

In [1]:
import pandas as pd
import datetime
from os import listdir
import re

In [2]:
original_file_path = 'directory_w_train_test.csv'
assert original_file_path in listdir(), 'Original metadata file missing. Please upload it.'

df = pd.read_csv(original_file_path)

In [3]:
df.head()

Unnamed: 0,Filename,Participant,Label,is_test
0,200126_2142_00-13-04.06--00-13-04.324.wav,P01,dysregulation-sick,0
1,200126_2142_00-06-41.54--00-06-42.47.wav,P01,dysregulation-sick,0
2,200126_2142_00-11-35.94--00-11-37.08.wav,P01,dysregulation-sick,0
3,200126_2142_00-12-11.66--00-12-15.31.wav,P01,dysregulation-sick,0
4,200126_2142_00-00-24.55--00-00-24.95.wav,P01,dysregulation-sick,1


As it turns out, some file names do not strictly adhere to the format described above, as they may contain the label right after the session time or other alphabetic characters at the end of the name. What is more, not all filenames appear to be consistent in the formatting of decimals in $\rm{SS.ss}$ and $\rm{ES.ss}$.

In [4]:
def split_filename(filename: str) -> dict[str, str]:
  data = {}
  data['Y'] = int('20'+filename[:2])  # YY
  data['M'] = int(filename[2:4])      # MM
  data['D'] = int(filename[4:6])      # DD
  i = filename.find('_') + 1          # Date is separated from time by an underscore
  filename = filename[i:]             # Skip ahead
  data['h'] = int(filename[:2])       # HH
  data['m'] = int(filename[2:4])      # MM
  i = filename.find('_') + 1
  filename = filename[i:]             # Skip ahead
  data['sh'] = int(filename[:2])      # SH
  data['sm'] = int(filename[3:5])     # SM
  i = filename.find('--')             # Double dash marks the end of start time
  data['ss'] = float(filename[6:i])   # SS.ss
  filename = filename[i+2:]           # Skip ahead
  data['eh'] = int(filename[:2])      # EH
  data['em'] = int(filename[3:5])     # EM
  data['es'] = float(re.search(r'\d+(?:\.?\d+)', filename[6:]).group())
  return data

In [5]:
df['split_data'] = df.Filename.apply(split_filename)

In [6]:
df['session_datetime'] = df.split_data.apply(lambda x: datetime.datetime(
    year=x['Y'],
    month=x['M'],
    day=x['D'],
    hour=x['h'],
    minute=x['m'],
    ))
df['relative_start'] = df.split_data.apply(lambda x: datetime.timedelta(hours=x['sh'], minutes=x['sm'], seconds=x['ss']))
df['relative_end'] = df.split_data.apply(lambda x: datetime.timedelta(hours=x['eh'], minutes=x['em'], seconds=x['es']))

In [7]:
df.head()

Unnamed: 0,Filename,Participant,Label,is_test,split_data,session_datetime,relative_start,relative_end
0,200126_2142_00-13-04.06--00-13-04.324.wav,P01,dysregulation-sick,0,"{'Y': 2020, 'M': 1, 'D': 26, 'h': 21, 'm': 42,...",2020-01-26 21:42:00,0 days 00:13:04.060000,0 days 00:13:04.324000
1,200126_2142_00-06-41.54--00-06-42.47.wav,P01,dysregulation-sick,0,"{'Y': 2020, 'M': 1, 'D': 26, 'h': 21, 'm': 42,...",2020-01-26 21:42:00,0 days 00:06:41.540000,0 days 00:06:42.470000
2,200126_2142_00-11-35.94--00-11-37.08.wav,P01,dysregulation-sick,0,"{'Y': 2020, 'M': 1, 'D': 26, 'h': 21, 'm': 42,...",2020-01-26 21:42:00,0 days 00:11:35.940000,0 days 00:11:37.080000
3,200126_2142_00-12-11.66--00-12-15.31.wav,P01,dysregulation-sick,0,"{'Y': 2020, 'M': 1, 'D': 26, 'h': 21, 'm': 42,...",2020-01-26 21:42:00,0 days 00:12:11.660000,0 days 00:12:15.310000
4,200126_2142_00-00-24.55--00-00-24.95.wav,P01,dysregulation-sick,1,"{'Y': 2020, 'M': 1, 'D': 26, 'h': 21, 'm': 42,...",2020-01-26 21:42:00,0 days 00:00:24.550000,0 days 00:00:24.950000


In [8]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7077 entries, 0 to 7076
Data columns (total 8 columns):
 #   Column            Non-Null Count  Dtype          
---  ------            --------------  -----          
 0   Filename          7077 non-null   object         
 1   Participant       7077 non-null   object         
 2   Label             7077 non-null   object         
 3   is_test           7077 non-null   int64          
 4   split_data        7077 non-null   object         
 5   session_datetime  7077 non-null   datetime64[ns] 
 6   relative_start    7077 non-null   timedelta64[ns]
 7   relative_end      7077 non-null   timedelta64[ns]
dtypes: datetime64[ns](1), int64(1), object(4), timedelta64[ns](2)
memory usage: 442.4+ KB


In [9]:
df.drop(['split_data'], axis=1).to_csv('directory_w_train_test_timestamps.csv')