### Video Dataframe creation
This notebook loads in the csv files that contain the information extracted by OpenFace from the videos. Each csv file is then loaded into a pandas dataFrame. To this dataframe we add columns containing information about the start (moment) of the TSST speech component. The dataframe is then stored in the `interim` folder as a `.hdf`, for use in the sampling and feature computation steps later on.

In [1]:
import os
import csv
import pandas as pd
import numpy as np

After loading in the required modules, we store the working directory in a variable called `project_dir`. We then store the folder where the data files are located in a variable called `data_dir`. We create a variable that contains the specific names of all the csv files, `raw_feature_files`. We also put the location in which the future created dataframes need to be saved in a variable called `processed_dir`. 

In [2]:
project_dir = os.getcwd().split('\\')[:-1]
project_dir = '\\'.join(project_dir)
data_dir = project_dir + '\\data'
processed_dir = data_dir + '\\raw\\Video_features' 
raw_feature_files = [file for file in os.listdir(processed_dir) if file.endswith('csv')]

We open the file `start_sentence.csv`, which contain the starting moments of the TSST speech component of each participant. We store this in a dict called `timings` where you can get the moment in the video of a certain participant (pp) where the TSST component starts in seconds (start) as follows:

`start = timings['pp']`

In [3]:
reader = csv.DictReader(open(project_dir + '\\data\\information\\start_sentence.csv', encoding='utf-8-sig'), delimiter=';')
timings = {}
for row in reader:
    timings[int(row['pp'])] = int(row['start'])

We then load in each csv file into a dataframe one by one. We add some information about the participant. Each line corresponds to a single frame in the video and contains the information about that frame extracted by OpenFace. We add to this dataframe a column called `started`, which is `1` if the participant already started the TSST speech component and `0` if nott. We also added information about the how many frames and seconds a specific frame (row) was from the frame in which the TSST speech component started, which can be found in the columns `t_from_start` (seconds) and `frames_away_start` (frames). Each dataframes is then saved in the interim folder as a `.hdf` file.

In [4]:
for csv_file in raw_feature_files: # change this to get all processed files
    new_data = pd.read_csv(processed_dir + '\\' + csv_file)
    pp = int(csv_file[4:7])
    new_data['pp'] = pp
    new_data['started'] = 0
    new_data.columns = [col.strip(' ') for col in new_data.columns]
    
    new_data.loc[(new_data.timestamp >= timings[pp]), 'started'] = 1
    frame_at_start = new_data.loc[new_data.started==1, 'frame'].values[0]
    new_data.loc[:, 'frames_away_start'] = new_data.loc[:, 'frame'] - frame_at_start ## Use this column together with the pp column to connect to physiological data
    new_data['t_from_start'] = new_data['frames_away_start']/25
    new_data.astype('float',copy=False)
    
    new_data.to_hdf(data_dir+ '\\interim\\video\\' + f'{pp}.hdf', f'pp{pp}', mode='w')
    print(f'pp{pp} done')
print('All done!')

pp2 done
pp3 done
pp4 done
pp5 done
pp10 done
pp11 done
pp12 done
pp13 done
pp14 done
pp17 done
pp18 done
pp19 done
pp62 done
pp68 done
pp69 done
pp70 done
pp71 done
pp76 done
pp77 done
pp79 done
pp82 done
pp84 done
pp86 done
pp90 done
pp91 done
pp92 done
pp93 done
pp94 done
pp95 done
pp96 done
pp97 done
pp98 done
pp99 done
pp100 done
All done!
