# Video Dataframe creation
This notebook loads in the csv files that contain the information extracted by OpenFace from the videos. Each csv file is then loaded into a pandas dataFrame. To this dataframe we add columns containing information about the start (moment) of the TSST speech component. The dataframe is then stored in the `interim` folder as a `.feather`, for use in the sampling and feature computation steps later on.

#### Requirements
If one wants to run this notebook, one needs to have run the `raw-video-extraction.py` script. The resulting `csv` files need to be stored in `data\raw\Video_Features`.

Next one, also needs the timing csv file, which denotes the second where the participant starts with the TSST Speech component. This file needs to be stored `data\information\` and be called `start_sentence.csv`

In [1]:
import os
import csv
import pandas as pd
import numpy as np

In [2]:
Mitchel = True

After loading in the required modules, we store the working directory in a variable called `project_dir`. We then store the folder where the data files are located in a variable called `data_dir`. We create a variable that contains the specific names of all the csv files, `raw_feature_files`. We also put the location in which the future created dataframes need to be saved in a variable called `processed_dir`. 

In [3]:
project_dir = os.getcwd().split('\\')[:-1]
project_dir = '\\'.join(project_dir) # Get the project dir
# Get the data dir
if Mitchel:
    data_dir = 'C:\\Users\\mitch\\OneDrive - UGent\\UGent\\Projects\\7. tDCS_Stress_WM_deSmet\\data'
    data_dir = 'Z:\\ghep_lab\\2020_DeSmetKappen_tDCS_Stress_WM_VIDEO\\Data'
else:
    data_dir = project_dir + '\\data'
processed_dir = data_dir + '\\raw\\Video_features' # Get the dir that contains the processed video features csv files
raw_feature_files = [file for file in os.listdir(processed_dir) if file.endswith('csv')] # Get all the csv files in the specific data dir

We open the file `start_sentence.csv`, which contain the starting moments of the TSST speech component of each participant. We store this in a dict called `timings` where you can get the moment in the video of a certain participant (pp) where the TSST component starts in seconds (start) as follows:

`start = timings['pp']`

In [4]:
reader = csv.DictReader(open(data_dir + '\\information\\start_sentence.csv', encoding='utf-8-sig'), delimiter=';') # Open the file that contains the starting points of the TSST component in the videos
timings = {}
for row in reader:
    timings[int(row['pp'])] = int(row['start']) # Store this information in a dict for later use

We then load in each csv file into a dataframe one by one. We add some information about the participant. Each line corresponds to a single frame in the video and contains the information about that frame extracted by OpenFace. We add to this dataframe a column called `started`, which is `1` if the participant already started the TSST speech component and `0` if nott. We also added information about the how many frames and seconds a specific frame (row) was from the frame in which the TSST speech component started, which can be found in the columns `t_from_start` (seconds) and `frames_away_start` (frames). Each dataframes is then saved in the interim folder as a `.feather` file.

In [9]:
for csv_file in raw_feature_files: # For each csv file
    pp = int(csv_file[4:7]) # Get the pp id    
    new_data['pp'] = pp # Add this to the DataFrame
    print(f'processing {pp}')
#     if pp > 97:
    if True:
        new_data = pd.read_csv(processed_dir + '\\' + csv_file) # Open the file as a DataFrame
        new_data.columns = [col.strip(' ') for col in new_data.columns] # Remove the blank spaces from the column names (This is the result of how OpenFace exports the csv Files)
    

        # Add a column `started` which is 0 for the rows when the pp did not yet the TSST speech, and 1 if the pp did start
        new_data['started'] = 0 
        new_data.loc[(new_data.timestamp >= timings[pp]), 'started'] = 1 # The timing information (when did the pp start) is retrieved from the timings dict using the pp id

        ## Add a column `frames_away_start` & `t_from_start` to denote how far each row is from the frame in which the pp started the TSST speech component
        ## We need these columns, together with the pp id column to synchronise the physiological and video DataFrame
        frame_at_start = new_data.loc[new_data.started==1, 'frame'].values[0] # First find the exact frame in which the participant started
        new_data.loc[:, 'frames_away_start'] = new_data.loc[:, 'frame'] - frame_at_start # Compute how far away each frame is from this starting frame by subtracting this from the column `frame`
        new_data['t_from_start'] = new_data['frames_away_start']/25 # Add the column `t_from_start`, by deviding `frames_away_start` by 25, which denotes how many seconds each frame is from the starting point

        new_data.astype('float',copy=False)# Set the entire dataframe as Float for memory reasons (string takes more memory)

        # Reset the index before storing the DataFrame in an feather file for later use
        new_data.reset_index(inplace=True)
        new_data.to_feather(f'{data_dir}\\interim\\video\\{pp}.feather')
        print(f'pp{pp} done')
print('All done!')

processing 2
processing 3
processing 4
processing 5
processing 10
processing 11
processing 12
processing 13
processing 14
processing 17
processing 18
processing 19
processing 21
processing 43
processing 62
processing 68
processing 69
processing 70
processing 71
processing 76
processing 77
processing 82
processing 84
processing 86
processing 90
processing 91
processing 92
processing 93
processing 94
processing 95
processing 96
processing 97
processing 98
pp98 done
processing 99
pp99 done
processing 100
pp100 done
processing 101
pp101 done
processing 102
pp102 done
processing 103
pp103 done
processing 104
pp104 done
processing 105
pp105 done
processing 106
pp106 done
processing 107
pp107 done
processing 109
pp109 done
processing 110
pp110 done
processing 113
pp113 done
processing 114
pp114 done
processing 115
pp115 done
processing 117
pp117 done
processing 118
pp118 done
processing 119
pp119 done
processing 120
pp120 done
processing 121
pp121 done
processing 122
pp122 done
processing 123

In [8]:
csv_file

'ppt_002.csv'