### Physiological Dataframe creation
This notebook creates the physiological signal dataframes for each participant. This is done for both the BIOPAC as well as the AMSDATA files. To both columns we add information about the start (moment) of the TSST speech component. Each created dataframe is stored in the `interim` folder as a `.hdf` file, for us in the sampling and computation steps later on.

In [1]:
import os
import bioread
import pandas as pd
import numpy as np
import csv

After loading in the required modules, we store the working directory in a variable called `project_dir`. We then store the folder where the data files are located in a variable called `data_dir`. We create a variable that contains the specific names of all the acq files, `acq_files`, as well as a variable for all the txt files (amsdata), `txt_files`.

In [2]:
project_dir = os.getcwd().split('\\')[:-1]
project_dir = '\\'.join(project_dir)
data_dir = project_dir + '\\data\\raw\\Physiological'
acq_files = [file for file in os.listdir(data_dir) if file.endswith('acq')]
txt_files = [file for file in os.listdir(data_dir) if file.endswith('SCL.txt')]

The acq files are, one by one, loaded using the bioread function. The start moment is computed using the third channel (`digital_input`). We then select a subset both the raw EDA and raw ECG signal, from 3 minutes prior to the start of the TSST speech component and 20 minutes after this start. We compute the timestamps so that they reflect seconds instead of ms, and we then compute the new column `t_from_start` (seconds). This reflects how far each sample(row) is removed from the sample in which the TSST component started. Each dataframe is then saved in the interim folder as a `.hdf` file.

In [3]:
for file in acq_files:
    
    bio = bioread.read_file(data_dir + '\\' + file)
    eda = bio.channels[0].data
    ecg = bio.channels[1].data
    timestamps = bio.channels[0].time_index

    digital_input = np.copy(bio.channels[2].data)
    starts = np.where(np.diff(digital_input>4))[0]
    start = starts[4]
    eda = eda[start-(2000*60*3):start+(2000*60*20)]
    ecg = ecg[start-(2000*60*3):start+(2000*60*20)]
    timestamps = timestamps[start-(2000*60*3):start+(2000*60*20)]
    digital_input = digital_input[start-(2000*60*3):start+(2000*60*20)]
    data = pd.DataFrame({'timestamp': timestamps,'raw_EDA': eda, 'raw_ECG': ecg, 'digital_input': digital_input})
    data['t_from_start'] = data['timestamp'] - (start/bio.channels[0].samples_per_second)
    data.astype('float',copy=False)
    
    pp = file[2:-6]
    data.to_hdf(f'{project_dir}\\data\\interim\\physiological\\{pp}.hdf', f'pp{pp}', mode='w')
    print(f'pp{pp} done')
print('All done!')

pp100 done
pp10 done
pp11 done
pp12 done
pp13 done
pp14 done
pp17 done
pp18 done
pp19 done
pp2 done
pp3 done
pp4 done
pp5 done
pp62 done
pp68 done
pp69 done
pp70 done
pp94 done
pp99 done
All done!


We open the file `start_TSST.csv`, which contain the starting moments of the TSST speech component of each participant, in the amsdata measurements. We store this in a dict called `timings` where you can get the moment in the video of a certain participant (pp) where the TSST component starts in seconds (start) as follows:

`start = timings['pp']`


In [4]:
reader = csv.DictReader(open(project_dir + '\\data\\information\\start_TSST.csv', encoding='utf-8-sig'), delimiter=';')
timings = {}
for row in reader:
    timings[int(row['pp'])] = int(row['start'])

Next up, we load in the txt files, that have been exported by the AMSDATA software. This txt file only contains the raw EDA signals. We add this raw signal to a dataframe, recompute the timestamps so it reflects seconds instead of minutes and then also compute the new column `t_from_start`. We once again take the subset of the data between 3 minutes prior to the start and 20 minutes after the start. This file is also stored in a `hdf` file in the interim folder. 

**NOTE:** The amsdata is sampled at 10Hz while the acq data is sampled at 2000Hz. This data should perhaps be resampled so they posses the same frequency. 

In [5]:
for file in txt_files:
    pp = int(file[2:-10])
    start = timings[pp]
    start = start
    data = pd.read_csv(f'{data_dir}\\{file}', sep=' ', skiprows=3, names=['timestamp', 'raw_EDA'])
    data['timestamp'] = data['timestamp']/1000
    data['t_from_start'] =  data['timestamp'] - start
    data = data.loc[(data.t_from_start>-180) & (data.t_from_start<1200),:]
    data.astype('float',copy=False)
    
    data.to_hdf(f'{project_dir}\\data\\interim\\physiological\\{pp}.hdf', f'pp{pp}', mode='w')
    print(f'pp{pp} done')
    
print('All done!')

pp71 done
pp76 done
pp77 done
pp79 done
pp82 done
pp84 done
pp86 done
pp90 done
pp91 done
pp95 done
pp96 done
pp97 done
pp98 done
All done!
