# Physiological Dataframe creation
This notebook creates the physiological signal dataframes for each participant. This is done for both the ACQ as well as the AMSDATA files. To both columns we add information about the start (moment) of the TSST speech component. Each created dataframe is stored in the `interim` folder as a `.feather` file, for use in the sampling and target computation steps later on.
#### Requirements
To be able to run this notebook one needs to have collected the physiological data files for all the participants. These files need to be saved in `data\raw\Physiological`.

This notebook handles both the ACQ and AMSDATA files. Since it is not possible to open the AMSDATA files from python, one needs to export the signals to `.txt` files using the AMSDATA software. This will result in two `.txt` files per participant, one for ECG and another one for SCL.

Since the AMSDATA does not handle timing information, we also need to collect the timing information for each participant from the session information excel file. This timing information is stored in an `.csv` file in `data\information`.

In [1]:
import os
import bioread
import pandas as pd
import numpy as np
import csv

In [2]:
Mitchel = False

After loading in the required modules, we store the working directory in a variable called `project_dir`. We then store the folder where the data files are located in a variable called `physData_dir`. We create a variable that contains the specific names of all the acq files, `acq_files`, as well as a variable for all the txt files (amsdata), `txt_files`.

In [3]:
project_dir = os.getcwd().split('\\')[:-1] 
project_dir = '\\'.join(project_dir) # Get the project dir
# Get the data dir
if Mitchel:
    data_dir = 'C:\\Users\\mitch\\OneDrive - UGent\\UGent\\Projects\\7. tDCS_Stress_WM_deSmet\\data'
    data_dir = 'Z:\\ghep_lab\\2020_DeSmetKappen_tDCS_Stress_WM_VIDEO\\Data'
else: 
    data_dir = project_dir + '\\data'
physData_dir = data_dir + '\\raw\\Physiological'
acq_files = [file for file in os.listdir(physData_dir) if file.endswith('acq')] # Find all the acq files in the data dir
txt_files = [file[:-7] for file in os.listdir(physData_dir) if file.endswith('SCL.txt')] # Find all the SCL text files in the dir

## All Components
Below we select the parts of the physiological signal that was recorded during active components of the experiment. This includes the baseline, TSST preparation, TSST Speech, TSST Math and recovery components. The remaining physiological signal, between the components, is removed. We also store the all the data as a `float` to reduce the amount of memory necessary.
### .acq Files
We can extract the different components from the `.acq` files using the digital input channel that is present in this data. The components are denoted by an active electronic signal of 5V in the digital input channel, while the time between components is denoted by an inactive signal. We can therefore find the starting points of each component by finding where the difference between consecutive pionts in the digital input channel is positive, indicating an increase from 0V to 5V. Since we now the exact length of each component we can then loop through the starting points corresponding to our desired components (in this case `[0,3,4,5,6]`) and extract the EDA and ECG signal aswell as the corresponding timestamps. 

In [4]:
# acq_files

In [5]:
for file in acq_files: # For each acq files in the dir
    print(f'processing: {file}')
    pp = int(file[2:-6]) # Get the pp id
    bio = bioread.read_file(physData_dir + '\\' + file) # Open the physiological file
    
    # Get the EDA, ECG, Time index and Digital input signal(which corresponds to the triggers from the psychopy script)
    eda_org = bio.channels[0].data 
    ecg_org = bio.channels[1].data
    timestamps_org = bio.channels[0].time_index    
    digital_input_org = np.copy(bio.channels[2].data)
    
    # Using the Digital input signal, get the start and stop points of the different components in the study
    starts = np.where(np.diff(digital_input_org)>0)[0]
    stops = np.where(np.diff(digital_input_org)<0)[0]
    
    # Create empty arrays to store the signals
    eda = np.array([])  
    ecg = np.array([])
    timestamps = np.array([])
    digital_input = np.array([])
    
    active_components = [0,3,4,5,6] # Desired components (Active components in the study)
    
    for i in active_components:  # For each component
        start = starts[i] # Get starting point
        stop = stops[i+1] # Get stop point (+1 because the digital signal starts with a stop)
        
        # Add the EDA, ECG, Time index and Digital input signal between start and stop point to the arrays
        eda = np.append(eda, eda_org[start:stop])
        ecg = np.append(ecg, ecg_org[start:stop])
        timestamps = np.append(timestamps, timestamps_org[start:stop])
        digital_input = np.append(digital_input, digital_input_org[start:stop])
    
    data = pd.DataFrame({'timestamp': timestamps,'raw_EDA': eda, 'raw_ECG': ecg, 'digital_input': digital_input}) # Create a DataFrame with the signals stored in the array
    
    freq=round(1/(data.timestamp.values[1] - data.timestamp.values[0])) # Compute the frequency of the signals using the following formula: freq=1/(t_1 - t_2)
    
    data['t_from_start'] = data['timestamp'] - (start/freq) # Compute the time between each row in the DataFrame and the starting point
    data['pp'] =  pp # Store the PP_id in the DataFrame
    data.astype('float',copy=False) # Set the entire dataframe as Float for memory reasons (string takes more memory)

    # Reset the index before storing the DataFrame in an feather file for later use
    data.reset_index(inplace=True) 
    data.to_feather(f'{data_dir}\\interim\\physiological_all\\{pp}.feather')
    
    print(f'pp{pp} done')
print('All done!')

processing: pp100s2.acq
pp100 done
processing: pp101s2.acq
pp101 done
processing: pp102s2.acq
pp102 done
processing: pp103s2.acq
pp103 done
processing: pp104s2.acq
pp104 done
processing: pp105s2.acq
pp105 done
processing: pp106s2.acq
pp106 done
processing: pp107s2.acq
pp107 done
processing: pp109s2.acq
pp109 done
processing: pp10s2.acq
pp10 done
processing: pp110s2.acq
pp110 done
processing: pp113s2.acq
pp113 done
processing: pp114s2.acq
pp114 done
processing: pp115s2.acq
pp115 done
processing: pp117s2.acq
pp117 done
processing: pp118s2.acq
pp118 done
processing: pp119s2.acq
pp119 done
processing: pp11s2.acq
pp11 done
processing: pp120s2.acq
pp120 done
processing: pp121s2.acq
pp121 done
processing: pp122s2.acq
pp122 done
processing: pp123s2.acq
pp123 done
processing: pp125s2.acq
pp125 done
processing: pp126s2.acq
pp126 done
processing: pp127s2.acq
pp127 done
processing: pp128s2.acq
pp128 done
processing: pp129s2.acq
pp129 done
processing: pp12s2.acq
pp12 done
processing: pp130s2.acq
pp

### .amsdata files
Unlike the `.acq` files, the `.amsdata` files (which are extracted and stored in an `.txt` file) do not denote the different components of the experiment along the physiological signal. Instead, the experimenters in the study have indicated the starting points of each component in an excel file. These starting points are saved in the `amsdata_start.xlsx` file. Because the `.amsdata` is saved in an `.txt` file we can store the raw EDA signal in a pandas dataframe, which allows for easier extraction of the active components, by using the `.loc` method from the Pandas DataFrame object, instead of a for loop

In [6]:
timings = pd.read_excel(f'{data_dir}\\information\\amsdata_start.xlsx') # Open the file that contains the starting and stop points in a DataFrame

for file in txt_files: # for each SCL text file
    pp = int(file[2:-3])  # Get the pp id
    start_stops = timings.loc[timings.Participant==pp].values[0][1:] # Subset the DataFrame to only the start and stop point for the current PP, and collect those values
    
    # Open the EDA and ECG file and merge these two based on their time index
    eda_data = pd.read_csv(f'{physData_dir}\\{file}SCL.txt', sep=' ', skiprows=3, names=['timestamp', 'raw_EDA'])
    ecg_data = pd.read_csv(f'{physData_dir}\\{file}ECG.txt', sep=' ', skiprows=3, names=['timestamp', 'raw_ECG'])
    data = ecg_data.merge(eda_data, on='timestamp', how='outer')
    
    
    data['timestamp'] = data['timestamp']/1000 # Divide the timestamp collumn by a 1000 since the timestamps are in the txt files are in ms but the starting points from the excel file are in seconds
    
    freq=round(1/(data.timestamp.values[1] - data.timestamp.values[0])) # Compute the frequency of the signals using the following formula: freq=1/(t_1 - t_2)
    
    data['pp'] =  pp # Store the PP_id in the DataFrame
    
    ## Very hard to read code below: Selects the rows from the DataFrame containing the EDA and ECG signal that are between the starting and stop points of the active components
    ## This code is faster than looping and appending to arrays. See the comments below for the logic
    data = data.loc[((data.timestamp>start_stops[0]) & (data.timestamp<start_stops[1])) # Get each row: (After 1st timepoint in start_stop) AND (before 2nd timepoint in start_stop) OR 
                    |((data.timestamp>start_stops[2]) & (data.timestamp<start_stops[3])) # (After 3rd timepoint in start_stop) AND (before 4th timepoint in start_stop) OR
                    |((data.timestamp>start_stops[4]) & (data.timestamp<start_stops[5])) # And so on...
                    |((data.timestamp>start_stops[6]) & (data.timestamp<start_stops[7]))
                    |((data.timestamp>start_stops[8]) & (data.timestamp<start_stops[9])),:]
    
    data.astype('float',copy=False) # Set the entire dataframe as Float for memory reasons (string takes more memory)
    
    # Reset the index before storing the DataFrame in an feather file for later use
    data.reset_index(inplace=True)
    data.to_feather(f'{data_dir}\\interim\\physiological_all\\{pp}.feather')
    print(f'pp{pp} done')
    
print('All done!')

pp71 done
pp76 done
pp77 done
pp82 done
pp84 done
pp86 done
pp90 done
pp91 done
pp92 done
pp93 done
pp95 done
pp96 done
pp97 done
pp98 done
All done!


## TSST component
Below we extract the TSST component of the experiment. From both the `.acq` files and the `.amsdata` files we extract physiological signal from the TSST speech and TSST math component, aswell as the signal inbetween both components. Because the camera is started a bit before the start of the TSST speech component and the recording lasts around 15 to 20 minutes, it is easier to extract the physiological from 3 minutes prior to the start of the TSST speech component till the 20 minutes after the start of the component. This is once again stored in a `.feather` file.

For this component we also add a `t_from_start` column, which indicates for all the other timepoints in the signal how far away (in seconds) it is removed from the start of the TSST component, which we can use it later on to synchronise it to the video data. 
### .acq files
Once again, we can select the components using the digital input channel present in the `.acq` files. The starting point of TSST speech component is indicated by the 5th time the digital input was activated. We can then use this starting point to select all the EDA and ECG signal from 3 minutes prior till 20 minutes post this starting point. 

In [7]:
for file in acq_files: # For each acq files in the dir
    pp = int(file[2:-6]) # Get the pp id
    bio = bioread.read_file(physData_dir + '\\' + file) # Open the physiological file

    # Get the EDA, ECG, Time index and Digital input signal(which corresponds to the triggers from the psychopy script)
    eda = bio.channels[0].data
    ecg = bio.channels[1].data
    timestamps = bio.channels[0].time_index
    digital_input = np.copy(bio.channels[2].data)
    
    # Using the Digital input signal, get the start and stop points of the different components in the study
    starts = np.where(np.diff(digital_input)>0)[0]
    stops = np.where(np.diff(digital_input)<0)[0]
    
    # Select the starting point of the TSST Speech component and the stopping point of the TSST Math component
    start = starts[4]
    stop = stops[6]
    
    # Subset the EDA, ECG, time index and digital input signal, 3 minutes prior to the start of the speech component and 3 minutes past the end of the speech component 
    eda = eda[start-(2000*60*3):stop+(2000*60*3)]
    ecg = ecg[start-(2000*60*3):stop+(2000*60*3)]
    timestamps = timestamps[start-(2000*60*3):stop+(2000*60*3)]
    digital_input = digital_input[start-(2000*60*3):stop+(2000*60*3)]
    
    
    data = pd.DataFrame({'timestamp': timestamps,'raw_EDA': eda, 'raw_ECG': ecg, 'digital_input': digital_input}) # Create a DataFrame with the signals stored in the array
    
    freq=round(1/(data.timestamp.values[1] - data.timestamp.values[0])) # Compute the frequency of the signals using the following formula: freq=1/(t_1 - t_2)
    
    data['t_from_start'] = data['timestamp'] - (start/freq) # Compute the time between each row in the DataFrame and the starting point
    data['pp'] =  pp # Store the PP_id in the DataFrame
    data.astype('float',copy=False) # Set the entire dataframe as Float for memory reasons (string takes more memory)
    
    # Reset the index before storing the DataFrame in an feather file for later use
    data.reset_index(inplace=True)
    data.to_feather(f'{data_dir}\\interim\\physiological\\{pp}.feather')
    print(f'pp{pp} done')
print('All done!')

pp100 done
pp101 done
pp102 done
pp103 done
pp104 done
pp105 done
pp106 done
pp107 done
pp109 done
pp10 done
pp110 done
pp113 done
pp114 done
pp115 done
pp117 done
pp118 done
pp119 done
pp11 done
pp120 done
pp121 done
pp122 done
pp123 done
pp125 done
pp126 done
pp127 done
pp128 done
pp129 done
pp12 done
pp130 done
pp133 done
pp134 done
pp135 done
pp136 done
pp138 done
pp139 done
pp13 done
pp143 done
pp14 done
pp17 done
pp18 done
pp19 done
pp21 done
pp2 done
pp3 done
pp43 done
pp4 done
pp5 done
pp62 done
pp68 done
pp69 done
pp70 done
pp94 done
pp99 done
All done!


### .amsdata files
As said before the `.amsdata` files (which are extracted and stored in an `.txt` files) do not denote the different components of the experiment along the physiological signal. Instead, the experimenters in the study have indicated the starting points of each component in an excel file. The starting point of the TSST speech component is stored in the `started_TSST.csv` file. We open the file `start_TSST.csv` and store the information inside in a dict called `timings` where you can get the moment in the video of a certain participant (pp) where the TSST component starts in seconds (start) as follows: `start = timings['pp']`.

Once agian we we can store the raw EDA signal in a pandas dataframe, which allows for easier extraction of the active components, by using the `.loc` method from the Pandas DataFrame object, instead of a for loop. Likewise to the `.acq` files, we take the subset of the data between 3 minutes prior to the start and 20 minutes after the start. This file is also stored in a `feather` file in the interim folder. 

In [8]:
# Open the file that contains the starting and stop points for each participant for the start TSST component
reader = csv.DictReader(open(project_dir + '\\data\\information\\start_TSST.csv', encoding='utf-8-sig'), delimiter=';')
timings = {}
for row in reader:
    timings[int(row['pp'])] = float(row['start'])

In [9]:
for file in txt_files: # for each SCL text file
    pp = int(file[2:-3]) # Get the pp id
    start = timings[pp] # Get the starting point of the TSST speech component for this participant
    
    # Open the EDA and ECG file and merge these two based on their time index
    eda_data = pd.read_csv(f'{physData_dir}\\{file}SCL.txt', sep=' ', skiprows=3, names=['timestamp', 'raw_EDA'])
    ecg_data = pd.read_csv(f'{physData_dir}\\{file}ECG.txt', sep=' ', skiprows=3, names=['timestamp', 'raw_ECG'])
    data = ecg_data.merge(eda_data, on='timestamp', how='outer')
    
    data['timestamp'] = data['timestamp']/1000 # Divide the timestamp collumn by a 1000 since the timestamps are in the txt files are in ms but the starting points from the excel file are in seconds

    freq=round(1/(data.timestamp.values[1] - data.timestamp.values[0])) # Compute the frequency of the signals using the following formula: freq=1/(t_1 - t_2)
    
    data['t_from_start'] =  data['timestamp'] - start # Compute the time between each row in the DataFrame and the starting point
    data['pp'] =  pp # Store the PP_id in the DataFrame
    
    data = data.loc[(data.t_from_start>-180) & (data.t_from_start<1200),:] # Subset the rows 3 minutes prior to the TSST Speech starting point and 20 minutes after this starting point
    
    data.astype('float',copy=False) # Set the entire dataframe as Float for memory reasons (string takes more memory)
    
    # Reset the index before storing the DataFrame in an feather file for later use
    data.reset_index(inplace=True)
    data.to_feather(f'{data_dir}\\interim\\physiological\\{pp}.feather')
    print(f'pp{pp} done')
    
print('All done!')

pp71 done
pp76 done
pp77 done
pp82 done
pp84 done
pp86 done
pp90 done
pp91 done
pp92 done
pp93 done
pp95 done
pp96 done
pp97 done
pp98 done
All done!


### Baseline component

Finally we want to extract the baseline component of the experiment, since we can use this information for standardisation. This is very very similar to the extraction of the TSST components, where as we now only have to extract one component. The two cells below corresponds to the previous two respectively, in one minor change. We subset only the rows between the starting and end point (minutes after the starting point) of the baseline component.

In [10]:
for file in acq_files: # For each acq files in the dir
    pp = int(file[2:-6]) # Get the pp id
    bio = bioread.read_file(physData_dir + '\\' + file) # Open the physiological file
    
    # Get the EDA, ECG, Time index and Digital input signal(which corresponds to the triggers from the psychopy script)
    eda = bio.channels[0].data
    ecg = bio.channels[1].data
    timestamps = bio.channels[0].time_index
    digital_input = np.copy(bio.channels[2].data)
    
    # Using the Digital input signal, get the start and stop points of the different components in the study
    starts = np.where(np.diff(digital_input)>0)[0]
    stops = np.where(np.diff(digital_input)<0)[0]
    
    # Select the start and stop point of the baseline component
    start = starts[0]
    stop = stops[1]
    
    # Subset the EDA, ECG, time index and digital input signal
    eda = eda[start:stop]
    ecg = ecg[start:stop]
    timestamps = timestamps[start:stop]
    digital_input = digital_input[start:stop]

    data = pd.DataFrame({'timestamp': timestamps,'raw_EDA': eda, 'raw_ECG': ecg, 'digital_input': digital_input}) # Create a DataFrame with the signals stored in the array
    
    freq=round(1/(data.timestamp.values[1] - data.timestamp.values[0])) # Compute the frequency of the signals using the following formula: freq=1/(t_1 - t_2)
    
    data['t_from_start'] = data['timestamp'] - (start/freq) # Compute the time between each row in the DataFrame and the starting point
    data['pp'] =  pp # Store the PP_id in the DataFrame
    data.astype('float',copy=False) # Set the entire dataframe as Float for memory reasons (string takes more memory)
    
    # Reset the index before storing the DataFrame in an feather file for later use
    data.reset_index(inplace=True)
    data.to_feather(f'{data_dir}\\interim\\physiological_baseline\\{pp}.feather')
    print(f'pp{pp} done')
print('All done!')

pp100 done
pp101 done
pp102 done
pp103 done
pp104 done
pp105 done
pp106 done
pp107 done
pp109 done
pp10 done
pp110 done
pp113 done
pp114 done
pp115 done
pp117 done
pp118 done
pp119 done
pp11 done
pp120 done
pp121 done
pp122 done
pp123 done
pp125 done
pp126 done
pp127 done
pp128 done
pp129 done
pp12 done
pp130 done
pp133 done
pp134 done
pp135 done
pp136 done
pp138 done
pp139 done
pp13 done
pp143 done
pp14 done
pp17 done
pp18 done
pp19 done
pp21 done
pp2 done
pp3 done
pp43 done
pp4 done
pp5 done
pp62 done
pp68 done
pp69 done
pp70 done
pp94 done
pp99 done
All done!


In [11]:
# Open the file that contains the starting and stop points for each participant for the start TSST component
reader = csv.DictReader(open(project_dir + '\\data\\information\\start_Baseline.csv', encoding='utf-8-sig'), delimiter=';')
timings = {}
for row in reader:
    timings[int(row['pp'])] = float(row['start'])

In [12]:
for file in txt_files: # for each SCL text file
    pp = int(file[2:-3]) # Get the pp id
    start = timings[pp] # Get the starting point of the baseline component for this participant
    
    # Open the EDA and ECG file and merge these two based on their time index
    eda_data = pd.read_csv(f'{physData_dir}\\{file}SCL.txt', sep=' ', skiprows=3, names=['timestamp', 'raw_EDA'])
    ecg_data = pd.read_csv(f'{physData_dir}\\{file}ECG.txt', sep=' ', skiprows=3, names=['timestamp', 'raw_ECG'])
    data = ecg_data.merge(eda_data, on='timestamp', how='outer')
    
    data['timestamp'] = data['timestamp']/1000 # Divide the timestamp collumn by a 1000 since the timestamps are in the txt files are in ms but the starting points from the excel file are in seconds
    
    freq=round(1/(data.timestamp.values[1] - data.timestamp.values[0])) # Compute the frequency of the signals using the following formula: freq=1/(t_1 - t_2)
    
    data['t_from_start'] =  data['timestamp'] - start # Compute the time between each row in the DataFrame and the starting point
    data['pp'] =  pp # Store the PP_id in the DataFrame
    
    data = data.loc[(data.t_from_start>0) & (data.t_from_start<300),:] # Subset the rows from the baseline starting point till 5 minutes after this starting point
    data.astype('float',copy=False) # Set the entire dataframe as Float for memory reasons (string takes more memory)
    
    # Reset the index before storing the DataFrame in an feather file for later use
    data.reset_index(inplace=True)
    data.to_feather(f'{data_dir}\\interim\\physiological_baseline\\{pp}.feather')
    print(f'pp{pp} done')
print('All done!')

pp71 done
pp76 done
pp77 done
pp82 done
pp84 done
pp86 done
pp90 done
pp91 done
pp92 done
pp93 done
pp95 done
pp96 done
pp97 done
pp98 done
All done!
