# Processing the windows
This notebook focusses on the processing of the sampled windows, to prepare them to be used as training and test data in the modelling part. At this point it only involves checking how many participants are still included in the windows DataFrames and adding the age and gender information to all the windows. More processing steps, could be added to this notebook.
#### Requirements
If one wants to run this notebook, make sure that you have run the `4-ak-window-sampling` notebook. This notebook is responsible for processing the windows DataFrames, stored in `data\processed`, created by this previous notebook.
You also need to have stored the demographics file `Demogr tDCS WM stress.csv` in the `data\raw` directory.

In [1]:
import pandas as pd
import numpy as np
import os

After loading in the required modules, we store the working directory in a variable called `project_dir`. We then store the folder where the data files are located in a variable called `data_dir`. We also store the specific processed directory and processed files in specific directories.

In [2]:
project_dir = os.getcwd().split('\\')[:-1]
project_dir = '\\'.join(project_dir) # Get the project dir 
data_dir = project_dir + '\\data' # Get the data dir
processed_dir = data_dir + '\\processed' # Get the processed subdir
processed_files = [file for file in os.listdir(processed_dir) if file.endswith('hdf') and not file.startswith('processed_')] # Get the processed files

## Processing
Below we execute some simple processing steps, if necessary more can be added below.
### Checking how many participants were still in the data sample
We noticed that we did not extract data from all participants. The first cell shows from how many participants windows were present in the different DataFrames.

In [3]:
for file in processed_files:
    window = pd.read_hdf(f'{processed_dir}\\{file}')
    print(f'In file {file} there is data from {len(window.pp.unique())} participants')

In file predicted_data_LR.hdf there is data from 10 participants
In file predicted_data_RF_120_108.hdf there is data from 10 participants
In file predicted_data_RF_120_84.hdf there is data from 10 participants
In file predicted_data_RF_180_125.hdf there is data from 10 participants
In file predicted_data_RF_180_162.hdf there is data from 10 participants
In file window_120_step_108.hdf there is data from 60 participants
In file window_120_step_120.hdf there is data from 60 participants
In file window_120_step_84.hdf there is data from 60 participants
In file window_180_step_125.hdf there is data from 60 participants
In file window_180_step_162.hdf there is data from 60 participants
In file window_180_step_180.hdf there is data from 60 participants


### Gender & Age
In the next two cells we add the age and gender of the participant to each processed window. We get the age and gender from the demographics file `Demogr tDCS WM stress.csv`. We also applied a quick fix since some rows did not separate properly. In the second cell we loop over the DataFrames, adding the gender and age.

In [4]:
demogr = pd.read_csv(data_dir + '\\raw\\Demogr tDCS WM stress.csv', sep=',') # Reading in the demographics DataFrame

## Handling the rows that did not separatly properly
i = demogr[demogr.id.str.split(',').str.len()>2].index ## Finding the rows that did not seperate by checking if they now do split
demogr.loc[i, 'geboortedatum_patient'] = demogr.loc[i, 'id'].str.split(',').str[1] # Extracting the birth date, and adding this to the column geboortedatum_patient
demogr.loc[i, 'geslacht'] = demogr.loc[i, 'id'].str.split(',').str[2] # Extracting the gender, and adding this to the column geslacht
demogr.loc[i, 'id'] = demogr.loc[i, 'id'].str.split(',').str[0] # Extracting the id, and adding this to the column id

demogr = demogr[['id', 'geboortedatum_patient', 'geslacht']] # Dropping all the other columns from the DataFrame
demogr['geboortedatum_patient'] = pd.to_datetime(demogr['geboortedatum_patient']) # Setting the birth date column as a datetime object
now = pd.Timestamp('now') # Getting the current time
demogr['leeftijd'] = (now - demogr['geboortedatum_patient']).astype('<m8[Y]') # Subtracting the birth dates from the current time to get the age of the participant in years.

In [5]:
demogr.id = demogr.id.str.split(',').str[0].astype('int64') # Setting the id of the participant as int type (necessary to merge on pp id)

for file in processed_files: # For each windows DataFrame
    window = pd.read_hdf(f'{processed_dir}\\{file}') # Read in the windows DataFrame
    window.pp = window.pp.astype('int64') # Set pp id as int type
    df = window.merge(demogr[['id', 'geslacht', 'leeftijd']], how='left', left_on='pp', right_on='id', validate='many_to_one') # Adding the age and gender of each participant (merging on pp_id)
    df.to_hdf(f'{processed_dir}\\processed_{file}', key='data') # Saving the DataFrame in the processed dir with the prefix "processed_"

your performance may suffer as PyTables will pickle object types that it cannot
map directly to c-types [inferred_type->mixed,key->block2_values] [items->Index(['target', 'geslacht'], dtype='object')]

  pytables.to_hdf(
your performance may suffer as PyTables will pickle object types that it cannot
map directly to c-types [inferred_type->mixed,key->block2_values] [items->Index(['geslacht'], dtype='object')]

  pytables.to_hdf(
