Some imports.
Note that this is almost the same as (or at least very similar to) the hrv segmentation program.
The main difference is that the files are organized a little bit differently.

In [3]:
import os
import re
import warnings
import pandas as pd
import numpy as np
from preprocess_modules import utilities_hrv as hrvutils
from preprocess_modules import utilities_e4 as e4utils

File paths and regex for reading in participant E4 data folders.

In [11]:
e4_dir = r"P:\Spironolactone\E4"
main_dir = r"P:\Spironolactone\main_qualtrics"
participant_folders = os.listdir(e4_dir)
# use a regex pattern to search for folders starting with p and two integers in the range 0-9
# note that it doesn't matter whether p is lower or upper case in the folder name due to including f.lower()
participant_folders = [f for f in participant_folders if re.search("^p[0][0-9][0-9]",f.lower())] 

Make an output directory.

In [5]:
output_dir = os.path.join(e4_dir,"processed_e4_files")
try:
    os.makedirs(output_dir)
except OSError:
    # if directory already exists
    warnings.warn("Directory already exists. Files may be overwritten. Manual check advised.")
    pass



This is basically the same as in the HRV preprocessing file.

In [13]:
col_list =  ["Status","DQ-1","Firstbeat_on_time","baseline start","baseline end","Q645","Q646","FILM-START","Q648","Q649"]
new_names = ["Response_type","Participant_number","Firstbeat_start","RT1_start","RT1_end","RT2_start","RT2_end","Film_start","RT3_start","RT3_end"]
qualtrics_df = pd.read_csv(os.path.join(main_dir,"main_dat.csv"),usecols =col_list,skiprows= [1,2])
qualtrics_df.columns = new_names

qualtrics_df = hrvutils.remove_invalid_records(qualtrics_df,"Participant_number",exclude_pnums = [1])
duplicates = hrvutils.flag_duplicate_participants(qualtrics_df,"Participant_number")
qualtrics_df = hrvutils.remove_duplicate_participants(qualtrics_df,"Participant_number")
qualtrics_df = hrvutils.convert_time_cols(qualtrics_df)
qualtrics_df = hrvutils.add_end_time(qualtrics_df,"Film_start",15)
rt_time_cols = [f for f in qualtrics_df.columns if any(k in f for k in ["start","end"])]
for rt_time in rt_time_cols[1:]:
    qualtrics_df = hrvutils.make_rel_time_cols(qualtrics_df,rt_time_cols[0],rt_time)
qualtrics_df = hrvutils.convert_to_secs(qualtrics_df)

The following participants have duplicate records:
[4.]


We now read in the skin conductance (EDA) files.
This does the usual, ie checking that a file exists for the participant and flagging duplicates, missing files or short recordings.
Short recordings are identified based on a pretty arbitrary threshold. Below I flag everything with a duration <4 hours, but you can adapt that as you see fit.
As in the HRV version, this will cut the EDA file for each participant into sections for RT1, RT2, film and RT3.
At the start I'm checking whether there is an EDA file for each participant and whether this participant has a record in the qualtrics file. Sometimes there are discrepancies, depending on how up to date the respective data sources are.
I'm also catching type/value errors for the start/end times of intervals. This is because I had issues with nan values before I realised that these were being caused by an out-of-date qualtrics file. You could take the try/except block out but it's nice to have it there, just in case there is no qualtrics time stamp for whatever reason.

In [41]:
start_intervals = qualtrics_df.filter(like = "start_interval", axis = 1).columns.sort_values()
end_intervals = qualtrics_df.filter(like = "end_interval", axis = 1).columns.sort_values()
intervals = list(zip(start_intervals, end_intervals))
qualtrics_pnums = qualtrics_df.Participant_number.values

missing_eda = []
pnums = []
eda_dat = []
below_min = []
missing_sec = []
duplicates = e4utils.flag_duplicates(participant_folders)
# somewhat arbitrary. If length of EDA recording indicates that session<4 hours, flag this.
# the formula for calculating min_session_length is: hours*minutes_per_hour*seconds_per_minute*sampling_rate
min_session_length = 4*60*60*4


for folder in participant_folders:
    pnum = e4utils.get_participant_num(folder)
    if pnum not in qualtrics_pnums:
        print(f" Participant {pnum} not in qualtrics file. Skipping.")
        continue
    if pnum in duplicates:
        print(f"More than one file exists for participant {pnum}. Skipping.")
        continue
    try:
        eda_df = pd.read_csv(os.path.join(e4_dir,folder,"EDA.csv"),header = None,names = ["EDA"])
    except FileNotFoundError:
        print(f"No E4 file found for participant {pnum}.Manual check advised.")
        missing_eda.append(pnum)
        continue
    if eda_df.shape[0]<min_session_length:
        print(f"Recording for participant {pnum} seems short. Manual check advised.")
        below_min.append(pnum)
        continue
    for start, stop in intervals:
        start_val = qualtrics_df.loc[qualtrics_df.Participant_number == pnum,start]
        stop_val = qualtrics_df.loc[qualtrics_df.Participant_number == pnum,stop]
        try:
            [int(val) for val in [start_val, stop_val]]
        except (ValueError, TypeError) as e:
            print(f"At least one of start_val, stop_val not int. Skipping participant {pnum}.")
            continue
        interval_df = e4utils.get_eda_intervals(eda_df,start_val,stop_val,4)
        if interval_df.empty:
            interval_name = start.split("_")[0]
            warnings.warn(f"Participant {pnum} has no valid data for {interval_name} interval.\nManual check advised.")
            missing_sec.append([pnum,interval_name])
            continue
        # save to file
        interval_df.to_csv(os.path.join(output_dir, "_".join([start.split("_")[0],str(int(pnum)),"eda.csv"])),index = False)


 Participant 1 not in qualtrics file. Skipping.
 Participant 1 not in qualtrics file. Skipping.
More than one file exists for participant 10. Skipping.
More than one file exists for participant 10. Skipping.
Recording for participant 12 seems short. Manual check advised.
Recording for participant 14 seems short. Manual check advised.


Manual check advised.


Recording for participant 18 seems short. Manual check advised.
More than one file exists for participant 20. Skipping.
More than one file exists for participant 20. Skipping.
More than one file exists for participant 27. Skipping.
More than one file exists for participant 27. Skipping.
At least one of start_val, stop_val not int. Skipping participant 39.
 Participant 42 not in qualtrics file. Skipping.
 Participant 43 not in qualtrics file. Skipping.
 Participant 44 not in qualtrics file. Skipping.
 Participant 45 not in qualtrics file. Skipping.
 Participant 45 not in qualtrics file. Skipping.
 Participant 46 not in qualtrics file. Skipping.
 Participant 47 not in qualtrics file. Skipping.
