 # Database of Remote Affective Physiological Signals and Continuous Ratings Collected in Virtual Reality `DRAP`

Notebook containing the postprocessing stages for the DRAP dataset.

It transforms individual files containing events (.json) and physiological responses (.csv) per each participant, and produces a single file `Dataset_DRAP_full_postprocessed.csv` synchronizing: 

1) Amplitude from Physiological responses. 
2) Affect ratings.
3) Start and End of intervention stages Resting and Video for three types of video content: 
    - VideoNegative
    - VideoPositive
    - VideoNeutral

In [3]:
# Add files to sys.path
from pathlib import Path
import sys,os
this_path = None
try:    # WORKS WITH .py
    this_path = str(os.path.dirname(os.path.abspath(__file__)))
except: # WORKS WITH .ipynb
    this_path = str(Path().absolute())+"/" 
print("File Path:", this_path)

# Add the level up to the file path so it recognizes the scripts inside `drap`
sys.path.append(os.path.join(this_path, ".."))

File Path: e:\dsv\dev\git_repos\DRAP\notebooks/


In [4]:
# Import classes
import drap.preprocessing       # Generate dataset index, load files, and plots.

# Shortcut for general variable constants
import drap.utils.enums

 # Utils for generation of files and paths
from drap.utils import files_handler

# Import data science libs
import numpy as np
import pandas as pd

import matplotlib
import matplotlib.pyplot as plt
#matplotlib.rcParams['text.usetex'] = True

---
## Setup

Global variables and functions for file management

In [5]:
### General configuration

# Path to the participants' folder w.r.t this notebook's filepath
DATASET_ROOT_FOLDER = "../data/"

# Used to generate the path of temporary subfolders
DATASET_NAME = "DRAP"

In [6]:
# Functions to generate filepaths

# MAIN FOLDERS FOR OUTPUT FILES
ROOT = this_path + ""   # Root folder for all the files w.r.t this file
TEMP_FOLDER = ROOT+"temp/"  # Main folder for temp files with intermediate calculations
RESULTS_FOLDER = ROOT+"results/"    # Folder to recreate plots and results from analyses

# Generates paths for the temporary files created from this script
def gen_path_temp(filename, extension, subfolders=""):
    # Generates full paths for TEMP FILES just by specifying a name
    return files_handler.generate_complete_path(filename, \
                                        main_folder=TEMP_FOLDER, \
                                        subfolders=DATASET_NAME+"/"+subfolders, \
                                        file_extension=extension)

---
# Dataset index

The class `drap.processing.Manager()` contains the scripts to generate an index of the dataset, which facilitates access to the data per participant, event, experimental segment, or physiological variable.

In [7]:
# The preprocessing manager analyzes the original data folder
# to create an index and facilitate preprocessing.
data_loader = drap.preprocessing.Manager(DATASET_ROOT_FOLDER, 
                                    index_files_path = TEMP_FOLDER, # None,
                                    force_index_regeneration=False, 
                                    verbose = False,
                                    )

Index already exists: Loading from  e:\dsv\dev\git_repos\DRAP\notebooks/temp/drap_index/drap_tree_index.json
Participant 0 with folder id: 101 was part of protocol: v1
Participant 1 with folder id: 216 was part of protocol: v1
Participant 2 with folder id: 219 was part of protocol: v1
Participant 3 with folder id: 222 was part of protocol: v1
Participant 4 with folder id: 247 was part of protocol: v1
Participant 5 with folder id: 248 was part of protocol: v1
Participant 6 with folder id: 268 was part of protocol: v1
Participant 7 with folder id: 270 was part of protocol: v1
Participant 8 with folder id: 278 was part of protocol: v1
Participant 9 with folder id: 290 was part of protocol: v1
Participant 10 with folder id: 293 was part of protocol: v1
Participant 11 with folder id: 299 was part of protocol: v1
Participant 12 with folder id: 307 was part of protocol: v1
Participant 13 with folder id: 308 was part of protocol: v1
Participant 14 with folder id: 309 was part of protocol: v1
P

In [8]:
# The attribute `summary` presents an overview on the original files.
# Note that it does not consider synchronization with the 
# real events in the experiment
data_loader.summary

Unnamed: 0,index_id,participant_id,protocol,Segment,Events_N,Events_duration,Emotions_N,Emotions_duration,Emotions_Valence_avg,Emotions_Arousal_avg
0,0,101,v1,video_1,76,175.555,1,0.000,3.000000,5.000000
1,0,101,v1,video_2,30,420.134,164,415.966,6.390244,5.737805
2,0,101,v1,video_3,26,420.161,121,416.545,2.677686,5.933884
3,0,101,v1,video_4,26,420.217,131,406.863,5.618321,4.862595
4,0,101,v1,video_5,4,120.121,15,101.889,3.733333,1.000000
...,...,...,...,...,...,...,...,...,...,...
190,38,384,v1,video_1,28,170.901,3,0.928,3.333333,6.333333
191,38,384,v1,video_2,26,420.101,265,420.711,4.033962,4.550943
192,38,384,v1,video_3,26,420.024,330,417.415,3.154545,4.884848
193,38,384,v1,video_4,30,420.069,336,420.292,5.800595,5.488095


In [9]:
# The attribute `index` contains the filepath for each
# of the events and physiological data for the participant
# with index 0. The index is according to the order how
# the files are found in the main folder `data/`
data_loader.index[0]

{'participant_id': '101',
 'protocol': 'v1',
 'events': 'e:\\dsv\\dev\\git_repos\\DRAP\\notebooks/temp/drap_index/participant_101\\compiled_experimental_events.csv',
 'segments': 'e:\\dsv\\dev\\git_repos\\DRAP\\notebooks/temp/drap_index/participant_101\\compiled_protocol_segment.csv',
 'emotions': 'e:\\dsv\\dev\\git_repos\\DRAP\\notebooks/temp/drap_index/participant_101\\compiled_emotion_ratings.csv',
 'data': {'fast_movement': '',
  'slow_movement': '',
  'video_1': '../data/participant_101\\video_1.csv',
  'video_2': '../data/participant_101\\video_2.csv',
  'video_3': '../data/participant_101\\video_3.csv',
  'video_4': '../data/participant_101\\video_4.csv',
  'video_5': '../data/participant_101\\video_5.csv'}}

In [10]:
# The attribute `events` contains all events
# not related to affective states
data_loader.events[0]

Unnamed: 0_level_0,Session,Event
Time,Unnamed: 1_level_1,Unnamed: 2_level_1
1.623248e+09,video_1,Start of signal check. Started data recording....
1.623248e+09,video_1,Signal check finished. Fit state: VeryGood val...
1.623248e+09,video_1,Cinema scene started
1.623248e+09,video_1,Finger lifted
1.623248e+09,video_1,Finger back on touchpad
...,...,...
1.623249e+09,video_4,Video category finished
1.623249e+09,video_5,Playing rest video
1.623249e+09,video_5,Finished playing rest video
1.623249e+09,video_5,Finished playing all videos


In [11]:
# The attribute `emotions` contains the events
# related to self-reported affective ratings.
# The RawX and RawY correspond to the original
# values measured with the joystick controller
data_loader.emotions[0]

Unnamed: 0_level_0,Session,Valence,Arousal,RawX,RawY
Time,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1.623248e+09,video_1,3,5,94,124
1.623248e+09,video_2,5,5,128,122
1.623248e+09,video_2,6,5,149,127
1.623248e+09,video_2,7,5,170,127
1.623248e+09,video_2,8,5,193,126
...,...,...,...,...,...
1.623249e+09,video_5,5,1,129,7
1.623249e+09,video_5,4,1,127,9
1.623249e+09,video_5,5,1,128,9
1.623249e+09,video_5,4,1,127,11


In [12]:
# The attribute `segments` is a subset of the data stored in
# `events`. This one contains the start and ending point
# of each of the experimental segments, and the specific VideoId
# shown during that experimental segment.
data_loader.segments[0]

Unnamed: 0_level_0,Session,Segment,VideoId,Trigger
Time,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1.623248e+09,video_2,VideoPositive,-1,Start
1.623248e+09,video_2,VideoPositive,-1,End
1.623248e+09,video_2,VideoPositive,51,Start
1.623248e+09,video_2,VideoPositive,49,Start
1.623248e+09,video_2,VideoPositive,51,End
...,...,...,...,...
1.623249e+09,video_4,VideoNeutral,23,End
1.623249e+09,video_4,VideoNeutral,21,Start
1.623249e+09,video_4,VideoNeutral,21,End
1.623249e+09,video_5,video_5,-1,Start


## Loading physiological data

Below, we show how to access and visualize specific physiological data from the files.

Note that the variable names to indicate the facial EMG are based on the placement provided by the Emteq sensor, as shown below:

![EmteqMaskSensors](https://www.frontiersin.org/files/Articles/781218/frvir-03-781218-HTML-r2/image_m/frvir-03-781218-g005.jpg)
*Image taken from paper: emteqPRO—Fully Integrated Biometric Sensing Array for Non-Invasive Biomedical Research in Virtual Reality [DOI](https://doi.org/10.3389/frvir.2022.781218))*

In [13]:
# Loading all variables for a specific participant and a specific session segment
PARTICIPANT_ID_IN_INDEX = 0
SESSION_SEGMENT_NAME = str(drap.utils.enums.SessionSegment.video1) # Or you can type directly the string from session segment `video_1`

# Obtain dataframes with data and metadata
data, metadata = data_loader.load_data_from_participant(participant_idx = PARTICIPANT_ID_IN_INDEX,
                                                        session_segment=SESSION_SEGMENT_NAME,
                                                        normalize_data_units=True)

Loading from:  ../data/../data/participant_101\video_1.csv


In [14]:
# Physiological data
data

Unnamed: 0_level_0,Frame,Faceplate/FaceState,Faceplate/FitState,Emg/ContactStates[RightFrontalis],Emg/Contact[RightFrontalis],Emg/Raw[RightFrontalis],Emg/RawLift[RightFrontalis],Emg/Filtered[RightFrontalis],Emg/Amplitude[RightFrontalis],Emg/ContactStates[RightZygomaticus],...,Accelerometer/Raw.x,Accelerometer/Raw.y,Accelerometer/Raw.z,Magnetometer/Raw.x,Magnetometer/Raw.y,Magnetometer/Raw.z,Gyroscope/Raw.x,Gyroscope/Raw.y,Gyroscope/Raw.z,Pressure/Raw
Time,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1.623248e+09,1,1,9,255,1833,393708,0,563,1450,255,...,874,31,-417,-485,-336,805,-151,20,16,0
1.623248e+09,2,1,5,255,1833,393708,0,-754,1450,0,...,874,31,-417,-485,-336,805,-151,20,16,0
1.623248e+09,3,1,5,255,1833,393708,0,1650,1450,0,...,874,31,-417,-485,-336,805,-151,20,16,0
1.623248e+09,4,1,5,255,1833,393708,0,-634,1450,0,...,874,31,-417,-485,-336,805,-151,20,16,0
1.623248e+09,5,1,5,255,1833,393708,0,1343,1450,0,...,874,31,-417,-485,-336,805,-151,20,16,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1.623248e+09,171797,1,9,255,2906,413625,0,-38,27,255,...,866,-3,-428,-485,-360,780,-30,90,-21,0
1.623248e+09,171798,1,9,255,2906,413781,0,-72,27,255,...,872,-1,-430,-485,-360,780,-29,119,-2,0
1.623248e+09,171799,1,9,255,2887,413781,0,25,27,255,...,872,-1,-430,-485,-360,780,-29,119,-2,0
1.623248e+09,171800,1,9,255,2887,413781,0,10,27,255,...,872,-1,-430,-485,-360,780,-29,119,-2,0


In [15]:
# Corresponding metadata
metadata

Unnamed: 0,metadata,value,more
0,#Format/Version,CSV1.1.0,
1,#File/Normalised,NO,
2,#File/Generator,dab2csv,v0.7.0-60-ga2621eb
3,#File/Source,E:\emteq-data-science\VR\python\data_preproces...,
4,#Protocol/Version,MASK.1.2,
5,#Firmware/Build.buildTag,v0.4.4-0-g2e9aa13,
6,#Device/Version.serialId,DAB006G+TM5OVMgICA4FEQQ/w,
7,#Device/Version.hardware,9.3.6-Mobile,
8,#Protocol/Log.message,AT21CS01 DeviceAddress=0,
9,#Faceplate/Version.serial,FLXoHHLagAAgdw,


In [16]:
metadata.reset_index().dtypes

index        int64
metadata    object
value       object
more        object
dtype: object

In [17]:
def get_metadata_value(df, key):
    """
    Get metadata from array
    """
    try:
        val = df[ df["metadata"]==key ]["value"]
        return None if (val.size == 0) else val
    except:
        return None

In [31]:
get_metadata_value(metadata, '#Time/Seconds.referenceOffset')
get_metadata_value(metadata, '#Emg/Properties.rawToVoltageDivisor')
get_metadata_value(metadata, '#Emg/Properties.contactToImpedanceDivisor')
get_metadata_value(metadata, '#Accelerometer/Properties.rawDivisor')
get_metadata_value(metadata, '#Magnetometer/Properties.rawDivisor')
get_metadata_value(metadata, '#Gyroscope/Properties.rawDivisor')


32    16
Name: value, dtype: object

In [19]:
# for line in metadata:
    #     if line.find('Frame#') == -1:
    #         data=data.replace("{}".format(line),'', 1)
    #     if line.find('#Time/Seconds.referenceOffset') != -1:
    #         time_offset = float(line.split(',')[1])
    #     if line.find('#Emg/Properties.rawToVoltageDivisor') != -1:
    #         emg_divisor = float(line.split(',')[1])
    #     if line.find('#Emg/Properties.contactToImpedanceDivisor') != -1:
    #         impedance_divisor = float(line.split(',')[1])
    #     if line.find('#Imu/Properties.accelerationDivisor') != -1:
    #         acceleration_divisor = float(line.split(',')[1])
    #     if line.find('#Imu/Properties.magnetometerDivisor') != -1:
    #         magnetometer_divisor = float(line.split(',')[1])
    #     if line.find('#Imu/Properties.gyroscopeDivisor') != -1:
    #         gyroscope_divisor = float(line.split(',')[1])
    #     #New labels for IMU    
    #     if line.find('#Accelerometer/Properties.rawDivisor') != -1:
    #         acceleration_divisor = float(line.split(',')[1])
    #     if line.find('#Magnetometer/Properties.rawDivisor') != -1:
    #         magnetometer_divisor = float(line.split(',')[1])
    #     if line.find('#Gyroscope/Properties.rawDivisor') != -1:
    #         gyroscope_divisor = float(line.split(',')[1])

#### ---- DO NOT CHECK BELOW THIS LINE!

In [20]:
# Obtain normalized data with parameter ``
data, metadata = data_loader.load_data_from_participant(participant_idx = PARTICIPANT_ID_IN_INDEX,
                                                        session_segment=SESSION_SEGMENT_NAME)

Loading from:  ../data/../data/participant_101\video_1.csv


In [21]:
# Example to load all variables types from a single facial EMG muscle Center Corrugator
COLNAMES_CENTER_CORRUGATOR=drap.GetColnamesFromEmgMuscle(drap.EmgMuscles.CenterCorrugator)
data, metadata = data_loader.load_data_from_participant(participant_idx = 0, session_part="video_1", columns = COLNAMES_CENTER_CORRUGATOR)
data.head()

AttributeError: module 'drap' has no attribute 'GetColnamesFromEmgMuscle'

In [None]:
# Get the column names for all variables of interest in the rest of the analysis
VARS_OF_INTEREST = DRAP.GetColnamesBasicsNonEmg() + DRAP.GetColnamesFromEmgVariableType(DRAP.EmgVars.Amplitude)
VARS_OF_INTEREST

In [None]:
data, metadata = data_loader.load_data_from_participant(participant_idx = 0, session_part="video_1", columns = VARS_OF_INTEREST)
data.head()

In [None]:
metadata

In [None]:
data.describe()

In [None]:
# Exploratory visualization of all data samples
data.iloc[2000:].plot.line(subplots=True, figsize=(20,1*data.shape[1]), sharex=True)

## Iterating over all participants

In [None]:
# Total participants
participants_ids = data_loader.index.keys()
participants_ids

In [None]:
# Total sessions
experimental_stages_ids = [ str(session) for session in data_loader.SessionSegment ]
experimental_stages_ids

In [None]:
# ### Test to know how long it would take to load all the dataset of interest
# import time
# for participant in participants_ids:
#     for exp_stage in experimental_stages_ids:
#         t0 = time.time()
#         data, metadata = data_loader.load_data_from_participant(participant_idx = participant, session_part = exp_stage, columns=VARS_OF_INTEREST)
#         print(f"\t>> Loading time: {time.time()-t0} s")

# print("\n\n=======\nFinished loading all relevant data!") 

# ### It took around 4.5mins just loading all the datasets

In [None]:
# Extract the sequence order of the videos

for SYNC_TIME_USER_ID in participants_ids:
    MASK_QUERY = ( 
                # data_loader.events[SYNC_TIME_USER_ID].Event.str.startswith( "Playing category number: " ) | \
                    # data_loader.events[SYNC_TIME_USER_ID].Event.str.startswith( "Video category finished" ) | \
                        # data_loader.events[SYNC_TIME_USER_ID].Event.str.startswith( "Playing rest video" ) | \
                            # data_loader.events[SYNC_TIME_USER_ID].Event.str.startswith( "Finished playing rest video" ) \
                    data_loader.events[SYNC_TIME_USER_ID].Event.str.startswith( "Playing" ) \
                    )
    
    # Event sequence
    EVENT_TEXT_SEQUENCE = "Category sequence:"
    keys_containing_sync_event = data_loader.events[SYNC_TIME_USER_ID].Event.str.startswith(EVENT_TEXT_SEQUENCE)
    cat_sequence = data_loader.events[SYNC_TIME_USER_ID][ keys_containing_sync_event ].iloc[0] # Choose first event
    video_seq = cat_sequence.Event.split(":")[1].split(",")

    print(f"Participant: {SYNC_TIME_USER_ID}, Events: {data_loader.events[SYNC_TIME_USER_ID][MASK_QUERY].shape}, Seq: {video_seq}")


## Testing pipeline to merge physiological and continuous affective ratings

The merging pipeline assumes a Participant ID, and the experimental stage to process (*VideoNegative, VideoNeutral, VideoPositive*)
1. Identify the timestamps for the resting stage $[r_{t0},r_{t1}]$ and the stage watching the video $[v_{t0},v_{t1}]$
2. Find the VideoID of the content being watched at each moment (facilitates filtering per video, if desired)
3. Merge the physiological and emotional data with corresponding timestamps.
4. Resample the dataframes at 50Hz
5. Save the merged dataset in a CSV file

In [None]:
## Participant and Video stage to process
PARTICIPANT_IDX = 7
EXPERIMENTAL_STAGE_NAME = "VideoPositive"

Identify timestamps dividing resting and video stages within segment

In [None]:
def calculate_info_from_segment(df_segments, segment_name_to_filter):
    """
    Processes a dataframe of segments timestamps and returns a tuple with:
        - rest_tstamp_start
        - rest_tstamp_end
        - video_tstamp_start
        - video_tstamp_end
        - video_filename
    """

    # Filter the segment corresponding to the intended video
    df_segments = df_segments[ df_segments.Segment == segment_name_to_filter]

    # Find the beginning and end of the RESTING (VideoId == -1)
    rest_start = df_segments[ (df_segments.Trigger=="Start") & (df_segments.VideoId == -1)].index.min()
    rest_end = df_segments[ (df_segments.Trigger=="End") & (df_segments.VideoId == -1)].index.max()

    # The segment watching the VIDEO (VideoId != -1)
    video_start = df_segments[ (df_segments.Trigger=="Start") & (df_segments.VideoId != -1)].index.min()
    video_end = df_segments[ (df_segments.Trigger=="End") & (df_segments.VideoId != -1)].index.max()

    # Which file should be loaded to access the required data
    video_filename = df_segments.Session.iloc[0]

    # Correct the few situations when the video starts before resting ends for few miliseconds
    if rest_end > video_start:
        video_start = rest_end
    
    return (rest_start, rest_end, video_start, video_end, video_filename)

In [None]:
r_t0, r_t1, v_t0, v_t1, video_filename = calculate_info_from_segment(data_loader.segments[PARTICIPANT_IDX], EXPERIMENTAL_STAGE_NAME)
print(f"Participant: \t\t{PARTICIPANT_IDX} \nRest duration: \t\t{r_t1-r_t0}s, \nVideos duration: \t{v_t1-v_t0} \nVideoName: \t\t{video_filename} \nResting was first: \t{r_t0 < v_t0}")

In [None]:
# Load video corresponding to desired Experimental stage
data, metadata = data_loader.load_data_from_participant(participant_idx = PARTICIPANT_IDX, session_part = video_filename, columns = VARS_OF_INTEREST)

Filter experimental stages

In [None]:
# Filter experimental stages
data_rest = data[ (data.index >= r_t0) & (data.index < r_t1) ]
data_video = data[ (data.index >= v_t0) & (data.index < v_t1) ]

print(data_rest.shape, data_video.shape)
print("Duration stage REST: ", r_t1-r_t0)
print("Duration stage VIDEO: ", v_t1-v_t0)

In [None]:
# data_video.plot.line(subplots=True, figsize=(15,1*data.shape[1]), sharex=True)

Load emotional responses within video range

In [None]:
# Emotions reported during the video, and in the corresponding video
Q = (data_loader.emotions[PARTICIPANT_IDX].index >= v_t0) & \
        (data_loader.emotions[PARTICIPANT_IDX].index < v_t1 ) & \
        (data_loader.emotions[PARTICIPANT_IDX].Session == video_filename)
        
data_emotions = data_loader.emotions[PARTICIPANT_IDX][ Q ].drop("Session", axis=1)
data_emotions.head()

Find the VideoId per timestamp

In [None]:
def calculate_video_id_end_timestamps(df_segments, segment_name_to_filter):
    """
    Returns a dataframe with the timestamp where a given VideoID *finishes*.
        - df_segments = DataFrame with segments
    """
    # EVENT_TEXT_SEQUENCE = "Finished playing video number:" # It will return when the event finished!
    # keys_containing_sync_event = df_events.Event.str.startswith(EVENT_TEXT_SEQUENCE)
    # videos_seq = df_events[ keys_containing_sync_event ] # Choose all video numbers
    # videos_ending = videos_seq.Event.str.split(":")
    # video_id_end_timestamp = videos_ending.apply((lambda x: int(x[1])))
    # video_id_end_timestamp = pd.DataFrame({"VideoID":video_id_end_timestamp})

    # Filter the segment corresponding to the intended video
    df_segments = df_segments[ df_segments.Segment == segment_name_to_filter]

    # Find the end of each video stage
    video_id_end_timestamp = df_segments[ (df_segments.Trigger=="End") ]
    video_id_end_timestamp = video_id_end_timestamp[ ["VideoId"] ]
    
    return video_id_end_timestamp

In [None]:
video_id_end_timestamp = calculate_video_id_end_timestamps(data_loader.segments[PARTICIPANT_IDX], EXPERIMENTAL_STAGE_NAME)
video_id_end_timestamp

In [None]:
video_id_end_timestamp.index

In [None]:
print(r_t0, r_t1)
print(v_t0, v_t1)

In [None]:
data_rest = pd.merge_asof(data_rest, video_id_end_timestamp, left_index=True, right_index=True, direction="forward")
data_rest["VideoId"].value_counts()

In [None]:
data_video = pd.merge_asof(data_video, video_id_end_timestamp, left_index=True, right_index=True, direction="forward")
data_video["VideoId"].value_counts()

In [None]:
data_rest.insert(0, "OriginalParticipantID", data_loader.index[PARTICIPANT_IDX]['folderid'])
data_video.insert(0, "OriginalParticipantID", data_loader.index[PARTICIPANT_IDX]['folderid'])
data_video

Merging physiology with affective ratings

In [None]:
# Merge physio with subjective emotions
data_merged = pd.merge_asof(data_video, data_emotions, left_index=True, right_index=True)

In [None]:
print("RANGE VIDEO: \t\t", data_video.index.min(), "---", data_video.index.max(), " \tLength:", data_video.index.max()-data_video.index.min())
print("RANGE EMOTIONS: \t", data_emotions.index.min(), "---", data_emotions.index.max(), " \tLength:", data_emotions.index.max()-data_emotions.index.min())

In [None]:
# data_merged.plot.line(subplots=True, figsize=(15,1*data.shape[1]), sharex=True)

Resample data at 50Hz

In [None]:
def calculate_resampled_dataframe(df, sampling_frequency_hz = 50):
    df.index = pd.to_datetime(df.index, unit="s")
    _FS = sampling_frequency_hz
    df_resampled = df.resample(str(1/_FS)+'S', origin='start').ffill()
    # The valence, arousal, rawX, rawY will contain null values before the first value is captured. Fill with first value.
    df_resampled = df_resampled.fillna(method="backfill")
    # Put the data back to 0 seconds
    df_resampled.index -= df_resampled.index[0]
    # Transform from datetime to float
    df_resampled.index = df_resampled.index.total_seconds()
    return df_resampled

In [None]:
# Resample to 50Hz
FS = 50
data_total = calculate_resampled_dataframe(data_merged, FS)
data_total.head()

In [None]:
# Missing values
data_total.isnull().sum()

In [None]:
data_total.plot.line(subplots=True, figsize=(15,1*data.shape[1]), sharex=True)#[0].figure.savefig(gen_path_temp(f"fig_test_data", extension=".png"))

Testing the merging process with fake data

In [None]:
# Merge two fake dataframes with MultiIndex
data2 = data_total.copy(deep=True)
data3 = data_total.copy(deep=True)

# Adding index with participant and video category
data_total = pd.concat({(PARTICIPANT_IDX,EXPERIMENTAL_STAGE_NAME):data_total}, names = ["Participant","VideoCategory"])
data_total.head()
# Toy data to prove how to merge fake "VideoPositive" data vertically
data2 = pd.concat({(PARTICIPANT_IDX,"VideoPositive"):data2}, names = ["Participant","VideoCategory"])
data_total = pd.concat([data_total, data2])
# Toy data to combine fake "VideoNeutral" vertically
data3 = pd.concat({(PARTICIPANT_IDX,"VideoNeutral"):data3}, names = ["Participant","VideoCategory"])
data_total = pd.concat([data_total, data3])
data_total

In [None]:
# Save data from a single participant and single experimental session
data_total.to_csv(gen_path_temp("example_df",extension=".csv") )

## Create dataset merging all data

Loading all segments for all participants and store the resting and video parts in a single large CSV

In [None]:
# Where the compiled dataset will be stored
DATASET_POSTPROCESSED_FILENAME = gen_path_temp("Dataset_DRAP_full_postprocessed", extension=".csv")

# Besides the compiled dataset in .CSV, it generates a folder with individual datasets per 
# participant, and with plots that show the saved data
SAVE_SINGLE_FILES_AND_PLOTS = False

# Segments of interest. The timestamps that determine each stage will be found and used to segment the physiological data per participant.
VIDEO_SEGMENT_NAMES = ["VideoNegative", "VideoPositive", "VideoNeutral"]    # Key of experimental stages
PREFIX_RESTING_STAGE = "Resting_"
FS = 50 # Sampling frequency

# Load or create dataframe with statistics of initial dataset
dataset_postprocessed_final = None

### INPUTS / OUTPUTS
"""EDIT CUSTOM FILENAMES"""
input_files = [DATASET_POSTPROCESSED_FILENAME]

# Try to load files maximum two times
RELOAD_TRIES = 2
for tries in range(RELOAD_TRIES):
    try:
        ### LOAD FILE
        print(f"Trying {tries+1}/{RELOAD_TRIES} to load files: {input_files}")
        
        ### CUSTOM SECTION TO READ FILES
        """EDIT CUSTOM READ"""
        dataset_postprocessed_final = pd.read_csv(input_files[0])#, index_col=[0,1,2])
        print(f"File {input_files[0]} was successfully loaded")

    except Exception as e:
        ### CREATE FILE
        print(f"File not found. Creating again! {e}")

        ### CUSTOM SECTION TO CREATE FILES 
        """EDIT CUSTOM WRITE"""

        for PARTICIPANT_IDX in participants_ids:
            for EXPERIMENTAL_STAGE_NAME in VIDEO_SEGMENT_NAMES:

                ## Extract segments for specific video type
                r_t0, r_t1, v_t0, v_t1, video_filename = calculate_info_from_segment(data_loader.segments[PARTICIPANT_IDX], EXPERIMENTAL_STAGE_NAME)
                print(f"\n\nParticipant: \t\t{PARTICIPANT_IDX} \nRest range: \t\t{r_t1-r_t0}s, \nVideos range: \t{v_t1-v_t0} \nVideo filename: \t\t{video_filename} \nResting was first: \t{r_t0 < v_t0}")

                # Load corresponding data and metadata
                data, metadata = data_loader.load_data_from_participant(participant_idx = PARTICIPANT_IDX, session_part = video_filename, columns = VARS_OF_INTEREST)
                
                # Detect the ending timestamp of each VideoID to be added to the datasets
                video_id_end_timestamp = calculate_video_id_end_timestamps(data_loader.segments[PARTICIPANT_IDX], EXPERIMENTAL_STAGE_NAME)
                
                """ PROCESSING DATA FROM RESTING STAGES """
                # Filter experimental stages
                data_rest = data[ (data.index >= r_t0) & (data.index < r_t1) ]
                # Combine the videoId with the data from the segment
                data_rest = pd.merge_asof(data_rest,  video_id_end_timestamp, left_index=True, right_index=True, direction="forward")

                # Emotions reported during the video, and in the corresponding video
                Q = (data_loader.emotions[PARTICIPANT_IDX].index >= r_t0) & \
                        (data_loader.emotions[PARTICIPANT_IDX].index < r_t1 ) & \
                        (data_loader.emotions[PARTICIPANT_IDX].Session == video_filename)
                data_emotions_rest = data_loader.emotions[PARTICIPANT_IDX][ Q ].drop("Session", axis=1)
                # Merge data end emotions in a single dataframe per time
                data_rest = pd.merge_asof(data_rest, data_emotions_rest, left_index=True, right_index=True)

                # Resampling data
                data_rest_resampled = calculate_resampled_dataframe(data_rest, FS)

                """ PROCESSING DATA FROM VIDEO STAGES """
                # Filter experimental stages
                data_video = data[ (data.index >= v_t0) & (data.index < v_t1) ]
                # Combine the videoId with the data from the segment
                data_video = pd.merge_asof(data_video, video_id_end_timestamp, left_index=True, right_index=True, direction="forward")
                
                # Emotions reported during the video, and in the corresponding video
                Q = (data_loader.emotions[PARTICIPANT_IDX].index >= v_t0) & \
                        (data_loader.emotions[PARTICIPANT_IDX].index < v_t1 ) & \
                        (data_loader.emotions[PARTICIPANT_IDX].Session == video_filename)
                data_emotions = data_loader.emotions[PARTICIPANT_IDX][ Q ].drop("Session", axis=1)
                
                # Merge data end emotions in a single dataframe per time
                data_video = pd.merge_asof(data_video, data_emotions, left_index=True, right_index=True)
                # Resample dataset to constant period
                data_video_resampled = calculate_resampled_dataframe(data_video, FS)

                """ COMBINING DATASET IN A SINGLE ONE """
                print(f"Actual duration stage VIDEO: {data_video_resampled.index.max()} \tSHORT?:{data_video_resampled.index.max()<295}")
                print(f"Actual duration stage REST: {data_rest_resampled.index.max()} \tSHORT?:{data_rest_resampled.index.max()<115}")
                print(f"Total missing vals:{data_video_resampled.isnull().sum().sum()}")

                # Add a column with the original participant ID corresponding to the original dataset
                folder_id = data_loader.index[PARTICIPANT_IDX]['folderid']
                data_rest_resampled.insert(0, "OriginalParticipantID", folder_id)
                data_video_resampled.insert(0, "OriginalParticipantID", folder_id)

                ### SAVING FILES
                if SAVE_SINGLE_FILES_AND_PLOTS:
                    # Save resting data
                    data_to_plot = {
                                        "video_data":data_video_resampled,
                                        "resting_data":data_rest_resampled
                                    }
                    for _df_name,_df in data_to_plot.items():
                        # Save video data
                        save_path_plot = gen_path_temp(f"per_participant/_plots/{folder_id}/{EXPERIMENTAL_STAGE_NAME}_{_df_name}", extension=".png")
                        save_path_csv = gen_path_temp(f"per_participant/{folder_id}/{EXPERIMENTAL_STAGE_NAME}_{_df_name}", extension=".csv")
                        _df.plot.line(subplots=True, figsize=(15,1*_df.shape[1]), sharex=True)[0].figure.savefig(save_path_plot); plt.close()
                        _df.to_csv(save_path_csv)

                ### Generating multiindex to create a single .csv with all the data
                COLNAMES_MULTIINDEX = ["Participant","Stage"]
                data_video_resampled = pd.concat({(PARTICIPANT_IDX,EXPERIMENTAL_STAGE_NAME):data_video_resampled}, names = COLNAMES_MULTIINDEX)
                data_rest_resampled = pd.concat({(PARTICIPANT_IDX, PREFIX_RESTING_STAGE + EXPERIMENTAL_STAGE_NAME):data_rest_resampled}, names = COLNAMES_MULTIINDEX)

                # Final concatenation of resting and video stages
                data_compiled = pd.concat([data_video_resampled.copy(deep=True), data_rest_resampled.copy(deep=True)])

                # Generate final DF
                if(dataset_postprocessed_final is None):
                    dataset_postprocessed_final = data_compiled.copy(deep=True)
                    # dataset_postprocessed_final = pd.concat([dataset_postprocessed_final, data_rest_resampled.copy(deep=True)])
                else:
                    dataset_postprocessed_final = pd.concat([dataset_postprocessed_final, data_compiled.copy(deep=True)])
                    # dataset_postprocessed_final = pd.concat([dataset_postprocessed_final, data_rest_resampled.copy(deep=True)])

        # Saving .csv
        dataset_postprocessed_final.to_csv( DATASET_POSTPROCESSED_FILENAME )
        print("\n\n End")


        ### ---- CONTROL RETRIES
        if tries+1 < RELOAD_TRIES:
            continue
        else:
            raise
    
    # Finish iteration
    break

In [None]:
dataset_postprocessed_final.head()

In [None]:
dataset_postprocessed_final.shape

In [None]:
print(">> FINISHED WITHOUT ERRORS!!")