# `AVDOS-VR` - Virtual Reality Affective Video Database with Physiological Signals

Notebook containing the postprocessing stages for the AVDOSVR dataset.

It transforms individual files containing events (.json) and physiological responses (.csv) per each participant, and produces a single file `Dataset_AVDOSVR_full_postprocessed.csv` synchronizing physiological responses, affect ratings, and grouped per affect segment and experimental stages (rest, video).

In [None]:
# Add files to sys.path
from pathlib import Path
import sys,os
this_path = None
try:    # WORKS WITH .py
    this_path = str(os.path.dirname(os.path.abspath(__file__)))
except: # WORKS WITH .ipynb
    this_path = str(Path().absolute())+"/" 
print("File Path:", this_path)

# Add the level up to the file path so it recognizes the scripts inside `avdosvr`
sys.path.append(os.path.join(this_path, ".."))

In [None]:
# Import classes
import avdosvr.preprocessing       # Generate dataset index, load files, and plots.

# Shortcut for general variable constants
import avdosvr.utils.enums as avdosEnums
# Utils for generation of files and paths
from avdosvr.utils import files_handler
from avdosvr.analysis.dataframe_functions import resample_dataframe

# Import data science libs
import numpy as np
import pandas as pd

import matplotlib
import matplotlib.pyplot as plt
#matplotlib.rcParams['text.usetex'] = True

---
## Setup

Global variables and functions for file management

In [None]:
### General configuration

# Path to the participants' folder w.r.t this notebook's filepath
DATASET_ROOT_FOLDER = "../data/"

# Used to generate the path of temporary subfolders
NOTEBOOK_NAME = "1_preprocess"

In [None]:
# Functions to generate filepaths

# MAIN FOLDERS FOR OUTPUT FILES
ROOT = this_path + ""   # Root folder for all the files w.r.t this file
TEMP_FOLDER = ROOT+"temp/"  # Main folder for temp files with intermediate calculations
RESULTS_FOLDER = ROOT+"results/"    # Folder to recreate plots and results from analyses

# Generates paths for files created from this script
def gen_path_temp(filename, extension, subfolders=""):
    # Generates full paths for TEMP FILES just by specifying a name
    return files_handler.generate_complete_path(filename, \
                                        main_folder=TEMP_FOLDER, \
                                        subfolders=NOTEBOOK_NAME+"/"+subfolders, \
                                        file_extension=extension)


---
## Creating a dataset index

The class `avdosvr.processing.Manager()` contains the scripts to generate an index of the dataset, which facilitates access to the data per participant, event, experimental segment, or physiological variable.

In [None]:
# The preprocessing manager analyzes the original data folder
# to create an index and facilitate preprocessing.
data_loader = avdosvr.preprocessing.Manager(DATASET_ROOT_FOLDER, 
                                    index_files_path = TEMP_FOLDER, # None,
                                    force_index_regeneration=True, 
                                    verbose = True,
                                    )

The summary dataframe produces the following columns:
- `index_id`: Identifier of a participant in the index, based on the order of the folders in the original dataset.
- `participant_id`: Participant's identifier according to original folder name.
- `protocol`: *v1* or *v2* depending on which type of remote experiment the participant did (see paper for details).
- `Segment`: *video_1* to *video_5* identifying the filename of the experimental segment.
- `NonAffectiveEvents_N`: Number of events **not** related to self-reported affective ratings.
- `NonAffectiveEvents_duration`: The difference between the first and the last non affective event (in seconds).
- `AffectiveRatings_N`: Number of events related to self-reported affective ratings.
- `AffectiveRatings_duration`: The difference between the first and the last affective event (in seconds). (No ratings as *NaN* in the dataframe)
- `AffectiveRatings_Valence_avg`: Average *Valence* ratings throughout a specific experimental *Segment*.
- `AffectiveRatings_Arousal_avg`: Average *Arousal* ratings throughout a specific experimental *Segment*.

In [None]:
# The attribute `summary` presents an overview on the original files.
# Note that it does not consider synchronization with the 
# real events in the experiment
data_loader.summary

In [None]:
# Some rows contain no data, but they are in the resting video_1 (ie. ratings were not collected during training portion of video 1 segment)
data_loader.summary[ data_loader.summary.isna().any(axis=1) ]

In [None]:
# The attribute `index` contains the filepath for each
# of the events and physiological data for the participant
# with index 0. The index is according to the order how
# the files are found in the main folder `data/`
data_loader.index[0]

In [None]:
# The attribute `events` contains all events
# not related to affective states
data_loader.events[0]

In [None]:
# The attribute `emotions` contains the events
# related to self-reported affective ratings.
# The RawX and RawY correspond to the original
# values measured with the joystick controller
data_loader.emotions[0]

In [None]:
# The attribute `segments` is a subset of the data stored in
# `events`. This one contains the start and ending point
# of each of the experimental segments, and the specific VideoId
# shown during that experimental segment.
data_loader.segments[0]

## Loading physiological data

Below, we show how to access and visualize specific physiological data from the files.

Note that the variable names to indicate the facial EMG are based on the placement provided by the Emteq sensor, as shown below:

![EmteqMaskSensors](https://www.frontiersin.org/files/Articles/781218/frvir-03-781218-HTML-r2/image_m/frvir-03-781218-g005.jpg)
*Image taken from paper: emteqPRO—Fully Integrated Biometric Sensing Array for Non-Invasive Biomedical Research in Virtual Reality [DOI](https://doi.org/10.3389/frvir.2022.781218))*

In [None]:
# Loading all variables for a specific participant and a specific session segment
PARTICIPANT_IDX = 0
SESSION_SEGMENT_NAME = str(avdosEnums.SessionSegment.video1) # Or you can type directly the string from session segment `video_1`

# Obtain dataframes with data and metadata
data, metadata = data_loader.load_data_from_participant(participant_idx = PARTICIPANT_IDX,
                                                        session_segment=SESSION_SEGMENT_NAME)

In [None]:
data

In [None]:
metadata

In [None]:
data.describe()

### Normalization

The command above allows loading the raw data as it is from the file. The only normalization is the `Time` to unix timestamps, so that it maps the timestamps from the event files.

The attribute `normalize_data_units` allows loading the data with normalized units, as follows:
- The EMG variables with type `Raw/`, `Filtered/`, and `Amplitude/` are normalized with the value in `#Emg/Properties.rawToVoltageDivisor` to produce signal in **volts**.
- The EMG variables with type `Contact/` are normalized with the value in `#Emg/Properties.contactToImpedanceDivisor` to produce data in **ohms**.
- The `Accelerometer/` variables are normalized with `#Accelerometer/Properties.rawDivisor` to produce data in $m/s^2$.
- The `Magnetometer/` variables are normalized with `#Magnetometer/Properties.rawDivisor` to produce data in $\mu$ Tesla.
- The `Gyroscope/` variables are normalized with `#Gyroscope/Properties.rawDivisor` to produce data in $^\circ/s$.

In [None]:
# Obtain normalized data with parameter `normalize_data_units`
data, metadata = data_loader.load_data_from_participant(participant_idx = PARTICIPANT_IDX,
                                                        session_segment=SESSION_SEGMENT_NAME,
                                                        normalize_data_units=True)

In [None]:
# Note how the range of the values changed to the correct units
# compared  to the non-normalize data
data.describe()

### Selecting subsets of physiological variables

The parameter `columns` allows loading a certain subset of variables from the physiological signal. Especially because there is redundancy in the EMG variables `Raw/`, `Filtered/`, and `Amplitude`.

To make it easier to access subset of variables, we show below how to access a specific data channels from the configuration file:

- Physical faceplate: `COLNAMES_FACEPLATE`,
- Facial EMG: `COLNAMES_EMG_RAW`, `COLNAMES_EMG_FILTERED`, `COLNAMES_EMG_AMPLITUDE`, 
- Contact states: `COLNAMES_EMG_CONTACT`, `COLNAMES_EMG_CONTACT_STATES`, `COLNAMES_NON_EMG_BASIC`
- Heart-related: `COLNAMES_HR`, `COLNAMES_PPG`
- Movement-related: `COLNAMES_ACCELEROMETER`, `COLNAMES_MAGNETOMETER`, `COLNAMES_GYROSCOPE`

The variable `COLNAMES_RECOMMENDED` contains a suggested set of columns useful for analysis, but you can always select columns by providing a list with the custom names that you would like to load.

The function `avdosvr.preprocessing.GetColnamesFromEmgMuscle()` allows to access all variables for a specific muscle type.

In [None]:
# For example: Creating a list of columns to extract all variables from center corrugator,
# the amplitude of all sensors, heart rate, and pressur.
COLS_OF_INTEREST = avdosvr.preprocessing.GetColnamesFromEmgMuscle(avdosEnums.EmgMuscles.CenterCorrugator) +\
                avdosvr.preprocessing.COLNAMES_EMG_AMPLITUDE +\
                avdosvr.preprocessing.COLNAMES_HR
print(COLS_OF_INTEREST)

In [None]:
# Obtain normalized data with parameter `normalize_data_units`
data, metadata = data_loader.load_data_from_participant(participant_idx = PARTICIPANT_IDX,
                                                        session_segment=SESSION_SEGMENT_NAME,
                                                        normalize_data_units=True,
                                                        columns = COLS_OF_INTEREST)

In [None]:
data.columns

## Plotting

The resulting `data` is a pandas dataframe and can be plotted as such. See their [official documentation](https://pandas.pydata.org/pandas-docs/stable/reference/plotting.html)

In [None]:
# Visualize each channel separately
data.plot.line(figsize=(15,1*data.shape[1]), subplots=True, sharex=True)

## Iterating over all participants

It is commonly interesting to compile data from all participants. However, the resulting dataframe may be large and not fit in memory.
We suggest design a preprocessing stage for each participant that generates a smaller dataset, and then join them together for complete analysis.

In the example below, we show two examples on how to iterate over the participants' data.

In [None]:
# Total participants
participants_ids = data_loader.summary["index_id"].unique()
participants_ids

In [None]:
# Total sessions
experiment_segment_names = data_loader.summary["Segment"].unique()
experiment_segment_names

In [None]:
# Iterate over all participants and segments
for participant in participants_ids:
    for exp_segment in experiment_segment_names:
        print(f"\t>> Participant {participant} and segment {exp_segment}")
print("\n\n=======\nFinished iterating all relevant data!") 
### It takes around 15 mins just loading all the raw dataset one by one

*Uncomment the block below if you want to iterate over the whole dataset. However, the block takes around 15 mins finalizing.*

In [None]:
# ### Testing how long it would take to iterate over the whole dataset
# ### without including normalization
# import time
# for participant in participants_ids:
#     for exp_segment in experiment_segment_names:
#         t0 = time.time()
#         data, metadata = data_loader.load_data_from_participant(participant_idx = participant, session_segment = exp_segment)
#         print(f"\t>> Loading time: {time.time()-t0} s")
# print("\n\n=======\nFinished loading all relevant data!") 
# ### It takes around 15 mins just loading all the raw dataset one by one

## Loading specific experimental stage

The  experimental stage (*Negative, Neutral, Positive*) was randomized. For some users the file `video_1` corresponds to the videos with `Neutral` affective induction, whereas the same file for another participant may represent the `Positive` stage. 

We provide scripts to easily access the physiological data per experimental segment, without worrying which specific file to load with the function `calculate_info_from_segment()`.

In this example, we will load the participant `PARTICIPANT_IDX`, to load the affective segment `Positive` and process the columns recommended for a comprehensive analysis.

In [None]:
PARTICIPANT_IDX = 0
AFFECTIVE_SEGMENT = str(avdosEnums.AffectSegments.VideosPositive) # or "Positive"
COLNAMES_PHYSIO = avdosvr.preprocessing.COLNAMES_RECOMMENDED

In [None]:
r_t0, r_t1, v_t0, v_t1, video_filename = data_loader.calculate_info_from_segment(PARTICIPANT_IDX, AFFECTIVE_SEGMENT)

print(f"\n\
Rest duration: \t\t{r_t1-r_t0}s \n\
Videos duration: \t{v_t1-v_t0} \n\
Video Name: \t\t{video_filename} \n\
Resting was first?: \t{r_t0 < v_t0}"
)

In [None]:
# Load `video_filename` extracted from the desider experimental segment above
data, metadata = data_loader.load_data_from_participant(participant_idx = PARTICIPANT_IDX, 
                                                        session_segment = video_filename,
                                                        normalize_data_units = True,
                                                        columns = COLNAMES_PHYSIO)

In [None]:
# Filter data between stages
data_rest = data[ (data.index >= r_t0) & (data.index < r_t1) ]
data_video = data[ (data.index >= v_t0) & (data.index < v_t1) ]

In [None]:
# This loop is to verify that the data loaded from each participant has
#  the desired length.
for pid in participants_ids:
    video_sequence = data_loader.obtain_order_experimental_segments(pid)
    print(f"Participant {pid} (ID:{data_loader.index[pid]['participant_id']}) had the experimental sequence: {video_sequence}")
    
    for affect_segment in video_sequence:
        # Extract the starting and final timestamps for the resting stage and video stage in the 
        r_t0, r_t1, v_t0, v_t1, video_filename = data_loader.calculate_info_from_segment(pid, affect_segment)

        duration_rest = r_t1-r_t0
        duration_video = v_t1-v_t0

        # Show a warning if the data is shorter than expected
        if(duration_rest < 115):
            print(f"Short data in REST stage!!: Participant {pid} and segment {affect_segment} ({video_filename})")
        if(duration_video < 295):
            print(f"Short data in VIDEO stage!!: Participant {pid} and segment {affect_segment} ({video_filename})")

## **Pipeline to combine physiology and affective annotations**

Finally, we provide an example to generate a subset of the dataset with the following preprocessing stages:

1. Identify the timestamps for the resting stage $[r_{t0},r_{t1}]$ and the stage watching the video $[v_{t0},v_{t1}]$
2. Resample the dataframes at 50Hz
3. Find the affective ratings and videoID of the content being watched at each moment (facilitates filtering per video, if desired)
4. Merge the physiological and emotional data with corresponding timestamps.
5. Merge data from all participants in a CSV file

First, we present the pipeline for a single participant, and then **merge** all datasets in a single exported `.csv` file.

## 1. Load data from a specific stage.
The function `load_data_from_affect_segment()` summarizes the process of getting individual rest and video data for a given affect stage. As shown in the previous code cells.

In [None]:
data_rest, data_video = data_loader.load_data_from_affect_segment(PARTICIPANT_IDX, AFFECTIVE_SEGMENT, columns=COLNAMES_PHYSIO)

In [None]:
print(data_rest.shape, data_video.shape)
print("Duration stage REST: ", data_rest.index[-1] - data_rest.index[0])
print("Duration stage VIDEO: ", data_video.index[-1] - data_video.index[0])

In [None]:
data_video.plot.line(subplots=True, figsize=(15,1*data.shape[1]), sharex=True)

## 2. Resample data to 50Hz

In order to extract features from the time series, it is common to resample the dataframes to the same sampling frequency. The function `avdosvr.analysis.dataframe_functions.resample_dataframe()` allows this process to obtain useful data from the combine dataset.

In [None]:
# Define sampling frequency
FS = 50

In [None]:
# Apply resampling dataframe at the defined sampling frequency.
data_rest_resampled = resample_dataframe(data_rest, FS, keep_original_timestamps=True)
data_video_resampled = resample_dataframe(data_video, FS, keep_original_timestamps=True)
data_video_resampled.head()

In [None]:
# Check if there are missing values
data_video_resampled.isnull().sum()

In [None]:
# Subset of columns sub
df_plot = data_video_resampled[ avdosvr.preprocessing.COLNAMES_ACCELEROMETER ]
df_plot.plot.line(subplots=True, figsize=(8,2*df_plot.shape[1]), sharex=True)

In [None]:
data_video_resampled.plot.line(subplots=True, figsize=(15,1*data_video_resampled.shape[1]), sharex=True)

In [None]:
## Uncomment thes lines to save the image in a file
path_to_save = gen_path_temp(f"example_figure", extension=".png")
data_video_resampled.plot.line(subplots=True, figsize=(15,1*data_video_resampled.shape[1]), sharex=True)[0].figure.savefig(path_to_save)
print(path_to_save)
plt.close()

## 3. Find affect states and Video IDs

Load the subjective affective ratings corresponding to the file of a specific `affect segment`. The function `load_emotions_from_affect_segment()` obtains the individual rest and video affect ratings for a given affect stage. The returned values are:

- `Valence`: Affective valence rating. Range 1-9
- `Arousal`: Affective arousal rating. Range 1-9
- `RawX`: Raw input x-axis from joystick used to report valence. Range 0-255
- `RawY`: Raw input y-axis from joystick used to report arousal. Range 0-255

In [None]:
emotions_rest, emotions_video = data_loader.load_emotions_from_affect_segment(PARTICIPANT_IDX, AFFECTIVE_SEGMENT)
emotions_video

The function `calculate_video_id_end_timestamps()` provides the end time of each video within the video stage, given a user and a specific affect segment. The possible values of the column `VideoId` are:
- `NaN`: Last timestamp of data not corresponding to a experimental segments (e.g., measuring from Emteq mask without having started the experiment)
- `-1`: Last timestamp of the `resting video` from the specific affect segment.
- `[int]`: Integer denoting the last timestamp of the user watching the corresponding `VideoId`

In [None]:
# Find the corresponding ending of the videoID. `VideoId=-1` corresponding to a resting stage
video_id_end_timestamp = data_loader.calculate_video_id_end_timestamps(PARTICIPANT_IDX, AFFECTIVE_SEGMENT)
video_id_end_timestamp

## 4. Merging the physiological data, affect ratings, and video ids.

The sampling frequency of the physiological `data` does not match the frequency and timestamps of the `emotions`. Thus, they need to be merged.

The `VideoId` corresponding to each physiological data sample can be loaded using the function `merge_asof()`

In [None]:
# Merge the the physiological data with the emotions
data_rest_merged = pd.merge_asof(data_rest_resampled, video_id_end_timestamp, left_index=True, right_index=True, direction="forward")
data_rest_merged.insert(0, "OriginalParticipantID", data_loader.index[PARTICIPANT_IDX]['participant_id'])
data_rest_merged = pd.merge_asof(data_rest_merged, emotions_rest, left_index=True, right_index=True)
data_rest_merged

In [None]:
# Merge physio with affective ratings
data_video_merged = pd.merge_asof(data_video_resampled, video_id_end_timestamp, left_index=True, right_index=True, direction="forward")
data_video_merged.insert(0, "OriginalParticipantID", data_loader.index[PARTICIPANT_IDX]['participant_id'])
data_video_merged = pd.merge_asof(data_video_merged, emotions_video, left_index=True, right_index=True)
data_video_merged

In [None]:
data_compiled = data_loader.generate_merged_synchronized_dataframe(PARTICIPANT_IDX,
                                                                AFFECTIVE_SEGMENT, 
                                                                avdosvr.preprocessing.COLNAMES_RECOMMENDED,
                                                                sampling_frequency_hz=FS,
                                                                set_timestamps_to_zero=True)

In [None]:
data_compiled.loc[0,"Positive"]

## 5. Merging all participants' data

Finally, we store the postprocessed dataframes from all participants in a single CSV file. This file can be handled directly in Python because the size is much smaller than the original dataset. The sampling frequency was reduced from ~1KHz to 50Hz.

In this case, we combine the raw dataset keeping two columns `["Participant","AffectSegment"]` to identify the individual files. However, you may involve feature extraction to a more comprehensive dataset.

In [None]:
# Total participants
participants_ids = data_loader.summary["index_id"].unique()
participants_ids

In [None]:
# Total sessions
affect_segments = [ str(x) for x in avdosEnums.AffectSegments]
affect_segments

In [None]:
# Columns to process per participant
data_columns_to_analyze = avdosvr.preprocessing.COLNAMES_RECOMMENDED
data_columns_to_analyze

In [None]:
# Define sampling frequency for resampling
SAMPLING_FREQ_HZ = 50

The same process shown above is incorporated in the function `generate_merged_synchronized_dataframe()`.

⛔⛔ **NOTE: The execution of the next cell may take between 30-60min because it goes through the whole dataset to generate a postprocessed version** ⛔⛔

In [None]:
# Loading all segments for all participants and store the resting and video parts in a single large CSV.

DATASET_POSTPROCESSED_FILENAME = gen_path_temp("Dataset_AVDOSVR_postprocessed", extension=".csv")

output_filename = DATASET_POSTPROCESSED_FILENAME

# Variable to store the final dataset
dataset_postprocessed_final = None
# Check if file already exists
if (os.path.isfile(output_filename)):
    dataset_postprocessed_final = pd.read_csv(output_filename)
    print(f"File loaded from path!")
# Otherwise generate it
else:
    print(f"Generating file!")
    for participant in participants_ids:
        for aff_segment in affect_segments:
            print(f"\n\nAnalyzing participant {participant} segment {aff_segment}")

            # Final concatenation of resting and video stages
            data_compiled = data_loader.generate_merged_synchronized_dataframe(participant,
                                                                                 aff_segment, 
                                                                                 data_columns_to_analyze,
                                                                                 sampling_frequency_hz=SAMPLING_FREQ_HZ,
                                                                                 set_timestamps_to_zero=True)

            # Generate final DF
            if(dataset_postprocessed_final is None):
                dataset_postprocessed_final = data_compiled.copy(deep=True)
            else:
                dataset_postprocessed_final = pd.concat([dataset_postprocessed_final, data_compiled.copy(deep=True)])

        # Saving .csv every iteration
        dataset_postprocessed_final.to_csv( output_filename )
    print("\n\n End")

In [None]:
dataset_postprocessed_final.head()

In [None]:
dataset_postprocessed_final.shape

In [None]:
print(">> FINISHED WITHOUT ERRORS!!")