## Sampling of the dataset
This notebook implements the sampling of the dataset, running a sliding window with a fixed timestep over the video and physio dataframe of each subject. For each sample we can compute a set of features and target based on the raw information that is in that sample.

In [2]:
import os
import pandas as pd
import numpy as np
from tqdm.notebook import tqdm
from targetComputation import compute_EDA_Targets, compute_ECG_Targets
import features as ft

In [3]:
project_dir = os.getcwd().split('\\')[:-1]
project_dir = '\\'.join(project_dir)
data_dir = project_dir + '\\data'
video_dir = data_dir+'\\interim\\video'
physio_dir = data_dir+'\\interim\\physiological'
video_files = [file for file in os.listdir(video_dir)]
physio_files = [file for file in os.listdir(physio_dir)]
pps = list(set([file[:-4] for file in video_files]).intersection(set([file[:-4] for file in physio_files])))
pps = sorted([int(pp) for pp in pps])

In [4]:
print(f'Collecting data for {len(pps)} participants.')

Collecting data for 66 participants.


### Start and end point of samples
After loading the required modules and storing the different location of the data files in variables, we implemented some functions that will perform the sampling below. First off the `get_start_end` function. This function finds all the start and end points of a sample based on a video signal. You can input a desired window size and step size and this function will compute the start and end points of the all the samples that can be sampled from the signal.

In [5]:
def get_start_end(data, window_size, step_size):
    """Get start & end timestamps of a video signal based on a given window size and step size.
    Returns a tuple (start, end frame)
    """

    points = []
    start = data.t_from_start.values[0]
    end = start + window_size
    points.append((start, end))

    while end+step_size < data.t_from_start.values[-1]:
        start += step_size
        end += step_size
        points.append((start, end))
    
    return points

### Getting one window
The function `get_window` grabs one window/sample from a dataframe, based on a given start and end point. Combined with the `get_start_end` function, we can gather all the possible windows/samples from a dataframe

In [6]:
def get_window(data, start, end):
    """Return window between a certain start and endpoint."""
    return data.loc[(data.t_from_start >= start) & (data.t_from_start <= end), :]

### Processing a window
Below we have implemented two functions for the processing of respectively a video sample and a physiological window. The functions `process_physio_window` and `process_video_window` have as input a window/sample (a subset of dataframe obtained through the function `get_window`) and compute various features or targets, which are returned in a dict. 

When we want to compute extra or other features/targets, we can add those in these functions.

In [7]:
def process_physio_window(window):
    """Processes the physiological window. Returns a dict containing the computed target variables and the first and last timestamp of the window."""
    processed_window = {}
    
    processed_window['t_start_physio'] = window.t_from_start.values[0]
    processed_window['t_end_physio'] = window.t_from_start.values[-1]
    
    processed_window = {**processed_window, **compute_EDA_Targets(window)}
    processed_window = {**processed_window, **compute_ECG_Targets(window)}
    
    
    return processed_window

In [8]:
def process_video_window(window):
    """Processes the video window. Returns a dict containing the computed features and the first and last frame of the window."""
    processed_window = {}
    
    processed_window['proportion_Success'] = window.success.sum()/len(window.success)
    processed_window['t_start_video'] = window.t_from_start.values[0]
    processed_window['t_end_video'] = window.t_from_start.values[-1]    
    processed_window = {**processed_window, **ft.compute_mean_AUs(window)}
    processed_window = {**processed_window, **ft.compute_std_AUs(window)}
    processed_window = {**processed_window, **ft.compute_arousal(window)}
    processed_window = {**processed_window, **ft.compute_emotions(window)}
    processed_window = {**processed_window, **ft.compute_head_motion(window)}
    processed_window = {**processed_window, **ft.compute_PD_features(window)}
    processed_window['blink_rate'] = ft.compute_blink_rate(window)
    processed_window['per_EC'] = ft.compute_percentage_EC(window)
    
    return processed_window

In [9]:
def check_video_window(window):
    """Checks the video window, based on the defined rules. If one rule is broken we return false and do not use the entire window. Else we return true"""
    global removed
    if (window.confidence >= 0.8).sum()/len(window.confidence) < 0.95:
        removed += 1
        return False
    else:
        return True

### Sampling and processing one participant
Below we have implemented the `sample_pp` function, which samples and processes the data of one participant. After reading in the physio and video dataframe of the participant, it implements the four functions from earlier to sample and process those dataframes. It stores the processed variables in a dict, which it returns for later use.

In [10]:
def sample_pp(physio_data, video_data, window_size, step_size):
    """Samples the data of one specific participant. Make sure that the both the physiological and video data exists for the participant."""
    points = get_start_end(video_data, window_size, step_size)
    processed_windows = []
    i = 1
    for point in points:
        start, end = point
        video_window = get_window(video_data, start, end)
        if check_video_window(video_window):                
            physio_window = get_window(physio_data, start, end)
            
            processed_physio_window = process_physio_window(physio_window)
            processed_video_window = process_video_window(video_window)

            processed_window = {**processed_video_window, **processed_physio_window}
            processed_window['start'] = start
            processed_window['end'] = end
            processed_window['pp'] = pp
            processed_window['pp_window'] = i
            
            processed_windows.append(processed_window)
        i += 1
    return processed_windows

### Loading in the participant data

In [11]:
pp_physio = {}
pp_video = {}
for pp in tqdm(pps, desc='pp', leave=False):
    pp_physio[pp] = pd.read_hdf(f'{physio_dir}\\{pp}.hdf', f'pp{pp}')
    df = pd.read_hdf(f'{video_dir}\\{pp}.hdf', f'pp{pp}')
    cols = [col for col in df.columns if col not in ['frame', 'face_id', 'timestamp', 'confidence', 'success', 'started', 'pp', 't_from_start', 'frames_away_start ']]
    df.loc[df.confidence<0.8, cols] = np.nan
    pp_video[pp] = df

pp:   0%|          | 0/66 [00:00<?, ?it/s]

### Processing all the participants
Below we have sampled all the participants and stored their processed variables in a dataframe, which is going to be the dataframe that we will be using in the modelling step. Below you can adjust the desired window_size and step_size. The dataset is stored in the desired location. 

In [11]:
pd.set_option('mode.chained_assignment',None)
window_sizes = [60*2, 60*3]
step_sizes = [1, 0.9, 0.7]

for window_size in tqdm(window_sizes, desc='windows'):
    for step_size in tqdm(step_sizes, desc='steps', leave=False):
        df = []
        step_size *= window_size
        step_size = int(step_size)
        removed = 0
        for pp in tqdm(pps, desc='pp', leave=False):
            df += sample_pp(pp_physio[pp], pp_video[pp], window_size, step_size)
        df = pd.DataFrame(df)
        df.pp = df['pp'].astype('str')
        df['window'] = df.index
        print(f'Finished window size: {window_size} step size: {step_size}. Sampled a total of {len(df.pp)} windows. Removed total of {removed} windows, ~{int(((removed)/(removed+len(df.pp)))*100)} percent of all possible windows.')
        df.to_hdf(f"{data_dir}\\processed\\window_{window_size}_step_{step_size}.hdf", "data", mode="w")

windows:   0%|          | 0/2 [00:01<?, ?it/s]

steps:   0%|          | 0/3 [00:00<?, ?it/s]

pp:   0%|          | 0/66 [00:00<?, ?it/s]

Finished window size: 120 step size: 120. Sampled a total of 301 windows. Removed total of 221 windows, ~42 percent of all possible windows.


pp:   0%|          | 0/66 [00:00<?, ?it/s]

Finished window size: 120 step size: 108. Sampled a total of 320 windows. Removed total of 260 windows, ~44 percent of all possible windows.


pp:   0%|          | 0/66 [00:00<?, ?it/s]

Finished window size: 120 step size: 84. Sampled a total of 428 windows. Removed total of 312 windows, ~42 percent of all possible windows.


steps:   0%|          | 0/3 [00:00<?, ?it/s]

pp:   0%|          | 0/66 [00:00<?, ?it/s]

Finished window size: 180 step size: 180. Sampled a total of 178 windows. Removed total of 153 windows, ~46 percent of all possible windows.


pp:   0%|          | 0/66 [00:00<?, ?it/s]

Finished window size: 180 step size: 162. Sampled a total of 180 windows. Removed total of 198 windows, ~52 percent of all possible windows.


pp:   0%|          | 0/66 [00:00<?, ?it/s]

Finished window size: 180 step size: 125. Sampled a total of 254 windows. Removed total of 217 windows, ~46 percent of all possible windows.
