## This notebook performs downsampling on the OpenFace Extracted features' dataframes

1. Downsampling - to a specified ```sampling_size```
2. Preprocesses the downsampled data
    - Excludes datapoints in ```failure videos``` which are before the ```failureOccurrence_timestamp```
3. Merges all ```participants``` all ```videos``` that are downsampled to a particular ```sampling_size``` and preprocessed (as per #2)

In [1]:
import os
import numpy as np
import pandas as pd
from tqdm import tqdm

### Specify paths

- ```downsampling_directory``` - path to the directory where the downsampled ```df's``` are stored for each ```participant``` 
    - the directories are then created based on the convention of ```/{frequency}fps_downsampling/```  

```participant_features_directory``` - path where the original openFace extracted features of the participants are stored locally  
```participant_database``` - stores the list of all the participants

In [2]:
#now, open openface_features files for each participant
files_to_ignore = ['.DS_Store']
participant_features_directory = '../data/openFace_features_data/csv_features_clean/'
participant_database = os.listdir(participant_features_directory)
participant_database = [participant_folder for participant_folder in participant_database if participant_folder not in files_to_ignore]

### Downsampling


- ```required_features``` - list of all the features that are to be considered and downsampled
- ```final_features``` - list of ```metadata columns + required_features```
- ```participant_folder``` - contains the list of all the extracted features dataframe of the ```response_video```s of the ```participant```
- ```csv_file``` - openface extracted feature files of the ```response_video``` of a ```participant```
- ```target_class``` - OHE value of the class the ```response_video``` belongs to.
- ```required_df``` - a dataframe that contains all the ```final_features``` from the original ```df``` dataframe


- ```sampling_size``` - indicates how many ```datapoints in a second``` is to be considered and retained
- ```frequency = 1 / sampling_size```
- ```last_timestamp``` - last recorded time of final extracted frame of the participant's ```response_video``` in seconds
- ```valid_timestamps``` - list of all the ```sampling_size``` intervals
- ```downsampled_df``` - final ```df``` that contains downsampled data with ```final_features``` values
- ```(from_time, time]``` - time range of the datapoints based on the original_df that is to be considered for the window
- ```dataPoints_within_window``` - dataPoints within the timerange to be considered for downsampling to a single datapoint

In [3]:
import warnings

# Filter out the specific warning category
warnings.filterwarnings('ignore')

required_features = ['timestamp', 'gaze_0_x','gaze_0_y','gaze_0_z','gaze_1_x','gaze_1_y','gaze_1_z','gaze_angle_x','gaze_angle_y','pose_Tx', 'pose_Ty', 'pose_Tz','pose_Rx', 'pose_Ry', 'pose_Rz','AU01_r','AU02_r','AU04_r','AU05_r','AU06_r','AU07_r','AU09_r','AU10_r','AU12_r','AU14_r','AU15_r','AU17_r','AU20_r','AU23_r','AU25_r','AU26_r','AU45_r','AU01_c','AU02_c','AU04_c','AU05_c','AU06_c','AU07_c','AU09_c','AU10_c','AU12_c','AU14_c','AU15_c','AU17_c','AU20_c','AU23_c','AU25_c','AU26_c','AU28_c','AU45_c']
# Some initial information regarding the participant and their corresponding responseVideo information to be included
final_features = ['participant_id', 'response_video', 'class']

class_types = {
    'ch': 0,
    'cr': 0,
    'fh': 1,
    'fr': 2
}

with tqdm(total=len(participant_database)) as pbar:
    for participant in sorted(participant_database):

        participant_folder_path = f'{participant_features_directory}{participant}/'
        participant_folder = os.listdir(participant_folder_path)
        for csv_file in sorted(participant_folder):

            if csv_file in files_to_ignore:
                continue

            # Identify the class to which the csv_file belongs to
            responseVideo_name = csv_file.split('_')[0]
            if responseVideo_name[:2] in class_types:
                target_class = class_types[responseVideo_name[:2]]

            # Read the features csv file
            csv_file_path = f'{participant_folder_path}{csv_file}'
            df = pd.read_csv(csv_file_path)

            # Create a new dataframe - 'required_df' that retains all the feature columns that are required from the original dataframe - 'df'
            required_df = df[required_features].copy()

            # add participant column
            required_df['participant_id'] = participant
            required_df['class'] = target_class
            required_df['response_video'] = responseVideo_name

            # Reorganize the column order
            required_df = required_df[final_features + required_features]

            ## TODO: Downsampling

            # Obtain the last_timestamp for every response_video data
            last_timestamp = required_df['timestamp'].values[-1] # NOTE: this is in seconds

            # round it to the nearest Xth second (frequency = 1/sampling_size)
            sampling_size = 0.5
            frequency = int(1 / sampling_size)
            last_timestamp = round(last_timestamp * frequency) / frequency
            # Now we get a list of timestamps once every 'sampling_size' seconds, until the last timestamp
            valid_timestamps = np.arange(0, last_timestamp + sampling_size, sampling_size)

            # A new dataframe to store the downsampled datapoints from the 'required_df'
            downsampled_df = pd.DataFrame()

            for time in valid_timestamps:
                if time == valid_timestamps[0]:
                    # Retrieve the rows between the range (from_time, time]
                    downsampled_df = pd.concat([downsampled_df, required_df[required_df['timestamp'] <= time]])
                    from_time = time
                else:
                    dataPoints_within_window = required_df[(required_df['timestamp'] > from_time) & (required_df['timestamp'] <= time)]

                    # Check if there exists any datapoints within the range specified: (from_time, time]
                    if len(dataPoints_within_window) > 0:

                        # Now that you have datapoints within the specified window range: (from_time, time]
                        # Calculate some statistics on the feature columns present in the dataframe
                        # i.e: mean(), max(), etc.. on the respective feature columns

                        # NOTE: here we reduce the range of dataPoints to a single dataPoint after calculating the statistics
                        for column in dataPoints_within_window.columns:

                            # for the feature columns that contain values based on classification
                            # i.e: AU##_c :- columns, find the max value amongst them (i.e: 0 or 1)
                            if '_c' in column:
                                dataPoints_within_window[column] = dataPoints_within_window[column].max()
                            # for all other columns except few, calculate the aggregate value
                            elif column not in ['participant_id', 'class', 'response_video', 'timestamp']:
                                dataPoints_within_window[column] = dataPoints_within_window[column].mean()
                            elif column == 'timestamp':
                                dataPoints_within_window[column] = time

                        # Now we add the dataPoints_within_window :- that have been reduced to a single datapoint
                        # to the final - downsampled_df
                        downsampled_df = pd.concat([downsampled_df.T, dataPoints_within_window.iloc[0, :]], axis = 1).T
                    else:
                        print(f'No datapoints within the specified range! ({from_time}, {time}]')

                    # Update the range of the time :- to shift the window
                    from_time = time
                # print(downsampled_df.shape)
            
            #make new folder in data/openFace_features_data directory, called 5fps_downsampling
            downsampling_directory = f'../data/openFace_features_data/downsampled_feature_data/{frequency}fps_downsampling/'
            if not os.path.exists(downsampling_directory):
                os.mkdir(downsampling_directory)
            
            # If the participant directory does not exist for the downsampling, create a directory
            if not os.path.exists(f'./{downsampling_directory}/{participant}/'):
                os.mkdir(f'./{downsampling_directory}/{participant}/')

            # Save the downsampled features file as - ch1_5fps.xlsx in the directory: participant
            downsampled_df.to_excel(f'./{downsampling_directory}/{participant}/{responseVideo_name}_{frequency}fps.xlsx', index = False)
        pbar.update(1)
        pbar.set_description(f'Participant ID = {participant}')

# After you're done with the code, you can reset the warning filters
warnings.resetwarnings()

Participant ID = 9214: 100%|████████████████████| 29/29 [01:58<00:00,  4.10s/it]


#### Preprocessing failure data

Here, we preprocess all the failure dataframes by removing datapoints before the task failure occurrence

- For both ```human failure (fh)``` and ```robot failure (fr)``` - ```response_video``` dataframes, based on the ```failureOccurrence``` timestamp, we remove all the datapoints in the openface extracted features - ```df``` where ```row[timestamp] < failureOccurrence_time```

The ```failure_timestamps``` information can be found in the ```New_Stimulus_Dataset_Information.xlsx```
The ```failure_timestamps``` is of type ```dict``` where the ```dict.key: failure stimulus video name``` and ```dict.value: timestamp of the failure occurrence (in seconds)```

In [4]:
# failure_timestamps: Timestamp of failure occurences - annotated manually
failure_timestamps = {'fh1': 3.3, 'fh4': 3.4, 'fh5': 2.8, 'fh6': 3.3, 'fh7': 5.0, 'fh8': 6.0, 'fh9': 0.8, 'fh10': 2.9, 'fh2': 6.3, 'fh3': 5.3, 'fr1': 7.8, 'fr2': 4.8, 'fr7': 6.0, 'fr10': 4.7, 'fr3': 5.5, 'fr4': 8.0, 'fr5': 8.0, 'fr6': 17.0, 'fr8': 2.3, 'fr9': 6.0}

- ```downsampled_participant_directory``` - directory where the downsampled data is stored
- ```downsampled_participant_database``` - A list containing all participants whose data has been downsampled

In [5]:
downsampled_participant_directory = downsampling_directory
# preprocessed_downsampled_participant_directory = 
downsampled_participant_database = os.listdir(downsampled_participant_directory)
downsampled_participant_database = [participant_folder for participant_folder in downsampled_participant_database if participant_folder not in files_to_ignore]

#### Create a new downsampling_preprocessed directory
- This directory stores all ```participants``` extracted features in which the datapoints from the ```failure response - dfs``` are removed  
where - ```row['timestamp'] < failure_type_failure_occurrence_timestamp```

- ```preprocessed_downsampled_participant_directory``` - directory where the downsampled & preprocessed data will be stored based on ```participant/responseVideo_name_frequencyfps_preprocessed```

In [6]:
preprocessed_downsampled_participant_directory = f'../data/openFace_features_data/downsampled_feature_data/{frequency}fps_downsampling_preprocessed/'
if not os.path.exists(preprocessed_downsampled_participant_directory):
    os.mkdir(preprocessed_downsampled_participant_directory)

In [7]:
with tqdm(total=len(downsampled_participant_database)) as pbar:
    for participant in sorted(downsampled_participant_database):
        participant_folder_path = f'{downsampled_participant_directory}{participant}/'
        participant_folder = os.listdir(participant_folder_path)

        if not os.path.exists(f'{preprocessed_downsampled_participant_directory}{participant}/'):
            os.mkdir(f'{preprocessed_downsampled_participant_directory}{participant}/')

        for csv_file in sorted(participant_folder):
            responseVideo_name = csv_file.split('_')[0]
            if csv_file in files_to_ignore:
                continue
            elif class_types[responseVideo_name[:2]] == 0:
                downsampled_df = pd.read_excel(f'{participant_folder_path}{csv_file}')
                downsampled_df.to_excel(f'{preprocessed_downsampled_participant_directory}{participant}/{responseVideo_name}_{frequency}fps_preprocessed.xlsx', index = False)
            else:
                downsampled_df = pd.read_excel(f'{participant_folder_path}{csv_file}')
                preprocessed_downsampled_df = downsampled_df.copy()
                preprocessed_downsampled_df = preprocessed_downsampled_df[preprocessed_downsampled_df['timestamp'] > failure_timestamps[responseVideo_name]]
                preprocessed_downsampled_df.to_excel(f'{preprocessed_downsampled_participant_directory}{participant}/{responseVideo_name}_{frequency}fps_preprocessed.xlsx', index = False)
        pbar.update(1)
        pbar.set_description(f'Participant ID = {participant}')

Participant ID = 9214: 100%|████████████████████| 29/29 [00:20<00:00,  1.41it/s]


#### Merge

- Here we merge - all ```participants``` all ```responseVideos``` into a single dataframe called as ```allParticipants_frequencyfps_downsampled_preprocessed``` 

In [8]:
preprocessed_downsampled_participant_database = os.listdir(preprocessed_downsampled_participant_directory)
preprocessed_downsampled_participant_database = [participant_folder for participant_folder in preprocessed_downsampled_participant_database if participant_folder not in files_to_ignore]

In [9]:
allParticipants_df = pd.DataFrame()

with tqdm(total=len(preprocessed_downsampled_participant_database)) as pbar:
    for participant in sorted(preprocessed_downsampled_participant_database):
        participant_folder_path = f'{preprocessed_downsampled_participant_directory}{participant}/'
        participant_folder = os.listdir(participant_folder_path)

        for csv_file in sorted(participant_folder):
            if csv_file in files_to_ignore:
                continue
            current_df = pd.read_excel(f'{participant_folder_path}{csv_file}')
            allParticipants_df = pd.concat([allParticipants_df, current_df])
        pbar.update(1)
        pbar.set_description(f'Participant ID = {participant}')
allParticipants_df.to_excel(f'../data/allParticipants_{frequency}fps_downsampled_preprocessed.xlsx', index = False)

Participant ID = 9214: 100%|████████████████████| 29/29 [00:09<00:00,  3.00it/s]
