## EDA

This notebook outlines some EDA done on ``data/normalized_gaze_data.csv``, which is the result of running the functions on TalEDA.ipynb and running LennoxEDA.ipynb. From Mongo DB:
- Cleaned by converting string to their appropriate types, rename emailed column to userId for consistency between DFs and drop the _id column since it's wrong
- Converted the formData and windowDimensions columns in the survey df from a dict to being their own individual columns
- Cut extra start times if the duration exceeds 15 seconds since the videos are only 15 seconds long
- Convert start and end times to a meaningful duration and drop the start/end time columns and the windowDimensions
- Merged the user df with the survey df
- Split the entire dataframe to have one row per timestamp with a value of if it's hazardous or not which will be the label
- One-hot encoded the following features: ['noDetectionReason', 'country', 'state', 'city', 'ethnicity', 'gender']
- Dropped rows with missing data
- Convert video data to 0.5s splits and replace hazard binary data by majority vote per video per time bin
- Normalized gaze data based on screen size 


In [139]:
import pandas as pd
import os
from pathlib import Path
import numpy as np

In [140]:
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)


In [141]:
data_csv_dir = "../data/normalized_gaze_data.csv"
output_aggregate_csv_filename = "../data/aggregate_gaze_data_by_video.csv"

In [142]:
df = pd.read_csv(data_csv_dir)

In [143]:
df.head()

Unnamed: 0.1,Unnamed: 0,userId,videoId,hazardDetected,detectionConfidence,hazardSeverity,width,height,duration,licenseAge,age,visuallyImpaired,x,y,time,hazard,attentionFactors_construction,attentionFactors_environment,attentionFactors_motion,attentionFactors_other,attentionFactors_pedestrian,attentionFactors_proximity,attentionFactors_velocity,noDetectionReason_nohazards,noDetectionReason_subtlehazards,noDetectionReason_uncertain,country_ar,country_fr,country_tn,country_us,state_california,state_florida,state_massachusetts,state_north carolina,state_oregon,state_south carolina,state_washington,city_boca raton,city_boston,city_chapel hill,city_charlotte,city_coconut creek,city_delray beach,city_durham,city_los angeles,city_miami,city_olympia,city_puyallup,city_raleigh,city_san diego,city_seattle,city_tega cay,city_west linn,city_woodland hills,ethnicity_asian,ethnicity_black or african american,ethnicity_hispanic or latino,ethnicity_middle eastern or north african,ethnicity_multiracial,ethnicity_prefer not to say,ethnicity_white,gender_female,gender_male,gender_prefer-not-to-say,original_x,original_y,original_width,original_height,display_width,display_height,x_offset,y_offset,normalized_to_width,normalized_to_height
0,0,jonahmulcrone@gmail.com,video219,False,5,0,1280,960,15.755,17.0,24.0,True,357.536879,602.202312,0.0,False,False,False,False,False,False,False,False,True,False,False,False,False,False,True,False,False,False,False,False,False,True,False,False,False,False,False,False,False,False,False,False,True,False,False,False,False,False,False,False,False,False,False,False,False,True,False,True,False,500.496763,499.953378,1470,797,1062.666667,797,203.666667,0.0,1280,960
1,1,jonahmulcrone@gmail.com,video219,False,5,0,1280,960,15.755,17.0,24.0,True,358.021284,599.200181,0.197,False,False,False,False,False,False,False,False,True,False,False,False,False,False,True,False,False,False,False,False,False,True,False,False,False,False,False,False,False,False,False,False,True,False,False,False,False,False,False,False,False,False,False,False,False,True,False,True,False,500.898921,497.460984,1470,797,1062.666667,797,203.666667,0.0,1280,960
2,2,jonahmulcrone@gmail.com,video219,False,5,0,1280,960,15.755,17.0,24.0,True,374.752952,598.149672,0.396,False,False,False,False,False,False,False,False,True,False,False,False,False,False,True,False,False,False,False,False,False,True,False,False,False,False,False,False,False,False,False,False,True,False,False,False,False,False,False,False,False,False,False,False,False,True,False,True,False,514.78969,496.588843,1470,797,1062.666667,797,203.666667,0.0,1280,960
3,3,jonahmulcrone@gmail.com,video219,False,5,0,1280,960,15.755,17.0,24.0,True,421.96597,576.436479,0.595,False,False,False,False,False,False,False,False,True,False,False,False,False,False,True,False,False,False,False,False,False,True,False,False,False,False,False,False,False,False,False,False,True,False,False,False,False,False,False,False,False,False,False,False,False,True,False,True,False,553.986331,478.562368,1470,797,1062.666667,797,203.666667,0.0,1280,960
4,4,jonahmulcrone@gmail.com,video219,False,5,0,1280,960,15.755,17.0,24.0,True,484.691224,556.259314,0.794,False,False,False,False,False,False,False,False,True,False,False,False,False,False,True,False,False,False,False,False,False,True,False,False,False,False,False,False,False,False,False,False,True,False,False,False,False,False,False,False,False,False,False,False,False,True,False,True,False,606.06136,461.811118,1470,797,1062.666667,797,203.666667,0.0,1280,960


In [144]:
# renamed the Unnamed : 0 column to index and duration to gazeDuration so it's more appropriate
df.rename(columns={'Unnamed: 0': 'index', 'duration': 'gazeDuration'}, inplace=True)

In [145]:
df.head()

Unnamed: 0,index,userId,videoId,hazardDetected,detectionConfidence,hazardSeverity,width,height,gazeDuration,licenseAge,age,visuallyImpaired,x,y,time,hazard,attentionFactors_construction,attentionFactors_environment,attentionFactors_motion,attentionFactors_other,attentionFactors_pedestrian,attentionFactors_proximity,attentionFactors_velocity,noDetectionReason_nohazards,noDetectionReason_subtlehazards,noDetectionReason_uncertain,country_ar,country_fr,country_tn,country_us,state_california,state_florida,state_massachusetts,state_north carolina,state_oregon,state_south carolina,state_washington,city_boca raton,city_boston,city_chapel hill,city_charlotte,city_coconut creek,city_delray beach,city_durham,city_los angeles,city_miami,city_olympia,city_puyallup,city_raleigh,city_san diego,city_seattle,city_tega cay,city_west linn,city_woodland hills,ethnicity_asian,ethnicity_black or african american,ethnicity_hispanic or latino,ethnicity_middle eastern or north african,ethnicity_multiracial,ethnicity_prefer not to say,ethnicity_white,gender_female,gender_male,gender_prefer-not-to-say,original_x,original_y,original_width,original_height,display_width,display_height,x_offset,y_offset,normalized_to_width,normalized_to_height
0,0,jonahmulcrone@gmail.com,video219,False,5,0,1280,960,15.755,17.0,24.0,True,357.536879,602.202312,0.0,False,False,False,False,False,False,False,False,True,False,False,False,False,False,True,False,False,False,False,False,False,True,False,False,False,False,False,False,False,False,False,False,True,False,False,False,False,False,False,False,False,False,False,False,False,True,False,True,False,500.496763,499.953378,1470,797,1062.666667,797,203.666667,0.0,1280,960
1,1,jonahmulcrone@gmail.com,video219,False,5,0,1280,960,15.755,17.0,24.0,True,358.021284,599.200181,0.197,False,False,False,False,False,False,False,False,True,False,False,False,False,False,True,False,False,False,False,False,False,True,False,False,False,False,False,False,False,False,False,False,True,False,False,False,False,False,False,False,False,False,False,False,False,True,False,True,False,500.898921,497.460984,1470,797,1062.666667,797,203.666667,0.0,1280,960
2,2,jonahmulcrone@gmail.com,video219,False,5,0,1280,960,15.755,17.0,24.0,True,374.752952,598.149672,0.396,False,False,False,False,False,False,False,False,True,False,False,False,False,False,True,False,False,False,False,False,False,True,False,False,False,False,False,False,False,False,False,False,True,False,False,False,False,False,False,False,False,False,False,False,False,True,False,True,False,514.78969,496.588843,1470,797,1062.666667,797,203.666667,0.0,1280,960
3,3,jonahmulcrone@gmail.com,video219,False,5,0,1280,960,15.755,17.0,24.0,True,421.96597,576.436479,0.595,False,False,False,False,False,False,False,False,True,False,False,False,False,False,True,False,False,False,False,False,False,True,False,False,False,False,False,False,False,False,False,False,True,False,False,False,False,False,False,False,False,False,False,False,False,True,False,True,False,553.986331,478.562368,1470,797,1062.666667,797,203.666667,0.0,1280,960
4,4,jonahmulcrone@gmail.com,video219,False,5,0,1280,960,15.755,17.0,24.0,True,484.691224,556.259314,0.794,False,False,False,False,False,False,False,False,True,False,False,False,False,False,True,False,False,False,False,False,False,True,False,False,False,False,False,False,False,False,False,False,True,False,False,False,False,False,False,False,False,False,False,False,False,True,False,True,False,606.06136,461.811118,1470,797,1062.666667,797,203.666667,0.0,1280,960


### Data Structure

In [146]:
def print_df_structure(df):
    #Data Structure
    print(f"Data Structure for the normalized gaze data")
    print("-"*10)
    print(f"Dimensions: {df.shape}")
    print(f"Data Types:\n{df.dtypes}")
    print(f"Missing Values:\n{df.isnull().sum()}")
    print(f"Unique observations:\n{df.nunique()}")
    print('\n')

print_df_structure(df)

Data Structure for the normalized gaze data
----------
Dimensions: (40653, 74)
Data Types:
index                                          int64
userId                                        object
videoId                                       object
hazardDetected                                  bool
detectionConfidence                            int64
hazardSeverity                                 int64
width                                          int64
height                                         int64
gazeDuration                                 float64
licenseAge                                   float64
age                                          float64
visuallyImpaired                                bool
x                                            float64
y                                            float64
time                                         float64
hazard                                          bool
attentionFactors_construction                   bool
attentio

There is a total of 40653 samples across 346 different videos for 30 different participants

In [147]:
def print_descriptive_stats(df):
    print(f'\nDescriptive Statistics for aggregate stats before one hot-encode')
    print('-'*15)
    numeric_columns = df.select_dtypes(include=[np.number]).columns
    print('Central Tendency Measures:')
    print(df[numeric_columns].describe().loc[['mean', '50%']])
    print('\nDispersion Measures:')
    print(df[numeric_columns].describe().loc[['std', 'min', 'max']])

    #Check for distribution normality (skewness and kurtosis)
    print('\nDistribution Measures:')
    print('-'*15)
    print('\nSkew:')
    print(df[numeric_columns].skew())
    print('\nKurtosis:')
    print('-'*15)
    print(df[numeric_columns].kurtosis())
    print('\n')

print_descriptive_stats(df)


Descriptive Statistics for aggregate stats before one hot-encode
---------------
Central Tendency Measures:
             index  detectionConfidence  hazardSeverity   width  height  \
mean  20414.442993             4.224239        0.738642  1280.0   960.0   
50%   20339.000000             5.000000        0.000000  1280.0   960.0   

      gazeDuration  licenseAge        age           x           y      time  \
mean     15.748633   15.874855  33.971343  639.320070  568.304652  6.613715   
50%      15.361000   16.000000  27.000000  616.593673  577.502262  6.521000   

      original_x  original_y  original_width  original_height  display_width  \
mean  790.906433  490.319465      1580.46828       831.894768    1109.193024   
50%   748.099366  492.653961      1512.00000       832.000000    1109.333333   

      display_height    x_offset  y_offset  normalized_to_width  \
mean      831.894768  235.637628       0.0               1280.0   
50%       832.000000  217.000000       0.0          

### Dropping unnecessary columns + Convert booleans to 0/1

- Since using the videos have all been recorded in Durham, NC, using the country, state, and city could introduce bias or could result in the model utilizing those features when they do not carry any predictive power (+ there is an imbalance there). So we will drop those features. 

- We will also drop userId and index since we don't really care about that 

- We will drop the original_y, original_x, original_width, original_height, normalized_to_width, normalized_to_height, as well as display_width, display_height and width and height. The intuition here is that after normalizing all the gaze data to fit into one standard screen size (1280, 960), we may not really need the raw gaze data and any of the display data since we are dealing with the same standard screen size

In [148]:
# Create noDetection_no_to_subtle_hazard feature
def create_no_detection_feature(df):
    df["noDetection_no_to_subtle_hazard"] = df["noDetectionReason_nohazards"] + df["noDetectionReason_subtlehazards"]
    return df

df = create_no_detection_feature(df)

In [149]:
def remove_features_and_convert_booleans(df):
    # Drop demographic and unnecessary columns
    columns_to_drop = [
        "userId", "index",

        # demographics features
        "ethnicity_asian", "ethnicity_black or african american", 
        "ethnicity_hispanic or latino", "ethnicity_middle eastern or north african", 
        "ethnicity_multiracial", "ethnicity_prefer not to say", "ethnicity_white", 
        "gender_female", "gender_male", "gender_prefer-not-to-say", "visuallyImpaired",


        # features relating to location
        "country_ar", "country_fr", "country_tn", "country_us",
        "state_california", "state_florida", "state_massachusetts", "state_north carolina", 
        "state_oregon", "state_south carolina", "state_washington",
        "city_boca raton", "city_boston", "city_chapel hill", "city_charlotte", 
        "city_coconut creek", "city_delray beach", "city_durham", "city_los angeles", 
        "city_miami", "city_olympia", "city_puyallup", "city_raleigh", "city_san diego", 
        "city_seattle", "city_tega cay", "city_west linn", "city_woodland hills",

        # features relating to raw gaze data (before standardization)
        'original_y', 'original_x', 'original_width', 'original_height', 'normalized_to_width', 'normalized_to_height',
        'display_width', 'display_height', 'width', 'height', 'x_offset', 'y_offset',

        # Survey related stuff
        'noDetectionReason_nohazards', 'noDetectionReason_subtlehazards'


    ]

    df.drop(columns=columns_to_drop, axis=1, inplace=True)

    # Convert boolean columns to integers (1 for True, 0 for False)
    df = df.astype({col: int for col in df.select_dtypes(include=["bool"]).columns})
    
    return df

df_clean = remove_features_and_convert_booleans(df)

In [150]:
df_clean.columns

Index(['videoId', 'hazardDetected', 'detectionConfidence', 'hazardSeverity',
       'gazeDuration', 'licenseAge', 'age', 'x', 'y', 'time', 'hazard',
       'attentionFactors_construction', 'attentionFactors_environment',
       'attentionFactors_motion', 'attentionFactors_other',
       'attentionFactors_pedestrian', 'attentionFactors_proximity',
       'attentionFactors_velocity', 'noDetectionReason_uncertain',
       'noDetection_no_to_subtle_hazard'],
      dtype='object')

In [151]:
print(print_df_structure(df_clean))

Data Structure for the normalized gaze data
----------
Dimensions: (40653, 20)
Data Types:
videoId                             object
hazardDetected                       int64
detectionConfidence                  int64
hazardSeverity                       int64
gazeDuration                       float64
licenseAge                         float64
age                                float64
x                                  float64
y                                  float64
time                               float64
hazard                               int64
attentionFactors_construction        int64
attentionFactors_environment         int64
attentionFactors_motion              int64
attentionFactors_other               int64
attentionFactors_pedestrian          int64
attentionFactors_proximity           int64
attentionFactors_velocity            int64
noDetectionReason_uncertain          int64
noDetection_no_to_subtle_hazard      int64
dtype: object
Missing Values:
videoId            

In [152]:
df_clean.columns

Index(['videoId', 'hazardDetected', 'detectionConfidence', 'hazardSeverity',
       'gazeDuration', 'licenseAge', 'age', 'x', 'y', 'time', 'hazard',
       'attentionFactors_construction', 'attentionFactors_environment',
       'attentionFactors_motion', 'attentionFactors_other',
       'attentionFactors_pedestrian', 'attentionFactors_proximity',
       'attentionFactors_velocity', 'noDetectionReason_uncertain',
       'noDetection_no_to_subtle_hazard'],
      dtype='object')

In [153]:
print_descriptive_stats(df_clean)


Descriptive Statistics for aggregate stats before one hot-encode
---------------
Central Tendency Measures:
      hazardDetected  detectionConfidence  hazardSeverity  gazeDuration  \
mean        0.274838             4.224239        0.738642     15.748633   
50%         0.000000             5.000000        0.000000     15.361000   

      licenseAge        age           x           y      time    hazard  \
mean   15.874855  33.971343  639.320070  568.304652  6.613715  0.253315   
50%    16.000000  27.000000  616.593673  577.502262  6.521000  0.000000   

      attentionFactors_construction  attentionFactors_environment  \
mean                       0.004452                      0.030625   
50%                        0.000000                      0.000000   

      attentionFactors_motion  attentionFactors_other  \
mean                 0.071016                0.043687   
50%                  0.000000                0.000000   

      attentionFactors_pedestrian  attentionFactors_proximi

aggergate the date by videoId where the gazes are all added up in one list 

In [154]:
df_clean.head()

Unnamed: 0,videoId,hazardDetected,detectionConfidence,hazardSeverity,gazeDuration,licenseAge,age,x,y,time,hazard,attentionFactors_construction,attentionFactors_environment,attentionFactors_motion,attentionFactors_other,attentionFactors_pedestrian,attentionFactors_proximity,attentionFactors_velocity,noDetectionReason_uncertain,noDetection_no_to_subtle_hazard
0,video219,0,5,0,15.755,17.0,24.0,357.536879,602.202312,0.0,0,0,0,0,0,0,0,0,0,1
1,video219,0,5,0,15.755,17.0,24.0,358.021284,599.200181,0.197,0,0,0,0,0,0,0,0,0,1
2,video219,0,5,0,15.755,17.0,24.0,374.752952,598.149672,0.396,0,0,0,0,0,0,0,0,0,1
3,video219,0,5,0,15.755,17.0,24.0,421.96597,576.436479,0.595,0,0,0,0,0,0,0,0,0,1
4,video219,0,5,0,15.755,17.0,24.0,484.691224,556.259314,0.794,0,0,0,0,0,0,0,0,0,1


In [155]:
df_clean[df_clean['videoId']==1].nunique()

videoId                            0
hazardDetected                     0
detectionConfidence                0
hazardSeverity                     0
gazeDuration                       0
licenseAge                         0
age                                0
x                                  0
y                                  0
time                               0
hazard                             0
attentionFactors_construction      0
attentionFactors_environment       0
attentionFactors_motion            0
attentionFactors_other             0
attentionFactors_pedestrian        0
attentionFactors_proximity         0
attentionFactors_velocity          0
noDetectionReason_uncertain        0
noDetection_no_to_subtle_hazard    0
dtype: int64

In [156]:
df_clean[df_clean['videoId']=='video219'].nunique()

videoId                              1
hazardDetected                       1
detectionConfidence                  1
hazardSeverity                       1
gazeDuration                         3
licenseAge                           3
age                                  3
x                                  340
y                                  386
time                               395
hazard                               1
attentionFactors_construction        1
attentionFactors_environment         1
attentionFactors_motion              1
attentionFactors_other               1
attentionFactors_pedestrian          1
attentionFactors_proximity           1
attentionFactors_velocity            1
noDetectionReason_uncertain          1
noDetection_no_to_subtle_hazard      1
dtype: int64

### Aggregating data by videoId

Intuition behind aggregation: 
1) the x feature should become a list of all the x values reported for that videoId, same for y, where x[i] and y[i] correspond to the x,y of the eye gaze data for observation i before aggregation
2) for the features attentionFactors_*, take the max value for each feature 
3) take the max value for the feature noDetection_no_to_subtle_hazard
4) take the mean of detectionConfidence when aggregating 
5) take the mean of the age feature and the licenseAge feature. 
6) For the feature hazardFeature, I want your help aggregating it into a new feature called weightedHazardSeverity, which is the weighted aggregation of hazardFeature. For example, let's say there are 15 observations for the videoId that we are aggregating, where 10 users reported a hazardSeverity of 3, while 5 users reported a hazardSeverity of 5.Then the  weightedHazardSeverity = (10/15) * 3 + (5/15)*5
7) take the min of the gazeDuration. The intuition is that we want to capture the fastest response time with the hope that it transfers to the model
8) we will include time as a list as well


In [157]:
df_clean.columns

Index(['videoId', 'hazardDetected', 'detectionConfidence', 'hazardSeverity',
       'gazeDuration', 'licenseAge', 'age', 'x', 'y', 'time', 'hazard',
       'attentionFactors_construction', 'attentionFactors_environment',
       'attentionFactors_motion', 'attentionFactors_other',
       'attentionFactors_pedestrian', 'attentionFactors_proximity',
       'attentionFactors_velocity', 'noDetectionReason_uncertain',
       'noDetection_no_to_subtle_hazard'],
      dtype='object')

In [158]:
def aggregate_by_videoId(df):
    # Define aggregation functions
    aggregation_rules = {
        'x': lambda x: list(x),
        'y': lambda y: list(y),
        'time': lambda t: list(t),
        'detectionConfidence': 'mean',
        'gazeDuration': 'min',
        'age': 'mean',
        'hazardDetected': 'max',
        'licenseAge': 'mean',
        'noDetection_no_to_subtle_hazard': 'max',
    }

    # Handle attentionFactors_* columns by taking max
    attention_factor_cols = [col for col in df.columns if col.startswith("attentionFactors_")]
    for col in attention_factor_cols:
        aggregation_rules[col] = 'max'

    # Function to compute weighted hazard severity
    def compute_weighted_hazard_severity(hazard_values):
        value_counts = hazard_values.value_counts(normalize=True)  # Relative frequencies
        return sum(value_counts.index * value_counts.values)  # Weighted sum

    aggregation_rules['hazardSeverity'] = compute_weighted_hazard_severity

    # Perform aggregation
    aggregated_df = df.groupby('videoId').agg(aggregation_rules).reset_index()
    
    # Ensure x and y are sorted based on time
    for idx, row in aggregated_df.iterrows():
        sorted_indices = sorted(range(len(row['time'])), key=lambda i: row['time'][i])
        aggregated_df.at[idx, 'x'] = [row['x'][i] for i in sorted_indices]
        aggregated_df.at[idx, 'y'] = [row['y'][i] for i in sorted_indices]

    aggregated_df.rename(columns={'hazardSeverity': 'weightedHazardSeverity',
                                  'detectionConfidence': 'meanDetectionConfidence', 
                                  'age': 'meanAge',
                                  'licenseAge': 'meanLicenseAge',
                                  'gazeDuration': 'minGazeDuration'}, inplace=True)
    return aggregated_df

aggregated_df = aggregate_by_videoId(df_clean)


In [159]:
aggregated_df.shape

(346, 18)

In [160]:
aggregated_df.columns

Index(['videoId', 'x', 'y', 'time', 'meanDetectionConfidence',
       'minGazeDuration', 'meanAge', 'hazardDetected', 'meanLicenseAge',
       'noDetection_no_to_subtle_hazard', 'attentionFactors_construction',
       'attentionFactors_environment', 'attentionFactors_motion',
       'attentionFactors_other', 'attentionFactors_pedestrian',
       'attentionFactors_proximity', 'attentionFactors_velocity',
       'weightedHazardSeverity'],
      dtype='object')

### More Feature Engineering

Intuition:
- mean_x and mean_y: mean of all elements in x and y lists
- numGazes: ength of the x list
- variance_x and variance_y:  the variance for the x and y lists
- gazeVariance: computing the standard deviation of the Euclidean distances from the mean gaze point (captures how spread out the gaze points are in a way that considers both x and y dimensions.)
- gazeSpread: If we consider (x, y) as a 2D distribution, then one way to measure spread is computing the covariance matrix determinant (which represents spread in 2D). This feature will help us capture how gaze points vary independently along the x and y axes. Diagonal values of the covariance matrix (Var(X) and Var(Y)) provide insight into the horizontal and vertical spread

In [161]:
def engineer_aggregate_features(df):
    aggregated_df = df.copy()
    # Compute mean x, mean y
    aggregated_df['mean_x'] = aggregated_df['x'].apply(lambda lst: np.mean(lst))
    aggregated_df['mean_y'] = aggregated_df['y'].apply(lambda lst: np.mean(lst))

    # Compute numGazes
    aggregated_df['numGazes'] = aggregated_df['x'].apply(len)

    # Compute variance of x and y
    aggregated_df['variance_x'] = aggregated_df['x'].apply(lambda lst: np.var(lst, ddof=1))
    aggregated_df['variance_y'] = aggregated_df['y'].apply(lambda lst: np.var(lst, ddof=1))

    # Compute gazeVariance
    def compute_gaze_variance(x_list, y_list):
        coords = np.column_stack((x_list, y_list))  # Create (x,y) coordinate pairs
        return np.var(np.linalg.norm(coords - np.mean(coords, axis=0), axis=1), ddof=1)

    def compute_spread_feature(x_list, y_list):
        coords = np.column_stack((x_list, y_list))  # Stack x and y as coordinate pairs
        cov_matrix = np.cov(coords, rowvar=False)  # Compute covariance matrix
        return np.trace(cov_matrix)  # Sum of diagonal elements (Var(X) + Var(Y))

    aggregated_df['gazeVariance'] = aggregated_df.apply(lambda row: compute_gaze_variance(row['x'], row['y']), axis=1)
    aggregated_df['spreadFeature'] = aggregated_df.apply(lambda row: compute_spread_feature(row['x'], row['y']), axis=1)
    return aggregated_df

aggregated_df = engineer_aggregate_features(aggregated_df)

In [168]:
aggregated_df.dtypes

videoId                             object
x                                   object
y                                   object
time                                object
meanDetectionConfidence            float64
minGazeDuration                    float64
meanAge                            float64
hazardDetected                       int64
meanLicenseAge                     float64
noDetection_no_to_subtle_hazard      int64
attentionFactors_construction        int64
attentionFactors_environment         int64
attentionFactors_motion              int64
attentionFactors_other               int64
attentionFactors_pedestrian          int64
attentionFactors_proximity           int64
attentionFactors_velocity            int64
weightedHazardSeverity             float64
mean_x                             float64
mean_y                             float64
numGazes                             int64
variance_x                         float64
variance_y                         float64
gazeVarianc

In [162]:
aggregated_df.head()

Unnamed: 0,videoId,x,y,time,meanDetectionConfidence,minGazeDuration,meanAge,hazardDetected,meanLicenseAge,noDetection_no_to_subtle_hazard,attentionFactors_construction,attentionFactors_environment,attentionFactors_motion,attentionFactors_other,attentionFactors_pedestrian,attentionFactors_proximity,attentionFactors_velocity,weightedHazardSeverity,mean_x,mean_y,numGazes,variance_x,variance_y,gazeVariance,spreadFeature
0,video10,"[287.5052660194201, 287.8400607017086, 288.314...","[657.2601094440882, 653.0818622335628, 645.946...","[0.0, 0.223, 0.449, 0.676, 0.893, 1.116, 1.336...",1.0,17.463,24.0,0,17.0,0,0,0,0,0,0,0,0,0.0,632.326565,583.023732,51,116213.811644,99000.150842,32277.646349,215213.962485
1,video101,"[2.4660697939106684, 254.65919275355293, 255.3...","[308.1097732720022, 495.76360359063335, 496.08...","[0.0, 1.082, 2.278, 3.249, 4.185, 5.14, 6.048,...",4.583333,15.055,25.333333,1,16.583333,1,0,0,0,1,0,0,0,2.916667,243.210563,395.830774,36,30378.65655,12056.183646,9750.694929,42434.840196
2,video102,"[88.87054908746231, 91.25963781909893, 104.370...","[418.0696212670434, 418.2258640885737, 422.498...","[0.0, 0.325, 0.59, 1.078, 2.647, 2.897, 3.117,...",5.0,16.136,28.0,0,16.0,1,0,0,0,0,0,0,0,0.0,317.293711,461.948617,36,32650.206842,6866.134849,11380.96608,39516.341691
3,video103,"[385.5081347365811, 374.4397553782947, 373.864...","[550.3778906909721, 576.9898561753142, 577.001...","[0.0, 0.162, 0.337, 0.498, 0.66, 0.82, 0.982, ...",5.0,15.696,45.903084,0,14.061674,1,0,0,0,0,0,0,0,0.0,775.360027,440.275154,227,77664.98891,54908.525112,23249.303134,132573.514021
4,video104,"[355.23072968969666, 254.7782384728764, 374.10...","[627.4960365848065, 495.8789291312279, 576.599...","[0.0, 0.07, 0.141, 0.211, 0.28, 0.349, 0.424, ...",4.395498,15.039,27.672026,0,15.189711,1,0,0,0,0,0,0,0,0.0,508.894477,535.532313,311,32720.955902,51493.181917,21346.106881,84214.137819


In [163]:
aggregated_df.columns

Index(['videoId', 'x', 'y', 'time', 'meanDetectionConfidence',
       'minGazeDuration', 'meanAge', 'hazardDetected', 'meanLicenseAge',
       'noDetection_no_to_subtle_hazard', 'attentionFactors_construction',
       'attentionFactors_environment', 'attentionFactors_motion',
       'attentionFactors_other', 'attentionFactors_pedestrian',
       'attentionFactors_proximity', 'attentionFactors_velocity',
       'weightedHazardSeverity', 'mean_x', 'mean_y', 'numGazes', 'variance_x',
       'variance_y', 'gazeVariance', 'spreadFeature'],
      dtype='object')

In [164]:
print_descriptive_stats(aggregated_df[['videoId', 'meanDetectionConfidence', 'minGazeDuration',
       'meanAge', 'meanLicenseAge', 'noDetection_no_to_subtle_hazard',
       'attentionFactors_construction', 'attentionFactors_environment',
       'attentionFactors_motion', 'attentionFactors_other',
       'attentionFactors_pedestrian', 'attentionFactors_proximity',
       'attentionFactors_velocity', 'weightedHazardSeverity', 'mean_x',
       'mean_y', 'numGazes', 'variance_x', 'variance_y', 'gazeVariance',
       'spreadFeature']])


Descriptive Statistics for aggregate stats before one hot-encode
---------------
Central Tendency Measures:
      meanDetectionConfidence  minGazeDuration    meanAge  meanLicenseAge  \
mean                 4.245000        15.834991  34.228910       15.959193   
50%                  4.661205        15.429000  29.379244       16.109119   

      noDetection_no_to_subtle_hazard  attentionFactors_construction  \
mean                         0.734104                       0.014451   
50%                          1.000000                       0.000000   

      attentionFactors_environment  attentionFactors_motion  \
mean                      0.049133                 0.112717   
50%                       0.000000                 0.000000   

      attentionFactors_other  attentionFactors_pedestrian  \
mean                0.092486                     0.075145   
50%                 0.000000                     0.000000   

      attentionFactors_proximity  attentionFactors_velocity  \
mean 

In [169]:
aggregated_df.columns

Index(['videoId', 'x', 'y', 'time', 'meanDetectionConfidence',
       'minGazeDuration', 'meanAge', 'hazardDetected', 'meanLicenseAge',
       'noDetection_no_to_subtle_hazard', 'attentionFactors_construction',
       'attentionFactors_environment', 'attentionFactors_motion',
       'attentionFactors_other', 'attentionFactors_pedestrian',
       'attentionFactors_proximity', 'attentionFactors_velocity',
       'weightedHazardSeverity', 'mean_x', 'mean_y', 'numGazes', 'variance_x',
       'variance_y', 'gazeVariance', 'spreadFeature'],
      dtype='object')

In [165]:
aggregated_df.to_csv(output_aggregate_csv_filename, index=False)