## EDA

This notebook outlines some EDA done on ``data/normalized_gaze_data.csv``, which is the result of running the functions on TalEDA.ipynb and running LennoxEDA.ipynb. From Mongo DB:
- Cleaned by converting string to their appropriate types, rename emailed column to userId for consistency between DFs and drop the _id column since it's wrong
- Converted the formData and windowDimensions columns in the survey df from a dict to being their own individual columns
- Cut extra start times if the duration exceeds 15 seconds since the videos are only 15 seconds long
- Convert start and end times to a meaningful duration and drop the start/end time columns and the windowDimensions
- Merged the user df with the survey df
- Split the entire dataframe to have one row per timestamp with a value of if it's hazardous or not which will be the label
- One-hot encoded the following features: ['noDetectionReason', 'country', 'state', 'city', 'ethnicity', 'gender']
- Dropped rows with missing data
- Convert video data to 0.5s splits and replace hazard binary data by majority vote per video per time bin
- Normalized gaze data based on screen size 


In [1]:
import pandas as pd
import os
from pathlib import Path

In [2]:
data_csv_dir = "../data/normalized_gaze_data.csv"

In [6]:
df = pd.read_csv(data_csv_dir)

In [7]:
df.head()

Unnamed: 0.1,Unnamed: 0,userId,videoId,hazardDetected,detectionConfidence,hazardSeverity,width,height,duration,licenseAge,...,original_x,original_y,original_width,original_height,display_width,display_height,x_offset,y_offset,normalized_to_width,normalized_to_height
0,0,jonahmulcrone@gmail.com,video219,False,5,0,1280,960,15.755,17.0,...,500.496763,499.953378,1470,797,1062.666667,797,203.666667,0.0,1280,960
1,1,jonahmulcrone@gmail.com,video219,False,5,0,1280,960,15.755,17.0,...,500.898921,497.460984,1470,797,1062.666667,797,203.666667,0.0,1280,960
2,2,jonahmulcrone@gmail.com,video219,False,5,0,1280,960,15.755,17.0,...,514.78969,496.588843,1470,797,1062.666667,797,203.666667,0.0,1280,960
3,3,jonahmulcrone@gmail.com,video219,False,5,0,1280,960,15.755,17.0,...,553.986331,478.562368,1470,797,1062.666667,797,203.666667,0.0,1280,960
4,4,jonahmulcrone@gmail.com,video219,False,5,0,1280,960,15.755,17.0,...,606.06136,461.811118,1470,797,1062.666667,797,203.666667,0.0,1280,960


In [None]:
# renamed the Unnamed : 0 column to index so it's more appropriate
df.rename(columns={'Unnamed: 0': 'index'}, inplace=True)

In [9]:
df.head()

Unnamed: 0,index,userId,videoId,hazardDetected,detectionConfidence,hazardSeverity,width,height,duration,licenseAge,...,original_x,original_y,original_width,original_height,display_width,display_height,x_offset,y_offset,normalized_to_width,normalized_to_height
0,0,jonahmulcrone@gmail.com,video219,False,5,0,1280,960,15.755,17.0,...,500.496763,499.953378,1470,797,1062.666667,797,203.666667,0.0,1280,960
1,1,jonahmulcrone@gmail.com,video219,False,5,0,1280,960,15.755,17.0,...,500.898921,497.460984,1470,797,1062.666667,797,203.666667,0.0,1280,960
2,2,jonahmulcrone@gmail.com,video219,False,5,0,1280,960,15.755,17.0,...,514.78969,496.588843,1470,797,1062.666667,797,203.666667,0.0,1280,960
3,3,jonahmulcrone@gmail.com,video219,False,5,0,1280,960,15.755,17.0,...,553.986331,478.562368,1470,797,1062.666667,797,203.666667,0.0,1280,960
4,4,jonahmulcrone@gmail.com,video219,False,5,0,1280,960,15.755,17.0,...,606.06136,461.811118,1470,797,1062.666667,797,203.666667,0.0,1280,960
