# Preprocessing (feature engineering)

I decided to do all feature engineering before splitting because both processes are independent in my pipeline. Performing all the feature engineering ahead of splitting simplifies the pipeline.

The purpose of this notebook is to:

1. Fill `heat index`, `wind chill`, and `class` missing values (*explained in EDA*)
2. Create speed feature
3. Create race-specific combinations of speed and time
4. Create weather combinations
5. Add `month`, `day`, `year` features

## Import necessary packages and data

In [14]:
import pandas as pd

df = pd.read_csv("../data/compiled.csv")
df.head()

Unnamed: 0,Name,Grade,Section,Class,School,Race,Date,Place,Time (sec),Speed Rating,...,Wind Speed,Precipitation,Dew Point,Humidity,Wind Chill,Wind Gust,Heat Index,Visibility,Year,Distance (mi)
0,Allyson Tierney,12.0,01-,C,Albertus Magnus,Section 1-Class C,2014-11-01 00:00:00,25.0,1325,70.0,...,14.3,0.0,33.4,63.18,38.7,23.0,,9.9,2014,3.106
1,Andrea Nardone,11.0,01-,C,Albertus Magnus,Section 1-Class C,2014-11-01 00:00:00,3.0,1172,121.03,...,14.3,0.0,33.4,63.18,38.7,23.0,,9.9,2014,3.106
2,Emily Auld,11.0,01-,C,Albertus Magnus,Section 1-Class C,2014-11-01 00:00:00,15.0,1253,94.19,...,14.3,0.0,33.4,63.18,38.7,23.0,,9.9,2014,3.106
3,Ruth Segall,10.0,01-,C,Ardsley,Section 1-Class C,2014-11-01 00:00:00,37.0,1384,50.54,...,14.3,0.0,33.4,63.18,38.7,23.0,,9.9,2014,3.106
4,Selena Colon,12.0,01-,C,Ardsley,Section 1-Class C,2014-11-01 00:00:00,8.0,1198,112.62,...,14.3,0.0,33.4,63.18,38.7,23.0,,9.9,2014,3.106


## Fill (some) Missing Values

In [15]:
# Explained in EDA, TLDR: no heat index below 80F, no wind chill above 50F/below 3MPH winds
df['Heat Index'] = df['Heat Index'].fillna(df['Temperature'])
df['Wind Chill'] = df['Wind Chill'].fillna(df['Temperature'])
df['Class'] = df['Class'].fillna("CITY") # missing values for class are all NYC schools

## Feature Engineering 
Speed, Race-Specific Speed & Time, Weather, Date

In [16]:
before_cols = df.columns

df['Time-Place'] = df["Time (sec)"]/df["Place"]
df['Name-School'] = df["Name"] + "-" + df["School"]

df['Date'] = pd.to_datetime(df['Date'])

df["Speed (mi/sec)"] = df["Distance (mi)"]/df["Time (sec)"]

average_time = df.groupby(['Date', 'Race'])['Time (sec)'].mean().reset_index()
average_time = average_time.rename(columns={'Time (sec)': 'Average_Time'})

df = pd.merge(df, average_time, on=['Date', 'Race'], how='left')
df['Time_Difference_From_Avg'] = df['Time (sec)'] - df['Average_Time']

average_time = df.groupby(['Date', 'Class', 'Race'])['Time (sec)'].mean().reset_index()
average_time = average_time.rename(columns={'Time (sec)': 'Average_Time_Class'})

df = pd.merge(df, average_time, on=['Date', 'Race', 'Class'], how='left')
df['Time_Difference_From_Avg_Class'] = df['Time (sec)'] - df['Average_Time_Class']

df[['Date', 'Grade', 'Section', 'Class', 'Time (sec)', "Speed (mi/sec)", 'Time_Difference_From_Avg', 'Time_Difference_From_Avg_Class']].head()

# Find the first place time for each race
first_place_time = df.groupby(['Date', 'Race'])['Time (sec)'].min().reset_index()
first_place_time = first_place_time.rename(columns={'Time (sec)': 'First_Place_Time'})

# Merge this information with the original dataframe
df = pd.merge(df, first_place_time, on=['Date', 'Race'], how='left')

# Calculate the time difference from the first place
df['Time_Difference_First_Place'] = df['Time (sec)'] - df['First_Place_Time']

# Similarly, for each class within each race:
first_place_time_class = df.groupby(['Date', 'Class', 'Race'])['Time (sec)'].min().reset_index()
first_place_time_class = first_place_time_class.rename(columns={'Time (sec)': 'First_Place_Time_Class'})

# Merge this information with the original dataframe
df = pd.merge(df, first_place_time_class, on=['Date', 'Race', 'Class'], how='left')

# Calculate the time difference from the first place in the class
df['Time_Difference_First_Place_Class'] = df['Time (sec)'] - df['First_Place_Time_Class']

# Display the data
df[['Place', 'Time_Difference_From_Avg', 'Time_Difference_From_Avg_Class', 'Time_Difference_First_Place', 'Time_Difference_First_Place_Class']].head()

average_speed = df.groupby(['Date', 'Race'])['Speed (mi/sec)'].mean().reset_index()
average_speed = average_speed.rename(columns={'Speed (mi/sec)': 'Average_Speed'})

df = pd.merge(df, average_speed, on=['Date', 'Race'], how='left')
df['Speed_Difference_From_Avg'] = df['Speed (mi/sec)'] - df['Average_Speed']

df['Temp_Humidity'] = df['Temperature'] * df['Humidity']
df['WindSpeed_WindChill'] = df['Wind Speed'] * df['Wind Chill']
df['Temp_WindSpeed'] = df['Temperature'] * df['Wind Speed']
df['Humidity_WindSpeed'] = df['Humidity'] * df['Wind Speed']
df['HeatIndex_Humidity'] = df['Heat Index'] * df['Humidity']
df['DewPoint_Temperature'] = df['Dew Point'] * df['Temperature']
df['DewPoint_Humidity'] = df['Dew Point'] * df['Humidity']
df['DewPoint_WindSpeed'] = df['Dew Point'] * df['Wind Speed']

df['Year'] = df['Date'].dt.year
df['Month'] = df['Date'].dt.month
df['Day'] = df['Date'].dt.day

print(f"Columns before ({len(before_cols)}): {list(before_cols)}")
print(f"Columns after  ({len(df.columns)}): {list(df.columns)}")

Columns before (27): ['Name', 'Grade', 'Section', 'Class', 'School', 'Race', 'Date', 'Place', 'Time (sec)', 'Speed Rating', 'SR', 'Gender', 'Race Section', 'Latitude', 'Longitude', 'Temperature', 'Cloud Coverage', 'Wind Speed', 'Precipitation', 'Dew Point', 'Humidity', 'Wind Chill', 'Wind Gust', 'Heat Index', 'Visibility', 'Year', 'Distance (mi)']
Columns after  (50): ['Name', 'Grade', 'Section', 'Class', 'School', 'Race', 'Date', 'Place', 'Time (sec)', 'Speed Rating', 'SR', 'Gender', 'Race Section', 'Latitude', 'Longitude', 'Temperature', 'Cloud Coverage', 'Wind Speed', 'Precipitation', 'Dew Point', 'Humidity', 'Wind Chill', 'Wind Gust', 'Heat Index', 'Visibility', 'Year', 'Distance (mi)', 'Time-Place', 'Name-School', 'Speed (mi/sec)', 'Average_Time', 'Time_Difference_From_Avg', 'Average_Time_Class', 'Time_Difference_From_Avg_Class', 'First_Place_Time', 'Time_Difference_First_Place', 'First_Place_Time_Class', 'Time_Difference_First_Place_Class', 'Average_Speed', 'Speed_Difference_From

Export

In [17]:
df.to_csv("../data/preprocessed.csv", index=False)