# Feature Engineering I

**This notebook has the objective to acquire all of the raw play-by-play data for the 2015/16 season all the way to the 2019/20 season (inclusive), with the last season being the test set. Note that this tidied data will be useful for the baseline models, but we will be creating more features that will require the full raw data in the Feature Engineering II section.**

We ran the file "nhl_data_downloader" twice, once to save our training dataset (2015-16 to 2018-19 seasons) in the folder "nhl_data_train", and once to set aside all of the 2019/20 data as our final test set in the folder "nhl_data"test". 

Here, we will work with the file "tidy_data.py" for the first part of feature engineering.


In [None]:
import pandas as pd
import numpy as np
import json
import os as os
import seaborn as sns
import matplotlib.pyplot as plt

# For the new NHL API
from tidy_data import *

# For the old NHL API
# from tidy_data_old_api import *

## Tidying the data

Using our training dataset, we created a tidied dataset for each SHOT/GOAL event, with the following columns:

- 'distance_to_goal' (distance of the shot from the net)
- 'shooting_angle' (angle from which the shot was taken)
- 'isGoal' (0 or 1)
- 'isEmptyNet' (0 or 1; we will assume NaNs as 0)


We approximated the net as a single point (i.e. we didn't consider the width of the net when computing the distance or angle). 

reference for shot angle: http://hockeyanalytics.com/Research_files/SQ-RS0910-Krzywicki.pdf


In [None]:
# RUN THIS CELL ONLY IF YOU DON'T ALREADY HAVE THE RAW DATA IN A TIDIED FORMAT IN A CSV FILE

# folder of the raw training dataset
folder_train = 'nhl_data_train'

# run the "tidy_data.py" code to get a clean df of the raw data (this takes a lot of time to run)
run_tidy_data(folder_train)

In [None]:
# Load the dataset
df = pd.read_csv("nhl_data_train.csv").copy()

# keep only shots and goals
df = df[df['Event'].isin(['SHOT', 'GOAL'])]

# add distance and angle columns
df = add_distance(df)
df = add_angle(df)

df

In [None]:
# keep only the selected columns
df = df[['isEmptyNet', 'isGoal', 'DistanceToGoal', 'ShootingAngle']]

# save to a csv file for the baseline model
df.to_csv('baseline_model_data.csv')
df

## Visualizing the data

Let's create several plots to get a better idea of the relation between the shooting angles and distances and the their efficiency in a shot being converted to a goal.

In [None]:
# Separate the DataFrame into two subsets: one for goals and one for no-goals
goals_df = df[df['isGoal'] == 1]
no_goals_df = df[df['isGoal'] == 0]

# Set up the figure
plt.figure(figsize=(12, 6))

# Histogram for goals
sns.histplot(data=goals_df, x='DistanceToGoal', bins=20, color='red', label='Goals', alpha=1)

# Histogram for no-goals
sns.histplot(data=no_goals_df, x='DistanceToGoal', bins=20, color='blue', label='No Goals', alpha=0.4)

plt.xlabel('Distance from the Net (in feet)')
plt.ylabel('Shot Count')
plt.legend()
plt.title('Histogram of Shot Count by Distance (Goals and No Goals)')

plt.show()

In [None]:
# Separate the DataFrame into two subsets: one for goals and one for no-goals
goals_df = df[df['isGoal'] == 1]
no_goals_df = df[df['isGoal'] == 0]

# Set up the figure
plt.figure(figsize=(12, 6))

# Histogram for goals
sns.histplot(data=goals_df, x='ShootingAngle', bins=20, color='red', label='Goals', alpha=1)

# Histogram for no-goals
sns.histplot(data=no_goals_df, x='ShootingAngle', bins=20, color='blue', label='No Goals', alpha=0.4)

plt.xlabel('Shooting Angle (in degrees)')
plt.ylabel('Shot Count')
plt.legend()
plt.title('Histogram of Shot Count by Shooting Angle (Goals and No Goals)')

plt.show()

In [None]:
# Set up the figure
plt.figure(figsize=(10, 8))

# Joint plot
sns.jointplot(data=df, x='DistanceToGoal', y='ShootingAngle', kind='hist', bins=20, cmap='viridis', cbar=True)

plt.xlabel('Distance to Goal (in feet)')
plt.ylabel('Shooting Angle (in degrees)')
plt.suptitle('2D Histogram of Distance vs. Shooting Angle')

plt.show()

In [None]:
# Group the data by distance
distance_grouped = df.groupby('DistanceToGoal')

# Calculate the number of goals and no-goals at each distance
goals_count = distance_grouped.size()
no_goals_count = distance_grouped.size().subtract(distance_grouped['isGoal'].sum(), fill_value=0)

# Compute the goal rate (#goals / (#no_goals + #goals))
goal_rate = goals_count / (no_goals_count + goals_count)

# Create a new DataFrame with the distance and goal rate
goal_rate_df = pd.DataFrame({'distance': goal_rate.index, 'goal_rate': goal_rate.values})

# Reset the index for a cleaner DataFrame
goal_rate_df.reset_index(drop=True, inplace=True)

goal_rate_df

In [None]:
# Line plot
plt.figure(figsize=(10, 6))
plt.plot(goal_rate_df['distance'], goal_rate_df['goal_rate'], alpha=0.5)
plt.xlabel('Distance from the Net (in feet)')
plt.ylabel('Goal Rate')
plt.title('Goal Rate vs. Distance')
plt.show()

In [None]:
# Group the data by angle
angle_grouped = df.groupby('ShootingAngle')

# Calculate the number of goals and no-goals at each angle
goals_count = angle_grouped.size()
no_goals_count = angle_grouped.size().subtract(angle_grouped['isGoal'].sum(), fill_value=0)

# Compute the goal rate (#goals / (#no_goals + #goals))
angle_goal_rate = goals_count / (no_goals_count + goals_count)

# Create a new DataFrame with the distance and goal rate
angle_goal_rate_df = pd.DataFrame({'angle': angle_goal_rate.index, 'goal_rate': angle_goal_rate.values})

# Reset the index for a cleaner DataFrame
angle_goal_rate_df.reset_index(drop=True, inplace=True)

angle_goal_rate_df

In [None]:
# Line plot
plt.figure(figsize=(10, 6))
plt.plot(angle_goal_rate_df['angle'], angle_goal_rate_df['goal_rate'], alpha=0.5)
plt.xlabel('Shooting Angle (in degrees)')
plt.ylabel('Goal Rate')
plt.title('Goal Rate vs. Shooting Angle')
plt.show()

In [None]:
# Filter the DataFrame to include only goals
goal_df = df[df['isGoal'] == 1]

# Create two subsets: empty net goals and non-empty net goals
empty_net_goals = goal_df[goal_df['isEmptyNet'] == 1]
non_empty_net_goals = goal_df[goal_df['isEmptyNet'] == 0]

print(goal_df['isEmptyNet'].value_counts())

plt.figure(figsize=(10, 6))

# Histogram for goals
sns.histplot(empty_net_goals['DistanceToGoal'], bins=20, alpha=0.5, label='Empty Net Goals', color='red')

# Histogram for no-goals
sns.histplot(non_empty_net_goals['DistanceToGoal'], bins=20, alpha=0.5, label='Non-Empty Net Goals', color='blue')


plt.xlabel('Distance from the Net (in feet)')
plt.ylabel('Number of Goals')
plt.title('Histogram of Goals by Distance (Empty Net vs. Non-Empty Net)')
plt.legend()

plt.show()