# Pucks to the Net 
## An Analysis of Shooting and Goal Scoring in the NHL from 2000 - 2020

### Team Members:
- Alec Lunn
- Dylan Medina
- Matthew Zirpoli

## Introduction

In Canada, Hockey is more than just a casual sport, it is one of the backbones of Canadian culture and identity. Since Canada is the birthplace of hockey, it has a deep connection to the game itself. The National Hockey League (NHL) features many Canadian teams, including the Calgary Flames, near and dear to us at UofC, and represents the top-tier skill of professional hockey players around the world. This, in turn, offers a vast warehouse of data for sports analytics and applications of Data Science. 

Understanding the dynamics of NHL games through data analysis provides key insights about player performance, team strategies, winning statistics and in-game decision-making. We aim to identify the key factors that influence successful teams in the league, as well as uncover potential areas for improvement for players and teams through data analysis. With this project, we will specifically hone in on shots and goals for our analysis of NHL game data, acquired from Kaggle.

## Guiding Questions

- Where on the ice do shots have the best percentage of scoring goals/Where are you most likely to score a goal? What types of shots are most likely to score?
    
    In hockey, scoring goals on the other team's net is the most important key to winning games. The location of where shots are taken on the ice plays a significant role in scoring chances for teams. By analyzing shot location data, this question aims to identify the zones where players are most likely to score and with what type of shot. These insights are valuable to teams and coaching staffs to design offensive plays, position players during power plays, and create defensive strategies to limit opposing teams scoring chances. 

- How does the score differential affect the team's decision-making and strategy during the game?
    
    Hockey, especially at the professional level, is the most dynamic game out of all professional sports. The score differential between teams plays an important role in shaping a team’s approach and strategy for the rest of the game. Leading teams often adopt a defensive approach—maintaining possession, blocking shots, and minimizing turnovers—while trailing teams become more aggressive, pulling goalies, increasing shot volume, and taking risks to create scoring chances. As a result, we seek to analyze how teams adjust their strategies with respect to score differential. 

- What are the key differences in all-star calibre shooting versus league average shooting?
    
    This question explores how All-star players excel in shot location, accuracy and selection compared to league-average players. Do they consistently shoot from high-probability areas or succeed from tougher angles? By comparing and analyzing shot location, percentage and shot type from all-stars we can find insights and patterns among the best players compared to league-average professionals. 

- Have shot patterns changed over the years (ie do they succeed from different areas, do players favour different shot types/locations)?
    
    As the game of hockey is constantly evolving, this question examines how player tendencies and strategies have shifted over time. We raise the question: how much has shot locations, types and success rates changed over time? How much have current advancements in tactics, technology and goaltending influenced scoring patterns over the league's progression? This is important for fans and professionals to see what direction the game is moving into the future


## Dataset

For this project, we will use a dataset of [NHL Game Data](https://www.kaggle.com/datasets/martinellis/nhl-game-data/data) available on Kaggle. This dataset includes NHL game data from the 2000-2001 season until the 2019-2020 season.The collected data is stored in a tabular format in a series of .csv files representing a point-in-time snapshot of a relational database. 

For our project, we are particularly interested in the game_plays table, which stores game plays like shots, goals, penalties, hits and other statistics driving events. The table also gives the x, y coordinates on the ice where these plays happened, as well as some game conditions like the current score, time remaining, and power play/penalty kill conditions when the play happened. In particular, we plan to analyze shot and goal events paired with their locations to analyze shot quality, and chance of success. 

We will also use reference tables like game_plays_players and player_info, which map plays to players’ biographical information in order to differentiate between high level players and others. High level players will be identified by their inclusion in the All Star Game. Data of players All Star status will be sourced from [hockey-reference.com](https://www.hockey-reference.com/allstar/). The game table will also be of use for its season column for later time-series analysis


## Analysis

### Data Cleaning

Here we focus on importing our dataset and addressing and missing or duplicate data. Our data is stored in multiple .csv files so we will address them one by one in the following sections. Then we will save them to pickle files for use in other notebooks

In [None]:
from pathlib import Path
import pandas as pd

DATA_PATH = Path("../data/")
PICKLE_PATH = Path("../pickled_data/")

#### Play Data
This set stores information about individual plays or events within a game. There are different event types corresponding to the moments of statistical interest within a hockey game. Some event types are associated with x, y coordinates giving location on the ice.

In [None]:
game_plays = pd.read_csv(DATA_PATH / "game_plays.csv")
game_plays.info()

In [None]:
null_plays = game_plays.isna().sum()
null_plays

There are some significant missing data at a glance, however we know that this by design. Certain events are missing team info because they are "neutral" and don't pertain to a particular team. X, Y coords are not recorded for every event type, and secondaryType is only recorded for shot events.

In [None]:
game_plays['event'].value_counts()

Example of an event type without certain data by design:

In [None]:
game_plays[game_plays['event'] == 'Shootout Complete']

In [None]:
duplicate_plays = game_plays.groupby(['play_id'])['game_id'].count() > 1
duplicate_plays.value_counts()

There is some duplication based on the play_id. We should attempt to drop duplicates with the same play_id, in the same game, happening at the same exact time.

In [None]:
# remove rows where primary key columns are duplicated, keep the first
subset_columns = ['play_id', 'game_id', 'dateTime']
game_plays = game_plays.drop_duplicates(subset=subset_columns, keep="first")

In [None]:
# inspect to verify de-duplication
duplicate_plays = game_plays.groupby(subset_columns)['game_id'].count() > 1
duplicate_plays.value_counts()

With no more duplicates, we can take a look at the data we're most interested in, shots and goals

In [None]:
shots = game_plays[game_plays['event'] == "Shot"]
shots.head()

In [None]:
shots['secondaryType'].value_counts()


In [None]:
shots['secondaryType'].isna().sum()

There are very few missing shots, we can fill them with the mode 'Wrist Shot'

In [None]:
# fill missing secondaryType with 'Wrist shot' (the mode of this categorical column)
shot_mask = game_plays['event'] == 'Shot'
game_plays.loc[shot_mask, 'secondaryType'] = game_plays.loc[shot_mask, 'secondaryType'].fillna('Wrist Shot')

In [None]:
# confirm NaN rows are filled
game_plays[shot_mask]['secondaryType'].isna().sum()

With the missing values for secondaryType populated, this data is ready to go!

In [None]:
# save to pickle, for easy use in other notebooks
game_plays.to_pickle(PICKLE_PATH / "game_plays")

#### Game Data
This table gives data about the game itself. Each game has an ID, the season it was played in, the teams that participated, the score and outcome, and venue data

In [None]:
games = pd.read_csv(DATA_PATH / "game.csv")
games.info()

In [None]:
games.shape

In [None]:
games['season'].unique()

We have no null data in the games set, let's check for duplicates.

In [None]:
# look for duplication in primary keys by count
duplicate_games = games.groupby(['game_id'])['season'].count() > 1
duplicate_games.value_counts()

Looks like there is some duplication. We can say for sure that no two games should have the same id, season, venue, and teams involved. If all those factors were the same, the game row is certainly duplicate

In [None]:
# remove rows where there is duplication in primary keys
subset_columns = ['game_id', 'season', 'venue', 'away_team_id', 'home_team_id']
games = games.drop_duplicates(subset=subset_columns, keep="first")

In [None]:
games.shape

We have gotten rid of our duplication, this data is ready to go to the pickle directory

In [None]:
# Save to pick for ease of use in other notebooks
games.to_pickle(PICKLE_PATH+"games")

#### Game Teams stats
This table gives aggregate statistics and other team data for each game. Goals, shots, hits, penalties, faceoff win %, etc.

In [None]:
game_teams = pd.read_csv(DATA_PATH / "game_teams_stats.csv")
game_teams.info()

In [None]:
game_teams.shape

There is some null data in faceoff win %, giveaways, takeaways, hits. Hits were not always a tracked stat and were only begun to be measured at some point during the timeline of this data, so we are expecting some missing values. For the others, the data isn't in areas we anticipate investigating heavily, so we will forego imputing any values for now.

Let's move on to inspecting for duplication.

In [None]:
# look for duplication in primary keys by count
team_duplicates = game_teams.groupby(['game_id', 'team_id'])['won'].count() > 1
team_duplicates.value_counts()

No game should have the same id with multiple entries for two teams of the same id. Ex. for each game there should be two entries, one for each team.

In [None]:
# remove rows where there is duplication in primary keys
subset_columns = ['game_id', 'team_id']
game_teams = game_teams.drop_duplicates(subset=subset_columns, keep="first")

In [None]:
game_teams.shape

Duplicates dropped, and a consistent result mathematically with our games set. 

In [None]:
game_teams.to_pickle(PICKLE_PATH / "game_teams")

#### Game Plays Players
This table is an intermediate mapping table for matching game_plays to player_info. It contains a play_id, game_id, and player_id along with playerType

In [None]:
game_plays_players = pd.read_csv(DATA_PATH / "game_plays_players.csv")
game_plays_players.info()

Let's inspect what information is stored in the playerType field

In [None]:
game_plays_players['playerType'].value_counts()

It looks like the playerType is informative to the type of event happening in the play. Let's check for duplicate values.

In [None]:
# look for duplication in primary keys by count
duplicate_play_players = game_plays_players.groupby(['play_id', 'game_id', 'player_id'])['playerType'].count() > 1
duplicate_play_players.value_counts()

In [None]:
# remove duplicate rows
subset_columns = ['play_id', 'game_id', 'player_id']
game_plays_players = game_plays_players.drop_duplicates(subset=subset_columns, keep="first")

In [None]:
# confirm de-duplication
game_plays_players.shape

In [None]:
# save to pickle directory
game_plays_players.to_pickle(PICKLE_PATH / 'game_plays_players')

#### Player Info 
This table contains biographical information about each player, including first and last name, nationality, birth city, position, birthday.

In [None]:
player_info = pd.read_csv(DATA_PATH / "player_info.csv")
player_info.info()

There are no null values in fields of interest for our group. First, Last, birthday  (for age), and position are the only data points we plan to use.

In [None]:
duplicate_players = player_info.groupby(['player_id', 'firstName', 'lastName'])['nationality'].count() > 1
duplicate_players.value_counts()

There are no duplicate players in the set.

### Added feature: allStarSeasons

We are going to add a feature for each player that contains the years in which they participated in the All Star Game. This can be used to differentiate top quality players in their position against others. Due to the variable skill level players can display throughout different phases of their career, we will use the years, rather than a binary allStar feature

First, we read in the all star data, containing season and player names for the all star team rosters

In [None]:
all_stars = pd.read_csv(DATA_PATH / 'all_star_data.csv')
all_stars.shape

We need a column to merge on between the all star data and the player_info table, we can use full name (first + last)

In [None]:
player_info['fullName'] = player_info['firstName'] + ' ' + player_info['lastName']

We can merge the names, then inspect for any columns where we do not find a match. These will be due to non-matching names between the datasets.

In [None]:
merged = pd.merge(all_stars, player_info, left_on='Player', right_on='fullName', how='left')
merged[merged['player_id'].isna()]

We iterated until there were no more missing rows. Most were conflicting spellings, use of nicknames, or use of special characters in names. We corrected these by editing the all star data csv to conform to the data in the player_info table

Now we can merge the data to create the allStarSeasons column, which will contain a list of the seasons in which the player was on an all star team roster.

In [None]:
# inspecting our new column
player_seasons = merged.groupby('player_id')['Season'].apply(list).reset_index()
player_info = pd.merge(player_info, player_seasons, on='player_id', how='left').rename(columns={'Season': 'allStarSeasons'})
player_info

In [None]:
# save to pickle
player_info.to_pickle(PICKLE_PATH / "player_info")

### Baseline Analysis

Now that we have cleaned our data and removed duplicates, we can take a look at some basic features of the dataset and answer some basic questions about shots and goals. What types of shots are the most common? What types of shots are the most successful? Where on the ice to shots and goals typically occur?

In [None]:
# import needed analysis and visualization libraries
from pathlib import Path
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Path where our pickle files are stored
PICKLE_PATH = Path("../pickled_data/")

# Read in the plays data
game_plays = pd.read_pickle(PICKLE_PATH / 'game_plays')

Next we inspected the different columns of our dataset, this was an iterative process, but we'll leave a few cells that highlight some of our features of interest

In [None]:
game_plays.dtypes

In [None]:
game_plays['event'].value_counts()

In [None]:
game_plays[game_plays['event'] == 'Shot']['secondaryType'].unique()

#### Shots and Goals by Type

We can filter our data down to a DataFrame for shots (rows where 'event' column is 'Shot') and a DataFrame for goals (rows where 'event' column is 'Goal'). We include the 'secondaryType' column as it stores the Shot Type. Shot types are broken down into

- Wrist shot
- Slap shot
- Snap shot
- Backhand 
- Deflected
- Tip In
- Wrap-around 

In [None]:
# Shots by shot type
shots = game_plays[game_plays['event'] == 'Shot']
shot_count = shots.groupby('secondaryType')['play_id'].count().sort_values(ascending=False)

In [None]:
# Goals by shot type
goals = game_plays[game_plays['event'] == 'Goal']
goal_count = goals.groupby('secondaryType')['play_id'].count().sort_values(ascending=False)

We use a stacked bar chart to show shot and goal counts by type.

In [None]:
# create a stacked bar with shots and goal separated by type
shot_types = goal_count.index
counts = {
    "shots": shot_count,
    "goals": goal_count
}

fig, ax = plt.subplots()
bottom=np.zeros(len(goal_count))
for name, count in counts.items():
    ax.bar(shot_types, count, label=name, bottom=bottom)
    bottom += count
    
ax.set_title('Shots and Goals by Type')
ax.legend(loc="upper right")
plt.ylabel("Count")
plt.xticks(rotation=45)
plt.show()

![Shot and Goals by Type Stacked Bar Chart](../reports/figures/shot_goal_stacked_bar.png)

We see that the most common shot by far is Wrist shots, followed by Snap and slap shots. This makes sense as these are the "easiest" shots to take. Backhand shots require space and some movement across the ice to be able to lift the puck, and players typically more confident on the forehand. Tip-ins, deflections, and wrap-arounds require coordination and opportunity, whereas the more common types can be attempted at any time with an open lane. Since wrist shots are much more common, this plot degrades the information of "how successful" each shot type is. 

We can get a better look at how successful shot types are at producing goals by introducing accuracy. Here we define accuracy by how many goals were scored from a particular shot type divided by the number of shots of that type were taken. We then plot the accuracy in bar chart where each bar represents a certain shot type.

In [None]:
# Plot accuracy (goals/shots) for shot types
accuracy = goal_count / shot_count
accuracy.sort_values(ascending=False, inplace=True)

fig, ax = plt.subplots()
ax.bar(accuracy.index, accuracy, color="purple")

ax.set_title("Accuracy by Shot Type")
plt.xticks(rotation=45)
plt.show()

![Accuracy by Shot Type Bar Chart](../reports/figures/shot_accuracy_bar.png)

We see that Deflected shots and Tip-In shots are the most succesful, but are some of the least attempted shots. Deflections are when a player in front of the net has the puck hit or "deflect" off their stick or body and then travel into the net. Tip-ins are when a player receives a pass or rebound and shoots them directly or "tips" them into the net. Since both these shot types require coordination between teammates, the opportunity may not arise as frequently. Similarly, many wrist, snap, or slap shots may be attempts at deflections that do not succeed as intended.

Another reason for the high frequency of less accurate shots is that they are "easier" to take. The time and space required for a snap shot or wrist shot is minimal compared to a backhand, where player must extend their stick away from their body to hold the puck on the back of their stick where they may have less control. 

#### Correcting coordinates

Our (x,y) coordinates are provided on a plane where center ice is the origin (0, 0), the x coordinate is distance in feet along the length of the ice, and the y coordinate is distance in feet along the width of the ice. Play occurs on both "sides" of the plane, so that plays that occur on the left side of the ice will have negative x values, while those that occur on the right side of the ice will have positive x values. Since our analysis and visualization is inspecting play as a whole, and not the performance of individual teams, we can produce more focused distributions and visualizations by focusing on one side of the ice.

To accomplish this, we transform or "correct" the x and y coordinates so that they all occupy one side. For the x coordinate, we simply take the absolute value, so that all plays appear to occur on the right side of the ice. For the y coordinate, we must "flip" the coordinate over the x axis to occur for the shift in perspective on the left vs. the right. For example, on the left side of the ice a player facing goal has positive y coordinates to their right hand side, and negative coordinates to their left hand side. The right side of the ice is the exact opposite. We flip the the coordinate by multiplying by the sign of the x coordinate.

In [None]:
# Add count columns for counting goals and shots, to be used in density based visualizations later on.
game_plays['goal'] = (game_plays['event'] == "Goal").astype(int)
game_plays['shot'] = (game_plays['event'] == "Shot").astype(int)

# Correct the x and y coordinates to occupy only one side of the ice
game_plays['xC'] = np.abs(game_plays['x'])
game_plays['yC'] = game_plays['y'] * np.sign(game_plays['x'])

# filter for goals and missing coordinates
goals = game_plays.loc[
    ~game_plays["xC"].isna() &
    ~game_plays["yC"].isna() &
    (game_plays['event'].isin(["Goal"]))
]
# keep only columns we need for performance in drawing density plots
goals = goals[['xC', 'yC', 'goal', 'secondaryType']]

# repeat for shots
# filter for shots and missing coordinates
shots = game_plays.loc[
    ~game_plays["xC"].isna() &
    ~game_plays["yC"].isna() &
    (game_plays['event'].isin(["Shot"]))
]
# keep only columns we need for performance in drawing density plots
shots = shots[['xC', 'yC', 'shot', 'secondaryType']]

Since we are given the (x,y) coordinates of our events, we can also calculate the distance from the goal, and the angle of the shooter to the goal using some geometry.

The distance can be found simply using the Pythagorean distance. We can approximate the location of the goal as the center of it's opening at (89, 0). Then we can use the following formula to calculate the distance: 

$$D = \sqrt {(x - 89)^2 + (y - 0)^2}$$ 

We can measure the angle of the shot as the angle of intersection with an imaginary line down the center of the ice. We can also think of this as the angle a goaltender would have to apply to be completely "square" to the shot. This angle can be calculated using the formula: 

$$\theta = \tan^{-1} ({\frac {y}{89 - x}}) * \frac {180}{\pi}$$

Distance and angle (theta) are illustrated through the plot below:

In [None]:
# Center of goal is (89, 0)
from hockey_rink import NHLRink # this package allows annotation of the rink 

fig, ax = plt.subplots(1, 1, figsize=(14, 8))
nhl = NHLRink()
nhl.draw(ax = ax, display_range = "ozone")

nhl.scatter(x=72, y=10, color='blue', s=150)
nhl.arrow(x=72, y=10, x2=89, y2=0, facecolor = "red", length_includes_head=True, head_width=1)
nhl.arrow(x=25, y=0, x2=110, y2=0, facecolor="black", head_width=None, linestyle='--', linewidth=0.1, alpha=0.8)
nhl.text(x=70, y=12, s="shot")
nhl.text(x=81, y=1, s="θ")
nhl.text(x=27, y=1, s="center line")
nhl.text(x=77, y=4, s="distance", rotation=-30)

![Illustration of Shot Distance and Angle](../reports/figures/shot_distance_angle_diagram.png)

Let's apply our distance and angle formulas to the shots DataFrame to produce and distance column and angle column for each shot in the dataset.

In [None]:
# Calculate shot distance column
shots['distance'] = shots[['xC', 'yC']].apply(
    lambda row: np.sqrt((row['xC'] - 89)**2 + row['yC']**2),
    axis=1
    )
# Calculate shot angle column
shots['angle'] = shots[['xC', 'yC']].apply(
    lambda row: np.arctan(row['yC'] / (89 - row['xC'])) * (180 / np.pi), 
    axis=1
    )

#### Shot Distance and Angle

We then plot the distributions of shot distance alongside their Kernel Density Estimation. This shows us how distance from goal when a shot is taken is distributed.

In [None]:
sns.histplot(shots['distance'], bins=20, stat="density", label="Histogram")
sns.kdeplot(shots['distance'], label='KDE') 

plt.xlabel("Distance from Goal (ft)")
plt.ylabel("Frequency/Probability Density")
plt.title("Distribution of Shot Distance")
plt.legend()
plt.show()

![Distribution of Shot Distance](../reports/figures/shot_distance_histogram.png)

We see there are very few shots taken from in or near the crease (0-5 ft), but shots are most frequent between 5-15 ft away from goal, then fairly evenly distributed along the length of the ice from there, until the frequency drops off steeply near the end of the offensive zone. Some shots are taken from outside the offensive zone, which makes sense, some of these will be shots at an empty net, but others will be players "dumping" the puck into the offensive zone, hoping that the goalie will freeze the puck or deflect it into a corner where their team can re-take possession.

In [None]:
sns.histplot(shots['angle'], bins=20, stat="density", label="Histogram")
sns.kdeplot(shots['angle'], label='KDE') 

plt.xlabel("Shot Angle (degrees)")
plt.ylabel("Frequency/Probability Density")
plt.title("Distribution of Shot Angle from Center")
plt.legend()
plt.show()

![Shot Angle Distribution](../reports/figures/shot_angle_histogram.png)

Here, we see that the angle at which a player shoots has a fairly symmetrical distribution, where shots are common in the middle of the ice, but most common at a 20-30 degree angle off center. This is common offensive positioning for the wing players, as well as the defensemen playing offense, where each stand slightly left or right of center. Another possible source of these angles are net front plays, where one player will pass across the crease to another in order to catch the goalie sliding from one side of the net to the other and thus, have a larger area to shoot at. 


#### Goal Distance and Angle

In [None]:
goals['distance'] = goals[['xC', 'yC']].apply(
    lambda row: np.sqrt((row['xC'] - 89)**2 + row['yC']**2),
    axis=1
    )
# Calculate shot angle column
goals['angle'] = goals[['xC', 'yC']].apply(
    lambda row: np.arctan(row['yC'] / (89 - row['xC'])) * (180 / np.pi), 
    axis=1
    )

In [None]:
sns.histplot(goals['distance'], bins=20, stat="density", label="Histogram")
sns.kdeplot(goals['distance'], label='KDE') 

plt.xlabel("Distance from Goal (ft)")
plt.ylabel("Frequency/Probability Density")
plt.title("Distribution of Shot Distance for Goals")
plt.legend()
plt.show()

![Shot Distance Distribution for Goals](../reports/figures/goal_distance_histogram.png)

We see that Goals have a much higher frequency close in to the net. We see the same low rate in tight to the goal crease (0-5ft), but that nearly all goals are scored from 5-20 ft before steeply dropping off in frequency. The likelihood of scoring from past the faceoff circles is very low, though we still see a high frequency of shots from their. This again, is likely due to the "ease" of the shots. Defensively, teams protect the middle of the ice, leaving the walls and top of the zone where defenseman on offense are positioned more open in order to prioritize stopping forwards from occupying the center of the ice and locations close to the net. These leaves farther shots "open" to be attempted, hoping to generate deflection attempts, or rebounds for players on their team to retrieve


In [None]:
sns.histplot(goals['angle'], bins=20, stat="density", label="Histogram")
sns.kdeplot(goals['angle'], label='KDE') 

plt.xlabel("Shot Angle (degrees)")
plt.ylabel("Frequency/Probability Density")
plt.title("Distribution of Shot Angle for Goals")
plt.legend()
plt.show()

![Shot Angle Distribution for Goals](../reports/figures/shot_angle_goal_histogram.png)

We see that shot angle for goals is distributed much closer to center ice than general shots. Shots like deflections, tip-ins, backhands, and wrap-around will *always* occur from these shallow angles, and those have a high success rate. Also, this speaks to "play-driving" shots or the mentality of "pucks to the net". Players shoot from all over to drive play, hoping for rebounds, tips, deflections or goalie mistakes. A shot is often just an effective way to get the puck near goal, which is the most common area to score from. This may explain the marked difference we see in these distributions


#### Shot Location Density Visualization

Seeing the distributions of distance and angle effectively explain shot patterns and scoring patterns. However, we seek a more accessible visualization that give better context to the game we are discussing. In the following visualizations, we use `sportpy`, an open source visualization library for sports. The library provides methods to plot playing surfaces for different sports, in this case we are only interested in hockey. The library also wraps `matplotlib` functions for use in displaying visualizations overlaid on the playing surface. Let's experiment with showing different density visualizations like `hexbin` and `heatmap` to display where shots and goals occur most often on the ice. 

##### Hexbins
We first show the density for shots and goals with a hexbin. Where each bin is a hexagon with a defined binsize and all data occurring within the area of that hexgon contributes to the density for that area. High density is show as red, while low density is shown in blue, and middling density is shown on a gradient between the two.

In [None]:
from sportypy.surfaces.hockey import NHLRink

nhl = NHLRink()
fig, ax = plt.subplots(1, 1, figsize=(14, 8))
nhl.draw(ax, display_range="ozone", rotation=270)
hex = nhl.hexbin(
    goals["xC"],
    goals["yC"],
    values = goals["goal"],
    cmap = "coolwarm",
    ax = ax,
    plot_range="ozone",
    binsize = (8, 12),
    zorder=30,
    alpha=0.75
)

ax.set_title("Goal Location Density", fontsize=24)
plt.colorbar(hex, ax=ax)

![Shot Location Density Hexbin](../reports/figures/shot_location_hexbin.png)

In [None]:
nhl = NHLRink()
fig, ax = plt.subplots(1, 1, figsize=(14, 8))
nhl.draw(ax, display_range="ozone", rotation=270)
hex = nhl.hexbin(
    shots["xC"],
    shots["yC"],
    values = shots["shot"],
    cmap = "coolwarm",
    ax = ax,
    plot_range="ozone",
    binsize = (8, 12),
    zorder=5,
    alpha=0.75
)

ax.set_title("Shot Locations", fontsize=24)
plt.colorbar(hex, ax=ax)

![Goal Location Density Hexbin](../reports/figures/goal_location_hexbin.png)

##### Heatmaps

We will repeat the same density visualization approach with a heatmap visualization 

In [None]:
nhl = NHLRink()
fig, ax = plt.subplots(1, 1, figsize=(14, 8))
nhl.draw(ax, display_range="ozone", rotation=270)
heat = nhl.heatmap(
    shots["xC"],
    shots["yC"],
    values = shots["shot"],
    cmap = "coolwarm",
    ax = ax,
    plot_range="ozone",
    plot_xlim=(25, 89),
    binsize=3, 
    alpha=0.85
)

ax.set_title("Shot Locations", fontsize=24)
plt.colorbar(heat, ax=ax)

![Shot Location Density Heatmap](../reports/figures/shot_location_heatmap.png)

In [None]:
nhl = NHLRink()
fig, ax = plt.subplots(1, 1, figsize=(14, 8))
nhl.draw(ax, display_range="ozone", rotation=270)
heat = nhl.heatmap(
    goals["xC"],
    goals["yC"],
    values = goals["goal"],
    cmap = "coolwarm",
    ax = ax,
    plot_range="ozone",
    plot_xlim=(25, 89),
    binsize=3,
    alpha=0.85
)

ax.set_title("Goal Locations", fontsize=24)
plt.colorbar(heat, ax=ax)

![Goal Location Density Heatmap](../reports/figures/goal_location_heatmap.png)

Comparing the two density plot types, we prefer the heatmap. The heatmap provides a higher definition with its smaller bin size and a more consistent handling of areas where data is sparse, whereas the hexbin may have empty white spaces inconsistently along the boards or other areas where there is little data. Despite the hexbin being a bit more visually pleasing, it comes at the cost of visual data.


##### Locations by Shot type

Let's repeat this visualization for each shot type in our shots and goals dataframes to see where each particular shot type occurs most, and where they are most often producing goals.

In [None]:
# Get shot types as a list to iterate
shot_types = sorted(list(shots['secondaryType'].unique()))

# prepare the visualization
nhl = NHLRink()
fig, axes = plt.subplots(nrows=3, ncols=3, figsize=(20, 12),  
                         gridspec_kw={'hspace': 0.4, 'wspace': 0.3})

# Only 7 subplots are needed, remove two from the bottom right
fig.delaxes(axes[2,2])
fig.delaxes(axes[2,1])
axes = axes.flatten()[:7]

for i, ax in enumerate(axes):
    # view the shot type we are focused on this loop, axes and shot_types are of same length so we can use i to access
    filtered_shots = shots[shots['secondaryType'] == shot_types[i]]
    
    # display offensive zone where goal is on bottom of image
    nhl.draw(ax, display_range="ozone", rotation=270) 
    
    # plot the heatmap
    heat = nhl.heatmap(
    filtered_shots["xC"],
    filtered_shots["yC"],
    values = filtered_shots["shot"],
    cmap = "coolwarm",
    ax = ax,
    plot_range="ozone",
    plot_xlim=(25, 89),
    binsize=3, 
    alpha=0.85
    )
    plt.colorbar(heat, ax=ax) # add a colorbar
    ax.set_title(f"{shot_types[i]} Locations") # title the axes
    
fig.suptitle("Shot Location Density by Type", fontsize=24)
plt.show()

![Shot Location Density Heatmaps by Type](../reports/figures/shot_density_by_shot_type.png)

We can repeat the same technique for goals

In [None]:
# prepare the visualization
nhl = NHLRink()
fig, axes = plt.subplots(nrows=3, ncols=3, figsize=(20, 12),  
                         gridspec_kw={'hspace': 0.4, 'wspace': 0.3})

# Only 7 subplots are needed, remove two from the bottom right
fig.delaxes(axes[2,2])
fig.delaxes(axes[2,1])
axes = axes.flatten()[:7]

for i, ax in enumerate(axes):
    # view the shot type we are focused on this loop, axes and shot_types are of same length so we can use i to access
    filtered_goals = goals[goals['secondaryType'] == shot_types[i]]
    
    # display offensive zone where goal is on bottom of image
    nhl.draw(ax, display_range="ozone", rotation=270) 
    
    # plot the heatmap
    heat = nhl.heatmap(
    filtered_goals["xC"],
    filtered_goals["yC"],
    values = filtered_goals["goal"],
    cmap = "coolwarm",
    ax = ax,
    plot_range="ozone",
    plot_xlim=(25, 89),
    binsize=3, 
    alpha=0.85
    )
    plt.colorbar(heat, ax=ax) # add a colorbar
    ax.set_title(f"{shot_types[i]} Locations") # title the axes
    
fig.suptitle("Goal Location Density by Shot Type", fontsize=24)
plt.show()

![Goal Location Density Heatmap by Shot Type](../reports/figures/goal_density_by_shot_type.png)

Based on the density plots and histograms above, combined with our domain knowledge of the sport, we observe just above what we expect. Shots are more evenly spread throughout the ice, whereas goals are focused heavily closer to the net and in the center of the ice. By shot type, we see shots coming from where we expect, ie slap shots are most frequently performed by defensemen on offense near the blue line or "the point". Wrist shots are focused close in and between the faceoff circles, or "the slot". Snap shots occur all over as they are a versatile hybrid shot. Others only occur near the net front. Compared to the distribution for shots, goals are much nearer the goal and center as we observed. 

### All Stars

This section will focus on the shot/play comparison of All-Star calibre players to league-average players.

For the All-Star scoring versus league average scoring comparison, we have used all-star data acquired from Hockey Reference to name all the all stars from 2011-2025, using the data we will find the top 10 most occuring all stars and feature them in our analysis of all stars. Below is the code to clean the initial all-star data to filter it so we can merge it into our player_info table to acquire the player_ids for all our featured players

In [None]:
player_data = pd.read_pickle("pickled_data/player_info")
plays_data = pd.read_pickle("pickled_data/game_plays")
plays_players_data = pd.read_pickle("pickled_data/game_plays_players")
player_info = pd.read_pickle("pickled_data/player_info")

allstar_data = pd.read_csv("data/allstars.csv")

allstar_data = allstar_data.iloc[:, [0]]
allstar_data.columns = ['fullName']

allstar_data[['firstName', 'lastName']] = allstar_data['fullName'].str.rsplit(n=1, expand=True)

top_10_players = allstar_data['fullName'].value_counts().head(11).index.tolist()

filtered_allstar_df = allstar_data[allstar_data['fullName'].isin(top_10_players)]

filtered_allstar_df = filtered_allstar_df.drop_duplicates(subset='fullName')

display(filtered_allstar_df)

merged_allstar = filtered_allstar_df.merge(player_info, on='fullName', how='left')

display(merged_allstar)

From the code above we have a list of 11 players (Had to add one because one of the players is a goalie) that we will focus on for our analysis. 

We have merged the filtered all stars data-frame with the player info dataframe, acquiring the player_id's for the game plays. 

We can now filter game plays based on each player for their own shooting visualizations

In [None]:
for lastName_x, player_id in zip(merged_allstar['lastName_x'], merged_allstar['player_id']):
    exec(f"{lastName_x}_id = {player_id}")

ovie_plays = plays_players_data[plays_players_data['player_id'] == Ovechkin_id]

kane_plays = plays_players_data[plays_players_data['player_id'] == Kane_id]

weber_plays = plays_players_data[plays_players_data['player_id'] == Weber_id]

mackinnon_plays = plays_players_data[plays_players_data['player_id'] == MacKinnon_id]

giroux_plays = plays_players_data[plays_players_data['player_id'] == Giroux_id]

letang_plays = plays_players_data[plays_players_data['player_id'] == Letang_id]

karlsson_plays = plays_players_data[plays_players_data['player_id'] == Karlsson_id]

stamkos_plays = plays_players_data[plays_players_data['player_id'] == Stamkos_id]

tavares_plays = plays_players_data[plays_players_data['player_id'] == Tavares_id]

burns_plays = plays_players_data[plays_players_data['player_id'] == Burns_id]

ovie2 = pd.merge(ovie_plays, plays_data, on='play_id')

kane2 = pd.merge(kane_plays, plays_data, on='play_id')

weber2 = pd.merge(weber_plays, plays_data, on='play_id')

mackinnon2 = pd.merge(mackinnon_plays, plays_data, on='play_id')

giroux2 = pd.merge(giroux_plays, plays_data, on='play_id')

letang2 = pd.merge(letang_plays, plays_data, on='play_id')

karlsson2 = pd.merge(karlsson_plays, plays_data, on='play_id')

stamkos2 = pd.merge(stamkos_plays, plays_data, on='play_id')

tavares2 = pd.merge(tavares_plays, plays_data, on='play_id')

burns2 = pd.merge(burns_plays, plays_data, on='play_id')



Above we have filtered the all-star data with respect to their player_id, we then all the plays under each player id and separated them by player, this will help us clear up visualizations, as it would be too crowded with all ten players on one graph. 

Now, we have to filter the data further by looking at shooting and scoring, as we are primarily focused on the offensive side of play.

In [None]:
ovie2 = ovie2[(ovie2['playerType'] == 'Shooter') | (ovie2['playerType'] == 'Scorer')]

letang2 = letang2[(letang2['playerType'] == 'Shooter') | (letang2['playerType'] == 'Scorer')]

kane2 = kane2[(kane2['playerType'] == 'Shooter') | (kane2['playerType'] == 'Scorer')]

weber2 = weber2[(weber2['playerType'] == 'Shooter') | (weber2['playerType'] == 'Scorer')]

mackinnon2 = mackinnon2[(mackinnon2['playerType'] == 'Shooter') | (mackinnon2['playerType'] == 'Scorer')]

tavares2 = tavares2[(tavares2['playerType'] == 'Shooter') | (tavares2['playerType'] == 'Scorer')]

giroux2 = giroux2[(giroux2['playerType'] == 'Shooter') | (giroux2['playerType'] == 'Scorer')]

burns2 = burns2[(burns2['playerType'] == 'Shooter') | (burns2['playerType'] == 'Scorer')]

stamkos2 = stamkos2[(stamkos2['playerType'] == 'Shooter') | (stamkos2['playerType'] == 'Scorer')]

karlsson2 = karlsson2[(karlsson2['playerType'] == 'Shooter') | (karlsson2['playerType'] == 'Scorer')]

ovie3 = ovie2.drop_duplicates()

letang3 = letang2.drop_duplicates()

kane3 = kane2.drop_duplicates()

weber3 = weber2.drop_duplicates()

mackinnon3 = mackinnon2.drop_duplicates()

tavares3 = tavares2.drop_duplicates()

giroux3 = giroux2.drop_duplicates()

burns3 = burns2.drop_duplicates()

stamkos3 = stamkos2.drop_duplicates()

karlsson3 = karlsson2.drop_duplicates()

ovieshots = ovie3[ovie3['event'].isin(['Shot', 'Goal'])][['event', 'secondaryType', 'st_x', 'st_y', 'period']]

letangshots = letang3[letang3['event'].isin(['Shot', 'Goal'])][['event', 'secondaryType', 'st_x', 'st_y', 'period']]

kaneshots = kane3[kane3['event'].isin(['Shot', 'Goal'])][['event', 'secondaryType', 'st_x', 'st_y', 'period']]

webershots = weber3[weber3['event'].isin(['Shot', 'Goal'])][['event', 'secondaryType', 'st_x', 'st_y', 'period']]

mackinnonshots = mackinnon3[mackinnon3['event'].isin(['Shot', 'Goal'])][['event', 'secondaryType', 'st_x', 'st_y', 'period']]

tavaresshots = tavares3[tavares3['event'].isin(['Shot', 'Goal'])][['event', 'secondaryType', 'st_x', 'st_y', 'period']]

girouxshots = giroux3[giroux3['event'].isin(['Shot', 'Goal'])][['event', 'secondaryType', 'st_x', 'st_y', 'period']]

burnsshots = burns3[burns3['event'].isin(['Shot', 'Goal'])][['event', 'secondaryType', 'st_x', 'st_y', 'period']]

stamkosshots = stamkos3[stamkos3['event'].isin(['Shot', 'Goal'])][['event', 'secondaryType', 'st_x', 'st_y', 'period']]

karlssonshots = karlsson3[karlsson3['event'].isin(['Shot', 'Goal'])][['event', 'secondaryType', 'st_x', 'st_y', 'period']]

Now above we have all the top 10 players shot data, with the type of shot and where on the ice they were when they took the shot/scored the goal. We can now use this data to visualize the shot attempts in scatter plots. 

Below is the code to create a icerink using MatPlotLib, that way we can easily visualize the goal/shot data. 

In [None]:
from matplotlib.patches import Arc, Wedge
def draw_rink(ax):
    '''
    Displays a hockey rink with NHL dimensions at current axes
    Parameter ax: Current axes
    Returns nothing
    '''

    # Draw the center ice line
    ax.axvline(0, color='red', linestyle='-', linewidth=2, alpha=0.5)
    
    # Draw the blue lines at +/- 25 feet from center ice
    ax.axvline(-25, color='blue', linestyle='--', linewidth=2, alpha=0.5)
    ax.axvline(25, color='blue', linestyle='--', linewidth=2, alpha=0.5)

    # Draw blue faceoff circle at center ice (15 ft radius) with blue dot at the center
    center_circle = plt.Circle((0, 0), 15, edgecolor='blue', facecolor='none', alpha = 0.5, lw=2)
    ax.add_patch(center_circle)
    ax.plot(0, 0, marker='o', color='blue', alpha = 0.5, markersize=6)  
    
    # Draw the 4 red faceoff circles with red dots at center (31 feet from end boards and 20.5 feet from side boards)
    faceoff_positions = [(69, 20.5), (-69, 20.5), (69, -20.5), (-69, -20.5)]
    for x, y in faceoff_positions:
        faceoff_circle = plt.Circle((x, y), 15, edgecolor='red', facecolor='none', alpha = 0.5, lw=2)
        ax.add_patch(faceoff_circle)
        ax.plot(x, y, marker='o', color='red', alpha = 0.5, markersize=6)  
    
    # Draw goal lines for net (6 ft)
    ax.plot([-89, -89], [-3, 3], color='red', lw=1)  # Left goal line
    ax.plot([89, 89], [-3, 3], color='red', lw=1)    # Right goal line

    # Draw back of goals as arcs (6 ft wide and 4 ft deep)
    left_goal= Arc((-89, 0), width=6, height=8, angle=90, theta1=360, theta2=180, color='red', lw=1)
    right_goal = Arc((89, 0), width=6, height=8, angle=90, theta1=180, theta2=360, color='red', lw=1)

    ax.add_patch(left_goal)
    ax.add_patch(right_goal)

    # Draw the goal creases using Wedges and fill with low opacity (6 ft radius)
    left_goal_crease = Wedge((-89, 0), r=6, theta1=270, theta2=90, color='skyblue', alpha=0.2, edgecolor='blue', lw=2)
    right_goal_crease = Wedge((89, 0), r=6, theta1=90, theta2=270, color='skyblue', alpha=0.2, edgecolor='blue', lw=2)

    ax.add_patch(left_goal_crease)
    ax.add_patch(right_goal_crease)

    # Set the rink bounds (200 ft by 85 ft)
    ax.set_xlim(-100, 100)        
    ax.set_ylim(-42.5, 42.5)

    # Treat x and y units equally so circles are drawn correctly
    ax.set_aspect('equal')

In [None]:
players = {
    "Letang": letangshots, "Kane": kaneshots, "Weber": webershots, "MacKinnon": mackinnonshots,
    "Tavares": tavaresshots, "Giroux": girouxshots, "Burns": burnsshots, "Stamkos": stamkosshots, "Karlsson": karlssonshots, "Ovechkin": ovieshots
}

fig, axes = plt.subplots(3, 4, figsize=(24,16))

axes = axes.flatten()

shot_colors = {
    "Wrist Shot": "blue",
    "Slap Shot": "red",
    "Snap Shot": "green",
    "Backhand": "purple",
    "Tip-In": "orange",
    "Deflected": "cyan",
    "Wrap-around": "pink",
    "Unknown": "gray"
}

for i, (player, df) in enumerate(players.items()):
    ax = axes[i]
    filtered_df = df[df['event'].isin(['Goal'])]
    draw_rink(ax)
    unique_shot_types = filtered_df['secondaryType'].unique() 

    for shot_type in unique_shot_types:
        subset = filtered_df[filtered_df['secondaryType'] == shot_type]
        color = shot_colors.get(shot_type, "black")  # Default to black if unknown shot type
        ax.scatter(subset['st_x'], subset['st_y'], label=shot_type, s=80, alpha=0.7, c=color, edgecolors="black")

    ax.set_title(f"{player} Goal Locations", fontsize=18)
    ax.set_xlabel("X Coordinate", fontsize=14)
    ax.set_ylabel("Y Coordinate", fontsize=14)
    ax.legend(fontsize=12, loc="upper left")
    ax.set_xlim(filtered_df['st_x'].min() - 5, filtered_df['st_x'].max() + 5)
    ax.set_ylim(filtered_df['st_y'].min() - 5, filtered_df['st_y'].max() + 5)

for j in range(i + 1, len(axes)):
    fig.delaxes(axes[j])


![All Stat Goal Location Scatter by Player, Shot type](../reports/figures/all_star_goal_locations.png)

Looking at the goal graphs, we can infer that each player tended to favor certain types of shots for their goals, mainly specializing in one two signature shots in order to score their goals. For example, in the Shea Weber graph, he tended to favor slap shots over any other type of shots for the majority of goals he scored. The locations are also interesting, as some players tended to stick to one location for their goals, and others were all over the place. Next, we will look more into goal locations for each player using a heatmap. 

In [None]:
fig, axes = plt.subplots(4, 3, figsize=(24,16))

axes = axes.flatten()

for i, (player, df) in enumerate(players.items()):

    ax = axes[i]

    draw_rink(ax)

    filtered_df = df[df['event'].isin(['Goal'])]

    sns.kdeplot(
        x ='st_x', y = 'st_y', data=filtered_df, thresh=0.2, fill=True, cmap='coolwarm', ax=ax)
    
    ax.set_title(f"{player} Shot & Goal Heatmap", fontsize=22)  # Bigger title font
    ax.set_xlabel("X Coordinate", fontsize=18)  # Bigger axis labels
    ax.set_ylabel("Y Coordinate", fontsize=18)

for j in range(i + 1, len(axes)):
    fig.delaxes(axes[j])


plt.show()

![Shot + Goal Heatmap Contours for All Stars](../reports/figures/all_star_shot_goal_heatmap.png)

The heatmaps above show key information for how all-stars tend to favor certain locations on the ice for all of their goals. For example, Alexander Ovechkin primarily scored goals in the left circle, and Shea Weber also had most of his goals scored in the bottom of the left circle in 2 primary locations. 

In [None]:
players = {
    "Letang": letangshots, "Kane": kaneshots, "Weber": webershots, "MacKinnon": mackinnonshots,
    "Tavares": tavaresshots, "Giroux": girouxshots, "Burns": burnsshots, "Stamkos": stamkosshots, "Karlsson": karlssonshots, "Ovechkin": ovieshots
}



for player, df in players.items():
    most_common_shot = df['secondaryType'].value_counts().idxmax()
    shot_count = df['secondaryType'].value_counts().max()
    print(f"{player}: {most_common_shot} ({shot_count} times)")

We see that every sampled all-star uses the wrist shot the most. One reason is that the wrist shot is already the most common shot taken by all players. 

Another important comparison metric is looking at the distribution of goals scored during the game, we need to separate the goal totals by period to see if all-stars are showing up in the right times to win games. Below, we will implement a bar graph to see the goal scoring distribution by period. 

In [None]:
goal_data = pd.concat(players.values(), ignore_index=True)

goal_data = goal_data[goal_data['event'] == 'Goal']

goal_counts = goal_data['period'].value_counts().sort_index()

fig, ax = plt.subplots(figsize=(12,6))
goal_counts.plot(kind='bar', color='steelblue', ax=ax)

ax.set_title('Goal Scoring By Period for All-Star Players', fontsize=16)
ax.set_ylabel('Total Goals')
ax.set_xlabel('Period')
plt.show()

![Goal Scoring by Period for All Stars](../reports/figures/all_star_goal_scoring_by_period.png)

We see from the bar graph above all-stars are scoring goals the most in the second and third period. This matches what we would initially expect as above average players see more icetime in the later periods of the game especially if they are losing. We should also look at shots taken by period as well, as this is correlated to the playing time of all-stars per period of the game. 

In [None]:
shot_data = pd.concat(players.values(), ignore_index=True)

shot_data = shot_data[shot_data['event'] == 'Shot']

shot_counts = shot_data['period'].value_counts().sort_index()

fig, ax = plt.subplots(figsize=(12,6))
shot_counts.plot(kind='bar', color='steelblue', ax=ax)

ax.set_title('Shots Taken By Period for All-Star Players', fontsize=16)
ax.set_ylabel('Period')
ax.set_xlabel("Total Shots")
plt.show()

![Shots Taken by Period for All Stars](../reports/figures/all_star_shots_taken_by_period.png)

Finally, we want to analyze the shot to goal ratio for All-Stars, this ratio would tell us how many goals are scored from shots taken, an important metric for evaluating All-Star performances. It is also important to see the amount of total shots from our all-star selection, as this plays a big role in quality of shots that actually turn into goals. 

In [None]:
shot_goal_data = pd.concat(players.values(), ignore_index=True)

total_shots = shot_goal_data[shot_goal_data['event'] == 'Shot'].shape[0]
total_goals = shot_goal_data[shot_goal_data['event'] == 'Goal'].shape[0]

shot_goal_ratio = total_goals / total_shots

display(round(shot_goal_ratio,3) * 100)
print("Total Shots:", total_shots)

`15.2`
`Total Shots: 21943`

This result is interesting because the all-stars actually have a lower shot to goal ratio compared to the league-average players. One reason could be that the total shots are much higher than the sampled league-average, leading to bias on the lower amount of shots. 

The last visualization that we want to produce is the K-Means clustering visual to show the position of goals in clusters on the ice to show us where the key areas of scoring is versus league-average. We will use the sklearn library to effectively cluster our goals. 

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.cluster import KMeans
import pandas as pd

num_clusters = 3  #We want to see the top 3 cluster areas for goals
all_star_shot_data = pd.concat(players.values(), ignore_index=True)
all_star_shot_locations = all_star_shot_data[all_star_shot_data['event'].isin(['Goal'])][['st_x', 'st_y']].dropna().copy()
# Apply K-Means Clustering
kmeans_all_star = KMeans(n_clusters=num_clusters, random_state=1) # Keep Random State set for reproductability
kmeans_all_star.fit(all_star_shot_locations)
# Assign cluster labels back to the original DataFrame (only to non-null rows)
all_star_shot_data.loc[all_star_shot_locations.index, 'Cluster'] = kmeans_all_star.labels_

# Create visual for All-Star shot clustering

fig, ax = plt.subplots(figsize=(12, 8))
draw_rink(ax)

# Scatter plot of All-Star shot clusters
sns.scatterplot(x=all_star_shot_data['st_x'], y=all_star_shot_data['st_y'], 
                hue=all_star_shot_data['Cluster'], palette='viridis', ax=ax)

# Mark cluster centroids
ax.scatter(kmeans_all_star.cluster_centers_[:, 0], kmeans_all_star.cluster_centers_[:, 1], 
           color='red', marker='X', s=200, label='Centroids')

# Set plot title and labels
ax.set_title("Goal Clustering for All-Star Player Sample", fontsize=18)
ax.set_xlabel("X Coordinate", fontsize=14)
ax.set_ylabel("Y Coordinate", fontsize=14)
ax.legend()

# Show plot
plt.tight_layout()
plt.show()

![Goal Clustering for All Star Players](../reports/figures/all_star_goal_clustering.png)

The clustering visual is very different from the league-average, where the third cluster is actually from their own end of the ice. These goals in the cluster are all empty net goals from the end of the game. This would make sense because in the last minutes of the game, a team normally puts out their best players to defend and try to score on the empty net 6v5. 

In [None]:
player_data = pd.read_pickle("pickled_data/player_info")
plays_data = pd.read_pickle("pickled_data/game_plays")
plays_players_data = pd.read_pickle("pickled_data/game_plays_players")
player_info = pd.read_pickle("pickled_data/player_info")

player_info['birthDate'] = pd.to_datetime(player_info['birthDate'], errors='coerce')

merged_df = plays_data.merge(plays_players_data, on='play_id', how='left')
merged_df = merged_df.merge(player_info, left_on='player_id', right_on='player_id', how='left')
current_year = 2020
merged_df = merged_df[(current_year - merged_df['birthDate'].dt.year >= 20) & (current_year - merged_df['birthDate'].dt.year <= 30)]

Above we have defined the age of players in our dataset to be between 20 and 30, as the average career in the NHL is around 6 years. Now, we have to split players into their offensive and defensive positions, as well as define what scoring and defense plays are for each group.

In [None]:
scoring_plays = ['Goal', 'Assist']
defensive_plays = ['Goal', 'Assist']

offensive_positions = ['C', 'LW', 'RW']
defensive_positions = ['D']

offensive_df = merged_df[merged_df['event'].isin(scoring_plays)]
defensive_df = merged_df[merged_df['event'].isin(defensive_plays)]

offensive_df = merged_df[merged_df['primaryPosition'].isin(offensive_positions)]
defensive_df = merged_df[merged_df['primaryPosition'].isin(defensive_positions)]

offensive_stats = offensive_df.groupby('player_id').size().reset_index(name='offensive_events')
defensive_stats = defensive_df.groupby('player_id').size().reset_index(name='defensive_events')

offensive_stats = offensive_stats.merge(player_info, on='player_id', how='left')
defensive_stats = defensive_stats.merge(player_info, on='player_id', how='left')

low_offensive, high_offensive = offensive_stats['offensive_events'].quantile([0.15, 0.65])
low_defensive, high_defensive = defensive_stats['defensive_events'].quantile([0.15, 0.65])

offensive_stats = offensive_stats[(offensive_stats['offensive_events'] >= low_offensive) & (offensive_stats['offensive_events'] <= high_offensive)]
defensive_stats = defensive_stats[(defensive_stats['defensive_events'] >= low_defensive) & (defensive_stats['defensive_events'] <= high_defensive)]

Above we have filtered the positions and game events to be within the scoring and defensive plays. We also grouped all the data by player_id, and filted out outliers with a pretty strong range as there is a wide range in production for NHL players. All-Stars would heavily skew the average point production to the right, as well as under-performing players that have barely played alot of NHL games would skew the data to 0 points. We had to account for that by cutting the bottom 15% and upper 35%, therefore defining a league-average player. Now, we can model the distribution in histograms for both offense and defense. 

In [None]:
plt.figure(figsize=(10,5))
sns.histplot(offensive_stats['offensive_events'], bins = 30, kde = True)
plt.title("Distribution of Offensive Production")
plt.xlabel("Number of Offensive Events")
plt.ylabel("Frequency")
plt.show()

plt.figure(figsize=(10,5))
sns.histplot(defensive_stats['defensive_events'], bins = 30, kde = True)
plt.title("Distribution of Defensive Production")
plt.xlabel("Number of Defensive Events")
plt.ylabel("Frequency")
plt.show()

offensive_mean = offensive_stats['offensive_events'].mean()
offensive_std = offensive_stats['offensive_events'].std()
league_avg_offensive = offensive_stats[(offensive_stats['offensive_events'] <= offensive_mean + offensive_std) & (offensive_stats['offensive_events'] >= offensive_mean)]

defensive_mean = defensive_stats['defensive_events'].mean()
defensive_std = defensive_stats['defensive_events'].std()
league_avg_defensive = defensive_stats[(defensive_stats['defensive_events'] <= defensive_mean + defensive_std) & (defensive_stats['defensive_events'] >= defensive_mean)]

random_offensive_sample = league_avg_offensive.sample(n=min(10, len(league_avg_offensive)), random_state=42)
random_defensive_sample = league_avg_defensive.sample(n=min(10,len(league_avg_defensive)), random_state=42)

display('Random Defensive Player Sample:', random_defensive_sample)
display("Random Offensive Player Sample:", random_offensive_sample)


![Offensive Production Distribution](../reports/figures/offensive_production_distribution.png)

![Defensive Production Distribution](../reports/figures/defensive_production_distribution.png)

Above we have modeled the distribution for points in both offense and defense groups. Then, we took the average (mean) for each group and the standard deviation to filter out players that have point production between the average and 1 standard deviation above. This way, we can capture true league-average players playing around the mean, while if we took players below 1 standard deviation from the mean it would be too close to 0 points, creating a pool of players that is way too large to properly capture league average. Then, using pandas we took a random sample of 10 players from each group to be used for shot/goal visualizations and metrics. 

Now we can finally look at some goal visualizations, comparing league-average sample to all-stars. First, we will look at types of goals and goal locations for each player. We can loop through our list of sampled players and create a subplot for their co-responding goal graphs. 

In [None]:
sampled_offplayer_ids = random_offensive_sample['player_id'].tolist()
filtered_offplays = merged_df[(merged_df['player_id'].isin(sampled_offplayer_ids)) & (merged_df['event'].isin(scoring_plays))]


fig, axes = plt.subplots(2, 5, figsize=(24,12))
axes = axes.flatten()

for i, (player_id) in enumerate(sampled_offplayer_ids):
    ax = axes[i]
    player_shots_goals = filtered_offplays[filtered_offplays['player_id'] == player_id]

    draw_rink(ax)

    sns.scatterplot(x=player_shots_goals['st_x'], y=player_shots_goals['st_y'], hue=player_shots_goals['secondaryType'], s=80, alpha=0.7, ax=ax)
    ax.legend(loc='upper left')
    
    player_name = filtered_offplays[filtered_offplays['player_id'] == player_id]['fullName'].iloc[1]
    ax.set_title(f"{player_name} Goal Chart", fontsize=14)
    

plt.show()

![League Average Offensemen Goal Scatter](../reports/figures/league_average_offenseman_goal_scatter.png)

We also have to look at the goal heatmaps, as these are key to see how scoring locations change between league-average and all-star players. 

In [None]:
fig, axes = plt.subplots(2, 5, figsize=(24,16))

axes = axes.flatten()

for i, (player_id) in enumerate(sampled_offplayer_ids):
    ax = axes[i]
    player_shots_goals = filtered_offplays[filtered_offplays['player_id'] == player_id]

    draw_rink(ax)

    sns.kdeplot(x='st_x', y='st_y', data=player_shots_goals, thresh=0.2, fill=True, cmap='coolwarm', ax=ax)
    player_name = filtered_offplays[filtered_offplays['player_id'] == player_id]['fullName'].iloc[1]
    ax.set_title(f"{player_name} Goal Chart", fontsize=14)

plt.show()

![League Average Offensemen Goal Heatmaps](../reports/figures/league_average_offenseman_goal_heatmap.png)

In [None]:
sampled_defplayer_ids = random_defensive_sample['player_id'].tolist()
filtered_defplays = merged_df[(merged_df['player_id'].isin(sampled_defplayer_ids)) & (merged_df['event'].isin(defensive_plays))]


fig, axes = plt.subplots(2, 5, figsize=(24,12))
axes = axes.flatten()

for i, (player_id) in enumerate(sampled_defplayer_ids):
    ax = axes[i]
    player_def_goals = filtered_defplays[filtered_defplays['player_id'] == player_id]

    draw_rink(ax)

    sns.scatterplot(x=player_def_goals['st_x'], y=player_def_goals['st_y'], hue=player_def_goals['secondaryType'], s=80, alpha=0.7, ax=ax)
    ax.legend(loc='upper left')
    
    player_name = filtered_defplays[filtered_defplays['player_id'] == player_id]['fullName'].iloc[1]
    ax.set_title(f"{player_name} Defensive Goal Chart", fontsize=14)
    

plt.show()

![League Average Defensemen Goal Scatter](../reports/figures/league_average_defenseman_goal_scatter.png)

In [None]:
fig, axes = plt.subplots(2, 5, figsize=(24,16))

axes = axes.flatten()

for i, (player_id) in enumerate(sampled_defplayer_ids):
    ax = axes[i]
    player_def_goals = filtered_defplays[filtered_defplays['player_id'] == player_id]

    draw_rink(ax)

    sns.kdeplot(x='st_x', y='st_y', data=player_def_goals, thresh=0.2, fill=True, cmap='coolwarm', ax=ax)
    player_name = filtered_defplays[filtered_defplays['player_id'] == player_id]['fullName'].iloc[1]
    ax.set_title(f"{player_name} Shot & Goal Heatmap", fontsize=14)

plt.show()

![League Average Defensemen Goal Heatmaps](../reports/figures/league_average_defenseman_goal_heatmap.png)

Like we did in the all-star notebook, we should also look at goal scoring by period for league-average, this will show us the distribution of goals as the games goes on. We will filter the plays data by goal and group by period, to effectively get the total counts of goals per period for our sample. In this case, we will look solely at the offensive players as they are doing the primary goal scoring.

In [None]:
goal_scoring_by_period = filtered_offplays[filtered_offplays['event'] == 'Goal'].groupby('period').size().reset_index(name='Goal Count')
plt.figure(figsize=(12,6))
sns.barplot(x=goal_scoring_by_period['period'], y=goal_scoring_by_period['Goal Count'])
plt.title("Goal Scoring By Period for League-Average Players")
plt.xlabel("Period")
plt.ylabel("Total Goals")
plt.show()

![Goal Scoring by Period for League Average Players](../reports/figures/league_average_goals_by_period.png)

Goal Scoring for League Average players actually falls as the game goes on, having the most in the first period. This is a direct opposite to the all-star group, as they see the most goals in the second and third period. 

In [None]:
scoring_plays = ['Goal', 'Assist', 'Shot']
filtered_offplays = merged_df[(merged_df['player_id'].isin(sampled_offplayer_ids)) & (merged_df['event'].isin(scoring_plays))]
shots_taken_by_period = filtered_offplays[filtered_offplays['event'] == 'Shot'].groupby('period').size().reset_index(name='Shot Count')



plt.figure(figsize=(12,6))
sns.barplot(x=shots_taken_by_period['period'], y=shots_taken_by_period['Shot Count'])
plt.title("Shots Taken By Period for League-Average Players")
plt.xlabel("Period")
plt.ylabel("Shot Count")
plt.show()

![Shots Taken by Period for League Average Players](../reports/figures/league_average_shots_taken_by_period.png)

Finally, like in the all-star notebook we will conclude with looking at the shot to goal ratio, this will give us a good comparison to the all-star players for how many goals they are getting, as well as the total amount of shots compared to the best players in the league. 

In [None]:
scoring_plays = ['Goal', 'Shot']
filtered_offplays = merged_df[(merged_df['player_id'].isin(sampled_offplayer_ids)) & (merged_df['event'].isin(scoring_plays))]
total_goals = (filtered_offplays['event'] == 'Goal').sum()
total_shots = (filtered_offplays['event'] == 'Shot').count()
total_shot_goal_ratio = total_goals / total_shots
display(total_shot_goal_ratio)
print("Total Shots:", total_shots)

`np.float64(0.18775995246583482)`
`Total Shots: 1683`

League-Average shot to goal ratio is actually higher than the all-star group, but the total shot's number is much less than the all-stars, potentially skewing the percentage with it.


Lastly, we will use K-Means clustering to find the top 3 groups for goals data. This will help visualize the difference between total all-stars and league-average goal locations. We will use sklearn to find the kmeans for all goal data. 

In [None]:
from sklearn.cluster import KMeans
num_clusters = 3 
scoring_plays = ['Goal', 'Shot']
filtered_offplays = merged_df[(merged_df['player_id'].isin(sampled_offplayer_ids)) & (merged_df['event'].isin(scoring_plays))]
shot_locations = filtered_offplays[['st_x', 'st_y']].dropna()
kmeans = KMeans(n_clusters=num_clusters, random_state=1)
kmeans.fit(shot_locations)
filtered_offplays['Cluster'] = kmeans.labels_

fig, ax = plt.subplots(figsize=(10, 6))
draw_rink(ax)
sns.scatterplot(x=filtered_offplays['st_x'], y=filtered_offplays['st_y'], hue=filtered_offplays['Cluster'], palette='viridis', ax=ax)
ax.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1], color='red', marker='X', s=200, label='Centroids')
plt.title("Goal Clustering for League-Average Sampled Players")
plt.xlabel("X Coordinate")
plt.ylabel("Y Coordinate")
plt.legend(loc='upper left')
plt.show()


![Goal Clustering for League Average Players](../reports/figures/league_average_goal_clustering.png)

As opposed to the all-stars, league-average scoring sees all their goals in the offensive zone. The centroid locations also differ between the two, with all of the centroids shifting closer to the goal. This is because alot of the league average scoring comes from tipped shots/redirections and one timers from close to the goal. 

### Game Situation Analysis

This section focuses on analysing NHL shot types across different game scenarios to answer questions such as "Do NHL players use different shots when they are losing the game?". To begin this analysis we first have to import the necessary packages and load the cleaned data base.

In [None]:
# Importing all necessary packages
from pathlib import Path
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.patches import Arc, Wedge
import warnings
warnings.filterwarnings("ignore", category=UserWarning)

DATA_PATH = Path("../data/")
PICKLE_PATH = Path("../pickled_data/")

In [None]:
# Loading the necessary pickled file after cleaning
game_plays = pd.read_pickle(PICKLE_PATH / 'game_plays')

In [None]:
# Ensure dateTime is in datetime format
game_plays['dateTime'] = pd.to_datetime(game_plays['dateTime'])

# Filter rows where the year is 2018
game_plays2018 = game_plays[(game_plays['dateTime'].dt.year == 2018) & (game_plays['dateTime'].dt.month == 4) & (game_plays['team_id_for'] == 20)]


To analyze where the players were when the shot occured we created a function that draws a hockey rink with regulation NHL dimensions. This example visualizes every shot type and goal for Calgary Flames players during the month of April 2018. 


In [None]:
# Calgary Flames Shots and Goals from April 2018 (3 games)

# Function for drawing key rink features. Dimensions from https://www.hockeymanitoba.ca/wp-content/uploads/2013/03/Rink-Marking-Diagrams.pdf
def draw_rink(ax):
    '''
    Displays a hockey rink with NHL dimensions at current axes
    Parameter ax: Current axes
    Returns nothing
    '''

    # Draw the center ice line
    ax.axvline(0, color='red', linestyle='-', linewidth=2, alpha=0.5)
    
    # Draw the blue lines at +/- 25 feet from center ice
    ax.axvline(-25, color='blue', linestyle='--', linewidth=2, alpha=0.5)
    ax.axvline(25, color='blue', linestyle='--', linewidth=2, alpha=0.5)

    # Draw blue faceoff circle at center ice (15 ft radius) with blue dot at the center
    center_circle = plt.Circle((0, 0), 15, edgecolor='blue', facecolor='none', alpha = 0.5, lw=2)
    ax.add_patch(center_circle)
    ax.plot(0, 0, marker='o', color='blue', alpha = 0.5, markersize=6)  
    
    # Draw the 4 red faceoff circles with red dots at center (31 feet from end boards and 20.5 feet from side boards)
    faceoff_positions = [(69, 20.5), (-69, 20.5), (69, -20.5), (-69, -20.5)]
    for x, y in faceoff_positions:
        faceoff_circle = plt.Circle((x, y), 15, edgecolor='red', facecolor='none', alpha = 0.5, lw=2)
        ax.add_patch(faceoff_circle)
        ax.plot(x, y, marker='o', color='red', alpha = 0.5, markersize=6)  
    
    # Draw goal lines for net (6 ft)
    ax.plot([-89, -89], [-3, 3], color='red', lw=1)  # Left goal line
    ax.plot([89, 89], [-3, 3], color='red', lw=1)    # Right goal line

    # Draw back of goals as arcs (6 ft wide and 4 ft deep)
    left_goal= Arc((-89, 0), width=6, height=8, angle=90, theta1=360, theta2=180, color='red', lw=1)
    right_goal = Arc((89, 0), width=6, height=8, angle=90, theta1=180, theta2=360, color='red', lw=1)

    ax.add_patch(left_goal)
    ax.add_patch(right_goal)

    # Draw the goal creases using Wedges and fill with low opacity (6 ft radius)
    left_goal_crease = Wedge((-89, 0), r=6, theta1=270, theta2=90, color='skyblue', alpha=0.2, edgecolor='blue', lw=2)
    right_goal_crease = Wedge((89, 0), r=6, theta1=90, theta2=270, color='skyblue', alpha=0.2, edgecolor='blue', lw=2)

    ax.add_patch(left_goal_crease)
    ax.add_patch(right_goal_crease)

    # Set the rink bounds (200 ft by 85 ft)
    ax.set_xlim(-100, 100)        
    ax.set_ylim(-42.5, 42.5)

    # Treat x and y units equally so circles are drawn correctly
    ax.set_aspect('equal')

# Set the size of the plot
plt.figure(figsize=(10, 6))

# Get the current axes
ax = plt.gca()  

# Filter shots and goals
shots = game_plays2018[game_plays2018['event'] == 'Shot']
goals = game_plays2018[game_plays2018['event'] == 'Goal']

# Give each shot a distinctive color
shot_colors = {
    'Wrist Shot': 'teal',
    'Slap Shot': 'gold',
    'Backhand': 'orchid',
    'Tip-In': 'darkorange',
    'Wrap-around': 'lightgreen',
    'Snap Shot': 'deepskyblue'
}

# Plot shots based on their type
for shot_type, color in shot_colors.items():
    show_shot = shots[(shots['secondaryType'] == shot_type)]
    ax.scatter(show_shot['st_x'], show_shot['st_y'], color=color, marker='h', s=10, label=f'{shot_type} (Shot)', alpha=0.8)

# Plot goals based on their shot type
for shot_type, color in shot_colors.items():
    show_goal = goals[goals['secondaryType'] == shot_type]
    ax.scatter(show_goal['st_x'], show_goal['st_y'], color=color, marker='x', s=20, label=f'{shot_type} (Goal)', alpha=1)

# Draw the rink at current axes using the custom function
draw_rink(ax)

# Labels and legend
plt.title("Hockey Rink Visualization: Shots and Goals")
plt.xlabel("X Position (Feet)")
plt.ylabel("Y Position (Feet)")
plt.legend(title='Shot and Goal Types', loc='upper right', bbox_to_anchor=(1.26, 1))
plt.show()

![Shots and Goals on Hockey Rink Viz](../reports/figures/game_sit_hockey_rink_shots_goals.png)

To analyze what types of shots players use and which are the most succesfull when a team is winning vs. when a team is losing, we need to extract some data from the game_plays dataframe. First we add 'team_winning' and 'team_losing' columns to the dataframe and then we can filter all the different shots and shot counts.

In [None]:
# Visualizing shot types and goals when a team is winning and losing

# Create a column for when a team was winning
game_plays['team_winning'] = ((game_plays['team_id_for'] == game_plays['goals_home']) & (game_plays['goals_home'] > game_plays['goals_away'])) | \
                             ((game_plays['team_id_for'] == game_plays['goals_away']) & (game_plays['goals_away'] > game_plays['goals_home']))

# Create a column for when a team is losing
game_plays['team_losing'] = ((game_plays['team_id_for'] == game_plays['goals_home']) & (game_plays['goals_home'] < game_plays['goals_away'])) | \
                             ((game_plays['team_id_for'] == game_plays['goals_away']) & (game_plays['goals_away'] < game_plays['goals_home']))

# Filter shots and goals when losing
losing_shots = game_plays[(game_plays['team_losing']) & (game_plays['event'] == 'Shot')]
losing_goals = game_plays[(game_plays['team_losing']) & (game_plays['event'] == 'Goal')]

# Filter shots and goals when winning
winning_shots = game_plays[(game_plays['team_winning']) & (game_plays['event'] == 'Shot')]
winning_goals = game_plays[(game_plays['team_winning']) & (game_plays['event'] == 'Goal')]

# Count occurrences by shot type for both winning and losing scenarios
shot_types = sorted(set(losing_shots['secondaryType']).union(set(winning_shots['secondaryType'])))
losing_shot_counts = [losing_shots['secondaryType'].value_counts().get(shot_type, 0) for shot_type in shot_types]
losing_goal_counts = [losing_goals['secondaryType'].value_counts().get(shot_type, 0) for shot_type in shot_types]
winning_shot_counts = [winning_shots['secondaryType'].value_counts().get(shot_type, 0) for shot_type in shot_types]
winning_goal_counts = [winning_goals['secondaryType'].value_counts().get(shot_type, 0) for shot_type in shot_types]


# Calculate total shots and goals for percentage calculation
total_losing_shots = sum(losing_shot_counts)
total_losing_goals = sum(losing_goal_counts)
total_winning_shots = sum(winning_shot_counts)
total_winning_goals = sum(winning_goal_counts)

First we will look strictly at shots that were saved for both winning teams and losing teams using a double bar graph and annotating the percentage for each shot type. We will look at the counts for every shot in the game_plays dataframe.

In [None]:
# Create bar positions
x = np.arange(len(shot_types))
bar_width = 0.35  # Width of each bar

# Calculate percentages
losing_shot_percentages = [(count / total_losing_shots) * 100 for count in losing_shot_counts]
winning_shot_percentages = [(count / total_winning_shots) * 100 for count in winning_shot_counts]

# Set size of plot
plt.figure(figsize=(12, 8))

# Plot bars for losing and winning shots
losing_bars = plt.bar(x - bar_width/2, losing_shot_counts, bar_width, color='blue', label='Losing Situation - Shots', alpha=0.7)
winning_bars = plt.bar(x + bar_width/2, winning_shot_counts, bar_width, color='green', label='Winning Situation - Shots', alpha=0.7)

# Add percentage annotations
for i in range(len(shot_types)):
    # Losing shot percentage
    if losing_shot_counts[i] > 0:
        plt.text(x[i] - bar_width/2, losing_shot_counts[i] + 1, f"{losing_shot_percentages[i]:.1f}%", 
                 ha='center', va='bottom', fontsize=9, color='black')
    # Winning shot percentage
    if winning_shot_counts[i] > 0:
        plt.text(x[i] + bar_width/2, winning_shot_counts[i] + 1, f"{winning_shot_percentages[i]:.1f}%", 
                 ha='center', va='bottom', fontsize=9, color='black')

# Add labels and title
plt.xlabel('Shot Type')
plt.ylabel('Number of Shots')
plt.title('Distribution of Shot Types (Winning vs. Losing Situations)')
plt.xticks(x, shot_types, rotation=45, ha='right')
plt.legend()
plt.tight_layout()  
plt.show()

![Distribution of Shot Types in Winning v Losing Situations](../reports/figures/distribution_of_shot_types_situations.png)

From this visualization, we can see that players choose similar shots regardless of whether they are winning or losing. The greatest disparities are between wrist shot and snap shots. Winning teams seem to choose wrist shots more often, suggesting they are utilizing the "Pucks to the Net" strategy. Winning teams generally finish the game with more shots on net. Additionally, once they have the lead, a high shot volume will keep pressure on the defending team, as well as keeping the puck in their zone where they have no chance of scoring. Losing teams tend to utilize snap shots more often than winning teams. This may be because when a team is struggling they feel the need to set up a more dangerous scoring chance to ensure a goal, or catch a hot goalie off guard with a speedy shot.

In [None]:
# Create bar positions
x = np.arange(len(shot_types))
bar_width = 0.35  

# Calculate percentages
losing_goal_percentages = [(count / total_losing_goals) * 100 for count in losing_goal_counts]
winning_goal_percentages = [(count / total_winning_goals) * 100 for count in winning_goal_counts]

# Set size of plot
plt.figure(figsize=(12, 8))

# Plot bars for losing and winning goals
losing_goal_bars = plt.bar(x - bar_width/2, losing_goal_counts, bar_width, color='red', label='Goals (Losing)', alpha=0.7)
winning_goal_bars = plt.bar(x + bar_width/2, winning_goal_counts, bar_width, color='orange', label='Goals (Winning)', alpha=0.7)

# Add percentage annotations
for i in range(len(shot_types)):
    # Annotations for losing goals
    if losing_goal_counts[i] > 0:
        plt.text(x[i] - bar_width/2, losing_goal_counts[i] + 1, f"{losing_goal_percentages[i]:.1f}%", 
                 ha='center', va='bottom', fontsize=9, color='black')
    # Annotations for winning goals
    if winning_goal_counts[i] > 0:
        plt.text(x[i] + bar_width/2, winning_goal_counts[i] + 1, f"{winning_goal_percentages[i]:.1f}%", 
                 ha='center', va='bottom', fontsize=9, color='black')

# Add labels and title
plt.xlabel('Shot Type')
plt.ylabel('Number of Goals')
plt.title('Distribution of Goals by Shot Type (Winning vs. Losing Situations)')
plt.xticks(x, shot_types, rotation=45, ha='right')
plt.legend()
plt.tight_layout()  
plt.show()

![Distribution of Goals by Shot Type Winning v Losing Situations](../reports/figures/distribution_of_goals_situation.png)

From this visualization, we can see that the high volume of wrist shots from winning teams accounts for almost half their goals. This shows how effective the "Pucks to the Net" strategy is not only for killing time, but also increasing the goal differential. While losing teams do have some success with the snap shot, they don't have enough shot volume to win the game.

In [None]:
# Winning vs. Losing Goals October 2018
ax = plt.gca()

# Draw the rink at current axes using the custom function
draw_rink(ax)   

# Plot winning and losing team shots and goals (filtered using boolean indexing)
ax.scatter(losing_goals[(losing_goals['dateTime'].dt.year == 2018) & (losing_goals['dateTime'].dt.month == 10)]['x'], 
           losing_goals[(losing_goals['dateTime'].dt.year == 2018) & (losing_goals['dateTime'].dt.month == 10)]['y'], 
           color='orange', marker='*', s=40, label='Losing Team Goals', alpha=0.8)

ax.scatter(winning_goals[(winning_goals['dateTime'].dt.year == 2018) & (winning_goals['dateTime'].dt.month == 10)]['x'], 
           winning_goals[(winning_goals['dateTime'].dt.year == 2018) & (winning_goals['dateTime'].dt.month == 10)]['y'], 
           color='green', marker='+', s=40, label='Winning Team Goals', alpha=0.8)

# Labels and legend
plt.title("Positional Analysis: Goals for Winning and Losing Teams")
plt.xlabel("X Position (Feet)")
plt.ylabel("Y Position (Feet)")
plt.legend(title='', loc='upper right', bbox_to_anchor=(1.4, 1))
plt.show()

![Positional Analyis: Goals for Winning and Losing Teams (Oct18)](../reports/figures/goal_locations_winning_losing_oct18.png)

Here we can see the exact location where the player was shooting from when a goal was scored. This visualization filters shots and goals from October 2018 (the start of the season), in order to not overpopulate the rink with markers. This visualization is inline with our hypothesis that winning teams are getting pucks to the net no matter where they are on the ice. They also have a lot of success just outside the crease, where they can block the goalies vision and get a tip on the puck right in front of the net where the goalie has no chance to react. This strategy makes it nearly impossible for even the best goaltenders to stop the puck. In contrast, the losing teams scores more goals in the slot area or just off to the side of the net. This provides further evidence that losing teams are trying to get in position to set up that perfect snap shot or one timer.

In [None]:
# Winning vs. Losing Goals April 2018
ax = plt.gca()

# Draw the rink at current axes using the custom function
draw_rink(ax)   

# Plot winning and losing team goals (filtered using boolean indexing)
ax.scatter(losing_goals[(losing_goals['dateTime'].dt.year == 2018) & (losing_goals['dateTime'].dt.month == 4)]['x'], 
           losing_goals[(losing_goals['dateTime'].dt.year == 2018) & (losing_goals['dateTime'].dt.month == 4)]['y'], 
           color='orange', marker='*', s=40, label='Losing Team Goals', alpha=0.8)

ax.scatter(winning_goals[(winning_goals['dateTime'].dt.year == 2018) & (winning_goals['dateTime'].dt.month == 4)]['x'], 
           winning_goals[(winning_goals['dateTime'].dt.year == 2018) & (winning_goals['dateTime'].dt.month == 4)]['y'], 
           color='green', marker='+', s=40, label='Winning Team Goals', alpha=0.8)

# Labels and legend
plt.title("Positional Analysis: Goals for Winning and Losing Teams")
plt.xlabel("X Position (Feet)")
plt.ylabel("Y Position (Feet)")
plt.legend(title='', loc='upper right', bbox_to_anchor=(1.26, 1))
plt.show()

![Positional Analysis: Goals for Winning and Losing Teams (Apr18)](../reports/figures/goal_locations_winning_losing_apr18.png)

The purpose of this visualization is to look at goals from winning and losing teams in the April, which is the last month of the season. Interestingly, we see the losing teams take more chances from further away from the net. This may be a result of teams becoming desperate to win in order to make the playoffs, as well as teams that won't the playoffs worrying less about that perfect shot when nothing is on the line. Either way the losing teams still seem to struggle with the important net front presence, that takes away the goalies eyes and creates tip in goals and rebounds.

In [None]:
# Filter for shots before any goals have been scored
shots_no_goals_against = game_plays[
    (game_plays['event'] == 'Shot') & 
    (game_plays['goals_home'] == 0) & 
    (game_plays['goals_away'] == 0) & 
    (game_plays['dateTime'].dt.year == 2018)
]

shots_goals_against = game_plays[
    (game_plays['event'] == 'Shot') & 
    (game_plays['goals_home'] != 0) & 
    (game_plays['goals_away'] != 0) & 
    (game_plays['dateTime'].dt.year == 2018) 
]

# Count the number of shots by type
shot_types_against = shots_goals_against['secondaryType'].unique()
shot_types = shots_no_goals_against['secondaryType'].unique()
shot_counts = [len(shots_no_goals_against[shots_no_goals_against['secondaryType'] == shot_type]) for shot_type in shot_types]
shot_counts_against = [len(shots_goals_against[shots_goals_against['secondaryType'] == shot_type]) for shot_type in shot_types_against]

# Calculate total shots and percentages
total_shots = sum(shot_counts)
shot_percentages = [(count / total_shots) * 100 if total_shots > 0 else 0 for count in shot_counts]
total_shots_against = sum(shot_counts_against)
shot_percentages_against = [(count / total_shots_against) * 100 if total_shots_against > 0 else 0 for count in shot_counts_against]

# Create bar positions
x = np.arange(len(shot_types))
x_against = np.arange(len(shot_types_against))
bar_width = 0.5  #

# Set size of plot
plt.figure(figsize=(12, 8))

# Plot bars for shot counts
bars = plt.bar(x, shot_counts, bar_width, color='red', alpha=0.3, label='Shots Before Any Goals Scored')
# Plot bars for shot counts after a goal has been scored
bars_against = plt.bar(x_against, shot_counts_against, bar_width, color='purple', alpha=0.2, label='Shots After a Goal Has Been Scored')


# Add percentage annotations
for i in range(len(shot_types)):
    if shot_counts[i] > 0:
        plt.text(x[i], shot_counts[i] + 1, f"{shot_percentages[i]:.1f}%", 
                 ha='center', va='bottom', fontsize=9, color='black')
# Add percentage annotations
for i in range(len(shot_types_against)):
    if shot_counts_against[i] > 0:
        plt.text(x_against[i], shot_counts_against[i] + 1, f"{shot_percentages_against[i]:.1f}%", 
                 ha='center', va='bottom', fontsize=9, color='black')
        
# Add labels and title
plt.xlabel('Shot Type')
plt.ylabel('Number of Shots')
plt.title('Distribution of Shot Types Before Any Goals Have Been Scored')
plt.xticks(x, shot_types, rotation=45, ha='right')
plt.legend()
plt.tight_layout()
plt.show()


![Distribution of Shot Types Before a Goal has been scored](../reports/figures/distribution_shot_types_scoreless_games.png)

The next game situation we will look are shots before and after a goal has been scored. "Shutouts" are one of the main statistics for goaltenders and they will battle to the last second of the game to ensure one. It is well known in the hockey community that goalies are very superstituous, and even mentioning the possibility of a shutout before a goal has been scored is taboo. Using a stacked bar plot to contrast shots before a goal has been scored and shots after a goal has been scored, we can get an idea of how players find success in breaking the shutout. The shot with the greatest disparity in use before and after a goal has been scored is the slap shot. From this data, it seems that before a goal has been scored players tend to take more slap shots. Slap shots are a good way of testing a goalie or shaking up one that is refusing to let anything in. When a goaltender is really dialed in, sometimes a 90 mile per hour rubber puck to the face is the only way to break their concentration. Players will selflessly throw their bodies in front of these shots to try and prevent this from happening.

In [None]:
# Filter shots when goals_home == 0 and goals_away == 0 during June 2018
shots_no_goals_june_2018 = game_plays[
    (game_plays['event'] == 'Shot') &
    (game_plays['goals_home'] == 0) & 
    (game_plays['goals_away'] == 0) &
    (game_plays['dateTime'].dt.year == 2018) & 
    (game_plays['dateTime'].dt.month == 6)
]

# Set up the plot
ax = plt.gca()

# Draw the rink at current axes using the custom function
draw_rink(ax)

# Scatter plot for shots when no goals have been scored
ax.scatter(shots_no_goals_june_2018['x'], 
           shots_no_goals_june_2018['y'], 
           color='purple', marker='x', s=30, label='Shots (No Goals Yet)', alpha=0.7)

# Labels and legend
plt.title("Positional Analysis: Shots Before Any Goals (June 2018)")
plt.xlabel("X Position (Feet)")
plt.ylabel("Y Position (Feet)")
plt.legend(title='', loc='upper right', bbox_to_anchor=(1.4, 1))
plt.show()


![Positional Analysis: Shot before goal scored (Jun 18)](../reports/figures/shot_locations_before_goals_jun18.png)

In this positional analysis, we look at shots before a goal has been scored during the month of June 2018. Only the best teams play in June, since this is when the final rounds of the playoffs happen. There is definitely a wide variety of shooting positions here, but we can see a lot of shot volume coming from the blue line where the majority of slap shots are taken.

In [None]:
# Filter for shootout goals when the game ended 0-0
shootout_goals0 = game_plays[
    (game_plays['event'] == 'Goal') &
    (game_plays['goals_home'] == 0) &
    (game_plays['goals_away'] == 0) &
    (game_plays['period'] == 5)
].copy()  

# Filter for all shootout goals
shootout_goals = game_plays[
    (game_plays['event'] == 'Goal') &
    (game_plays['period'] == 5) &
    (game_plays['dateTime'].dt.year == 2018)
].copy()  

# Ensure secondaryType column is cleaned (convert to string and drop NaNs) using .loc
shootout_goals.loc[:, 'secondaryType'] = shootout_goals['secondaryType'].astype(str).fillna('')
shootout_goals0.loc[:, 'secondaryType'] = shootout_goals0['secondaryType'].astype(str).fillna('')

# Get unique shot types across both datasets
goal_types = sorted(set(shootout_goals['secondaryType'].dropna()).union(set(shootout_goals0['secondaryType'].dropna())))

# Count occurrences by shot type
shootout_goal_counts = [shootout_goals['secondaryType'].value_counts().get(goal_type, 0) for goal_type in goal_types]
shootout_goal0_counts = [shootout_goals0['secondaryType'].value_counts().get(goal_type, 0) for goal_type in goal_types]

# Calculate percentages
total_shootout_goals = sum(shootout_goal_counts)
total_shootout0_goals = sum(shootout_goal0_counts)

shootout_goal_percentages = [(count / total_shootout_goals * 100) if total_shootout_goals > 0 else 0 for count in shootout_goal_counts]
shootout_goal0_percentages = [(count / total_shootout0_goals * 100) if total_shootout0_goals > 0 else 0 for count in shootout_goal0_counts]

# Create bar positions
x = np.arange(len(goal_types))
bar_width = 0.5  

# Set up the figure
plt.figure(figsize=(12, 8))

# Plot bars for both shootout goals (stacked on top of each other)
bars1 = plt.bar(x, shootout_goal_counts, bar_width, color='purple', alpha=0.7, label='All Shootout Goals (2018)')
bars2 = plt.bar(x, shootout_goal0_counts, bar_width, color='red', alpha=0.7, label='Shootout Goals (Game Ended 0-0)', bottom=shootout_goal_counts)

# Add percentage annotations
for i in range(len(goal_types)):
    if shootout_goal_counts[i] > 0:
        plt.text(x[i], shootout_goal_counts[i] / 2, f"{shootout_goal_percentages[i]:.1f}%", 
                 ha='center', va='center', fontsize=9, color='white', fontweight='bold')
    if shootout_goal0_counts[i] > 0:
        plt.text(x[i], shootout_goal_counts[i] + (shootout_goal0_counts[i] / 2), f"{shootout_goal0_percentages[i]:.1f}%", 
                 ha='center', va='center', fontsize=9, color='white', fontweight='bold')

# Labels and title
plt.xlabel('Shot Type')
plt.ylabel('Number of Goals')
plt.title('Stacked Distribution of Shootout Goals by Shot Type')
plt.xticks(x, goal_types, rotation=45, ha='right')
plt.legend()
plt.tight_layout()
plt.show()

![Stacked Distribution of Shootout Goals by Type](../reports/figures/stacked_distribution_shootout_goals_by_type.png)

The final game situation we will look at is the shootout. During the regular season when a game ends in a tie and 5 minutes of 3 on 3 doesn't result in a goal, teams get a chance to send their best players in one on one against the goaltender. Using a stacked bar plot we contrast games where goals have been scored but ended tied with games that ended 0-0 where the goalies were really preforming well. Not surprisingly, wrist shots account for the majority of successful shootout goals in both situations. This is because many good players will elect to skate in, pick their spot and try and hit it with a wrist shot. Snap shots are also fairly common in both situations, as they come off the stick fast and can surprise a goalie. The largest disparity between these two situations is the backhand shot where 30% of the shootout goals in games that ended 0-0 occured as opposed tp 18.8%. In a shootout, backhand shots are generally preceded with some tricky stickhandling to get the goalie moving. It makes sense that in games where the goalies were perfect, it takes some trickery to finally get the puck in the net.

In [None]:
# Filter for shootout goals when no goals have been scored yet
goals_no_goals_against = game_plays[
    (game_plays['event'] == 'Goal') & 
    (game_plays['goals_home'] == 0) & 
    (game_plays['goals_away'] == 0)
]

# Set up the plot
ax = plt.gca()

# Draw the rink at current axes using the custom function
draw_rink(ax)

# Scatter plot for goals when no goals have been scored
ax.scatter(goals_no_goals_against['x'], 
           goals_no_goals_against['y'], 
           color='orange', marker='+', s=40, label='Player Coordinates', alpha=0.8)

# Labels and legend
plt.title("Positional Analysis: Shooutout Goals in 0-0 Games")
plt.xlabel("X Position (Feet)")
plt.ylabel("Y Position (Feet)")
plt.legend(title='', loc='upper right', bbox_to_anchor=(1.4, 1))
plt.show()


![Positional Analysis: Shootout Goals in Scoreless Games](../reports/figures/shootout_goal_locations_scoreless_games.png)

Finally, we visualize where players typically shoot from in shootouts. Specifically in games that ended 0-0. It comes to no surprise that when a player has the option of shooting from wherever they choose, it is going to be right in front of the net and generally as close as possible.

### Time Series Analysis

This section will explore shot and goal data over time. First, we'll inspect the shot and goal distributions throughout the time within games, i.e. how the tendency to shoot and score is distributed through minutes 0-59 of a regulation time NHL game. Next, we will inspect how shot and goal trends have changed over seasons. Has scoring increased over time? Do the types of shots show any trends in play style? Do shot/goal locations show any trends?

In [None]:
from pathlib import Path
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sportypy.surfaces.hockey import NHLRink

PICKLE_PATH = Path("../pickled_data/")

game_plays = pd.read_pickle(PICKLE_PATH / "game_plays")
games = pd.read_pickle(PICKLE_PATH / "games")

In [None]:
# Filter out shots and goals
shots = game_plays[game_plays['event'] == 'Shot']
goals = game_plays[game_plays['event'] == 'Goal']

#### Time in Game

For our first analysis, we can inspect how shots and goals are distributed throughout time in the game. In order to visualize this effectively, we'll need to add a column that shows the total time elapsed in the game. We can calculate this through two existing columns 'period', which tells us the number of the current period (there are 3, 20 minute periods) and 'periodTime', which stores the number of seconds elapsed in the current period. To accomplish this, we simply calculate the base number of seconds for period using $(P - 1) * 1200$ where $P$ is the period and 1200 is the number of seconds in the 20 minute period. So period 1 occupies 0 - 1200 secs (base is 0), period 2 1200-2400 secs (base is 1200), period 3 2400-3600 secs (base is 2400). Then we simply add the seconds elapsed in the current period, giving us total game time elapsed. 

In [None]:
def calculate_total_time(df):
    """Calculates total elapsed game time in seconds, given period and time_elapsed (seconds).

    Args:
        df: Pandas DataFrame with 'period' and 'periodTime' (in seconds) columns.

    Returns:
        Pandas DataFrame with a new 'total_time' column (in seconds).
        Returns the original DataFrame if there's an error.
    """
    df = df.copy() # create a copy of the DataFrame
    
    try:
        
        df['time_elapsed'] = pd.to_numeric(df['periodTime'], errors='coerce')  # Convert to numeric, handle errors

        # Calculate total seconds for each period:
        df['period_seconds'] = (df['period'] - 1) * 1200  # 1200 seconds per period (20 minutes)

        # Calculate total time:
        df['total_time'] = df['period_seconds'] + df['time_elapsed']

        df = df.drop(columns=['period_seconds'])  # Clean up unnecessary column

        return df
    
    except (TypeError, ValueError) as e:
        print(f"Error during time calculation: {e}")
        return df

# add the total time column to the shots dataframe
shots = calculate_total_time(shots)

Next we can plot a heatmap showing the frequency of shots in different minutes of the game. For viewing simplicity, we constrain our results to regular time (0-60 min), and convert seconds to minutes. 

In [None]:
# Heatmap for shot frequency over time of game
# Create bins for time intervals 
bin_size = 60  # Seconds per bin 
bins = range(0, 3601, bin_size)  # Create the bin edges 0 - 60 minutes
shots['time_bin'] = pd.cut(shots['total_time'], bins=bins, right=False, labels=False) # Assign each shot to a time bin

# Aggregate shot counts by time bin:
shot_counts = shots.groupby('time_bin').size().reset_index(name='shot_count')

# Convert time_bin back to seconds for plotting:
shot_counts['time_bin_seconds'] = shot_counts['time_bin'] * bin_size

# Create the heatmap:
plt.figure(figsize=(10, 6)) 

# Convert seconds to minutes for the x-axis labels 
shot_counts['time_bin_minutes'] = shot_counts['time_bin_seconds'] / 60

sns.heatmap(shot_counts.pivot_table(index=None, columns='time_bin_minutes', values='shot_count', aggfunc='sum'), cmap="YlGnBu", linewidths=.5)  # Use pivot_table for heatmap
plt.xlabel("Game Time (minutes)")
plt.ylabel("Shot Count")
plt.title("Shot Frequency Heatmap (Regulation Time)")

plt.xticks(rotation=45) 
plt.tight_layout() 
plt.show()


![Shot Frequency Heatmap (Regulation Time)](../reports/figures/shot_frequency_heatmap_game_time.png)

The heatmap shows that shots are most likely to occur toward the end of periods, which makes sense, as teams may take higher levels of risk to score when they feel the other team will not have enough time to transition to the other side of the ice and score. Additionally we see that the second period tends to have more shots as a whole than others. In hockey, this is known as the period of the "long change", where each team's bench is closer to the offensive zone than the defensive zone. This means that teams can more easily make subsitutions while on offense than on defense. If a team is possessing the puck for an extended time in the offensive zone, they are more likely to keep tired defenders on the ice, which will result in even more shots.



In [None]:
# Histogram of shot count frequency by game time elapsed
plt.figure(figsize=(10, 6))
plt.bar(shot_counts['time_bin_minutes'], shot_counts['shot_count'], width=bin_size/60)  # Use bin_size to determine bar width
plt.axhline(shot_counts["shot_count"].mean(), color='black', linestyle='--')
plt.xlabel("Game Time (minutes)")
plt.ylabel("Shot Count")
plt.title("Shot Frequency Histogram (Regulation Time)")
plt.xticks(rotation=45)  # Rotate x-axis labels
plt.ylim(8000, 13500)
plt.legend(['Average', 'Shot Count'])
plt.tight_layout()
plt.show()

![Shot Frequency Histogram (Regulation Time)](../reports/figures/shot_frequency_histogram.png)

Above, we visualize the same data using a histogram instead. This better captures the variation and allows us to zoom in on a range of interest in the data to highlight that variation

We can repeat our histogram plot for goals, as well.

In [None]:
# Histogram of shot count frequency by game time elapsed
fig, ax = plt.subplots(figsize=(10,6))
ax.bar(goal_counts['time_bin_minutes'], goal_counts['goal_count'], width=bin_size/60, color="tab:orange")  # Use bin_size to determine bar width
plt.axhline(goal_counts["goal_count"].mean(), color='black', linestyle='--')
ax.annotate("empty net goals", xy=(58, 5150), xytext=(45, 5000), va="top", ha="left", arrowprops=dict(facecolor='black', shrink=0.05))
plt.xlabel("Game Time (minutes)")
plt.ylabel("Goal Count")
plt.title("Goal Frequency Histogram (Regular Time)")
plt.xticks(rotation=45)  # Rotate x-axis labels if needed
plt.legend(['Average', 'Goal Count'])
plt.tight_layout()
plt.show()

![Goal Frequency Histogram (Regular Time)](../reports/figures/goal_frequency_histogram.png)

The distribution for goals is very highly skewed to the final minutes of the game. This is certainly due to the amount of empty net goals that are scored at the end of a game. When a team is only down by 1 or 2 goals, it behooves them to pull their goaltender from the net, allowing them to place an extra skater on the ice to play offense and attempt to even up the score. There is no penalty for losing by more goals, but there is benefit to tying the game and proceeding to overtime, or potentially winning the game in regulation. This skews the number of goals scored to the end of the 3rd period heavily

Aside from the spike in goals at the end of the game, we see similar trends to shots, where there are more goals scored overall in the second period, and goal scoring is more common at the ends of periods.

#### Variation through Seasons

To add the season data, we must first merge in the season column from games dataframe.
To simplify things and make them a bit cleaner, let's change the season data to align to the year end for each season. This will also allow us to filter a bit easier (ie with 'season' > 2010, for example)

In [None]:
game_plays = pd.merge(game_plays, games[['game_id', 'season']], how="left", on="game_id")
season_replacements = {
    20002001: 2001, 
    20012002: 2002,
    20022003: 2003, 
    20032004: 2004, 
    20052006: 2006,
    20062007: 2007,
    20072008: 2008, 
    20082009: 2009, 
    20092010: 2010,
    20102011: 2011,
    20112012: 2012,
    20122013: 2013, 
    20132014: 2014,
    20142015: 2015,
    20152016: 2016,
    20162017: 2017, 
    20172018: 2018,
    20182019: 2019,
    20192020: 2020
}

game_plays['season'] = game_plays['season'].replace(season_replacements)
shots = game_plays.loc[game_plays['event'] == 'Shot']

#### Shot trends

Now that we have our season data, we can take a look at trends in shot totals. Below, we filter to years after 2010, as the years preceding are outliers and have relatively few shots recorded, and only the counts for goals.

In [None]:
# filter to years after 2010 due to missing data
total_shots_after_2010 = shots[shots['season'] > 2010]

# create a function for trendline
df = total_shots_after_2010.groupby('season')['play_id'].count().reset_index()
z = np.polyfit(df['season'], df['play_id'], 1)  # 1 for linear, 2 for quadratic, etc.
p = np.poly1d(z)

In [None]:
# create plot with shots data and trendline
ax = total_shots_after_2010.groupby('season')['play_id'].count().plot(kind="line")
ax.annotate("Lockout shortened season", xy=(2013, 43250), xytext=(2014, 43250), arrowprops=dict(arrowstyle='->'))
plt.plot(total_shots_after_2010['season'], p(total_shots_after_2010['season']))
plt.xlabel('Season')
plt.ylabel('Total Shots')
plt.title('Total Shots per Season (2011-2020)')
plt.legend(['Total Shots', 'Trendline'])
plt.tight_layout()
plt.show()

![Total Shots per Season (2011-2020)](../reports/figures/shots_by_season_trend.png)

We have one outlier season where only about half the season was played due to a lockout. We see that shooting is up in general. This suggests an increased pace of play in more recent years. Next we can take a look at the difference in counts by shot type for each season.

# create a plot of shot type counts for each season
ax = total_shots_after_2010.groupby(['season', 'secondaryType'])['play_id'].count().unstack().plot(kind="line", figsize=(16, 8))
ax.annotate("Lockout shortened season", xy=(2013, 25000), xytext=(2012.4, 40000), arrowprops=dict(arrowstyle='->'))
plt.title('Shots by Type, Season (2011 - 2020)')
plt.xlabel('Season')
plt.ylabel('Count of Shots')
plt.legend(title="Shot type")
plt.tight_layout()
plt.show()

![Shots by Type, Season (2011-2020)](../reports/figures/shots_by_type_season.png)

We see that Slap shots have decreased while Wrist shots have increased. Snap shots show a small trend upward, but a decline in recent years. It might be more illuminating to take a look at shot types as a percentage of the total shots rather than raw counts, this may better highlight trends in greater share with the adjusted scale it will provide

In [None]:
# Group by season and shot type, count occurrences
grouped_shots = total_shots_after_2010.groupby(['season', 'secondaryType']).size().unstack(fill_value=0)

# Calculate percentages within each year
grouped_shots = grouped_shots.div(grouped_shots.sum(axis=1), axis=0) * 100

ax = grouped_shots.plot(kind="line")
plt.title('Percentage of Shots by Shot type, Season (2011 - 2020)')
plt.xlabel('Season')
plt.ylabel('Percentage of Shots Taken')
plt.legend(title="Shot type", bbox_to_anchor=(1.0, 0.6), loc="right")
plt.tight_layout()
plt.show()

![Percentage of Shots by Shot Type (2011-2020)](../reports/figures/percentage_of_shots_by_shot_type_season.png)

This plot better shows that wrist shots have markedly increased, taking over share from snap and slap shots. Other shot types remain mostly flat.

#### Goal Trends

Next, we can repeat a similar analysis for goals. Note we don't have to constrain the seasons for raw goal counts, but we will for shot types, as that stat was not reliably tracked until 2011.

In [None]:
# filter to goals
goals = game_plays.loc[game_plays['event'] == 'Goal']

# create a function to plot the trendline
df = goals.groupby('season')['play_id'].count().reset_index()
z = np.polyfit(df['season'], df['play_id'], 1)  # 1 for linear, 2 for quadratic, etc.
p = np.poly1d(z)

# Create a plot with goals data and trendline
ax = goals.groupby('season')['play_id'].count().plot(kind="line") # plot goals data
ax.annotate("Lockout shortened season", xy=(2013, 4450), xytext=(2004, 4750), arrowprops=dict(arrowstyle='->'))
plt.plot(goals['season'], p(goals['season']))
plt.xlabel('Season')
plt.ylabel('Count of Goals')
plt.title("Goals per Season (2001 - 2020)")
plt.xticks(np.arange(2000, 2021, 2))
plt.legend(['Total Goals', 'Trendline'])
plt.tight_layout()
plt.show()

![Goals per Season (2001 - 2020)](../reports/figures/goals_by_season_trend.png)

As expected by the general trend upward we saw in shots, goals are also increasing modestly over time. This tells us that games are moving toward higher scores in recent years, and likely a higher pace of play

In [None]:
goals_after_2010 = goals[goals['season'] > 2010]
ax = goals_after_2010.groupby(['season', 'secondaryType'])['play_id'].count().unstack().plot(kind="line", figsize=(16, 8))
ax.annotate("Lockout shortened season", xy=(2013, 2450), xytext=(2012.4, 4000), arrowprops=dict(arrowstyle='->'))
plt.title('Goals by Shot type, Season (2011 - 2020)')
plt.xlabel('Season')
plt.ylabel('Count of Goals')
plt.legend(title="Shot type")
plt.tight_layout()
plt.show()

![Goals by Shot type, Season (2011 - 2020)](../reports/figures/goals_by_shot_type_season.png)

In [None]:
goals_after_2010 = goals[goals['season'] > 2010]

# Group by season and shot type, count occurrences
grouped_goals = goals_after_2010.groupby(['season', 'secondaryType']).size().unstack(fill_value=0)

# Calculate percentages within each year
grouped_goals = grouped_goals.div(grouped_goals.sum(axis=1), axis=0) * 100

ax = grouped_goals.plot(kind="line")
plt.title('Percentage of Goals by Shot type, Season (2011 - 2020)')
plt.xlabel('Season')
plt.ylabel('Percentage of Goals Scored')
plt.legend(title="Shot type", bbox_to_anchor=(1.0, 0.6), loc="right")
plt.tight_layout()
plt.show()

![Percentage of Goals by Shot Type, Season (2011 - 2020)](../reports/figures/percentage_of_goals_by_shot_type_season.png)

As with shots, we see similar trends in the shot types of goals. However, they are far less pronounced. We see more goals coming from wrist shots and less coming from slap shots. 

#### Location Density

We can repeat x and y coordinate correction we performed in the baseline analysis to set ourselves up to plot the location density over all seasons. For these plots, we will constrain ourselves to seasons after 2010, as we have already observed that is when data collection became more reliable.

In [None]:
# Add count columns for counting goals and shots, to be used in density based visualizations later on.
game_plays['goal'] = (game_plays['event'] == "Goal").astype(int)
game_plays['shot'] = (game_plays['event'] == "Shot").astype(int)

# Correct the x and y coordinates to occupy only one side of the ice
game_plays['xC'] = np.abs(game_plays['x'])
game_plays['yC'] = game_plays['y'] * np.sign(game_plays['x'])

# filter for goals and missing coordinates, season after 2010
goals = game_plays.loc[
    ~game_plays["xC"].isna() &
    ~game_plays["yC"].isna() &
    (game_plays['event'].isin(["Goal"])) &
    (game_plays['season'] > 2010)
]
# keep only columns we need for performance in drawing density plots
goals = goals[['xC', 'yC', 'goal','season', 'secondaryType']]

# repeat for shots
# filter for shots and missing coordinates, season after 2010
shots = game_plays.loc[
    ~game_plays["xC"].isna() &
    ~game_plays["yC"].isna() &
    (game_plays['event'].isin(["Shot"])) &
    (game_plays['season'] > 2010)
]
# keep only columns we need for performance in drawing density plots
shots = shots[['xC', 'yC', 'shot', 'season', 'secondaryType']]

In [None]:
seasons = sorted(list(shots['season'].unique()))

In [None]:
nhl = NHLRink()
fig, axes = plt.subplots(nrows=4, ncols=3, figsize=(20, 12),  
                         gridspec_kw={'hspace': 0.4, 'wspace': 0.3})

# Only 10 subplots are needed, remove two from the bottom right
fig.delaxes(axes[3,2])
fig.delaxes(axes[3,1])
axes = axes.flatten()[:10]

for i, ax in enumerate(axes):
    # view the shot type we are focused on this loop, axes and shot_types are of same length so we can use i to access
    filtered_shots = shots[shots['season'] == seasons[i]]
    
    # display offensive zone where goal is on bottom of image
    nhl.draw(ax, display_range="ozone", rotation=270) 
    
    # plot the heatmap
    heat = nhl.heatmap(
    filtered_shots["xC"],
    filtered_shots["yC"],
    values = filtered_shots["shot"],
    cmap = "coolwarm",
    ax = ax,
    plot_range="ozone",
    plot_xlim=(25, 89),
    binsize=3, 
    alpha=0.85
    )
    plt.colorbar(heat, ax=ax) # add a colorbar
    ax.set_title(f"{seasons[i]} Shot Locations") # title the axes
    
fig.suptitle("Shot Location Density by Season", fontsize=24)
plt.show()

![Shot Location Density by Season (2011 - 2020)](../reports/figures/shot_density_by_season.png)

In [None]:
nhl = NHLRink()
fig, axes = plt.subplots(nrows=4, ncols=3, figsize=(20, 12),  
                         gridspec_kw={'hspace': 0.4, 'wspace': 0.3})

# Only 10 subplots are needed, remove two from the bottom right
fig.delaxes(axes[3,2])
fig.delaxes(axes[3,1])
axes = axes.flatten()[:10]

for i, ax in enumerate(axes):
    # view the shot type we are focused on this loop, axes and shot_types are of same length so we can use i to access
    filtered_goals = goals[goals['season'] == seasons[i]]
    
    # display offensive zone where goal is on bottom of image
    nhl.draw(ax, display_range="ozone", rotation=270) 
    
    # plot the heatmap
    heat = nhl.heatmap(
    filtered_goals["xC"],
    filtered_goals["yC"],
    values = filtered_goals["goal"],
    cmap = "coolwarm",
    ax = ax,
    plot_range="ozone",
    plot_xlim=(25, 89),
    binsize=3, 
    alpha=0.85
    )
    plt.colorbar(heat, ax=ax) # add a colorbar
    ax.set_title(f"{seasons[i]} Goal Locations") # title the axes
    
fig.suptitle("Goal Location Density by Season", fontsize=24)
plt.show()

![Shot Location Density by Season (2011 - 2020)](../reports/figures/goal_density_by_season.png)

We don't see much difference in the shot and goal scoring locations over time. They fluctuate very slightly over time, typically with respect to distance away from goal, but do not show a strong trend in any one direction over time, like moving closer or farther from goal

## Conclusion

By studying this NHL Data set and diving deep into shots and goals, we learned a lot about strategy and scoring at the highest level of hockey. Players prefer wrist shots by far, and they account for about 50% of all the shots taken. In spite of this preference, Deflections and Tip-ins are the shots with the highest success rate. The farther a player is away from the goal, and the farther away from the center of the ice in the y-direction, the less likely they are to score a goal. While players shoot from all over the offensive zone, they score the most from nearest the net. This may pose the question why players don't just exclusively shoot from 

In inspecting time-series analyses for trends. Looking within games, we found that more shots are taken and goals scored in the second period of games. This is possibly explained by the "long change", meaning that defensive players are further from their bench to make substitutions, and thus get stuck defending for long periods of time without a break. We also found that more goals are scored at the ends of periods, with a large standout being the third period. This is of course, due to the empty net goals that are scored when one team is down and pulls their goalie in favor of playing 6 v 5, hoping to even the score. When studying trends between seasons, we found that more shots are being taken and more goals are being score, suggesting a higher rate of play. We also note a transition to taking even more wrist shots, while taking less slap shots. We do not note any key trends in where shots are being taken from over time, however.



### Future Work
Using the analysis we have performed here, we anticipate there would be several applications for the knowledge developed. Visualizations like the shot charts produced for certain players and teams could be produced for coaches and inform their team strategy. With the insights about success of shots, we could also build a prediction model that tells us the likelihood of scoring a goal, given shot type, location, and player parameters. This might be useful in a coaching or gambling context, where simulations are performed to predict scores. Coaches could tune parameters to predict the effect on their chances of winning.

## References

1. Ellis, M. (2019). NHL Game Data [Online]. Available at:
https://www.kaggle.com/datasets/martinellis/nhl-game-data/data 

2. Hockey Reference (2024). NHL All-Star Game History & Statistics [Online]. Available at:
https://www.hockey-reference.com/allstar/ 

3. Python Software Foundation, 2023. Python (Version 3.11). Available at: https://www.python.org

4. Harris, C.R. et al., 2020. Array programming with NumPy. Nature, 585, pp.357–362. Available at: https://doi.org/10.1038/s41586-020-2649-2 

5. McKinney, W. & others, 2010. Data structures for statistical computing in Python. In Proceedings of the 9th Python in Science Conference, pp. 51–56. Available at: https://conference.scipy.org/proceedings/2010/mckinney.html 

6. Hunter, J.D., 2007. Matplotlib: A 2D graphics environment. Computing in Science & Engineering, 9(3), pp.90–95. Available at: https://doi.org/10.1109/MCSE.2007.55 

7. Waskom, M.L., 2021. Seaborn: statistical data visualization. Journal of Open Source Software, 6(60), pp.3021. Available at: https://doi.org/10.21105/joss.03021 