# "NHL Player Analysis"
> "From time to time, I see things on Instagram and they inspire me to take a look at the stats."

- toc: false
- branch: master
- badges: true
- comments: true
- categories: [fastpages, jupyter, sports, NHL, data, analytics, hockey]
- hide: false
- search_exclude: true
- author: Parth Dholakia

In [2]:
#hide
import pandas as pd
import altair as alt
from altair import datum
import numpy as np

import os
import glob
from matplotlib import pyplot as plt
from matplotlib import dates as mpl_dates
from datetime import datetime , timedelta

In [3]:
#hide
#Pick up player base statistics into a dataframe

NHL2021BaseURL = 'https://www.hockey-reference.com/leagues/NHL_2021_skaters.html'

NHL2021BaseStats = pd.read_html(NHL2021BaseURL)[0]

NHL2021BaseStats = NHL2021BaseStats.droplevel(0, axis =1)

NHL2021BaseStats.drop_duplicates(inplace = True)

#base_safety_inbox = customer_safety_inbox.drop(columns = ["following","turn","distracted","forward","brake" , "Event URL"])

#ROY_df = ROY_df.drop(columns = ["Voting" , "Lg"])

NHL2021BaseStats.head()

Unnamed: 0,Rk,Player,Age,Tm,Pos,GP,G,A,PTS,+/-,...,SH,S,S%,TOI,ATOI,BLK,HIT,FOW,FOL,FO%
0,1,Noel Acciari,29,FLA,C,10,0,0,0,0,...,0,11,0.0,149,14:54,16,26,46,56,45.1
1,2,Andrew Agozzino,30,ANA,LW,1,0,0,0,0,...,0,1,0.0,16,16:00,6,3,10,8,55.6
2,3,Sebastian Aho,23,CAR,F,10,2,7,9,0,...,0,21,9.5,200,19:57,4,4,67,78,46.2
3,4,Mark Alt,29,LAK,D,2,0,0,0,0,...,0,3,0.0,27,13:35,1,2,0,0,
4,5,Michael Amadio,24,LAK,C,11,0,2,2,-3,...,0,11,0.0,143,12:58,6,1,42,40,51.2


In [4]:
#hide
filter1 = NHL2021BaseStats["Player"].isin(["Auston Matthews",
                      "John Tavares",
                      "Connor McDavid",
                      "Leon Draisaitl",
                      "Patrick Kane",
                      "Drew Doughty",
                      "Anze Kopitar",
                      "Artemi Panarin",
                      "Jacob Trouba"
                     ])

FilteredBasePlayers = NHL2021BaseStats[filter1].reset_index()

FilteredBasePlayers = FilteredBasePlayers.drop(columns = ["index"])

colstoconvert = ["G",
                 "A",
                 "+/-",
                 "PS"
                ]

FilteredBasePlayers[colstoconvert] = FilteredBasePlayers[colstoconvert].apply(pd.to_numeric, errors = "coerce")

NHL2021BaseStats[colstoconvert] = NHL2021BaseStats[colstoconvert].apply(pd.to_numeric, errors = "coerce")

I saw a post on Instagram from an account (@talkinghockey) that highlighted the following combinations of NHL players that are among the highest paid:


Of course, once seeing this, I wanted to take a look at the actual numbers for these players to see which team is getting the bang for their buck. A couple things that I want to take a look at:
* Overall points per player group
* Impact by their Time On Ice
* Exploring hockey's advanced stats (Corsi and Fenwick)
* Team impact (how has their team performed since this player group started playing together?)
* Relative impact by contract value

In [5]:
#hide

NHL2021GoalsAverage = np.mean(NHL2021BaseStats["G"])
NHL2021AssistsAverage = np.mean(NHL2021BaseStats["A"])
NHL2022PlusMinusAverage = np.mean(NHL2021BaseStats["+/-"])
NHL2021PointSharesAverage = np.mean(NHL2021BaseStats["PS"])

In [6]:
#hide
print(NHL2022PlusMinusAverage)

-0.06275033377837116


In [7]:
#hide
print(NHL2021PointSharesAverage)

0.4627503337783712


In [8]:
#hide
path = '/Users/parthdholakia/Code/first-python-notebook'

os.chdir(path)

FilteredBasePlayers.to_csv("FilteredBasePlayers.csv", index=False, encoding='utf-8-sig')

In [9]:
#hide
ChartDF = pd.read_csv('FilteredBasePlayers.csv')

ChartDF.head()

Unnamed: 0,Rk,Player,Age,Tm,Pos,GP,G,A,PTS,+/-,...,SH.1,S,S%,TOI,ATOI,BLK,HIT,FOW,FOL,FO%
0,160,Drew Doughty,31,LAK,D,12,3,5,8,-5,...,0,21,14.3,323,26:55,23,15,0,0,
1,163,Leon Draisaitl,25,EDM,C,15,8,17,25,11,...,0,43,18.6,333,22:10,7,11,173,122,58.6
2,343,Patrick Kane,32,CHI,RW,14,6,11,17,0,...,0,52,11.5,308,21:58,7,3,0,0,
3,368,Anze Kopitar,33,LAK,C,12,3,12,15,0,...,0,28,10.7,265,22:05,8,13,137,116,54.2
4,449,Auston Matthews,23,TOR,C,12,11,4,15,6,...,0,52,21.2,264,21:58,11,19,116,101,53.5


First thing we'll take a look at is how these players compare on goals and assists in 2021. You can interact with the chart in the following ways:

* Click multiple players to see their names
* Drag and zoom in on the chart
* Hover over a player to see their names

The colors represent the teams they play for to easily distinguish between the player groups. 

In [10]:
#hide_input

hover = alt.selection_single(
    on='mouseover',  # select on mouseover
    nearest=True,    # select nearest point to mouse cursor
    empty='none'     # empty selection should match nothing
)

click = alt.selection_multi(
    empty='none' # empty selection matches no points
)

# scatter plot encodings shared by all marks
plot = alt.Chart(ChartDF).mark_circle().encode(
    x=alt.X('G:Q', axis=alt.Axis(title='Goals')),
    y=alt.Y('A:Q', axis=alt.Axis(title='Assists')),
    color= alt.Color("Tm", legend = alt.Legend(title = "Team")),
    tooltip=['G:Q', 'A:Q']
).properties(
    title = "2021 Goals and Assists Comparison"
) 
    
# shared base for new layers
base = plot.transform_filter(
    hover | click # filter to points in either selection
)

# layer scatter plot points, halo annotations, and title labels
alt.layer(
    plot.add_selection(hover).add_selection(click).add_selection(alt.selection_interval(bind='scales')),
    base.mark_point(size=100, stroke='firebrick', strokeWidth=1),
    base.mark_text(dx=4, dy=-8, align='right', stroke='white', strokeWidth=2).encode(text='Player:N'),
    base.mark_text(dx=4, dy=-8, align='right').encode(text='Player:N'),
    data=ChartDF
).properties(
    width=600,
    height=550
)

What are we seeing here? A couple things quickly stand out based on looking at just Goals and Assists:
* Edmonton's duo are outpacing everyone, putting up 17 G and 34 A - looking like they're going to be off the chart soon
* It's slightly unfair to Patrick Kane, who is currently solely responsible for representing Chicago here. Given Jonathan Toews' injury, we may not see a fair representation of that duo. Despite that, Patrick Kane is still putting up 17 PTS
* We have two defensemen in this group, Drew Doughty and Jacob Trouba, who won't contribute to the offense in the same way as the rest of the group



In [11]:
#hide_input

hover = alt.selection_single(
    on='mouseover',  # select on mouseover
    nearest=True,    # select nearest point to mouse cursor
    empty='none'     # empty selection should match nothing
)

click = alt.selection_multi(
    empty='none' # empty selection matches no points
)

# scatter plot encodings shared by all marks
plot = alt.Chart(ChartDF).mark_circle().encode(
    x=alt.X('+/-:Q', axis=alt.Axis(title='Plus Minus')),
    y=alt.Y('PS:Q', axis=alt.Axis(title='Point Shares')),
    tooltip=['+/-:Q', 'PS:Q'],
    color= alt.Color("Tm", legend = alt.Legend(title = "Team"))
).properties(
    title = "2021 Plus Minus and Point Shares Comparison"
) 
    
# shared base for new layers
base = plot.transform_filter(
    hover | click # filter to points in either selection
)

# layer scatter plot points, halo annotations, and title labels
alt.layer(
    plot.add_selection(hover).add_selection(click).add_selection(alt.selection_interval(bind='scales')),
    base.mark_point(size=100, stroke='firebrick', strokeWidth=1),
    base.mark_text(dx=4, dy=-8, align='right', stroke='white', strokeWidth=2).encode(text='Player:N'),
    base.mark_text(dx=4, dy=-8, align='right').encode(text='Player:N'),
    data=ChartDF
).properties(
    width=600,
    height=550
)

Before we get into some insights on the above chart, let's quickly review what Plus Minus and Point Shares are. These are both different statistics used to attribute team success to an individual player. 

**Plus Minus**

Plus/minus is calculated by subtracting the total number of goals allowed by a player's team while the player is on the ice (at even strength or on the power play) from the total number of goals scored by the player's team while the player is on the ice (at even strength or short-handed). On average, a player with a positive plus minus will benefit the team with more goals scored than goals against. 

**Point Shares**

Point shares is another effort to attribute team success to individual players through calculating offensive point shares and defensive point shares for each player. 

Offensive point shares take into account player goals created (Goals and Assists) compared to the team goals. 

Defensive point shares take into account player time on ice compared to team total time on ice, player plus minus and team goals against. 

You can find additional documentation on the calculation for this here: https://www.hockey-reference.com/about/point_shares.html


**Insights**

Let's be very clear here, we're looking an an elite level of players. The 2021 NHL average for plus minus is 0 and the average for point shares is about 0.4, putting all these players at a level far above the rest in the league. 

As we look at Plus Minus and Point Shares, similar insights stand out as well:
* The Edmonton players contribute the most points to their team, by a long shot. However Connor McDavid is near even on plus minus
* Both of the LA Kings players are 0 or below on plus minus, mostly an indication of LA's defensive struggles so far this season
* Auston Matthews is clearly outpacing Jon Tavares in both plus minus and point shares

In [12]:
#hide
#Pick up player advanced statistics into a dataframe

NHL2021AdvancedURL = 'https://www.hockey-reference.com/leagues/NHL_2021_skaters-advanced.html'

NHL2021AdvancedStats = pd.read_html(NHL2021AdvancedURL)[0]

NHL2021AdvancedStats = NHL2021AdvancedStats.droplevel(0, axis =1)

NHL2021AdvancedStats.drop_duplicates(inplace = True)

#base_safety_inbox = customer_safety_inbox.drop(columns = ["following","turn","distracted","forward","brake" , "Event URL"])

#ROY_df = ROY_df.drop(columns = ["Voting" , "Lg"])

NHL2021AdvancedStats.head()

Unnamed: 0,Rk,Player,Age,Tm,Pos,GP,CF,CA,CF%,CF% rel,...,PDO,oZS%,dZS%,TOI/60,TOI(EV),TK,GV,E+/-,SAtt.,Thru%
0,1,Noel Acciari,29,FLA,C,10,65,114,36.3,-19.3,...,102.2,26.6,73.4,14:54,11:42,4,3,0.7,17,64.7
1,2,Andrew Agozzino,30,ANA,LW,1,13,19,40.6,-18.1,...,100.0,50.0,50.0,16:00,15:33,0,1,0.6,3,33.3
2,3,Sebastian Aho,23,CAR,F,10,169,116,59.3,3.5,...,98.1,57.4,42.6,19:57,15:35,6,6,1.9,38,55.3
3,4,Mark Alt,29,LAK,D,2,15,21,41.7,2.3,...,100.0,60.0,40.0,13:35,12:09,0,0,-0.8,4,75.0
4,5,Michael Amadio,24,LAK,C,11,94,121,43.7,-2.5,...,97.0,54.5,45.5,12:58,10:22,0,6,-1.6,20,55.0


In [13]:
#hide
filter2 = NHL2021AdvancedStats["Player"].isin(["Auston Matthews",
                      "John Tavares",
                      "Connor McDavid",
                      "Leon Draisaitl",
                      "Patrick Kane",
                      "Drew Doughty",
                      "Anze Kopitar",
                      "Artemi Panarin",
                      "Jacob Trouba"
                     ])

FilteredAdvancedPlayers = NHL2021AdvancedStats[filter2].reset_index()

FilteredAdvancedPlayers = FilteredAdvancedPlayers.drop(columns = ["index"])

FilteredAdvancedPlayers.head()

Unnamed: 0,Rk,Player,Age,Tm,Pos,GP,CF,CA,CF%,CF% rel,...,PDO,oZS%,dZS%,TOI/60,TOI(EV),TK,GV,E+/-,SAtt.,Thru%
0,160,Drew Doughty,31,LAK,D,12,180,197,47.7,2.7,...,96.3,52.3,47.7,26:55,19:56,2,10,-2.2,49,42.9
1,163,Leon Draisaitl,25,EDM,C,15,224,239,48.4,0.4,...,109.5,48.3,51.7,22:10,15:57,10,14,-1.8,73,58.9
2,343,Patrick Kane,32,CHI,RW,14,191,239,44.4,-4.1,...,101.9,57.0,43.0,21:58,17:44,9,6,1.0,89,58.4
3,368,Anze Kopitar,33,LAK,C,12,165,164,50.2,6.0,...,99.9,50.5,49.5,22:05,15:47,2,10,-1.9,60,46.7
4,449,Auston Matthews,23,TOR,C,12,199,175,53.2,3.4,...,104.7,55.6,44.4,21:58,18:21,12,7,4.4,85,61.2


In [17]:
#hide

advancedcolstoconvert = ["CF%",
                 "FF%"
                ]

NHL2021AdvancedStats[advancedcolstoconvert] = NHL2021AdvancedStats[advancedcolstoconvert].apply(pd.to_numeric, errors = "coerce")


NHL2021CorsiAverage = np.mean(NHL2021AdvancedStats["CF%"])
NHL2021FenwickAverage = np.mean(NHL2021AdvancedStats["FF%"])



In [18]:
#hide
print(NHL2021CorsiAverage)

49.33238866396761


In [19]:
#hide
print(NHL2021FenwickAverage)

49.191093117408904


In [14]:
#hide
path = '/Users/parthdholakia/Code/first-python-notebook'

os.chdir(path)

FilteredAdvancedPlayers.to_csv("FilteredAdvancedPlayers.csv", index=False, encoding='utf-8-sig')

#hide
AdvancedChartDF = pd.read_csv('FilteredAdvancedPlayers.csv')

AdvancedChartDF.head()


Unnamed: 0,Rk,Player,Age,Tm,Pos,GP,CF,CA,CF%,CF% rel,...,PDO,oZS%,dZS%,TOI/60,TOI(EV),TK,GV,E+/-,SAtt.,Thru%
0,160,Drew Doughty,31,LAK,D,12,180,197,47.7,2.7,...,96.3,52.3,47.7,26:55,19:56,2,10,-2.2,49,42.9
1,163,Leon Draisaitl,25,EDM,C,15,224,239,48.4,0.4,...,109.5,48.3,51.7,22:10,15:57,10,14,-1.8,73,58.9
2,343,Patrick Kane,32,CHI,RW,14,191,239,44.4,-4.1,...,101.9,57.0,43.0,21:58,17:44,9,6,1.0,89,58.4
3,368,Anze Kopitar,33,LAK,C,12,165,164,50.2,6.0,...,99.9,50.5,49.5,22:05,15:47,2,10,-1.9,60,46.7
4,449,Auston Matthews,23,TOR,C,12,199,175,53.2,3.4,...,104.7,55.6,44.4,21:58,18:21,12,7,4.4,85,61.2


In [16]:
#hide_input

hover = alt.selection_single(
    on='mouseover',  # select on mouseover
    nearest=True,    # select nearest point to mouse cursor
    empty='none'     # empty selection should match nothing
)

click = alt.selection_multi(
    empty='none' # empty selection matches no points
)

# scatter plot encodings shared by all marks
plot = alt.Chart(AdvancedChartDF).mark_circle().encode(
    x=alt.X('CF%:Q', axis=alt.Axis(title='Corsi Percentage')),
    y=alt.Y('FF%:Q', axis=alt.Axis(title='Fenwick Percentage')),
    tooltip=['CF%:Q', 'FF%:Q'],
    color= alt.Color("Tm", legend = alt.Legend(title = "Team"))
).properties(
    title = "2021 Corsi and Fenwick Comparison"
) 
    
# shared base for new layers
base = plot.transform_filter(
    hover | click # filter to points in either selection
)

# layer scatter plot points, halo annotations, and title labels
alt.layer(
    plot.add_selection(hover).add_selection(click).add_selection(alt.selection_interval(bind='scales')),
    base.mark_point(size=100, stroke='firebrick', strokeWidth=1),
    base.mark_text(dx=4, dy=-8, align='right', stroke='white', strokeWidth=2).encode(text='Player:N'),
    base.mark_text(dx=4, dy=-8, align='right').encode(text='Player:N'),
    data=AdvancedChartDF
).properties(
    width=600,
    height=550
)

Corsi and Fenwick stats are similar to plus minus, but they take into account shots instead of goals. Effectively, a high Corsi and Fenwick score indicates that you're shooting the puck more than the other team is. A 50% score indicates that you're schooting the puck at the same rate as the other team. 

The difference between Corsi and Fenwick is that Corsi takes into account blocked shots, while Fenwick does not. 

The two players that are clearly not positive on both scales are Drew Doughty and Patrick Kane. They both fall under the NHL average for each metric, which is 49%. 

Meanwhile, Connor McDavid is clearly a huge asset to his team with about a 58% score on both scales. Both players on the Maple Leafs scored quite close to each other, over 52%. 


In [20]:
#hide

#Pick up 2021 salaries

NHL2021SalaryURL = 'https://www.hockey-reference.com/friv/current_nhl_salaries.cgi'

NHL2021Salaries = pd.read_html(NHL2021SalaryURL)[0]

#NHL2021AdvancedStats = NHL2021AdvancedStats.droplevel(0, axis =1)

NHL2021Salaries.drop_duplicates(inplace = True)

#base_safety_inbox = customer_safety_inbox.drop(columns = ["following","turn","distracted","forward","brake" , "Event URL"])

#ROY_df = ROY_df.drop(columns = ["Voting" , "Lg"])

NHL2021Salaries.head()

Unnamed: 0,Player,Tm,Salary,Cap Hit
0,Auston Matthews,TOR,15900000.0,11640250.0
1,Mitch Marner,TOR,15000000.0,10903000.0
2,Connor McDavid,EDM,14000000.0,12500000.0
3,Artemi Panarin,NYR,13000000.0,11642857.0
4,Sergei Bobrovsky,FLA,12000000.0,10000000.0


In [53]:
#hide
BaseStatsSalary = ChartDF.merge(NHL2021Salaries, left_on = "Player", right_on = "Player")

In [54]:
#hide

BaseStatsSalary["Salary Per Point Share"] = BaseStatsSalary["Salary"] / BaseStatsSalary["PS"]

BaseStatsSalary["Salary Per Point Share"].round(1)

BaseStatsSalary.sort_values(by = ["Salary Per Point Share"] , ascending = False)

BaseStatsSalary.head()

Unnamed: 0,Rk,Player,Age,Tm_x,Pos,GP,G,A,PTS,+/-,...,ATOI,BLK,HIT,FOW,FOL,FO%,Tm_y,Salary,Cap Hit,Salary Per Point Share
0,160,Drew Doughty,31,LAK,D,12,3,5,8,-5,...,26:55,23,15,0,0,,LAK,10000000.0,11000000.0,8333333.0
1,163,Leon Draisaitl,25,EDM,C,15,8,17,25,11,...,22:10,7,11,173,122,58.6,EDM,9000000.0,8500000.0,2812500.0
2,343,Patrick Kane,32,CHI,RW,14,6,11,17,0,...,21:58,7,3,0,0,,CHI,7000000.0,10500000.0,3684211.0
3,368,Anze Kopitar,33,LAK,C,12,3,12,15,0,...,22:05,8,13,137,116,54.2,LAK,8000000.0,10000000.0,5714286.0
4,449,Auston Matthews,23,TOR,C,12,11,4,15,6,...,21:58,11,19,116,101,53.5,TOR,15900000.0,11640250.0,6625000.0


In [96]:
#hide_input

JustPlayerSalary = BaseStatsSalary.drop(columns = ["Rk","Age","ATOI","BLK","HIT", "Cap Hit",
                                                   "G",
                                                   "A",
                                                   "PTS",
                                                   "FOW",
                                                   "FOL",
                                                   "Tm_y",
                                                   "Salary Per Point Share"])

JustPlayerSalary = JustPlayerSalary.drop(columns = ["Pos",
                                                    "GP",
                                                    "PIM",
                                                    "PS",
                                                    "EV",
                                                    "PP",
                                                    "SH",
                                                    "GW",
                                                    "EV.1",
                                                    "PP.1",
                                                    "SH.1",
                                                    "S"])

JustPlayerSalary = JustPlayerSalary.drop(columns = ["+/-",
                                                    "S%",
                                                    "TOI",
                                                    "FO%"])

salarycolstoconvert = ["Salary"]

JustPlayerSalary[salarycolstoconvert] = JustPlayerSalary[salarycolstoconvert].apply(pd.to_numeric, errors = "coerce")

def format(x):
    return "${:,.0f}".format(x)

JustPlayerSalary['Salary'] = JustPlayerSalary['Salary'].apply(format)

JustPlayerSalary.sort_values(by = ["Salary"]).reset_index().drop(columns = ["index"])


Unnamed: 0,Player,Tm_x,Salary
0,Drew Doughty,LAK,"$10,000,000"
1,Jacob Trouba,NYR,"$10,000,000"
2,John Tavares,TOR,"$12,000,000"
3,Artemi Panarin,NYR,"$13,000,000"
4,Connor McDavid,EDM,"$14,000,000"
5,Auston Matthews,TOR,"$15,900,000"
6,Patrick Kane,CHI,"$7,000,000"
7,Anze Kopitar,LAK,"$8,000,000"
8,Leon Draisaitl,EDM,"$9,000,000"


In [98]:
#hide_input

alt.Chart(BaseStatsSalary).mark_bar().encode(
    x=alt.X('Player', sort=alt.EncodingSortField(field="Salary Per Point Share", order='descending')),
    y=alt.Y('Salary Per Point Share')
).properties(
    width=500,
    height=500
)

In [56]:
#hide
aggregation_functions = {"Tm_x" : "first",
                         "Salary": "sum",
                         "Salary Per Point Share": "sum",
                         "G": "sum",
                         "A": "sum"}

BaseStatsSalary_aggregated = BaseStatsSalary.groupby(BaseStatsSalary["Tm_x"]).aggregate(aggregation_functions)

BaseStatsSalary_aggregated.head()

Unnamed: 0_level_0,Tm_x,Salary,Salary Per Point Share,G,A
Tm_x,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
CHI,CHI,7000000.0,3684211.0,6,11
EDM,EDM,23000000.0,7479167.0,17,35
LAK,LAK,18000000.0,14047620.0,6,17
NYR,NYR,23000000.0,21507940.0,5,12
TOR,TOR,27900000.0,15196430.0,16,11


In [97]:
#hide_input

alt.Chart(BaseStatsSalary_aggregated).mark_bar().encode(
    x=alt.X('Tm_x', sort=alt.EncodingSortField(field="Salary Per Point Share", order='descending') , axis=alt.Axis(title='Team')),
    y=alt.Y('Salary Per Point Share')
).properties(
    width=500,
    height=500
)

# Summary

We took a look at a couple of different statistics to measure player performance and impact based on their current salary