In [1]:
import pandas as pd

DATA = '/kaggle/input/puck-elo-archive-1917now/NHL_ELO_Odyssey.csv'
df = pd.read_csv(filepath_or_buffer=DATA, parse_dates=['date'])
df.shape

(66118, 24)

In [2]:
df.columns

Index(['season', 'date', 'playoff', 'neutral', 'status', 'ot', 'home_team',
       'away_team', 'home_team_abbr', 'away_team_abbr',
       'home_team_pregame_rating', 'away_team_pregame_rating',
       'home_team_winprob', 'away_team_winprob', 'overtime_prob',
       'home_team_expected_points', 'away_team_expected_points',
       'home_team_score', 'away_team_score', 'home_team_postgame_rating',
       'away_team_postgame_rating', 'game_quality_rating',
       'game_importance_rating', 'game_overall_rating'],
      dtype='object')

We want to be able to plot the ELO scores for different teams over time, so let's get the team/ELO data together into a single DataFrame.

In [3]:
home_df = df[['season', 'date', 'home_team', 'home_team_postgame_rating']].rename(columns={'home_team': 'team', 'home_team_postgame_rating': 'rating'})
away_df = df[['season', 'date', 'away_team', 'away_team_postgame_rating']].rename(columns={'away_team': 'team', 'away_team_postgame_rating': 'rating'})
elo_df = pd.concat(axis='index', objs=[home_df, away_df]).sort_values(by=['date'])

How many teams do we have in this dataset, and how many games did each team play?

In [4]:
from plotly import express
from plotly.offline import init_notebook_mode

init_notebook_mode(connected=True)

express.bar(data_frame=elo_df['team'].value_counts().to_frame().reset_index(), x='team', y='count').show(renderer='iframe_connected', )


We have a lot of variance in the number of games each team has played. Let's try plotting and see how our plot looks.

We have a couple of problems we need to solve: one is that we have forty teams, so we need a color palette with forty colors if we don't want to reuse colors. Another is that we have at least one team that ceased operations and was later replaced by a team with the same name, but it doesn't make sense to have a line on our plot suggesting continuity between the two teams. So let's start with a scatter plot.

In [5]:
from plotly import colors

express.scatter(data_frame=elo_df, x='date', y='rating', color='team', height=900, color_discrete_sequence = colors.sample_colorscale('HSV', 40)).show(renderer='iframe_connected', )


What do we see? Well, we solved our Ottawa Senators problem, and by using a cyclical palette we get red colors for both very old teams and very young ones, which maybe is preferable to having older teams with very similar colors.

Also we see that our data suggests that the 1977-78 Montreal Canadiens were the very best team of all time, and the 1975 Washington Capitals were the worst.

Can our data tell us the highest ELO for each team?

In [6]:
elo_df[['team', 'rating']].groupby(by='team').max().reset_index().sort_values(ascending=False, by='rating').head(n=10)

Unnamed: 0,team,rating
15,Montreal Canadiens,1715.979078
10,Detroit Red Wings,1661.199234
2,Boston Bruins,1659.146958
21,New York Islanders,1656.094704
4,Calgary Flames,1647.983237
11,Edmonton Oilers,1644.384855
25,Philadelphia Flyers,1638.661924
7,Colorado Avalanche,1633.950631
34,Tampa Bay Lightning,1629.655342
27,Pittsburgh Penguins,1621.853006


This gives us a sense of which teams have been the most dominant when they were at their best. We can do a similar analysis to see which teams have been the best on average over time.

In [7]:
elo_df[['team', 'rating']].groupby(by='team').mean().reset_index().sort_values(ascending=False, by='rating').head(n=10)

Unnamed: 0,team,rating
37,Vegas Golden Knights,1543.359173
15,Montreal Canadiens,1538.329685
2,Boston Bruins,1527.531767
25,Philadelphia Flyers,1523.310394
7,Colorado Avalanche,1511.936955
10,Detroit Red Wings,1509.882833
4,Calgary Flames,1507.876158
3,Buffalo Sabres,1506.104804
11,Edmonton Oilers,1505.590172
22,New York Rangers,1505.506706


The two teams share a lot of teams. That's probably not surprising.

Let's just look at the last few years. We don't have the Ottawa Senators problem any more, so we can make a line plot. But the fact that our time series has the date as its X axis and the season doesn't run year round gives us an odd looking graph, because teams' ELO scores jump toward the mean during the offseason for reasons that aren't entirely clear. 

In [8]:
express.line(data_frame=elo_df[elo_df['season'] > 2015].sort_values(by=['date', 'team']).reset_index(), x='date', y='rating', color='team', height=800, color_discrete_sequence = colors.sample_colorscale('HSV', 32)).show(renderer='iframe_connected', )


Maybe ELO scores make more sense within a season than they do across seasons. 

In [9]:
express.scatter(data_frame=elo_df[elo_df['season'] == 2023].sort_values(by=['date', 'team']).reset_index(), x='date', y='rating', color='team', height=800, color_discrete_sequence = colors.sample_colorscale('HSV', 32)).show(renderer='iframe_connected', )


What do we see? First of all we see that the 2023 season starts in 2022. Second, we can clearly see the impact of a couple of early games between the Sharks and the Predators and the post-season, which goes on for two solid months. Finally, we see that the highest ELO teams don't always win titles, as the Bruins exit in an early round of the playoffs, but the Golden Knights, which start the year in the middle of the ELO pack end up winning the Stanley Cup. Go figure.