Let's load up our data. We're just going to look at the latest data.

In [1]:
import pandas as pd

DATA = '/kaggle/input/puck-elo-archive-1917now/NHL_ELO_Odyssey_latest.csv'
df = pd.read_csv(filepath_or_buffer=DATA, parse_dates=['date'])
df.head()

Unnamed: 0,season,date,playoff,neutral,status,ot,home_team,away_team,home_team_abbr,away_team_abbr,...,overtime_prob,home_team_expected_points,away_team_expected_points,home_team_score,away_team_score,home_team_postgame_rating,away_team_postgame_rating,game_quality_rating,game_importance_rating,game_overall_rating
0,2023,2022-10-07,0,1,post,,Nashville Predators,San Jose Sharks,NSH,SJS,...,0.234345,1.258314,0.97603,4,1,1510.044255,1446.464381,15,41,28
1,2023,2022-10-08,0,1,post,,San Jose Sharks,Nashville Predators,SJS,NSH,...,0.233053,0.956634,1.276419,2,3,1444.546428,1511.962208,15,43,29
2,2023,2022-10-11,0,0,post,,New York Rangers,Tampa Bay Lightning,NYR,TBL,...,0.238935,1.193826,1.045109,3,1,1554.002969,1567.724755,97,24,60
3,2023,2022-10-11,0,0,post,,Los Angeles Kings,Vegas Golden Knights,LAK,VEG,...,0.240107,1.1774,1.062707,3,4,1498.528556,1531.054886,60,77,68
4,2023,2022-10-12,0,0,post,,Carolina Hurricanes,Columbus Blue Jackets,CAR,CBJ,...,0.221824,1.430048,0.791776,4,1,1557.526926,1471.437081,58,36,47


How much data do we have?

In [2]:
df.shape

(1400, 24)

Is that a lot of data? How many seasons of data do we have?

In [3]:
df['season'].nunique()

1

We only have one season of data, and it is in game format. Let's take a look at the distributions of the ratings and win probabilities.

In [4]:
from plotly import express
from plotly.offline import init_notebook_mode

init_notebook_mode(connected=True)
express.histogram(data_frame=df, x='home_team_pregame_rating', nbins=100).show(renderer='iframe_connected', )

Our pregame ratings are bimodal. What do visitor ratings look like?

In [5]:
express.histogram(data_frame=df, x='away_team_pregame_rating', nbins=100).show(renderer='iframe_connected', )

Away team pregame ratings are also bimodal. Let's see what the two features look like together.

In [6]:
express.scatter(data_frame=df, x='home_team_pregame_rating', color='game_overall_rating',
                y='away_team_pregame_rating').show(renderer='iframe_connected',)

Not surprisingly, games between highly rated teams tend to be highly rated games.

Let's look at score frequencies. 

In [7]:
express.scatter(data_frame=df[['home_team_score', 'away_team_score']].groupby(by=['home_team_score', 'away_team_score']).size().reset_index(), x='home_team_score', y='away_team_score', size=0).show(renderer='iframe_connected',)

Not surprisingly, one-goal games are relatively common, with 3-2 and 2-3 games being the most common. Games where one team scores ten or more goals almost never happen.

Let's look at the relationship between home team expected points, which is a float, and home team score, which is an integer.

In [8]:
express.scatter(data_frame=df, x='home_team_expected_points', color='game_overall_rating',
                y='home_team_score').show(renderer='iframe_connected',)

Not surprisingly, games with high overall ratings tend to be relatively low-scoring games.

In [9]:
express.scatter(data_frame=df, x='home_team_expected_points', color='game_overall_rating',
                y='away_team_expected_points').show(renderer='iframe_connected',)

Surprisingly, the sum of the expected points is always about 2.2, and overall highly rated games tend to be expected to be close.

In [10]:
express.scatter(data_frame=df, x='game_quality_rating', color='game_overall_rating',
                y='game_importance_rating').show(renderer='iframe_connected',)

Because the overall rating is the sum of the game quality rating and the game importance rating, we see a simple gradient along the line y=x. It's not surprising that there are a lot of games that have a low importance rating; are we surprised that there are a lot of games that are high-quality, low-importance?

In [11]:
express.density_contour(data_frame=df, x='game_quality_rating', histfunc='count',
                y='game_importance_rating').show(renderer='iframe_connected',)

We have a lot of low-quality, low-importance games, fewer high-quality, low-importance games, essentially no low-quality, high-importance games, and some high-quality, high-importance games. This latter condition is probably due to the presence of playoff and championship games in our dataset.

In [12]:
df[['game_quality_rating', 'game_importance_rating', 'game_overall_rating']].corr()

Unnamed: 0,game_quality_rating,game_importance_rating,game_overall_rating
game_quality_rating,1.0,0.359439,0.844041
game_importance_rating,0.359439,1.0,0.803748
game_overall_rating,0.844041,0.803748,1.0


If we look at the Pearson correlations, we see that game quality and game importance are not highly correlated, suggesting that overall rating comes mostly from game importance.