# Visualizing final scores in the NHL: 1917 to today

I start, as I usually do - aimlessly gathering all the data I can possibly get my hands on.

## Use NHL.com API calls to gather data

My approach to data gathering goes something like this:
1. Muck around on [nhl.com/stats](http://www.nhl.com/stats/) until I find some interest
2. Use 'Inspect' (Crtl+Shit+I on Chrome/Windows) to view calls to the NHL REST API ([Check this out](http://www.gregreda.com/2015/02/15/web-scraping-finding-the-api/) for more on this)
3. Fiddle with the URL of the call till I get what I want

In this case, I wanted the results of all the games ever played in the NHL and the API seemed to truncate the data at 50k rows, so I ended up splitting the calls to the API into 5 year chunks (could have been larger sections).

In [1]:
import json
import requests
import pandas as pd
import numpy as np
import datetime

# Get current year
now = datetime.datetime.now().year

# Create range of years to get data, starting with 1917, leading to now.
# The REST api call doesn't seem to return more that 50k lines at a time.
# So, I've split up the year range into 5 year sections (could be 10)
years = np.arange(1917,now,5)

# Ensure the final year in the range is the current year
if years[-1] != now: years[-1] = now 

# Create empty data frame
df = pd.DataFrame()
    
# For each year span, generate URL and get data
for i in range(len(years)-1):
    
    # Create URL
    URL = ("http://www.nhl.com/stats/rest"
           "/team?"
           "isAggregate=false"
           "&reportType=basic"
           "&isGame=true"
           "&reportName=teamsummary"
           "&cayenneExp=gameDate%3E=%22"
           +str(years[i])+
           "-08-01%22%20and%20gameDate%3C=%22"
           +str(years[i+1])+
           "-08-01%22%20and%20gameTypeId=2")
    
    # Get data as JSON dict from URL
    rawDict = requests.get(URL).json()
    # Convert raw data dictionary to pandas data frame
    df = df.append(pd.DataFrame.from_dict(rawDict['data']))

# Write complete data frame to CSV (not required, just for posterity)
df.to_csv('NHL_Game_Summaries_1917_'+str(int(now))+'.csv')

# Print columns of data frame for future reference
print(df.columns)

print(df.head)


Index(['faceoffWinPctg', 'faceoffsLost', 'faceoffsWon', 'gameDate', 'gameId',
       'gameLocationCode', 'gamesPlayed', 'goalsAgainst', 'goalsFor', 'losses',
       'opponentTeamAbbrev', 'otLosses', 'penaltyKillPctg', 'points',
       'ppGoalsAgainst', 'ppGoalsFor', 'ppOpportunities', 'ppPctg',
       'shNumTimes', 'shootoutGamesLost', 'shootoutGamesWon', 'shotsAgainst',
       'shotsFor', 'teamAbbrev', 'teamFullName', 'teamId', 'ties', 'wins'],
      dtype='object')
<bound method NDFrame.head of       faceoffWinPctg faceoffsLost faceoffsWon              gameDate  \
0                  0            0           0  1917-12-20T01:00:00Z   
1                  0            0           0  1919-02-19T01:00:00Z   
2                  0            0           0  1920-03-09T01:00:00Z   
3                  0            0           0  1920-03-07T01:00:00Z   
4                  0            0           0  1921-01-30T01:00:00Z   
5                  0            0           0  1921-01-27T01:00:00Z   
6

## Forming the dataset

Easy peasy. Now, let's fiddle with the data.

Here, I start to think about what I want to see out of the data. Right off the bat, I see that each game played has two lines corresponding to it (one for the home team and one for the away team). I only need one side of each game to visualize data - in this case, I've decided to keep the winning side and the first side of any ties.

In [2]:
# Creating a look up of teamId vs teamAbbrev (thought this would be useful...not so far)
teamIdLookup = set(zip(df['teamId'],df['teamAbbrev']))

# Removing all 'loss' sides of games and one side of any ties
res = pd.concat([df[df.wins==1],df[df.ties==1].drop_duplicates(subset = ['gameId','ties'], keep='first')])

# Filtering out unnecessary columns
res = res.loc[:,['gameId','gamesPlayed','teamAbbrev','opponentTeamAbbrev',
                 'goalsFor','goalsAgainst','gameLocationCode',
                 'wins', 'ties', 'shootoutGamesWon']].reset_index(drop=True)

# Create 'year' column
res['year'] = [int(str(x)[0:4]) for x in res.loc[:,'gameId']]

# Create 'seasonId' column
res['seasonId'] = [str(int(x))+'-'+str(int(x)+1) for x in res.year]

# Dropping any games with goal totals that aren't finite
res = res[np.isfinite(res.goalsFor)]

# Create 'homeScore' column which contains the score of the home team for each game
res['homeScore'] = [int(res.goalsFor[i] + (1 if res.shootoutGamesWon[i] == 1 else 0))
                    if res.gameLocationCode[i] == 'H' else int(res.goalsAgainst[i]) for i in res.index]

# Create 'roadScore' column which contains the score of the away team for each game
res['roadScore'] = [int(res.goalsFor[i] + (1 if res.shootoutGamesWon[i] == 1 else 0))
                    if res.gameLocationCode[i] == 'R' else int(res.goalsAgainst[i]) for i in res.index]


res.head(5)

Unnamed: 0,gameId,gamesPlayed,teamAbbrev,opponentTeamAbbrev,goalsFor,goalsAgainst,gameLocationCode,wins,ties,shootoutGamesWon,year,seasonId,homeScore,roadScore
0,1919020044,1,TSP,QBD,11.0,2.0,H,1,0,0,1919,1919-1920,11,2
1,1920020025,1,HAM,MTL,6.0,5.0,H,1,0,0,1920,1920-1921,6,5
2,1921020012,1,HAM,MTL,4.0,3.0,H,1,0,0,1921,1921-1922,4,3
3,1920020003,1,TSP,MTL,5.0,4.0,H,1,0,0,1920,1920-1921,5,4
4,1920020021,1,MTL,SEN,5.0,3.0,H,1,0,0,1920,1920-1921,5,3


## Plotly magic

I chose Plotly for the interactive and HTML embedding functionality it offered over other plotting packages. It also offers me the flexibity of configuring the plot in JavaScript (should I ever get around to learning a whole lot more JS).

I'm not going to wade into the depths of the Plotly code below - I would suggest following the [tutorials and examples here](https://plot.ly/python/getting-started/) to learn to configure plots and all their features.

In [3]:
import plotly.graph_objs as go
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
from plotly.widgets import GraphWidget


def plotly_heatmap(res):
    hSet = range(0,max(res.homeScore)+1)
    rSet = range(0,max(res.roadScore)+1)
    z = np.zeros((len(rSet),len(hSet)))
    t = []

    c = res.groupby(['roadScore','homeScore']).gamesPlayed.count()

    for j, r in enumerate(sorted(rSet, reverse = False)):
        for i,h in enumerate(sorted(hSet, reverse = False)):
            try:
                z[j][i] = int(c[j][i])
            except:
                pass
            hov = ("Home: "+str(i)+
                   "<br>Away: "+str(j)+
                   "<br>Count: "+str(int(z[j][i])))
            t.append(hov)

    t = [t[i:i+len(hSet)] for i in range(0, len(t), len(hSet))]


    data = [{
            'z': z,
            'type': 'heatmap',
            'colorscale': [
                [0, 'rgb(255, 255, 255)'],
                [0.0001, 'rgb(230,250,255)'],
                [0.01, 'rgb(150,200,255)'],
                [0.5, 'rgb(150,0,75)'],
                [0.8, 'rgb(120,0,75)'],
                [1., 'rgb(50, 0, 0)']],
            'colorbar': {
                'tick0': 0,
                'tickmode': 'array',
                'tickvals': [0, 500, 1000, 1500, 2000, 2500, 3000, 3500]},
            'hoverinfo':'text',
            'showscale': False,
            'text': t
            },
            go.Histogram(y = res.roadScore,
                         xaxis = 'x2',
                         marker = dict(color = 'rgba(0,0,1,.1)'),
                         hoverinfo = 'text', 
                         text = list(res.groupby('roadScore').gamesPlayed.count())), 
            go.Histogram(x = res.homeScore,
                         yaxis = 'y2',
                         marker = dict(color = 'rgba(0,0,1,.1)'),
                         hoverinfo = 'text', 
                         text = list(res.groupby('homeScore').gamesPlayed.count()))
            ]

    axesColor = 'rgb(200,200,200)'

    layout = go.Layout(
        #title='<b>NHL SCORE DISTRIBUTION</b>',
        titlefont = dict(size = 50, 
                         color = axesColor),
        xaxis = dict(ticks = list(hSet),
                     domain = [0,.8], 
                     nticks=len(hSet)+1,
                     fixedrange = True,
                     side = 'top',
                     ticklen = 0,
                     tickfont = dict(color = axesColor, size = 15),
                     title='',
                     titlefont=dict(size=18,color=axesColor)),
        yaxis = dict(ticks= list(rSet),
                     domain = [0.2,1],
                     autorange = 'reversed', 
                     fixedrange = True,
                     nticks=len(rSet)+1,
                     ticklen = 0,
                     tickfont = dict(color = axesColor, size = 15)),
        xaxis2 = dict(zeroline = False,
                      domain = [0.8,1],
                      fixedrange = True,
                      scaleratio = 10,
                      showgrid = False,
                      tickfont = dict(color = axesColor, size = 8),
                      showticklabels=False),
        yaxis2 = dict(zeroline = False,
                      domain = [0,.2],
                      autorange = 'reversed', 
                      fixedrange = True,
                      scaleratio = 10,
                      showgrid = False,
                      tickfont = dict(color = axesColor, size = 8),
                      showticklabels=False),
        annotations = [dict(x=0, y=1.11, xref = 'paper', yref = 'paper',
                            showarrow = False,
                            text = '<b>home<b>', 
                            font = dict(size=40, color = axesColor)),
                       dict(x=-0.1, y=1, xref = 'paper', yref = 'paper',
                            showarrow = False,
                            text = '<b>away<b>',
                            textangle = -90,
                            font = dict(size=40, color = axesColor))],
        showlegend = False,
        hovermode = 'closest',
#         autosize = True,
        height = 750,
        width = 800,
        margin = dict(r=0, b=0),
        )

    fig = go.Figure(data = data, layout = layout)
    
    return fig


<IPython.core.display.Javascript object>

In [4]:
fig = plotly_heatmap(res)

plot(fig, filename='scores-heatmap.html')

'file://c:\\users\\madha\\Projects\\NHL Scores\\scores-heatmap.html'

In [17]:
res2 = res.groupby(['year','seasonId']).sum()
res2.reset_index(inplace=True)
res2['avgHomeGPG'] = res2.homeScore/res2.gamesPlayed
res2['avgAwayGPG'] = res2.roadScore/res2.gamesPlayed
res2['avgHomeAwayDiff'] = res2.avgHomeGPG - res2.avgAwayGPG

traceHome = go.Bar(x = res2.seasonId, 
                   y = res2.avgHomeGPG, 
                   name = 'Home GpG', 
                   marker = dict(color = 'rgba(150,0,115,.5)')
                  )
traceAway = go.Bar(x = res2.seasonId, 
                       y = res2.avgAwayGPG,
                       name = 'Away GpG', 
                       marker = dict(color = 'rgba(0,150,200,.5)')
                      )
traceDiff = go.Scatter(x = res2.seasonId, 
                       y = res2.avgHomeAwayDiff, 
                       name = 'GpG Diff', 
                       marker = dict(color = 'rgba(0,0,0,1)')
                      )

data = [traceHome, traceAway, traceDiff]

fig = go.Figure(data = data)

plot(fig, filename='GPG-barplot.html')

'file://c:\\users\\madha\\Projects\\NHL Scores\\GPG-barplot.html'