# Predicting Hockey Contracts - The Complete Guide

by Luke Kerwin, Eric Wu, Brian Ellis, and Griffin Jordan

In [1]:
import requests
import json
import pandas as pd
import numpy as np
from datetime import datetime
from bs4 import BeautifulSoup
import warnings
warnings.filterwarnings('ignore')

# Step 1: Getting the Data

We decided as a group that we were going to take on the task of predicting NHL (hockey) contracts for players in the season 2022-2023. We chose this specific season as it is the most recent completed season. In order to reach our goal, we needed to gather data on the players and their contracts. We also needed to gather data on the players' performance, such as goals, assists, points, etc. We gathered this data from the following sources:

- [CapFriendly](https://www.capfriendly.com/) - a website that tracks NHL contracts
- [Hockey Reference](https://www.hockey-reference.com/) - a website that tracks NHL player performance statistics

Below is the code we used to scrape the data from these websites. We used the [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) library to scrape the data from the websites. We also used the [requests](https://docs.python-requests.org/en/master/) library to make the HTTP requests to the websites.

### CapFriendly

In [2]:
# We need to get the last 10 seasons of contracts for each player
years = range(2012, 2023)

# Iterate through each season, month by month
contracts = []
for year in years:
    for month in range(1, 13):
        print(f'Getting contract data for {year}-{month}...', end='\r')
        # Get the data from capfriendly
        url = f'https://www.capfriendly.com/signings/all/all/all/1-15/0-15000000/{month}01{year}-{month+1}31{year}'
        r = requests.get(url)
        table = pd.read_html(r.text)[0]
        # Add the data to our list
        contracts.append(table)

contracts = pd.concat(contracts)
contracts.to_csv('data/contracts.csv', index=False)
print('\nDone!')
contracts.head().style.hide_index()
        

Getting contract data for 2022-12...
Done!


PLAYER,PLAYER.1,AGE,POS,TEAM,DATE,TYPE,EXTENSION,STRUCTURE,LENGTH,VALUE,CAP HIT
Mark Pysyk,Mark Pysyk,31,RD,CGY,"Dec. 2, 2023",Stnd (UFA),,2-way,1,"$775,000","$775,000"
Samuel Montembeault,Samuel Montembeault,27,G,MTL,"Dec. 1, 2023",Stnd (UFA),✔,1-way,3,"$9,450,000","$3,150,000"
Jordan Gustafson,Jordan Gustafson,19,C,VGK,"Nov. 29, 2023",ELC,,2-way,3,"$2,572,500","$857,500"
Patrick Kane,Patrick Kane,34,RW,DET,"Nov. 28, 2023",Stnd (UFA),,1-way,1,"$2,750,000","$2,750,000"
Justin Bailey,Justin Bailey,27,RW,SJS,"Nov. 27, 2023",Stnd (UFA),,2-way,1,"$775,000","$775,000"


In [5]:
# print out white table
contracts.head().style.hide_index()

PLAYER,PLAYER.1,AGE,POS,TEAM,DATE,TYPE,EXTENSION,STRUCTURE,LENGTH,VALUE,CAP HIT
Mark Pysyk,Mark Pysyk,31,RD,CGY,"Dec. 2, 2023",Stnd (UFA),,2-way,1,"$775,000","$775,000"
Samuel Montembeault,Samuel Montembeault,27,G,MTL,"Dec. 1, 2023",Stnd (UFA),✔,1-way,3,"$9,450,000","$3,150,000"
Jordan Gustafson,Jordan Gustafson,19,C,VGK,"Nov. 29, 2023",ELC,,2-way,3,"$2,572,500","$857,500"
Patrick Kane,Patrick Kane,34,RW,DET,"Nov. 28, 2023",Stnd (UFA),,1-way,1,"$2,750,000","$2,750,000"
Justin Bailey,Justin Bailey,27,RW,SJS,"Nov. 27, 2023",Stnd (UFA),,2-way,1,"$775,000","$775,000"


### Hockey Reference

We are going to use the 3 seasons of statistics before the player is awarded their contract to predict the contract.

In [12]:
stats_seasons = range(2009, 2023)

stats = []
for season in stats_seasons:
    print(f'Getting stats for {season}...', end='\r')
    req = requests.get(f'https://www.hockey-reference.com/leagues/NHL_{season}_skaters.html')
    table = pd.read_html(req.text)[0]
    table['season'] = season
    stats.append(table)

Getting stats for 2022...

In [13]:
stats_ = pd.concat(stats)
# Remove the multi-index
stats_.columns = ['Rk', 'Player', 'Age', 'Tm', 'Pos', 'GP', 'G', 'A', 'PTS', '+/-', 'PIM',
       'PS', 'EV', 'PP', 'SH', 'GW', 'EV', 'PP', 'SH', 'S', 'S%', 'TOI',
       'ATOI', 'BLK', 'HIT', 'FOW', 'FOL', 'FO%', 'SEASON']
stats_.to_csv('data/stats.csv', index=False)
stats_.head().style.hide_index()

Rk,Player,Age,Tm,Pos,GP,G,A,PTS,+/-,PIM,PS,EV,PP,SH,GW,EV.1,PP.1,SH.1,S,S%,TOI,ATOI,BLK,HIT,FOW,FOL,FO%,SEASON
1,Justin Abdelkader,21,DET,LW,2,0,0,0,0,0,0.0,0,0,0,0,0,0,0,2,0.0,19,9:18,0,3,4,3,57.1,2009
2,Craig Adams,31,TOT,RW,45,2,5,7,-3,22,0.1,1,1,0,0,5,0,0,47,4.3,391,8:41,20,67,8,13,38.1,2009
2,Craig Adams,31,CHI,RW,36,2,4,6,-3,22,0.1,1,1,0,0,4,0,0,38,5.3,314,8:43,16,53,6,10,37.5,2009
2,Craig Adams,31,PIT,RW,9,0,1,1,0,0,0.0,0,0,0,0,1,0,0,9,0.0,77,8:34,4,14,2,3,40.0,2009
3,Maxim Afinogenov,29,BUF,RW,48,6,14,20,-7,20,1.4,6,0,0,0,7,7,0,93,6.5,605,12:36,11,20,0,3,0.0,2009


# Step 2: Cleaning and Preparing the Data for Analysis

Now that we have the data, we need to clean it and prepare it for analysis. As I mentioned above, we will want to use the 3 previous seasons of data to predict the contract, so we will have to merge on the contract data but only use the previous 3 seasons of performance data.


### Cleaning Contract Data

In [31]:
# Remove PLAYER.1
cleaned_contracts = contracts.drop(columns=['PLAYER.1'])

# Data formatting
cleaned_contracts['DATE'] = cleaned_contracts['DATE'].str[-4:].astype(int)
cleaned_contracts['VALUE'] = cleaned_contracts['VALUE'].str.replace('$', '').str.replace(',', '').astype(float).round(2)
cleaned_contracts['CAP HIT'] = cleaned_contracts['CAP HIT'].str.replace('$', '').str.replace(',', '').astype(float).round(2)
cleaned_contracts['EXTENSION'] = cleaned_contracts['EXTENSION'].str.replace('✔', '1').fillna(0).astype(int)

# Removing data we dont want
cleaned_contracts = cleaned_contracts[cleaned_contracts['TYPE'].isin(['Stnd (UFA)','35+ (UFA)'])]
cleaned_contracts = cleaned_contracts[cleaned_contracts['STRUCTURE']=='1-way']
cleaned_contracts = cleaned_contracts[cleaned_contracts['EXTENSION']==0].reset_index(drop=True)
cleaned_contracts['id'] = cleaned_contracts['PLAYER'].str.replace(' ', '').str.replace("'", '') + cleaned_contracts['DATE'].astype(int).astype(str)
cleaned_contracts = cleaned_contracts.drop_duplicates().reset_index(drop=True)
cleaned_contracts['VALUE'] = cleaned_contracts['VALUE'].astype(int)
cleaned_contracts['CAP HIT'] = cleaned_contracts['CAP HIT'].astype(int)
cleaned_contracts

Unnamed: 0,PLAYER,AGE,POS,TEAM,DATE,TYPE,EXTENSION,STRUCTURE,LENGTH,VALUE,CAP HIT,id
0,Patrick Kane,34,RW,DET,2023,Stnd (UFA),0,1-way,1,2750000,2750000,PatrickKane2023
1,Danton Heinen,27,"LW, RW",BOS,2023,Stnd (UFA),0,1-way,1,775000,775000,DantonHeinen2023
2,Jonah Gadjovich,24,"LW, RW",FLA,2023,Stnd (UFA),0,1-way,1,810000,810000,JonahGadjovich2023
3,Noah Gregor,24,"LW, RW",TOR,2023,Stnd (UFA),0,1-way,1,775000,775000,NoahGregor2023
4,Austin Watson,31,"RW, LW",TBL,2023,Stnd (UFA),0,1-way,1,776665,776665,AustinWatson2023
5,Carlo Colaiacovo,30,LD,STL,2013,Stnd (UFA),0,1-way,1,750000,750000,CarloColaiacovo2013
6,Ilya Bryzgalov,33,G,EDM,2013,Stnd (UFA),0,1-way,1,2000000,2000000,IlyaBryzgalov2013
7,Ilya Bryzgalov,34,G,ANA,2014,Stnd (UFA),0,1-way,1,2880000,2880000,IlyaBryzgalov2014
8,Martin Brodeur,42,G,STL,2014,35+ (UFA),0,1-way,1,700000,700000,MartinBrodeur2014
9,Dainius Zubrus,37,LW,SJS,2015,35+ (UFA),0,1-way,1,600000,600000,DainiusZubrus2015


### Cleaning Statistics Data

In [30]:
stats_cleaned = stats_.copy()
stats_cleaned.columns = ['Rk', 'Player', 'Age', 'Tm', 'Pos', 'GP', 'G', 'A', 'PTS', '+/-', 'PIM',
'PS', 'EV', 'PP', 'SH', 'GW', 'EV', 'PP', 'SH', 'S', 'S%', 'TOI',
'ATOI', 'BLK', 'HIT', 'FOW', 'FOL', 'FO%', 'SEASON']
stats_cleaned = stats_cleaned[['Player', 'Age', 'Tm', 'Pos', 'GP', 'G', 'A', 'PTS', '+/-', 'PIM',
    'PS', 'EV', 'PP', 'SH', 'GW', 'EV', 'PP', 'SH', 'S', 'S%', 'TOI',
    'ATOI', 'BLK', 'HIT', 'FOW', 'FOL', 'FO%', 'SEASON']]
stats_cleaned = stats_cleaned[stats_cleaned['Player']!='Player'].reset_index(drop=True)

new = []
for player in stats_cleaned['Player'].unique():
    data = stats_cleaned[stats_cleaned['Player']==player]
    for season in data['SEASON'].unique():
        data_season = data[data['SEASON']==season]
        if len(data_season) > 1:
            data_season = data_season[data_season['Tm']=='TOT']
        else:
            pass
        new.append(data_season)

stats_cleaned = pd.concat(new).reset_index(drop=True)
stats_cleaned['Pos'] = np.where(stats_cleaned['Pos'].isin(['D','LD','RD']), 'D', np.where(stats_cleaned['Pos'].isin(['C']), 'C', 'W'))
stats_cleaned['Pos'].unique()
stats_cleaned.columns = ['PLAYER', 'AGE', 'TEAM', 'POS', 'GP', 'G', 'A', 'PTS', 'PLUSMINUS', 'PIM', 'PS',
    'EVG', 'EVA', 'PPG', 'PPA', 'EVSH', 'PPSH', 'GWG', 'EV', 'EV', 'PP', 'PP', 'SH',
    'SH', 'S', 'S%', 'TOI', 'ATOI', 'BLK', 'HIT', 'FOW', 'FOL', 'FO%',
    'SEASON']
stats_cleaned = stats_cleaned.drop(columns=['EV', 'PP', 'SH'])
stats_cleaned['S%'] = stats_cleaned['S%'].str.replace('%', '').astype(float)/100
stats_cleaned['TOI'] = stats_cleaned['TOI'].astype(int) * 60
stats_cleaned['ATOI'] = stats_cleaned['ATOI'].str.split(':').apply(lambda x: int(x[0])*60 + int(x[1]))
stats_cleaned['FO%'] = stats_cleaned['FO%'].str.replace('%', '').astype(float)/100
stats_cleaned = stats_cleaned[stats_cleaned['GP'].astype(int)>=20].reset_index(drop=True)

for col in stats_cleaned.columns:
    try:
        stats_cleaned[col] = stats_cleaned[col].astype(float)
    except:
        try:
            stats_cleaned[col] = stats_cleaned[col].astype(int)
        except:
            try:
                stats_cleaned[col] = stats_cleaned[col].astype(str)
            except:
                pass

stats_cleaned['id'] = stats_cleaned['PLAYER'].replace(' ', '') + stats_cleaned['SEASON'].astype(str)
stats_cleaned = stats_cleaned.drop_duplicates(subset=['id']).reset_index(drop=True)
stats_cleaned['FO%'] = stats_cleaned['FO%'].fillna(0)
# Round all possible columns
for col in stats_cleaned.columns:
    try:
        stats_cleaned[col] = stats_cleaned[col].astype(float).round(2)
    except:
        pass
stats_cleaned.columns = [col.replace('%', '_') for col in stats_cleaned.columns]
stats_cleaned

Unnamed: 0,PLAYER,AGE,TEAM,POS,GP,G,A,PTS,PLUSMINUS,PIM,...,S_,TOI,ATOI,BLK,HIT,FOW,FOL,FO_,SEASON,id
0,Justin Abdelkader,22.0,DET,W,50.0,3.0,3.0,6.0,-11.0,35.0,...,0.04,31800.0,635.0,20.0,152.0,148.0,170.0,0.46,2010.0,Justin Abdelkader2010.0
1,Justin Abdelkader,23.0,DET,W,74.0,7.0,12.0,19.0,15.0,61.0,...,0.05,54600.0,738.0,39.0,188.0,227.0,203.0,0.53,2011.0,Justin Abdelkader2011.0
2,Justin Abdelkader,24.0,DET,W,81.0,8.0,14.0,22.0,4.0,62.0,...,0.07,59820.0,739.0,42.0,148.0,239.0,213.0,0.53,2012.0,Justin Abdelkader2012.0
3,Justin Abdelkader,25.0,DET,W,48.0,10.0,3.0,13.0,6.0,34.0,...,0.10,42660.0,889.0,13.0,120.0,65.0,60.0,0.52,2013.0,Justin Abdelkader2013.0
4,Justin Abdelkader,26.0,DET,W,70.0,10.0,18.0,28.0,2.0,31.0,...,0.07,64200.0,917.0,31.0,172.0,23.0,32.0,0.42,2014.0,Justin Abdelkader2014.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9388,Jordan Spence,20.0,LAK,D,24.0,2.0,6.0,8.0,0.0,2.0,...,0.05,28440.0,1185.0,29.0,32.0,0.0,0.0,0.00,2022.0,Jordan Spence2022.0
9389,Philip Tomasino,20.0,NSH,C,76.0,11.0,21.0,32.0,2.0,10.0,...,0.10,52620.0,692.0,28.0,49.0,21.0,33.0,0.39,2022.0,Philip Tomasino2022.0
9390,Alexey Toropchenko,22.0,STL,W,28.0,2.0,0.0,2.0,-9.0,15.0,...,0.05,18000.0,643.0,21.0,69.0,1.0,2.0,0.33,2022.0,Alexey Toropchenko2022.0
9391,Jasper Weatherby,24.0,SJS,C,50.0,5.0,6.0,11.0,-14.0,18.0,...,0.11,33720.0,674.0,24.0,52.0,173.0,168.0,0.51,2022.0,Jasper Weatherby2022.0


# Step 3: Creating SQL Database

In [27]:
from sqlalchemy import create_engine

# Create the database
engine = create_engine('sqlite:///data/nhl.db', echo=False)

# Save the dataframes to the database
cleaned_contracts.to_sql('contracts', con=engine)
stats_cleaned.to_sql('stats', con=engine)

9393

# Step 4: Building the API

In [None]:
from flask import Flask

app = Flask(__name__)

app.config['SQLALCHEMY_DATABASE_URI'] = 'sqlite:///data/nhl.db'
app.config['SQLALCHEMY_TRACK_MODIFICATIONS'] = False

from flask_sqlalchemy import SQLAlchemy
db = SQLAlchemy(app)

class Contract(db.Model):
    __tablename__ = 'contracts'
    index = db.Column(db.Integer, primary_key=True)
    PLAYER = db.Column(db.String)
    TEAM = db.Column(db.String)
    TYPE = db.Column(db.String)
    DATE = db.Column(db.Integer)
    VALUE = db.Column(db.Float)
    LENGTH = db.Column(db.Integer)
    CAP_HIT = db.Column(db.Float)
    EXTENSION = db.Column(db.Integer)

    def __repr__(self):
        return '<Contract %r>' % self.PLAYER
    
class Stat(db.Model):
    __tablename__ = 'stats'
    index = db.Column(db.Integer, primary_key=True)
    PLAYER = db.Column(db.String)
    AGE = db.Column(db.Float)
    TEAM = db.Column(db.String)
    POS = db.Column(db.String)
    GP = db.Column(db.Float)
    G = db.Column(db.Float)
    A = db.Column(db.Float)
    PTS = db.Column(db.Float)
    PLUSMINUS = db.Column(db.Float)
    PIM = db.Column(db.Float)
    PS = db.Column(db.Float)
    EVG = db.Column(db.Float)
    EVA = db.Column(db.Float)
    PPG = db.Column(db.Float)
    PPA = db.Column(db.Float)
    EVSH = db.Column(db.Float)
    PPSH = db.Column(db.Float)
    GWG = db.Column(db.Float)
    S = db.Column(db.Float)
    S_ = db.Column(db.Float)
    TOI = db.Column(db.Float)
    ATOI = db.Column(db.Float)
    BLK = db.Column(db.Float)
    HIT = db.Column(db.Float)
    FOW = db.Column(db.Float)
    FOL = db.Column(db.Float)
    FO_ = db.Column(db.Float)
    SEASON = db.Column(db.Float)

    def __repr__(self):
        return '<Stat %r>' % self.PLAYER

from flask import jsonify

@app.route('/api/contracts')
def get_contracts():
    contracts = Contract.query.all()
    return jsonify([contract.__dict__ for contract in contracts])

@app.route('/api/stats')
def get_stats():
    stats = Stat.query.all()
    return jsonify([stat.__dict__ for stat in stats])

In [4]:
import requests

r = requests.get('https://www.capfriendly.com/ajax/signings/all/all/all/1-15/0-15000000/12012022-12312022?p=1')

In [34]:
import pandas as pd

months = {
    'January': {'num':'01','start':'01', 'end':'31'},
    'February': {'num':'02','start':'02', 'end':'28'},
    'March': {'num':'03','start':'03', 'end':'31'},
    'April': {'num':'04','start':'04', 'end':'30'},
    'May': {'num':'05','start':'05', 'end':'31'},
    'June': {'num':'06','start':'06', 'end':'30'},
    'July': {'num':'07','start':'07', 'end':'31'},
    'August': {'num':'08','start':'08', 'end':'31'},
    'September': {'num':'09','start':'09', 'end':'30'},
    'October': {'num':'10','start':'10', 'end':'31'},
    'November': {'num':'11','start':'11', 'end':'30'},
    'December': {'num':'12','start':'12', 'end':'31'}
}
for year in range(2012, 2023):
    conts = []
    for month in months:
        contracts = []
        data = months[month]
        string = f'{data["num"]}01{year}-{data["num"]}{data["end"]}{year}'
        length = 50
        pc = 1
        while length == 50:
            url = f'https://www.capfriendly.com/ajax/signings/all/all/all/1-15/0-15000000/{string}?p={pc}'
            html = requests.get(url).json()
            if html['data'] != None:
                html = html['data']['html']
                if html != '':
                    html = f'<table>{html}</table>'
                    soup = BeautifulSoup(html)
                    table = soup.find('table')
                    df = pd.read_html(str(table))[0]
                    df.columns = ['PLAYER', 'PLAYER.1', 'AGE', 'POS', 'TEAM', 'DATE', 'TYPE', 'EXTENSION', 'STRUCTURE', 'LENGTH', 'VALUE', 'CAP HIT']
                    df = df.drop(columns=['PLAYER.1'])
                    contracts.append(df)
                    length = len(df)
                else:
                    length = 0
            else:
                length = 0
            pc += 1
        contracts = pd.concat(contracts)
        conts.append(contracts)
    pd.concat(conts)

        

ValueError: No objects to concatenate

In [3]:
import requests

request = requests.get('http://127.0.0.1:5000/api/ml/data?season=2021')


In [5]:
contracts = request.json()['contracts']
stats = request.json()['stats']

In [12]:
from ml import ContractPredictor

cp = ContractPredictor(contracts, stats)
cp.dataset.head()

--- 2021 ---


Unnamed: 0,AGE,CAP_HIT,DATE,PLAYER,POS,A,BLK,EVA,EVG,EVSH,...,HIT,PIM,PLUSMINUS,PPA,PPG,PPSH,PS,PTS,S,TOI
0,35,3500000,2021,Alexander Edler,D,0.432,2.87,0.249,0.081,0.0,...,2.119,0.995,-0.011,0.178,0.032,0.005,5.6,0.546,2.259,1429.946
1,26,4500000,2021,Alexander Wennberg,C,0.338,0.551,0.268,0.045,0.005,...,0.419,0.202,0.081,0.056,0.025,0.015,1.766667,0.414,1.061,993.636
2,30,750000,2021,Alex Chiasson,W,0.191,0.377,0.106,0.126,0.01,...,0.995,0.503,-0.015,0.07,0.075,0.015,2.6,0.402,1.402,847.236
3,27,750000,2021,Alex Galchenyuk,F,0.329,0.413,0.178,0.122,0.0,...,0.69,0.338,-0.258,0.15,0.094,0.0,3.366667,0.545,2.15,898.592
4,35,5000000,2021,Alex Goligoski,D,0.335,1.973,0.223,0.058,0.0,...,1.268,0.295,-0.134,0.103,0.027,0.009,5.433333,0.42,1.438,1342.5
