# Sportsbook Evaluation
This notebook will pull historical line data and game outcomes from covers.com and create a calibration curve which we can use to evaluate how effective sportsbooks are at setting betting lines for nhl games. We will then use this data to determine the accuracy of our own model that we create.

This notebook will accomplish the following:
1. Get a list of dates that NHL games were played in the 2018/2019 season
2. For every game in the 2018/2019 season we will pull the betting lines and result of the game. We will use this season as it was the last full season played that wasn't shortened due to COVID.
3. Determine how accurate sportsbooks were at predicting the winner of games over the whole season

In [4]:
from bs4 import BeautifulSoup
import datetime as dt
import matplotlib.pyplot as plt
import os
import pickle
from sklearn.calibration import calibration_curve
from sklearn.metrics import accuracy_score, brier_score_loss
import requests
import re
import sys
%matplotlib inline


## Remove Duplicates
This function will come in handy later. All it does is take a list and removes and duplicate values from that list. Example [cat,dog,duck,dog] will return [cat,dog,duck].

In [3]:
def remove_duplicates(x):
    """
    takes a list and removes duplicates from that list
    ...
    Parameters
    ----------
    x: list
        list from which duplicates will be removed
    Returns
    -------
    list
        list with duplicates removed
    """
    return list(dict.fromkeys(x))

## Pull NHL Betting Line Data
We will get our data from covers.com. This function will first return a list of every date an NHL regular season game was played in the season specified.

In [5]:
def get_nhl_dates(season):
    """
    returns a list of all dates NHL games were played in the specified season
    ...
    Parameters
    ----------
    season: int
        the two years of the season that nhl dates will be retrieved for (ex. 20192020)
    Returns
    -------
    list[dt.datetime]
        a list of datetime objects
    """
    # Determine the two years that NHL games were played for that season
    year1: int = int(str(season)[:4])
    year2: int = int(str(season)[4:])

    # Form the url and get response from hockey-reference.com
    url: str = f'https://www.hockey-reference.com/leagues/NHL_{year2}_games.html'
    resp: type = requests.get(url)

    # Find all the days games were played for year1 and year 2.
    days1: List[str] = re.findall(f'html">({year1}.*?)</a></th>', resp.text)
    days2: List[str] = re.findall(f'html">({year2}.*?)</a></th>', resp.text)
    days: List[str] = days1 + days2

    # Remove duplicates and convert strings to datetime
    days = remove_duplicates(days)
    dates: List[dt.datetime] = [dt.datetime.strptime(d, '%Y-%m-%d') for d in days]

    print(f'Number of days NHL regular season played in {season}: ', len(dates))
    return dates

Lets get dates for the 2018/2019 season

In [None]:
dates = get_nhl_dates(20182019)