# Sportsbook Evaluation
This notebook will pull historical line data and game outcomes from covers.com and create a calibration curve which we can use to evaluate how effective sportsbooks are at setting betting lines for nhl games. We will then use this data to determine the accuracy of our own model that we create.

This notebook will accomplish the following:
1. Get a list of dates that NHL games were played in the 2018/2019 season
2. For every game in the 2018/2019 season we will pull the betting lines and result of the game. We will use this season as it was the last full season played that wasn't shortened due to COVID.
3. Determine how accurate sportsbooks were at predicting the winner of games over the whole season

In [4]:
from bs4 import BeautifulSoup
import datetime as dt
import matplotlib.pyplot as plt
import os
import pickle
from sklearn.calibration import calibration_curve
from sklearn.metrics import accuracy_score, brier_score_loss
import requests
import re
import sys
%matplotlib inline


## Remove Duplicates
This function will come in handy later. All it does is take a list and removes and duplicate values from that list. Example [cat,dog,duck,dog] will return [cat,dog,duck].

In [3]:
def remove_duplicates(x):
    """
    takes a list and removes duplicates from that list
    ...
    Parameters
    ----------
    x: list
        list from which duplicates will be removed
    Returns
    -------
    list
        list with duplicates removed
    """
    return list(dict.fromkeys(x))

## Pull NHL Betting Line Data
We will get our data from covers.com and hockey-reference.com. 

This first function will first return a list of every date an NHL regular season game was played in the season specified. The list will contain datetime objects which each datetime object representing a date on which at least 1 nhl regular season game was played.

In [5]:
def get_nhl_dates(season):
    """
    returns a list of all dates NHL games were played in the specified season
    ...
    Parameters
    ----------
    season: int
        the two years of the season that nhl dates will be retrieved for (ex. 20192020)
    Returns
    -------
    list[dt.datetime]
        a list of datetime objects
    """
    # Determine the two years that NHL games were played for that season
    year1: int = int(str(season)[:4])
    year2: int = int(str(season)[4:])

    # Form the url and get response from hockey-reference.com
    url: str = f'https://www.hockey-reference.com/leagues/NHL_{year2}_games.html'
    resp: type = requests.get(url)

    # Find all the days games were played for year1 and year 2.
    days1: List[str] = re.findall(f'html">({year1}.*?)</a></th>', resp.text)
    days2: List[str] = re.findall(f'html">({year2}.*?)</a></th>', resp.text)
    days: List[str] = days1 + days2

    # Remove duplicates and convert strings to datetime
    days = remove_duplicates(days)
    dates: List[dt.datetime] = [dt.datetime.strptime(d, '%Y-%m-%d') for d in days]

    print(f'Number of days NHL regular season played in {season}: ', len(dates))
    return dates

Lets get dates for the 2018/2019 season. We can see there were 223 days that NHL games were played during that season.

In [8]:
dates = get_nhl_dates(20182019)
#print a few of the dates
print(dates[0:20])
print(len(dates))

Number of days NHL regular season played in 20182019:  223
[datetime.datetime(2018, 10, 3, 0, 0), datetime.datetime(2018, 10, 4, 0, 0), datetime.datetime(2018, 10, 5, 0, 0), datetime.datetime(2018, 10, 6, 0, 0), datetime.datetime(2018, 10, 7, 0, 0), datetime.datetime(2018, 10, 8, 0, 0), datetime.datetime(2018, 10, 9, 0, 0), datetime.datetime(2018, 10, 10, 0, 0), datetime.datetime(2018, 10, 11, 0, 0), datetime.datetime(2018, 10, 13, 0, 0), datetime.datetime(2018, 10, 14, 0, 0), datetime.datetime(2018, 10, 15, 0, 0), datetime.datetime(2018, 10, 16, 0, 0), datetime.datetime(2018, 10, 17, 0, 0), datetime.datetime(2018, 10, 18, 0, 0), datetime.datetime(2018, 10, 19, 0, 0), datetime.datetime(2018, 10, 20, 0, 0), datetime.datetime(2018, 10, 21, 0, 0), datetime.datetime(2018, 10, 22, 0, 0), datetime.datetime(2018, 10, 23, 0, 0)]
223


Now we will write a function that will scrape covers.com for the data we need. It will take a date and return a list of dictionaries. Each dictionary item represents 1 game played on that date. The data that will be stored in the dictionaries will be as follows:

- date
- a numeric game id for that game
- home team
- away team
- home money line in american odds (ex. -110 means if you need to bet 110 to win 100 while +120 means if you bet 100 you win 120)
- goals scored by the home team
- goals scored by the away team

In [10]:
def nhl_games_date(date):
    """
    creates a list of NhlGame dictionaries for all games played on the date provided
    ...
    Parameters
    ----------
    date: dt.datetime
        datetime object for which we want to
    Returns
    -------
    list[NhlGame]
        a list of NhlGame objects
    """
    games = []

    # retrieve the covers.com webpage for the date provided
    date = date.strftime('%Y-%m-%d')
    url = f'https://www.covers.com/sports/nhl/matchups?selectedDate={date}'
    resp = requests.get(url)

    # parse the page, and retrieve all the game boxes on the page
    scraped_games = BeautifulSoup(resp.text, features='html.parser').findAll('div', {'class': 'cmg_matchup_game_box'})

    # iterate through all the game boxes and retrieve required information for NhlGame object
    for g in scraped_games:
        game_id = g['data-event-id'] # game_id
        h_abv = g['data-home-team-shortname-search'] # home_team
        a_abv = g['data-away-team-shortname-search'] # away_team
        h_ml = g['data-game-odd'] # home moneyline

        try:
            h_score = g.find('div', {'class': 'cmg_matchup_list_score_home'}).get_text(strip=True) # home score
            a_score = g.find('div', {'class': 'cmg_matchup_list_score_away'}).get_text(strip=True) # away score
        except:  # If a score cannot be found leave as blank
            h_score = ''
            a_score = ''

        game = {'date':date, 'game_id':game_id, 'home_team':h_abv,
                'away_team':a_abv, 'home_ml':h_ml, 'home_score':h_score,
                'away_score':a_score}

        games.append(game)

    return games

now we will use our "dates" list and the above function in order to pull data for all dates in our dates list

In [13]:
# intialize list to hold our data
games = []

# run a loop to get data for every date in dates
for date in dates:
    games += nhl_games_date(date)

# check length of games
print('There are '+ str(len(games)) + ' games in our list\n')

# print an entry
print(games[0])

There are 1358in our list

[{'date': '2018-10-03', 'game_id': '121530', 'home_team': 'TOR', 'away_team': 'MON', 'home_ml': '-240', 'home_score': '3', 'away_score': '2'}, {'date': '2018-10-03', 'game_id': '121531', 'home_team': 'WAS', 'away_team': 'BOS', 'home_ml': '-115', 'home_score': '7', 'away_score': '0'}, {'date': '2018-10-03', 'game_id': '121532', 'home_team': 'VAN', 'away_team': 'CAL', 'home_ml': '115', 'home_score': '5', 'away_score': '2'}, {'date': '2018-10-03', 'game_id': '121533', 'home_team': 'SJ', 'away_team': 'ANA', 'home_ml': '-185', 'home_score': '2', 'away_score': '5'}, {'date': '2018-10-04', 'game_id': '121536', 'home_team': 'PIT', 'away_team': 'WAS', 'home_ml': '-165', 'home_score': '7', 'away_score': '6'}, {'date': '2018-10-04', 'game_id': '121537', 'home_team': 'CAR', 'away_team': 'NYI', 'home_ml': '-170', 'home_score': '1', 'away_score': '2'}, {'date': '2018-10-04', 'game_id': '121534', 'home_team': 'BUF', 'away_team': 'BOS', 'home_ml': '115', 'home_score': '0', '

Great! We now have a database of every game in the 2018/2019 season telling us the betting line and score.