# NFL Betting Model

This project will try to predict nfl game outcomes against the spread to try to derive a betting system based on historical nfl game data.

This dataset was found on Kaggle and contains NFL game results since 1966 with betting odds information since 1979. The dataset was created from a variety of sources include ESPN, NFL.com, and Pro Football Reference. Weather information is from NOAA data with NFLweather.com as a cross reference. Betting data was used from http://www.repole.com/sun4cast/data.html for the 1978-2013 seasons. From 2013 on, betting data was from sportsline.com

In [80]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
%matplotlib inline

from IPython.display import display
pd.options.mode.chained_assignment = None
pd.set_option('display.max_rows', 1100)
pd.set_option('display.max_columns', 50)

import warnings
warnings.filterwarnings('ignore')

In [81]:
#Read in datasets
games = pd.read_csv('/Users/jaymilch/Documents/Coding Projects/Kaggle/NFL Betting/NFL Betting Datasets/spreadspoke_scores.csv')
teams = pd.read_csv('/Users/jaymilch/Documents/Coding Projects/Kaggle/NFL Betting/NFL Betting Datasets/nfl_teams.csv')

# Clean and Explore Data

The dataset only contains betting data from the year 1978 on, so we will remove all games from earlier than 1978

In [82]:
games = games[games['schedule_season'] >= 1978]

In [83]:
print(games.info())

<class 'pandas.core.frame.DataFrame'>
Int64Index: 10951 entries, 2268 to 13218
Data columns (total 17 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   schedule_date        10951 non-null  object 
 1   schedule_season      10951 non-null  int64  
 2   schedule_week        10951 non-null  object 
 3   schedule_playoff     10951 non-null  bool   
 4   team_home            10951 non-null  object 
 5   score_home           10951 non-null  int64  
 6   score_away           10951 non-null  int64  
 7   team_away            10951 non-null  object 
 8   team_favorite_id     10728 non-null  object 
 9   spread_favorite      10728 non-null  float64
 10  over_under_line      10719 non-null  object 
 11  stadium              10951 non-null  object 
 12  stadium_neutral      10951 non-null  bool   
 13  weather_temperature  10138 non-null  float64
 14  weather_wind_mph     10121 non-null  float64
 15  weather_humidity     6522 non-nul

There are a few columns that include useless information, so we will drop them. We will only be trying to predict spread outcomes, so we will drop the 'over_under_line' column

In [84]:
results = games[['schedule_season', 'schedule_week', 'schedule_playoff', 'team_home','score_home', 'team_away','score_away', 'spread_favorite', 'team_favorite_id','weather_temperature', 'weather_wind_mph']]
results.reset_index(inplace = True, drop = True)
display(results.head())

Unnamed: 0,schedule_season,schedule_week,schedule_playoff,team_home,score_home,team_away,score_away,spread_favorite,team_favorite_id,weather_temperature,weather_wind_mph
0,1978,1,False,Tampa Bay Buccaneers,13,New York Giants,19,-2.0,TB,83.0,8.0
1,1978,1,False,Atlanta Falcons,20,Houston Oilers,14,,,77.0,10.0
2,1978,1,False,Buffalo Bills,17,Pittsburgh Steelers,28,,,66.0,12.0
3,1978,1,False,Chicago Bears,17,St. Louis Cardinals,10,,,74.0,13.0
4,1978,1,False,Cincinnati Bengals,23,Kansas City Chiefs,24,,,68.0,6.0


There are some null values in the 'spread_favorite' column, so we will have to remove all rows that have a null value in this column

In [85]:
results = results[results['spread_favorite'].notna()]
print(results.info())

<class 'pandas.core.frame.DataFrame'>
Int64Index: 10728 entries, 0 to 10950
Data columns (total 11 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   schedule_season      10728 non-null  int64  
 1   schedule_week        10728 non-null  object 
 2   schedule_playoff     10728 non-null  bool   
 3   team_home            10728 non-null  object 
 4   score_home           10728 non-null  int64  
 5   team_away            10728 non-null  object 
 6   score_away           10728 non-null  int64  
 7   spread_favorite      10728 non-null  float64
 8   team_favorite_id     10728 non-null  object 
 9   weather_temperature  9915 non-null   float64
 10  weather_wind_mph     9898 non-null   float64
dtypes: bool(1), float64(3), int64(3), object(4)
memory usage: 932.4+ KB
None


The 'weather_temperature' and 'weather_wind_mph' columns have a good amount of null values, so we will remove them for now

In [86]:
results.drop(columns = ['weather_temperature', 'weather_wind_mph'], inplace = True)
print(results.info())

<class 'pandas.core.frame.DataFrame'>
Int64Index: 10728 entries, 0 to 10950
Data columns (total 9 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   schedule_season   10728 non-null  int64  
 1   schedule_week     10728 non-null  object 
 2   schedule_playoff  10728 non-null  bool   
 3   team_home         10728 non-null  object 
 4   score_home        10728 non-null  int64  
 5   team_away         10728 non-null  object 
 6   score_away        10728 non-null  int64  
 7   spread_favorite   10728 non-null  float64
 8   team_favorite_id  10728 non-null  object 
dtypes: bool(1), float64(1), int64(3), object(4)
memory usage: 764.8+ KB
None


Some of the values in the 'team_favorite_id' column are 'PICK' indicating that the spread of the game is 0, there is no favorite. We can set this value equal to the home team, since it is arbitrary who we label as the "favorite", since there really is no favorite

The 'schedule_week' column is an object type because it has playoff weeks that are entered as strings. The 2021 season playoffs aren't in this dataset so we don't have to worry about that yet, but we will keep the code in here for now for when that data is added

In [87]:
results_1989 = results[results['schedule_season'] <= 1989]
results_1989.loc[results_1989['schedule_week'] == 'Wildcard', 'schedule_week'] = 17
results_1989.loc[results_1989['schedule_week'] == 'WildCard', 'schedule_week'] = 17
results_1989.loc[results_1989['schedule_week'] == 'Division', 'schedule_week'] = 18
results_1989.loc[results_1989['schedule_week'] == 'Conference', 'schedule_week'] = 19
results_1989.loc[results_1989['schedule_week'] == 'Superbowl', 'schedule_week'] = 20
results_1989.loc[results_1989['schedule_week'] == 'SuperBowl', 'schedule_week'] = 20

results_16_games = results[(results['schedule_season'] >= 1990) & (results['schedule_season'] <= 2020) & (results['schedule_season'] != 1993)]
results_16_games.loc[results_16_games['schedule_week'] == 'Wildcard', 'schedule_week'] = 18
results_16_games.loc[results_16_games['schedule_week'] == 'WildCard', 'schedule_week'] = 18
results_16_games.loc[results_16_games['schedule_week'] == 'Division', 'schedule_week'] = 19
results_16_games.loc[results_16_games['schedule_week'] == 'Conference', 'schedule_week'] = 20
results_16_games.loc[results_16_games['schedule_week'] == 'Superbowl', 'schedule_week'] = 21
results_16_games.loc[results_16_games['schedule_week'] == 'SuperBowl', 'schedule_week'] = 21

#The 1993 season and 2021 seasons had 18 weeks
results_2021 = results[(results['schedule_season'] >= 2021) | (results['schedule_season'] == 1993)]
results_2021.loc[results_2021['schedule_week'] == 'Wildcard', 'schedule_week'] = 19
results_2021.loc[results_2021['schedule_week'] == 'Division', 'schedule_week'] = 20
results_2021.loc[results_2021['schedule_week'] == 'Conference', 'schedule_week'] = 21
results_2021.loc[results_2021['schedule_week'] == 'Superbowl', 'schedule_week'] = 22

results = pd.concat([results_1989, results_16_games, results_2021])

#Convert 'schedule_week column from string to int type'
results['schedule_week'] = pd.to_numeric(results['schedule_week'])
results.sort_values(by = ['schedule_season', 'schedule_week'], inplace = True)
print(results.info())

<class 'pandas.core.frame.DataFrame'>
Int64Index: 10728 entries, 0 to 10950
Data columns (total 9 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   schedule_season   10728 non-null  int64  
 1   schedule_week     10728 non-null  int64  
 2   schedule_playoff  10728 non-null  bool   
 3   team_home         10728 non-null  object 
 4   score_home        10728 non-null  int64  
 5   team_away         10728 non-null  object 
 6   score_away        10728 non-null  int64  
 7   spread_favorite   10728 non-null  float64
 8   team_favorite_id  10728 non-null  object 
dtypes: bool(1), float64(1), int64(4), object(3)
memory usage: 764.8+ KB
None


We now need to map the 'team_home' and 'team_away' columns with the team id's, so we can use the team_favorite_id column to determine the against the spread result of each of the games.

In [88]:
#Create dictionary mapping team names with team ID's
team_mapping_dict = pd.Series(teams['team_id'].values, index = teams['team_name']).to_dict()

#Map team names to team ID's
results['team_home'] = results['team_home'].map(team_mapping_dict)
results['team_away'] = results['team_away'].map(team_mapping_dict)

In [89]:
results.reset_index(inplace = True)
results.drop(columns = 'index', inplace = True)

for row in range(len(results)):
    if results.loc[row, 'team_favorite_id'] == 'PICK':
        results.loc[row, 'team_favorite_id'] = results.loc[row, 'team_home']
    else:
        continue

In [90]:
print(results['schedule_season'].value_counts())

2021    272
2020    269
2015    267
2010    267
2005    267
2012    267
2004    267
2006    267
2019    267
2011    267
2003    267
2014    267
2018    267
2013    267
2002    267
2017    267
2009    267
2007    267
2016    267
2008    267
2000    259
1999    259
2001    259
1998    251
1995    251
1997    251
1996    251
1990    235
1993    235
1994    235
1992    235
1991    235
1983    233
1984    233
1989    233
1981    233
1988    233
1979    233
1986    233
1985    233
1980    233
1987    177
1982    141
1978     10
Name: schedule_season, dtype: int64


We can see that the 1978, 1982, and 1987 seasons are missing a lot of entries, so we will delete these from our dataframe

In [91]:
results = results[(results['schedule_season'] != 1978) & (results['schedule_season'] != 1982) & (results['schedule_season'] != 1987)]

Some weeks are labeled incorrectly, this code will fix that

In [92]:
results.loc[2479, 'schedule_week'] = 5
results.loc[2480, 'schedule_week'] = 5

# Feature Engineering

The function below takes in a team and will create an against the spread record for that given team and add that teams against the spread record to the dataframe representing the results of a given season

In [93]:
def team_ats_record(team, season):
    #Create a dataframe of a specific team to generate their Against the Spread record
    
    season_df = results[results['schedule_season'] == season]
    team_df = pd.concat([season_df[season_df['team_home'] == team], season_df[season_df['team_away'] == team]])
    team_df.sort_values(by = 'schedule_week', inplace = True)
    team_df = team_df.reset_index()

    weeks = team_df['schedule_week'].unique().tolist()

    #Create an empty list of against the spread results
    ats_results = []

    for week in weeks:
        week_df = team_df[team_df['schedule_week'] == week]
        home_team = week_df['team_home'].values
        away_team = week_df['team_away'].values
        favorite = week_df['team_favorite_id'].values
    
        if (home_team == favorite) & (home_team == team):
            ats_result = week_df['score_home'].values[0] + week_df['spread_favorite'].values[0] - week_df['score_away'].values[0]
            ats_results.append(ats_result)
        elif (away_team == favorite) & (home_team == team):
            ats_result = week_df['score_away'].values[0] + week_df['spread_favorite'].values[0] - week_df['score_home'].values[0]
            ats_results.append(ats_result)
        elif (home_team == favorite) & (away_team == team):
            ats_result = week_df['score_away'].values[0] - week_df['spread_favorite'].values[0] - week_df['score_home'].values[0]
            ats_results.append(ats_result)
        elif (away_team == favorite) & (away_team == team):
            ats_result = week_df['score_away'].values[0] + week_df['spread_favorite'].values[0] - week_df['score_home'].values[0]
            ats_results.append(ats_result)

    #Create against the spread results for current team
    ats_wins = [0]
    ats_losses = [0]

    for i in range(len(ats_results) - 1):
        if ats_results[i] > 0:
            ats_wins.append(ats_wins[i] + 1)
            ats_losses.append(ats_losses[i])
        elif ats_results[i] < 0:
            ats_wins.append(ats_wins[i])
            ats_losses.append(ats_losses[i] + 1)
        elif ats_results[i] == 0:
            ats_wins.append(ats_wins[i])
            ats_losses.append(ats_losses[i])

    #Convert against the spread results from lists to pandas Series
    #Note that ats_wins and ats_losses are the pregame values
    ats_losses_series = pd.Series(ats_losses)
    ats_wins_series = pd.Series(ats_wins)
    ats_results_series = pd.Series(ats_results)
    ats_record = ats_wins_series - ats_losses_series

    team_ats_df = pd.DataFrame([ats_wins_series, ats_losses_series, ats_results_series, ats_record]).transpose()
    team_ats_df.columns = ['ats_wins', 'ats_losses', 'ats_results', 'ats_record']

    #Create columns that are the home teams' ATS wins and losses and away team's ATS wins and losses
    team_df['home_team_ats_record'] = np.nan
    team_df['away_team_ats_record'] = np.nan

    for row in range(len(team_df)):
        if team_df.loc[row, 'team_home'] == team:
            team_df.loc[row, 'home_team_ats_record'] = team_ats_df.loc[row, 'ats_record']
        elif team_df.loc[row, 'team_away'] == team:
            team_df.loc[row, 'away_team_ats_record'] = team_ats_df.loc[row, 'ats_record']
    
    return team_df

The following function will create the average points scored and allowed per game for each team given a team dataframe for a given season

In [94]:
def team_ppg(team_df, team):

    total_score_for = 0
    total_score_against = 0
    number_of_games = 0
    team_ppg_for = [0]
    team_ppg_against = [0]

    #Create a new copy of dataframe so original dataframe isn't altered
    team_df_ppg = team_df
    
    #Create new columns that represent the average point per game scored for the given team in the team_df
    team_df_ppg['home_team_ppg_for'] = np.nan
    team_df_ppg['away_team_ppg_for'] = np.nan
    
    #Create new columns that represent the average point per game allowed for the given team in the team_df
    team_df_ppg['home_team_ppg_against'] = np.nan
    team_df_ppg['away_team_ppg_against'] = np.nan
 
    for row in range(len(team_df_ppg) - 1):
        if team_df_ppg.loc[row, 'team_home'] == team:
            total_score_for += team_df_ppg.loc[row, 'score_home']
            total_score_against += team_df_ppg.loc[row, 'score_away']
            number_of_games += 1
        elif team_df_ppg.loc[row, 'team_away'] == team:
            total_score_for += team_df_ppg.loc[row, 'score_away']
            total_score_against += team_df_ppg.loc[row, 'score_home']
            number_of_games += 1
        team_ppg_for.append(round(total_score_for/number_of_games, 1))
        team_ppg_against.append(round(total_score_against/number_of_games, 1))

    ppg_for_series = pd.Series(team_ppg_for)
    ppg_against_series = pd.Series(team_ppg_against)

    for row in range(len(team_df_ppg)):
        if team_df_ppg.loc[row, 'team_home'] == team:
            team_df_ppg.loc[row, 'home_team_ppg_for'] = ppg_for_series.loc[row]
            team_df_ppg.loc[row, 'home_team_ppg_against'] = ppg_against_series.loc[row]
        elif team_df_ppg.loc[row, 'team_away'] == team:
            team_df_ppg.loc[row, 'away_team_ppg_for'] = ppg_for_series.loc[row]
            team_df_ppg.loc[row, 'away_team_ppg_against'] = ppg_against_series.loc[row]
            
    return team_df_ppg

The following function will add the against the against the spread records, points for, points against, etc. for an entire season

In [95]:
def make_season_df(season):
    season_ats_df = pd.DataFrame()
    season_df = results[results['schedule_season'] == season]
    teams = season_df['team_home'].unique()
    
    for team in teams:
        team_df = team_ats_record(team, season)
        team_df = team_ppg(team_df, team)
        season_ats_df = season_ats_df.append(team_df)
        
    season_ats_df.sort_values(by = 'index', inplace = True)
    season_ats_df.reset_index(inplace = True)
    
    for row in range(len(season_ats_df) - 1):
        top_row_index = season_ats_df.loc[row, 'index']
        bottom_row_index = season_ats_df.loc[row+1, 'index']
        
        if top_row_index == bottom_row_index:
            home_ats_val_1 = season_ats_df.loc[row, 'home_team_ats_record']
            away_ats_val_1 = season_ats_df.loc[row, 'away_team_ats_record']
            home_ats_val_2 = season_ats_df.loc[row+1, 'home_team_ats_record']
            away_ats_val_2 = season_ats_df.loc[row+1, 'away_team_ats_record']
    
            if np.isnan(home_ats_val_1) == True:
                season_ats_df.loc[row, 'home_team_ats_record'] = home_ats_val_2
            if np.isnan(away_ats_val_1) == True:
                season_ats_df.loc[row, 'away_team_ats_record'] = away_ats_val_2
                
            home_ppg_for_1 = season_ats_df.loc[row, 'home_team_ppg_for']
            away_ppg_for_1 = season_ats_df.loc[row, 'away_team_ppg_for']
            home_ppg_for_2 = season_ats_df.loc[row+1, 'home_team_ppg_for']
            away_ppg_for_2 = season_ats_df.loc[row+1, 'away_team_ppg_for']
            
            if np.isnan(home_ppg_for_1) == True:
                season_ats_df.loc[row, 'home_team_ppg_for'] = home_ppg_for_2
            if np.isnan(away_ppg_for_1) == True:
                season_ats_df.loc[row, 'away_team_ppg_for'] = away_ppg_for_2
                
            home_ppg_against_1 = season_ats_df.loc[row, 'home_team_ppg_against']
            away_ppg_against_1 = season_ats_df.loc[row, 'away_team_ppg_against']
            home_ppg_against_2 = season_ats_df.loc[row+1, 'home_team_ppg_against']
            away_ppg_against_2 = season_ats_df.loc[row+1, 'away_team_ppg_against']
            
            if np.isnan(home_ppg_against_1) == True:
                season_ats_df.loc[row, 'home_team_ppg_against'] = home_ppg_against_2
            if np.isnan(away_ppg_for_1) == True:
                season_ats_df.loc[row, 'away_team_ppg_against'] = away_ppg_against_2
                
        else:
            continue
    
    season_ats_df.drop(columns = 'level_0', inplace = True)
    season_ats_df.dropna(inplace = True)
    
    return season_ats_df

In [96]:
#Loop through seasons to make full dataframe of all games
ats_results = pd.DataFrame()
seasons = results['schedule_season'].unique()

for season in seasons:
    season_ats_df = make_season_df(season)
    ats_results = ats_results.append(season_ats_df)

We need to create a column indicating whether or not the home team is the favorite

In [97]:
ats_results['home_team_fav'] = np.where(ats_results['team_home'] == ats_results['team_favorite_id'], 1, 0)

# Add Target Column

We are trying to predict whether or not the favorited team will cover the spread. Our target column will therefore be a binary column, with a 1 indicating that the favorited team covered the spread, and a 0 indicating that the davorited team didn't cover the spread.

Create a column that represents how much the favorited team covered (or didn't cover, represented by a negative number) the spread by

In [98]:
home_team_favorite = ats_results[ats_results['team_home'] == ats_results['team_favorite_id']]
home_team_favorite['favorite_spread_result'] = home_team_favorite['score_home'] + home_team_favorite['spread_favorite'] - home_team_favorite['score_away']
away_team_favorite = ats_results[ats_results['team_away'] == ats_results['team_favorite_id']]
away_team_favorite['favorite_spread_result'] = away_team_favorite['score_away'] + away_team_favorite['spread_favorite'] - away_team_favorite['score_home']
ats_results = pd.concat([home_team_favorite, away_team_favorite])
ats_results.sort_values(by = 'index', inplace = True)
display(ats_results.tail())

Unnamed: 0,index,schedule_season,schedule_week,schedule_playoff,team_home,score_home,team_away,score_away,spread_favorite,team_favorite_id,home_team_ats_record,away_team_ats_record,home_team_ppg_for,away_team_ppg_for,home_team_ppg_against,away_team_ppg_against,home_team_fav,favorite_spread_result
534,10723,2021,18,False,LAR,24,SF,27,-3.5,SF,-2.0,0.0,27.2,25.0,21.6,21.3,0,-0.5
536,10724,2021,18,False,MIA,33,NE,24,-6.0,NE,1.0,4.0,19.2,27.4,21.8,16.9,0,-15.0
538,10725,2021,18,False,MIN,31,CHI,17,-3.5,MIN,0.0,6.0,24.6,18.4,25.6,23.5,1,10.5
540,10726,2021,18,False,NYG,7,WAS,22,-6.0,WAS,-4.0,-1.0,15.7,19.6,24.6,26.7,0,9.0
542,10727,2021,18,False,TB,41,CAR,17,-10.5,TB,0.0,-3.0,29.4,17.9,21.0,22.7,1,13.5


Create a column that is 0 if the 'favorite_spread_result' column is <=0, and 1 if the 'favorite_spread_result' column is >0. We will do this using a lambda function

In [99]:
ats_results['favorite_spread_binary'] = ats_results['favorite_spread_result'].apply(lambda row: 0 if row <= 0 else 1)

In [100]:
ats_results.reset_index(inplace = True, drop = True)
ats_results.drop(columns = ['index'], inplace = True)
display(ats_results.tail())

Unnamed: 0,schedule_season,schedule_week,schedule_playoff,team_home,score_home,team_away,score_away,spread_favorite,team_favorite_id,home_team_ats_record,away_team_ats_record,home_team_ppg_for,away_team_ppg_for,home_team_ppg_against,away_team_ppg_against,home_team_fav,favorite_spread_result,favorite_spread_binary
10395,2021,18,False,LAR,24,SF,27,-3.5,SF,-2.0,0.0,27.2,25.0,21.6,21.3,0,-0.5,0
10396,2021,18,False,MIA,33,NE,24,-6.0,NE,1.0,4.0,19.2,27.4,21.8,16.9,0,-15.0,0
10397,2021,18,False,MIN,31,CHI,17,-3.5,MIN,0.0,6.0,24.6,18.4,25.6,23.5,1,10.5,1
10398,2021,18,False,NYG,7,WAS,22,-6.0,WAS,-4.0,-1.0,15.7,19.6,24.6,26.7,0,9.0,1
10399,2021,18,False,TB,41,CAR,17,-10.5,TB,0.0,-3.0,29.4,17.9,21.0,22.7,1,13.5,1


# Logistic Regression

We will now see how good a model can perform at predicting spreads using only the basic feature engineering we've done. We will eventually feature engineer more columns to see if that will improve performance, and will also try to do web scraping to get more advanced stats like passing and rushing yards for and against per game

The first model we'll implement will be logistic regression

In [101]:
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics import recall_score
from sklearn.metrics import precision_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import RandomForestRegressor

In [102]:
#X = ats_results[['spread_favorite', 'home_team_ats_record', 'away_team_ats_record', 'home_team_ppg_for', 'away_team_ppg_for', 'home_team_ppg_against', 'away_team_ppg_against', 'home_team_fav']]
X = ats_results[['spread_favorite', 'home_team_ppg_for', 'away_team_ppg_for', 'home_team_ppg_against', 'away_team_ppg_against', 'home_team_fav']]
y = ats_results['favorite_spread_binary']

#Normalize columns using min-max normalization
scaler = MinMaxScaler()
X_scaled = scaler.fit_transform(X)
X_scaled = pd.DataFrame(X_scaled)
X_scaled.columns = X.columns

display(X_scaled.tail())

#Split dataset into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)

Unnamed: 0,spread_favorite,home_team_ppg_for,away_team_ppg_for,home_team_ppg_against,away_team_ppg_against,home_team_fav
10395,0.867925,0.461017,0.480769,0.366102,0.417647,0.0
10396,0.773585,0.325424,0.526923,0.369492,0.331373,0.0
10397,0.867925,0.416949,0.353846,0.433898,0.460784,1.0
10398,0.773585,0.266102,0.376923,0.416949,0.523529,0.0
10399,0.603774,0.498305,0.344231,0.355932,0.445098,1.0


In [103]:
#Implement Logistic Regression
lr = LogisticRegression(random_state = 0)
lr.fit(X_train, y_train)
test_predictions = lr.predict(X_test)
train_predictions = lr.predict(X_train)
test_matches = (y_test == test_predictions)
train_matches = (y_train == train_predictions)
test_accuracy = test_matches.sum()/len(y_test)
train_accuracy = train_matches.sum()/len(y_train)
print('test_accuracy', test_accuracy)
print('train_accuracy', train_accuracy)

test_accuracy 0.5288461538461539
train_accuracy 0.5289663461538462


The accuracy of the model isn't great, but this 53% accuracy score is a result from trying to predict every game. We won't have our model predict every game. We'll have our model only select games in which the probability is above a certain threshold

In [104]:
probabilities = lr.predict_proba(X_test)
probabilities_df = pd.DataFrame(probabilities).rename(columns = {0:'probability_favorite_not_cover', 1:'probability_favorite_cover'})
probabilities_df.sort_values(by = 'probability_favorite_not_cover', inplace = True)
display(probabilities_df)

Unnamed: 0,probability_favorite_not_cover,probability_favorite_cover
1348,0.482203,0.517797
692,0.484546,0.515454
1460,0.487200,0.512800
592,0.488363,0.511637
1618,0.492237,0.507763
...,...,...
1365,0.563426,0.436574
48,0.563438,0.436562
1442,0.563526,0.436474
648,0.564815,0.435185


The logistic regression is predicting that the favorite will not cover the spread almost all of the time. This is clearly an incorrect result. To explore more we will look at precision and recall scores.

In [105]:
precision = precision_score(y_test, test_predictions)
recall = recall_score(y_test, test_predictions)

print('Precision:', precision)
print('Recall:', recall)

Precision: 0.4
Recall: 0.006141248720573183


# Random Forest Classifier

The logistic regression model had a very small recall for some reason. It could be issues with the data, or it could be that logisitc regression isn't great at fitting the data. We will try a random forest classifier to see if we can get better results.

In [106]:
clf = RandomForestClassifier(n_estimators = 100, random_state = 0, min_samples_leaf = 100)
clf.fit(X_train, y_train)

RandomForestClassifier(min_samples_leaf=100, random_state=0)

In [107]:
forest_predictions_test = clf.predict(X_test)
forest_predictions_train = clf.predict(X_train)
forest_test_matches = (y_test == forest_predictions_test)
forest_train_matches = (y_train == forest_predictions_train)
forest_test_accuracy = forest_test_matches.sum()/len(y_test)
forest_train_accuracy = forest_train_matches.sum()/len(y_train)
print('Test Accuracy: ', forest_test_accuracy)
print('Train Accuracy: ', forest_train_accuracy)

Test Accuracy:  0.5274038461538462
Train Accuracy:  0.5848557692307692


# Predicting Scores

For some reason our algorithm is prediction that the favorite won't cover the spread in almost every game in the test set. This is a reasonably common error when the test set doesn't have an even distribution (for example if 98% of the test set was labelled negative, it would be common for a classification algorithm to predict a 0 for all data points). This is not the case in our situation, where 46% of the games resulted in the favorite covering the spread. We will now try to use a regression algorithm to predict the scores of the favorite and underdog, then determine whether or not it is likely that the favorite will cover the spread

We need to create different X_train, X_test, y_train, y_test since we have different target columns

In [108]:
X = ats_results[['schedule_week', 'spread_favorite', 'home_team_ats_record', 'away_team_ats_record', 'home_team_ppg_for', 'away_team_ppg_for', 'home_team_ppg_against','away_team_ppg_against', 'home_team_fav']]

#y1 will be the predicted home team score, y2 will be the predicted away team score
y_home = ats_results['score_home']
y_away = ats_results['score_away']

X_train, X_test, y_train_home, y_test_home, y_train_away, y_test_away = train_test_split(X, y_home, y_away, test_size = 0.2, random_state = 0)

In [109]:
#Implement Random Forest
rfr = RandomForestRegressor(n_estimators = 100, random_state = 0, min_samples_leaf = 100)

#Home team prediction
rfr.fit(X_train, y_train_home)
test_predictions_home = rfr.predict(X_test)
train_predictions_home = rfr.predict(X_train)

#Away team prediction
rfr.fit(X_train, y_train_away)
test_predictions_away = rfr.predict(X_test)
train_predictions_away = rfr.predict(X_train)

In [110]:
#Add predictions to overall dataframe
X_train['score_home_prediction'] = np.nan
X_train['score_away_prediction'] = np.nan
X_test['score_home_prediction'] = np.nan
X_test['score_away_prediction'] = np.nan
X_train['train/test'] = 'train'
X_test['train/test'] = 'test'

X_train.loc[:, 'score_home_prediction'] = train_predictions_home
X_train.loc[:, 'score_away_prediction'] = train_predictions_away
X_test.loc[:, 'score_home_prediction'] = test_predictions_home
X_test.loc[:, 'score_away_prediction'] = test_predictions_away

ats_predictions = pd.concat([X_train, X_test], axis = 0)
ats_predictions.sort_index(inplace = True)

In [121]:
#Add columns from ats_results to ats_predictions dataframe
ats_predictions['schedule_season'] = ats_results['schedule_season']
ats_predictions['schedule_playoff'] = ats_results['schedule_playoff']
ats_predictions['team_home'] = ats_results['team_home']
ats_predictions['team_away'] = ats_results['team_away']
ats_predictions['team_favorite_id'] = ats_results['team_favorite_id']
ats_predictions['favorite_cover?'] = ats_results['favorite_spread_binary']
ats_predictions['score_home'] = ats_results['score_home']
ats_predictions['score_away'] = ats_results['score_away']
ats_predictions['favorite_spread_result'] = ats_results['favorite_spread_result']

#Reorder columns
cols = ['train/test', 'schedule_season', 'schedule_week', 'schedule_playoff', 'team_home', 'team_away', 'team_favorite_id', 'score_home', 'score_away', 'spread_favorite', 'home_team_ats_record', 'away_team_ats_record', 'home_team_ppg_for', 'away_team_ppg_for', 'home_team_ppg_against', 'away_team_ppg_against', 'home_team_fav', 'favorite_spread_result', 'score_home_prediction', 'score_away_prediction', 'favorite_cover?']
ats_predictions = ats_predictions[cols]

In [122]:
#Separate ats_predictions back into train and test sets
ats_predictions_train = ats_predictions[ats_predictions['train/test'] == 'train']
ats_predictions_test = ats_predictions[ats_predictions['train/test'] == 'test']

In [123]:
#Create column that predicts whether or not the favorite will cover based on the predicted scores of the home and away team for train dataframe
ats_predictions_train.reset_index(inplace = True)
spread_results = []

for game in range(len(ats_predictions_train)):
    away_team = ats_predictions_train.loc[game, 'team_away']
    home_team = ats_predictions_train.loc[game, 'team_home']
    favorite = ats_predictions_train.loc[game, 'team_favorite_id']
    
    if (favorite == away_team):
        spread_result = ats_predictions_train.loc[game, 'score_away_prediction'] + ats_predictions_train.loc[game, 'spread_favorite'] - ats_predictions_train.loc[game, 'score_home_prediction']
    elif (favorite == home_team):
        spread_result = ats_predictions_train.loc[game, 'score_home_prediction'] + ats_predictions_train.loc[game, 'spread_favorite'] - ats_predictions_train.loc[game, 'score_away_prediction']

    spread_results.append(spread_result)
        
ats_predictions_train['favorite_cover_prediction'] = pd.Series(spread_results)

In [124]:
#Create column that predicts whether or not the favorite will cover based on the predicted scores of the home and away team for test dataframe
ats_predictions_test.reset_index(inplace = True)
spread_results = []

for game in range(len(ats_predictions_test)):
    away_team = ats_predictions_test.loc[game, 'team_away']
    home_team = ats_predictions_test.loc[game, 'team_home']
    favorite = ats_predictions_test.loc[game, 'team_favorite_id']
    
    if (favorite == away_team):
        spread_result = ats_predictions_test.loc[game, 'score_away_prediction'] + ats_predictions_test.loc[game, 'spread_favorite'] - ats_predictions_test.loc[game, 'score_home_prediction']
    elif (favorite == home_team):
        spread_result = ats_predictions_test.loc[game, 'score_home_prediction'] + ats_predictions_test.loc[game, 'spread_favorite'] - ats_predictions_test.loc[game, 'score_away_prediction']

    spread_results.append(spread_result)
        
ats_predictions_test['favorite_cover_prediction'] = pd.Series(spread_results)

In [125]:
#Create a predicted column that is binary
ats_predictions_train['favorite_cover_prediction_binary'] = np.where(ats_predictions_train['favorite_cover_prediction'] > 0, 1, 0)
ats_predictions_test['favorite_cover_prediction_binary'] = np.where(ats_predictions_test['favorite_cover_prediction'] > 0, 1, 0)

display(ats_predictions_train.tail(20))
display(ats_predictions_test.tail(20))

Unnamed: 0,index,train/test,schedule_season,schedule_week,schedule_playoff,team_home,team_away,team_favorite_id,score_home,score_away,spread_favorite,home_team_ats_record,away_team_ats_record,home_team_ppg_for,away_team_ppg_for,home_team_ppg_against,away_team_ppg_against,home_team_fav,favorite_spread_result,score_home_prediction,score_away_prediction,favorite_cover?,favorite_cover_prediction,favorite_cover_prediction_binary
8300,10373,train,2021,17,False,GB,MIN,GB,37,10,-13.0,5.0,1.0,25.5,25.6,21.6,24.8,1,14.0,28.506137,17.566621,1,-2.060484,0
8301,10374,train,2021,17,False,IND,LVR,IND,20,23,-8.5,7.0,-1.0,28.0,21.1,21.1,25.8,1,-11.5,29.189153,18.002759,0,2.686393,1
8302,10375,train,2021,17,False,LAC,DEN,LAC,34,13,-7.5,1.0,-3.0,27.2,19.9,27.4,17.3,1,13.5,26.340324,18.310562,1,0.529762,1
8303,10376,train,2021,17,False,NE,JAX,NE,50,10,-17.5,3.0,-1.0,25.9,14.5,17.3,26.4,1,22.5,28.014455,13.157708,1,-2.643253,0
8304,10377,train,2021,17,False,NO,CAR,NO,18,10,-7.0,1.0,-2.0,21.1,18.5,20.3,23.0,1,1.0,24.037106,16.78315,1,0.253956,1
8305,10378,train,2021,17,False,NYJ,TB,TB,24,28,-14.0,2.0,1.0,18.4,29.5,29.9,20.8,0,-10.0,18.411037,28.238648,0,-4.172389,0
8306,10381,train,2021,17,False,TEN,MIA,TEN,34,3,-3.0,-3.0,2.0,23.8,20.3,21.7,21.0,1,28.0,23.852406,20.327949,1,0.524457,1
8307,10383,train,2021,17,False,PIT,CLE,PIT,26,14,-2.0,-7.0,-1.0,20.1,20.9,24.7,21.9,1,10.0,20.651356,20.588518,1,-1.937162,0
8308,10384,train,2021,18,False,DEN,KC,KC,24,28,-11.5,-4.0,0.0,19.4,28.2,18.4,21.2,0,-7.5,18.639313,25.485469,0,-4.653844,0
8309,10385,train,2021,18,False,PHI,DAL,DAL,26,51,-6.0,5.0,8.0,26.1,29.9,20.9,20.8,0,19.0,22.470352,27.632567,1,-0.837785,0


Unnamed: 0,index,train/test,schedule_season,schedule_week,schedule_playoff,team_home,team_away,team_favorite_id,score_home,score_away,spread_favorite,home_team_ats_record,away_team_ats_record,home_team_ppg_for,away_team_ppg_for,home_team_ppg_against,away_team_ppg_against,home_team_fav,favorite_spread_result,score_home_prediction,score_away_prediction,favorite_cover?,favorite_cover_prediction,favorite_cover_prediction_binary
2060,10300,test,2021,12,False,HOU,NYJ,HOU,14,21,-3.0,-4.0,-3.0,15.0,17.8,27.1,32.0,1,-10.0,22.226178,19.133322,0,0.092856,1
2061,10304,test,2021,12,False,NE,TEN,NE,36,13,-7.0,3.0,-1.0,27.3,26.5,16.1,23.1,1,16.0,27.446431,19.454344,1,0.992086,1
2062,10307,test,2021,12,False,WAS,SEA,SEA,17,15,-1.5,0.0,0.0,21.2,19.4,26.7,20.9,0,-3.5,21.762778,20.038531,0,-3.224247,0
2063,10309,test,2021,13,False,ATL,TB,TB,17,30,-10.5,3.0,-1.0,18.1,31.5,27.5,23.0,0,2.5,18.508879,27.888496,1,-1.120383,0
2064,10318,test,2021,13,False,NYJ,PHI,PHI,18,33,-5.0,-2.0,4.0,18.1,25.3,30.4,22.8,0,10.0,20.174883,27.129105,1,1.954222,1
2065,10329,test,2021,14,False,KC,LVR,KC,48,9,-9.5,-2.0,0.0,25.2,22.8,21.6,26.0,1,29.5,28.550763,17.689096,1,1.361668,1
2066,10341,test,2021,15,False,DEN,CIN,DEN,10,15,-3.0,-1.0,2.0,21.2,27.2,17.5,22.5,1,-8.0,23.61637,20.876525,0,-0.260154,0
2067,10348,test,2021,15,False,TB,NO,TB,0,9,-11.5,1.0,-1.0,31.5,23.4,22.8,21.9,1,-20.5,30.853425,19.014692,0,0.338733,1
2068,10352,test,2021,16,False,TEN,SF,SF,20,17,-3.5,-2.0,0.0,24.1,25.7,22.1,22.4,0,-6.5,22.928974,24.563532,0,-1.865441,0
2069,10357,test,2021,16,False,CIN,BAL,CIN,41,21,-7.5,3.0,-4.0,26.4,23.9,21.6,22.5,1,12.5,27.383732,20.135,1,-0.251268,0


In [126]:
train_matches = (ats_predictions_train['favorite_cover?'] == ats_predictions_train['favorite_cover_prediction_binary'])
train_accuracy = train_matches.sum()/len(train_matches)

test_matches = (ats_predictions_test['favorite_cover?'] == ats_predictions_test['favorite_cover_prediction_binary'])
test_accuracy = test_matches.sum()/len(test_matches)

print('Train Accuracy: ', train_accuracy)
print('Test Accuracy: ', test_accuracy)

Train Accuracy:  0.5590144230769231
Test Accuracy:  0.5235576923076923


# Selecting the Games We Would Bet On

Our model is slightly overfitting, and the test set accuracy isn't good enough to rationalize betting on every single game with our model's prediction. Games that have a 'favorite_cover_prediction' that is close to 0 indicate that we likely won't have much of an "edge" on that game and we shouldn't be betting on it given that the odds will be -110 and we'll have negative EV. We only want to be betting on games that have the largest differential between the actual spread and our predicted outcome of the game.

To begin, we'll only bet on the games in which the 'favorite_cover_prediction' column is 1 standard deviation above or below the mean of that column.

In [127]:
standard_dev_test = ats_predictions_test['favorite_cover_prediction'].std()
mean_test = ats_predictions_test['favorite_cover_prediction'].mean()

filtered_ats_predictions_test = ats_predictions_test[(ats_predictions_test['favorite_cover_prediction'] > standard_dev_test) | (ats_predictions_test['favorite_cover_prediction'] < -standard_dev_test)]

filtered_test_df_matches = (filtered_ats_predictions_test['favorite_cover?'] == filtered_ats_predictions_test['favorite_cover_prediction_binary'])
filtered_test_df_accuracy = filtered_test_df_matches.sum()/len(filtered_test_df_matches)

print('Accuracy: ', filtered_test_df_accuracy)
print("Percent of Games We're Betting On: ", len(filtered_ats_predictions_test)/len(ats_predictions_test))

Accuracy:  0.5803571428571429
Percent of Games We're Betting On:  0.2692307692307692


This is giving us much better results. At a 58% success clip, our betting model would be profitable, assuming that we bet the same amount on each game. Now we will look to fine tune the model to increase the success rate even more.

# Momentum Score

Create a Momentum Score Function. The formula for momentum score will be based on how well a team has performed the past 3-4 weeks. The score won't refresh at the end of each season and will instead roll continuously.

A very negative momentum score indicates that the team has performed very poorly the past few weeks and is likely to be undervalued. A very positive momentum score indicates that the team has performed very well the past few weeks and is likely to be overvalued. The momentum score will take into account win-loss record, average margin of win/loss, and average against the spread margin over the past 3-4 games.

We will experiment with different formulae to see which gives us the strongest predictive value. We will determine how many weeks we should take into account for our momentum score.

The function will accept one argument, the team that we are generating momentum scores for. The function will create a dataframe ("team_df") that contains one team's historical results. We will add a "momentum_score" column to this team_df dataframe, and then use a for loop to concatenate all of the team_df dataframes into a new "momentum_ats_predictions" dataframe. This "momentum_ats_predictions" dataframe will have duplicated games, so we will need to delete the duplicates as we did in the "make_season_df" function

In [130]:
def momentum_score_generator(team):
    team_df = ats_predictions[(ats_predictions['team_home'] == team) | (ats_predictions['team_away'] == team)]
    display(team_df.tail(4))

In [131]:
last_5 = momentum_score_generator('PIT')
weighted_average_result_margin = 0.1*

Unnamed: 0,train/test,schedule_season,schedule_week,schedule_playoff,team_home,team_away,team_favorite_id,score_home,score_away,spread_favorite,home_team_ats_record,away_team_ats_record,home_team_ppg_for,away_team_ppg_for,home_team_ppg_against,away_team_ppg_against,home_team_fav,favorite_spread_result,score_home_prediction,score_away_prediction,favorite_cover?
10346,train,2021,15,False,PIT,TEN,PIT,19,13,-1.0,-7.0,-1.0,20.9,24.9,24.8,22.3,1,5.0,20.578327,22.009619,1
10360,train,2021,16,False,KC,PIT,KC,36,10,-10.0,0.0,-6.0,27.5,20.8,21.1,23.9,1,16.0,28.029213,17.196565,1
10383,train,2021,17,False,PIT,CLE,PIT,26,14,-2.0,-7.0,-1.0,20.1,20.9,24.7,21.9,1,10.0,20.651356,20.588518,1
10388,train,2021,18,False,BAL,PIT,BAL,13,16,-3.0,-6.0,-6.0,23.4,20.4,23.5,24.1,1,-6.0,23.936996,21.211704,0
