# Predicting the scoring Margin of NBA Games Using Machine Learning

The were free and publicly available datasets for NBA box scores, but many had flaws or were missing information. However, there were multiple python libraries that allowed us to bypass this issue. We chose nba_api mostly due to it's comprehensive documentation. nba_api sourced it's data from the official NBA website: nba.com. They limited data requests to about 30,0000 games, and among those 30,000 were various basketball leagues: the WNBA and Summer League being the most notable. We wanted only NBA games, which we used the nba_api functionality to achieve.

* Importing various libraries to be used through the notebook.
* Call the nba_api game finder to get the maximum amount of games we can from nba.com.
* Getting a new dataframe with the same features as the one returned from the nba_api method.
* Searching, sorting, and appending the new DF to only include NBA teams.
* Displaying head of the new 'games' DF that will be used throughout the notebook.

In [1]:
import pandas as pd
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt
import statsmodels.api as sm
import statsmodels.formula.api as smf
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import scale 
from sklearn.linear_model import Ridge, Lasso, RidgeCV, LassoCV
from sklearn.metrics import mean_squared_error


from nba_api.stats.endpoints import leaguegamefinder
from nba_api.stats.static import teams

# Gets the box score of every single game (NBA + WNBA + others: nba.com doesn't seperate the leagues) 
# from 2014-2021 into a dataframe.
all_games_finder = leaguegamefinder.LeagueGameFinder()
all_games = all_games_finder.get_data_frames()[0]

# Creates empty dataframe with the same column names passed in from the dataframe containing all the games.
column_names = all_games.columns
games = pd.DataFrame(columns = column_names)

# Appends every game containing an NBA team in the all_games df to the games df. 
nba_teams = teams.get_teams()
for team in nba_teams:
    temp_id = team['id']
    games = games.append(all_games[all_games['TEAM_ID'] == temp_id]).reset_index(drop = True)

pd.set_option('display.max_columns', None)
games.head(60)

Unnamed: 0,SEASON_ID,TEAM_ID,TEAM_ABBREVIATION,TEAM_NAME,GAME_ID,GAME_DATE,MATCHUP,WL,MIN,PTS,FGM,FGA,FG_PCT,FG3M,FG3A,FG3_PCT,FTM,FTA,FT_PCT,OREB,DREB,REB,AST,STL,BLK,TOV,PF,PLUS_MINUS
0,22021,1610612737,ATL,Atlanta Hawks,22100359,2021-12-06,ATL @ MIN,W,240,121,40,90,0.444,25,49,0.51,16,19,0.842,11,38,49,31,5,8,10,20,11.0
1,22021,1610612737,ATL,Atlanta Hawks,22100350,2021-12-05,ATL vs. CHA,L,240,127,48,93,0.516,17,37,0.459,14,18,0.778,10,35,45,29,4,3,10,19,-3.0
2,22021,1610612737,ATL,Atlanta Hawks,22100335,2021-12-03,ATL vs. PHI,L,240,96,31,76,0.408,10,28,0.357,24,30,0.8,11,36,47,20,5,3,14,20,-2.0
3,22021,1610612737,ATL,Atlanta Hawks,22100319,2021-12-01,ATL @ IND,W,240,114,44,86,0.512,16,33,0.485,10,12,0.833,7,34,41,24,4,8,13,15,3.0
4,22021,1610612737,ATL,Atlanta Hawks,22100293,2021-11-27,ATL vs. NYK,L,240,90,33,93,0.355,9,37,0.243,15,20,0.75,13,39,52,18,8,6,6,17,-9.0
5,22021,1610612737,ATL,Atlanta Hawks,22100285,2021-11-26,ATL @ MEM,W,239,132,52,89,0.584,13,27,0.481,15,21,0.714,9,40,49,33,8,5,12,15,32.0
6,22021,1610612737,ATL,Atlanta Hawks,22100277,2021-11-24,ATL @ SAS,W,239,124,45,88,0.511,12,26,0.462,22,24,0.917,8,36,44,26,10,5,9,11,18.0
7,22021,1610612737,ATL,Atlanta Hawks,22100255,2021-11-22,ATL vs. OKC,W,239,113,42,87,0.483,14,34,0.412,15,16,0.938,8,36,44,25,6,6,7,16,12.0
8,22021,1610612737,ATL,Atlanta Hawks,22100242,2021-11-20,ATL vs. CHA,W,241,115,43,82,0.524,12,34,0.353,17,21,0.81,8,38,46,24,6,6,12,22,10.0
9,22021,1610612737,ATL,Atlanta Hawks,22100215,2021-11-17,ATL vs. BOS,W,240,110,41,81,0.506,13,37,0.351,15,18,0.833,6,34,40,28,9,4,11,17,11.0


The games dataframe obtained from the prior cell needed to be cleaned and prepped for model use.

* We first had to search and drop any games that contained null data. 
* Furthermore, due to the games being split up between rows (a row for each participating team's stats) potentially far apart in the DF, we dropped any 'game' that was missing one of it's two rows, to avoid having incomplete games being trained on by the model. 
* After, we sorted the entire DF by 'Game Date', enabling us to have complete games combined together (on seperate rows).
* Next, we needed to add the data from each team in a game onto both team's corresponding rows (due to how the model is being trained upon a teams most recent x games). 
* Finally after the data indexing and merging we arrived at a DF more closely resembling it's final form. 

In [2]:
# Dropping any game (two rows in DF) that has any NaN values or is missing either team's stats
games.isna()
games.dropna(inplace=True)

games = games[games.duplicated(subset = ['GAME_ID'], keep=False)]

# Merging games together (previously seperated in the DF by team: each team's stats from the game were kept in seperate rows
games = games.sort_values(by=['GAME_DATE'])
games = games.reset_index(drop=True)

# Team A and B each have a row for their stats in a given matchup; we need to add both stats to the end of their respective rows
# Team A dataframe
tempA = games[games.index % 2 == 0]
tempA2 = games[games.index % 2 == 1]

tempA2 = tempA2.add_prefix('OPP_')

tempA = tempA.reset_index(drop=True)
tempA2 = tempA2.reset_index(drop=True)

a_temp = tempA.join(tempA2)

# Team B dataframe
tempB = games[games.index % 2 == 0]
tempB2 = games[games.index % 2 == 1]

tempB = tempB.add_prefix('OPP_')

tempB = tempB.reset_index(drop=True)
tempB2 = tempB2.reset_index(drop=True)

b_temp = tempB2.join(tempB)

# Adding both teams to main dataframe
games = a_temp.append(b_temp)

# Resorting main dataframe
games = games.sort_values(by=['GAME_DATE'])
games = games.reset_index(drop=True)

# Sending data to CSV
games.to_csv('games.csv', index = False)

# Print Head
pd.set_option('display.max_columns', None)
games.head()

Unnamed: 0,SEASON_ID,TEAM_ID,TEAM_ABBREVIATION,TEAM_NAME,GAME_ID,GAME_DATE,MATCHUP,WL,MIN,PTS,FGM,FGA,FG_PCT,FG3M,FG3A,FG3_PCT,FTM,FTA,FT_PCT,OREB,DREB,REB,AST,STL,BLK,TOV,PF,PLUS_MINUS,OPP_SEASON_ID,OPP_TEAM_ID,OPP_TEAM_ABBREVIATION,OPP_TEAM_NAME,OPP_GAME_ID,OPP_GAME_DATE,OPP_MATCHUP,OPP_WL,OPP_MIN,OPP_PTS,OPP_FGM,OPP_FGA,OPP_FG_PCT,OPP_FG3M,OPP_FG3A,OPP_FG3_PCT,OPP_FTM,OPP_FTA,OPP_FT_PCT,OPP_OREB,OPP_DREB,OPP_REB,OPP_AST,OPP_STL,OPP_BLK,OPP_TOV,OPP_PF,OPP_PLUS_MINUS
0,22015,1610612763,MEM,Memphis Grizzlies,1421500020,2015-07-09,MEM vs. OKC,W,198,87,28,61,0.459,1,7,0.143,30,33,0.909,12,25,37,13,9,2,19,21,6.0,22015,1610612762,UTA,Utah Jazz,1621500006,2015-07-09,UTA vs. PHI,W,211,84,27,74,0.365,2,12,0.167,28,35,0.8,10,23,33,14,13,2,12,24,6.0
1,22015,1610612760,OKC,Oklahoma City Thunder,1421500020,2015-07-09,OKC @ MEM,L,200,81,33,75,0.44,5,20,0.25,10,19,0.526,15,15,30,15,11,5,15,33,-6.0,22015,1610612759,SAS,San Antonio Spurs,1621500005,2015-07-09,SAS vs. BOS,L,201,71,26,65,0.4,5,19,0.263,14,19,0.737,11,27,38,11,7,4,16,21,-14.0
2,22015,1610612738,BOS,Boston Celtics,1621500005,2015-07-09,BOS @ SAS,W,202,85,29,65,0.446,12,20,0.6,15,17,0.882,5,29,34,21,7,2,10,12,14.0,22015,1610612755,PHI,Philadelphia 76ers,1621500006,2015-07-09,PHI @ UTA,L,209,78,27,61,0.443,5,21,0.238,19,27,0.704,8,33,41,8,5,6,25,31,-6.0
3,22015,1610612762,UTA,Utah Jazz,1621500006,2015-07-09,UTA vs. PHI,W,211,84,27,74,0.365,2,12,0.167,28,35,0.8,10,23,33,14,13,2,12,24,6.0,22015,1610612763,MEM,Memphis Grizzlies,1421500020,2015-07-09,MEM vs. OKC,W,198,87,28,61,0.459,1,7,0.143,30,33,0.909,12,25,37,13,9,2,19,21,6.0
4,22015,1610612755,PHI,Philadelphia 76ers,1621500006,2015-07-09,PHI @ UTA,L,209,78,27,61,0.443,5,21,0.238,19,27,0.704,8,33,41,8,5,6,25,31,-6.0,22015,1610612738,BOS,Boston Celtics,1621500005,2015-07-09,BOS @ SAS,W,202,85,29,65,0.446,12,20,0.6,15,17,0.882,5,29,34,21,7,2,10,12,14.0


Because we no longer needed games to be sorted together, as they had already been merged, and we did need all the games from the unique teams sorted together by the game date; we decided to accomplish this here... with a basic for loop essentially identical to the one from the first cell.  

In [3]:
# Sorting the DF by teams
temp_games = pd.DataFrame()
nba_teams = teams.get_teams()
for team in nba_teams:
    temp_id = team['id']
    temp_games = temp_games.append(games[games['TEAM_ID'] == temp_id]).reset_index(drop=True)

games = temp_games

# Print Head
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)
games.head(100)

Unnamed: 0,SEASON_ID,TEAM_ID,TEAM_ABBREVIATION,TEAM_NAME,GAME_ID,GAME_DATE,MATCHUP,WL,MIN,PTS,FGM,FGA,FG_PCT,FG3M,FG3A,FG3_PCT,FTM,FTA,FT_PCT,OREB,DREB,REB,AST,STL,BLK,TOV,PF,PLUS_MINUS,OPP_SEASON_ID,OPP_TEAM_ID,OPP_TEAM_ABBREVIATION,OPP_TEAM_NAME,OPP_GAME_ID,OPP_GAME_DATE,OPP_MATCHUP,OPP_WL,OPP_MIN,OPP_PTS,OPP_FGM,OPP_FGA,OPP_FG_PCT,OPP_FG3M,OPP_FG3A,OPP_FG3_PCT,OPP_FTM,OPP_FTA,OPP_FT_PCT,OPP_OREB,OPP_DREB,OPP_REB,OPP_AST,OPP_STL,OPP_BLK,OPP_TOV,OPP_PF,OPP_PLUS_MINUS
0,22015,1610612737,ATL,Atlanta Hawks,1521500003,2015-07-10,ATL @ DEN,L,201,71,29,70,0.414,6,22,0.273,7,8,0.875,9,28,37,13,8,4,22,29,-15.0,22015,1610612761,TOR,Toronto Raptors,1521500002,2015-07-10,TOR vs. SAC,W,200,90,35,72,0.486,4,18,0.222,16,25,0.64,9,28,37,16,13,5,14,25,22.0
1,22015,1610612737,ATL,Atlanta Hawks,1521500020,2015-07-12,ATL vs. GSW,W,200,71,28,63,0.444,6,22,0.273,9,22,0.409,2,33,35,12,12,12,20,33,1.0,22015,1610612743,DEN,Denver Nuggets,1521500019,2015-07-12,DEN vs. SAC,W,199,98,38,81,0.469,12,29,0.414,10,14,0.714,15,28,43,22,14,3,15,24,22.0
2,22015,1610612737,ATL,Atlanta Hawks,1521500039,2015-07-15,ATL vs. MIA,W,200,75,28,65,0.431,3,14,0.214,16,18,0.889,12,32,44,17,7,7,20,28,11.0,22015,1610612749,MIL,Milwaukee Bucks,1521500037,2015-07-15,MIL @ HOU,W,201,97,36,66,0.545,9,15,0.6,16,27,0.593,11,27,38,23,5,9,25,28,4.0
3,22015,1610612737,ATL,Atlanta Hawks,1521500050,2015-07-16,ATL @ DEN,W,199,82,31,75,0.413,10,22,0.455,10,14,0.714,8,29,37,19,15,2,17,22,9.0,22015,1610612757,POR,Portland Trail Blazers,1521500052,2015-07-16,POR @ BOS,L,201,85,34,73,0.466,7,16,0.438,10,13,0.769,9,31,40,16,3,3,14,22,-6.0
4,22015,1610612737,ATL,Atlanta Hawks,1521500062,2015-07-18,ATL vs. DAL,W,200,91,31,64,0.484,8,21,0.381,21,27,0.778,9,31,40,17,6,5,17,15,8.0,22015,1610612756,PHX,Phoenix Suns,1521500063,2015-07-18,PHX vs. CHI,W,201,91,33,79,0.418,2,10,0.2,23,30,0.767,7,26,33,12,10,4,8,18,7.0
5,22015,1610612737,ATL,Atlanta Hawks,1521500065,2015-07-19,ATL @ SAS,L,199,68,23,71,0.324,5,26,0.192,17,23,0.739,14,27,41,12,9,7,18,24,-7.0,22015,1610612756,PHX,Phoenix Suns,1521500066,2015-07-19,PHX @ NOP,W,199,93,37,72,0.514,11,27,0.407,8,10,0.8,6,30,36,20,8,2,17,21,6.0
6,12015,1610612737,ATL,Atlanta Hawks,11500017,2015-10-07,ATL @ CLE,W,238,98,33,75,0.44,7,24,0.292,25,30,0.833,9,37,46,17,8,2,17,19,2.0,12015,1610612745,HOU,Houston Rockets,11500019,2015-10-07,HOU vs. DAL,W,240,109,40,93,0.43,12,37,0.324,17,25,0.68,18,45,63,24,6,5,12,23,27.0
7,12015,1610612737,ATL,Atlanta Hawks,11500032,2015-10-09,ATL @ NOP,W,240,103,33,74,0.446,11,26,0.423,26,36,0.722,5,43,48,21,8,5,16,23,10.0,12015,1610612752,NYK,New York Knicks,11500031,2015-10-09,NYK @ WAS,W,241,115,42,84,0.5,8,22,0.364,23,25,0.92,11,42,53,25,3,6,18,25,11.0
8,12015,1610612737,ATL,Atlanta Hawks,11500060,2015-10-14,ATL vs. SAS,W,240,100,32,78,0.41,11,31,0.355,25,29,0.862,3,45,48,24,10,4,16,26,14.0,12015,1610612761,TOR,Toronto Raptors,11500058,2015-10-14,TOR @ MIN,L,241,87,31,68,0.456,9,24,0.375,16,24,0.667,6,26,32,17,13,2,18,25,-2.0
9,12015,1610612737,ATL,Atlanta Hawks,11500068,2015-10-16,ATL @ DAL,W,241,91,31,84,0.369,9,28,0.321,20,24,0.833,9,40,49,18,9,7,15,14,7.0,12015,1610612760,OKC,Oklahoma City Thunder,11500065,2015-10-16,OKC @ MEM,L,240,78,29,79,0.367,7,36,0.194,13,18,0.722,10,31,41,19,9,8,18,24,-16.0


The next cell is for basic categorical data removal, although we kept some of the more important features in lists for their potential later usage. 

In [4]:
# Storing some columns for future use
game_ids = games['GAME_ID'].values
team_ids = games['TEAM_ID'].values
minutes = games['MIN'].values
abrv = games['TEAM_ABBREVIATION'].values
opp_abrv = games['OPP_TEAM_ABBREVIATION'].values
spread = games['PLUS_MINUS'].values

# Dropping Non-essential categorical data
games = games.drop(columns=['SEASON_ID', 'OPP_SEASON_ID', 'OPP_TEAM_ID','GAME_ID', 'OPP_GAME_ID', 'TEAM_ABBREVIATION', 'OPP_TEAM_ABBREVIATION', 'TEAM_NAME', 'OPP_TEAM_NAME', 'MATCHUP', 'OPP_MATCHUP', 'WL', 'OPP_WL', 'GAME_DATE', 'OPP_GAME_DATE', 'MIN', 'OPP_MIN']) 

# Adding minutes back as a single column
games['MIN'] = minutes

# Print Head
pd.set_option('display.max_columns', None)
games.head()

Unnamed: 0,TEAM_ID,PTS,FGM,FGA,FG_PCT,FG3M,FG3A,FG3_PCT,FTM,FTA,FT_PCT,OREB,DREB,REB,AST,STL,BLK,TOV,PF,PLUS_MINUS,OPP_PTS,OPP_FGM,OPP_FGA,OPP_FG_PCT,OPP_FG3M,OPP_FG3A,OPP_FG3_PCT,OPP_FTM,OPP_FTA,OPP_FT_PCT,OPP_OREB,OPP_DREB,OPP_REB,OPP_AST,OPP_STL,OPP_BLK,OPP_TOV,OPP_PF,OPP_PLUS_MINUS,MIN
0,1610612737,71,29,70,0.414,6,22,0.273,7,8,0.875,9,28,37,13,8,4,22,29,-15.0,90,35,72,0.486,4,18,0.222,16,25,0.64,9,28,37,16,13,5,14,25,22.0,201
1,1610612737,71,28,63,0.444,6,22,0.273,9,22,0.409,2,33,35,12,12,12,20,33,1.0,98,38,81,0.469,12,29,0.414,10,14,0.714,15,28,43,22,14,3,15,24,22.0,200
2,1610612737,75,28,65,0.431,3,14,0.214,16,18,0.889,12,32,44,17,7,7,20,28,11.0,97,36,66,0.545,9,15,0.6,16,27,0.593,11,27,38,23,5,9,25,28,4.0,200
3,1610612737,82,31,75,0.413,10,22,0.455,10,14,0.714,8,29,37,19,15,2,17,22,9.0,85,34,73,0.466,7,16,0.438,10,13,0.769,9,31,40,16,3,3,14,22,-6.0,199
4,1610612737,91,31,64,0.484,8,21,0.381,21,27,0.778,9,31,40,17,6,5,17,15,8.0,91,33,79,0.418,2,10,0.2,23,30,0.767,7,26,33,12,10,4,8,18,7.0,200


This rather long cell is likely the most basic in the entire Notebook, while at the same time being one of the most important. It was here we decided to add all the advanced features that were not included in the initial box scores. We found the calculations on various semi-famous basketball sites, most notably Basketball Reference, which included an entire glossary of basketball stats and how they were calculated. 

In [5]:
# Adding Advanced Stats to enhance model performance, formulas were gathered from various sources. 

# Efficient Field Goal Percentage
games['EFG%'] = (games['FGM'] + (.5 * games['FG3M'])) / games['FGA']
games['OPP_EFG%'] = (games['OPP_FGM'] + (.5 * games['OPP_FG3M'])) / games['OPP_FGA']

# Block Percentage
games['BLK%'] = (games['BLK'] / (games['OPP_FGA']-games['OPP_FG3A']))
games['OPP_BLK%'] = (games['OPP_BLK'] / (games['FGA']-games['FG3A']))

# Turnover Percentage
games['TOV%'] = games['TOV'] / (games['FGA'] + 0.44 * games['FTA'] + games['TOV'])
games['OPP_TOV%'] = games['OPP_TOV'] / (games['OPP_FGA'] + 0.44 * games['OPP_FTA'] + games['OPP_TOV'])

#Offensive Rebound Percentage
games['ORB%'] = games['OREB'] / (games['OREB'] + games['OPP_DREB'])
games['OPP_ORB%'] = games['OPP_OREB'] / (games['OPP_OREB'] + games['DREB'])

# Defensive Rebound Percentage
games['DREB%'] = games['DREB'] / (games['OPP_OREB'] + games['DREB'])
games['OPP_DREB%'] = games['OPP_DREB'] / (games['OREB'] + games['OPP_DREB'])

# Possessions 
games['POSS'] = 0.96*((games['FGA']) + games['TOV'] + 0.44 * games['FTA'] - games['OREB'])
games['OPP_POSS'] = 0.96*((games['OPP_FGA']) + games['OPP_TOV'] + 0.44 * games['OPP_FTA'] - games['OPP_OREB'])

# Steals Percentage
games['STL%'] = (games['STL'] / games['OPP_POSS'])
games['OPP_STL%'] = (games['OPP_STL'] / games['POSS'])

# Free Throw Rate
games['FTR'] = games['FTM'] / games['FGA']
games['OPP_FTR'] = games['OPP_FTM'] / games['OPP_FGA']

# True Shooting (Requires True Shooting Attempts)
tsa = games['FGA'] + 0.44 * games['FTA']
OPP_tsa = games['OPP_FGA'] + 0.44 * games['OPP_FTA']
games['TS'] = games['PTS'] / (2 * tsa)
games['OPP_TS'] = games['OPP_PTS'] / (2 * OPP_tsa)

# Assist Rate
games['ASTR'] = games['AST'] / (games['FGA'] + (.44 * games['FTA']) + games['AST'] + games['TOV'])
games['OPP_ASTR'] = games['OPP_AST'] / (games['OPP_FGA'] + (.44 * games['OPP_FTA']) + games['OPP_AST'] + games['OPP_TOV'])

# Total Rebound Percentage
games['TRB%'] = (games['REB'] * (games['REB'] / 5)) / (games['MIN'] * (games['REB'] + games['OPP_REB']))
games['OPP_TRB%'] = (games['OPP_REB'] * (games['OPP_REB'] / 5)) / (games['MIN'] * (games['OPP_REB'] + games['REB']))

# PACE
games['PACE'] = 48 * (games['POSS'] + games['OPP_POSS']) / (2 * (games['MIN'] / 5))
games['OPP_PACE'] = 48 * (games['OPP_POSS'] + games['POSS']) / (2 * (games['MIN'] / 5))

# Offensive Rating
games['ORTG'] = (games['PTS'] / games['POSS'])
games['OPP_ORTG'] = (games['OPP_PTS'] / games['OPP_POSS'])

# Defensive Rating
games['DRTG'] = (games['OPP_PTS'] / games['POSS'])
games['OPP_DRTG'] = (games['PTS'] / games['OPP_POSS'])

pd.set_option('display.max_columns', None)
games.head()

Unnamed: 0,TEAM_ID,PTS,FGM,FGA,FG_PCT,FG3M,FG3A,FG3_PCT,FTM,FTA,FT_PCT,OREB,DREB,REB,AST,STL,BLK,TOV,PF,PLUS_MINUS,OPP_PTS,OPP_FGM,OPP_FGA,OPP_FG_PCT,OPP_FG3M,OPP_FG3A,OPP_FG3_PCT,OPP_FTM,OPP_FTA,OPP_FT_PCT,OPP_OREB,OPP_DREB,OPP_REB,OPP_AST,OPP_STL,OPP_BLK,OPP_TOV,OPP_PF,OPP_PLUS_MINUS,MIN,EFG%,OPP_EFG%,BLK%,OPP_BLK%,TOV%,OPP_TOV%,ORB%,OPP_ORB%,DREB%,OPP_DREB%,POSS,OPP_POSS,STL%,OPP_STL%,FTR,OPP_FTR,TS,OPP_TS,ASTR,OPP_ASTR,TRB%,OPP_TRB%,PACE,OPP_PACE,ORTG,OPP_ORTG,DRTG,OPP_DRTG
0,1610612737,71,29,70,0.414,6,22,0.273,7,8,0.875,9,28,37,13,8,4,22,29,-15.0,90,35,72,0.486,4,18,0.222,16,25,0.64,9,28,37,16,13,5,14,25,22.0,201,0.457143,0.513889,0.074074,0.104167,0.230318,0.14433,0.243243,0.243243,0.756757,0.756757,83.0592,84.48,0.094697,0.156515,0.1,0.222222,0.482862,0.542169,0.119794,0.141593,0.018408,0.018408,100.023403,100.023403,0.854812,1.065341,1.083564,0.840436
1,1610612737,71,28,63,0.444,6,22,0.273,9,22,0.409,2,33,35,12,12,12,20,33,1.0,98,38,81,0.469,12,29,0.414,10,14,0.714,15,28,43,22,14,3,15,24,22.0,200,0.492063,0.54321,0.230769,0.073171,0.215796,0.146829,0.066667,0.3125,0.6875,0.933333,87.0528,83.6736,0.143414,0.160822,0.142857,0.123457,0.488442,0.562184,0.114635,0.177191,0.015705,0.023705,102.43584,102.43584,0.815597,1.171218,1.125754,0.848535
2,1610612737,75,28,65,0.431,3,14,0.214,16,18,0.889,12,32,44,17,7,7,20,28,11.0,97,36,66,0.545,9,15,0.6,16,27,0.593,11,27,38,23,5,9,25,28,4.0,200,0.453846,0.613636,0.137255,0.176471,0.215239,0.243002,0.307692,0.255814,0.744186,0.692308,77.6832,88.2048,0.079361,0.064364,0.246154,0.242424,0.514262,0.622753,0.154658,0.182714,0.02361,0.01761,99.5328,99.5328,0.96546,1.099713,1.248661,0.850294
3,1610612737,82,31,75,0.413,10,22,0.455,10,14,0.714,8,29,37,19,15,2,17,22,9.0,85,34,73,0.466,7,16,0.438,10,13,0.769,9,31,40,16,3,3,14,22,-6.0,199,0.48,0.513699,0.035088,0.056604,0.173187,0.150992,0.205128,0.236842,0.763158,0.794872,86.5536,80.3712,0.186634,0.034661,0.133333,0.136986,0.505175,0.539888,0.162171,0.147167,0.017869,0.020884,100.658171,100.658171,0.94739,1.057593,0.98205,1.020266
4,1610612737,91,31,64,0.484,8,21,0.381,21,27,0.778,9,31,40,17,6,5,17,15,8.0,91,33,79,0.418,2,10,0.2,23,30,0.767,7,26,33,12,10,4,8,18,7.0,200,0.546875,0.43038,0.072464,0.093023,0.183032,0.07984,0.257143,0.184211,0.815789,0.742857,80.5248,89.472,0.06706,0.124185,0.328125,0.291139,0.599631,0.493492,0.154714,0.106952,0.021918,0.014918,101.99808,101.99808,1.130087,1.017078,1.130087,1.017078


After making a halfhearted attempt at an actual model to make sure our data was sound, we quickly realized we had a major problem. Our model scores were ludicrously high, in the upper 99's. After searching for the error for days, it became apparent that our model had a fundamental problem: we were attempting to train and test a model to predict point spreads based on box scores... but we had given the model given game's spread AND it's box score. Which had two major flaws: 1) An actual model attempting to prevent future games would never have a game's box score, as the game had not yet been played. 2) The box score (along with the added features) also fully explained the point spread (sometimes reffered to as 'Plus/minus'); these features are highly colinear. So we needed a different approach, one that would somewhat make sense in the real world, and to that end we decided on rolling averages of a teams X most recent games. 
* We needed a new DF to calculate the rolling averages, and we also wanted to rename the target feature to be more cohesive with our adopted vernacular.
* We inserted game_id and spread so they would go to their appropriate games before games were dropped, in order to not mess up the DF and list indexing. 
* We then calculated the rolling averages for a team's most recent 5 games. Meaning: Game 5 would be an average of Games 1-5; Game 6 would be an average of Games 2-6, and so on. This required us to drop the first 4 games for every team, because they did not have the required 5 games of averages.  
* As before, we checked for and dropped games with null values, and did the same for games that had no 'sister' row.
* We then replaced the old spread and game id lists with the new ordered values for the remaining games.
* Finally, we replaced the old Games DF with the rolling averages and printed out the head to make sure the DF was ready for normalization.


In [6]:
# Getting rolling average of recent x games
games_averages = games.copy() #Sets a copy to be used when we do rolling averages
games_averages = games_averages.drop(columns = ['PLUS_MINUS', 'OPP_PLUS_MINUS'])
games_averages = games_averages.groupby('TEAM_ID').rolling(5).mean().reset_index(drop=True)

# Dropping any games that became null
games_averages.insert(0, 'GAME_ID', game_ids)
games_averages.insert(1, 'SPREAD', spread)

games_averages = games_averages[games_averages.duplicated(subset = ['GAME_ID'], keep=False)]
games_averages.isna()
games_averages.dropna(inplace=True)

game_ids = games_averages['GAME_ID'].values
spread = games_averages['SPREAD'].values

games_averages = games_averages.drop(columns= ['GAME_ID', 'SPREAD', 'TEAM_ID'])
games_averages.reset_index(drop=True)
games = games_averages

# Print Head
pd.set_option('display.float_format', '{:.2f}'.format)
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)
games_averages.head(25)

Unnamed: 0,PTS,FGM,FGA,FG_PCT,FG3M,FG3A,FG3_PCT,FTM,FTA,FT_PCT,OREB,DREB,REB,AST,STL,BLK,TOV,PF,OPP_PTS,OPP_FGM,OPP_FGA,OPP_FG_PCT,OPP_FG3M,OPP_FG3A,OPP_FG3_PCT,OPP_FTM,OPP_FTA,OPP_FT_PCT,OPP_OREB,OPP_DREB,OPP_REB,OPP_AST,OPP_STL,OPP_BLK,OPP_TOV,OPP_PF,MIN,EFG%,OPP_EFG%,BLK%,OPP_BLK%,TOV%,OPP_TOV%,ORB%,OPP_ORB%,DREB%,OPP_DREB%,POSS,OPP_POSS,STL%,OPP_STL%,FTR,OPP_FTR,TS,OPP_TS,ASTR,OPP_ASTR,TRB%,OPP_TRB%,PACE,OPP_PACE,ORTG,OPP_ORTG,DRTG,OPP_DRTG
4,78.0,29.4,67.4,0.44,6.6,20.2,0.32,12.6,17.8,0.73,8.0,30.6,38.6,15.6,9.6,6.0,19.2,25.4,92.2,35.2,74.2,0.48,6.8,17.6,0.37,15.0,21.8,0.7,10.2,28.0,38.2,17.8,9.0,4.8,15.2,23.4,200.0,0.49,0.52,0.11,0.1,0.2,0.15,0.22,0.25,0.75,0.78,82.97,85.24,0.11,0.11,0.19,0.2,0.52,0.55,0.14,0.15,0.02,0.02,100.93,100.93,0.94,1.08,1.11,0.92
5,77.4,28.2,67.6,0.42,6.4,21.0,0.3,14.6,20.8,0.71,9.0,30.4,39.4,15.4,9.8,6.6,18.4,24.4,92.8,35.6,74.2,0.48,8.2,19.4,0.41,13.4,18.8,0.73,9.6,28.4,38.0,18.6,8.0,4.2,15.8,22.6,199.6,0.47,0.54,0.13,0.09,0.19,0.16,0.23,0.23,0.77,0.77,82.71,85.13,0.12,0.1,0.22,0.18,0.51,0.57,0.14,0.16,0.02,0.02,100.9,100.9,0.94,1.09,1.12,0.91
6,82.8,29.2,70.0,0.42,6.6,21.4,0.31,17.8,22.4,0.79,10.4,31.2,41.6,16.4,9.0,4.6,17.8,21.6,95.0,36.0,76.6,0.47,8.2,21.0,0.39,14.8,21.0,0.72,10.2,31.8,42.0,19.0,6.4,4.6,15.2,22.4,207.2,0.47,0.53,0.09,0.09,0.18,0.15,0.25,0.24,0.76,0.75,83.77,87.21,0.11,0.08,0.26,0.19,0.52,0.56,0.14,0.16,0.02,0.02,99.21,99.21,0.99,1.09,1.14,0.95
7,88.4,30.2,71.8,0.42,8.2,23.8,0.35,19.8,26.0,0.76,9.0,33.4,42.4,17.2,9.2,4.2,17.0,20.6,98.6,37.2,80.2,0.47,8.0,22.4,0.35,16.2,20.6,0.79,10.2,34.8,45.0,19.4,6.0,4.0,13.8,21.8,215.2,0.48,0.52,0.08,0.08,0.17,0.14,0.21,0.23,0.77,0.79,87.59,89.15,0.11,0.07,0.28,0.2,0.53,0.55,0.15,0.16,0.02,0.02,98.78,98.78,1.01,1.1,1.12,0.99
8,92.0,30.4,72.4,0.42,8.4,25.6,0.33,22.8,29.0,0.79,8.0,36.6,44.6,18.2,8.2,4.6,16.8,21.4,99.0,36.6,79.2,0.46,8.4,24.0,0.33,17.4,22.8,0.77,9.6,33.8,43.4,19.6,8.0,3.8,14.6,22.4,223.4,0.48,0.52,0.09,0.08,0.17,0.14,0.19,0.2,0.8,0.81,90.2,90.46,0.09,0.09,0.31,0.22,0.54,0.56,0.15,0.16,0.02,0.02,97.3,97.3,1.02,1.09,1.1,1.01
9,92.0,30.4,76.4,0.4,8.6,27.0,0.32,22.6,28.4,0.8,8.0,38.4,46.4,18.4,8.8,5.0,16.4,21.2,96.4,35.8,79.2,0.45,9.4,29.2,0.33,15.4,20.4,0.76,10.2,34.8,45.0,21.0,7.8,4.6,16.6,23.6,231.6,0.45,0.51,0.11,0.09,0.16,0.16,0.18,0.21,0.79,0.82,93.4,90.79,0.1,0.08,0.3,0.19,0.52,0.55,0.15,0.17,0.02,0.02,95.59,95.59,0.98,1.06,1.04,1.01
10,96.8,32.4,77.2,0.42,8.8,26.2,0.33,23.2,28.2,0.83,6.8,40.2,47.0,20.6,8.2,5.4,17.2,21.2,91.4,33.6,78.6,0.43,8.0,27.2,0.3,16.2,22.2,0.72,11.2,34.4,45.6,21.0,7.6,5.0,17.6,24.2,239.8,0.48,0.48,0.11,0.1,0.16,0.17,0.16,0.22,0.78,0.84,96.01,90.98,0.09,0.08,0.3,0.21,0.54,0.51,0.16,0.16,0.02,0.02,93.57,93.57,1.01,1.0,0.95,1.07
11,93.4,32.0,78.6,0.41,9.0,26.6,0.34,20.4,25.6,0.79,6.8,41.2,48.0,21.4,8.2,7.0,16.8,21.8,91.6,33.2,78.6,0.42,7.2,24.6,0.3,18.0,23.8,0.75,9.6,33.2,42.8,21.0,8.0,5.4,17.0,23.4,240.2,0.47,0.47,0.13,0.1,0.16,0.16,0.17,0.19,0.81,0.83,95.87,92.61,0.09,0.08,0.26,0.23,0.52,0.51,0.17,0.16,0.02,0.02,94.16,94.16,0.97,0.98,0.96,1.02
12,90.2,32.0,79.8,0.4,8.2,25.2,0.33,18.0,21.8,0.82,7.8,38.2,46.0,20.6,8.2,6.8,17.8,20.2,91.6,34.0,79.0,0.43,8.8,27.6,0.31,14.8,20.4,0.74,9.4,31.0,40.4,22.4,9.2,5.0,16.2,22.6,240.2,0.45,0.48,0.13,0.09,0.17,0.16,0.2,0.2,0.8,0.8,95.42,90.98,0.09,0.1,0.23,0.19,0.51,0.52,0.16,0.18,0.02,0.02,93.12,93.12,0.94,1.0,0.96,1.0
13,89.0,33.0,80.6,0.41,7.6,24.4,0.31,15.4,19.0,0.8,8.6,35.8,44.4,20.2,8.0,6.8,17.6,20.0,93.6,35.2,82.8,0.42,8.4,26.6,0.31,14.8,20.2,0.74,9.6,33.8,43.4,21.6,7.8,6.6,15.2,22.0,240.0,0.46,0.47,0.12,0.12,0.17,0.14,0.21,0.21,0.79,0.79,94.04,93.4,0.09,0.08,0.19,0.18,0.5,0.51,0.16,0.17,0.02,0.02,93.72,93.72,0.95,1.0,1.0,0.96


In [None]:
Data normalization using Z scores, almost doubles model accuracy.

In [7]:
# Z score normalization
realcols = list(games.columns.values)
for col in realcols:
   mean = games[col].mean()
   std = games[col].std()
   games[col] = (games[col] - mean)/std

pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)
games.head(25)

Unnamed: 0,PTS,FGM,FGA,FG_PCT,FG3M,FG3A,FG3_PCT,FTM,FTA,FT_PCT,OREB,DREB,REB,AST,STL,BLK,TOV,PF,OPP_PTS,OPP_FGM,OPP_FGA,OPP_FG_PCT,OPP_FG3M,OPP_FG3A,OPP_FG3_PCT,OPP_FTM,OPP_FTA,OPP_FT_PCT,OPP_OREB,OPP_DREB,OPP_REB,OPP_AST,OPP_STL,OPP_BLK,OPP_TOV,OPP_PF,MIN,EFG%,OPP_EFG%,BLK%,OPP_BLK%,TOV%,OPP_TOV%,ORB%,OPP_ORB%,DREB%,OPP_DREB%,POSS,OPP_POSS,STL%,OPP_STL%,FTR,OPP_FTR,TS,OPP_TS,ASTR,OPP_ASTR,TRB%,OPP_TRB%,PACE,OPP_PACE,ORTG,OPP_ORTG,DRTG,OPP_DRTG
4,-3.22,-2.81,-3.76,-0.64,-1.52,-1.56,-0.79,-1.51,-1.25,-0.6,-1.07,-1.07,-1.48,-2.16,1.19,0.87,2.43,1.95,-1.78,-1.26,-2.59,0.78,-1.7,-2.44,0.5,-0.85,-0.27,-1.45,0.08,-2.17,-1.82,-1.78,0.92,-0.06,0.69,1.3,-4.78,-0.88,0.15,0.83,0.51,3.96,1.5,-0.31,0.57,-0.57,0.31,-2.86,-2.63,1.99,1.82,-0.37,-0.05,-1.08,-0.08,-1.55,-1.24,0.42,0.26,1.5,1.5,-2.39,-0.45,0.01,-2.45
5,-3.28,-3.15,-3.72,-1.25,-1.59,-1.43,-1.16,-0.88,-0.49,-1.11,-0.55,-1.13,-1.26,-2.22,1.32,1.33,2.07,1.55,-1.7,-1.13,-2.59,0.99,-1.09,-2.09,1.37,-1.42,-1.13,-0.77,-0.28,-2.02,-1.88,-1.51,0.18,-0.59,0.99,0.94,-4.82,-1.42,0.61,1.47,-0.01,3.47,1.92,0.1,0.2,-0.2,-0.1,-2.92,-2.66,2.14,1.02,0.32,-0.65,-1.46,0.35,-1.67,-0.83,0.74,0.11,1.49,1.49,-2.46,-0.32,0.16,-2.53
6,-2.68,-2.87,-3.24,-1.28,-1.52,-1.37,-1.08,0.13,-0.09,0.47,0.17,-0.87,-0.65,-1.93,0.8,-0.2,1.8,0.45,-1.43,-1.01,-2.07,0.7,-1.09,-1.78,0.95,-0.92,-0.5,-0.92,0.08,-0.77,-0.62,-1.38,-1.0,-0.23,0.69,0.86,-3.91,-1.45,0.32,-0.07,0.21,2.92,1.44,0.64,0.29,-0.29,-0.64,-2.69,-2.15,1.44,-0.27,1.26,-0.33,-1.06,0.1,-1.42,-0.9,0.8,0.79,0.96,0.96,-1.76,-0.36,0.3,-2.05
7,-2.04,-2.58,-2.88,-1.17,-0.92,-0.99,-0.11,0.75,0.82,-0.15,-0.55,-0.16,-0.43,-1.7,0.93,-0.51,1.44,0.06,-0.98,-0.63,-1.29,0.36,-1.18,-1.5,-0.16,-0.43,-0.62,0.45,0.08,0.33,0.33,-1.24,-1.29,-0.76,-0.01,0.59,-2.94,-1.09,-0.08,-0.51,-0.24,2.26,0.52,-0.45,-0.02,0.02,0.45,-1.84,-1.67,1.46,-0.72,1.79,-0.16,-0.71,-0.01,-1.31,-0.92,0.38,1.34,0.83,0.83,-1.48,-0.12,0.14,-1.55
8,-1.64,-2.52,-2.76,-1.19,-0.85,-0.71,-0.57,1.7,1.57,0.4,-1.07,0.88,0.17,-1.41,0.29,-0.2,1.35,0.37,-0.93,-0.82,-1.51,0.29,-1.0,-1.19,-0.46,-0.0,0.01,0.03,-0.28,-0.03,-0.18,-1.17,0.18,-0.94,0.39,0.86,-1.95,-1.08,-0.03,-0.07,-0.36,2.03,0.91,-1.0,-0.72,0.72,1.0,-1.26,-1.34,0.6,0.57,2.71,0.38,-0.44,0.08,-1.1,-0.89,0.94,0.33,0.36,0.36,-1.32,-0.3,-0.15,-1.23
9,-1.64,-2.52,-1.96,-1.98,-0.77,-0.48,-0.85,1.64,1.42,0.6,-1.07,1.46,0.67,-1.35,0.68,0.1,1.17,0.29,-1.26,-1.07,-1.51,-0.09,-0.56,-0.17,-0.49,-0.71,-0.67,-0.16,0.08,0.33,0.33,-0.7,0.03,-0.23,1.39,1.38,-0.96,-1.77,-0.14,0.64,0.07,1.56,1.92,-1.18,-0.63,0.63,1.18,-0.55,-1.26,0.98,0.15,2.27,-0.31,-1.14,-0.21,-1.24,-0.34,0.96,0.39,-0.18,-0.18,-1.85,-0.81,-1.02,-1.27
10,-1.1,-1.95,-1.79,-1.19,-0.7,-0.61,-0.48,1.83,1.37,1.23,-1.69,2.05,0.84,-0.71,0.29,0.41,1.53,0.29,-1.88,-1.76,-1.64,-1.12,-1.18,-0.56,-1.3,-0.43,-0.16,-0.87,0.67,0.19,0.51,-0.7,-0.12,0.12,1.89,1.65,0.04,-1.1,-1.25,0.78,0.34,1.81,2.35,-1.7,-0.32,0.32,1.7,0.03,-1.21,0.54,-0.17,2.4,0.03,-0.41,-1.32,-0.54,-0.44,0.73,0.23,-0.82,-0.82,-1.45,-1.78,-2.14,-0.59
11,-1.48,-2.07,-1.51,-1.61,-0.62,-0.55,-0.4,0.94,0.72,0.54,-1.69,2.37,1.12,-0.48,0.29,1.63,1.35,0.53,-1.85,-1.88,-1.64,-1.27,-1.53,-1.07,-1.26,0.21,0.3,-0.41,-0.28,-0.26,-0.37,-0.7,0.18,0.48,1.59,1.3,0.08,-1.43,-1.51,1.64,0.58,1.62,2.06,-1.58,-1.12,1.12,1.58,-0.0,-0.81,0.46,0.13,1.42,0.56,-1.02,-1.4,-0.26,-0.45,1.21,-0.82,-0.63,-0.63,-1.96,-2.04,-2.09,-1.21
12,-1.84,-2.07,-1.27,-1.83,-0.92,-0.77,-0.66,0.19,-0.24,0.92,-1.17,1.4,0.56,-0.71,0.29,1.48,1.8,-0.1,-1.85,-1.63,-1.55,-1.01,-0.82,-0.48,-0.94,-0.92,-0.67,-0.6,-0.4,-1.07,-1.12,-0.22,1.06,0.12,1.19,0.94,0.08,-1.78,-1.03,1.65,0.06,2.06,1.8,-0.84,-0.76,0.76,0.84,-0.1,-1.21,0.55,1.0,0.54,-0.49,-1.47,-1.16,-0.56,0.2,0.87,-1.36,-0.96,-0.96,-2.37,-1.71,-2.01,-1.42
13,-1.98,-1.78,-1.11,-1.56,-1.14,-0.9,-0.93,-0.63,-0.95,0.69,-0.76,0.62,0.12,-0.83,0.16,1.48,1.71,-0.18,-1.6,-1.26,-0.72,-1.24,-1.0,-0.68,-0.97,-0.92,-0.73,-0.48,-0.28,-0.03,-0.18,-0.49,0.03,1.54,0.69,0.68,0.06,-1.68,-1.37,1.4,1.27,2.0,1.05,-0.59,-0.43,0.43,0.59,-0.41,-0.62,0.26,0.11,-0.32,-0.77,-1.59,-1.51,-0.64,-0.32,0.1,-0.31,-0.77,-0.77,-2.34,-1.77,-1.55,-1.91


Basic Train test split to prepare for Modelling.

In [8]:
# Test/Train splitting
from sklearn.metrics import r2_score

spread = spread.astype(int)
y = spread
x = games.values
print(x.shape)
print(y.shape)

xtrain, xtest, ytrain, ytest =  train_test_split(x,y, test_size = 0.2, random_state=1234)

(17828, 65)
(17828,)


Ridge Model
Ridge Regression is a method for estimating coefficients in scenarios where independent variables are highly correlated. For this model, the parameter alpha takes in the following values: 0.001, 0.01, 0.1, and 1.

In [9]:
# Ridge Model
ridge = RidgeCV(alphas=[1e-3, 1e-2, 1e-1, 1])
ridge.fit(xtrain, ytrain)

predict = ridge.predict(xtest)

#stats for the model
pd.set_option('display.max_rows', None)
#print(pd.Series(ridge.coef_, index = games.columns[0:66])) 
print(predict)
print(ytest)
mse = mean_squared_error(ytest, predict) 
print("Test mean squared error (MSE): {:.2f}".format(mse))
print("Score:", ridge.score(xtest,ytest))

[-3.06661976 -5.66207813 -2.60020579 ...  2.04388607  4.0358702
 -3.24511857]
[-10  -5   5 ...  20   1  16]
Test mean squared error (MSE): 152.79
Score: 0.20756056860415362


In [10]:
# Getting rid of the sklearn convergence warnings.
def warn(*args, **kwargs):
    pass
import warnings
warnings.warn = warn

Lasso Regression
Lasso Regression is a type of linear regression that utilizes shrinkage. Shrinkage is where the data points are shrunk towards a central point such as a mean. Lasso Regression is well-suited for models showing multicollinearity. For this model, the parameter cv (cross validation) is set to 5 which means it is a 5-fold cross validation.


In [11]:
# LASSO
lasso = LassoCV(cv=5, random_state=0)
lasso.fit(xtrain, ytrain)

predict3 = lasso.predict(xtest)

#stats for the model
pd.set_option('display.max_rows', None)
#print(pd.Series(lasso.coef_, index = games.columns[0:66])) 
mse = mean_squared_error(ytest, predict3) 
print("Test mean squared error (MSE): {:.2f}".format(mse))
print("Score:", lasso.score(xtest,ytest))

Test mean squared error (MSE): 152.15
Score: 0.21086422287096718


Elastic Net Regularization
Elastic Net Regularization is a method that linearly combines penalties of the Lasso and Ridge methods.

In [12]:
# ELASTIC
from sklearn.linear_model import ElasticNet
elastic = ElasticNet(0.001)
elastic.fit(xtrain, ytrain)

y_pred_elastic = elastic.predict(xtest)
mse = mean_squared_error(ytest, y_pred_elastic) 
print("Test mean squared error (MSE): {:.2f}".format(mse))
print("Score:", elastic.score(xtest,ytest))

Test mean squared error (MSE): 152.42
Score: 0.20945288612312063


#SVM
Support Vector Machines are supervised learning models that analyze data for classification and regression analysis. In this project, we have used for regression analysis. For this model, we used a Support Vector Regression with no parameters included. 

In [16]:
# Support Vector Machine
from sklearn.svm import SVR
from sklearn.svm import LinearSVC
from sklearn.metrics import accuracy_score

svr = SVR()

# fit classifier to training set
svr.fit(xtrain, ytrain)

# make predictions on test set
y_pred = svr.predict(xtest)

# Calculate MSE
print('Model MSE: {0:0.4f}'.format(mean_squared_error(ytest, y_pred)))

# compute and print accuracy score
print('Model accuracy score with default hyperparameters: {0:0.4f}'.format(svr.score(xtest, ytest)))

Model MSE: 157.1923
Model accuracy score with default hyperparameters: 0.1847
runtime: 25.06702160835266
