# **NBA Game Predictor**

#### **Jonah Silverstein**

# Summary

#### Description:
This machine learning project focuses on developing a classification model that can predict the results of past NBA games. This will be done by analyzing data from early in our dataset, which will include previous game stats, rolling averages of previous game stats, and opponent stats. This model will also use time series cross-validation to ensure that only past games are used in predictions.

#### Details:
 - Predicting NBA game results is a binary classification problem, as each team either wins or loses.
 - This project will analyze a dataset of statistics from past NBA games, and train a model to predict the results of games from the more recent past (this model will not predict the result of future games).

# Procedure

## 1. Importing Libraries

In [1]:
import pandas as pd

# data preprocessing and model evaluation
from sklearn.preprocessing import LabelEncoder, MinMaxScaler
from sklearn.metrics import accuracy_score

# classification algorithms
from sklearn.linear_model import LogisticRegression

# feature engineering and hyperparamter tuning
from sklearn.feature_selection import SequentialFeatureSelector
from sklearn.model_selection import TimeSeriesSplit

import warnings
warnings.filterwarnings("ignore")

## 2. Data Exploration

In [2]:
df = pd.read_csv("nba_games.csv", index_col=0)
df

Unnamed: 0,mp,mp.1,fg,fga,fg%,3p,3pa,3p%,ft,fta,...,tov%_max_opp,usg%_max_opp,ortg_max_opp,drtg_max_opp,team_opp,total_opp,home_opp,season,date,won
0,240.0,240.0,39.0,81.0,0.481,6.0,20.0,0.300,14.0,18.0,...,22.8,29.0,178.0,111.0,DAL,95,1,2016,2015-12-09,True
1,240.0,240.0,36.0,100.0,0.360,7.0,31.0,0.226,16.0,19.0,...,50.0,32.6,152.0,111.0,ATL,98,0,2016,2015-12-09,False
2,240.0,240.0,37.0,85.0,0.435,8.0,19.0,0.421,17.0,23.0,...,20.0,30.9,148.0,116.0,SAS,107,1,2018,2017-10-18,False
3,240.0,240.0,41.0,89.0,0.461,8.0,21.0,0.381,17.0,19.0,...,28.6,30.9,138.0,118.0,MIN,99,0,2018,2017-10-18,True
4,240.0,240.0,27.0,86.0,0.314,6.0,26.0,0.231,15.0,20.0,...,16.8,30.9,157.0,90.0,MEM,92,1,2021,2021-04-30,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
17767,240.0,240.0,35.0,81.0,0.432,11.0,26.0,0.423,27.0,36.0,...,34.2,33.7,160.0,118.0,OKC,92,0,2019,2018-10-19,True
17768,240.0,240.0,37.0,74.0,0.500,13.0,25.0,0.520,26.0,37.0,...,25.0,30.0,139.0,129.0,ORL,108,1,2017,2016-12-14,True
17769,240.0,240.0,42.0,89.0,0.472,14.0,33.0,0.424,10.0,20.0,...,25.6,29.9,175.0,126.0,LAC,113,0,2017,2016-12-14,False
17770,240.0,240.0,41.0,85.0,0.482,9.0,26.0,0.346,26.0,30.0,...,27.7,27.1,150.0,126.0,MIA,106,1,2020,2020-09-19,True


In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 17772 entries, 0 to 17771
Columns: 150 entries, mp to won
dtypes: bool(1), float64(140), int64(6), object(3)
memory usage: 20.4+ MB


In [4]:
nulls = df.isnull().sum()
nulls = nulls[nulls > 0]

print(f"Duplicate Rows: {df.duplicated().sum()}")
print(f"Null Values:\n{nulls}")

Duplicate Rows: 0
Null Values:
+/-             17772
mp_max          17772
mp_max.1        17772
+/-_opp         17772
mp_max_opp      17772
mp_max_opp.1    17772
dtype: int64


## 3. Data Preprocessing

In [5]:
# sorting dataframe by game date
df = df.sort_values("date")
df = df.reset_index(drop=True)

In [6]:
# removing unnecessary indexing columns
del df["mp.1"]
del df["mp_opp.1"]
del df["index_opp"]

In [7]:
# removing columns filled with null values
non_nulls = df.columns[~df.columns.isin(nulls.index)]
df = df[non_nulls]

In [8]:
def add_target(group):
    """
    This function takes in a group argument, which will be one team's season from the overall dataframe, and adds a new column to the group dataframe. 
    This column will contain the result of the team's next game, essentially making it a "won_next" column. This column will be used to make 
    predictions later.
    """
    group = group.copy()
    group["target"] = group["won"].shift(-1)
    return group

In [9]:
# adding the new column to our dataframe
# grouping by team and season ensures that our predictions are relevant
df = df.groupby(["team", "season"], group_keys=False).apply(add_target)

In [10]:
# checking if our grouping worked
df[df["team"] == "BOS"]

Unnamed: 0,mp,fg,fga,fg%,3p,3pa,3p%,ft,fta,ft%,...,usg%_max_opp,ortg_max_opp,drtg_max_opp,team_opp,total_opp,home_opp,season,date,won,target
29,240.0,39.0,85.0,0.459,8.0,24.0,0.333,26.0,27.0,0.963,...,29.2,120.0,123.0,PHI,95,0,2016,2015-10-28,True,False
42,240.0,32.0,85.0,0.376,7.0,26.0,0.269,32.0,41.0,0.780,...,29.8,231.0,101.0,TOR,113,0,2016,2015-10-30,False,False
79,240.0,35.0,98.0,0.357,6.0,29.0,0.207,11.0,14.0,0.786,...,30.1,126.0,101.0,SAS,95,0,2016,2015-11-01,False,False
126,240.0,35.0,83.0,0.422,10.0,27.0,0.370,18.0,22.0,0.818,...,29.6,145.0,110.0,IND,100,1,2016,2015-11-04,False,True
149,240.0,44.0,97.0,0.454,12.0,30.0,0.400,18.0,23.0,0.783,...,35.1,171.0,119.0,WAS,98,0,2016,2015-11-06,True,True
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
17763,240.0,30.0,80.0,0.375,15.0,37.0,0.405,13.0,17.0,0.765,...,36.4,263.0,103.0,GSW,107,1,2022,2022-06-05,False,True
17765,240.0,43.0,89.0,0.483,13.0,35.0,0.371,17.0,24.0,0.708,...,32.7,164.0,135.0,GSW,100,0,2022,2022-06-08,True,False
17767,240.0,34.0,85.0,0.400,15.0,38.0,0.395,14.0,19.0,0.737,...,36.3,133.0,112.0,GSW,107,0,2022,2022-06-10,False,False
17769,240.0,31.0,75.0,0.413,11.0,32.0,0.344,21.0,31.0,0.677,...,36.2,222.0,107.0,GSW,104,1,2022,2022-06-13,False,False


In [11]:
# checking our 'won' values for equality
df["won"].value_counts()

won
False    8886
True     8886
Name: count, dtype: int64

In [12]:
# encoding our target values
le = LabelEncoder()
df["target"] = le.fit_transform(df["target"]) # 0 = false, 1 = true, 2 = NaN
print(f"Encoded Values: {df["target"].unique()}")

Encoded Values: [0 1 2]


In [13]:
# checking our target values
df["target"].value_counts()

target
1    8782
0    8780
2     210
Name: count, dtype: int64

There are 210 null values because of the last games of every season. These games have no games following them, thus, there are no games to be predicted. The reason there's 210 of these is because 30 teams * 7 seasons * 1 final game per team per season = 210 games, or 210 null values.

In [14]:
# scaling the dataframe, which ensures Logistic Regression works
scaler = MinMaxScaler()

columns_to_remove = ["season", "date", "won", "target", "team", "team_opp"]
columns_for_model = df.columns[~df.columns.isin(columns_to_remove)]

df[columns_for_model] = scaler.fit_transform(df[columns_for_model])

In [15]:
df

Unnamed: 0,mp,fg,fga,fg%,3p,3pa,3p%,ft,fta,ft%,...,usg%_max_opp,ortg_max_opp,drtg_max_opp,team_opp,total_opp,home_opp,season,date,won,target
0,0.0,0.363636,0.338235,0.366029,0.206897,0.212121,0.395487,0.418605,0.412698,0.654609,...,0.277279,0.554502,0.317647,GSW,0.451923,1.0,2016,2015-10-27,False,0
1,0.0,0.431818,0.500000,0.322967,0.310345,0.378788,0.368171,0.209302,0.253968,0.519253,...,0.160462,0.345972,0.317647,CHI,0.317308,1.0,2016,2015-10-27,False,1
2,0.0,0.409091,0.397059,0.373206,0.241379,0.227273,0.437055,0.348837,0.349206,0.645274,...,0.088575,0.232227,0.329412,CLE,0.298077,0.0,2016,2015-10-27,True,1
3,0.0,0.500000,0.529412,0.377990,0.310345,0.393939,0.356295,0.441860,0.333333,0.893816,...,0.215661,0.530806,0.505882,NOP,0.298077,0.0,2016,2015-10-27,True,1
4,0.0,0.409091,0.323529,0.435407,0.275862,0.348485,0.351544,0.255814,0.222222,0.766628,...,0.019255,0.203791,0.317647,DET,0.403846,0.0,2016,2015-10-27,False,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
17767,0.0,0.340909,0.367647,0.313397,0.517241,0.515152,0.469121,0.302326,0.285714,0.693116,...,0.182285,0.208531,0.411765,GSW,0.413462,0.0,2022,2022-06-10,False,0
17768,0.0,0.500000,0.411765,0.471292,0.310345,0.545455,0.267221,0.279070,0.222222,0.844807,...,0.928113,1.000000,0.411765,BOS,0.288462,0.0,2022,2022-06-13,True,1
17769,0.0,0.272727,0.220588,0.344498,0.379310,0.424242,0.408551,0.465116,0.476190,0.623104,...,0.181001,0.630332,0.352941,GSW,0.384615,1.0,2022,2022-06-13,False,0
17770,0.0,0.340909,0.294118,0.373206,0.379310,0.363636,0.466746,0.232558,0.174603,0.903151,...,0.120668,0.459716,0.400000,GSW,0.375000,0.0,2022,2022-06-16,False,2


## 4. Initial Modeling

In [16]:
home_teams = df[df["home"] == 1]
home_wins = home_teams[home_teams["won"] == 1]["won"].sum()

# calculating percentage of home teams that win
home_win_pct = home_wins / home_teams.shape[0]
print(f"Home Team Winning Percentage: {home_win_pct}")

Home Team Winning Percentage: 0.5716857978843124


If we predicted that in every NBA game, the home team would win, we'd end up with an accuracy score of 57.17%. This represents the baseline accuracy of the model we're about to build.

In [17]:
def test_model(data, model, predictors, start=2):
    """
    This function trains and tests the model. This is done season by season, where seasons preceeding the selected year are used as training data, and 
    seasons after the selected year are used as testing data. Additionally, this function makes predictions, and concatenates the results of its
    predictions to a dataframe.
    """
    all_predictions = []
    seasons = sorted(data["season"].unique())

    # start is initialized as 2, meaning the first two seasons will not have predictions made on them (not enough training data preceeding them)
    for i in range(start, len(seasons)):
        season = seasons[i]
        train = data[data["season"] < season]
        test = data[data["season"] == season]

        model.fit(train[predictors], train["target"])
        
        preds = model.predict(test[predictors])
        preds = pd.Series(preds, index=test.index)
        
        combined = pd.concat([test["target"], preds], axis=1)
        combined.columns = ["actual", "prediction"]
        
        all_predictions.append(combined)
    return pd.concat(all_predictions)

In [18]:
# creating our initial predictors, which are essentially just every column that isn't filled with strings or objects
predictors = list(columns_for_model)

### Picking a Classification Algorithm

For this project, I've decided to use the Logistic Regression classifier. This is because of its strength dealing with high dimensions, its compatability with SequentialFeatureSelector, and its ease of use. Hopefully, this algorithm will prove to be receptive to feature engineering and accurate in more fine-tuned modeling.

In [19]:
# creating the model
model = LogisticRegression(max_iter=10000)

In [20]:
# first iteration of the model
predictions_1 = test_model(df, model, predictors)
score_1 = accuracy_score(predictions_1["actual"], predictions_1["prediction"])

print(f"Initial Accuracy: {score_1}")

Initial Accuracy: 0.5274716498961827


This accuracy isn't terrible, but it's not great either, and it's also below our baseline. This model could definitely benefit from some feature engineering.

## 5. Feature Engineering

### Filtering Features

The first change we can make to our model is filtering our features. Right now, our model synthesizes around 140 features, many of which could be superfluous or redundant. By using SequentialFeatureSelector, we can make sure our model only uses the most important features in the future.

In [21]:
# creating our feature selector and time-ordered data split 
split = TimeSeriesSplit(n_splits=3)
select = SequentialFeatureSelector(model, n_features_to_select=40, cv=split, n_jobs=-1)

In [22]:
select.fit(df[columns_for_model], df["target"])

In [23]:
# finding the best features
new_predictors = list(columns_for_model[select.get_support()])
new_predictors

['3p%',
 'ft%',
 'orb',
 'pf',
 'drb%',
 'trb%',
 'usg%',
 'fg%_max',
 'ft_max',
 'fta_max',
 'pf_max',
 '+/-_max',
 'ts%_max',
 'drb%_max',
 'trb%_max',
 'tov%_max',
 'usg%_max',
 'drtg_max',
 'home',
 'fg_opp',
 'ft%_opp',
 'blk_opp',
 'ftr_opp',
 'orb%_opp',
 'trb%_opp',
 'blk%_opp',
 'usg%_opp',
 'fg_max_opp',
 '3p%_max_opp',
 'ast_max_opp',
 'stl_max_opp',
 'blk_max_opp',
 'pts_max_opp',
 '3par_max_opp',
 'drb%_max_opp',
 'ast%_max_opp',
 'stl%_max_opp',
 'blk%_max_opp',
 'tov%_max_opp',
 'home_opp']

In [24]:
# second iteration of the model
predictions_2 = test_model(df, model, new_predictors)
score_2 = accuracy_score(predictions_2["actual"], predictions_2["prediction"])

print(f"New Accuracy: {score_2}")

New Accuracy: 0.5375339402651333


By selecting the 30 most important features, we increased our accuracy marginally. Now our model is slightly more effective, but considerably more efficient.

### Finding Rolling Averages

The next change to make to our model is giving it rolling averages. Rolling averages can provide a lot of info about a team's performace, such as if they're on a hot streak, or if they're slumping. With this kind of data, our model has the potential to be more accurate.

In [25]:
# creating a dataframe to hold our rolling averages
df_rolling_avg = df[list(columns_for_model) + ['won', 'team', 'season']].copy()
rolling_sum = 10

# encoding 'won' so that it can be averaged
df_rolling_avg["won"] = le.fit_transform(df_rolling_avg["won"]) # 0 = false, 1 = true, 2 = NaN

In [26]:
def find_rolling_averages(team):
    """
    This function finds the columns with numeric values in our dataframe, then returns a new dataframe holding the rolling averages of the values
    of the columns.
    """
    numeric_cols = team.select_dtypes(include='number').columns
    rolling = team[numeric_cols].rolling(rolling_sum).mean()
    return rolling

In [27]:
# adding the rolling averages to the dataframe
df_rolling_avg = df_rolling_avg.groupby(["team", "season"], group_keys=False).apply(find_rolling_averages)
df_rolling_avg

Unnamed: 0,mp,fg,fga,fg%,3p,3pa,3p%,ft,fta,ft%,...,stl%_max_opp,blk%_max_opp,tov%_max_opp,usg%_max_opp,ortg_max_opp,drtg_max_opp,total_opp,home_opp,won,season
0,,,,,,,,,,,...,,,,,,,,,,
1,,,,,,,,,,,...,,,,,,,,,,
2,,,,,,,,,,,...,,,,,,,,,,
3,,,,,,,,,,,...,,,,,,,,,,
4,,,,,,,,,,,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
17767,0.0,0.381818,0.292647,0.428230,0.468966,0.477273,0.448100,0.434884,0.373016,0.764177,...,0.0570,0.1113,0.471908,0.170603,0.431754,0.522353,0.348077,0.5,0.6,2022.0
17768,0.0,0.502273,0.364706,0.517703,0.455172,0.481818,0.440736,0.320930,0.282540,0.757993,...,0.0716,0.1171,0.374109,0.321566,0.642654,0.564706,0.392308,0.4,0.7,2022.0
17769,0.0,0.354545,0.279412,0.404545,0.437931,0.465152,0.429572,0.434884,0.385714,0.736639,...,0.0591,0.1113,0.483229,0.174711,0.438863,0.483529,0.350000,0.5,0.5,2022.0
17770,0.0,0.354545,0.294118,0.389952,0.434483,0.459091,0.431710,0.406977,0.357143,0.754142,...,0.0572,0.1111,0.483229,0.172144,0.460190,0.472941,0.344231,0.5,0.5,2022.0


In [28]:
# renaming the rolling averages columns
rolling_cols = [f"{col}_{rolling_sum}" for col in df_rolling_avg.columns]
df_rolling_avg.columns = rolling_cols
df_rolling_avg

Unnamed: 0,mp_10,fg_10,fga_10,fg%_10,3p_10,3pa_10,3p%_10,ft_10,fta_10,ft%_10,...,stl%_max_opp_10,blk%_max_opp_10,tov%_max_opp_10,usg%_max_opp_10,ortg_max_opp_10,drtg_max_opp_10,total_opp_10,home_opp_10,won_10,season_10
0,,,,,,,,,,,...,,,,,,,,,,
1,,,,,,,,,,,...,,,,,,,,,,
2,,,,,,,,,,,...,,,,,,,,,,
3,,,,,,,,,,,...,,,,,,,,,,
4,,,,,,,,,,,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
17767,0.0,0.381818,0.292647,0.428230,0.468966,0.477273,0.448100,0.434884,0.373016,0.764177,...,0.0570,0.1113,0.471908,0.170603,0.431754,0.522353,0.348077,0.5,0.6,2022.0
17768,0.0,0.502273,0.364706,0.517703,0.455172,0.481818,0.440736,0.320930,0.282540,0.757993,...,0.0716,0.1171,0.374109,0.321566,0.642654,0.564706,0.392308,0.4,0.7,2022.0
17769,0.0,0.354545,0.279412,0.404545,0.437931,0.465152,0.429572,0.434884,0.385714,0.736639,...,0.0591,0.1113,0.483229,0.174711,0.438863,0.483529,0.350000,0.5,0.5,2022.0
17770,0.0,0.354545,0.294118,0.389952,0.434483,0.459091,0.431710,0.406977,0.357143,0.754142,...,0.0572,0.1111,0.483229,0.172144,0.460190,0.472941,0.344231,0.5,0.5,2022.0


In [29]:
# adding the rolling averages dataframe to our primary dataframe
df = pd.concat([df, df_rolling_avg], axis=1)
df = df.dropna()
df

Unnamed: 0,mp,fg,fga,fg%,3p,3pa,3p%,ft,fta,ft%,...,stl%_max_opp_10,blk%_max_opp_10,tov%_max_opp_10,usg%_max_opp_10,ortg_max_opp_10,drtg_max_opp_10,total_opp_10,home_opp_10,won_10,season_10
243,0.0,0.522727,0.382353,0.523923,0.344828,0.333333,0.457245,0.255814,0.238095,0.708285,...,0.0628,0.0679,0.413522,0.124134,0.361611,0.449412,0.347115,0.4,0.8,2016.0
251,0.0,0.659091,0.426471,0.645933,0.620690,0.515152,0.562945,0.325581,0.238095,0.927655,...,0.0613,0.0772,0.469497,0.219641,0.394787,0.531765,0.324038,0.5,1.0,2016.0
252,0.0,0.386364,0.382353,0.358852,0.206897,0.181818,0.445368,0.511628,0.412698,0.827305,...,0.0625,0.1145,0.437841,0.138126,0.507109,0.360000,0.351923,0.6,0.4,2016.0
253,0.0,0.500000,0.382353,0.497608,0.344828,0.318182,0.475059,0.325581,0.349206,0.593932,...,0.0646,0.0759,0.512159,0.133633,0.277251,0.388235,0.308654,0.4,0.6,2016.0
256,0.0,0.318182,0.132353,0.500000,0.275862,0.272727,0.432304,0.581395,0.444444,0.879813,...,0.0741,0.0982,0.313312,0.179974,0.500000,0.471765,0.380769,0.5,0.4,2016.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
17767,0.0,0.340909,0.367647,0.313397,0.517241,0.515152,0.469121,0.302326,0.285714,0.693116,...,0.0570,0.1113,0.471908,0.170603,0.431754,0.522353,0.348077,0.5,0.6,2022.0
17768,0.0,0.500000,0.411765,0.471292,0.310345,0.545455,0.267221,0.279070,0.222222,0.844807,...,0.0716,0.1171,0.374109,0.321566,0.642654,0.564706,0.392308,0.4,0.7,2022.0
17769,0.0,0.272727,0.220588,0.344498,0.379310,0.424242,0.408551,0.465116,0.476190,0.623104,...,0.0591,0.1113,0.483229,0.174711,0.438863,0.483529,0.350000,0.5,0.5,2022.0
17770,0.0,0.340909,0.294118,0.373206,0.379310,0.363636,0.466746,0.232558,0.174603,0.903151,...,0.0572,0.1111,0.483229,0.172144,0.460190,0.472941,0.344231,0.5,0.5,2022.0


In [30]:
# third iteration of the model
predictions_3 = test_model(df, model, new_predictors + rolling_cols)
score_3 = accuracy_score(predictions_3["actual"], predictions_3["prediction"])

print(f"New Accuracy: {score_3}")

New Accuracy: 0.573218761188686


By finding the 10-game rolling averages for our numerical features, we improved our accuracy a little bit more. Still, we can take this up a notch.

In [31]:
select.fit(df[rolling_cols], df["target"])

In [32]:
# finding the best rolling average features
new_rolling_cols = [col for i, col in enumerate(rolling_cols) if select.get_support()[i]]
new_rolling_cols

['3p%_10',
 'ft%_10',
 'orb_10',
 'drb_10',
 'trb_10',
 'pf_10',
 'efg%_10',
 'orb%_10',
 'blk%_10',
 'usg%_10',
 'ortg_10',
 '3p%_max_10',
 'ft_max_10',
 'fta_max_10',
 'stl_max_10',
 'ftr_max_10',
 'stl%_max_10',
 'ortg_max_10',
 'fg%_opp_10',
 '3p%_opp_10',
 'ft%_opp_10',
 'stl_opp_10',
 'drb%_opp_10',
 'stl%_opp_10',
 'usg%_opp_10',
 'fg_max_opp_10',
 'fga_max_opp_10',
 'fg%_max_opp_10',
 '3p_max_opp_10',
 'drb_max_opp_10',
 'ast_max_opp_10',
 '+/-_max_opp_10',
 'efg%_max_opp_10',
 'ftr_max_opp_10',
 'orb%_max_opp_10',
 'trb%_max_opp_10',
 'stl%_max_opp_10',
 'drtg_max_opp_10',
 'won_10',
 'season_10']

In [33]:
# fourth iteration of the model
predictions_4 = test_model(df, model, new_predictors + new_rolling_cols)
score_4 = accuracy_score(predictions_4["actual"], predictions_4["prediction"])

print(f"New Accuracy: {score_4}")

New Accuracy: 0.5788578589330469


When evaluated on both the top 30 features and the top 30 rolling average features, our model achieved it's best accuracy yet, and also broke through our baseline. Let's see if there's any more changes we can make to improve it.

### Adding Next Game Details

Another change we can make to our model is giving it information about a team's next game. In the real world, we have a schedule of games readily available, and in any given game, which know which team is at home and which team is performing better at the time. We can simulate this in our model by merging team games with the rolling averages of their opponents.

In [34]:
def shift_col(team, col_name):
    """
    This function creates a column with information about a team's next game.
    """
    next_col = team[col_name].shift(-1)
    return next_col

def add_col(df, col_name):
    """
    This function adds the column to the dataframe.
    """
    return df.groupby(["team", "season"], group_keys=False).apply(lambda x: shift_col(x, col_name))

In [35]:
# copying our dataframe (prevents slicing errors)
df = df.copy()

# adding the next game columns
df["home_next"] = add_col(df, "home")
df["team_opp_next"] = add_col(df, "team_opp")
df["date_next"] = add_col(df, "date")

df

Unnamed: 0,mp,fg,fga,fg%,3p,3pa,3p%,ft,fta,ft%,...,usg%_max_opp_10,ortg_max_opp_10,drtg_max_opp_10,total_opp_10,home_opp_10,won_10,season_10,home_next,team_opp_next,date_next
243,0.0,0.522727,0.382353,0.523923,0.344828,0.333333,0.457245,0.255814,0.238095,0.708285,...,0.124134,0.361611,0.449412,0.347115,0.4,0.8,2016.0,0.0,BOS,2015-11-13
251,0.0,0.659091,0.426471,0.645933,0.620690,0.515152,0.562945,0.325581,0.238095,0.927655,...,0.219641,0.394787,0.531765,0.324038,0.5,1.0,2016.0,1.0,BRK,2015-11-14
252,0.0,0.386364,0.382353,0.358852,0.206897,0.181818,0.445368,0.511628,0.412698,0.827305,...,0.138126,0.507109,0.360000,0.351923,0.6,0.4,2016.0,0.0,MIN,2015-11-15
253,0.0,0.500000,0.382353,0.497608,0.344828,0.318182,0.475059,0.325581,0.349206,0.593932,...,0.133633,0.277251,0.388235,0.308654,0.4,0.6,2016.0,0.0,CHI,2015-11-16
256,0.0,0.318182,0.132353,0.500000,0.275862,0.272727,0.432304,0.581395,0.444444,0.879813,...,0.179974,0.500000,0.471765,0.380769,0.5,0.4,2016.0,0.0,CHO,2015-11-15
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
17767,0.0,0.340909,0.367647,0.313397,0.517241,0.515152,0.469121,0.302326,0.285714,0.693116,...,0.170603,0.431754,0.522353,0.348077,0.5,0.6,2022.0,0.0,GSW,2022-06-13
17768,0.0,0.500000,0.411765,0.471292,0.310345,0.545455,0.267221,0.279070,0.222222,0.844807,...,0.321566,0.642654,0.564706,0.392308,0.4,0.7,2022.0,0.0,BOS,2022-06-16
17769,0.0,0.272727,0.220588,0.344498,0.379310,0.424242,0.408551,0.465116,0.476190,0.623104,...,0.174711,0.438863,0.483529,0.350000,0.5,0.5,2022.0,1.0,GSW,2022-06-16
17770,0.0,0.340909,0.294118,0.373206,0.379310,0.363636,0.466746,0.232558,0.174603,0.903151,...,0.172144,0.460190,0.472941,0.344231,0.5,0.5,2022.0,,,


In [36]:
# merging our primary dataframe with the rolling averages of the next game's opponent
merged = df.merge(df[rolling_cols + ["team_opp_next", "date_next", "team"]], left_on=["team", "date_next"], right_on=["team_opp_next", "date_next"])
merged

Unnamed: 0,mp,fg,fga,fg%,3p,3pa,3p%,ft,fta,ft%,...,tov%_max_opp_10_y,usg%_max_opp_10_y,ortg_max_opp_10_y,drtg_max_opp_10_y,total_opp_10_y,home_opp_10_y,won_10_y,season_10_y,team_opp_next_y,team_y
0,0.00,0.477273,0.500000,0.375598,0.379310,0.348485,0.483373,0.441860,0.396825,0.730455,...,0.380294,0.273427,0.270616,0.478824,0.308654,0.6,0.7,2016.0,SAC,TOR
1,0.00,0.340909,0.250000,0.413876,0.310345,0.257576,0.509501,0.511628,0.412698,0.827305,...,0.437212,0.124904,0.404739,0.408235,0.428846,0.2,0.3,2016.0,TOR,SAC
2,0.50,0.409091,0.455882,0.330144,0.482759,0.515152,0.437055,0.372093,0.412698,0.568261,...,0.504403,0.153273,0.344076,0.384706,0.319231,0.7,0.5,2016.0,CLE,DET
3,0.25,0.545455,0.544118,0.416268,0.413793,0.454545,0.419240,0.186047,0.142857,0.883314,...,0.467505,0.276508,0.352607,0.482353,0.316346,0.7,0.6,2016.0,GSW,TOR
4,0.00,0.340909,0.558824,0.186603,0.206897,0.469697,0.203088,0.139535,0.111111,0.854142,...,0.413732,0.156739,0.470142,0.391765,0.436538,0.6,0.1,2016.0,DEN,NOP
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
15587,0.00,0.545455,0.426471,0.511962,0.448276,0.469697,0.440618,0.372093,0.365079,0.659277,...,0.457128,0.235173,0.562085,0.552941,0.429808,0.4,0.6,2022.0,BOS,GSW
15588,0.00,0.477273,0.455882,0.409091,0.517241,0.590909,0.414489,0.255814,0.222222,0.766628,...,0.471908,0.170603,0.431754,0.522353,0.348077,0.5,0.6,2022.0,GSW,BOS
15589,0.00,0.340909,0.367647,0.313397,0.517241,0.515152,0.469121,0.302326,0.285714,0.693116,...,0.431761,0.242875,0.567773,0.575294,0.394231,0.4,0.7,2022.0,BOS,GSW
15590,0.00,0.500000,0.411765,0.471292,0.310345,0.545455,0.267221,0.279070,0.222222,0.844807,...,0.483229,0.174711,0.438863,0.483529,0.350000,0.5,0.5,2022.0,GSW,BOS


In [37]:
# validating our merge
merged[["team_x", "team_opp", "date", "team_opp_next_x", "date_next", "team_y", "team_opp_next_y"]][merged["team_x"] == "BOS"]

Unnamed: 0,team_x,team_opp,date,team_opp_next_x,date_next,team_y,team_opp_next_y
31,BOS,HOU,2015-11-16,DAL,2015-11-18,DAL,BOS
54,BOS,DAL,2015-11-18,BRK,2015-11-20,BRK,BOS
74,BOS,BRK,2015-11-20,BRK,2015-11-22,BRK,BOS
106,BOS,BRK,2015-11-22,ATL,2015-11-24,ATL,BOS
134,BOS,ATL,2015-11-24,PHI,2015-11-25,PHI,BOS
...,...,...,...,...,...,...,...
15582,BOS,GSW,2022-06-02,GSW,2022-06-05,GSW,BOS
15585,BOS,GSW,2022-06-05,GSW,2022-06-08,GSW,BOS
15587,BOS,GSW,2022-06-08,GSW,2022-06-10,GSW,BOS
15589,BOS,GSW,2022-06-10,GSW,2022-06-13,GSW,BOS


In [38]:
# creating our newest predictors, which are again mostly every column that isn't filled with strings or objects
columns_to_remove_2 = list(merged.columns[merged.dtypes == "object"]) + columns_to_remove
columns_for_model_2 = merged.columns[~merged.columns.isin(columns_to_remove_2)]

In [39]:
# fifth iteration of the model
predictions_5 = test_model(merged, model, columns_for_model_2)
score_5 = accuracy_score(predictions_5["actual"], predictions_5["prediction"])

print(f"New Accuracy: {score_5}")

New Accuracy: 0.6294675419401896


This was very successful! We added opponent data (including 10 game rolling averages) to our dataframe, which resulted in a nice increase in accuracy. Maybe by tuning the predictors a little further, we can get even more accurate.

In [40]:
# removing predictors that were used in a previous model (reducing redundacy and dimensionality)
merged_predictors = [feature for feature in columns_for_model_2 if feature not in columns_for_model]

In [41]:
# sixth iteration of the model
predictions_6 = test_model(merged, model, merged_predictors)
score_6 = accuracy_score(predictions_6["actual"], predictions_6["prediction"])

print(f"New Accuracy: {score_6}")

New Accuracy: 0.6337527352297593


Here, we removed the original numerical features from our predictors list, since they are a touch redundant with the rolling averages of the same data. It made a slight difference, as our accuracy rose marginally.

## 6. Model Comparison

In [46]:
columns = ["DataFrame Features", "Predictor Features", "Accuracy"]
rows = ["baseline"] + [f"Model {i}" for i in range(1, 7)]

scores = pd.DataFrame(columns=columns, index=rows)

scores["DataFrame Features"] = ["none", "original", "original", "original, team rolling averages", "original, team rolling averages",
                       "original, team/opponent rolling averages", "original, team/opponent rolling averages"]
scores["Predictor Features"] = ["home team always wins", "all original", "30 best original", "30 best original, all team rolling average", 
                      "30 best original, 30 best team rolling average", "all original, all team/opponent rolling average",
                      "all team/opponent rolling average"]
scores["Accuracy"] = [home_win_pct] + [score_1, score_2, score_3, score_4, score_5, score_6]

scores

Unnamed: 0,DataFrame Features,Predictor Features,Accuracy
baseline,none,home team always wins,0.571686
Model 1,original,all original,0.527472
Model 2,original,30 best original,0.537534
Model 3,"original, team rolling averages","30 best original, all team rolling average",0.573219
Model 4,"original, team rolling averages","30 best original, 30 best team rolling average",0.578858
Model 5,"original, team/opponent rolling averages","all original, all team/opponent rolling average",0.629468
Model 6,"original, team/opponent rolling averages",all team/opponent rolling average,0.633753


After rigorous feature engineering--filtering to find the best features, adding rolling averages, and evaluating opponent features--we can conclude that the final iteration of our model, which has a 63.38% accuracy score, is the most accurate.

# Conclusion

In this project, we aimed to predict the results of NBA games between 2016 and 2022. This was done with thorough data preprocessing and feature engineering. Below are my thoughts and conclusions from the overall process.

 - **Data Exploration:** Through a brief examination, I gained insights into the characteristics and quality of our dataset.
 - **Data Preprocessing:** Various transformations, such as removing null values and encoding categorical variables, were performed to prepare the dataset for modeling.
 - **Initial Modeling:** I constructed a logistic regression classification model with an initial accuracy of 52.75%. I chose logistic regression as my algorithm due to it's simplicity and capability dealing with high dimensions.
 - **Feature Engineering:** After filtering through hundreds of features, calculating rolling averages, and giving our model information on opposing teams, we managed to increase our accuracy to 63.38% - a respectable score.

In conclusion, this project was demanding, sometimes frustrating, but ultimately very instructive. I gained hands-on experience with supervised learning principles, feature engineering strategies, and the general workflow of a machine learning project. In the future, I hope to actually figure out hyperparameter tuning, and to build bigger and better models.