# NBA-2023-Win-Lose-Predictor

## Overview

This project centered around a model that predicts the outcome of the NBA's next season. 205 Sports Solutions was founded in order to help NBA teams have a better idea of their future by the numbers in order to know what to expect with their future and thus set out the correct long term plan for success. In order to create our model we use data based on [ESPN](https://www.espn.com/nba/stats/player/_/season/2001/seasontype/2/table/offensive/sort/avgPoints/dir/desc), [nba.com](https://www.nba.com/stats/teams/boxscores/?Season=2021-22&SeasonType=Regular%20Season) and [Basketball-Reference.com](https://www.basketball-reference.com/teams/) from the 2001 to the 2022 NBA seasons and analyzed 1,821 games during that span. We used player rankings and expirence from pre-season and season games to determine the points that each team scored in a post-season game to determine the winner of that game.

## Business Problem

We were hired by an anonymous gambler to create a model that can predict future outcomes of NBA games in regards to winning or losing. By doing so we can aid our stakeholders in accumulating more money through betting on NBA games.

## Data Understanding

The data sets we used to create our models came from an [ESPN](https://www.espn.com/nba/stats/player/_/season/2001/seasontype/2/table/offensive/sort/avgPoints/dir/desc) database, [nba.com](https://www.nba.com/stats/teams/boxscores/?Season=2021-22&SeasonType=Regular%20Season) and data from [Basketball-Reference.com](https://www.basketball-reference.com/teams/). The data included 1,821 games from seasons between 2011-2022. We then combined the 3 data sets by player ranking, team, and game data which combined the player and team data sets which then allowed for the player rank to be next to each player and then we found each team in the game data sets and added in the player rankings. 

Thus, we used player ranking for a given position to determine the points that that team will score againnst another team. Our model uses player ranking to predict a games scores since ranking is based on the number of points a player scores on average for each game. Thus, ranking is a great predictor for post season games points since ranking is based on points scored in season games prior to the post season games. This means that overall our dataframe has 3642 rows and 43 columns. 


## Modeling

### Preparation

#### Imports

In [2]:
import pandas as pd
from sklearn.tree import DecisionTreeRegressor
import xgboost as xgb
import numpy as np
from sklearn.metrics import mean_squared_error, explained_variance_score, max_error
from sklearn.model_selection import train_test_split
from sklearn.impute import SimpleImputer
from sklearn.model_selection import GridSearchCV
import warnings

In [3]:
# Removing warnings
warnings.filterwarnings(action="ignore")

#### Definitions

Below is the TeamMatchUpGuesser which can intake a model and then make predictions of two teams scores and thus who one

In [4]:
def TeamMatchUpGuesser(estimator, holdout, team1, team2):
    df = holdout[((holdout["team1"] == team1) & (holdout["team2"] == team2)) | ((holdout["team2"] == team1) & (holdout["team1"] == team2))]
    pred_input_df = df.drop(columns=["pts","team2","team1","game_date"])
    names_df = df.drop(columns=pred_input_df.columns.tolist())
#     print(pred_input_df.columns)
    for col in pos_col:
        pred_input_df[col+" is null"] = df[col].isna()
    X_imp = num_imp.transform(pred_input_df)
    X_imp_df = pd.DataFrame(X_imp)

    for col in X_train_imp_df.columns:
        X_imp_df[col] = X_imp_df[col].astype(float)
    
    
    y_hat = list(estimator.predict(X_imp_df))
    names_df["preds"] = y_hat
    print(names_df.head(2))

#### Data set obsevation and cleaning

In [7]:
df = pd.read_csv("Data/win_loss_completed_df.csv")
print(list(df.columns))
df.head()

['game_date', 'pts', 'team1-1-power-forward', 'team1-1-point-guard', 'team1-1-shooting-guard', 'team1-1-small-forward', 'team1-1-center', 'team1-1-power-forward-exp', 'team1-1-point-guard-exp', 'team1-1-shooting-guard-exp', 'team1-1-small-forward-exp', 'team1-1-center-exp', 'team1-2-power-forward', 'team1-2-point-guard', 'team1-2-shooting-guard', 'team1-2-small-forward', 'team1-2-center', 'team1-2-power-forward-exp', 'team1-2-point-guard-exp', 'team1-2-shooting-guard-exp', 'team1-2-small-forward-exp', 'team1-2-center-exp', 'team2-1-power-forward', 'team2-1-point-guard', 'team2-1-shooting-guard', 'team2-1-small-forward', 'team2-1-center', 'team2-1-power-forward-exp', 'team2-1-point-guard-exp', 'team2-1-shooting-guard-exp', 'team2-1-small-forward-exp', 'team2-1-center-exp', 'team2-2-power-forward', 'team2-2-point-guard', 'team2-2-shooting-guard', 'team2-2-small-forward', 'team2-2-center', 'team2-2-power-forward-exp', 'team2-2-point-guard-exp', 'team2-2-shooting-guard-exp', 'team2-2-small

Unnamed: 0,game_date,pts,team1-1-power-forward,team1-1-point-guard,team1-1-shooting-guard,team1-1-small-forward,team1-1-center,team1-1-power-forward-exp,team1-1-point-guard-exp,team1-1-shooting-guard-exp,...,team2-2-shooting-guard,team2-2-small-forward,team2-2-center,team2-2-power-forward-exp,team2-2-point-guard-exp,team2-2-shooting-guard-exp,team2-2-small-forward-exp,team2-2-center-exp,team2,team1
0,2022,103,68,5,157,73,57,6,12,3,...,151,77,25,1,7,17,1,14,BOS,GSW
1,2022,90,105,63,5,87,71,2,1,5,...,150,13,84,9,17,R,7,17,GSW,BOS
2,2022,104,68,5,157,73,57,6,12,3,...,151,77,25,1,7,17,1,14,BOS,GSW
3,2022,94,105,63,5,87,71,2,1,5,...,150,13,84,9,17,R,7,17,GSW,BOS
4,2022,107,68,5,157,73,57,6,12,3,...,151,77,25,1,7,17,1,14,BOS,GSW


Creating a list of columns not including points, or teams columns

In [4]:
cols = list(df.drop(columns=["pts", "team1", "team2"]).columns)

Turning all the "Nulls" into np.nan

In [5]:
for col in cols:
    for row in df[cols].index:
        if df[col][row] == "Null":
            df[col][row] = np.nan

Creating the experience columns list and none expirence columns list

In [6]:
exp_col = []
pos_col = []
for col in cols:
    if "exp" in col:
        exp_col.append(col)
    else:
        pos_col.append(col)

Turning "R" which means rookie to 0 as in they have played for 0 years

In [7]:
for col in exp_col:
    for row in df[cols].index:
        if df[col][row] == "R":
            df[col][row] = 0

Holding back 2022 data so that we can build a bracket and compare with the real NBA post season bracket

In [8]:
hold_out_df = df[df["game_date"] == 2022]
working_df = df[df["game_date"] != 2022]
working_df = working_df.drop(columns=["game_date"])

Train test split. We will try to predict the points of a game from both teams perspectives

In [9]:
X = working_df.drop(columns=["pts"])
y = working_df["pts"]
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42, test_size=.2)

Creating null indicator columns

In [10]:
for col in pos_col:
    X_train[col+" is null"] = df[col].isna()
    X_test[col+" is null"] = df[col].isna()

Turning the nulls into the mean since we are going over the ranking columns and the ranking is evenly spread and has an upper limit

In [11]:
num_imp = SimpleImputer(strategy='mean')
num_imp.fit(X_train.drop(columns=["team1", "team2"]))

X_train_imp = num_imp.transform(X_train.drop(columns=["team1", "team2"]))
X_test_imp = num_imp.transform(X_test.drop(columns=["team1", "team2"]))

Turning all the columns into floats, currently the ranking columns are objects

In [12]:
X_train_imp_df = pd.DataFrame(X_train_imp)
X_test_imp_df = pd.DataFrame(X_test_imp)

for col in X_train_imp_df.columns:
    X_train_imp_df[col] = X_train_imp_df[col].astype(float)
    X_test_imp_df[col] = X_test_imp_df[col].astype(float)

### The Models

#### FSM

The model following was our first model, and was not ideal. We decided to go with a decision tree model to both see how challenging a problem this may be and to see how good a model could be. Our decision tree model is unkempt and free ranged meaning it should be perfectly accurate on the training data. If it is not then our models will likely be unable to do better.

In [14]:
tree = DecisionTreeRegressor()
# fitting the data
tree.fit(X_train_imp_df, y_train)
# predictions
y_hat_train = tree.predict(X_train_imp_df) 
y_hat_test = tree.predict(X_test_imp_df) 
# printing the score
print("Train set: mean squared error", str(mean_squared_error(y_train, y_hat_train, squared=False)), "max error", str(max_error(y_train, y_hat_train)), "Explained variance", str(explained_variance_score(y_test, y_hat_test)))
print("Test set: mean squared error", str(mean_squared_error(y_test, y_hat_test, squared=False)), "max error", str(max_error(y_test, y_hat_test)), "Explained variance", str(explained_variance_score(y_test, y_hat_test)))


Train set: mean squared error 9.427153608164073 max error 45.5 Explained variance 0.21188258480127609
Test set: mean squared error 12.077867093119222 max error 39.5 Explained variance 0.21188258480127609


As we can see this model is overfit meaning that it is more accurate on the training than the testing. But the model is not 100% accurate which decission tree models can do if there is enough data. 

Since this model is not 100% accurate likely we will need more data if we want to get a score better than +/- 9.4 points.

The max error is pretty high on both so going foward we tried to minimize max error as we decrease mean squared error.

An explained variance of 0.28 suggests that there is alot of data that is not taken into account for this model.

#### XGB model

The next model is a XGBoost model.

In [15]:
param_grid = {
    "max_depth": [5],
    "learning_rate" : [.1],
    "num_parallel_tree" : [15]
}
xgb_regres = xgb.XGBRegressor(random_state=42)

In [16]:
gs1 = GridSearchCV(xgb_regres, param_grid, cv=5, scoring='neg_mean_absolute_error')
gs1.fit(X_train_imp_df, y_train)

GridSearchCV(cv=5,
             estimator=XGBRegressor(base_score=None, booster=None,
                                    colsample_bylevel=None,
                                    colsample_bynode=None,
                                    colsample_bytree=None, gamma=None,
                                    gpu_id=None, importance_type='gain',
                                    interaction_constraints=None,
                                    learning_rate=None, max_delta_step=None,
                                    max_depth=None, min_child_weight=None,
                                    missing=nan, monotone_constraints=None,
                                    n_estimators=100, n_jobs=None,
                                    num_parallel_tree=None, random_state=42,
                                    reg_alpha=None, reg_lambda=None,
                                    scale_pos_weight=None, subsample=None,
                                    tree_method=None, validate_parame

In [17]:
best_estimator1 = gs1.best_estimator_
y_hat_test1 = best_estimator1.predict(X_test_imp_df)
print("Test set: mean squared error", str(mean_squared_error(y_test, y_hat_test1, squared=False)), "max error", str(max_error(y_test, y_hat_test1)), "Explained variance", str(explained_variance_score(y_test, y_hat_test1)))

Test set: mean squared error 11.523063236672044 max error 38.13196563720703 Explained variance 0.2816532774455486


As we can see the mean squared error was reduced by .5 points and the max error dropped by 1 point. This is not bad especially since our explained variance did not change.

## Results

Below is the results of our xgboosted models prediction on the holdout and specifically what our predictor looks like for one such prediction.

In [19]:
TeamMatchUpGuesser(best_estimator1, hold_out_df, "BOS", "GSW")

   game_date  pts team2 team1       preds
0       2022  103   BOS   GSW  113.530663
1       2022   90   GSW   BOS  104.345093


Below we can see how close we got to the actual 2022 bracket. The red Xs indicate what we got wrong with how far off we were written above. The green boxes are where our model predicted correctly. We can see that overall our model got 6 out of 15 correct but the matchups we got wrong most were fairly close. 

We can see that many of the games we got wrong was because our model does not take into account inactive players due to reasons such as injury, suspension, vaxination status, or otherwise.

![](Data/Team_bracket_2022.PNG)

## Conclusion

Despite our model needing some improvements, overall the model successfully predicted the bracket winner and was fairly close to predicitng the winners of each series and therefore should be considered a success.


## For More Information

- [ESPN](https://www.espn.com/nba/stats/player/_/season/2001/seasontype/2/table/offensive/sort/avgPoints/dir/desc)
- [nba.com](https://www.nba.com/stats/teams/boxscores/?Season=2021-22&SeasonType=Regular%20Season)
- [Basketball-Reference.com](https://www.basketball-reference.com/teams/)
