# Betting NBA Over Unders

We need to import the pandas package to do all of our data frame manipulation

In [1]:
import pandas as pd


Next, we will use the pandas package to import the data that we're basing our machine learning model off of using the read_csv function. After that we can use the .head() function to print out the first few rows of our data frame to ensure everything has been imported just the way we like it.  

I personally collected this data from the official NBA website under the team stats section as well as box scores section. I will include a link below.  I have a CSV file with over 200 games' worth of data sampled from the 2017-2021 seasons. I did not go any further back than 2017 due to a fundemental shift in how teams play basketball. Due to the quality work of sports data analysts, teams began to acknowledge the value of the 3 point shot and embraced it heavily. Teams like the Houston Rockets and Golden State Warriors pioneered this change. In order to avoid bias introduced by this paradigm shift I omitted years preceeding 2017 altogether.

https://www.nba.com/stats/teams/traditional/?sort=W_PCT&dir=-1

In [None]:
df = pd.read_csv("C:\\NBA_Betting\\ModelTrainingandTestingData.csv")
df.head()


Next up, I want to print a list of columns so that I can copy and paste the ones I want into another list to be my features or explanatory/independent variables. With the number of attributes in this data frame the df.columns function does not work because it summarizes and not all column names are displayed in the output.


In [7]:
print(list(df.columns))

['TEAM', 'Opponent', 'MATCH\xa0UP', 'Home Team', 'Away Team', 'GAME\xa0DATE', 'Game_Id', 'Oppenent_Id', 'Is_Home_Team', 'HomeTeam', 'AwayTeam', 'W/L', 'Home Team Win', 'Point Total', 'MIN', 'PTS', '+/-', 'H_WIN%', 'H_PTS_PG', 'H_FGM_PG', 'H_FGA_PG', 'H_CUM_FG%', 'H_3PM_PG', 'H_3PA_PG', 'H_CUM_3P%', 'H_FTM_PG', 'H_FTA_PG', 'H_CUM_FT%', 'H_OREB_PG', 'H_DREB_PG', 'H_REB_PG', 'H_AST_PG', 'H_TOV_PG', 'H_STL_PG', 'H_BLK_PG', 'H_BLKA_PG', 'H_PF_PG', 'H_PFD_PG', 'H_+/-_PG', 'A_WIN%', 'A_PTS_PG', 'A_FGM_PG', 'A_FGA_PG', 'A_CUM_FG%', 'A_3PM_PG', 'A_3PA_PG', 'A_CUM_3P%', 'A_FTM_PG', 'A_FTA_PG', 'A_CUM_FT%', 'A_OREB_PG', 'A_DREB_PG', 'A_REB_PG', 'A_AST_PG', 'A_TOV_PG', 'A_STL_PG', 'A_BLK_PG', 'H_BLKA_PG2', 'A_PF_PG', 'A_PFD_PG', 'A_+/-_PG', 'HLW_WIN%2', 'HLW_PTS_PG3', 'HLW_FGM_PG4', 'HLW_FGA_PG5', 'HLW_CUM_FG%6', 'HLW_3PM_PG7', 'HLW_3PA_PG8', 'HLW_CUM_3P%9', 'HLW_FTM_PG10', 'HLW_FTA_PG11', 'HLW_CUM_FT%12', 'HLW_OREB_PG13', 'HLW_DREB_PG14', 'HLW_REB_PG15', 'HLW_AST_PG16', 'HLW_TOV_PG17', 'HLW_STL_P


## Feature Selection

In this step I copy the attributes above that I think most influence in how many points will be scored in a given NBA game. Here are the definitions of the features I chose for both the home and away teams in a game:

-Season winning %

-Mean points per game scored

-Mean fieldgoals scored per game

-Mean fieldgoals attempted per game

-Season field goal %

-^^ Same as the above 3 features for both 3 point shots and free throws

-Offensive rebounds per game

-defensive rebounds per game

-total rebounds per game

-assists per game

-offensive rebounds per game

-defensive rebounds per game

-rebounds per game

-turnovers per game

-steals per game

-blocks per game

-blocked field goal attempts per game

-personal fouls per game

-personal fouls drawn per game

-Season long average point differential per game


In [15]:
features = ['H_WIN%', 'H_PTS_PG', 'H_FGM_PG', 'H_FGA_PG', 'H_CUM_FG%', 'H_3PM_PG', 'H_3PA_PG', 'H_CUM_3P%', 'H_FTM_PG', 'H_FTA_PG', 'H_CUM_FT%', 'H_OREB_PG', 'H_DREB_PG', 'H_REB_PG', 'H_AST_PG', 'H_TOV_PG', 'H_STL_PG', 'H_BLK_PG', 'H_BLKA_PG', 'H_PF_PG', 'H_PFD_PG', 'H_+/-_PG', 'A_WIN%', 'A_PTS_PG', 'A_FGM_PG', 'A_FGA_PG', 'A_CUM_FG%', 'A_3PM_PG', 'A_3PA_PG', 'A_CUM_3P%', 'A_FTM_PG', 'A_FTA_PG', 'A_CUM_FT%', 'A_OREB_PG', 'A_DREB_PG', 'A_REB_PG', 'A_AST_PG', 'A_TOV_PG', 'A_STL_PG', 'A_BLK_PG', 'H_BLKA_PG2', 'A_PF_PG', 'A_PFD_PG', 'A_+/-_PG']

## Clean it up!

Next up we need to drop all of the rows that have missing values so that our algorithm can run smoothly.  I chose to drop the rows with missing values, but if you would prefer you could do some sort of imputation such as mean or median value imputation. I decided to just drop the rows because there were so few.

In [16]:
df_without_missing_values = df.dropna()

## Set the Target

In this model we are trying to predict how many points will be scored by the two teams combined.  Earlier, while creating the dataset that I imported as a CSV I had already created this field which was simply a sum of the home team points scored and the away team points scored. Now, we just designate this column as our target or dependent variable. In the next cell I print the first 5 rows of "target" to make sure it is what I want it to be.


In [17]:
target = df_without_missing_values['Point Total']

In [18]:
target.head()

0    232
1    227
2    227
3    261
4    250
Name: Point Total, dtype: int64

## X's and y's

Next, I just define a data frame with my desired features as X and and a data frame with my target variable as y to fit in with standard conventions for many ML models.


In [19]:
X = df_without_missing_values[features]

In [20]:
y=target

## Splitting the data for training and testing

We need to split our data into training and testing sets.  This randomly seperates our data into two groups of different sizes (that we choose). The reason is that we don't want to test our model on data it has already seen before. This is called data leakage and it can cause what is called overfitting.  In essence, we are "hiding" a third of our data from the model as we train it so that when we want to test it we have games to it to predict that we know the results of, but it doesn't.  Basically, we don't have to wait for the next month's worth of NBA games to happen to find out if our model is any good or not. 

The reason we have to use data that the model hasn't seen before is because that's what the model is going use in real life. If we test it on data that it has already "seen" it will perform better on that test because it learned from those records. The results of that test won't give us any idea of how the model will perform in real life.

To perform this split we import the train_test_split function from python's sklearn library. Then we perform the splitting. I chose to make the a third of the data test data.


In [21]:
from sklearn.model_selection import train_test_split

In [22]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

## Import Packages for Model Evaluation

This is a regression model.  We need to evaluate it as such. The metric I chose is root mean square error (RMSE). RMSE is the standard deviation of the difference between the predicted values and the actual values. What we want is a low RMSE because this indicates that our data points are close to the "line of best fit". I want to note that our line of best fit in this case will not actually be linear because we are using a gradient boosting regressor and not a linear one.


In [23]:
from sklearn.metrics import mean_squared_error
from math import sqrt

## It's Time to Run the Model!

First up we need to import our gradient boosting model from the sklearn library.  If you want to learn more about how this works and how to use it I encourage you to explore the documentation at:

https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.GradientBoostingRegressor.html

We fit the model (let it look at the training data and learn from it) and then we make the predictions for what the scores of those games will be. Then we use the trained model to make a predictions on the test set. Then we evaluate the training and testing predictions using RMSE and print out the results. Remember, the lower the better.  In this case we see that the model performs really well on the training set and not as well on the test set.  This makes sense because the model learned from the training set and has never seen the test set before.  In an ideal case, the numbers would be low and similar, but these results are tolerable.  A large discrepancy between high training performance and low testing performance is an indicator of variance, or overfitting.  The opposite would be called underfitting or bias, where the model doesn't learn enough from the training data and is too simple.

Some causes for overfitting are having too many attributes, noisy data, or too much complexity in the model. In this case, I feel like the overfitting comes from redundant attributes and high variance nature of NBA games. Some causes for underfitting are too little data, features that don't explain the target,  or the model is too simple.  We avoid most of that in this case.



In [27]:
from sklearn.ensemble import GradientBoostingRegressor
GB = GradientBoostingRegressor()
GB.fit(X_train, y_train)
GB_predict_Train=GB.predict(X_train)

RMSE1=sqrt(mean_squared_error(y_train,GB_predict_Train))
print("RMSE (training) for GB:{0:10f}".format(RMSE1))
GB_predict_Test=GB.predict(X_test)
RMSE= sqrt(mean_squared_error(y_test,GB_predict_Test))
print("RMSE (Test Data) for GB:{0:10f}".format(RMSE))

RMSE (training) for GB:  0.415978
RMSE (Test Data) for GB:  7.895790


## Making the Bets

Now that we have a model that is ready to predict the final scores of NBA games we can use it to bet the Over/Under. To do this we need to create a data frame with the days NBA schedule. We need the  "feature" values for the home and away teams of each game. To do this, I create a CSV with the most up-to-date values available on the NBA site. Then I load that CSV into this file once again using the pandas read_csv function. I also call the .head() function so that I can double check I imported what I wanted to and to show why I am about to drop the first two columns of this data frame.



In [35]:
Bet = pd.read_csv("C:\\NBA_Betting\\328Games.csv")
Bet.head()

Unnamed: 0,HomeTeam,AwayTeam,H_WIN%,H_PTS_PG,H_FGM_PG,H_FGA_PG,H_CUM_FG%,H_3PM_PG,H_3PA_PG,H_CUM_3P%,...,A_DREB_PG,A_REB_PG,A_AST_PG,A_TOV_PG,A_STL_PG,A_BLK_PG,H_BLKA_PG2,A_PF_PG,A_PFD_PG,H_+/-_PG23
0,DEN,CHA,0.587,111.8,41.3,86.2,47.9,12.8,36.1,35.4,...,33.9,44.9,27.5,13.1,8.6,4.9,4.7,19.9,19.8,0.4
1,ORL,CLE,0.267,104.2,38.2,88.2,43.3,12.0,36.4,33.0,...,34.1,44.4,25.2,14.6,7.1,4.2,4.6,17.2,20.1,2.5
2,ATL,IND,0.5,112.9,41.2,88.2,46.8,12.6,34.0,37.0,...,33.3,44.5,25.0,14.5,6.9,5.5,5.0,20.2,19.3,-3.1
3,BOS,TOR,0.627,110.7,40.3,87.2,46.3,12.9,36.7,35.2,...,31.9,45.3,22.1,12.5,8.9,4.6,5.1,19.7,19.2,2.0
4,CHI,NYK,0.581,111.6,41.8,86.9,48.1,10.8,29.2,37.1,...,34.7,46.3,21.6,13.4,7.0,4.9,4.7,20.5,20.4,-0.5


## Make it Match

To make predictions, the data frame needs to match the format (number of columns and column titles) of the training data.  To accomplish this, I need to remove the first two columns with the team codes.  I also need to remove and null values that could have snuck into the CSV unintentionally.


In [29]:
Bet = Bet.iloc[: , 2:]

In [30]:
Bet = Bet.dropna()

## Crystal Ball

Now we call on the model to work its magic and make the predictions for tonight's games. We use the predict function and then create a list with the predictions for each game. They will be in same order as the input data, so make sure you don't forget which game is which just because we dropped the columns with team codes!



In [31]:
GB_predict_Tonight=GB.predict(Bet)

In [32]:
GB_predict_Tonight

array([230.43999468, 211.76630551, 230.43621412, 220.42287937,
       238.98007558, 223.13676021, 237.57257982, 229.23819486,
       215.54706441])

## Make it Pretty

Now for ease of use I like to create a new data frame from the resulting array and display it without indices

In [38]:
Tonights_Totals= pd.DataFrame(GB_predict_Tonight,columns=['scores'])

In [43]:
Tonights_Totals.style.hide(axis='index')

scores
230.439995
211.766306
230.436214
220.422879
238.980076
223.13676
237.57258
229.238195
215.547064


## Make your Money $$

Now compare the predicted score to the Over/Under line offered by your local sportsbook!  When the predicted value is greater than the line, I bet the over.  When the opposite is true, I bet the under. Disclaimer: Bet legally, responsibly, and at your own risk. This model and betting strategy can become outdated and ineffective quickly and for a variety of reasons.  In the meantime, this model has been effective for me and my bets have hit about 58 percent of the time which exceeds 53 percent therefore covering the "vig" or "juice". My sample size is also smaller than a whole NBA season, but large enough to have me convinced that it really works. And not because my model is better at predicting final scores than Vegas, in fact the opposite is true. But I'll elaborate on the betting theory of why I believe this system works outside of this notebook.  

I would view this as a fun experiment and not a money making endeavor.  Even if you could truly expect 5 percent returns on your money from betting Over/Unders in the NBA, the ROI is less than what you could expect to make investing in the market with a diversified portfolio. You can make about 7-9 percent on your money that way and it's a whole lot easier (source: Trust me bro).

Lastly, the sportsbooks are under no obligation to welcome you as a customer.  If you start taking too much of their money they can cut you off and that's just no fun.

