We are importing the necessary dataframes and loading in the CSV with basic NBA team stats. The data used in this dataset come from https://www.basketball-reference.com/leagues/NBA_2020.html

In [1]:
import pandas as pd
import numpy as np
df_nba_stats = pd.read_csv('NBA_Team_Data.csv')

The original dataset has aterisks next to playoff teams. In order to avoid any confusion, the asterisks will get removed. Additionally, the last row is the league average of each statistic, so that row will be removed.

In [2]:
# remove the asterisks at the end of playoff team names
df_nba_stats['Team'] = df_nba_stats['Team'].str.replace( '*', '' )
# show the dataframe
df_nba_stats

Unnamed: 0,Rk,Team,G,MP,FG,FGA,FG%,3P,3PA,3P%,...,FT%,ORB,DRB,TRB,AST,STL,BLK,TOV,PF,PTS
0,1.0,Dallas Mavericks,75,242.3,41.7,90.3,0.461,15.1,41.3,0.367,...,0.779,10.5,36.4,46.9,24.7,6.1,4.8,12.7,19.5,117.0
1,2.0,Milwaukee Bucks,73,241.0,43.3,90.9,0.476,13.8,38.9,0.355,...,0.742,9.5,42.2,51.7,25.9,7.2,5.9,15.1,19.6,118.7
2,3.0,Portland Trail Blazers,74,241.0,42.2,91.2,0.463,12.9,34.1,0.377,...,0.804,10.2,35.1,45.3,20.6,6.3,6.1,12.8,21.7,115.0
3,4.0,Houston Rockets,72,241.4,40.8,90.4,0.451,15.6,45.3,0.345,...,0.791,9.8,34.5,44.3,21.6,8.7,5.2,14.7,21.8,117.8
4,5.0,Los Angeles Clippers,72,241.4,41.6,89.2,0.466,12.4,33.5,0.371,...,0.791,10.7,37.0,47.7,23.7,7.1,4.7,14.6,22.1,116.3
5,6.0,New Orleans Pelicans,72,242.1,42.6,91.6,0.465,13.6,36.9,0.37,...,0.729,11.1,35.4,46.5,26.8,7.5,5.0,16.4,21.2,115.8
6,7.0,Phoenix Suns,73,241.0,41.2,88.1,0.468,11.4,31.8,0.358,...,0.834,9.8,33.8,43.5,27.2,7.7,4.0,14.8,22.0,113.6
7,8.0,Washington Wizards,72,241.0,41.5,90.9,0.457,12.0,32.6,0.368,...,0.788,10.2,31.9,42.0,25.0,8.0,4.3,14.2,22.7,114.4
8,9.0,Memphis Grizzlies,73,240.7,42.5,90.9,0.468,10.9,31.5,0.347,...,0.763,10.3,36.2,46.5,26.9,7.9,5.5,15.2,21.2,112.6
9,10.0,Boston Celtics,72,242.1,41.3,89.6,0.461,12.6,34.5,0.364,...,0.801,10.7,35.4,46.1,23.0,8.3,5.6,13.8,21.6,113.7


According to data scientists who work with basketball data, the ideal predictors of a good team are effective field goal percentage, turnover percentage, free throw factor, offensive rebounding percentage, and defensive rebounding percentage. Out of those 5, 3 of them (effective field goal percentage, turnover percentage, and free throw factor) are created here as new columns in the dataframe. Additionally, the last row is the league average of each statistic, so that row will be removed.

In [3]:
EFG_percent = (df_nba_stats['FG'] + 0.5 * df_nba_stats['3P']) / df_nba_stats['FGA']
df_nba_stats['eFG%'] = EFG_percent

TOV_percent = df_nba_stats['TOV'] / (df_nba_stats['FGA'] + 0.44 * df_nba_stats['FTA'] + df_nba_stats['AST'] + df_nba_stats['TOV'])
df_nba_stats['TOV%'] = TOV_percent

FT_factor = df_nba_stats['FT'] / df_nba_stats['FGA']
df_nba_stats['FTF'] = FT_factor

# create a separate dataframe for league average in case it becomes useful later
league_average = df_nba_stats.iloc[30]
df_nba_stats = df_nba_stats.drop(df_nba_stats.index[[30]])
df_nba_stats

Unnamed: 0,Rk,Team,G,MP,FG,FGA,FG%,3P,3PA,3P%,...,TRB,AST,STL,BLK,TOV,PF,PTS,eFG%,TOV%,FTF
0,1.0,Dallas Mavericks,75,242.3,41.7,90.3,0.461,15.1,41.3,0.367,...,46.9,24.7,6.1,4.8,12.7,19.5,117.0,0.545404,0.091914,0.20598
1,2.0,Milwaukee Bucks,73,241.0,43.3,90.9,0.476,13.8,38.9,0.355,...,51.7,25.9,7.2,5.9,15.1,19.6,118.7,0.552255,0.105766,0.20132
2,3.0,Portland Trail Blazers,74,241.0,42.2,91.2,0.463,12.9,34.1,0.377,...,45.3,20.6,6.3,6.1,12.8,21.7,115.0,0.533443,0.095292,0.194079
3,4.0,Houston Rockets,72,241.4,40.8,90.4,0.451,15.6,45.3,0.345,...,44.3,21.6,8.7,5.2,14.7,21.8,117.8,0.537611,0.10638,0.227876
4,5.0,Los Angeles Clippers,72,241.4,41.6,89.2,0.466,12.4,33.5,0.371,...,47.7,23.7,7.1,4.7,14.6,22.1,116.3,0.535874,0.104982,0.233184
5,6.0,New Orleans Pelicans,72,242.1,42.6,91.6,0.465,13.6,36.9,0.37,...,46.5,26.8,7.5,5.0,16.4,21.2,115.8,0.539301,0.113029,0.186681
6,7.0,Phoenix Suns,73,241.0,41.2,88.1,0.468,11.4,31.8,0.358,...,43.5,27.2,7.7,4.0,14.8,22.0,113.6,0.53235,0.105284,0.22588
7,8.0,Washington Wizards,72,241.0,41.5,90.9,0.457,12.0,32.6,0.368,...,42.0,25.0,8.0,4.3,14.2,22.7,114.4,0.522552,0.100764,0.213421
8,9.0,Memphis Grizzlies,73,240.7,42.5,90.9,0.468,10.9,31.5,0.347,...,46.5,26.9,7.9,5.5,15.2,21.2,112.6,0.527503,0.106598,0.182618
9,10.0,Boston Celtics,72,242.1,41.3,89.6,0.461,12.6,34.5,0.364,...,46.1,23.0,8.3,5.6,13.8,21.6,113.7,0.53125,0.101019,0.207589


In order to find the other reportedly important statistics, we import another CSV and drop the effective field goal percentage column because we already created it as a new column. The turnover percentage column was also created, but another reason to drop it along with opponent's turnover percentage is because the calculation on NBA reference slightly differs from the generic turnover percentage statistic (NBA reference doesn't include assists in the denominator).

In [4]:
# load in the dataframe
df_miscellaneous_nba_stats = pd.read_csv('NBA_Team_Miscellaneous_Data.csv')
# drop unnecessary columns
df_miscellaneous_nba_stats = df_miscellaneous_nba_stats.drop(['eFG%','TOV%','TOV%.1'], axis=1)
# show the updated dataframe
df_miscellaneous_nba_stats

Unnamed: 0,Rk,Team,Age,W,L,PW,PL,MOV,SOS,SRS,...,3PAr,TS%,ORB%,FT/FGA,eFG%.1,DRB%,FT/FGA.1,Arena,Attend.,Attend./G
0,1.0,Milwaukee Bucks*,29.2,56.0,17.0,57,16,10.08,-0.67,9.41,...,0.428,0.583,20.7,0.201,0.489,81.6,0.178,Fiserv Forum,549036,17711
1,2.0,Los Angeles Clippers*,27.4,49.0,23.0,50,22,6.44,0.21,6.66,...,0.375,0.577,23.5,0.233,0.506,77.6,0.206,STAPLES Center,610176,19068
2,3.0,Los Angeles Lakers*,29.5,52.0,19.0,48,23,5.79,0.49,6.28,...,0.358,0.573,24.5,0.201,0.515,78.8,0.205,STAPLES Center,588907,18997
3,4.0,Toronto Raptors*,26.6,53.0,19.0,50,22,6.24,-0.26,5.97,...,0.421,0.574,21.3,0.21,0.502,76.7,0.202,Scotiabank Arena,633456,19796
4,5.0,Boston Celtics*,25.3,48.0,24.0,50,22,6.31,-0.47,5.83,...,0.386,0.57,23.9,0.207,0.509,77.4,0.215,TD Garden,610864,19090
5,6.0,Dallas Mavericks*,26.1,43.0,32.0,49,26,4.95,-0.07,4.87,...,0.457,0.581,23.2,0.206,0.525,77.7,0.175,American Airlines Center,682096,20062
6,7.0,Houston Rockets*,29.2,44.0,28.0,42,30,2.96,0.17,3.13,...,0.501,0.578,21.0,0.228,0.529,75.6,0.197,Toyota Center,578458,18077
7,8.0,Miami Heat*,25.9,44.0,29.0,43,30,2.95,-0.35,2.59,...,0.419,0.587,20.3,0.234,0.523,79.5,0.213,AmericanAirlines Arena,629771,19680
8,9.0,Utah Jazz*,27.3,44.0,28.0,42,30,2.47,0.05,2.52,...,0.414,0.585,21.6,0.208,0.518,78.9,0.185,Vivint Smart Home Arena,567486,18306
9,10.0,Denver Nuggets*,25.6,46.0,27.0,41,32,2.11,0.24,2.35,...,0.344,0.567,24.8,0.183,0.533,76.8,0.198,Pepsi Center,633153,19186


Before dropping the asterisks from the end of playoff teams, we create the playoff teams variable which will store the values for which teams made the playoffs and nan for which teams didn't. The non-playoff teams variable does the exact opposite, keeping non-playoff teams and turning playoff teams into nan. Finally, we set the league average variable in case we need it later and drop it from the dataframe.

In [5]:
# find teams with an asterisk as the last character and store it in a variable
playoff_teams = df_miscellaneous_nba_stats['Team'][df_miscellaneous_nba_stats['Team'].str.endswith("*")]
# replace the asterisk with an empty string
playoff_teams = playoff_teams.str.replace( '*', '' )

# find the teams that do not have an asterisk as the last character and store it in a variable
non_playoff_teams = df_miscellaneous_nba_stats['Team'][~df_miscellaneous_nba_stats['Team'].str.endswith("*")]

# replace the asterisks in the dataframe
df_miscellaneous_nba_stats['Team'] = df_miscellaneous_nba_stats['Team'].str.replace( '*', '' )

# make a variable for league average stats
league_average_miscellaneous_stats = df_miscellaneous_nba_stats.iloc[30]
# drop the league average stats
df_miscellaneous_nba_stats = df_miscellaneous_nba_stats.drop(df_miscellaneous_nba_stats.index[[30]])

Here we merge the two dataframes on the Team column

In [6]:
df_all_team_stats = pd.merge(df_nba_stats, df_miscellaneous_nba_stats, on='Team')
# show the resulting merged dataframe
df_all_team_stats

Unnamed: 0,Rk_x,Team,G,MP,FG,FGA,FG%,3P,3PA,3P%,...,3PAr,TS%,ORB%,FT/FGA,eFG%.1,DRB%,FT/FGA.1,Arena,Attend.,Attend./G
0,1.0,Dallas Mavericks,75,242.3,41.7,90.3,0.461,15.1,41.3,0.367,...,0.457,0.581,23.2,0.206,0.525,77.7,0.175,American Airlines Center,682096,20062
1,2.0,Milwaukee Bucks,73,241.0,43.3,90.9,0.476,13.8,38.9,0.355,...,0.428,0.583,20.7,0.201,0.489,81.6,0.178,Fiserv Forum,549036,17711
2,3.0,Portland Trail Blazers,74,241.0,42.2,91.2,0.463,12.9,34.1,0.377,...,0.374,0.57,22.4,0.194,0.53,75.3,0.208,Moda Center,628303,19634
3,4.0,Houston Rockets,72,241.4,40.8,90.4,0.451,15.6,45.3,0.345,...,0.501,0.578,21.0,0.228,0.529,75.6,0.197,Toyota Center,578458,18077
4,5.0,Los Angeles Clippers,72,241.4,41.6,89.2,0.466,12.4,33.5,0.371,...,0.375,0.577,23.5,0.233,0.506,77.6,0.206,STAPLES Center,610176,19068
5,6.0,New Orleans Pelicans,72,242.1,42.6,91.6,0.465,13.6,36.9,0.37,...,0.403,0.568,24.2,0.186,0.532,77.8,0.212,Smoothie King Center,528172,16505
6,7.0,Phoenix Suns,73,241.0,41.2,88.1,0.468,11.4,31.8,0.358,...,0.361,0.576,22.2,0.226,0.539,78.8,0.221,Talking Stick Resort Arena,550633,15606
7,8.0,Washington Wizards,72,241.0,41.5,90.9,0.457,12.0,32.6,0.368,...,0.358,0.562,22.2,0.213,0.558,75.3,0.231,Capital One Arena,532702,16647
8,9.0,Memphis Grizzlies,73,240.7,42.5,90.9,0.468,10.9,31.5,0.347,...,0.346,0.561,23.0,0.183,0.521,77.8,0.217,FedEx Forum,523297,15857
9,10.0,Boston Celtics,72,242.1,41.3,89.6,0.461,12.6,34.5,0.364,...,0.386,0.57,23.9,0.207,0.509,77.4,0.215,TD Garden,610864,19090


Creating a column for playoff teams and replacing nan values with 0 in preparation of manipulating the column to become binary.

In [7]:
# create the new column and input the playoff team variable into the column
df_all_team_stats['Playoff Team'] = playoff_teams
# replace nan values with 0
df_all_team_stats['Playoff Team'] = df_all_team_stats['Playoff Team'].fillna(0)
# show the added column in the dataframe
df_all_team_stats['Playoff Team']

0            Milwaukee Bucks
1       Los Angeles Clippers
2         Los Angeles Lakers
3            Toronto Raptors
4             Boston Celtics
5           Dallas Mavericks
6            Houston Rockets
7                 Miami Heat
8                  Utah Jazz
9             Denver Nuggets
10     Oklahoma City Thunder
11        Philadelphia 76ers
12            Indiana Pacers
13                         0
14                         0
15    Portland Trail Blazers
16                         0
17                         0
18             Orlando Magic
19             Brooklyn Nets
20                         0
21                         0
22                         0
23                         0
24                         0
25                         0
26                         0
27                         0
28                         0
29                         0
Name: Playoff Team, dtype: object

This for loop replaces any nonzero values, which are the playoff team names, into 1's.

In [8]:
for team in df_all_team_stats['Playoff Team']:
    if team != 0:
        df_all_team_stats['Playoff Team'] = df_all_team_stats['Playoff Team'].replace(team, 1)

# show the new column
df_all_team_stats['Playoff Team']

0     1
1     1
2     1
3     1
4     1
5     1
6     1
7     1
8     1
9     1
10    1
11    1
12    1
13    0
14    0
15    1
16    0
17    0
18    1
19    1
20    0
21    0
22    0
23    0
24    0
25    0
26    0
27    0
28    0
29    0
Name: Playoff Team, dtype: int64

Importing the regression tool. Also, setting the predictors to be points per game, effective field goal percentage, turnover percentage, free throw factor, net rating, offensive rebounding percentage, and defensive rebounding percentage. The response is the newly created playoff team column, with 1 for a playoff team and 0 for a non-playoff team.

In [17]:
# import logistic regression tool
from sklearn.linear_model import LogisticRegression
# set the columns for predictors
pred_list = [24,25,26,27,39,44,47]
# create the predictor variable
predictors = df_all_team_stats.iloc[:,pred_list]
# create the response variable
response = df_all_team_stats.iloc[:,-1]

# make the logistic regression model
model = LogisticRegression()
model.fit( predictors, response )

LogisticRegression()

This code tests out the precision, recall, and F1 of the model. These are all factors that determine the accuracy of the model.

In [18]:
predictions = model.predict(predictors)
# making the number of true positive, false positive, and false negative values
TP = ( df_all_team_stats['Playoff Team'] & predictions ).sum()
FP = ( ~df_all_team_stats['Playoff Team'] & predictions ).sum()
FN = ( df_all_team_stats['Playoff Team'] & ~predictions ).sum()

# making precision, recall, and F1
precision = TP / ( TP + FP )
recall = TP / ( TP + FN )
F1 = 2 * precision * recall / ( precision + recall )

precision, recall, F1

(0.9375, 0.9375, 0.9375)

Now, we split the data into training and validation data. This validation data provides a glimpse into the success of the model outside of the dataset.

In [19]:
# randomly select 70% of the data for the training set
rows_for_training = np.random.choice( df_all_team_stats.index, int(0.7 * len(df_all_team_stats)), False )
# a Boolean array that shows which of the data points is included
# in the training set (True means included)
training = df_all_team_stats.index.isin( rows_for_training )
# making the dataframe for training set data points
df_training = df_all_team_stats[training]
# making the dataframe for validation set data points
df_validation = df_all_team_stats[~training]
# output the number of data points in the training and validation set, respectively
len( df_training ), len( df_validation )

(21, 9)

We now create the Logistic Regression model on the training and validation set and calculate their F1 values in order to determine model accuracy.

In [20]:
# this is the same as before, but instead of all 30 data points,
# this makes a model only based on the predictors
predictors = df_training.iloc[:,pred_list]
response = df_training.iloc[:,-1]
model = LogisticRegression()
model.fit(predictors, response)

# making the number of true positive, false positive, and false negative values
# for the training set and then using precision and recall to find the F1 score
predictions = model.predict(df_training.iloc[:,pred_list])
TP = (df_training['Playoff Team'] & predictions).sum()
FP = ( ~df_training['Playoff Team'] & predictions ).sum()
FN = ( df_training['Playoff Team'] & ~predictions ).sum()
precision = TP / ( TP + FP )
recall = TP / ( TP + FN )
F1_train = 2 * precision * recall / ( precision + recall )

# making the number of true positive, false positive, and false negative values
# for the validation set and then using precision and recall to find the F1 score
predictions = model.predict(df_validation.iloc[:,pred_list])
TP = ( df_validation['Playoff Team'] & predictions ).sum()
FP = ( ~df_validation['Playoff Team'] & predictions ).sum()
FN = ( df_validation['Playoff Team'] & ~predictions ).sum()
precision = TP / ( TP + FP )
recall = TP / ( TP + FN )
F1_valid = 2 * precision * recall / ( precision + recall )

# show F1 scores of the training and validation set
F1_train, F1_valid

(0.9090909090909091, 0.9090909090909091)

Due to the tedious process of running those lines of code every time new predictors are used, these functions take a couple of necessary inputs and return the F1 value.

In [21]:
# this function creates the model, with the dataframe given as an input
# the function is designed for the inputted dataframe to be all or some subset
# of the merged dataframe above
def fit_model_to ( training ):
    predictors = training.iloc[:,pred_list]
    response = training.iloc[:,-1]
    model = LogisticRegression()
    model.fit( predictors, response )
    return model

# this function takes two inputs: the model used in the function above
# and what subset of the merged dataframe to use
def score_model ( M, validation ):
    predictions = M.predict( validation.iloc[:,pred_list] )
    TP = ( validation['Playoff Team'] & predictions ).sum()
    FP = ( ~validation['Playoff Team'] & predictions ).sum()
    FN = ( validation['Playoff Team'] & ~predictions ).sum()
    precision = TP / ( TP + FP )
    recall = TP / ( TP + FN )
    return 2 * precision * recall / ( precision + recall )

# running the functions and outputting the F1 scores for the 
# training and validation sets
model = fit_model_to( df_training )
score_model( model, df_training ), score_model( model, df_validation )

(0.9090909090909091, 0.9090909090909091)

Now we enhance the fit_model_to function in order to fit a model to standardized coefficients, sort those coefficient values in descending order, and print out those coefficients. The function also takes the predictor columns as an input so that it is changeable.

In [23]:
def fit_model_to (training, list_pred):
    # choose predictors and fit model as before
    predictors = training.iloc[:,list_pred]
    response = training.iloc[:,-1]
    model = LogisticRegression()
    model.fit(predictors, response)
    # fit another model to standardized predictors
    # this is because the scales of the predictors differ, and this
    # standardization will ensure that all scales are the same for
    # displaying the coefficients
    standardized = (predictors - predictors.mean()) / predictors.std()
    temp_model = LogisticRegression()
    temp_model.fit(standardized, response)
    # get that model's coefficients and display them
    coeffs = pd.Series(temp_model.coef_[0], index=predictors.columns)
    # sort the coefficients by absolute value in descending order 
    sorted = np.abs(coeffs).sort_values(ascending=False)
    coeffs = coeffs.loc[sorted.index]                    
    print(coeffs)
    # return the model fit to the actual predictors
    return model

# testing the function
model = fit_model_to(df_training, pred_list)
print(score_model(model, df_training), score_model(model, df_validation))

PTS     1.147503
NRtg    0.791438
TOV%   -0.745769
eFG%    0.732563
DRB%   -0.712137
FTF     0.177888
ORB%   -0.108098
dtype: float64
0.9090909090909091 0.9090909090909091


This model shows that lower defensive rebounding percentage leads to better odds of being a playoff team, which doesn't make sense. The next model will remove defensive rebounding percentage since it seems to be counterintuitive.

In [None]:
# redefining the list of predictors and taking out defensive rebounding percentage
pred_list = [24, 25, 26, 27, 39, 44]

In [None]:
# fitting the new model without defensive rebounding percentage
model = fit_model_to( df_training, pred_list )
score_model( model, df_training ), score_model( model, df_validation )

PTS     1.376966
TOV%   -0.655121
NRtg    0.590992
eFG%    0.390675
ORB%   -0.315928
FTF     0.153554
dtype: float64


(0.923076923076923, 0.75)

The training set improves slightly, but the validation set declines significantly. However, with a low number of data points and such high F1 scores, there could be some overfitting going on.

Let's try making a model with just the 5 biggest factors in predicting winning NBA teams according to data scientists (mentioned earlier).

In [None]:
# new list of columns that contains the 5 factors in predicting winning teams
pred_list = [24, 25, 26, 44, 47]

In [None]:
model = fit_model_to( df_training, pred_list )
score_model( model, df_training ), score_model( model, df_validation )

PTS     1.327272
eFG%    0.839295
TOV%   -0.802913
DRB%   -0.677885
ORB%   -0.195806
dtype: float64


(0.923076923076923, 0.6)

Once again, a slightly higher training set, and an even lower validation set. 

We will now go back to the original dataset.

In [None]:
# the columns from the original dataset
pred_list = [24, 25, 26, 27, 39, 44, 47]

As mentioned, there are only 30 teams so the dataset is small. This leads to a skew in results from each run through. After rerunning the notebook a few times, the original model with 7 predictor variables is the best at predicting NBA playoff teams.

In [None]:
model = fit_model_to( df_training, pred_list )
score_model( model, df_training ), score_model( model, df_validation )

PTS     1.206891
DRB%   -0.806440
NRtg    0.770574
TOV%   -0.657715
eFG%    0.546608
FTF     0.188462
ORB%   -0.132646
dtype: float64


(0.9600000000000001, 0.888888888888889)

We will now load the dataframe into a pkl file and score the model without changing it by reading the pickle file and calling the functions.

In [None]:
import pickle
df_all_team_stats.to_pickle('Combined_Dataframe.pkl')
df_testing = pd.read_pickle('Combined_Dataframe.pkl')
score_model(model, df_testing)

0.9411764705882353