## Advanced NBA Statistics Capstone
# 5. Training Data and Modeling

The goal of this regression model is to predict Player Impact Estimate(PIE) for the 2018-2019 season using Advanced NBA Stats from the 2017-2018 season. After data cleaning and pre-processing, we have 399 players and 77 variables with which to build our regression model.

We will use the following Linear Regression Models:
- Statsmodels' OLS

- Scikit-Learn's Ridge,Lasso, and Elatic Net

- Scikit-Learn's RandomForest

- XGBoost's Regression

We will use a basic grid search on each other these models (when applicable) to tune hyperparameters. and compare the results. For the model with the lowest RMSE, we will further tune its hyperparameters to improve model accuracy.

In [1]:
import pandas as pd 
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score
import time

pd.set_option('display.max_columns', None)


In [2]:
df = pd.read_csv('data/pre_processed_data.csv')

In [3]:
print(df.shape)
df.head()

(399, 77)


Unnamed: 0,Player,PIE_2018,PIE_2017,AGE,MIN_2017,GP,W,L,PTS,FGM,3P%,FTM,FT%,TOV,STL,BLK,PF,FP,DD2,TD3,+/-,Height,Weight,Draft_Number,%_Box_Outs_Off,%_Box_Outs_Def,%_Team_RebWhen_Box_Out,%_Player_RebWhen_Box_Out,ContestedREB%,DeferredREB_Chances,AdjustedREB_Chance%,AVG_REBDistance,PassesMade,PassesReceived,SecondaryAST,ASTAdj,AST_ToPass%_Adj,ScreenAssists_PTS,Deflections,%_Loose_BallsRecovered_OFF,%_Loose_BallsRecovered_DEF,ChargesDrawn,Contested2PT_Shots,Contested3PT_Shots,FGM_und_5ft,FG%_und_5ft,FGM_5_9ft,FG%_5_9ft,FGM_10_14ft,FG%_10_14ft,FGM_15_19ft,FG%_15_19ft,OPP_FGM_und_5ft,OPP_FG%_und_5ft,OPP_FGM_5_9ft,OPP_FG%_5_9ft,OPP_FGM_10_14ft,OPP_FG%_10_14ft,OPP_FGM_15_19ft,OPP_FG%_15_19ft,OPP_FGM_20_24ft,OPP_FG%_20_24ft,OPP_FGM_25_29ft,OPP_FG%_25_29ft,DEFRTG,NETRTG,AST%,OREB%,DREB%,eFG%,TS%,USG%,PACE,cluster_five_1,cluster_five_2,cluster_five_3,cluster_five_4
0,Aaron Gordon,10.9,0.584828,-1.143002,1.316754,-0.09764,-0.590935,0.760899,1.269168,1.292411,0.222524,1.09119,-0.475985,0.833149,0.781209,1.171129,0.095138,1.292614,1.504622,-0.285133,-0.501247,0.53467,0.071346,-1.36829,-0.166183,0.438563,-0.267094,0.132715,0.43263,1.021697,0.737197,-0.468996,0.932913,0.500361,-0.066395,0.471447,-0.448659,0.574806,0.268502,0.123948,-0.024,0.876953,0.665584,-0.243491,1.295927,0.606304,-0.170261,-0.99162,0.797554,-0.095648,0.894197,-0.259874,1.88096,0.364954,1.249472,-0.302905,1.296342,0.075982,1.1568,0.295255,1.144227,0.330568,0.807395,-0.308855,0.316152,-0.041749,-0.104794,0.393282,0.917322,-0.205608,-0.241691,1.036662,0.19306,1.631119,-0.050125,-0.953463,-0.14304
1,Abdel Nader,6.6,-2.244676,-0.505529,-1.319105,-0.564023,0.311443,-0.937439,-1.365481,-1.510848,0.340871,-0.959423,-1.255379,-0.640302,-1.054433,-0.559061,-1.374838,-1.488761,-0.96183,-0.285133,-0.759263,-0.39247,0.476743,1.300389,0.771398,-0.583922,0.697652,-0.823635,0.133036,-1.249298,0.871701,-0.028926,-1.461553,-1.245383,-0.75404,-1.116724,0.131965,-0.928949,-0.762464,-0.133073,0.247328,-1.016576,-1.076799,-0.814219,-1.41636,-1.915182,-0.800997,-0.99162,-1.434773,-0.977402,-1.30446,-2.03908,-1.210797,-0.299905,-1.718616,-1.574813,-1.416622,0.015511,-1.213846,0.26251,-1.165006,0.786946,-1.043498,0.719301,0.156599,-0.744308,-0.620049,-0.359935,-0.369932,-1.542321,-1.676184,-0.340679,0.201373,-0.613076,-0.050125,-0.953463,-0.14304
2,Al Horford,13.4,1.046042,1.323023,1.154153,0.62918,1.1469,-0.208727,0.684132,0.809126,0.832006,0.118006,0.243228,0.833149,-0.133483,1.573456,0.095138,1.143551,1.462908,-0.285133,1.600125,0.864997,1.057697,-1.512332,-0.12997,0.406248,0.456757,0.81227,0.463107,0.625699,0.681385,-0.29099,1.3942,0.982041,0.455191,1.311874,0.625138,1.535828,0.139787,0.746304,-0.648719,-1.016576,1.903576,1.486655,0.453549,0.690697,1.589358,0.327111,1.066743,-0.007264,0.639507,0.763151,1.005593,-0.299905,0.420569,-0.704657,0.611723,-0.685157,1.032568,-0.742375,0.875849,0.477033,0.540439,-1.025906,-0.720915,0.563998,1.080923,0.480478,0.729424,0.489319,0.372852,0.096764,-0.693721,1.631119,-0.050125,-0.953463,-0.14304
3,Al-Farouq Aminu,9.7,-0.005714,0.292537,0.954962,0.466548,0.735664,0.073214,0.127425,0.031018,0.439345,-0.337394,-0.149267,0.009075,0.971442,0.784401,0.248244,0.54384,1.266268,-0.285133,1.105702,0.53467,0.071346,-0.928652,0.679795,-0.469119,0.18195,1.303723,0.094287,1.021697,0.882944,-0.513936,-0.016086,-0.315681,-0.066395,-0.202976,-0.539355,1.0997,1.149621,0.091833,0.009493,-1.016576,1.026894,1.00448,-0.220865,-0.194169,-0.170261,-0.698818,-1.434773,-1.167106,-1.30446,-1.317136,0.642683,-1.333526,1.249472,0.490362,1.124858,-0.391235,1.279619,0.493975,0.965808,0.200565,0.451618,0.19702,-0.296635,0.439992,-1.058085,0.422894,1.015865,-0.16416,-0.356793,-0.738953,-0.350094,-0.613076,-0.050125,1.048809,-0.14304
4,Alan Williams,27.4,-0.033953,-0.220725,-0.955068,-1.921817,-1.997404,-1.80809,-1.046233,-1.162613,-2.055496,0.015609,-0.71496,0.399907,0.781209,-0.559061,1.186654,-0.469268,-0.96183,-0.285133,0.136784,0.215134,1.787129,1.381993,0.329032,-0.056203,0.042134,1.10154,-0.045892,-2.43407,-0.904817,-1.406991,-0.585693,-0.592416,-1.683845,0.079267,0.889329,1.203039,-0.279968,-2.656206,3.245527,-1.016576,-1.332711,-1.176313,-0.737035,-0.372948,1.228698,0.127297,-1.434773,-2.182371,-1.30446,-2.03908,-1.210797,-2.81641,-0.899105,0.776197,-1.081783,1.014477,-1.213846,-0.639769,-1.271918,2.664846,-1.303796,-1.53454,-2.235752,0.115982,0.823915,0.236146,1.822009,-1.965761,-1.402112,0.116305,-0.688179,-0.613076,-0.050125,-0.953463,-0.14304


### Train-Test-Split for Statsmodels' OLS Regression Model

In [4]:
from sklearn.model_selection import train_test_split
import statsmodels.api as sm

# Explanatory Variables are already appropriately scaled in previous notebook.

# splitting training and testing data as welll as adding OLS's required constant variable.

X = df.drop(['Player','PIE_2018'], axis=1)
X = sm.add_constant(X)

y = df.PIE_2018


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=42)

In [5]:
#creating the OLS Regression Model

# Create the model
ols =  sm.OLS(y_train,X_train)
# Fit the model with fit() 
ols = ols.fit()

In [6]:
# evaluating the OLS model 
ols.summary()

0,1,2,3
Dep. Variable:,PIE_2018,R-squared:,0.802
Model:,OLS,Adj. R-squared:,0.711
Method:,Least Squares,F-statistic:,8.816
Date:,"Wed, 11 Nov 2020",Prob (F-statistic):,6.2e-31
Time:,16:44:03,Log-Likelihood:,-464.61
No. Observations:,239,AIC:,1081.0
Df Residuals:,163,BIC:,1345.0
Df Model:,75,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,9.4788,0.156,60.680,0.000,9.170,9.787
PIE_2017,1.3680,0.952,1.438,0.152,-0.511,3.247
AGE,-0.2349,0.191,-1.230,0.221,-0.612,0.142
MIN_2017,0.4362,2.298,0.190,0.850,-4.101,4.973
GP,2.9544,1.387,2.129,0.035,0.215,5.694
W,-2.0234,1.023,-1.977,0.050,-4.044,-0.003
L,-2.7721,0.963,-2.878,0.005,-4.674,-0.870
PTS,-2.4738,4.103,-0.603,0.547,-10.575,5.627
FGM,0.3396,3.860,0.088,0.930,-7.282,7.961

0,1,2,3
Omnibus:,23.406,Durbin-Watson:,1.983
Prob(Omnibus):,0.0,Jarque-Bera (JB):,35.128
Skew:,0.608,Prob(JB):,2.36e-08
Kurtosis:,4.432,Cond. No.,207.0


In [7]:
# making predictions with OLS

y_pred = ols.predict(X_test)

m1_scores = mean_squared_error(y_test, y_pred) ** 0.5, r2_score(y_test, y_pred)

print("RMSE: ", m1_scores[0])
print("R squared: ", m1_scores[1])

RMSE:  3.187569268735532
R squared:  -0.04229700567404726


### Resetting the Train/Test Split of the Data (a constant was added from the Statsmodels iteration)

In [8]:
X = df.drop(['Player','PIE_2018'], axis=1)
y = df.PIE_2018
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=42)

## Lasso Model
### (Heuristically Tuned with GridSearchCV)

In [9]:
#Lasso with Grid Search

from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import Lasso

params={'alpha': [25,10,4,3,2.5,2,1.5,1.0,0.8,0.5,0.4,0.375,0.35,0.325,0.3,0.275,0.25,0.225, 0.2,0.175, 0.15,0.1,0.05,0.02,0.01,0.001,0.0001]}

lasso = Lasso(max_iter=50000)
clf = GridSearchCV(lasso, params, cv=5,verbose = 1, scoring = 'neg_mean_squared_error')
clf.fit(X_train, y_train)

# pauing for Grid Search to finish. 
time.sleep(5)

# cell break
best_lasso_alpha = clf.best_params_['alpha']
print("Best alpha paramater for Lasso Regression: alpha =", best_lasso_alpha, "\n")


print("Below is a dataframe of the best parameters found through the grids search sorted by lowest mean squared error.")
pd.DataFrame(clf.cv_results_).sort_values(by='rank_test_score').head()


Fitting 5 folds for each of 27 candidates, totalling 135 fits


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 135 out of 135 | elapsed:    1.8s finished


Best alpha paramater for Lasso Regression: alpha = 0.02 

Below is a dataframe of the best parameters found through the grids search sorted by lowest mean squared error.


Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_alpha,params,split0_test_score,split1_test_score,split2_test_score,split3_test_score,split4_test_score,mean_test_score,std_test_score,rank_test_score
23,0.030072,0.024818,0.004024,0.000759,0.02,{'alpha': 0.02},-5.878297,-6.099248,-9.210303,-5.857888,-7.187059,-6.846559,1.278804,1
21,0.005214,0.001151,0.003479,0.000511,0.1,{'alpha': 0.1},-6.975638,-5.725217,-9.271959,-5.609877,-7.233368,-6.963212,1.324205,2
22,0.021275,0.018987,0.004509,0.00093,0.05,{'alpha': 0.05},-7.704586,-5.549478,-9.295212,-5.97019,-6.974126,-7.098718,1.332735,3
24,0.011919,0.002361,0.002375,0.000187,0.01,{'alpha': 0.01},-6.296682,-6.96327,-8.846613,-6.209175,-7.336912,-7.13053,0.954956,4
20,0.004689,0.00071,0.00283,0.000906,0.15,{'alpha': 0.15},-7.378348,-6.197987,-9.862487,-5.640329,-7.348095,-7.285449,1.452079,5


In [10]:
# Scoring Lasso Model

lasso=Lasso(alpha=best_lasso_alpha, max_iter=50000,random_state=33)

lasso.fit(X_train, y_train)
lasso_preds=lasso.predict(X_test)

m2_scores = mean_squared_error(y_test, lasso_preds) ** 0.5, r2_score(y_test, lasso_preds)

print("RMSE: ", m2_scores[0])
print("R Squared: ", m2_scores[1])

RMSE:  2.973214709052998
R Squared:  0.09317230934405285


## Ridge Model
### (Heuristically Tuned with GridSearchCV)

In [11]:
from sklearn.linear_model import Ridge


# Note I manually tested different parameters for cross-validation(cv) and 4 had the lowest mean_test_scores.

params={'alpha': [1000,500,100,50,25,10,4,3,2.5,2,1.5,1.0,0.8,0.5,0.3,0.2,0.1,0.05,0.02,0.01]}
ridge = Ridge(normalize=True,random_state=33)
clf = GridSearchCV(ridge, params, cv=4,verbose = 1, scoring = 'neg_mean_squared_error')
clf.fit(X_train, y_train)

# pauing for Grid Search to finish. 
time.sleep(5)

best_ridge_alpha = clf.best_params_['alpha']
print("Best alpha paramater for Ridge Regressions: alpha =", best_ridge_alpha,"\n")

print("Best parameters found through the Grid Search sorted by lowest mean squared error.")
pd.DataFrame(clf.cv_results_).sort_values(by='rank_test_score').head()

Fitting 4 folds for each of 20 candidates, totalling 80 fits


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done  80 out of  80 | elapsed:    0.5s finished


Best alpha paramater for Ridge Regressions: alpha = 500 

Best parameters found through the Grid Search sorted by lowest mean squared error.


Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_alpha,params,split0_test_score,split1_test_score,split2_test_score,split3_test_score,mean_test_score,std_test_score,rank_test_score
1,0.003052,7.5e-05,0.002179,9.3e-05,500,{'alpha': 500},-10.176904,-17.082754,-13.975357,-15.579612,-14.203657,2.571445,1
2,0.00356,0.000826,0.002719,0.000566,100,{'alpha': 100},-13.886731,-16.098721,-12.830194,-14.512325,-14.331993,1.184009,2
0,0.004441,0.001841,0.002784,0.000897,1000,{'alpha': 1000},-10.67818,-17.235759,-14.15308,-15.742828,-14.452462,2.436525,3
3,0.003065,7.9e-05,0.002134,1.3e-05,50,{'alpha': 50},-36.370784,-15.254505,-11.85415,-13.574166,-19.263401,9.949852,4
4,0.003046,0.000119,0.002105,3.4e-05,25,{'alpha': 25},-133.866629,-14.181661,-10.651063,-12.365649,-42.76625,52.611642,5


In [12]:
# Scoring Ridge Model

ridge = Ridge(alpha=best_ridge_alpha, normalize=True,random_state=33)
ridge.fit(X_train, y_train)
ridge_preds = ridge.predict(X_test)

m3_scores = mean_squared_error(y_test, ridge_preds) ** 0.5, r2_score(y_test, ridge_preds)

print("RMSE: ", m3_scores[0])
print("R Squared: ", m3_scores[1])

RMSE:  3.08405627117422
R Squared:  0.024298854538979286


## Elastic Net 
### (Heuristically Tuned with GridSearchCV)

In [13]:
from sklearn.linear_model import ElasticNet
# Elastic Net with Grid Search

parametersGrid = {"alpha": np.arange(0.0001, 1.0, 0.05),
                  "l1_ratio": np.arange(0.0001, 1.0, 0.05)}

eNet = ElasticNet(max_iter=100000, tol=0.0001, random_state=33)
grid = GridSearchCV(eNet, parametersGrid, scoring='r2', cv=3)
grid.fit(X_train, y_train)

# pauing for Grid Search to finish. 
time.sleep(5)

best_elastic_params = grid.best_params_
print("Best paramaters for Elastic Net:", best_elastic_params, "\n")

print("Best parameters found through the Grid Search sorted by lowest mean squared error.")
pd.DataFrame(grid.cv_results_).sort_values(by='rank_test_score').head(2)

Best paramaters for Elastic Net: {'alpha': 0.4001, 'l1_ratio': 0.050100000000000006} 

Best parameters found through the Grid Search sorted by lowest mean squared error.


Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_alpha,param_l1_ratio,params,split0_test_score,split1_test_score,split2_test_score,mean_test_score,std_test_score,rank_test_score
161,0.004761,0.000753,0.00214,1.1e-05,0.4001,0.0501,"{'alpha': 0.4001, 'l1_ratio': 0.05010000000000...",0.481664,0.541175,0.584843,0.535894,0.042288,1
141,0.009722,0.007254,0.002703,0.00042,0.3501,0.0501,"{'alpha': 0.3501, 'l1_ratio': 0.05010000000000...",0.474794,0.543531,0.58836,0.535562,0.046704,2


In [14]:

elastic = ElasticNet(alpha = best_elastic_params['alpha'], l1_ratio=best_elastic_params['l1_ratio'],random_state=33)
elastic.fit(X_train, y_train)
e_preds = elastic.predict(X_test)

m4_scores = mean_squared_error(y_test, e_preds) ** 0.5, r2_score(y_test, e_preds)

print("RMSE: ", m4_scores[0])
print("R Squared: ", m4_scores[1])

RMSE:  2.4611613257054814
R Squared:  0.3786269805531376


### Random Forest Regression Model

In [15]:
from sklearn.ensemble import RandomForestRegressor

rf = RandomForestRegressor(random_state=33)
rf.fit(X_train, y_train)
rf_preds = rf.predict(X_test)

m5_scores = mean_squared_error(y_test, rf_preds) ** 0.5, r2_score(y_test, rf_preds)

print("RMSE: ", m5_scores[0])
print("R Squeared: ", m5_scores[1])

RMSE:  2.1901805262352236
R Squeared:  0.507924146836235


### XGBoost Regression Model

In [16]:
import xgboost as xgb

xgbr = xgb.XGBRegressor(objective ='reg:squarederror', n_estimators = 10, seed = 33) 
xgbr.fit(X_train, y_train)
xgb_preds = xgbr.predict(X_test) 

m6_scores = mean_squared_error(y_test, xgb_preds) ** 0.5, r2_score(y_test, xgb_preds)

print("RMSE: ", m6_scores[0])
print("R Squared: ", m6_scores[1])

RMSE:  2.408214723307803
R Squared:  0.4050744204942437


## Reviewing All Models and Identifying Best Model to Select for Hyperparameter Tuning

In [17]:
scores_df = pd.DataFrame(data= [m1_scores,m2_scores,m3_scores,m4_scores,m5_scores,m6_scores],
             columns=['RMSE','R-Squared'],
             index=['SmOLS','Lasso','Ridge','Elastic_Net', 'Random_Forest', 'XGBoost'])\
            .sort_values(by='RMSE').round(2)

scores_df['Hyperparameters'] = ['Default','Default', 'Grid Search', 'Grid Search', 'Grid Search', 'Default']

scores_df

Unnamed: 0,RMSE,R-Squared,Hyperparameters
Random_Forest,2.19,0.51,Default
XGBoost,2.41,0.41,Default
Elastic_Net,2.46,0.38,Grid Search
Lasso,2.97,0.09,Grid Search
Ridge,3.08,0.02,Grid Search
SmOLS,3.19,-0.04,Default


## Preliminary Analysis: 
### Given that Random Forest and XGBoost do not give interpretable results for regression models, I will use Elastic Net moving forward as Elastic Net contains the L1 and L2 penalties of the lasso and ridge methods.

## Hyperparameter Tuning Elastic Net with a Randomized Grid Search

In [18]:
from sklearn.model_selection import RandomizedSearchCV

import scipy.stats as stats
from sklearn.utils.fixes import loguniform

elastic = ElasticNet(random_state=33, max_iter=10000)

# specify parameters and distributions to sample from
param_dist = {'l1_ratio': stats.uniform(0.0001, 1),
              'alpha': stats.uniform(0.0001, 1)}

# run randomized search
n_iter_search = 20

random_search = RandomizedSearchCV(elastic, param_distributions=param_dist,
                                   n_iter = n_iter_search)

random_search.fit(X_train, y_train)
pass

In [19]:
best_elastic_random_params = random_search.best_params_
print("Best paramaters for Elastic Net:", best_elastic_random_params)

pd.DataFrame(random_search.cv_results_).sort_values(by='rank_test_score').head(2)

Best paramaters for Elastic Net: {'alpha': 0.308531696423748, 'l1_ratio': 0.22582013591303407}


Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_alpha,param_l1_ratio,params,split0_test_score,split1_test_score,split2_test_score,split3_test_score,split4_test_score,mean_test_score,std_test_score,rank_test_score
8,0.005024,0.000805,0.002383,3.7e-05,0.308532,0.22582,"{'alpha': 0.308531696423748, 'l1_ratio': 0.225...",0.407044,0.516866,0.431871,0.596397,0.544769,0.499389,0.070515,1
7,0.005765,0.000635,0.003548,0.001171,0.24024,0.117708,"{'alpha': 0.24023955398807462, 'l1_ratio': 0.1...",0.381232,0.504492,0.436254,0.587894,0.567547,0.495484,0.077943,2


### Scoring Elastic Net with Random Search Parameters

In [20]:
random_elastic = ElasticNet(alpha = best_elastic_random_params['alpha'], l1_ratio=best_elastic_random_params['l1_ratio'],random_state=33)
random_elastic.fit(X_train, y_train)
e_preds = random_elastic.predict(X_test)

random_elastic_grid_scores = mean_squared_error(y_test, e_preds) ** 0.5, r2_score(y_test, e_preds)

print("RMSE: ", random_elastic_grid_scores[0])
print("Explained Variance Score: ", random_elastic_grid_scores[1])


RMSE:  2.4332727786724844
Explained Variance Score:  0.3926293200152323


### Elastic Net with Bayesian Optimization

In [21]:
from bayes_opt import BayesianOptimization

# Defining scoring function to pass into Bayesian Optimizer

def elastic_func(**params):
        
    random_elastic = ElasticNet(alpha = params['alpha'], l1_ratio = params['l1_ratio'], random_state=33)
    
    random_elastic.fit(X_train, y_train)
    e_preds = random_elastic.predict(X_test)

    random_elastic_grid_score = mean_squared_error(y_test, e_preds) 
    print("RMSE: ", random_elastic_grid_scores[0] ** 0.5)
    
    return random_elastic_grid_score

In [22]:
# Defining ranges of hyperparameters for Bayesian Optimizaton

params = {'l1_ratio': (0.0001, 1),
              'alpha': (0.0001, 1)}

bo = BayesianOptimization(elastic_func, params, random_state=33)
bo.maximize(init_points=5, n_iter=10, acq='ucb', kappa=2)

|   iter    |  target   |   alpha   | l1_ratio  |
-------------------------------------------------
RMSE:  1.5598951178436595
| [0m 1       [0m | [0m 5.89    [0m | [0m 0.2486  [0m | [0m 0.45    [0m |
RMSE:  1.5598951178436595
| [0m 2       [0m | [0m 5.534   [0m | [0m 0.411   [0m | [0m 0.2604  [0m |
RMSE:  1.5598951178436595
| [0m 3       [0m | [0m 5.161   [0m | [0m 0.8704  [0m | [0m 0.1851  [0m |
RMSE:  1.5598951178436595
| [95m 4       [0m | [95m 8.899   [0m | [95m 0.01976 [0m | [95m 0.9533  [0m |
RMSE:  1.5598951178436595
| [0m 5       [0m | [0m 5.077   [0m | [0m 0.6805  [0m | [0m 0.4866  [0m |
RMSE:  1.5598951178436595
| [0m 6       [0m | [0m 5.633   [0m | [0m 0.2221  [0m | [0m 1.0     [0m |
RMSE:  1.5598951178436595
| [0m 7       [0m | [0m 8.46    [0m | [0m 0.04892 [0m | [0m 0.207   [0m |
RMSE:  1.5598951178436595
| [0m 8       [0m | [0m 5.331   [0m | [0m 0.5231  [0m | [0m 0.2522  [0m |
RMSE:  1.5598951178436595
| [95

  positive)
  positive)
  positive)


RMSE:  1.5598951178436595
| [95m 10      [0m | [95m 10.19   [0m | [95m 0.0001  [0m | [95m 0.6883  [0m |
RMSE:  1.5598951178436595
| [95m 11      [0m | [95m 10.2    [0m | [95m 0.0001  [0m | [95m 0.5315  [0m |
RMSE:  1.5598951178436595
| [0m 12      [0m | [0m 9.93    [0m | [0m 0.002768[0m | [0m 0.5999  [0m |
RMSE:  1.5598951178436595
| [0m 13      [0m | [0m 6.424   [0m | [0m 0.1521  [0m | [0m 0.7644  [0m |
RMSE:  1.5598951178436595
| [95m 14      [0m | [95m 10.2    [0m | [95m 0.0001  [0m | [95m 0.3915  [0m |
RMSE:  1.5598951178436595
| [0m 15      [0m | [0m 6.12    [0m | [0m 1.0     [0m | [0m 1.0     [0m |


  positive)
  positive)


In [23]:
pd.set_option('display.max_colwidth', None)


elastic_bayes_df = pd.DataFrame(bo.res).sort_values(by='target')


best_elastic_bayes_params = elastic_bayes_df.iloc[0][1]
print("Best paramaters for Elastic Net with Bayesian Optimization: \n",  best_elastic_bayes_params)

elastic_bayes_df.head()

Best paramaters for Elastic Net with Bayesian Optimization: 
 {'alpha': 0.6804827596505661, 'l1_ratio': 0.4866394677378728}


Unnamed: 0,target,params
4,5.076833,"{'alpha': 0.6804827596505661, 'l1_ratio': 0.4866394677378728}"
2,5.160852,"{'alpha': 0.8704086487781147, 'l1_ratio': 0.18512142316918878}"
7,5.330528,"{'alpha': 0.5230717290412917, 'l1_ratio': 0.2522328846061174}"
1,5.534247,"{'alpha': 0.4109997089162411, 'l1_ratio': 0.26037366091781117}"
5,5.63336,"{'alpha': 0.2220991983298697, 'l1_ratio': 1.0}"


## Scoring Elastic Net Tuned With Bayesian Optimization

In [24]:
bayes = ElasticNet(alpha = best_elastic_bayes_params['alpha'], l1_ratio=best_elastic_bayes_params['l1_ratio'], random_state=33)
bayes.fit(X_train, y_train)
y_preds = bayes.predict(X_test)

bayes_elastic_scores = mean_squared_error(y_test, y_preds) ** 0.5, r2_score(y_test, y_preds)

print("RMSE: ", bayes_elastic_scores[0])
print("Explained Variance Score: ", bayes_elastic_scores[1])


RMSE:  2.2531827498272774
Explained Variance Score:  0.4792070900626817


## Comparing Results of Elastic Net Models with Hypertuned Parameters

In [25]:
elastic_net_params_df = pd.DataFrame(data=[m4_scores, random_elastic_grid_scores, bayes_elastic_scores],
                                    columns = ['RMSE','R-Squared'],
                                    index = ['Standard Grid Search', 'Random Grid Search', 'Bayesian Optimization'])\
                                    .sort_values(by='RMSE').round(2)

elastic_net_params_df

Unnamed: 0,RMSE,R-Squared
Bayesian Optimization,2.25,0.48
Random Grid Search,2.43,0.39
Standard Grid Search,2.46,0.38


### Conclusion:  Bayesian Optimization yielded the best results with the lowest RMSE score and the highest R-Squared value. In the next and final notebook I will review this model and analyze the results and model predictions.
____

#### Exporting a few dataframes to review in final notebook

In [26]:
# concatenating a dataframe contains all model results

# reformatting this dataframe to match other dataframe of model scors
elastic_scores_df = elastic_net_params_df.copy()
elastic_scores_df.index = ['Elastic_Net','Elastic_Net','Elastic_Net']
elastic_scores_df['Hyperparameters'] = ['Bayesian Optimization', 'Grid Search', 'Random Grid Search']

full_model_results_df = pd.concat([elastic_scores_df, scores_df]).sort_values(by='RMSE')

# Exporting dataframe
full_model_results_df.to_csv('data/model_scores_df.csv')

full_model_results_df

Unnamed: 0,RMSE,R-Squared,Hyperparameters
Random_Forest,2.19,0.51,Default
Elastic_Net,2.25,0.48,Bayesian Optimization
XGBoost,2.41,0.41,Default
Elastic_Net,2.43,0.39,Grid Search
Elastic_Net,2.46,0.38,Random Grid Search
Elastic_Net,2.46,0.38,Grid Search
Lasso,2.97,0.09,Grid Search
Ridge,3.08,0.02,Grid Search
SmOLS,3.19,-0.04,Default


### Exporting dataframe for Model Analysis in next notebook 

In [27]:
# Creating a data frame to export for model analysis in 


# Importing unscaled dataframe and extracting just the Player names and unscaled PIE_2017

unscaled = pd.read_csv('data/clean_nba_stats_data.csv')[['Player', 'PIE_2017','MIN_2017', 'AGE']]
 
# filtering with indexes of test dataframe
test_df = df.loc[y_test.index]

# concatenating model predictions, error, and unscaled PIE_2017
test_df.insert(1,'predictions', y_preds)
test_df.insert(1,'pred_error', abs(test_df.predictions- test_df.PIE_2018))

test_df.drop(columns=['PIE_2017','MIN_2017', 'AGE'], inplace=True)
test_df = test_df.merge(unscaled,on=['Player'], how='left')

test_df= test_df[['Player','PIE_2017', 'PIE_2018', 'predictions', 'pred_error', 'AGE', 'MIN_2017']]
test_df['true_change_in_PIE'] = test_df.PIE_2018- test_df.PIE_2017

# Extracting just the Team_2018 column to merge with test_df
team_2018 = pd.read_csv('data/unscaled_dataframe_for_model_analysis.csv')[['Player', 'TEAM','Team_2018']]
test_df = test_df.merge(team_2018, on=['Player'], how='left')

# Exporting dataframe
test_df.to_csv('data/best_model_results_analysis.csv',index = False)

test_df.head()

Unnamed: 0,Player,PIE_2017,PIE_2018,predictions,pred_error,AGE,MIN_2017,true_change_in_PIE,TEAM,Team_2018
0,Jonathan Isaac,7.1,8.2,8.428404,0.228404,20,19.8,1.1,ORL,ORL
1,Taurean Prince,9.0,7.5,9.635847,2.135847,24,30.0,-1.5,ATL,ATL
2,Bojan Bogdanovic,9.1,11.1,8.830515,2.269485,29,30.8,2.0,IND,IND
3,Julius Randle,13.5,13.7,12.740179,0.959821,23,26.7,0.2,LAL,NOP
4,Devin Harris,7.6,6.5,7.797586,1.297586,35,18.9,-1.1,DEN,DAL
