# <u>Model Tuning
-----

## Objective
This notebook aims to fit the most accurate model possible when predicting Strokes Gained. Through a series of training and testing various types of models, the one that explains the most variation in the data will be used to further analyze the relationship between Club Head Speed and player efficiency.

-----
#### External Libraries Import

In [1]:
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression, LassoCV, RidgeCV, ElasticNetCV
from sklearn.model_selection import train_test_split, cross_val_score, KFold
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestRegressor, ExtraTreesRegressor
from pactools.grid_search import GridSearchCVProgressBar
import pickle
import warnings
warnings.filterwarnings('ignore')

#### Read Model 1 Dataframe

In [2]:
df_pga = pd.read_csv('../Data/Sets/model_one.csv')

### Prepare for Modeling

#### Feature Reduction
From the methods in Notebook 3, features are combined to decrease the number of features and boost the interpretability of the model.

In [3]:
# using the feature reductions from notebook 03

# combine all approaches between 100 and 200 yards
df_pga['approaches_from_100-200_yards'] = df_pga['approaches_from_100-125_yards'] + \
df_pga['approaches_from_125-150_yards'] + df_pga['approaches_from_150-175_yards'] + \
df_pga['approaches_from_175-200_yards']

# combine all scrambling percentages between 10 and 30 yards
df_pga['scrambling_from_10-30_yards'] = df_pga['scrambling_from_10-20_yards'] + \
                                        df_pga['scrambling_from_20-30_yards']

# combine all putts between 10 and 25 feet
df_pga['putting_from_-_10-25\''] = df_pga['putting_from_-_10-15\''] + \
df_pga['putting_from_-_15-20\''] + df_pga['putting_from_-_20-25\'']

# drop columns
df_pga.drop(columns = ['approaches_from_100-125_yards', 'approaches_from_125-150_yards',
                       'approaches_from_150-175_yards', 'approaches_from_175-200_yards',
                       'scrambling_from_10-20_yards', 'scrambling_from_20-30_yards',
                       'putting_from_-_10-15\'', 'putting_from_-_15-20\'', 
                       'putting_from_-_20-25\''], inplace = True)

#### Save Final Dataframe

In [4]:
df_pga.to_csv('../Data/Sets/final_model.csv', index = False)

#### Train, Test Split and Scale

In [5]:
# create list of features, remove irrelevant columns and strokes gained columns
features = [
    col for col in df_pga.columns if col not in ['date', 'finish', 
                                                  'player', 'event', 
                                                  'sg:_off-the-tee',
                                                  'sg:_approach-the-green',
                                                  'sg:_around-the-green',
                                                  'sg:_putting',
                                                  'sg:_total']]

X = df_pga[features]
y = df_pga['sg:_total']

# train, test split
X_train, X_test, y_train, y_test = train_test_split(X, y, 
                                                        test_size = 0.3, random_state = 77)

# standardize the data using StandardScaler
ss = StandardScaler()
X_train_sc = ss.fit_transform(X_train)
X_test_sc = ss.transform(X_test)

## Modeling

#### Linear Models

In [6]:
# instantiate four linear models
lr = LinearRegression().fit(X_train_sc, y_train)
lasso = LassoCV().fit(X_train_sc, y_train)
ridge = RidgeCV().fit(X_train_sc, y_train)
en = ElasticNetCV(l1_ratio = [.1 , .5 , .7 , .9 , .95 , .99 , 1]).fit(X_train_sc, y_train)

# include a shuffled KFold with 10 splits to emphasize the 
# accuracy of the cross-validation scores
kf = KFold(n_splits = 10 , shuffle = True , random_state = 77)

lr_cv = cross_val_score(lr, X_train_sc, y_train, cv=kf).mean()
lasso_cv = cross_val_score(lasso, X_train_sc, y_train, cv=kf).mean()
ridge_cv = cross_val_score(ridge, X_train_sc, y_train, cv=kf).mean()
en_cv = cross_val_score(en, X_train_sc, y_train, cv=kf).mean()

print('Linear Regression:')
print(f'Train Score: {lr_cv}.')
print(f'Test Score: {lr.score(X_test_sc, y_test)}')
print('-----------')
print('Lasso Regression:')
print(f'Train Score: {lasso_cv}.')
print(f'Test Score: {lasso.score(X_test_sc, y_test)}')
print('-----------')
print('Ridge Regression:')
print(f'Train Score: {ridge_cv}.')
print(f'Test Score: {ridge.score(X_test_sc, y_test)}')
print('-----------')
print('Elastic Net Regression:')
print(f'Train Score: {en_cv}.')
print(f'Test Score: {en.score(X_test_sc, y_test)}')

Linear Regression:
Train Score: 0.5903701376204411.
Test Score: 0.6024533231385061
-----------
Lasso Regression:
Train Score: 0.5913890667229031.
Test Score: 0.6031792503229226
-----------
Ridge Regression:
Train Score: 0.5908146545439048.
Test Score: 0.6020097899508954
-----------
Elastic Net Regression:
Train Score: 0.5912889550238862.
Test Score: 0.6026829288665122


- The LASSO model likely performed best because of the high number of features. 

#### Random Forest Regressor

In [7]:
rf = RandomForestRegressor()
rf_params =  {
        'n_estimators' : [20, 60, 100],
        'max_depth' : [None, 2, 6, 10],
        'min_samples_split' : [2, 3, 4] 
}

gs_rf = GridSearchCVProgressBar(rf, param_grid = rf_params)
gs_rf.fit(X_train_sc, y_train)
print(f'Random Forest:')
print(f'Train Score: {gs_rf.best_score_}')
print(f'Test Score: {gs_rf.score(X_test_sc, y_test)}')

Random Forest:
Train Score: 0.544898603097128
Test Score: 0.5508078968691739


#### Extra-Trees Regressor

In [8]:
et = ExtraTreesRegressor()
et_params = {
        'n_estimators' : [20, 60, 100],
        'max_depth' : [None, 6, 10, 14],
        'min_samples_leaf' : [1, 2], 
        'min_samples_split' : [2, 3], 
}
gs_et = GridSearchCVProgressBar(et, param_grid = et_params)
gs_et.fit(X_train_sc, y_train)
print(f'Extra-Trees:')
print(f'Train Score: {gs_et.best_score_}')
print(f'Test Score: {gs_et.score(X_test_sc, y_test)}')

Extra-Trees:
Train Score: 0.5617186270376882
Test Score: 0.5832676785033042


#### Save Best Model

In [9]:
pickle.dump(lasso, open('../Best_Models/final_model.pk', 'wb'))

#### Insight: 
- All of the linear models explained just over 60% of Strokes Gained variation compared to the mean. Because of lasso's feature selection process it explains 0.1% more than this other models.
- Out of the two random trees regressors, Extra-Trees performed better explaining 58% of strokes gained variation (2% less than LASSO).

## Model Coefficients and Important Features

### Lasso

In [10]:
# create a dataframe of coefficients to facilitate interpretation
best_model = pd.DataFrame(lasso.coef_, columns=['coefs'])
best_model['abs_coefs'] = abs(lasso.coef_)
best_model.index = X_train.columns
best_model = best_model.sort_values('abs_coefs', ascending=False).head(10)

# create column of standard deviations
lasso_std = []
for col in best_model.index:
    lasso_std.append(df_pga[col].std())
best_model['std._dev'] = lasso_std
best_model

Unnamed: 0,coefs,abs_coefs,std._dev
greens_in_regulation_percentage,0.647663,0.647663,7.533094
scrambling,0.463073,0.463073,10.716159
overall_putting_average,-0.323547,0.323547,0.074604
putting_average,-0.286532,0.286532,0.078363
one-putt_percentage,-0.260034,0.260034,6.335913
putts_per_round,-0.226114,0.226114,1.342851
birdie_or_better_conversion_percentage,0.131641,0.131641,6.984383
fairway_proximity,0.122027,0.122027,52.124938
going_for_the_green_-_hit_green_pct.,0.109558,0.109558,18.173266
club_head_speed,0.095398,0.095398,4.235433


#### Takeaways:
- As expected, percentage of GIR has the largest positive and overall effect on strokes gained. The coefficient implies that for every increase in one standard deviation (7.53%) of a players GIR percentage, their total strokes gained increases by about 0.647.
- Overall putting average has the largest negative effect on strokes gained. For every increase in 0.0746 putts of a players overall putting average, their total strokes gained decreases by 0.323. 
- Club Head Speed is the 10th largest predictor. For every increase in 4.235mph in a players' swing speed, they should expect an increase in 0.0953 Strokes Gained.

### Extra-Trees

In [11]:
best_trees = pd.DataFrame(gs_et.best_estimator_.feature_importances_,
                          columns=['Feature Importance'])
best_trees.index = X_train.columns
best_trees.sort_values('Feature Importance', ascending=False).head(10)

Unnamed: 0,Feature Importance
greens_in_regulation_percentage,0.14804
scrambling,0.141461
putting_average,0.116505
birdie_or_better_conversion_percentage,0.077363
overall_putting_average,0.037523
putts_per_round,0.036976
going_for_the_green_-_birdie_or_better,0.019669
putting_-_inside_10',0.018103
scrambling_from_10-30_yards,0.015354
one-putt_percentage,0.015186


#### Takeaways:
- Comparing the Extra-Trees regressor with the LASSO model, there are many similarities between predictors. GIR percentage and scrambling percentage are the strongest predictors for both models.
- The values for feature importance are uninterpretable other than observing magnitude and ranking them. 
- Club Head Speed does not appear as a top 10 predictor.

### Insight:
- The features appearing as strong predictors are the features that one would expect to influence Total Strokes Gained the most. GIR percentage and Scrambling by definition reduce the number of strokes for a golf tournament.
- Because of the LASSO model's explanatory power and feature interpretability, a LASSO model will be used to explore Club Head Speed. 

<br><br>
## Club Head Speed
By dividing the data by Club Head Speed and fitting two separate models, a comparison of R-squared and coefficients offers significant results.

#### Split Data into two based on club head speed

In [12]:
# check average Club Head Speed
print(f"The average club head speed in the data set is \
{round(df_pga['club_head_speed'].mean(), 2)} mph")

The average club head speed in the data set is 115.51 mph


In [13]:
# dataframe of events with below average club head speed
slow_data = df_pga[df_pga['club_head_speed'] < df_pga['club_head_speed'].mean()]
# save
slow_data.to_csv('../Data/Sets/slow_data.csv', index = False)

# dataframe of events with above average club head speed
fast_data = df_pga[df_pga['club_head_speed'] > df_pga['club_head_speed'].mean()]
# save
fast_data.to_csv('../Data/Sets/fast_data.csv', index = False)

#### Comparing Total Strokes Gained 

In [14]:
print(f"The average total strokes gained for golfers swinging \
below the average Club Head Speed is   \
{round(slow_data['sg:_total'].mean(), 4)}.")

The average total strokes gained for golfers swinging below the average Club Head Speed is   0.9524.


In [15]:
print(f"The average total strokes gained for golfers swinging \
above the average Club Head Speed is   \
{round(fast_data['sg:_total'].mean(), 4)}.")

The average total strokes gained for golfers swinging above the average Club Head Speed is   1.0891.


### Preprocess
Prepare data using same methods as before.

#### Slow Swing Speed

In [16]:
# choose X and y variables
X_slow = slow_data[features]
y_slow = slow_data['sg:_total']

# train test split
X_slowtrain, X_slowtest, y_slowtrain, y_slowtest = train_test_split(X_slow, y_slow, 
                                                        test_size = 0.3, random_state = 77)

# scale data
ss_slow = StandardScaler()
X_slowtrain_sc = ss_slow.fit_transform(X_slowtrain)
X_slowtest_sc = ss_slow.transform(X_slowtest)

#### Fast Swing Speed

In [17]:
# same process
X_fast = fast_data[features]
y_fast = fast_data['sg:_total']

X_fasttrain, X_fasttest, y_fasttrain, y_fasttest = train_test_split(X_fast, y_fast, 
                                                        test_size = 0.3, random_state = 77)

ss_fast = StandardScaler()
X_fasttrain_sc = ss_fast.fit_transform(X_fasttrain)
X_fasttest_sc = ss_fast.transform(X_fasttest)

### Fit a LASSO Regression to Both

In [18]:
slow_lasso = LassoCV()
slow_model = slow_lasso.fit(X_slowtrain_sc, y_slowtrain)
print(f'Slow model produces an R-squared of {slow_model.score(X_slowtest_sc, y_slowtest)}')

# save model to pull insight
pickle.dump(slow_model, open('../Best_Models/slow_model.pk', 'wb'))

Slow model produces an R-squared of 0.5817966093811695


In [19]:
fast_lasso = LassoCV()
fast_model = fast_lasso.fit(X_fasttrain_sc, y_fasttrain)
print(f'Fast model produces an R-squared of {fast_model.score(X_fasttest_sc, y_fasttest)}')

# save model to pull insight
pickle.dump(fast_model, open('../Best_Models/fast_model.pk', 'wb'))

Fast model produces an R-squared of 0.6340229852563167


### Takeaways:
- Fast model explains 5% more variation in the data than the slow model. This demonstrates the notion that it is slightly more difficult to explain the performance of a golfer with a slower swing speed.
- Removing the golfers with slower Club Head Speed boosted the explanatory power of the model by 3%.
<br><br>

### Insight: 
- A 5% difference of predictive power between the two models points toward the unexplainable success of golfers with slow swing speeds. 
- Exploring the coefficients of the two models offers more direct insight into what separates golfers with slow swing speeds from those with fast swing speeds.