# <u>Model Baseline
-----

## Objective
The main goal of this notebook is determine which of the three methods for handling missing values is most appropriate for the data. To achieve this, a basic linear regression is fit to each unique set of data and then compared.

-----
#### External Libraries Import

In [1]:
import numpy as np
import pandas as pd
from sklearn.linear_model import LassoCV, RidgeCV, ElasticNetCV, LinearRegression
from sklearn.model_selection import train_test_split, cross_val_score, KFold
from sklearn.preprocessing import StandardScaler
import warnings
warnings.filterwarnings('ignore')

#### Read Cleaned Dataframes 

In [2]:
# from method 1
df_pga1 = pd.read_csv('../Data/Sets/model_one.csv')
# from method 2
df_pga2 = pd.read_csv('../Data/Sets/model_two.csv')
# from method 3
df_pga3 = pd.read_csv('../Data/Sets/model_three.csv')

<br><br>
## <u>Preparation for Modeling

----
#### Model 1

In [3]:
# create list of features, remove irrelevant columns and strokes gained columns
features = [
    col for col in df_pga1.columns if col not in ['date', 'finish', 
                                                  'player', 'event', 
                                                  'sg:_off-the-tee',
                                                  'sg:_approach-the-green',
                                                  'sg:_around-the-green',
                                                  'sg:_putting',
                                                  'sg:_total']]

# choose X and y variables
X1 = df_pga1[features]
y1 = df_pga1['sg:_total']

# train, test split
X_train1, X_test1, y_train1, y_test1 = train_test_split(X1, y1, 
                                                        test_size = 0.3, random_state = 77)
# scale the data (subtract mean and divide by standard deviation)
ss = StandardScaler()
X_train1_sc = ss.fit_transform(X_train1)
X_test1_sc = ss.transform(X_test1)

#### Model 2

In [4]:
# create list of features, remove irrelevant columns and strokes gained columns
features = [
    col for col in df_pga2.columns if col not in ['date', 'finish', 
                                                  'player', 'event', 
                                                  'sg:_off-the-tee',
                                                  'sg:_approach-the-green',
                                                  'sg:_around-the-green',
                                                  'sg:_putting',
                                                  'sg:_total']]

X2 = df_pga2[features]
y2 = df_pga2['sg:_total']

# train, test split
X_train2, X_test2, y_train2, y_test2 = train_test_split(X2, y2, 
                                                        test_size = 0.3, random_state = 77)

ss = StandardScaler()
X_train2_sc = ss.fit_transform(X_train2)
X_test2_sc = ss.transform(X_test2)

#### Model 3

In [5]:
# create list of features, remove irrelevant columns and strokes gained columns
features = [
    col for col in df_pga3.columns if col not in ['date', 'finish', 
                                                  'player', 'event', 
                                                  'sg:_off-the-tee',
                                                  'sg:_approach-the-green',
                                                  'sg:_around-the-green',
                                                  'sg:_putting',
                                                  'sg:_total']]

X3 = df_pga3[features]
y3 = df_pga3['sg:_total']

# train, test split
X_train3, X_test3, y_train3, y_test3 = train_test_split(X3, y3, 
                                                        test_size = 0.3, random_state = 77)

ss = StandardScaler()
X_train3_sc = ss.fit_transform(X_train3)
X_test3_sc = ss.transform(X_test3)

<br><br>
## Comparing Models
----

In [6]:
# instantiate basic linear regression model
lr = LinearRegression()
# instantiate a KFold to shuffle the data on every cross fold
kf = KFold(n_splits = 10 , shuffle = True , random_state = 77)

# fit a model for each of the three models
lr1_cv = cross_val_score(lr, X_train1_sc, y_train1, cv=kf).mean()
lr2_cv = cross_val_score(lr, X_train2_sc, y_train2, cv=kf).mean()
lr3_cv = cross_val_score(lr, X_train3_sc, y_train3, cv=kf).mean()

print(f'Model 1 produces an average R-squared of {lr1_cv}.')
print(f'Model 2 produces an average R-squared of {lr2_cv}.')
print(f'Model 3 produces an average R-squared of {lr3_cv}.')

Model 1 produces an average R-squared of 0.5894856748890387.
Model 2 produces an average R-squared of 0.5679173216443035.
Model 3 produces an average R-squared of 0.5672825566325466.


- Model 1 produces the highest R-squared score on the training data. The model explains 58.89% more of the variability in the data compared to the mean of Strokes Gained: Total.

In [7]:
# instantiate three models
lr1 = LinearRegression()
lr2 = LinearRegression()
lr3 = LinearRegression()

# fit the three sets of data to their own model
lr1.fit(X_train1_sc, y_train1)
lr2.fit(X_train2_sc, y_train2)
lr3.fit(X_train3_sc, y_train3)

print(f'Model 1 produces an R-squared score \
of {round(lr1.score(X_test1_sc, y_test1), 5)} on unseen data.')

print(f'Model 2 produces an R-squared score \
of {round(lr2.score(X_test2_sc, y_test2), 5)} on unseen data.')

print(f'Model 3 produces an R-squared score \
of {round(lr3.score(X_test3_sc, y_test3), 5)} on unseen data.')

Model 1 produces an R-squared score of 0.60282 on unseen data.
Model 2 produces an R-squared score of 0.55097 on unseen data.
Model 3 produces an R-squared score of 0.56553 on unseen data.


### Insight
- Remember that the only difference between the three models is the data being used as input. The data for each model is different based on the methods used in the second notebook for handling the missing values.
- A score of 0.60282 for the testing data means that the model explains 60.282% of variation in Total Strokes Gained compared to the mean's ability to predict Total Strokes Gained. In the unpredictable game of golf, this score is enough for significant interpretation.
- When comparing each model, Model 1 explains nearly 4% more variation in Strokes Gained than the other models. While this difference is not drastic, it suggests the lack of importance of the missing data in the modeling process.
- For Model 1, the missing data was removed and ignored. This method lends to the simplicity and interpretability of the model. Because of this, Model 1 will be further tuned and carried into the regression analysis of Club Head Speed.