# In this notebook, I will tune the regression models for position player salary using GridSearchCV in scikit-learn to try to get best performing results for the decision tree, linear, and ridge regressions

From experimentation in the Positionplayers_regressions notebook, I found that decision tree regressors, neural networks, and ridge regressions work best on this data. I will try to find models and tweak parameters to optimize performance and minimize MAE.

In [190]:
import numpy as np
import pandas as pd

from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [191]:
import sklearn
from sklearn import preprocessing
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.feature_selection import SelectKBest, f_regression
from sklearn.preprocessing import MinMaxScaler

In [192]:
from sklearn import linear_model
from sklearn.linear_model import Ridge, Lasso

In [193]:
from sklearn.metrics import mean_absolute_error, mean_squared_error, mean_squared_log_error

In [194]:
#read in position player stats
pp_train = pd.read_csv('/content/drive/MyDrive/Data_Science_Projects/MLB/pp_traindata1.csv')
pp_test = pd.read_csv('/content/drive/MyDrive/Data_Science_Projects/MLB/pp_testdata.csv')

# Final check for null values:

In [195]:
#Looks like pp_train has a few null values so we drop them
pp_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 280 entries, 0 to 279
Data columns (total 45 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   Unnamed: 0          280 non-null    int64  
 1   age                 280 non-null    int64  
 2   stints              280 non-null    int64  
 3   G                   279 non-null    float64
 4   tap                 279 non-null    float64
 5   AB                  279 non-null    float64
 6   R                   279 non-null    float64
 7   H                   279 non-null    float64
 8   db                  279 non-null    float64
 9   tr                  279 non-null    float64
 10  HR                  279 non-null    float64
 11  RBI                 279 non-null    float64
 12  SB                  279 non-null    float64
 13  CS                  279 non-null    float64
 14  BB                  279 non-null    float64
 15  SO                  279 non-null    float64
 16  IBB     

In [196]:
pp_train = pp_train.dropna()

In [197]:
pp_train.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 279 entries, 0 to 279
Data columns (total 45 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   Unnamed: 0          279 non-null    int64  
 1   age                 279 non-null    int64  
 2   stints              279 non-null    int64  
 3   G                   279 non-null    float64
 4   tap                 279 non-null    float64
 5   AB                  279 non-null    float64
 6   R                   279 non-null    float64
 7   H                   279 non-null    float64
 8   db                  279 non-null    float64
 9   tr                  279 non-null    float64
 10  HR                  279 non-null    float64
 11  RBI                 279 non-null    float64
 12  SB                  279 non-null    float64
 13  CS                  279 non-null    float64
 14  BB                  279 non-null    float64
 15  SO                  279 non-null    float64
 16  IBB     

In [198]:
#pp_test has no null values so we are good
pp_test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 70 entries, 0 to 69
Data columns (total 45 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   Unnamed: 0          70 non-null     int64  
 1   age                 70 non-null     int64  
 2   stints              70 non-null     int64  
 3   G                   70 non-null     float64
 4   tap                 70 non-null     float64
 5   AB                  70 non-null     float64
 6   R                   70 non-null     float64
 7   H                   70 non-null     float64
 8   db                  70 non-null     float64
 9   tr                  70 non-null     float64
 10  HR                  70 non-null     float64
 11  RBI                 70 non-null     float64
 12  SB                  70 non-null     float64
 13  CS                  70 non-null     float64
 14  BB                  70 non-null     float64
 15  SO                  70 non-null     float64
 16  IBB       

In [199]:
#drop unnamed index column from both datasets
#also drop the at bats feature since I am now realizing the redunancy here when tap (total plate appearances) contains very similar info
pp_train.drop(columns = ['Unnamed: 0', 'AB'], inplace = True)
pp_test.drop(columns = ['Unnamed: 0', 'AB'], inplace = True)
pp_test.head()

Unnamed: 0,age,stints,G,tap,R,H,db,tr,HR,RBI,...,C,DH,OF,P,SS,bats_B,bats_L,bats_R,throws_L,throws_R
0,32,1,72.0,251.0,19.0,48.0,12.0,0.0,12.0,41.0,...,1,0,0,0,0,0,0,1,0,1
1,31,1,161.0,680.0,97.0,155.0,25.0,1.0,34.0,97.0,...,0,0,0,0,0,0,0,1,0,1
2,25,1,75.0,248.0,15.0,39.0,9.0,1.0,0.0,15.0,...,0,0,1,0,0,0,0,1,0,1
3,24,1,77.0,288.0,28.0,67.0,11.0,4.0,6.0,26.0,...,0,0,0,0,0,0,0,1,0,1
4,23,1,30.0,80.0,4.0,18.0,6.0,0.0,0.0,4.0,...,0,0,0,0,1,1,0,0,0,1


# Reminding ourselves of the train and test data shapes (we performed an 80-20 split previously then normalized both independently)

In [200]:
print('Position Players Train:')
print('Train dataset dimensions: ', pp_train.shape, '\n')

print('Position Players Test:')
print('Test dataset dimensions: ', pp_test.shape, '\n')

Position Players Train:
Train dataset dimensions:  (279, 43) 

Position Players Test:
Test dataset dimensions:  (70, 43) 



# Normalize input features using min-max normalization

In [201]:
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()

In [202]:
train_target = pp_train['salary']

train_features = pp_train.drop(columns = ['salary'])
col = train_features.columns
train_features = scaler.fit_transform(train_features)
train_features = pd.DataFrame(data=train_features, columns=col)

In [203]:
test_target = pp_test['salary']

test_features = pp_test.drop(columns = ['salary'])
test_features = scaler.fit_transform(test_features)
test_features = pd.DataFrame(data=test_features, columns=col)

# Use the best features we found when experimenting in the PositionPlayer_Regressions.ipynb file

In [204]:
#optimal features for multivariate linear and ridge regressions:
train_selected_X = train_features[['age', 'G', 'RBI', 'H', 'GIDP', 'PRO', 'Rookie contract', 'ops']]
#optimal decision tree features:
train_selected_dt_X = train_features[['age', 'G', 'tap', 'H', 'GIDP', 'PRO', 'Rookie contract', 'ops', 'RBI']]

train_y = train_target

test_selected_X = test_features[['age', 'G', 'RBI', 'H', 'GIDP', 'PRO', 'Rookie contract', 'ops']]
test_selected_dt_X = test_features[['age', 'G', 'tap', 'H', 'GIDP', 'PRO', 'Rookie contract', 'ops', 'RBI']]

test_y = test_target

In [205]:
print('Train Features:')
print('train_selected_X dimensions: ', train_selected_X.shape)
print('train_y dimensions:', train_y.shape, '\n')

print('Test Features:')
print('test_data_X dimensions: ', test_selected_X.shape)
print('test_data_y dimensions:', test_y.shape, '\n')

Train Features:
train_selected_X dimensions:  (279, 8)
train_y dimensions: (279,) 

Test Features:
test_data_X dimensions:  (70, 8)
test_data_y dimensions: (70,) 



# Multivariate Linear Regression
Features Used: age, games, runs batted in, hits, grounded into double play, pro, rookie contract, and ops

In [221]:
multivariate_regression = linear_model.LinearRegression()
multivariate_regression.fit(train_selected_X, train_y)

LinearRegression()

# Beta values (coefficients) and intercept for our multivariate linear regression:


In [223]:
multivariate_regression.coef_

array([18.52392221, -9.3448351 , -2.64652833, -3.67020106,  6.47866356,
       22.15176025,  0.96789912, -5.06003056])

In [224]:
y_pred = multivariate_regression.predict(test_selected_X)

In [225]:
print('Mean Squared Error: ', mean_squared_error(test_y, y_pred), '\n')
print('Mean Absolute Error: ', mean_absolute_error(test_y, y_pred), '\n')
print('Root Mean Squared Error: ', np.sqrt(mean_squared_error(test_y, y_pred)))

Mean Squared Error:  20.957202516377947 

Mean Absolute Error:  3.3962689166115063 

Root Mean Squared Error:  4.577903725110211


# Ridge Regression
Features Used: age, games, runs batted in, hits, grounded into double play, pro, rookie contract, and ops

In [206]:
from sklearn.model_selection import GridSearchCV

In [234]:
ridge_search = GridSearchCV(Ridge(), {
    'alpha': [0.025,0.05, 0.075,0.1, 0.125,0.15,0.2, 0.25,0.5,0.75,1.0,1.25,1.5,1.75,2.0, 2.25,2.5,2.75,
              3.0,3.25,3.5,3.75,4,4.25,4.5,4.75,5,5.25,5.5,5.75,6,6.25,6.5,
              6.75,7,7.25,7.5,7.75,8]
}, cv = 4, scoring = 'neg_mean_absolute_error', return_train_score = False)
ridge_search.fit(train_selected_X, train_y)
pd.DataFrame(ridge_search.cv_results_)[['param_alpha', 'params', 'mean_test_score','rank_test_score']].sort_values(by = ['rank_test_score'])[0:5]

Unnamed: 0,param_alpha,params,mean_test_score,rank_test_score
8,0.5,{'alpha': 0.5},-3.358481,1
7,0.25,{'alpha': 0.25},-3.36127,2
9,0.75,{'alpha': 0.75},-3.361375,3
6,0.2,{'alpha': 0.2},-3.362564,4
5,0.15,{'alpha': 0.15},-3.365257,5


In [235]:
#optimal alpha value for a ridge regression to predict salary for position players
ridge_search.best_params_

{'alpha': 0.5}

# Reproducing the model with hypertuned parameters

In [236]:
ridge_regression = Ridge(alpha = 0.5)
ridge_regression.fit(train_selected_X, train_y)

Ridge(alpha=0.5)

In [230]:
ridge_regression.coef_

array([11.07806031, -1.65674287,  1.83675343,  1.95171635,  4.75308528,
        3.71003721, -1.14919075, -0.14232078])

In [238]:
y_predicted_ridge = ridge_regression.predict(test_selected_X)

In [239]:
print('Mean Squared Error: ', mean_squared_error(test_y, y_predicted_ridge), '\n')
print('Mean Absolute Error: ', mean_absolute_error(test_y, y_predicted_ridge), '\n')
print('Root Mean Squared Error: ', np.sqrt(mean_squared_error(test_y, y_predicted_ridge)))

Mean Squared Error:  21.152804458732984 

Mean Absolute Error:  3.37598531734457 

Root Mean Squared Error:  4.59921780944684


# Decision Tree Regressor -Use GridSearch CV to find the best max_depth and max_features
Features Used: age, games, total plate appearances, hits, grounded into double play, pro, rookie contract, ops, and runs batted in




In [213]:
from sklearn.tree import DecisionTreeRegressor

In [214]:
dt_regressor = DecisionTreeRegressor()

In [215]:
dt_search = GridSearchCV(dt_regressor, {
    #grid search two parameters for decision tree regressor
    'max_depth': [1,2,3,4,5,6,7],
    'max_features': [None,'sqrt','auto','log2']
}, cv = 5, scoring = 'neg_mean_absolute_error', return_train_score = False)
dt_search.fit(train_selected_X, train_y)
pd.DataFrame(dt_search.cv_results_)[['param_max_depth', 'params', 'mean_test_score', 'rank_test_score']].sort_values(by = ['rank_test_score'])[0:5]

Unnamed: 0,param_max_depth,params,mean_test_score,rank_test_score
8,3,"{'max_depth': 3, 'max_features': None}",-3.156915,1
10,3,"{'max_depth': 3, 'max_features': 'auto'}",-3.185189,2
4,2,"{'max_depth': 2, 'max_features': None}",-3.185449,3
6,2,"{'max_depth': 2, 'max_features': 'auto'}",-3.185449,3
12,4,"{'max_depth': 4, 'max_features': None}",-3.229652,5


In [216]:
#optimal parameters for a decision tree regressor
dt_search.best_params_

{'max_depth': 3, 'max_features': None}

In [217]:
dt_regressor = DecisionTreeRegressor(max_depth=3, max_features='log2')

In [218]:
dt_regressor.fit(train_selected_X, train_y)

DecisionTreeRegressor(max_depth=3, max_features='log2')

In [219]:
y_pred_dt = dt_regressor.predict(test_selected_X)

In [220]:
print('Mean Squared Error: ', mean_squared_error(test_y, y_pred_dt), '\n')
print('Mean Absolute Error: ', mean_absolute_error(test_y, y_pred_dt), '\n')
print('Root Mean Squared Error: ', np.sqrt(mean_squared_error(test_y, y_pred_dt)))

Mean Squared Error:  18.249250004699583 

Mean Absolute Error:  2.6256970830234416 

Root Mean Squared Error:  4.271914091446548


In [241]:
position_player_tuned = pd.DataFrame(columns = ['Model Type', 'Features Used', 'Number of Features', 'MSE', 'MAE', 'RMSE'])

# Display results for tuned position player models:

In [243]:
position_player_tuned['Model Type'] = ['Multivariate Linear', 'Ridge', 'Decision Tree']
position_player_tuned['Number of Features'] = [8,8,9]
position_player_tuned['Features Used'] = [['age', 'G', 'RBI', 'H', 'GIDP', 'PRO', 'Rookie contract', 'ops'],
                                          ['age', 'G', 'RBI', 'H', 'GIDP', 'PRO', 'Rookie contract', 'ops'],
                                          ['age', 'G', 'tap', 'H', 'GIDP', 'PRO', 'Rookie contract', 'ops', 'RBI']]

In [245]:
mse = []
mse.append(mean_squared_error(test_y, y_pred))
mse.append(mean_squared_error(test_y, y_predicted_ridge))
mse.append(mean_squared_error(test_y, y_pred_dt))

mae = []
mae.append(mean_absolute_error(test_y, y_pred))
mae.append(mean_absolute_error(test_y, y_predicted_ridge))
mae.append(mean_absolute_error(test_y, y_pred_dt))

rmse = []
rmse.append(np.sqrt(mean_squared_error(test_y, y_pred)))
rmse.append(np.sqrt(mean_squared_error(test_y, y_predicted_ridge)))
rmse.append(np.sqrt(mean_squared_error(test_y, y_pred_dt)))

In [249]:
position_player_tuned['MSE'] = mse
position_player_tuned['MAE'] = mae
position_player_tuned['RMSE'] = rmse

Below, we have the final results of all three models tuned to best performance I could get. They are sorted by mean absolute error with the decision tree regressor outperforming the linear and ridge models.

Although the Neural Network regressors performed best, we still were able to produce similar results with the decision tree regressor. Ridge and multivariate linear regresions performed solid as well with a mean absolute error of around 3.3 million which is pretty good given that position player salaries in the dataset range from 140,000 to 30 million.

In [255]:
position_player_tuned.sort_values(by = ['MAE']).reset_index(drop = True)

Unnamed: 0,Model Type,Features Used,Number of Features,MSE,MAE,RMSE
0,Decision Tree,"[age, G, tap, H, GIDP, PRO, Rookie contract, o...",9,18.24925,2.625697,4.271914
1,Ridge,"[age, G, RBI, H, GIDP, PRO, Rookie contract, ops]",8,21.152804,3.375985,4.599218
2,Multivariate Linear,"[age, G, RBI, H, GIDP, PRO, Rookie contract, ops]",8,20.957203,3.396269,4.577904
