<a href="https://colab.research.google.com/github/mnocerino23/MLB-Salary-Regressions/blob/main/Pitcher_Tuning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# In this notebook, I will tune the regression models for pitcher salaries using GridSearchCV in scikit-learn to try to get best performing results for the decision tree, linear, and ridge regressions

From experimentation in the Pitcher_Regressions notebook, I found that decision tree regressors, mulit-linear, and ridge regressions work best on this data. I will try to find models and tweak parameters to optimize performance and minimize MAE.

In [148]:
import numpy as np
import pandas as pd

from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [149]:
import sklearn
from sklearn import preprocessing
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.feature_selection import SelectKBest, f_regression
from sklearn.preprocessing import MinMaxScaler

In [150]:
from sklearn import linear_model
from sklearn.linear_model import Ridge, Lasso

In [151]:
from sklearn.metrics import mean_absolute_error, mean_squared_error, mean_squared_log_error

In [152]:
#read in position player stats
pitcher_train = pd.read_csv('/content/drive/MyDrive/Data_Science_Projects/MLB/pitcher_train.csv')
pitcher_test = pd.read_csv('/content/drive/MyDrive/Data_Science_Projects/MLB/pitcher_test.csv')

# Final check for null values:

In [153]:
#Looks like pitcher_train has a few null values so we drop them
pitcher_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 300 entries, 0 to 299
Data columns (total 43 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   name     300 non-null    object 
 1   throws   300 non-null    object 
 2   age      300 non-null    int64  
 3   stints   300 non-null    int64  
 4   teamID   300 non-null    object 
 5   LG       300 non-null    object 
 6   POS1     300 non-null    object 
 7   W        300 non-null    int64  
 8   L        300 non-null    int64  
 9   CG       300 non-null    int64  
 10  ShO      300 non-null    int64  
 11  GP       300 non-null    int64  
 12  GS       300 non-null    int64  
 13  SV       300 non-null    int64  
 14  GF       300 non-null    int64  
 15  IPOuts   300 non-null    int64  
 16   IP      300 non-null    float64
 17   ERA     300 non-null    float64
 18  HA       300 non-null    int64  
 19  ER       300 non-null    int64  
 20  HRA      300 non-null    int64  
 21  BBA      300 non

In [154]:
pitcher_train = pitcher_train.dropna()

In [155]:
pitcher_train.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 300 entries, 0 to 299
Data columns (total 43 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   name     300 non-null    object 
 1   throws   300 non-null    object 
 2   age      300 non-null    int64  
 3   stints   300 non-null    int64  
 4   teamID   300 non-null    object 
 5   LG       300 non-null    object 
 6   POS1     300 non-null    object 
 7   W        300 non-null    int64  
 8   L        300 non-null    int64  
 9   CG       300 non-null    int64  
 10  ShO      300 non-null    int64  
 11  GP       300 non-null    int64  
 12  GS       300 non-null    int64  
 13  SV       300 non-null    int64  
 14  GF       300 non-null    int64  
 15  IPOuts   300 non-null    int64  
 16   IP      300 non-null    float64
 17   ERA     300 non-null    float64
 18  HA       300 non-null    int64  
 19  ER       300 non-null    int64  
 20  HRA      300 non-null    int64  
 21  BBA      300 non

In [156]:
#pitcher_test has no null values so we are good
pitcher_test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 75 entries, 0 to 74
Data columns (total 43 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   name     75 non-null     object 
 1   throws   75 non-null     object 
 2   age      75 non-null     int64  
 3   stints   75 non-null     int64  
 4   teamID   75 non-null     object 
 5   LG       75 non-null     object 
 6   POS1     75 non-null     object 
 7   W        75 non-null     int64  
 8   L        75 non-null     int64  
 9   CG       75 non-null     int64  
 10  ShO      75 non-null     int64  
 11  GP       75 non-null     int64  
 12  GS       75 non-null     int64  
 13  SV       75 non-null     int64  
 14  GF       75 non-null     int64  
 15  IPOuts   75 non-null     int64  
 16   IP      75 non-null     float64
 17   ERA     75 non-null     float64
 18  HA       75 non-null     int64  
 19  ER       75 non-null     int64  
 20  HRA      75 non-null     int64  
 21  BBA      75 non-nu

In [157]:
#drop unnamed index column from both datasets
#also drop the at bats feature since I am now realizing the redunancy here when tap (total plate appearances) contains very similar info
pitcher_train.drop(columns = ['name','stints', 'teamID', 'LG'], inplace = True)
pitcher_test.drop(columns = ['name','stints', 'teamID', 'LG'], inplace = True)
pitcher_test.head()

Unnamed: 0,throws,age,POS1,W,L,CG,ShO,GP,GS,SV,...,WHIP,WRIP,HRBIB,RAVG,FIP,DERA,STUFF,GURU,salary,WAR
0,R,28,Reliever,6,4,0,0,56,0,5,...,1.3,1.54,0.07,5.31,3.69,3.6,123,44,0.390904,0.9
1,R,24,Starter,14,4,2,0,30,30,0,...,1.04,1.3,0.06,3.8,2.99,2.93,206,147,0.57,5.0
2,L,21,Reliever,2,6,0,0,19,15,0,...,1.64,3.25,0.16,7.44,6.71,6.27,6,9,0.265576,-0.6
3,R,23,Starter,11,8,0,0,33,33,0,...,0.97,1.55,0.08,2.84,3.45,3.32,171,107,0.5621,4.9
4,R,26,Reliever,4,4,0,0,27,9,0,...,1.67,1.61,0.05,5.77,4.58,4.42,31,23,0.704256,0.6


# Reminding ourselves of the train and test data shapes (we performed an 80-20 split previously then normalized both independently)

In [158]:
print('Position Players Train:')
print('Train dataset dimensions: ', pitcher_train.shape, '\n')

print('Position Players Test:')
print('Test dataset dimensions: ', pitcher_test.shape, '\n')

Position Players Train:
Train dataset dimensions:  (300, 39) 

Position Players Test:
Test dataset dimensions:  (75, 39) 



# Add on feature for under rookie contract. If a player is 25 or under they are on their rookie contract else they are typically not

In [159]:
pitcher_train['Rookie contract'] = ''
pitcher_test['Rookie contract'] = ''

for index, row in pitcher_train.iterrows():
  if pitcher_train.at[index,'age'] <= 25:
    pitcher_train.at[index,'Rookie contract'] = 1
  else:
    pitcher_train.at[index,'Rookie contract'] = 0

for index, row in pitcher_test.iterrows():
  if pitcher_test.at[index,'age'] <= 25:
    pitcher_test.at[index,'Rookie contract'] = 1
  else:
    pitcher_test.at[index,'Rookie contract'] = 0

# One-hot encode position:

In [160]:
dummy_position = pd.get_dummies(pitcher_train['POS1'])
dummy_throws = pd.get_dummies(pitcher_train['throws'], prefix = 'throws')
items = [dummy_position, dummy_throws]
for item in items:
  pitcher_train = pd.merge(left = pitcher_train, right = item, left_index = True, right_index = True)

pitcher_train.drop(columns = ['POS1', 'throws'], axis = 1, inplace = True)

In [161]:
dummy_pos = pd.get_dummies(pitcher_test['POS1'])
dummy_throw = pd.get_dummies(pitcher_test['throws'], prefix = 'throws')
items = [dummy_pos, dummy_throw]

for item in items:
  pitcher_test = pd.merge(left = pitcher_test, right = item, left_index = True, right_index = True)

pitcher_test.drop(columns = ['POS1', 'throws'], axis = 1, inplace = True)

In [162]:
#pitchers train has two invalid entries which are causing issues. We will drop both rows
pitcher_train.head(10)
print(pitcher_train.at[8, ' HR9 '])

pitcher_train = pitcher_train.drop(8)
pitcher_train = pitcher_train.reset_index(drop = True)

print(pitcher_train.at[31, ' GURU '])
pitcher_train = pitcher_train.drop(31)
pitcher_train = pitcher_train.reset_index(drop = True)

 -   
 1,325 


In [163]:
pitcher_train.head(3)

Unnamed: 0,age,W,L,CG,ShO,GP,GS,SV,GF,IPOuts,...,DERA,STUFF,GURU,salary,WAR,Rookie contract,Reliever,Starter,throws_L,throws_R
0,24,16,7,0,0,33,32,1,1,524,...,4.73,11,38,0.491965,0.6,1,0,1,0,1
1,33,6,5,0,0,73,0,2,7,199,...,3.29,187,63,9.0,1.3,0,1,0,0,1
2,21,13,4,0,0,29,29,0,0,524,...,3.43,62,91,0.411792,3.8,1,0,1,0,1


In [164]:
pitcher_test.head(3)

Unnamed: 0,age,W,L,CG,ShO,GP,GS,SV,GF,IPOuts,...,DERA,STUFF,GURU,salary,WAR,Rookie contract,Reliever,Starter,throws_L,throws_R
0,28,6,4,0,0,56,0,5,20,173,...,3.6,123,44,0.390904,0.9,0,1,0,0,1
1,24,14,4,2,0,30,30,0,0,547,...,2.93,206,147,0.57,5.0,1,0,1,0,1
2,21,2,6,0,0,19,15,0,0,243,...,6.27,6,9,0.265576,-0.6,1,1,0,1,0


# Normalize input features using min-max normalization

In [165]:
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()

In [166]:
train_target = pitcher_train['salary']

train_features = pitcher_train.drop(columns = ['salary'])
col = train_features.columns
train_features = scaler.fit_transform(train_features)
train_features = pd.DataFrame(data=train_features, columns=col)

In [167]:
test_target = pitcher_test['salary']

test_features = pitcher_test.drop(columns = ['salary'])
test_features = scaler.fit_transform(test_features)
test_features = pd.DataFrame(data=test_features, columns=col)

# Use the best features we found when experimenting in the Pitcher_Regressions.ipynb file

In [168]:
#optimal features I found:

train_selected_X = train_features[['age', 'HRA', ' GURU ', 'Starter', 'BFP', 'IPOuts']] 
train_y = train_target

test_selected_X = test_features[['age', 'HRA', ' GURU ', 'Starter', 'BFP','IPOuts']]
test_y = test_target

In [169]:
print('Train Features:')
print('train_selected_X dimensions: ', train_selected_X.shape)
print('train_y dimensions:', train_y.shape, '\n')

print('Test Features:')
print('test_data_X dimensions: ', test_selected_X.shape)
print('test_data_y dimensions:', test_y.shape, '\n')

Train Features:
train_selected_X dimensions:  (298, 6)
train_y dimensions: (298,) 

Test Features:
test_data_X dimensions:  (75, 6)
test_data_y dimensions: (75,) 



# Multivariate Linear Regression
Features Used: age, HRA (home run average), GURU, Starter, BFP (Batters faced by pitchers), IPOuts (outs pitched)

In [170]:
multivariate_regression = linear_model.LinearRegression()
multivariate_regression.fit(train_selected_X, train_y)

LinearRegression()

# Beta values (coefficients) and intercept for our multivariate linear regression:


In [171]:
multivariate_regression.coef_

array([ 12.13146549,   3.22829229,   8.49811282,   3.58536602,
        12.1979896 , -13.88932942])

In [172]:
y_pred = multivariate_regression.predict(test_selected_X)

In [173]:
print('Mean Squared Error: ', mean_squared_error(test_y, y_pred), '\n')
print('Mean Absolute Error: ', mean_absolute_error(test_y, y_pred), '\n')
print('Root Mean Squared Error: ', np.sqrt(mean_squared_error(test_y, y_pred)))

Mean Squared Error:  22.794146806162665 

Mean Absolute Error:  2.9077081381337027 

Root Mean Squared Error:  4.774321606905286


# Ridge Regression
Features Used: age, games, runs batted in, hits, grounded into double play, pro, rookie contract, and ops

In [174]:
from sklearn.model_selection import GridSearchCV

In [175]:
ridge_search = GridSearchCV(Ridge(), {
    'alpha': [0.025,0.05, 0.075,0.1, 0.125,0.15,0.2, 0.25,0.5,0.75,1.0,1.25,1.5,1.75,2.0, 2.25,2.5,2.75,
              3.0,3.25,3.5,3.75,4,4.25,4.5,4.75,5,5.25,5.5,5.75,6,6.25,6.5,
              6.75,7,7.25,7.5,7.75,8]
}, cv = 5, scoring = 'neg_mean_absolute_error', return_train_score = False)
ridge_search.fit(train_selected_X, train_y)
pd.DataFrame(ridge_search.cv_results_)[['param_alpha', 'params', 'mean_test_score','rank_test_score']].sort_values(by = ['rank_test_score'])[0:5]

Unnamed: 0,param_alpha,params,mean_test_score,rank_test_score
22,4.0,{'alpha': 4},-2.78873,1
21,3.75,{'alpha': 3.75},-2.78892,2
23,4.25,{'alpha': 4.25},-2.788965,3
24,4.5,{'alpha': 4.5},-2.789406,4
20,3.5,{'alpha': 3.5},-2.789937,5


In [176]:
#optimal alpha value for a ridge regression to predict salary for position players
ridge_search.best_params_

{'alpha': 4}

# Reproducing the model with hypertuned parameters

In [177]:
ridge_regression = Ridge(alpha = 4.0)
ridge_regression.fit(train_selected_X, train_y)

Ridge(alpha=4.0)

In [178]:
ridge_regression.coef_

array([8.78734171, 1.00332381, 2.03045098, 2.99296014, 0.79317293,
       0.8673204 ])

In [179]:
y_predicted_ridge = ridge_regression.predict(test_selected_X)

In [180]:
print('Mean Squared Error: ', mean_squared_error(test_y, y_predicted_ridge), '\n')
print('Mean Absolute Error: ', mean_absolute_error(test_y, y_predicted_ridge), '\n')
print('Root Mean Squared Error: ', np.sqrt(mean_squared_error(test_y, y_predicted_ridge)))

Mean Squared Error:  26.505084078348478 

Mean Absolute Error:  3.014859689171686 

Root Mean Squared Error:  5.148308856153492


# Decision Tree Regressor -Use GridSearch CV to find the best max_depth and max_features
Features Used: age, games, total plate appearances, hits, grounded into double play, pro, rookie contract, ops, and runs batted in




In [181]:
from sklearn.tree import DecisionTreeRegressor

In [182]:
dt_regressor = DecisionTreeRegressor()

In [183]:
dt_search = GridSearchCV(dt_regressor, {
    #grid search two parameters for decision tree regressor
    'max_depth': [1,2,3,4,5,6,7],
    'max_features': [None,'sqrt','auto','log2']
}, cv = 5, scoring = 'neg_mean_absolute_error', return_train_score = False)
dt_search.fit(train_selected_X, train_y)
pd.DataFrame(dt_search.cv_results_)[['param_max_depth', 'params', 'mean_test_score', 'rank_test_score']].sort_values(by = ['rank_test_score'])[0:5]

Unnamed: 0,param_max_depth,params,mean_test_score,rank_test_score
27,7,"{'max_depth': 7, 'max_features': 'log2'}",-2.866646,1
23,6,"{'max_depth': 6, 'max_features': 'log2'}",-2.923478,2
9,3,"{'max_depth': 3, 'max_features': 'sqrt'}",-2.957806,3
8,3,"{'max_depth': 3, 'max_features': None}",-2.97715,4
4,2,"{'max_depth': 2, 'max_features': None}",-2.989423,5


In [184]:
#optimal parameters for a decision tree regressor
dt_search.best_params_

{'max_depth': 7, 'max_features': 'log2'}

In [185]:
dt_regressor = DecisionTreeRegressor(max_depth = 2, max_features='auto')

In [186]:
dt_regressor.fit(train_selected_X, train_y)

DecisionTreeRegressor(max_depth=2, max_features='auto')

In [187]:
y_pred_dt = dt_regressor.predict(test_selected_X)

In [188]:
print('Mean Squared Error: ', mean_squared_error(test_y, y_pred_dt), '\n')
print('Mean Absolute Error: ', mean_absolute_error(test_y, y_pred_dt), '\n')
print('Root Mean Squared Error: ', np.sqrt(mean_squared_error(test_y, y_pred_dt)))

Mean Squared Error:  31.210955192092253 

Mean Absolute Error:  3.3088998296896324 

Root Mean Squared Error:  5.586676578440196


In [189]:
position_player_tuned = pd.DataFrame(columns = ['Model Type', 'Features Used', 'Number of Features', 'MSE', 'MAE', 'RMSE'])

# Display results for tuned position player models:

In [190]:
position_player_tuned['Model Type'] = ['Multivariate Linear', 'Ridge', 'Decision Tree']
position_player_tuned['Number of Features'] = [6,6,6]
position_player_tuned['Features Used'] = [['age', 'HRA', ' GURU ', 'Starter', 'BFP', 'IPOuts'],
                                          ['age', 'HRA', ' GURU ', 'Starter', 'BFP', 'IPOuts'],
                                          ['age', 'HRA', ' GURU ', 'Starter', 'BFP', 'IPOuts']]

In [191]:
mse = []
mse.append(mean_squared_error(test_y, y_pred))
mse.append(mean_squared_error(test_y, y_predicted_ridge))
mse.append(mean_squared_error(test_y, y_pred_dt))

mae = []
mae.append(mean_absolute_error(test_y, y_pred))
mae.append(mean_absolute_error(test_y, y_predicted_ridge))
mae.append(mean_absolute_error(test_y, y_pred_dt))

rmse = []
rmse.append(np.sqrt(mean_squared_error(test_y, y_pred)))
rmse.append(np.sqrt(mean_squared_error(test_y, y_predicted_ridge)))
rmse.append(np.sqrt(mean_squared_error(test_y, y_pred_dt)))

In [192]:
position_player_tuned['MSE'] = mse
position_player_tuned['MAE'] = mae
position_player_tuned['RMSE'] = rmse

Below, we have the final results of all three models tuned to best performance I could get. They are sorted by mean absolute error with the decision tree regressor outperforming the linear and ridge models.

Although the Neural Network regressors performed best, we still were able to produce similar results with the decision tree regressor. Ridge and multivariate linear regresions performed solid as well with a mean absolute error of around 3.3 million which is pretty good given that position player salaries in the dataset range from 140,000 to 30 million.

In [193]:
position_player_tuned.sort_values(by = ['MAE']).reset_index(drop = True)

Unnamed: 0,Model Type,Features Used,Number of Features,MSE,MAE,RMSE
0,Multivariate Linear,"[age, HRA, GURU , Starter, BFP, IPOuts]",6,22.794147,2.907708,4.774322
1,Ridge,"[age, HRA, GURU , Starter, BFP, IPOuts]",6,26.505084,3.01486,5.148309
2,Decision Tree,"[age, HRA, GURU , Starter, BFP, IPOuts]",6,31.210955,3.3089,5.586677
