### Description

### Context

This dataset is part of the R-package ISLR and is used in the related book by G. James et al. (2013) "An Introduction to Statistical Learning with applications in R" to demonstrate how Ridge regression and the LASSO are performed using R.

### Format

A data frame with 322 observations of major league players on the following 20 variables. AtBat Number of times at bat in 1986, Hits Number of hits in 1986, HmRun Number of home runs in 1986, Runs Number of runs in 1986, RBI Number of runs batted in in 1986, Walks Number of walks in 1986, Years Number of years in the major leagues, CAtBat Number of times at bat during his career, CHits Number of hits during his career, CHmRun Number of home runs during his career, CRuns Number of runs during his career, CRBI Number of runs batted in during his career, CWalks Number of walks during his career, League A factor with levels A and N indicating player’s league at the end of 1986, Division A factor with levels E and W indicating player’s division at the end of 1986, PutOuts Number of put outs in 1986, Assists Number of assists in 1986, Errors Number of errors in 1986, Salary 1987 annual salary on opening day in thousands of dollars, NewLeague A factor with levels A and N indicating player’s league at the beginning of 1987,

### Importing Libraries and Reading Data¶

In [2]:
import warnings
warnings.simplefilter(action='ignore')
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt


from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.linear_model import LinearRegression, Ridge, RidgeCV, ElasticNet, Lasso, LassoCV
from sklearn.metrics import mean_squared_error
from sklearn.neighbors import LocalOutlierFactor
from sklearn.preprocessing import LabelEncoder
from sklearn import preprocessing
from sklearn.preprocessing import RobustScaler

Hitters=pd.read_csv("../DataSets/Hitters/Hitters.csv")

### Data Understanding

In [3]:
df=Hitters.copy()
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 322 entries, 0 to 321
Data columns (total 20 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   AtBat      322 non-null    int64  
 1   Hits       322 non-null    int64  
 2   HmRun      322 non-null    int64  
 3   Runs       322 non-null    int64  
 4   RBI        322 non-null    int64  
 5   Walks      322 non-null    int64  
 6   Years      322 non-null    int64  
 7   CAtBat     322 non-null    int64  
 8   CHits      322 non-null    int64  
 9   CHmRun     322 non-null    int64  
 10  CRuns      322 non-null    int64  
 11  CRBI       322 non-null    int64  
 12  CWalks     322 non-null    int64  
 13  League     322 non-null    object 
 14  Division   322 non-null    object 
 15  PutOuts    322 non-null    int64  
 16  Assists    322 non-null    int64  
 17  Errors     322 non-null    int64  
 18  Salary     263 non-null    float64
 19  NewLeague  322 non-null    object 
dtypes: float64

In [4]:
df.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
AtBat,322.0,380.928571,153.404981,16.0,255.25,379.5,512.0,687.0
Hits,322.0,101.024845,46.454741,1.0,64.0,96.0,137.0,238.0
HmRun,322.0,10.770186,8.709037,0.0,4.0,8.0,16.0,40.0
Runs,322.0,50.909938,26.024095,0.0,30.25,48.0,69.0,130.0
RBI,322.0,48.02795,26.166895,0.0,28.0,44.0,64.75,121.0
Walks,322.0,38.742236,21.639327,0.0,22.0,35.0,53.0,105.0
Years,322.0,7.444099,4.926087,1.0,4.0,6.0,11.0,24.0
CAtBat,322.0,2648.68323,2324.20587,19.0,816.75,1928.0,3924.25,14053.0
CHits,322.0,717.571429,654.472627,4.0,209.0,508.0,1059.25,4256.0
CHmRun,322.0,69.490683,86.266061,0.0,14.0,37.5,90.0,548.0


In [5]:
df[df.isnull().any(axis=1)].head(3)

Unnamed: 0,AtBat,Hits,HmRun,Runs,RBI,Walks,Years,CAtBat,CHits,CHmRun,CRuns,CRBI,CWalks,League,Division,PutOuts,Assists,Errors,Salary,NewLeague
0,293,66,1,30,29,14,1,293,66,1,30,29,14,A,E,446,33,20,,A
15,183,39,3,20,15,11,3,201,42,3,20,16,11,A,W,118,0,0,,A
18,407,104,6,57,43,65,12,5233,1478,100,643,658,653,A,W,912,88,9,,A


In [6]:
df.isnull().sum().sum()

59

### Data Pre-Processing

In [7]:
df=df.copy()

In [8]:
df.corr()

Unnamed: 0,AtBat,Hits,HmRun,Runs,RBI,Walks,Years,CAtBat,CHits,CHmRun,CRuns,CRBI,CWalks,PutOuts,Assists,Errors,Salary
AtBat,1.0,0.967939,0.592198,0.91306,0.820539,0.669845,0.047372,0.235526,0.252717,0.236659,0.266534,0.244053,0.166123,0.31755,0.353824,0.352117,0.394771
Hits,0.967939,1.0,0.562158,0.922187,0.811073,0.641211,0.044767,0.227565,0.255815,0.202712,0.261787,0.232005,0.151818,0.310673,0.320455,0.310038,0.438675
HmRun,0.592198,0.562158,1.0,0.650988,0.855122,0.481014,0.116318,0.221882,0.220627,0.493227,0.262361,0.351979,0.233154,0.282923,-0.106329,0.039318,0.343028
Runs,0.91306,0.922187,0.650988,1.0,0.798206,0.732213,0.004541,0.186497,0.20483,0.227913,0.250556,0.205976,0.182168,0.279347,0.220567,0.240475,0.419859
RBI,0.820539,0.811073,0.855122,0.798206,1.0,0.615997,0.146168,0.294688,0.308201,0.441771,0.323285,0.393184,0.250914,0.343186,0.106591,0.19337,0.449457
Walks,0.669845,0.641211,0.481014,0.732213,0.615997,1.0,0.136475,0.277175,0.280671,0.332473,0.338478,0.308631,0.424507,0.299515,0.149656,0.129382,0.443867
Years,0.047372,0.044767,0.116318,0.004541,0.146168,0.136475,1.0,0.920289,0.903631,0.726872,0.882877,0.868812,0.838533,-0.004684,-0.080638,-0.16214,0.400657
CAtBat,0.235526,0.227565,0.221882,0.186497,0.294688,0.277175,0.920289,1.0,0.995063,0.798836,0.983345,0.949219,0.906501,0.062283,0.002038,-0.066922,0.526135
CHits,0.252717,0.255815,0.220627,0.20483,0.308201,0.280671,0.903631,0.995063,1.0,0.783306,0.984609,0.945141,0.890954,0.076547,-0.002523,-0.062756,0.54891
CHmRun,0.236659,0.202712,0.493227,0.227913,0.441771,0.332473,0.726872,0.798836,0.783306,1.0,0.820243,0.929484,0.799983,0.112724,-0.158511,-0.138115,0.524931


In [9]:
df['Year_lab'] = pd.cut(x=df['Years'], bins=[0, 3, 6, 10, 15, 19, 24])
df.groupby(['League','Division', 'Year_lab']).agg({'Salary':'mean'})

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Salary
League,Division,Year_lab,Unnamed: 3_level_1
A,E,"(0, 3]",112.5
A,E,"(3, 6]",655.568182
A,E,"(6, 10]",852.738125
A,E,"(10, 15]",816.311353
A,E,"(15, 19]",665.41675
A,E,"(19, 24]",
A,W,"(0, 3]",153.613636
A,W,"(3, 6]",401.36
A,W,"(6, 10]",633.958375
A,W,"(10, 15]",835.25


In [10]:
df['Salary'] = df.groupby(['League', 'Division', 'Year_lab'])['Salary'].transform(lambda x: x.fillna(x.mean()))

In [11]:
df.head()

Unnamed: 0,AtBat,Hits,HmRun,Runs,RBI,Walks,Years,CAtBat,CHits,CHmRun,...,CRBI,CWalks,League,Division,PutOuts,Assists,Errors,Salary,NewLeague,Year_lab
0,293,66,1,30,29,14,1,293,66,1,...,29,14,A,E,446,33,20,112.5,A,"(0, 3]"
1,315,81,7,24,38,39,14,3449,835,69,...,414,375,N,W,632,43,10,475.0,N,"(10, 15]"
2,479,130,18,66,72,76,3,1624,457,63,...,266,263,A,W,880,82,14,480.0,A,"(0, 3]"
3,496,141,20,65,78,37,11,5628,1575,225,...,838,354,N,E,200,11,3,500.0,N,"(10, 15]"
4,321,87,10,39,42,30,2,396,101,12,...,46,33,N,E,805,40,4,91.5,N,"(0, 3]"


In [12]:
df.isnull().sum()

AtBat        0
Hits         0
HmRun        0
Runs         0
RBI          0
Walks        0
Years        0
CAtBat       0
CHits        0
CHmRun       0
CRuns        0
CRBI         0
CWalks       0
League       0
Division     0
PutOuts      0
Assists      0
Errors       0
Salary       0
NewLeague    0
Year_lab     0
dtype: int64

In [13]:
df.shape

(322, 21)

In [14]:
le = LabelEncoder()
df['League'] = le.fit_transform(df['League'])
df['Division'] = le.fit_transform(df['Division'])
df['NewLeague'] = le.fit_transform(df['NewLeague'])

In [15]:
df.head()

Unnamed: 0,AtBat,Hits,HmRun,Runs,RBI,Walks,Years,CAtBat,CHits,CHmRun,...,CRBI,CWalks,League,Division,PutOuts,Assists,Errors,Salary,NewLeague,Year_lab
0,293,66,1,30,29,14,1,293,66,1,...,29,14,0,0,446,33,20,112.5,0,"(0, 3]"
1,315,81,7,24,38,39,14,3449,835,69,...,414,375,1,1,632,43,10,475.0,1,"(10, 15]"
2,479,130,18,66,72,76,3,1624,457,63,...,266,263,0,1,880,82,14,480.0,0,"(0, 3]"
3,496,141,20,65,78,37,11,5628,1575,225,...,838,354,1,0,200,11,3,500.0,1,"(10, 15]"
4,321,87,10,39,42,30,2,396,101,12,...,46,33,1,0,805,40,4,91.5,1,"(0, 3]"


In [16]:
df['Year_lab'] = le.fit_transform(df['Year_lab'])

In [17]:
df.head()

Unnamed: 0,AtBat,Hits,HmRun,Runs,RBI,Walks,Years,CAtBat,CHits,CHmRun,...,CRBI,CWalks,League,Division,PutOuts,Assists,Errors,Salary,NewLeague,Year_lab
0,293,66,1,30,29,14,1,293,66,1,...,29,14,0,0,446,33,20,112.5,0,0
1,315,81,7,24,38,39,14,3449,835,69,...,414,375,1,1,632,43,10,475.0,1,3
2,479,130,18,66,72,76,3,1624,457,63,...,266,263,0,1,880,82,14,480.0,0,0
3,496,141,20,65,78,37,11,5628,1575,225,...,838,354,1,0,200,11,3,500.0,1,3
4,321,87,10,39,42,30,2,396,101,12,...,46,33,1,0,805,40,4,91.5,1,0


In [18]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 322 entries, 0 to 321
Data columns (total 21 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   AtBat      322 non-null    int64  
 1   Hits       322 non-null    int64  
 2   HmRun      322 non-null    int64  
 3   Runs       322 non-null    int64  
 4   RBI        322 non-null    int64  
 5   Walks      322 non-null    int64  
 6   Years      322 non-null    int64  
 7   CAtBat     322 non-null    int64  
 8   CHits      322 non-null    int64  
 9   CHmRun     322 non-null    int64  
 10  CRuns      322 non-null    int64  
 11  CRBI       322 non-null    int64  
 12  CWalks     322 non-null    int64  
 13  League     322 non-null    int32  
 14  Division   322 non-null    int32  
 15  PutOuts    322 non-null    int64  
 16  Assists    322 non-null    int64  
 17  Errors     322 non-null    int64  
 18  Salary     322 non-null    float64
 19  NewLeague  322 non-null    int32  
 20  Year_lab  

In [19]:
df_X= df.drop(["Salary","League","Division","NewLeague"], axis=1)

scaled_cols5=preprocessing.normalize(df_X)

scaled_cols=pd.DataFrame(scaled_cols5, columns=df_X.columns)
scaled_cols.head()

Unnamed: 0,AtBat,Hits,HmRun,Runs,RBI,Walks,Years,CAtBat,CHits,CHmRun,CRuns,CRBI,CWalks,PutOuts,Assists,Errors,Year_lab
0,0.472401,0.106411,0.001612,0.048369,0.046756,0.022572,0.001612,0.472401,0.106411,0.001612,0.048369,0.046756,0.022572,0.719082,0.053206,0.032246,0.0
1,0.085657,0.022026,0.001903,0.006526,0.010333,0.010605,0.003807,0.937879,0.22706,0.018763,0.087289,0.112578,0.101973,0.171858,0.011693,0.002719,0.000816
2,0.237036,0.064331,0.008907,0.03266,0.03563,0.037609,0.001485,0.803645,0.226149,0.031176,0.110848,0.131631,0.130147,0.435473,0.040578,0.006928,0.0
3,0.082624,0.023488,0.003332,0.010828,0.012993,0.006163,0.001832,0.937518,0.262365,0.037481,0.137929,0.139595,0.05897,0.033316,0.001832,0.0005,0.0005
4,0.331579,0.089867,0.01033,0.040285,0.043384,0.030989,0.002066,0.40905,0.104328,0.012395,0.049582,0.047516,0.034088,0.831529,0.041318,0.004132,0.0


In [20]:
cat_df=pd.concat([df.loc[:,"League":"Division"],df.loc[:,"NewLeague":"Year_lab"]], axis=1)
cat_df.head()

Unnamed: 0,League,Division,NewLeague,Year_lab
0,0,0,0,0
1,1,1,1,3
2,0,1,0,0
3,1,0,1,3
4,1,0,1,0


In [21]:
df= pd.concat([scaled_cols,cat_df,df["Salary"]], axis=1)
df

Unnamed: 0,AtBat,Hits,HmRun,Runs,RBI,Walks,Years,CAtBat,CHits,CHmRun,...,CWalks,PutOuts,Assists,Errors,Year_lab,League,Division,NewLeague,Year_lab.1,Salary
0,0.472401,0.106411,0.001612,0.048369,0.046756,0.022572,0.001612,0.472401,0.106411,0.001612,...,0.022572,0.719082,0.053206,0.032246,0.000000,0,0,0,0,112.5
1,0.085657,0.022026,0.001903,0.006526,0.010333,0.010605,0.003807,0.937879,0.227060,0.018763,...,0.101973,0.171858,0.011693,0.002719,0.000816,1,1,1,3,475.0
2,0.237036,0.064331,0.008907,0.032660,0.035630,0.037609,0.001485,0.803645,0.226149,0.031176,...,0.130147,0.435473,0.040578,0.006928,0.000000,0,1,0,0,480.0
3,0.082624,0.023488,0.003332,0.010828,0.012993,0.006163,0.001832,0.937518,0.262365,0.037481,...,0.058970,0.033316,0.001832,0.000500,0.000500,1,0,1,3,500.0
4,0.331579,0.089867,0.010330,0.040285,0.043384,0.030989,0.002066,0.409050,0.104328,0.012395,...,0.034088,0.831529,0.041318,0.004132,0.000000,1,0,1,0,91.5
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
317,0.169544,0.043324,0.002388,0.022174,0.016374,0.012622,0.001706,0.922085,0.274954,0.010916,...,0.047076,0.110869,0.003070,0.001023,0.000341,1,0,1,1,700.0
318,0.083222,0.023004,0.000846,0.012855,0.008457,0.015900,0.002030,0.932185,0.255585,0.006597,...,0.148006,0.052944,0.064446,0.003383,0.000507,0,0,0,3,875.0
319,0.256903,0.068147,0.001623,0.032992,0.023256,0.028124,0.003245,0.919443,0.234188,0.003786,...,0.078964,0.020011,0.061116,0.003786,0.000541,0,1,0,1,385.0
320,0.155442,0.039064,0.002441,0.023059,0.016277,0.021160,0.002170,0.867543,0.232484,0.026314,...,0.090064,0.356458,0.035537,0.003255,0.000543,0,0,0,2,960.0


In [22]:
df.shape

(322, 22)

### Modeling

In [23]:
y=df["Salary"]
X=df.drop("Salary", axis=1)

X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                    test_size=0.20,
                                                    random_state=46)

linreg = LinearRegression()
model = linreg.fit(X_train,y_train)
y_pred = model.predict(X_test)
df_linreg_rmse = np.sqrt(mean_squared_error(y_test,y_pred))
df_linreg_rmse

295.09409321656557

In [24]:
y=df["Salary"]
X=df.drop("Salary", axis=1)

X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                    test_size=0.20,
                                                    random_state=46)


ridreg = Ridge()
model = ridreg.fit(X_train, y_train)
y_pred = model.predict(X_test)
df_ridreg_rmse = np.sqrt(mean_squared_error(y_test,y_pred))
df_ridreg_rmse

313.57725335744357

In [25]:
y=df["Salary"]
X=df.drop("Salary", axis=1)

X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                    test_size=0.20,
                                                    random_state=46)


lasreg = Lasso()
model = lasreg.fit(X_train,y_train)
y_pred = model.predict(X_test)
df_lasreg_rmse = np.sqrt(mean_squared_error(y_test,y_pred))
df_lasreg_rmse

287.03573553221474

In [26]:
y=df["Salary"]
X=df.drop("Salary", axis=1)

X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                    test_size=0.20,
                                                    random_state=46)


enet = ElasticNet()
model = enet.fit(X_train,y_train)
y_pred = model.predict(X_test)
df_enet_rmse = np.sqrt(mean_squared_error(y_test,y_pred))
df_enet_rmse

331.1528731117293

In [27]:
y=df["Salary"]
X=df.drop("Salary", axis=1)

X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                    test_size=0.20,
                                                    random_state=46)


alpha = [0.1,0.01,0.001,0.2,0.3,0.5,0.8,0.9,1]
ridreg_cv = RidgeCV(alphas = alpha, scoring = "neg_mean_squared_error", cv = 10, normalize = True)
ridreg_cv.fit(X_train, y_train)
ridreg_cv.alpha_

#Final Model

ridreg_tuned = Ridge(alpha = ridreg_cv.alpha_).fit(X_train,y_train)
y_pred = ridreg_tuned.predict(X_test)
df_ridge_tuned_rmse = np.sqrt(mean_squared_error(y_test,y_pred))
df_ridge_tuned_rmse

294.5139057976809

In [28]:
y=df["Salary"]
X=df.drop("Salary", axis=1)

X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                    test_size=0.20,
                                                    random_state=46)

alpha = [0.1,0.01,0.001,0.2,0.3,0.5,0.8,0.9,1]
lasso_cv = LassoCV(alphas = alpha, cv = 10, normalize = True)
lasso_cv.fit(X_train, y_train)
lasso_cv.alpha_

#Final Model

lasso_tuned = Lasso(alpha = lasso_cv.alpha_).fit(X_train,y_train)
y_pred = lasso_tuned.predict(X_test)
df_lasso_tuned_rmse = np.sqrt(mean_squared_error(y_test, y_pred))
df_lasso_tuned_rmse

295.47603675458856

In [30]:
y=df["Salary"]
X=df.drop("Salary", axis=1)

X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                    test_size=0.20,
                                                    random_state=46)

enet_params = {"l1_ratio": [0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,1],
               "alpha":[0.1,0.01,0.001,0.2,0.3,0.5,0.8,0.9,1]}

enet_model = ElasticNet().fit(X_train,y_train)
enet_cv = GridSearchCV(enet_model, enet_params, cv = 10).fit(X, y)
enet_cv.best_params_

#Final Model

enet_tuned = ElasticNet(**enet_cv.best_params_).fit(X_train,y_train)
y_pred = enet_tuned.predict(X_test)
df_enet_tuned_rmse = np.sqrt(mean_squared_error(y_test,y_pred))
df_enet_tuned_rmse

283.49265451112063

In [31]:
basicresult_df = pd.DataFrame({"CONDITIONS":["df: filled with mean, normalized",],
                               "LINEAR":[df_linreg_rmse],
                               "RIDGE":[df_ridreg_rmse],
                               "RIDGE TUNED":[df_ridge_tuned_rmse],
                               "LASSO":[df_lasreg_rmse],
                               "LASSO TUNED":[df_lasso_tuned_rmse],
                               "ELASTIC NET":[df_enet_rmse],
                               "ELASTIC NET TUNED":[df_enet_tuned_rmse]
                               })

basicresult_df

Unnamed: 0,CONDITIONS,LINEAR,RIDGE,RIDGE TUNED,LASSO,LASSO TUNED,ELASTIC NET,ELASTIC NET TUNED
0,"df: filled with mean, normalized",295.094093,313.577253,294.513906,287.035736,295.476037,331.152873,283.492655
