Created: 12/9/2019  
DATA 512 Human Centered Data Science  
Peter Meleney  

# Chapter 6: Ridge Regression and the Lasso

## Overview

In this lab we will:
1. Review the Hitters data set.
1. Fit a Ridge Regression.
1. Fit a Lasso Regression.

In [4]:
#Imports for Chapter 6: Ridge Regression and the Lasso
import numpy as np
import math
import matplotlib.pyplot as plt
import pandas as pd

from sklearn.linear_model import Ridge #The version of Ridge we will be implementing
from sklearn.linear_model import Lasso #The version of Lasso we will be implementing

In [5]:
#allows for matplotlib charts to plot inline in a jupyter notebook.
%matplotlib inline 

#Increases number of characters displayed per column in a dataframe to 100
pd.options.display.max_colwidth = 150

## The Hitters Data set

The Hitters data set is a repository of 322 observations of major league baseball players and their offensive and defensive statistics from 1986, and their salary information from 1987.[1] [2]

In [6]:
data_description = pd.DataFrame([["AtBat", "Number of times at bat in 1986."],
["Hits","Number of hits in 1986."],
["HmRun", "Number of home runs in 1986."],
["Runs", "Number of runs in 1986."],  
["RBI", "Number of runs batted in in 1986."],
["Walks", "Number of walks in 1986."],
["Years", "Number of years in the major leagues."],  
["CAtBat", "Number of times at bat during his career."], 
["CHits", "Number of hits during his career."],
["CHmRun","Number of home runs during his career."],
["CRuns", "Number of runs during his career."],
["CRBI", "Number of runs batted in during his career."],  
["CWalks", "Number of walks during his career"],
["League", "A factor with levels A and N indicating the player's league at the end of 1986."],
["Division", "A factor with levels E and W indicating the player's division at the end of 1986."],  
["PutOuts", "Number of put outs in 1986"], 
["Assists", "Number of assists in 1986."],
["Errors","Number of errors in 1986."],
["Salary", "1987 annual salary on opening day in thousands of dollars"],
["NewLeague", "A factor with levels A and N indicating player's league at the beginning of 1987."]])

data_description.columns = ['Name', "Description"]
data_description.index = data_description.iloc[:,0]
data_description.drop('Name', axis = 1, inplace=True)
data_description

Unnamed: 0_level_0,Description
Name,Unnamed: 1_level_1
AtBat,Number of times at bat in 1986.
Hits,Number of hits in 1986.
HmRun,Number of home runs in 1986.
Runs,Number of runs in 1986.
RBI,Number of runs batted in in 1986.
Walks,Number of walks in 1986.
Years,Number of years in the major leagues.
CAtBat,Number of times at bat during his career.
CHits,Number of hits during his career.
CHmRun,Number of home runs during his career.


## Import and Review Data

In [16]:
df = pd.read_csv('data/Hitters.csv', index_col=0)

In [17]:
df.head()

Unnamed: 0,AtBat,Hits,HmRun,Runs,RBI,Walks,Years,CAtBat,CHits,CHmRun,CRuns,CRBI,CWalks,League,Division,PutOuts,Assists,Errors,Salary,NewLeague
-Andy Allanson,293,66,1,30,29,14,1,293,66,1,30,29,14,A,E,446,33,20,,A
-Alan Ashby,315,81,7,24,38,39,14,3449,835,69,321,414,375,N,W,632,43,10,475.0,N
-Alvin Davis,479,130,18,66,72,76,3,1624,457,63,224,266,263,A,W,880,82,14,480.0,A
-Andre Dawson,496,141,20,65,78,37,11,5628,1575,225,828,838,354,N,E,200,11,3,500.0,N
-Andres Galarraga,321,87,10,39,42,30,2,396,101,12,48,46,33,N,E,805,40,4,91.5,N


In [18]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 322 entries, -Andy Allanson to -Willie Wilson
Data columns (total 20 columns):
AtBat        322 non-null int64
Hits         322 non-null int64
HmRun        322 non-null int64
Runs         322 non-null int64
RBI          322 non-null int64
Walks        322 non-null int64
Years        322 non-null int64
CAtBat       322 non-null int64
CHits        322 non-null int64
CHmRun       322 non-null int64
CRuns        322 non-null int64
CRBI         322 non-null int64
CWalks       322 non-null int64
League       322 non-null object
Division     322 non-null object
PutOuts      322 non-null int64
Assists      322 non-null int64
Errors       322 non-null int64
Salary       263 non-null float64
NewLeague    322 non-null object
dtypes: float64(1), int64(16), object(3)
memory usage: 52.8+ KB


### Dropping Null Values

Notice that while the data contain 322 tptal rows, only 263 of them actually have entries for the salary.  The other entries are NaN values that cannot be used by our regression model.  We will ignore these rows. 

In [23]:
df.dropna(inplace=True)

In [24]:
df.describe()

Unnamed: 0,AtBat,Hits,HmRun,Runs,RBI,Walks,Years,CAtBat,CHits,CHmRun,CRuns,CRBI,CWalks,PutOuts,Assists,Errors,Salary
count,263.0,263.0,263.0,263.0,263.0,263.0,263.0,263.0,263.0,263.0,263.0,263.0,263.0,263.0,263.0,263.0,263.0
mean,403.642586,107.828897,11.619772,54.745247,51.486692,41.114068,7.311787,2657.543726,722.186312,69.239544,361.220532,330.418251,260.26616,290.711027,118.760456,8.593156,535.925882
std,147.307209,45.125326,8.757108,25.539816,25.882714,21.718056,4.793616,2286.582929,648.199644,82.197581,331.198571,323.367668,264.055868,279.934575,145.080577,6.606574,451.118681
min,19.0,1.0,0.0,0.0,0.0,0.0,1.0,19.0,4.0,0.0,2.0,3.0,1.0,0.0,0.0,0.0,67.5
25%,282.5,71.5,5.0,33.5,30.0,23.0,4.0,842.5,212.0,15.0,105.5,95.0,71.0,113.5,8.0,3.0,190.0
50%,413.0,103.0,9.0,52.0,47.0,37.0,6.0,1931.0,516.0,40.0,250.0,230.0,174.0,224.0,45.0,7.0,425.0
75%,526.0,141.5,18.0,73.0,71.0,57.0,10.0,3890.5,1054.0,92.5,497.5,424.5,328.5,322.5,192.0,13.0,750.0
max,687.0,238.0,40.0,130.0,121.0,105.0,24.0,14053.0,4256.0,548.0,2165.0,1659.0,1566.0,1377.0,492.0,32.0,2460.0


## Ridge Regression

We will perform ridge regression on the dataset containing 263 rows, but first we have a little work to do to taylor the dataset for regression.  First we drop the target variable from the predictive featureset X.  Then we create the target set y.  Finally we replace the binary variables with dummies, that is to say we pick one feature level to be 1 and the other to be 0.  For features with more than two levels this is called "one hot encoding" and can be done automatically by pandas.  I chose to do it manually here so that we get consistency between the encodings of columns **League** and **NewLeague**.

In [25]:
X = df.drop('Salary', axis = 1)
y = df['Salary']

In [26]:
X['League'] = X['League'].replace('A', 1).replace('N', 0)
X['Division'] = X['Division'].replace('W', 1).replace('E', 0)
X['NewLeague'] = X['NewLeague'].replace('A', 1).replace('N', 0)

Ridge regression has a parameter called alpha, this is the regularization strength and must be a positive float.  We will consider many alhas (between 10^10 to 10^-2 by powers of 10).  This will effectively show the difference between a strongly regularized Ridge regression (a straight line at an intercept) and an unregularized OLS fit.

In [68]:
coef_df = pd.DataFrame([])

for alpha in range(-2,11):
    coefs = []
    model = Ridge(alpha=10**alpha)
    model.fit(X, y)
    
    i = 0
    for coef in model.coef_:
        coefs.append(round(coef,3))
        i +=1
    
    coefs.append(round(model.intercept_,3))
    coef_df = pd.concat([coef_df, pd.DataFrame(coefs)], axis =1)
    
coef_df.columns = range(-2,11)
idx = list(data_description.index.drop('Salary'))
idx.append('intercept')
coef_df.index = idx

In [69]:
coef_df

Unnamed: 0,-2,-1,0,1,2,3,4,5,6,7,8,9,10
AtBat,-1.98,-1.981,-1.987,-2.024,-2.115,-2.08,-1.516,-0.238,0.351,0.237,0.045,0.006,0.001
Hits,7.501,7.502,7.514,7.578,7.653,7.19,4.818,1.441,0.33,0.098,0.016,0.002,0.0
HmRun,4.331,4.329,4.307,4.073,2.893,1.212,-0.124,-0.0,0.023,0.012,0.002,0.0,0.0
Runs,-2.376,-2.376,-2.373,-2.33,-2.092,-1.431,0.408,0.703,0.194,0.057,0.009,0.001,0.0
RBI,-1.045,-1.043,-1.03,-0.904,-0.364,0.234,0.556,0.347,0.131,0.049,0.009,0.001,0.0
Walks,6.231,6.231,6.227,6.199,6.102,5.784,4.239,1.416,0.236,0.048,0.007,0.001,0.0
Years,-3.489,-3.487,-3.468,-3.264,-2.336,-0.894,-0.043,0.001,-0.008,-0.003,-0.0,0.0,0.0
CAtBat,-0.171,-0.171,-0.171,-0.171,-0.177,-0.197,-0.264,-0.353,-0.218,0.011,0.076,0.055,0.012
CHits,0.134,0.133,0.129,0.109,0.092,0.181,0.524,0.874,0.595,0.147,0.04,0.017,0.004
CHmRun,-0.173,-0.174,-0.179,-0.203,-0.201,-0.057,0.178,0.16,0.156,0.064,0.011,0.003,0.0


Notice how all the coefficients tend to zero as the strength of the regularization increases.  This is the effect of regularization, it decreases the magnitude of the estimated values to prevent overfitting.

## Lasso Regression

In [80]:
coef_df = pd.DataFrame([])

for alpha in range(-2,6):
    coefs = []
    model = Lasso(alpha=10**alpha, max_iter = 100000)
    model.fit(X, y)
    
    i = 0
    for coef in model.coef_:
        coefs.append(round(coef,3))
        i +=1
    
    coefs.append(round(model.intercept_,3))
    coef_df = pd.concat([coef_df, pd.DataFrame(coefs)], axis =1)
    
coef_df.columns = range(-2,6)
idx = list(data_description.index.drop('Salary'))
idx.append('intercept')
coef_df.index = idx

In [81]:
coef_df

Unnamed: 0,-2,-1,0,1,2,3,4,5
AtBat,-1.98,-1.982,-2.001,-2.013,-1.732,0.194,0.278,0.0
Hits,7.501,7.503,7.51,7.152,5.877,1.035,0.0,0.0
HmRun,4.329,4.313,4.127,1.711,0.0,0.0,0.0,0.0
Runs,-2.376,-2.369,-2.307,-1.722,-0.0,0.0,0.0,0.0
RBI,-1.044,-1.036,-0.957,-0.124,0.0,0.0,0.0,0.0
Walks,6.231,6.227,6.189,6.011,4.767,0.0,0.0,0.0
Years,-3.485,-3.446,-3.077,-0.0,-0.0,-0.0,-0.0,0.0
CAtBat,-0.171,-0.171,-0.172,-0.194,-0.226,-0.301,0.096,0.085
CHits,0.134,0.132,0.124,0.215,0.337,0.707,0.0,0.0
CHmRun,-0.173,-0.174,-0.174,0.0,0.0,0.0,0.0,0.0


## References

[1] Hastie, Trevor. ISLR v1.2, RDocumentation https://www.rdocumentation.org/packages/ISLR/versions/1.2/topics/Hitters Last Updated October 19th, 2017. Accessed December 4th 2019.