<br>
# Ridge Regression and the Lasso
<br><br>

GSS(General Social Survey, https://en.wikipedia.org/wiki/General_Social_Survey) 는 5900+ features 를 갖고 있다.

<br><br>

## Multiple Linear regression with All Features

### Form of multiple linear regression

$y = \beta_0 + \beta_1x_1 + \beta_2x_2 + ... + \beta_nx_n$

- $y$ is the response
- $\beta_0$ is the intercept
- $\beta_1$ is the coefficient for $x_1$ (the first feature)
- $\beta_n$ is the coefficient for $x_n$ (the nth feature)


In [74]:
# inserted cell

import numpy as np
import pandas as pd

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score, explained_variance_score
from sklearn.preprocessing import PolynomialFeatures

import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline

In [75]:
# read CSV file and save the results
data = pd.read_csv('data/Hitters.csv')
print(data.head())

data = pd.read_csv('data/Hitters.csv').dropna().drop('Player', axis = 1)
print(data.head())
data.info()
data.values

              Player  AtBat  Hits  HmRun  Runs  RBI  Walks  Years  CAtBat  \
0     -Andy Allanson    293    66      1    30   29     14      1     293   
1        -Alan Ashby    315    81      7    24   38     39     14    3449   
2       -Alvin Davis    479   130     18    66   72     76      3    1624   
3      -Andre Dawson    496   141     20    65   78     37     11    5628   
4  -Andres Galarraga    321    87     10    39   42     30      2     396   

   CHits    ...      CRuns  CRBI  CWalks  League Division PutOuts  Assists  \
0     66    ...         30    29      14       A        E     446       33   
1    835    ...        321   414     375       N        W     632       43   
2    457    ...        224   266     263       A        W     880       82   
3   1575    ...        828   838     354       N        E     200       11   
4    101    ...         48    46      33       N        E     805       40   

   Errors  Salary  NewLeague  
0      20     NaN          A  
1     

array([[315, 81, 7, ..., 10, 475.0, 'N'],
       [479, 130, 18, ..., 14, 480.0, 'A'],
       [496, 141, 20, ..., 3, 500.0, 'N'],
       ..., 
       [475, 126, 3, ..., 7, 385.0, 'A'],
       [573, 144, 9, ..., 12, 960.0, 'A'],
       [631, 170, 9, ..., 3, 1000.0, 'A']], dtype=object)

In [76]:
# create a Python list of feature names
feature_cols = list(data)
print(type(feature_cols))

feature_cols.remove('Salary')
print(feature_cols)

# use the list to select a subset of the original DataFrame
X = data[feature_cols]

# print the first 5 rows
X.head()

<class 'list'>
['AtBat', 'Hits', 'HmRun', 'Runs', 'RBI', 'Walks', 'Years', 'CAtBat', 'CHits', 'CHmRun', 'CRuns', 'CRBI', 'CWalks', 'League', 'Division', 'PutOuts', 'Assists', 'Errors', 'NewLeague']


Unnamed: 0,AtBat,Hits,HmRun,Runs,RBI,Walks,Years,CAtBat,CHits,CHmRun,CRuns,CRBI,CWalks,League,Division,PutOuts,Assists,Errors,NewLeague
1,315,81,7,24,38,39,14,3449,835,69,321,414,375,N,W,632,43,10,N
2,479,130,18,66,72,76,3,1624,457,63,224,266,263,A,W,880,82,14,A
3,496,141,20,65,78,37,11,5628,1575,225,828,838,354,N,E,200,11,3,N
4,321,87,10,39,42,30,2,396,101,12,48,46,33,N,E,805,40,4,N
5,594,169,4,74,51,35,11,4408,1133,19,501,336,194,A,W,282,421,25,A


In [77]:
# dealing with categorical variable
X=pd.get_dummies(X)
feature_cols=list(X)
print(feature_cols)

['AtBat', 'Hits', 'HmRun', 'Runs', 'RBI', 'Walks', 'Years', 'CAtBat', 'CHits', 'CHmRun', 'CRuns', 'CRBI', 'CWalks', 'PutOuts', 'Assists', 'Errors', 'League_A', 'League_N', 'Division_E', 'Division_W', 'NewLeague_A', 'NewLeague_N']


In [78]:
# check the type and shape of X
print(type(X))
print(X.shape)

<class 'pandas.core.frame.DataFrame'>
(263, 22)


In [79]:
# select a Series from the DataFrame
y = data['Salary']

# equivalent command that works if there are no spaces in the column name
y = data.Salary

# print the first 5 values
y.head()

1    475.0
2    480.0
3    500.0
4     91.5
5    750.0
Name: Salary, dtype: float64

In [80]:
# check the type and shape of y
print(type(y))
print(y.shape)

<class 'pandas.core.series.Series'>
(263,)


## Splitting X and y into training and testing sets

In [81]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)

In [82]:
# default split is 75% for training and 25% for testing
print(X_train.shape)
print(y_train.shape)
print(X_test.shape)
print(y_test.shape)

(197, 22)
(197,)
(66, 22)
(66,)


## Multiple Linear Regression in scikit-learn

In [83]:
# import model
#from sklearn.linear_model import LinearRegression

# instantiate
linreg = LinearRegression()

# fit the model to the training data (learn the coefficients)
linreg.fit(X_train, y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

### Interpreting model coefficients

In [84]:
# print the intercept and coefficients
print(linreg.intercept_)
print(linreg.coef_)

128.844224908
[ -2.16123855   7.34282876  -2.50752867  -1.73702779   1.95041682
   5.08249087   4.76875394  -0.29124969   1.10059463   2.48996468
   0.24883038  -0.08433489  -0.47381542   0.30655117   0.13939092
   2.5437623  -19.28537989  19.28537989  53.70328044 -53.70328044
   1.63433958  -1.63433958]


In [85]:
# pair the feature names with the coefficients
list(zip(feature_cols, linreg.coef_))

[('AtBat', -2.1612385540872046),
 ('Hits', 7.3428287563015555),
 ('HmRun', -2.5075286723146166),
 ('Runs', -1.737027785518094),
 ('RBI', 1.9504168201617955),
 ('Walks', 5.0824908748466084),
 ('Years', 4.7687539365942397),
 ('CAtBat', -0.29124969479297108),
 ('CHits', 1.1005946257236663),
 ('CHmRun', 2.4899646751856226),
 ('CRuns', 0.24883038096072063),
 ('CRBI', -0.084334892917219406),
 ('CWalks', -0.47381541770115998),
 ('PutOuts', 0.3065511689111029),
 ('Assists', 0.13939092129592878),
 ('Errors', 2.5437623038945025),
 ('League_A', -19.28537988734087),
 ('League_N', 19.285379887340845),
 ('Division_E', 53.703280436242366),
 ('Division_W', -53.703280436242416),
 ('NewLeague_A', 1.6343395782361423),
 ('NewLeague_N', -1.6343395782361441)]


- This is a statement of **association**, not **causation**.


### Making predictions

In [86]:
# make predictions on the testing set
y_pred = linreg.predict(X_test)

We need an **evaluation metric** in order to compare our predictions with the actual values!

### Computing  $R^2$

In [87]:
print(linreg.score(X_test, y_test))

0.40034574192


### Computing the RMSE 

In [88]:
print(np.sqrt(mean_squared_error(y_test, y_pred)))

366.200465502


# 모든 feature를 포함하는 것이 더 좋을까????

In [89]:
from sklearn.linear_model import Ridge

linridge = Ridge(alpha=20.0).fit(X_train, y_train)

print('Hitters dataset')
print('ridge regression linear model intercept: {}'
     .format(linridge.intercept_))
print('ridge regression linear model coeff:\n{}'
     .format(linridge.coef_))
print('R-squared score (training): {:.3f}'
     .format(linridge.score(X_train, y_train)))
print('R-squared score (test): {:.3f}'
     .format(linridge.score(X_test, y_test)))
print('Number of non-zero features: {}'
     .format(np.sum(linridge.coef_ != 0)))

Crime dataset
ridge regression linear model intercept: 129.8072255926499
ridge regression linear model coeff:
[ -2.21206977   7.42949689  -2.88216299  -1.68533396   2.1808145
   5.01377054   4.38972055  -0.28511991   1.08296061   2.51756161
   0.25680383  -0.10550198  -0.46901896   0.31083184   0.14478788
   2.670955   -11.53995546  11.53995546  43.86020908 -43.86020908
  -3.69027841   3.69027841]
R-squared score (training): 0.566
R-squared score (test): 0.396
Number of non-zero features: 22


### Making predictions

In [90]:
# make predictions on the testing set
y_pred = linridge.predict(X_test)

### Computing the RMSE 

In [91]:
print(np.sqrt(mean_squared_error(y_test, y_pred)))

367.402939586


Ridge regression with feature normalization

In [92]:
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()

X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

linridge = Ridge(alpha=3.0).fit(X_train_scaled, y_train)

print('Hitters dataset')
print('ridge regression linear model intercept: {}'
     .format(linridge.intercept_))
print('ridge regression linear model coeff:\n{}'
     .format(linridge.coef_))
print('R-squared score (training): {:.3f}'
     .format(linridge.score(X_train_scaled, y_train)))
print('R-squared score (test): {:.3f}'
     .format(linridge.score(X_test_scaled, y_test)))
print('Number of non-zero features: {}'
     .format(np.sum(linridge.coef_ != 0)))

Crime dataset
ridge regression linear model intercept: -12.793008384039695
ridge regression linear model coeff:
[   5.0774865   255.15088117  -19.38159624   81.80690095  179.13099038
  183.38415493   36.05652194  112.54263517  187.67343381  237.63220265
  159.80575218  305.35757756    5.49891995  333.9174685   -42.2374294
   30.77030564   -9.84591746    9.84591746   58.58832409  -58.58832409
  -10.74020369   10.74020369]
R-squared score (training): 0.488
R-squared score (test): 0.421
Number of non-zero features: 22


### Making predictions

In [95]:
# make predictions on the testing set
y_pred = linridge.predict(X_test_scaled)

### Computing the RMSE 

In [96]:
print(np.sqrt(mean_squared_error(y_test, y_pred)))

359.84821519


#### Ridge regression with regularization parameter: alpha

In [97]:
print('Ridge regression: effect of alpha regularization parameter\n')
for this_alpha in [0, 0.5, 1, 1.5, 2, 3, 4, 5, 10, 20, 50]:
    linridge = Ridge(alpha = this_alpha).fit(X_train_scaled, y_train)
    r2_train = linridge.score(X_train_scaled, y_train)
    r2_test = linridge.score(X_test_scaled, y_test)
    num_coeff_bigger = np.sum(abs(linridge.coef_) > 1.0)
    print('Alpha = {:.2f}\nnum abs(coeff) > 1.0: {}, \
r-squared training: {:.2f}, r-squared test: {:.2f}\n'
         .format(this_alpha, num_coeff_bigger, r2_train, r2_test))

Ridge regression: effect of alpha regularization parameter

Alpha = 0.00
num abs(coeff) > 1.0: 22, r-squared training: 0.57, r-squared test: 0.40

Alpha = 0.50
num abs(coeff) > 1.0: 22, r-squared training: 0.53, r-squared test: 0.42

Alpha = 1.00
num abs(coeff) > 1.0: 22, r-squared training: 0.51, r-squared test: 0.42

Alpha = 1.50
num abs(coeff) > 1.0: 22, r-squared training: 0.50, r-squared test: 0.42

Alpha = 2.00
num abs(coeff) > 1.0: 22, r-squared training: 0.50, r-squared test: 0.42

Alpha = 3.00
num abs(coeff) > 1.0: 22, r-squared training: 0.49, r-squared test: 0.42

Alpha = 4.00
num abs(coeff) > 1.0: 22, r-squared training: 0.48, r-squared test: 0.42

Alpha = 5.00
num abs(coeff) > 1.0: 22, r-squared training: 0.48, r-squared test: 0.42

Alpha = 10.00
num abs(coeff) > 1.0: 22, r-squared training: 0.46, r-squared test: 0.40

Alpha = 20.00
num abs(coeff) > 1.0: 22, r-squared training: 0.43, r-squared test: 0.38

Alpha = 50.00
num abs(coeff) > 1.0: 20, r-squared training: 0.35, r-

Ill-conditioned matrix detected. Result is not guaranteed to be accurate.
Reciprocal condition number: 1.5549567060582057e-17


### Lasso regression

In [98]:
from sklearn.linear_model import Lasso
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()

# X_train, X_test, y_train, y_test = train_test_split(X, y,
#                                                   random_state = 0)

# X_train_scaled = scaler.fit_transform(X_train)
# X_test_scaled = scaler.transform(X_test)

linlasso = Lasso(alpha=1.8, max_iter = 10000).fit(X_train_scaled, y_train)

print('Hitters dataset')
print('lasso regression linear model intercept: {}'
     .format(linlasso.intercept_))
print('lasso regression linear model coeff:\n{}'
     .format(linlasso.coef_))
print('Non-zero features: {}'
     .format(np.sum(linlasso.coef_ != 0)))
print('R-squared score (training): {:.3f}'
     .format(linlasso.score(X_train_scaled, y_train)))
print('R-squared score (test): {:.3f}\n'
     .format(linlasso.score(X_test_scaled, y_test)))
print('Features with non-zero weight (sorted by absolute magnitude):')

for e in sorted (list(zip(list(X), linlasso.coef_)),
                key = lambda e: -abs(e[1])):
    if e[1] != 0:
        print('\t{}, {:.3f}'.format(e[0], e[1]))

Crime dataset
lasso regression linear model intercept: -61.787463229131504
lasso regression linear model coeff:
[ -7.09206971e+01   5.53786137e+02  -0.00000000e+00   0.00000000e+00
   4.02874040e+00   1.90235004e+02  -0.00000000e+00  -0.00000000e+00
   0.00000000e+00   7.17932089e+01   0.00000000e+00   8.96141770e+02
  -0.00000000e+00   4.10360467e+02  -1.55304220e+01   0.00000000e+00
  -1.19044960e+01   0.00000000e+00   1.14722756e+02  -4.88100959e-12
  -1.69587447e+01   0.00000000e+00]
Non-zero features: 12
R-squared score (training): 0.502
R-squared score (test): 0.426

Features with non-zero weight (sorted by absolute magnitude):
	CRBI, 896.142
	Hits, 553.786
	PutOuts, 410.360
	Walks, 190.235
	Division_E, 114.723
	CHmRun, 71.793
	AtBat, -70.921
	NewLeague_A, -16.959
	Assists, -15.530
	League_A, -11.904
	RBI, 4.029
	Division_W, -0.000


### Making predictions

In [99]:
# make predictions on the testing set
y_pred = linlasso.predict(X_test_scaled)

### Computing the RMSE 

In [100]:
print(np.sqrt(mean_squared_error(y_test, y_pred)))

358.297774977


#### Lasso regression with regularization parameter: alpha

In [101]:
print('Lasso regression: effect of alpha regularization\n\
parameter on number of features kept in final model\n')

for alpha in [0.1, 0.5, 0.8,  1, 1.3, 1.5, 1.8, 2, 2.5, 3]:
    linlasso = Lasso(alpha, max_iter = 10000).fit(X_train_scaled, y_train)
    r2_train = linlasso.score(X_train_scaled, y_train)
    r2_test = linlasso.score(X_test_scaled, y_test)
    
    print('Alpha = {:.2f}\nFeatures kept: {}, r-squared training: {:.2f}, \
r-squared test: {:.2f}\n'
         .format(alpha, np.sum(linlasso.coef_ != 0), r2_train, r2_test))

Lasso regression: effect of alpha regularization
parameter on number of features kept in final model

Alpha = 0.10
Features kept: 20, r-squared training: 0.56, r-squared test: 0.41

Alpha = 0.50
Features kept: 20, r-squared training: 0.55, r-squared test: 0.42

Alpha = 0.80
Features kept: 17, r-squared training: 0.54, r-squared test: 0.42

Alpha = 1.00
Features kept: 15, r-squared training: 0.53, r-squared test: 0.43

Alpha = 1.30
Features kept: 13, r-squared training: 0.52, r-squared test: 0.43

Alpha = 1.50
Features kept: 14, r-squared training: 0.51, r-squared test: 0.43

Alpha = 1.80
Features kept: 12, r-squared training: 0.50, r-squared test: 0.43

Alpha = 2.00
Features kept: 12, r-squared training: 0.50, r-squared test: 0.42

Alpha = 2.50
Features kept: 11, r-squared training: 0.50, r-squared test: 0.42

Alpha = 3.00
Features kept: 11, r-squared training: 0.50, r-squared test: 0.42

