# 정규화 회귀 모형

Multiple Linear Regression 에 이어서 정규화를 사용하는 모형들인 Ridge, Lasso, Elastic Net을 실습해본다.
앞서 MLR을 진행할 때와 마찬가지로 scikit-learn으로부터 데이터를 입력받아 모델링을 하는 과정을 진행한다.

In [1]:
from sklearn import datasets
import pandas as pd
data = datasets.load_diabetes()
print(data.keys())

dict_keys(['data', 'target', 'DESCR', 'feature_names', 'data_filename', 'target_filename'])


In [2]:
X = data.data
Y = data.target
print(X.shape)

(442, 10)


In [3]:
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.3, random_state=0)
X_train.shape, X_test.shape, Y_train.shape, Y_test.shape

((309, 10), (133, 10), (309,), (133,))

In [4]:
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Lasso, Ridge, ElasticNet

In [5]:
mlr = LinearRegression()
lasso = Lasso(alpha=0.1)
ridge = Ridge(alpha=0.1)
ela = ElasticNet(alpha=1.0 , l1_ratio=0.5)

In [6]:
mlr.fit(X_train, Y_train)
lasso.fit(X_train, Y_train)
ridge.fit(X_train, Y_train)
ela.fit(X_train, Y_train)

ElasticNet(alpha=1.0, copy_X=True, fit_intercept=True, l1_ratio=0.5,
           max_iter=1000, normalize=False, positive=False, precompute=False,
           random_state=None, selection='cyclic', tol=0.0001, warm_start=False)

### 결과 확인 및 분석

+ 각 모델의 성능을 확인하기 위해 각 함수별로 존재하는 score함수를 이용한다. score함수는 R-square 값을 반환해주는 함수.


+ 트레이닝의 성능과 테스트데이터에 대한 성능을 각각 확인한다 또한, 다른 모델들에서의 변수 선택 역할을 확인하기 위해 모델링에 사용된 변수의 수를 확인하기 위해 계수가 0이 아닌 변수의 수를 합산.

In [7]:
# MLR score
mlr_train_score=mlr.score(X_train,Y_train)
mlr_test_score=mlr.score(X_test,Y_test)
mlr_coeff_used = np.sum(mlr.coef_!=0)
print ("training score for alpha=0.1:", mlr_train_score)
print ("test score for alpha =0.1: ", mlr_test_score)
print ("number of features used:", mlr_coeff_used)

training score for alpha=0.1: 0.5539411781927147
test score for alpha =0.1:  0.39289398450747565
number of features used: 10


In [8]:
# Ridge score
ridge_train_score=ridge.score(X_train,Y_train)
ridge_test_score=ridge.score(X_test,Y_test)
ridge_coeff_used = np.sum(ridge.coef_!=0)
print ("training score for alpha=0.1:", ridge_train_score)
print ("test score for alpha =0.1: ", ridge_test_score)
print ("number of features used: for alpha =0.1:", ridge_coeff_used)

training score for alpha=0.1: 0.5483467396437521
test score for alpha =0.1:  0.4021292749449725
number of features used: for alpha =0.1: 10


In [9]:
# Lasso score
lasso_train_score=lasso.score(X_train,Y_train)
lasso_test_score=lasso.score(X_test,Y_test)
lasso_coeff_used = np.sum(lasso.coef_!=0)
print ("training score for alpha=0.1:", lasso_train_score)
print ("test score for alpha =0.1: ", lasso_test_score)
print ("number of features used: for alpha =0.1:", lasso_coeff_used)

training score for alpha=0.1: 0.5469887155387941
test score for alpha =0.1:  0.38754999548270175
number of features used: for alpha =0.1: 7


In [10]:
# ElasticNet score
ela_train_score=ela.score(X_train,Y_train)
ela_test_score=ela.score(X_test,Y_test)
ela_coeff_used = np.sum(ela.coef_!=0)
print ("training score for alpha=1 & l1_ratio=0.5:", ela_train_score)
print ("test score for alpha=1 & l1_ratio=0.5: ", ela_test_score)
print ("number of features used: for alpha=1 & l1_ratio=0.5:", ela_coeff_used)

training score for alpha=1 & l1_ratio=0.5: 0.010751863222008828
test score for alpha=1 & l1_ratio=0.5:  0.00879606043811132
number of features used: for alpha=1 & l1_ratio=0.5: 9


Lasso의 파라미터인 alpha를 조절해가면서 성능과 사용하는 변수의 수가 어떻게 변해가는지 확인해본다.  
alpha의 값이 커짐에 따라 선택되는 변수의 수가 줄어들고, 그로인한 정보 손실로 모델의 정확도가 떨어짐을 확인 할 수 있다.

In [11]:
alpha_list = [0.01, 0.05, 0.1, 1, 10]
for alpha in alpha_list:
    lasso = Lasso(alpha=alpha)
    lasso.fit(X_train, Y_train)
    lasso_train_score=lasso.score(X_train,Y_train)
    lasso_test_score=lasso.score(X_test,Y_test)
    lasso_coeff_used = np.sum(lasso.coef_!=0)
    print ("training score for alpha={}:".format(alpha), lasso_train_score)
    print ("test score for alpha ={}: ".format(alpha), lasso_test_score)
    print ("number of features used: for alpha ={}:".format(alpha), lasso_coeff_used)
    print("==========================================================")

training score for alpha=0.01: 0.5532544496972629
test score for alpha =0.01:  0.3878927565620651
number of features used: for alpha =0.01: 9
training score for alpha=0.05: 0.5515636960790138
test score for alpha =0.05:  0.388960344614822
number of features used: for alpha =0.05: 8
training score for alpha=0.1: 0.5469887155387941
test score for alpha =0.1:  0.38754999548270175
number of features used: for alpha =0.1: 7
training score for alpha=1: 0.4156569544849773
test score for alpha =1:  0.30577836095304156
number of features used: for alpha =1: 2
training score for alpha=10: 0.0
test score for alpha =10:  -4.088943807989409e-07
number of features used: for alpha =10: 0


Elastic Net에서의 l1 ratio 변화에 따라 L1과 L2 정규화의 가중치가 달라지는데, 이에 대한 실험을 진행한다.

In [12]:
l1_ratio_list = [0.2, 0.4, 0.6, 0.8, 1.0]
for l1_ratio in l1_ratio_list:
    ela = ElasticNet(alpha=1, l1_ratio=l1_ratio)
    ela.fit(X_train, Y_train)
    ela_train_score=ela.score(X_train,Y_train)
    ela_test_score=ela.score(X_test,Y_test)
    ela_coeff_used = np.sum(ela.coef_!=0)
    print ("training score for l1_ratio={}:".format(l1_ratio), ela_train_score)
    print ("test score for l1_ratio ={}: ".format(l1_ratio), ela_test_score)
    print ("number of features used: for l1_ratio ={}:".format(l1_ratio), ela_coeff_used)
    print("==========================================================")

training score for l1_ratio=0.2: 0.00838070714792083
test score for l1_ratio =0.2:  0.0068681995133402785
number of features used: for l1_ratio =0.2: 10
training score for l1_ratio=0.4: 0.009701852290429303
test score for l1_ratio =0.4:  0.007942573471441339
number of features used: for l1_ratio =0.4: 9
training score for l1_ratio=0.6: 0.01231491863910883
test score for l1_ratio =0.6:  0.01006632721166345
number of features used: for l1_ratio =0.6: 9
training score for l1_ratio=0.8: 0.02015615128867887
test score for l1_ratio =0.8:  0.016479494771337033
number of features used: for l1_ratio =0.8: 7
training score for l1_ratio=1.0: 0.4156569544849773
test score for l1_ratio =1.0:  0.30577836095304156
number of features used: for l1_ratio =1.0: 2
