## [作業重點]
使用 Sklearn 中的 Lasso, Ridge 模型，來訓練各種資料集，務必了解送進去模型訓練的**資料型態**為何，也請了解模型中各項參數的意義。

機器學習的模型非常多種，但要訓練的資料多半有固定的格式，確保你了解訓練資料的格式為何，這樣在應用新模型時，就能夠最快的上手開始訓練！

## 練習時間
試著使用 sklearn datasets 的其他資料集 (boston, ...)，來訓練自己的線性迴歸模型，並加上適當的正則話來觀察訓練情形。

In [1]:
import numpy as np
import pandas as pd
from sklearn import datasets, linear_model
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score

In [2]:
boston = datasets.load_boston()
train_x, test_x, train_y, test_y = train_test_split(boston.data, boston.target, test_size = 0.2, random_state = 4)
reg_model = linear_model.LinearRegression()
reg_model.fit(train_x, train_y)
y_pred = reg_model.predict(test_x)

In [3]:
print(reg_model.coef_)

[-1.15966452e-01  4.71249231e-02  8.25980146e-03  3.23404531e+00
 -1.66865890e+01  3.88410651e+00 -1.08974442e-02 -1.54129540e+00
  2.93208309e-01 -1.34059383e-02 -9.06296429e-01  8.80823439e-03
 -4.57723846e-01]


In [4]:
print("Mean squared error: %.2f"
      % mean_squared_error(test_y, y_pred))

Mean squared error: 25.42


In [5]:
## LASSO
dict = {}
for i in np.arange(0.1, 1.1, 0.1):
    lasso_model = linear_model.Lasso(alpha = i)
    lasso_model.fit(train_x, train_y)
    y_pred = lasso_model.predict(test_x)
    dict[i] = lasso_model.coef_

df = pd.DataFrame(data = dict)
df.head(20)

Unnamed: 0,0.1,0.2,0.30000000000000004,0.4,0.5,0.6,0.7000000000000001,0.8,0.9,1.0
0,-0.106189,-0.10353,-0.098554,-0.093577,-0.088601,-0.083712,-0.07913,-0.074401,-0.069679,-0.06495
1,0.048864,0.048905,0.048701,0.048496,0.048291,0.04797,0.047255,0.046775,0.046295,0.045815
2,-0.045367,-0.02915,-0.023124,-0.017099,-0.011074,-0.004535,-0.0,-0.0,-0.0,-0.0
3,1.149531,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,-0.0,-0.0,-0.0,-0.0,-0.0,-0.0,-0.0,-0.0,-0.0,-0.0
5,3.823539,3.565228,3.263817,2.962413,2.661018,2.366282,2.085998,1.784474,1.482923,1.1814
6,-0.020898,-0.015763,-0.011535,-0.007307,-0.003079,-0.0,0.000249,0.003863,0.007477,0.011091
7,-1.235906,-1.173711,-1.110608,-1.047505,-0.984403,-0.928693,-0.894686,-0.84211,-0.789533,-0.736958
8,0.260089,0.266519,0.26323,0.259933,0.25664,0.253314,0.248898,0.243756,0.238645,0.2335
9,-0.015171,-0.015741,-0.015805,-0.015869,-0.015933,-0.015993,-0.015937,-0.015794,-0.015653,-0.015511


In [6]:
print("Mean squared error: %.2f"
      % mean_squared_error(test_y, y_pred))

Mean squared error: 28.95


In [7]:
## 這邊觀察到Ridge的SSE比LASSO小，而且很靠近原始Linear Regression
## 猜測原因應該是有共線性的問題，然後Ridge抵銷了
## 要驗證這點就想要嘗試從LASSO當中，把係數降為0的剔除來Fitting看看(暫時不確定是用Linear Regression還是LASSO Regression)
## 這邊剔除index 2, 3, 4

train_x, test_x, train_y, test_y = train_test_split(boston.data[:, [1,5,6,7,8,9,10,11,12]], boston.target, test_size = 0.2, random_state = 4)
reg_model = linear_model.LinearRegression()
reg_model.fit(train_x, train_y)
y_pred = reg_model.predict(test_x)
print(reg_model.coef_)
print("Mean squared error: %.2f" % mean_squared_error(test_y, y_pred))

[ 0.04595046  4.1925197  -0.02361562 -1.20297264  0.22166016 -0.01641407
 -0.77336512  0.01017504 -0.51418606]
Mean squared error: 26.92


In [8]:
## 好像不是這樣用的
lasso_model = linear_model.Lasso(alpha = 1)
lasso_model.fit(train_x, train_y)
y_pred = lasso_model.predict(test_x)
print(lasso_model.coef_)
print("Mean squared error: %.2f" % mean_squared_error(test_y, y_pred))

[ 0.04311275  1.17904988  0.0119182  -0.70235753  0.19694663 -0.01521155
 -0.69236388  0.00762594 -0.71124163]
Mean squared error: 29.36


In [9]:
## Ridge
dict = {}
for i in np.arange(0.1, 1.1, 0.1):
    ridge_model = linear_model.Ridge(alpha = i)
    ridge_model.fit(train_x, train_y)
    y_pred = ridge_model.predict(test_x)
    dict[i] = ridge_model.coef_

df = pd.DataFrame(data = dict)
df.head(10)

Unnamed: 0,0.1,0.2,0.30000000000000004,0.4,0.5,0.6,0.7000000000000001,0.8,0.9,1.0
0,0.045966,0.045981,0.045997,0.046012,0.046028,0.046043,0.046058,0.046073,0.046089,0.046104
1,4.188784,4.185054,4.181332,4.177616,4.173908,4.170206,4.166511,4.162823,4.159141,4.155467
2,-0.023588,-0.02356,-0.023532,-0.023504,-0.023476,-0.023448,-0.023421,-0.023393,-0.023365,-0.023338
3,-1.202938,-1.202902,-1.202867,-1.202831,-1.202795,-1.202758,-1.202721,-1.202684,-1.202647,-1.20261
4,0.22174,0.221819,0.221899,0.221978,0.222057,0.222136,0.222215,0.222293,0.222372,0.22245
5,-0.016418,-0.016421,-0.016425,-0.016428,-0.016432,-0.016435,-0.016439,-0.016442,-0.016446,-0.016449
6,-0.773482,-0.773598,-0.773714,-0.77383,-0.773945,-0.774061,-0.774176,-0.77429,-0.774405,-0.774519
7,0.010173,0.010171,0.010169,0.010167,0.010165,0.010163,0.01016,0.010158,0.010156,0.010154
8,-0.51445,-0.514714,-0.514977,-0.51524,-0.515502,-0.515764,-0.516025,-0.516286,-0.516546,-0.516806


In [10]:
print("Mean squared error: %.2f"
      % mean_squared_error(test_y, y_pred))

Mean squared error: 26.91
