# 通し課題模範解答 回帰編 DAY 2
- kaggle の kickstarter project に関して，usd_pledged_real を予測するモデルを作成する
    - https://www.kaggle.com/kemical/kickstarter-projects?select=ks-projects-201801.csv
- DAY 2 では，以下を行う
    - モデルの検証
    - 前処理
    - 正則化・ハイパーパラメータの探索
    - SVM の利用

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.svm import SVR
from sklearn.model_selection import train_test_split, cross_validate, KFold, GridSearchCV
from sklearn.metrics import mean_absolute_error, mean_squared_error
from sklearn.preprocessing import StandardScaler, PolynomialFeatures
from sklearn.pipeline import Pipeline

In [2]:
df = pd.read_csv('../data/df_regression.csv', index_col='ID')
df.head()

Unnamed: 0_level_0,usd_pledged_real,usd_goal_real,period,log_usd_goal,log_usd_pledged,n_words,main_category_Comics,main_category_Crafts,main_category_Dance,main_category_Design,...,currency_EUR,currency_GBP,currency_HKD,currency_JPY,currency_MXN,currency_NOK,currency_NZD,currency_SEK,currency_SGD,currency_USD
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1000002330,0.0,1533.95,58,3.185811,-5.0,6,0,0,0,0,...,0,1,0,0,0,0,0,0,0,0
1000003930,2421.0,30000.0,59,4.477121,3.383995,8,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
1000004038,220.0,45000.0,44,4.653213,2.342423,3,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
1000007540,1.0,5000.0,29,3.69897,4e-06,7,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
1000011046,1283.0,19500.0,55,4.290035,3.108227,8,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1


## モデルの検証
- ホールドアウト法によるモデルの検証を行う
- Day1 で実装した線形回帰を利用する

In [3]:
X = df.drop(columns=['log_usd_pledged', 'usd_pledged_real'])
y = df['usd_pledged_real']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1234)

In [4]:
lr_reg = LinearRegression()
lr_reg.fit(X_train, y_train)

LinearRegression()

In [5]:
y_pred = lr_reg.predict(X_test)

In [6]:
mae = mean_absolute_error(y_test, y_pred)
mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)

print(f'MAE: {mae:.3}')
print(f'MSE: {mse:.3}')
print(f'RMSE: {rmse:.3}')

MAE: 1.4e+04
MSE: 4.9e+09
RMSE: 7e+04


- 交差検証法によるモデルの検証を行う
- Day1 で実装した線形回帰を利用する

In [7]:
lr_reg_cv = LinearRegression()
kf = KFold(n_splits=5, shuffle=True, random_state=1234)
cv_results = cross_validate(lr_reg_cv, X_train, y_train, cv=kf, return_estimator=True,
                            scoring=('neg_mean_squared_error', 'neg_mean_absolute_error'))

In [8]:
cv_results.keys()

dict_keys(['fit_time', 'score_time', 'estimator', 'test_neg_mean_squared_error', 'test_neg_mean_absolute_error'])

In [9]:
mse = - cv_results['test_neg_mean_squared_error'].mean()
mae = - cv_results['test_neg_mean_absolute_error'].mean()
rmse = np.sqrt(mse)

print(f'MAE: {mae:.3}')
print(f'MSE: {mse:.3}')
print(f'RMSE: {rmse:.3}')

MAE: 1.44e+04
MSE: 9.54e+09
RMSE: 9.77e+04


## 前処理
- 連続変数に対する標準化を行う

In [10]:
std = StandardScaler()
X_train.loc[:, ['log_usd_goal', 'period']] = std.fit_transform(X_train.loc[:, ['log_usd_goal', 'period']])
X_test.loc[:, ['log_usd_goal', 'period']] = std.transform(X_test.loc[:, ['log_usd_goal', 'period']])

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  isetter(loc, value[:, i].tolist())
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  isetter(loc, value[:, i].tolist())


In [11]:
lr_reg = LinearRegression()
lr_reg.fit(X_train, y_train)
y_pred = lr_reg.predict(X_test)

mae = mean_absolute_error(y_test, y_pred)
mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)

print(f'MAE: {mae:.3}')
print(f'MSE: {mse:.3}')
print(f'RMSE: {rmse:.3}')

MAE: 1.4e+04
MSE: 4.9e+09
RMSE: 7e+04


- 線形回帰の場合には，標準化は効果がない（解析的にパラメータを求めるため）

##  正則化・ハイパーパラメータ探索
- 二次の多項式までを考慮したロジスティック回帰について，以下の正則化を併用する．また，正則化のパラメータをグリッドサーチによって探索する
    - L_2 正則化（Ridge）
- 以下のクラスを利用する
    - Pipeline: 複数のクラスを連結して利用するためのクラス．
    - GridSearchCV: グリッドサーチを行うためのクラス．PipeLine を併用する場合には`__`によってインスタンスの名称とパラメータの名称を連結することに注意
- 実行には30分程度かかることに注意

### Ridge

In [12]:
degree = 2
poly = PolynomialFeatures(degree)

parameters = {'reg__alpha': [1e12, 1e14, 1e16, 1e18, 1e20]}

reg_pl = Pipeline([("poly", poly), ("reg", Ridge())])

grid = GridSearchCV(reg_pl, param_grid=parameters, 
                         cv=kf, 
                         scoring='neg_mean_squared_error', 
                         verbose=3) 

grid.fit(X_train, y_train)

Fitting 5 folds for each of 5 candidates, totalling 25 fits
[CV] reg__alpha=1000000000000.0 ......................................


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
  overwrite_a=True).T


[CV]  reg__alpha=1000000000000.0, score=-5799362813.056, total=   2.9s
[CV] reg__alpha=1000000000000.0 ......................................


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    2.9s remaining:    0.0s
  overwrite_a=True).T


[CV]  reg__alpha=1000000000000.0, score=-16058799536.362, total=   2.8s
[CV] reg__alpha=1000000000000.0 ......................................


[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:    5.8s remaining:    0.0s
  overwrite_a=True).T


[CV]  reg__alpha=1000000000000.0, score=-10336964941.643, total=   2.8s
[CV] reg__alpha=1000000000000.0 ......................................


  overwrite_a=True).T


[CV]  reg__alpha=1000000000000.0, score=-10946439539.822, total=   2.8s
[CV] reg__alpha=1000000000000.0 ......................................


  overwrite_a=True).T


[CV]  reg__alpha=1000000000000.0, score=-5511865594.479, total=   2.8s
[CV] reg__alpha=100000000000000.0 ....................................


  overwrite_a=True).T


[CV]  reg__alpha=100000000000000.0, score=-5790802779.586, total=   2.8s
[CV] reg__alpha=100000000000000.0 ....................................


  overwrite_a=True).T


[CV]  reg__alpha=100000000000000.0, score=-15874555757.049, total=   2.8s
[CV] reg__alpha=100000000000000.0 ....................................


  overwrite_a=True).T


[CV]  reg__alpha=100000000000000.0, score=-10333101951.227, total=   2.8s
[CV] reg__alpha=100000000000000.0 ....................................


  overwrite_a=True).T


[CV]  reg__alpha=100000000000000.0, score=-10899686056.289, total=   2.8s
[CV] reg__alpha=100000000000000.0 ....................................


  overwrite_a=True).T


[CV]  reg__alpha=100000000000000.0, score=-5399743562.720, total=   2.8s
[CV] reg__alpha=1e+16 ................................................


  overwrite_a=True).T


[CV] .......... reg__alpha=1e+16, score=-5781652695.099, total=   2.8s
[CV] reg__alpha=1e+16 ................................................


  overwrite_a=True).T


[CV] ......... reg__alpha=1e+16, score=-15896556467.679, total=   2.8s
[CV] reg__alpha=1e+16 ................................................


  overwrite_a=True).T


[CV] ......... reg__alpha=1e+16, score=-10386757946.278, total=   2.8s
[CV] reg__alpha=1e+16 ................................................


  overwrite_a=True).T


[CV] ......... reg__alpha=1e+16, score=-10889947715.363, total=   2.8s
[CV] reg__alpha=1e+16 ................................................


  overwrite_a=True).T


[CV] .......... reg__alpha=1e+16, score=-5384918016.021, total=   2.8s
[CV] reg__alpha=1e+18 ................................................
[CV] .......... reg__alpha=1e+18, score=-5787400841.859, total=   2.9s
[CV] reg__alpha=1e+18 ................................................
[CV] ......... reg__alpha=1e+18, score=-15904197624.979, total=   2.9s
[CV] reg__alpha=1e+18 ................................................
[CV] ......... reg__alpha=1e+18, score=-10388882729.948, total=   2.8s
[CV] reg__alpha=1e+18 ................................................
[CV] ......... reg__alpha=1e+18, score=-10893719637.084, total=   2.8s
[CV] reg__alpha=1e+18 ................................................
[CV] .......... reg__alpha=1e+18, score=-5388457739.126, total=   2.8s
[CV] reg__alpha=1e+20 ................................................
[CV] .......... reg__alpha=1e+20, score=-5788810424.813, total=   2.8s
[CV] reg__alpha=1e+20 ................................................
[CV] .

[Parallel(n_jobs=1)]: Done  25 out of  25 | elapsed:  1.2min finished
  overwrite_a=True).T


GridSearchCV(cv=KFold(n_splits=5, random_state=1234, shuffle=True),
             estimator=Pipeline(steps=[('poly', PolynomialFeatures()),
                                       ('reg', Ridge())]),
             param_grid={'reg__alpha': [1000000000000.0, 100000000000000.0,
                                        1e+16, 1e+18, 1e+20]},
             scoring='neg_mean_squared_error', verbose=3)

In [13]:
y_pred = grid.predict(X_test)

mae = mean_absolute_error(y_test, y_pred)
mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)

print(f'MAE: {mae:.3}')
print(f'MSE: {mse:.3}')
print(f'RMSE: {rmse:.3}')

MAE: 1.27e+04
MSE: 5e+09
RMSE: 7.07e+04


In [14]:
grid.best_estimator_

Pipeline(steps=[('poly', PolynomialFeatures()),
                ('reg', Ridge(alpha=100000000000000.0))])

- 通常の線形回帰より少し性能が良くなった
- 正則化パラメータは 10^14 付近が良いということがわかった

## SVM の利用
SVM はスモールデータに適したモデルであり，今回の課題に適用する場合には適宜データを間引かないと計算時間が爆発してしまう

In [17]:
n_sample = 1000 # サンプルサイズ
y_sampled = y_train.sample(n_sample)
X_sampled = X_train.loc[y_sampled.index, :]

In [18]:
parameters = {'kernel':['linear', 'rbf'], 'C':[1e-5, 1e-4, 1e-3]}
model = SVR()
svr = GridSearchCV(model, parameters, cv=kf, verbose=3)
svr.fit(X_sampled, y_sampled)

Fitting 5 folds for each of 6 candidates, totalling 30 fits
[CV] C=1e-05, kernel=linear ..........................................
[CV] ............. C=1e-05, kernel=linear, score=-0.079, total=   0.1s
[CV] C=1e-05, kernel=linear ..........................................
[CV] ............. C=1e-05, kernel=linear, score=-0.017, total=   0.1s
[CV] C=1e-05, kernel=linear ..........................................
[CV] ............. C=1e-05, kernel=linear, score=-0.026, total=   0.1s
[CV] C=1e-05, kernel=linear ..........................................


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.1s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:    0.1s remaining:    0.0s


[CV] ............. C=1e-05, kernel=linear, score=-0.069, total=   0.1s
[CV] C=1e-05, kernel=linear ..........................................
[CV] ............. C=1e-05, kernel=linear, score=-0.072, total=   0.1s
[CV] C=1e-05, kernel=rbf .............................................
[CV] ................ C=1e-05, kernel=rbf, score=-0.078, total=   0.0s
[CV] C=1e-05, kernel=rbf .............................................
[CV] ................ C=1e-05, kernel=rbf, score=-0.016, total=   0.0s
[CV] C=1e-05, kernel=rbf .............................................
[CV] ................ C=1e-05, kernel=rbf, score=-0.026, total=   0.0s
[CV] C=1e-05, kernel=rbf .............................................
[CV] ................ C=1e-05, kernel=rbf, score=-0.069, total=   0.0s
[CV] C=1e-05, kernel=rbf .............................................
[CV] ................ C=1e-05, kernel=rbf, score=-0.072, total=   0.0s
[CV] C=0.0001, kernel=linear .........................................
[CV] .

[Parallel(n_jobs=1)]: Done  30 out of  30 | elapsed:  1.3min finished


GridSearchCV(cv=KFold(n_splits=5, random_state=1234, shuffle=True),
             estimator=SVR(),
             param_grid={'C': [1e-05, 0.0001, 0.001],
                         'kernel': ['linear', 'rbf']},
             verbose=3)

In [19]:
y_pred = svr.predict(X_test)

mae = mean_absolute_error(y_test, y_pred)
mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)

print(f'MAE: {mae:.3}')
print(f'MSE: {mse:.3}')
print(f'RMSE: {rmse:.3}')

MAE: 8.67e+03
MSE: 5.08e+09
RMSE: 7.13e+04


線形回帰に比べて MAE は良くなったが MSE・RMSE は悪くなった