## 汎化性能と過学習
- 機械学習は学習して終わりではない
- 正しく精度を測り、最も精度が高いモデルを採用する
- 道のデータに対して高い精度は必要
- 汎化性能：　未知のデータに対して正しく予測できる性能
- 過学習：学習に使用したデータにフィットしすぎて汎化性能が低くなること　->　過学習しすぎず汎化性能を高くすることを目標にする
- hold-out, LOOCV, k-Fold CV　の三つの方法がある

## hold-out
- ランダムに学習データとテストデータに分割する
- 7:3, 5:5に分けるのが一般的
- すべてのデータを学習に使えないのが難点

In [41]:
import seaborn as sns
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split

df = sns.load_dataset('tips')
y_col = 'tip'
X = df.drop(columns=[y_col])

# numpyの数値カラムだけ取れる。標準化のために数値カラムのリストを取得
numeric_cols = X.select_dtypes(include=np.number).columns.to_list()

# ダミー変数の作成
X = pd.get_dummies(X, drop_first=True)
y = df[y_col]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)

In [42]:
df.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
0,16.99,1.01,Female,No,Sun,Dinner,2
1,10.34,1.66,Male,No,Sun,Dinner,3
2,21.01,3.5,Male,No,Sun,Dinner,3
3,23.68,3.31,Male,No,Sun,Dinner,2
4,24.59,3.61,Female,No,Sun,Dinner,4


In [43]:
X.head()

Unnamed: 0,total_bill,size,sex_Female,smoker_No,day_Fri,day_Sat,day_Sun,time_Dinner
0,16.99,2,1,1,0,0,1,1
1,10.34,3,0,1,0,0,1,1
2,21.01,3,0,1,0,0,1,1
3,23.68,2,0,1,0,0,1,1
4,24.59,4,1,1,0,0,1,1


In [44]:
print(numeric_cols)
print(X.dtypes)

['total_bill', 'size']
total_bill     float64
size             int64
sex_Female       uint8
smoker_No        uint8
day_Fri          uint8
day_Sat          uint8
day_Sun          uint8
time_Dinner      uint8
dtype: object


In [46]:
print(len(X_train))
print(len(X_test))

170
74


In [54]:
# 標準化
# 標準化はデータん分割後に実施する
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_train_scaled = X_train.copy()
# 数値のカラムのみ標準化
X_train_scaled[numeric_cols] = scaler.fit(X_train[numeric_cols]) 
X_train_scaled[numeric_cols] = scaler.transform(X_train[numeric_cols])

X_test_scaled = X_test.copy()
X_test_scaled[numeric_cols] = scaler.transform(X_test[numeric_cols])


In [66]:
# 機械学習モデル学習
from sklearn.linear_model import LinearRegression

model = LinearRegression()
model.fit(X_train_scaled, y_train)
y_pred = model.predict(X_test_scaled)

In [67]:
y_pred

array([2.82249035, 2.97504474, 2.8260184 , 1.38113692, 3.15154584,
       1.72121268, 2.48332645, 3.03579004, 2.75176346, 4.52560955,
       3.1133346 , 3.14781575, 2.33198109, 2.11518372, 2.93262778,
       4.27846609, 1.83157994, 2.26626275, 2.31085596, 3.24382161,
       3.81889336, 2.85616455, 2.42949782, 2.42039736, 2.20253234,
       2.42509643, 2.81777778, 4.70274951, 3.81268552, 2.38673795,
       2.29194112, 2.20803273, 2.45503466, 1.7743294 , 2.71663745,
       2.22913684, 2.72146912, 2.01205852, 5.85346207, 3.49435578,
       2.26246168, 2.20347519, 2.50905642, 4.41646769, 1.97212663,
       2.78445294, 2.65274212, 3.01652357, 2.73423023, 3.95761528,
       3.9498931 , 2.53992971, 2.71758399, 6.35620823, 1.7434279 ,
       2.33450139, 4.23562521, 3.29319236, 2.41114285, 2.20345847,
       3.72455103, 2.29099827, 3.04008335, 3.74539008, 4.01431996,
       2.26547605, 2.66047323, 3.84238482, 2.17921165, 3.87859588,
       2.59899485, 1.94814647, 3.70801825, 2.11341037])

In [70]:
y_test

64     2.64
63     3.76
55     3.51
111    1.00
225    2.50
       ... 
90     3.00
101    3.00
75     1.25
4      3.61
109    4.00
Name: tip, Length: 74, dtype: float64

In [75]:
# テストデータの精度（MSE）
from sklearn.metrics import mean_squared_error
mean_squared_error(y_test, y_pred) # np.mean(np.square(y_test - y_pred))

0.9550808988617148

## 基本的は機械学習の流れ
- データの準備をする
- ダミー変数を作って
- hold-outをして学習データとテストデータを分割して
- 標準化して
- 学習してモデル評価をする

# Leave-One-Out Crosss Validation
- 一つのデータだけテストデータとし、残りのデータで学習し全てのデータだテストデータになるように繰り返し平均を取る
- ランダム性がない（常に同じ結果になる）
-　ほぼ全てのデータを学習に使用可能
- 欠点：非常にコストがかかる

In [83]:
X = df['total_bill'].values.reshape(-1, 1)
y = df['tip']

In [92]:
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import LeaveOneOut
loo = LeaveOneOut()

In [106]:
model = LinearRegression()
mse_list = []

for train_index, test_index in loo.split(X):
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]

    # モデル学種 
    model.fit(X_train, y_train)
    
    # テストデータの予測
    y_pred = model.predict(X_test)
    
    # MSE
    mse = mean_squared_error(y_test, y_pred)
    mse_list.append(mse)

In [111]:
print(f"MSE(LOOCV): {np.mean(mse_list)}")
print(f"std: {np.std(mse_list)}")

MSE(LOOCV): 1.0675673489857438
std: 2.0997944551776313


In [115]:
# 上記のプログラムを一発でやってくれるのが下記のプログラム
from sklearn.model_selection import cross_val_score
cv = LeaveOneOut()
scores = cross_val_score(model, X, y, cv=cv, scoring='neg_mean_squared_error')

In [116]:
print(f"MSE(LOOCV): {-np.mean(scores)}")
print(f"std: {np.std(scores)}")

MSE(LOOCV): 1.0675673489857438
std: 2.0997944551776313


## k-fold CV
- データをk個に分割して交差検証を行う（よく使われるのはk＝5、　k＝10）
- LOOCVよりコストが低い
- 最も使用される評価方法

In [145]:
from sklearn.model_selection import KFold
k = 5
n_repeats = 3

cv = KFold(n_splits=k, shuffle=True, random_state=0)
model = LinearRegression()
mse_list = []
for train_index, test_index in cv.split(X):
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]
    
    #　モデル学習
    model.fit(X_train, y_train)
    
    # テストデータの予測
    y_pred = model.predict(X_test)
    
    # MSE
    mse = mean_squared_error(y_test, y_pred)
    mse_list.append(mse)

In [146]:
mse_list

[0.8213090642766285,
 1.0745842125927976,
 1.0880123892600388,
 1.3323867714930204,
 1.084763004349474]

In [147]:
print(f"MSE({k}Fold CV): {np.mean(mse_list)}")
print(f"std: {np.std(mse_list)}")

MSE(5Fold CV): 1.080211088394392
std: 0.1617010050703952


In [156]:
# 標準化が組み込めないという欠点がある
# １行でできる
scores = cross_val_score(model, X, y, cv=cv, scoring='neg_mean_squared_error', n_jobs=-1)

In [157]:
scores

array([-0.82130906, -1.07458421, -1.08801239, -1.33238677, -1.084763  ,
       -1.15878391, -1.6042084 , -1.03070862, -0.71202907, -0.84729854,
       -0.88561033, -1.52485216, -0.6332659 , -1.2003542 , -1.12141427])

In [158]:
print(f"MSE({k}Fold CV): {-np.mean(scores)}")
print(f"std: {np.std(scores)}")

MSE(5Fold CV): 1.0746387233165984
std: 0.26517178540898434


In [159]:
from sklearn.model_selection import RepeatedKFold
k = 5
n_repeats = 3

# cv = KFold(n_splits=k, shuffle=True, random_state=0)
cv = RepeatedKFold(n_splits=k, n_repeats=n_repeats, random_state=0)
model = LinearRegression()
mse_list = []
for train_index, test_index in cv.split(X):
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]
    
    #　モデル学習
    model.fit(X_train, y_train)
    
    # テストデータの予測
    y_pred = model.predict(X_test)
    
    # MSE
    mse = mean_squared_error(y_test, y_pred)
    mse_list.append(mse)

In [160]:
mse_list # 5 fold を　3回しているので合計15

[0.8213090642766285,
 1.0745842125927976,
 1.0880123892600388,
 1.3323867714930204,
 1.084763004349474,
 1.1587839131131423,
 1.6042084002514578,
 1.0307086207441924,
 0.7120290668798746,
 0.8472985410140899,
 0.8856103319481908,
 1.5248521639391936,
 0.6332659028150582,
 1.200354200262607,
 1.121414266809207]

In [162]:
print(f"MSE({k}Fold CV): {-np.mean(mse_list)}")
print(f"std: {np.std(mse_list)}")

MSE(5Fold CV): -1.0746387233165984
std: 0.26517178540898434


# 標準化をk-fold CV に組み込む

## Pipeline

### pipeline + KFoldCV

In [183]:
from sklearn.pipeline import Pipeline
pipeline = Pipeline(steps=[('scaler', StandardScaler()), ('model', LinearRegression())])

cv = KFold(n_splits=5, shuffle=True, random_state=0)
scores = cross_val_score(pipeline, X, y, scoring='neg_mean_squared_error', cv=cv)

scores

array([-0.82130906, -1.07458421, -1.08801239, -1.33238677, -1.084763  ])

In [194]:
## pipelineなし
## 標準化　＋　線形回帰
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)
model = LinearRegression()
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
model.fit(X_train_scaled, y_train)
y_pred = model.predict(X_test_scaled)

In [198]:
## pipelineあり
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)
pipeline = Pipeline(steps=[('scaler', StandardScaler()), ('model', LinearRegression())])
pipeline.fit(X_train, y_train)
y_pred_p = pipeline.predict(X_test)

In [199]:
y_pred

array([2.71486884, 2.78639251, 2.90900452, 1.65836207, 2.57999564,
       1.50509707, 2.74858715, 3.30136293, 2.77208778, 4.45800284,
       3.50060744, 3.49345507, 2.35520697, 2.24587793, 2.28879213,
       4.02375199, 1.77075641, 2.3480546 , 2.83645908, 3.2778623 ,
       3.98901192, 3.05511716, 2.55240794, 2.45431834, 2.29798803,
       2.59327861, 2.16004953, 3.96244599, 3.50162921, 2.5289073 ,
       2.42264357, 2.19274606, 2.49314547, 1.99963215, 2.78639251,
       2.28572683, 2.64743224, 1.97306622, 5.85577969, 2.55036441,
       1.79425705, 2.18763723, 2.52073317, 3.96755482, 2.22135553,
       2.65151931, 2.78128368, 3.12255376, 2.66173698, 3.66409011,
       4.2567148 , 2.74552185, 3.01118119, 5.83943142, 1.89847725,
       2.14676656, 3.97572896, 3.03161652, 2.37462053, 2.21113786,
       3.70496078, 2.53299437, 3.07963956, 3.47199797, 3.99718606,
       2.5043849 , 2.60043097, 4.2720413 , 1.97306622, 3.87763935,
       2.4890584 , 1.99145802, 3.43010554, 2.37972937])

In [200]:
y_pred_p

array([2.71486884, 2.78639251, 2.90900452, 1.65836207, 2.57999564,
       1.50509707, 2.74858715, 3.30136293, 2.77208778, 4.45800284,
       3.50060744, 3.49345507, 2.35520697, 2.24587793, 2.28879213,
       4.02375199, 1.77075641, 2.3480546 , 2.83645908, 3.2778623 ,
       3.98901192, 3.05511716, 2.55240794, 2.45431834, 2.29798803,
       2.59327861, 2.16004953, 3.96244599, 3.50162921, 2.5289073 ,
       2.42264357, 2.19274606, 2.49314547, 1.99963215, 2.78639251,
       2.28572683, 2.64743224, 1.97306622, 5.85577969, 2.55036441,
       1.79425705, 2.18763723, 2.52073317, 3.96755482, 2.22135553,
       2.65151931, 2.78128368, 3.12255376, 2.66173698, 3.66409011,
       4.2567148 , 2.74552185, 3.01118119, 5.83943142, 1.89847725,
       2.14676656, 3.97572896, 3.03161652, 2.37462053, 2.21113786,
       3.70496078, 2.53299437, 3.07963956, 3.47199797, 3.99718606,
       2.5043849 , 2.60043097, 4.2720413 , 1.97306622, 3.87763935,
       2.4890584 , 1.99145802, 3.43010554, 2.37972937])