# 第4章
モデル作成について学ぶ。

---
ソースコードは以下から引用しています: https://github.com/ghmagazine/kagglebook/tree/master/ch04

ライセンス: https://github.com/ghmagazine/kagglebook/blob/master/LICENSE

## データ準備

In [1]:
import numpy as np
import pandas as pd

# train_xは学習データ、train_yは目的変数、test_xはテストデータ
# pandasのDataFrame, Seriesで保持します。（numpyのarrayで保持することもあります）

train = pd.read_csv('data/sample-data/train_preprocessed.csv')
train_x = train.drop(['target'], axis=1)
train_y = train['target']
test_x = pd.read_csv('data/sample-data/test_preprocessed.csv')

# 学習データを学習データとバリデーションデータに分ける
from sklearn.model_selection import KFold

kf = KFold(n_splits=4, shuffle=True, random_state=71)
tr_idx, va_idx = list(kf.split(train_x))[0]
tr_x, va_x = train_x.iloc[tr_idx], train_x.iloc[va_idx]
tr_y, va_y = train_y.iloc[tr_idx], train_y.iloc[va_idx]

## lightgbm

In [2]:
import lightgbm as lgb
from sklearn.metrics import log_loss

# 特徴量と目的変数をlightgbmのデータ構造に変換する
lgb_train = lgb.Dataset(tr_x, tr_y)
lgb_eval = lgb.Dataset(va_x, va_y)

# ハイパーパラメータの設定
params = {'objective': 'binary', 'seed': 71, 'verbose': 0, 'metrics': 'binary_logloss'}
num_round = 300

# 学習の実行
# カテゴリ変数をパラメータで指定している
# バリデーションデータもモデルに渡し、学習の進行とともにスコアがどう変わるかモニタリングする
categorical_features = ['product', 'medical_info_b2', 'medical_info_b3']
model = lgb.train(params, lgb_train, num_boost_round=num_round,
                  categorical_feature=categorical_features,
                  valid_names=['train', 'valid'], valid_sets=[lgb_train, lgb_eval])

# バリデーションデータでのスコアの確認
va_pred = model.predict(va_x)
score = log_loss(va_y, va_pred)
print(f'logloss: {score:.4f}')

# 予測
pred = model.predict(test_x)

New categorical_feature is ['medical_info_b2', 'medical_info_b3', 'product']
  'New categorical_feature is {}'.format(sorted(list(categorical_feature))))


[1]	train's binary_logloss: 0.454286	valid's binary_logloss: 0.4654
[2]	train's binary_logloss: 0.429417	valid's binary_logloss: 0.443487
[3]	train's binary_logloss: 0.410142	valid's binary_logloss: 0.426359
[4]	train's binary_logloss: 0.393494	valid's binary_logloss: 0.411015
[5]	train's binary_logloss: 0.379488	valid's binary_logloss: 0.398589
[6]	train's binary_logloss: 0.366857	valid's binary_logloss: 0.386944
[7]	train's binary_logloss: 0.354417	valid's binary_logloss: 0.376575
[8]	train's binary_logloss: 0.34379	valid's binary_logloss: 0.367472
[9]	train's binary_logloss: 0.334998	valid's binary_logloss: 0.359954
[10]	train's binary_logloss: 0.325439	valid's binary_logloss: 0.35231
[11]	train's binary_logloss: 0.316396	valid's binary_logloss: 0.345017
[12]	train's binary_logloss: 0.309224	valid's binary_logloss: 0.340222
[13]	train's binary_logloss: 0.301732	valid's binary_logloss: 0.333364
[14]	train's binary_logloss: 0.294708	valid's binary_logloss: 0.328792
[15]	train's binary

[189]	train's binary_logloss: 0.042121	valid's binary_logloss: 0.210389
[190]	train's binary_logloss: 0.0417657	valid's binary_logloss: 0.210134
[191]	train's binary_logloss: 0.0414265	valid's binary_logloss: 0.210087
[192]	train's binary_logloss: 0.0410369	valid's binary_logloss: 0.210163
[193]	train's binary_logloss: 0.0407076	valid's binary_logloss: 0.210132
[194]	train's binary_logloss: 0.0403809	valid's binary_logloss: 0.209968
[195]	train's binary_logloss: 0.0400728	valid's binary_logloss: 0.20993
[196]	train's binary_logloss: 0.0396733	valid's binary_logloss: 0.209878
[197]	train's binary_logloss: 0.0392792	valid's binary_logloss: 0.209418
[198]	train's binary_logloss: 0.0388855	valid's binary_logloss: 0.209483
[199]	train's binary_logloss: 0.0385772	valid's binary_logloss: 0.20944
[200]	train's binary_logloss: 0.0382351	valid's binary_logloss: 0.209168
[201]	train's binary_logloss: 0.0379234	valid's binary_logloss: 0.209001
[202]	train's binary_logloss: 0.0375595	valid's binary

In [3]:
pred

array([8.30069933e-02, 1.44318904e-02, 9.98427778e-04, ...,
       9.34629531e-01, 4.14023216e-05, 3.12900280e-01])

### early stopping導入版

In [4]:
# 学習の実行
# カテゴリ変数をパラメータで指定している
# バリデーションデータもモデルに渡し、学習の進行とともにスコアがどう変わるかモニタリングする
categorical_features = ['product', 'medical_info_b2', 'medical_info_b3']
model2 = lgb.train(params, lgb_train, num_boost_round=num_round,
                  categorical_feature=categorical_features,
                  valid_names=['train', 'valid'], valid_sets=[lgb_train, lgb_eval], early_stopping_rounds=10)

# バリデーションデータでのスコアの確認
va_pred = model2.predict(va_x)
score = log_loss(va_y, va_pred)
print(f'logloss: {score:.4f}')

[1]	train's binary_logloss: 0.454286	valid's binary_logloss: 0.4654
Training until validation scores don't improve for 10 rounds.
[2]	train's binary_logloss: 0.429417	valid's binary_logloss: 0.443487
[3]	train's binary_logloss: 0.410142	valid's binary_logloss: 0.426359
[4]	train's binary_logloss: 0.393494	valid's binary_logloss: 0.411015
[5]	train's binary_logloss: 0.379488	valid's binary_logloss: 0.398589
[6]	train's binary_logloss: 0.366857	valid's binary_logloss: 0.386944
[7]	train's binary_logloss: 0.354417	valid's binary_logloss: 0.376575
[8]	train's binary_logloss: 0.34379	valid's binary_logloss: 0.367472
[9]	train's binary_logloss: 0.334998	valid's binary_logloss: 0.359954
[10]	train's binary_logloss: 0.325439	valid's binary_logloss: 0.35231
[11]	train's binary_logloss: 0.316396	valid's binary_logloss: 0.345017
[12]	train's binary_logloss: 0.309224	valid's binary_logloss: 0.340222
[13]	train's binary_logloss: 0.301732	valid's binary_logloss: 0.333364
[14]	train's binary_logloss:

→アーリーストッピングが効いて168イテレーションで止まる

### predictメソッドのnum_iteration引数に関する検証

#### 予測1：num_iterationでアーリーストッピングしたイテレーション数を指定

In [5]:
pred_with_best_iteration = model2.predict(test_x, num_iteration=model.best_iteration)
print(pred_with_best_iteration)

[1.17955551e-01 4.21639439e-02 5.44313703e-03 ... 8.33034274e-01
 4.28464650e-04 3.68167940e-01]


#### 予測2：num_iterationはデフォルト値

In [6]:
# num_iterationはdefault=100のはずだが、指定しないときに何故かmodel.best_iterationを反映してる
pred = model2.predict(test_x)
print(pred)

[1.17955551e-01 4.21639439e-02 5.44313703e-03 ... 8.33034274e-01
 4.28464650e-04 3.68167940e-01]


#### 予測1と予測2の結果が等しいことを検証

In [7]:
print(all([a==b for a, b in zip(pred, pred_with_best_iteration)]))

True


→等しい。num_iterationでmodel.best_iterationを指定しようがしまいが、結果は同じ。