## 6.　モデルチューニング

- LightGBMのハイパーパラメータのチューニング
- scikit-learnのモデル利用
- ニューラルネットワークの利用
- アンサンブル

In [1]:
import numpy as np
import pandas as pd
import os
import pickle
import gc
# 分布の確認
import pandas_profiling as pdp
# 可視化
import matplotlib.pyplot as plt
# 前処理
from sklearn.preprocessing import StandardScaler, MinMaxScaler, LabelEncoder, OneHotEncoder
# モデリング
from sklearn.model_selection import train_test_split, KFold, StratifiedKFold
from sklearn.metrics import accuracy_score, roc_auc_score, confusion_matrix
import lightgbm as lgb

import warnings
warnings.filterwarnings("ignore")

# matplotlibで日本語表示したい場合はこれをinstallしてインポートする
# !pip install japanize-matplotlib
# import japanize_matplotlib
# %matplotlib inline

df_train = pd.read_csv("../data/train.csv")
x_train, y_train, id_train = df_train[['Pclass', 'Fare']], df_train[[
    'Survived']], df_train[['PassengerId']]

### 6.1　LightGBMのハイパーパラメータのチューニング

##### 6.1.1　手動チューニング
1. 初期値の設定
2. 学習結果に応じた個別チューニング

##### 6.1.2　自動チューニング

optunaを用いた自動チューニングの例

In [2]:
import numpy as np
import pandas as pd
from sklearn.model_selection import StratifiedKFold

In [3]:
!pip install optuna

Collecting optuna
  Downloading optuna-3.1.0-py3-none-any.whl (365 kB)
     ------------------------------------- 365.3/365.3 kB 11.1 MB/s eta 0:00:00
Collecting alembic>=1.5.0
  Downloading alembic-1.9.4-py3-none-any.whl (210 kB)
     ------------------------------------- 210.5/210.5 kB 12.5 MB/s eta 0:00:00
Collecting colorlog
  Downloading colorlog-6.7.0-py2.py3-none-any.whl (11 kB)
Collecting cmaes>=0.9.1
  Downloading cmaes-0.9.1-py3-none-any.whl (21 kB)
Collecting Mako
  Downloading Mako-1.2.4-py3-none-any.whl (78 kB)
     ---------------------------------------- 78.7/78.7 kB 4.6 MB/s eta 0:00:00
Installing collected packages: Mako, colorlog, cmaes, alembic, optuna
Successfully installed Mako-1.2.4 alembic-1.9.4 cmaes-0.9.1 colorlog-6.7.0 optuna-3.1.0


In [4]:
import optuna

In [5]:
# 詮索しないハイパーパラメータ
params_base = {
    "boosting_type": "gbdt",
    "objective": "binary",
    "metric":"auc",
    "learning_rate": "0.02",
    "n_estimators": 100000,
    "bagging_fleq": 1,
    "seed": 123,
}

def objective(trial):
    # 詮索するハイパーパラメータ
    params_tuning = {
        "num_leaves": trial.suggest_int("num_leaves", 8, 256),
        "min_data_in_leaf": trial.suggest_int("min_data_in_leaf", 5, 200),
        "min_sum_hessian_in_leaf": trial.suggest_float("min_sum_hessian_in_leaf", 1e-5, 1e-2, log=True),
        "feature_fraction": trial.suggest_float("feature_fraction", 0.5, 1.0),
        "bagging_fraction": trial.suggest_float("bagging_fraction", 0.5, 1.0),
        "lambda_l1": trial.suggest_float("lambda_l1", 1e-2, 1e2, log=True),
        "lambda_l2": trial.suggest_float("lambda_l2", 1e-2, 1e2, log=True),
    }
    params_tuning.update(params_base)

    # モデル学習・評価
    list_metrics = []
    cv = list(StratifiedKFold(n_splits=5, shuffle=True, random_state=123).split(x_train, y_train))
    for nfold in np.arange(5):
        idx_tr, idx_va = cv[nfold][0], cv[nfold][1]
        x_tr, y_tr = x_train.loc[idx_tr, :], y_train.loc[idx_tr, :]
        x_va, y_va = x_train.loc[idx_va, :], y_train.loc[idx_va, :]
        model = lgb.LGBMClassifier(**params_tuning)
        model.fit(x_tr,
                  y_tr, 
                  eval_set=[(x_tr, y_tr), (x_va, y_va)],
                  early_stopping_rounds=100,
                  verbose=0,)
        y_va_pred = model.predict_proba(x_va)[:, 1]
        metric_va = accuracy_score(y_va, np.where(y_va_pred>=0.5, 1, 0))
        list_metrics.append(metric_va)

    # 評価値の計算
    metrics = np.mean(list_metrics)

    return metrics

In [None]:
# 最適化処理（詮索の実行）
sampler = optuna.samplers.TPESampler(seed=123)
study = optuna.create_study(sampler=sampler, direction="maximize")
study.optimize(objective, n_trials=30)

In [7]:
# 詮索結果の確認
trial = study.best_trial
print("acc(best)={:,.4f}".format(trial.value))
print(f"acc(best)={trial.value}")
display(trial.params)

acc(best)=0.6992
acc(best)=0.6992404745464817


{'num_leaves': 252,
 'min_data_in_leaf': 60,
 'min_sum_hessian_in_leaf': 0.009739830877756862,
 'feature_fraction': 0.8018999555097835,
 'bagging_fraction': 0.5949431124260618,
 'lambda_l1': 0.1812977929299853,
 'lambda_l2': 0.011197876549499167}

In [8]:
# ベストなハイパーパラメータの取得
params_best = trial.params
params_best.update(params_base)
display(params_best)

{'num_leaves': 252,
 'min_data_in_leaf': 60,
 'min_sum_hessian_in_leaf': 0.009739830877756862,
 'feature_fraction': 0.8018999555097835,
 'bagging_fraction': 0.5949431124260618,
 'lambda_l1': 0.1812977929299853,
 'lambda_l2': 0.011197876549499167,
 'boosting_type': 'gbdt',
 'objective': 'binary',
 'metric': 'auc',
 'learning_rate': '0.02',
 'n_estimators': 100000,
 'bagging_fleq': 1,
 'seed': 123}

### 6.2　LightGBM以外のモデル利用

- scikit-learnの各種モデル
- ニューラルネットワーク

##### 6.2.1　scikit-learnの各種モデル

基本的にはどのモデルでも同じ手順で処理できる.

1. モデル定義: importした関数を指定してモデルを定義
2. 学習: .fitで学習を実行する
3. .predictで推論処理を実行する

sklearnの各種モデルでの学習の際の注意点
- 欠損値を埋めないと学習できない
- すべて数値データにしないと学習できない
- 数値データを正規化あるいは標準化する

##### ロジスティック回帰

簡単のため説明変数は'Pclass', 'Age', 'Embarked'の３つとする

In [9]:
# ファイルの読み込み
df_train = pd.read_csv("../data/train.csv")
# データセットの作成
x_train = df_train[['Pclass', 'Age', 'Embarked']]
y_train = df_train[['Survived']]

'Age', 'Embarked'には欠損値があるので, 欠損値を保管する

In [10]:
# 欠損値補間: 数値データ
x_train['Age'] = x_train['Age'].fillna(x_train['Age'].mean())
# 欠損値補間: カテゴリ変数
x_train["Embarked"] = x_train['Embarked'].fillna(x_train['Embarked'].mode()[0])

カテゴリ変数である'Embarked'をohe-hot-encodingで数値データに変換

In [11]:
# カテゴリ変数の数値データへの変換（one-hot-encoding）
ohe = OneHotEncoder()
ohe.fit(x_train[['Embarked']])
df_embarked = pd.DataFrame(ohe.transform(x_train[['Embarked']]).toarray(), columns=[f"Embarked_{col}" for col in ohe.categories_[0]])
x_train = pd.concat([x_train, df_embarked], axis=1)
x_train = x_train.drop(columns=['Embarked'])

In [12]:
# 数値データの正規化
mms = MinMaxScaler()
mms.fit(df_train[['Pclass']])
df_train['Pclass'] = mms.transform(df_train[['Pclass']])

mms.fit(df_train[['Age']])
df_train['Age'] = mms.transform(df_train[['Age']])

In [13]:
# 学習データと検証データの分割（ホールドアウト検証）
x_tr, x_va, y_tr, y_va = train_test_split(x_train, y_train, test_size=0.2, stratify=y_train, random_state=123)
print(x_tr.shape, x_va.shape, y_tr.shape, y_va.shape)

(712, 5) (179, 5) (712, 1) (179, 1)


In [14]:
# モデル定義
from sklearn.linear_model import LogisticRegression
model_logis = LogisticRegression()

# 学習
model_logis.fit(x_tr, y_tr)

# 予測
y_va_pred = model_logis.predict(x_va)
print(f"accuracy: {accuracy_score(y_va, y_va_pred)}")
print(y_va_pred[:5])

# 確率値の取得
y_va_pred_prob = model_logis.predict_proba(x_va)
print(y_va_pred_prob[:5, :])

accuracy: 0.7262569832402235
[0 1 0 1 0]
[[0.85951356 0.14048644]
 [0.20358813 0.79641187]
 [0.85501264 0.14498736]
 [0.28727379 0.71272621]
 [0.61610234 0.38389766]]


##### SVM（サポートベクターマシン）
今回はSVMの分類モデル

In [15]:
# モデル定義
from sklearn.svm import SVC
model_svm = SVC(C=1.0, random_state=123, probability=True)

# 学習
model_svm.fit(x_tr, y_tr)

# 予測
y_va_pred = model_svm.predict(x_va)
print(f"accuracy: {accuracy_score(y_va, y_va_pred)}")
print(y_va_pred[:5])

# 確率値の取得
y_va_pred_prob = model_svm.predict_proba(x_va)
print(y_va_pred_prob[:5, :])


accuracy: 0.6368715083798883
[0 0 0 0 0]
[[0.66192458 0.33807542]
 [0.5599569  0.4400431 ]
 [0.66089667 0.33910333]
 [0.57490543 0.42509457]
 [0.58408982 0.41591018]]


##### 6.2.2　ニューラルネットワーク

ニューラルネットワークを用いて学習する際の注意点
- 欠損値を埋めないと学習できない
- すべて数値データにしないと学習できない
- 数値データを正規化あるいは標準化する

全結合層のみのニューラルネットワーク

In [16]:
!pip install tensorflow

Collecting tensorflow
  Downloading tensorflow-2.11.0-cp39-cp39-win_amd64.whl (1.9 kB)
Collecting tensorflow-intel==2.11.0
  Downloading tensorflow_intel-2.11.0-cp39-cp39-win_amd64.whl (266.3 MB)
     -------------------------------------- 266.3/266.3 MB 9.2 MB/s eta 0:00:00
Collecting gast<=0.4.0,>=0.2.1
  Downloading gast-0.4.0-py3-none-any.whl (9.8 kB)
Collecting grpcio<2.0,>=1.24.3
  Downloading grpcio-1.51.1-cp39-cp39-win_amd64.whl (3.7 MB)
     ---------------------------------------- 3.7/3.7 MB 19.8 MB/s eta 0:00:00
Collecting flatbuffers>=2.0
  Downloading flatbuffers-23.1.21-py2.py3-none-any.whl (26 kB)
Collecting absl-py>=1.0.0
  Downloading absl_py-1.4.0-py3-none-any.whl (126 kB)
     ---------------------------------------- 126.5/126.5 kB ? eta 0:00:00
Collecting opt-einsum>=2.3.2
  Downloading opt_einsum-3.3.0-py3-none-any.whl (65 kB)
     ---------------------------------------- 65.5/65.5 kB 3.5 MB/s eta 0:00:00
Collecting protobuf<3.20,>=3.9.2
  Downloading protobuf-3.19

In [17]:
!pip install keras



In [19]:
# tensorflowライブラリのimport
import tensorflow as tf
from tensorflow.keras.models import Sequential, Model
from tensorflow.keras.layers import Input, Dense, Dropout, BatchNormalization
from tensorflow.keras.layers import Embedding, Flatten, Concatenate
from tensorflow.keras.callbacks import EarlyStopping, ModelCheckpoint, ReduceLROnPlateau, LearningRateScheduler
from tensorflow.keras.optimizers import Adam, SGD

In [20]:
# tensorflowの再現性のためのシード設定
def seed_everything(seed):
    import random
    random.seed(seed)
    os.environ['PYTHONHASHSEED'] = str(seed)
    np.random.seed(seed)
    tf.random.set_seed(seed)
    session_coef = tf.compat.v1.ConfigProto(
        intra_op_parallelism_threads=1,
        inter_op_parallelism_threads=1
    )
    sess = tf.compat.v1.Session(graph=tf.compat.v1.get_default_graph(), config=session_coef)
    tf.compat.v1.keras.backend.set_session(sess)

In [22]:
# ファイルの読み込み
df_train = pd.read_csv("../data/train.csv")

# データセット作成
x_train = df_train[["Pclass", "Age", "Embarked"]]      # 簡単のため, 説明変数は3つ
y_train = df_train[["Survived"]]

In [23]:
"""数値データの前処理"""
# 欠損値補間
x_train["Age"] = x_train["Age"].fillna(x_train["Age"].mean())
# 正規化
for col in ["Pclass", "Age"]:
    value_min = x_train[col].min()
    value_max = x_train[col].max()
    x_train[col] = (x_train[col] - value_min) / (value_max - value_min)

In [25]:
"""カテゴリ変数の前処理"""
# 欠損値補間
x_train["Embarked"] = x_train["Embarked"].fillna(x_train["Embarked"].mode()[0])
# one-hot-encoding
ohe = OneHotEncoder()
ohe.fit(x_train[["Embarked"]])
df_embarked = pd.DataFrame(ohe.transform(x_train[["Embarked"]]).toarray(), columns=[f"Embarked_{col}" for col in ohe.categories_[0]])
x_train = pd.concat([x_train.drop(columns=["Embarked"]), df_embarked], axis=1)

In [26]:
# 学習データとテストデータの分割
x_tr, x_va, y_tr, y_va = train_test_split(x_train, y_train, test_size=0.2, stratify=y_train, random_state=123)
print(x_tr.shape, x_va.shape, y_tr.shape, y_va.shape)

(712, 5) (179, 5) (712, 1) (179, 1)


In [33]:
def create_model():
    input_num = Input(shape=(5,))
    x_num = Dense(10, activation="relu")(input_num)
    x_num = BatchNormalization()(x_num)
    x_num = Dropout(0.3)(x_num)
    x_num = Dense(10, activation="relu")(x_num)
    x_num = BatchNormalization()(x_num)
    x_num = Dropout(0.2)(x_num)
    x_num = Dense(5, activation="relu")(x_num)
    x_num = BatchNormalization()(x_num)
    x_num = Dropout(0.1)(x_num)
    out = Dense(1, activation="sigmoid")(x_num)
    
    model = Model(inputs=input_num, outputs=out,)
    
    model.compile(
        optimizer="Adam",
        loss="binary_crossentropy",
        metrics=["binary_crossentropy"],
    )

    return model

model = create_model()
model.summary()

Model: "model"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_1 (InputLayer)        [(None, 5)]               0         
                                                                 
 dense (Dense)               (None, 10)                60        
                                                                 
 batch_normalization (BatchN  (None, 10)               40        
 ormalization)                                                   
                                                                 
 dropout (Dropout)           (None, 10)                0         
                                                                 
 dense_1 (Dense)             (None, 10)                110       
                                                                 
 batch_normalization_1 (Batc  (None, 10)               40        
 hNormalization)                                             

In [36]:
# モデル学習
seed_everything(seed=123)
model = create_model()
model.fit(x=x_tr,
          y=y_tr,
          validation_data=(x_va, y_va),
          batch_size=8,
          epochs=10000,
          callbacks=[ModelCheckpoint(filepath="model_keras.h5", monitor="val_loss", mode="min", verbose=1, save_best_only=True, save_weights_only=True),
                     EarlyStopping(monitor="val_loss", mode="min", min_delta=0, patience=10, verbose=1, restore_best_weights=True), 
                     ReduceLROnPlateau(monitor="val_loss", mode="min", factor=0.1, patience=5, verbose=1),],
          verbose=1,)

Epoch 1/10000
Epoch 1: val_loss improved from inf to 0.68177, saving model to model_keras.h5
Epoch 2/10000
Epoch 2: val_loss improved from 0.68177 to 0.66743, saving model to model_keras.h5
Epoch 3/10000
Epoch 3: val_loss improved from 0.66743 to 0.65251, saving model to model_keras.h5
Epoch 4/10000
Epoch 4: val_loss improved from 0.65251 to 0.63584, saving model to model_keras.h5
Epoch 5/10000
Epoch 5: val_loss improved from 0.63584 to 0.61953, saving model to model_keras.h5
Epoch 6/10000
Epoch 6: val_loss improved from 0.61953 to 0.61322, saving model to model_keras.h5
Epoch 7/10000
Epoch 7: val_loss did not improve from 0.61322
Epoch 8/10000
Epoch 8: val_loss improved from 0.61322 to 0.61172, saving model to model_keras.h5
Epoch 9/10000
Epoch 9: val_loss improved from 0.61172 to 0.60519, saving model to model_keras.h5
Epoch 10/10000
Epoch 10: val_loss did not improve from 0.60519
Epoch 11/10000
Epoch 11: val_loss improved from 0.60519 to 0.60269, saving model to model_keras.h5
Epoch

<keras.callbacks.History at 0x1c611791400>

In [41]:
# モデルの評価
y_va_pred = model.predict(x_va, batch_size=8, verbose=1)
print(f"accuracy: {accuracy_score(y_va, np.where(y_va_pred>=0.5, 1, 0)):f}")

accuracy: 0.692737


埋め込み層ありのネットワークモデル

In [42]:
# データセットの作成
x_train = df_train[["Pclass", "Age", "Cabin"]]
y_train = df_train[["Survived"]]

In [44]:
"""数値データの前処理"""
# 欠損値補間
x_train["Age"] = x_train["Age"].fillna(x_train["Age"].mean())
# 正規化
for col in ["Pclass", "Age"]:
    value_min = x_train[col].min()
    value_max = x_train[col].max()
    x_train[col] = (x_train[col] - value_min) / (value_max - value_min)

In [45]:
"""カテゴリ変数の前処理"""
# 欠損値補間
x_train["Cabin"] = x_train["Cabin"].fillna("None")
# label-encoding
le = LabelEncoder()
le.fit(x_train[["Cabin"]])
x_train["Cabin"] = le.transform(x_train["Cabin"])

print(le.classes_)
print("count:", len(le.classes_))

['A10' 'A14' 'A16' 'A19' 'A20' 'A23' 'A24' 'A26' 'A31' 'A32' 'A34' 'A36'
 'A5' 'A6' 'A7' 'B101' 'B102' 'B18' 'B19' 'B20' 'B22' 'B28' 'B3' 'B30'
 'B35' 'B37' 'B38' 'B39' 'B4' 'B41' 'B42' 'B49' 'B5' 'B50' 'B51 B53 B55'
 'B57 B59 B63 B66' 'B58 B60' 'B69' 'B71' 'B73' 'B77' 'B78' 'B79' 'B80'
 'B82 B84' 'B86' 'B94' 'B96 B98' 'C101' 'C103' 'C104' 'C106' 'C110' 'C111'
 'C118' 'C123' 'C124' 'C125' 'C126' 'C128' 'C148' 'C2' 'C22 C26'
 'C23 C25 C27' 'C30' 'C32' 'C45' 'C46' 'C47' 'C49' 'C50' 'C52' 'C54'
 'C62 C64' 'C65' 'C68' 'C7' 'C70' 'C78' 'C82' 'C83' 'C85' 'C86' 'C87'
 'C90' 'C91' 'C92' 'C93' 'C95' 'C99' 'D' 'D10 D12' 'D11' 'D15' 'D17' 'D19'
 'D20' 'D21' 'D26' 'D28' 'D30' 'D33' 'D35' 'D36' 'D37' 'D45' 'D46' 'D47'
 'D48' 'D49' 'D50' 'D56' 'D6' 'D7' 'D9' 'E10' 'E101' 'E12' 'E121' 'E17'
 'E24' 'E25' 'E31' 'E33' 'E34' 'E36' 'E38' 'E40' 'E44' 'E46' 'E49' 'E50'
 'E58' 'E63' 'E67' 'E68' 'E77' 'E8' 'F E69' 'F G63' 'F G73' 'F2' 'F33'
 'F38' 'F4' 'G6' 'None' 'T']
count: 148


In [46]:
# 学習データと検証データの分割
x_train_num, x_train_cat = x_train[["Pclass", "Age"]], x_train[["Cabin"]]

x_num_tr, x_num_va, x_cat_tr, x_cat_va, y_tr, y_va = train_test_split(x_train_num, x_train_cat, y_train, test_size=0.2, stratify=y_train, random_state=123)
print(x_num_tr.shape, x_num_va.shape, x_cat_tr.shape, x_cat_va.shape, y_tr.shape, y_va.shape)

(712, 2) (179, 2) (712, 1) (179, 1) (712, 1) (179, 1)


In [47]:
# モデルの定義
def create_model_embedding():
    ########### num
    input_num = Input(shape=(2,))
    layer_num = Dense(10, activation="relu")(input_num)
    layer_num = BatchNormalization()(layer_num)
    layer_num = Dropout(0.2)(layer_num)
    layer_num = Dense(10, activation="relu")(layer_num)
    
    ########### cat
    input_cat = Input(shape=(1,))
    layer_cat = input_cat[:,0]
    layer_cat = Embedding(input_dim=148, output_dim=74)(layer_cat)
    layer_cat = Dropout(0.2)(layer_cat)
    layer_cat = Flatten()(layer_cat)
    
    ########### concat
    hidden_layer = Concatenate()([layer_num, layer_cat])
    hidden_layer = Dense(50, activation="relu")(hidden_layer)
    hidden_layer = BatchNormalization()(hidden_layer)
    hidden_layer = Dropout(0.1)(hidden_layer)
    hidden_layer = Dense(20, activation="relu")(hidden_layer)
    hidden_layer = BatchNormalization()(hidden_layer)
    hidden_layer = Dropout(0.1)(hidden_layer)
    output_layer = Dense(1, activation="sigmoid")(hidden_layer)
    
    model = Model(inputs=[input_num, input_cat], outputs=output_layer,)
    
    model.compile(
        optimizer="Adam",
        loss="binary_crossentropy",
        metrics=["binary_crossentropy"],
    )
    
    return model

model = create_model_embedding()
model.summary()

Model: "model_4"
__________________________________________________________________________________________________
 Layer (type)                   Output Shape         Param #     Connected to                     
 input_5 (InputLayer)           [(None, 2)]          0           []                               
                                                                                                  
 input_6 (InputLayer)           [(None, 1)]          0           []                               
                                                                                                  
 dense_16 (Dense)               (None, 10)           30          ['input_5[0][0]']                
                                                                                                  
 tf.__operators__.getitem (Slic  (None,)             0           ['input_6[0][0]']                
 ingOpLambda)                                                                               

In [48]:
# モデルの学習
seed_everything(seed=123)
model = create_model_embedding()
model.fit(x=[x_num_tr, x_cat_tr],
          y=y_tr,
          validation_data=([x_num_va, x_cat_va], y_va),
          batch_size=8,
          epochs=10000,
          callbacks=[ModelCheckpoint(filepath="model_keras_embedding.h5", monitor="val_loss", mode="min", verbose=1, save_best_only=True, save_weights_only=True),
                     EarlyStopping(monitor="val_loss", mode="min", min_delta=0, patience=10, verbose=1, restore_best_weights=True), 
                     ReduceLROnPlateau(monitor="val_loss", mode="min", factor=0.1, patience=5, verbose=1),],
          verbose=1,)

Epoch 1/10000
Epoch 1: val_loss improved from inf to 0.65927, saving model to model_keras_embedding.h5
Epoch 2/10000
Epoch 2: val_loss improved from 0.65927 to 0.65246, saving model to model_keras_embedding.h5
Epoch 3/10000
Epoch 3: val_loss improved from 0.65246 to 0.65015, saving model to model_keras_embedding.h5
Epoch 4/10000
Epoch 4: val_loss improved from 0.65015 to 0.63424, saving model to model_keras_embedding.h5
Epoch 5/10000
Epoch 5: val_loss improved from 0.63424 to 0.61907, saving model to model_keras_embedding.h5
Epoch 6/10000
Epoch 6: val_loss improved from 0.61907 to 0.59948, saving model to model_keras_embedding.h5
Epoch 7/10000
Epoch 7: val_loss did not improve from 0.59948
Epoch 8/10000
Epoch 8: val_loss did not improve from 0.59948
Epoch 9/10000
Epoch 9: val_loss did not improve from 0.59948
Epoch 10/10000
Epoch 10: val_loss did not improve from 0.59948
Epoch 11/10000
Epoch 11: val_loss did not improve from 0.59948

Epoch 11: ReduceLROnPlateau reducing learning rate t

<keras.callbacks.History at 0x1c615d20610>

In [50]:
# モデルの評価
y_va_pred = model.predict([x_num_va, x_cat_va], batch_size=8, verbose=1)
print(f"accuracy:{accuracy_score(y_va, np.where(y_va_pred>=0.5,1,0)):f}")

accuracy:0.703911


### 6.3　アンサンブル

- 単純平均
- 重み付き平均
- スタッキング

##### 6.3.1　単純平均

In [58]:
# アンサンブルにだけ焦点を当てるため, モデルの学習と予測値算出処理は省略
# サンプルデータの作成
np.random.seed(123)
df = pd.DataFrame({
    "true": [0]*700 + [1]*300,
    "pred1": np.arange(1000) + np.random.rand(1000)*1200,
    "pred2": np.arange(1000) + np.random.rand(1000)*1000,
    "pred3": np.arange(1000) + np.random.rand(1000)*800,
})
df["pred1"] = np.clip(df["pred1"]/df["pred1"].max(), 0, 1)
df["pred2"] = np.clip(df["pred2"]/df["pred2"].max(), 0, 1)
df["pred3"] = np.clip(df["pred3"]/df["pred3"].max(), 0, 1)

df_train, df_test = train_test_split(df, test_size=0.2, stratify=df["true"], random_state=123)
df_train = df_train.reset_index(drop=True)
df_test = df_test.reset_index(drop=True)
df_train.head()

Unnamed: 0,true,pred1,pred2,pred3
0,1,0.668853,0.95356,0.735708
1,0,0.566234,0.226771,0.323684
2,0,0.419504,0.492005,0.252806
3,1,0.63583,0.80238,0.941991
4,0,0.215446,0.127587,0.106397


In [63]:
# 単純平均によるアンサンブル
df_train["pred_ensemble1"] = (df_train["pred1"] + df_train["pred2"] + df_train["pred3"]) / 3
df_train.head()

Unnamed: 0,true,pred1,pred2,pred3,pred_ensamble1,pred_ensemble1
0,1,0.668853,0.95356,0.735708,0.78604,0.78604
1,0,0.566234,0.226771,0.323684,0.37223,0.37223
2,0,0.419504,0.492005,0.252806,0.388105,0.388105
3,1,0.63583,0.80238,0.941991,0.7934,0.7934
4,0,0.215446,0.127587,0.106397,0.14981,0.14981


In [65]:
# アンサンブル用の精度評価関数と精度評価
def evaluate_ensemnble(input_df, col_pred):
    print("[auc] model1:{:.4f}, model2: {:.4f}, model3: {:.4f} -> ensemble: {:.4f}".format(
        roc_auc_score(input_df["true"], input_df["pred1"]),
        roc_auc_score(input_df["true"], input_df["pred2"]),
        roc_auc_score(input_df["true"], input_df["pred3"]),
        roc_auc_score(input_df["true"], input_df[col_pred]),))
evaluate_ensemnble(df_train, col_pred="pred_ensemble1")

[auc] model1:0.8029, model2: 0.8446, model3: 0.8926 -> ensemble: 0.9383


In [66]:
# 推論時のアンサンブル処理と精度評価
df_test["pred_ensemble1"] = (df_test["pred1"] + df_test["pred2"] + df_test["pred3"]) / 3
evaluate_ensemnble(df_test, col_pred="pred_ensemble1")

[auc] model1:0.8538, model2: 0.8467, model3: 0.9212 -> ensemble: 0.9601


##### 6.3.2　重み付き平均
- 各モデルの評価値（精度）をもとに決める
- 検証データの評価値をもとに決める

In [70]:
weight = [0.3, 0.3, 0.4]
weight = weight / np.sum(weight)
print(weight)

df_train["pred_ensemble2"] = df_train["pred1"] * weight[0] + df_train["pred2"] * weight[1] + df_train["pred3"] * weight[2] 
df_train[["true", "pred1", "pred2", "pred3", "pred_ensemble2"]].head()

[0.3 0.3 0.4]


Unnamed: 0,true,pred1,pred2,pred3,pred_ensemble2
0,1,0.668853,0.95356,0.735708,0.781007
1,0,0.566234,0.226771,0.323684,0.367375
2,0,0.419504,0.492005,0.252806,0.374575
3,1,0.63583,0.80238,0.941991,0.808259
4,0,0.215446,0.127587,0.106397,0.145469


In [71]:
# アンサンブルの精度評価
evaluate_ensemnble(df_train, col_pred="pred_ensemble2")

[auc] model1:0.8029, model2: 0.8446, model3: 0.8926 -> ensemble: 0.9407


In [73]:
# 推論時のアンサンブル処理と精度評価
df_test["pred_ensemble2"] = df_test["pred1"] * weight[0] + df_test["pred2"] * weight[1] + df_test["pred3"] * weight[2] 
evaluate_ensemnble(df_test, col_pred="pred_ensemble2")

[auc] model1:0.8538, model2: 0.8467, model3: 0.9212 -> ensemble: 0.9635


##### 6.3.3　スタッキング

In [77]:
# スタッキングによるアンサンブル
from sklearn.linear_model import Lasso

x, y = df_train[["pred1", "pred2", "pred3"]], df_train[["true"]]
oof = np.zeros(len(x))
models = []

cv = list(StratifiedKFold(n_splits=5, shuffle=True, random_state=123).split(x, y))
for nfold in np.arange(5):
    # 学習データと検証データの分離
    idx_tr, idx_va = cv[nfold][0], cv[nfold][1]
    x_tr, y_tr = x.loc[idx_tr, :], y.loc[idx_tr, :]
    x_va, y_va = x.loc[idx_va, :], y.loc[idx_va, :]
    # モデル学習
    model = Lasso(alpha=0.01)
    model.fit(x_tr, y_tr)
    models.append(model)
    # 検証データの予測値算出
    y_va_pred = model.predict(x_va)
    oof[idx_va] = y_va_pred
    
df_train["pred_ensemble3"] = oof
df_train["pred_ensemble3"] = df_train["pred_ensemble3"].clip(lower=0, upper=1)
df_train[["true", "pred1", "pred2", "pred3", "pred_ensemble3"]].head()

Unnamed: 0,true,pred1,pred2,pred3,pred_ensemble3
0,1,0.668853,0.95356,0.735708,0.685202
1,0,0.566234,0.226771,0.323684,0.053733
2,0,0.419504,0.492005,0.252806,0.074234
3,1,0.63583,0.80238,0.941991,0.769464
4,0,0.215446,0.127587,0.106397,0.0


In [78]:
evaluate_ensemnble(df_train, col_pred="pred_ensemble3")

[auc] model1:0.8029, model2: 0.8446, model3: 0.8926 -> ensemble: 0.9398


In [79]:
df_test["pred_ensemble3"] = 0
for model in models:
    df_test["pred_ensemble3"] += model.predict(df_test[["pred1", "pred2", "pred3"]]) / len(models)
df_test["pred_ensemble3"] = df_test["pred_ensemble3"].clip(lower=0, upper=1)
evaluate_ensemnble(df_test, col_pred="pred_ensemble3")

[auc] model1:0.8538, model2: 0.8467, model3: 0.9212 -> ensemble: 0.9662
