ハイパーパラメータのチューニング過程を示しています。
以下コードの概要


1. データの前処理・オーバーサンプリング
2. 機械学習に必要なライブラリの呼び出し
3. 分類器の定義(パラメータはデフォルト値)
4. LogisticRegressionのグリッドサーチ
5. DecisionTreeのグリッドサーチ
6. KNeighborのグリッドサーチ
7. SVCのグリッドサーチ
8. RandomForestのグリッドサーチ
9. AdaBoostのグリッドサーチ
10. GradientBoostのグリッドサーチ

結論:SVCのチューニングは10時間以上待ってもチューニングが完了せず、ＰＣの処理能力が限界だったため諦めた。どの分類器もだが、特にチューニングにかかる計算コストが大きすぎることが課題となった。計算コスト要因には分割検証の分割数や、データセットのサンプル数、特徴量の数など考えられたが特徴量の数を減らすのが第一策であると判断した。





In [1]:
#データの前処理・オーバーサンプリング
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
df=pd.read_csv("train.csv")
df_dummy=pd.get_dummies(df[["job","marital","education",
                            "default","housing","loan",
                            "contact","month","poutcome","subscribed",
                            "age","balance","day","previous",
                            "duration","campaign","pdays"]],drop_first=True)
X,y=df_dummy.iloc[:,0:42],df_dummy.iloc[:,42].values
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.3,random_state=1,stratify=y)
sc=StandardScaler()
X_train=sc.fit_transform(X_train)
X_test=sc.fit_transform(X_test)
from imblearn import FunctionSampler
from imblearn.over_sampling import SMOTE
from sklearn.datasets import make_classification
sm = SMOTE(random_state=2)
X_train_res, y_train_res = sm.fit_resample(X_train, y_train.ravel())

In [2]:
#機械学習に必要な分類器・グリッドサーチなどの呼び出し
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV

In [3]:
#分類器の定義：チューニング前なのでデフォルトの値に設定してある
log=LogisticRegression(random_state=1)
dt=DecisionTreeClassifier(random_state=1)
knn=KNeighborsClassifier()
svc=SVC(probability=True)
rfc=RandomForestClassifier(random_state=1)
ada=AdaBoostClassifier(base_estimator=dt,random_state=1)
gb=GradientBoostingClassifier(random_state=1)

In [None]:
#logisticregressionのグリッドサーチ
log_param={"C":[0.001,0.01,0.1,1,10,100,1000], 
       "penalty":["l1","l2"],
       'solver' : ['lbfgs', 'liblinear']}
log_cv=GridSearchCV(estimator=log,param_grid=log_param,cv=5,scoring="accuracy")
log_cv.fit(X_train_res,y_train_res)
print("tuned hpyerparameters :(best parameters) ",log_cv.best_params_)
print("accuracy :",log_cv.best_score_)
#tuned hpyerparameters :(best parameters)  {'C': 0.01, 'penalty': 'l2', 'solver': 'liblinear'}
#accuracy : 0.8482764254281623

In [None]:
#決定木でグリッドサーチ
dt_param={'max_features': ['auto', 'sqrt', 'log2'],
              'max_depth' : [3,5,7,None],
              'criterion' :['gini', 'entropy',"log_loss"]}
dt_cv=GridSearchCV(estimator=dt,param_grid=dt_param,
                           cv=5,scoring="accuracy",verbose=True)
dt_cv.fit(X_train_res, y_train_res)

print("tuned hpyerparameters :(best parameters) ",dt_cv.best_params_)
print("accuracy :",dt_cv.best_score_)
#tuned hpyerparameters :(best parameters)  {'criterion': 'entropy', 'max_depth': None, 'max_features': 'auto'}
#accuracy : 0.9020056468076799

In [6]:
#kneighborsでグリッドサーチ
knn_param={"n_neighbors":list(range(1,11)),
           "weights":["uniform","distance"]}
knn_cv=GridSearchCV(estimator=knn,param_grid=knn_param,
                    cv=5,scoring="accuracy",verbose=True)
knn_cv.fit(X_train_res, y_train_res)
print("tuned hpyerparameters :(best parameters) ",knn_cv.best_params_)
print("accuracy :",knn_cv.best_score_)
#tuned hpyerparameters :(best parameters)  {'n_neighbors': 1, 'weights': 'uniform'}
#accuracy : 0.9481639928698751

Fitting 5 folds for each of 20 candidates, totalling 100 fits
tuned hpyerparameters :(best parameters)  {'n_neighbors': 6, 'weights': 'distance'}
accuracy : 0.9950898671920827


In [None]:
#SVCでグリッドサーチ　10時間以上かかってしまったので諦めました。
svc_param={"kernel":["linear","rbf"],
           "C":[0.01,1,100]}
svc_cv=GridSearchCV(estimator=svc,param_grid=svc_param,
                    cv=5,scoring="accuracy",verbose=True)
svc_cv.fit(X_train_res, y_train_res)
print("tuned hpyerparameters :(best parameters) ",svc_cv.best_params_)
print("accuracy :",svc_cv.best_score_)

In [None]:
#randomforest グリッドサーチ
rfc_param={'max_features':['auto','sqrt','log2'],
           'max_depth' : [3,5,9,None],
           'criterion' :['gini','entropy',"log_loss"],
           "n_estimators":[50,100,200]}

rfc_cv=GridSearchCV(estimator=rfc,param_grid=rfc_param,
                           cv=5,scoring="accuracy",verbose=True)
rfc_cv.fit(X_train_res, y_train_res)
print("tuned hpyerparameters :(best parameters) ",rfc_cv.best_params_)
print("accuracy :",rfc_cv.best_score_)
#tuned hpyerparameters :(best parameters)  {'criterion': 'entropy', 'max_depth': None, 'max_features': 'log2', 'n_estimators': 200}
#accuracy : 0.9475761369072904

In [None]:
#adaboost gridsearch
ada_dt=DecisionTreeClassifier(criterion="entropy",max_depth=9,max_features="log2")
ada_grid=AdaBoostClassifier(random_state=1,base_estimator=ada_dt)
ada_param={"n_estimators":[50,100,200],
           "learning_rate":[0.01,0.1,1]}
ada_cv=GridSearchCV(estimator=ada_grid,param_grid=ada_param,
                           cv=5,scoring="accuracy",verbose=True)
ada_cv.fit(X_train_res, y_train_res)
print("tuned hpyerparameters :(best parameters) ",ada_cv.best_params_)
print("accuracy :",ada_cv.best_score_)
#tuned hpyerparameters :(best parameters)  {'learning_rate': 1, 'n_estimators': 200}
#accuracy : 0.9454282309645061

In [None]:
#gradientboost gridsearch
gb_param={"n_estimators":[50,100,200,500],
           "max_depth":[3,5,9],
           "learning_rate":[0.01,0.1,1]}
gb_cv=GridSearchCV(estimator=gb,param_grid=gb_param,
                           cv=5,scoring="accuracy",verbose=True)
gb_cv.fit(X_train_res, y_train_res)
print("tuned hpyerparameters :(best parameters) ",gb_cv.best_params_)
print("accuracy :",gb_cv.best_score_)
#tuned hpyerparameters :(best parameters)  {'learning_rate': 0.1, 'max_depth': 9, 'n_estimators': 100}
#accuracy : 0.925072565141198