# WORK: Basic regression and classification

<h3>[Pre-Work]</h3><br>
<p>Scikit-learnのデータセットにはどのようなものがあるか確認し、指定するデータセットに対してモデル構築とアルゴリズムを検証してください。</p>
<p>5.2. Toy datasets</p>
•http://scikit-learn.org/stable/datasets/index.html



<h3>[Work-1]</h3><br>
<p>ボストンの住宅価格についての回帰のデータセットを検証します。</p>
- 線形回帰（ＯＬＳ）およびリッジ回帰を用いてモデルのトレーニングをしましょう
- 過学習を軽減するためリッジ回帰については任意のαの値を2通り試しましょう(ex. alpha=1.0, alpha=10.0)
- ホールドアウト法を用いてトレインデータ8割、テストデータ2割を用意しましょう
- R2スコアを各アルゴリズムで比較し最も性能のよりアルゴリズムを検討しましょう
- R2スコアの最も高いアルゴリズムを用いてテストデータの1番目の住宅価格を求めましょう<br>
※実際の1番目のテストデータの住宅価格からどの程度離れているでしょうか

データセットを確認してください。

In [1]:
# import the boston house-prices datase for regression
import pandas as pd
from sklearn.datasets import load_boston
dataset = load_boston()

# set dataframe
X = pd.DataFrame(dataset.data, columns=dataset.feature_names)
y = pd.DataFrame(dataset.target, columns=['MEDV'])

# check the shape
print('----------------------------------------------------------------------------------------')
print('X shape: (%i,%i)' %X.shape)
print('y shape: (%i,%i)' %y.shape)
print('----------------------------------------------------------------------------------------')
print(y.describe())
print('----------------------------------------------------------------------------------------')
print(X.join(y).head())
print('----------------------------------------------------------------------------------------')
print(dataset.DESCR)

----------------------------------------------------------------------------------------
X shape: (506,13)
y shape: (506,1)
----------------------------------------------------------------------------------------
             MEDV
count  506.000000
mean    22.532806
std      9.197104
min      5.000000
25%     17.025000
50%     21.200000
75%     25.000000
max     50.000000
----------------------------------------------------------------------------------------
      CRIM    ZN  INDUS  CHAS    NOX     RM   AGE     DIS  RAD    TAX  \
0  0.00632  18.0   2.31   0.0  0.538  6.575  65.2  4.0900  1.0  296.0   
1  0.02731   0.0   7.07   0.0  0.469  6.421  78.9  4.9671  2.0  242.0   
2  0.02729   0.0   7.07   0.0  0.469  7.185  61.1  4.9671  2.0  242.0   
3  0.03237   0.0   2.18   0.0  0.458  6.998  45.8  6.0622  3.0  222.0   
4  0.06905   0.0   2.18   0.0  0.458  7.147  54.2  6.0622  3.0  222.0   

   PTRATIO       B  LSTAT  MEDV  
0     15.3  396.90   4.98  24.0  
1     17.8  396.90   9.14  21

以下の[----------]にコードを記述してください。

In [2]:
# import libraries
import pandas as pd
import numpy as np
from sklearn.datasets import load_boston
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression,Ridge # 線形回帰およびリッジ回帰のライブラリ
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.metrics import r2_score

# set dataframe
dataset = load_boston()
X = pd.DataFrame(dataset.data, columns=dataset.feature_names)
y = pd.DataFrame(dataset.target, columns=['y'])
#y = pd.DataFrame(dataset.target,name='y') # series型のままの方がよい

# cross-validation(holdout)
X_train, X_test, y_train, y_test= train_test_split(X, y, test_size=0.20, random_state=1)

# make pipelines for modeling # これが100個、200個になってもかけるようにする方がいいですよ。
# OLS
pipe_ols =Pipeline([('scl',StandardScaler()),
                    ('est',LinearRegression())])
# Ridge
pipe_ridge_1 =Pipeline([('scl',StandardScaler()),
                       ('est',Ridge(alpha=1.0, random_state=0))])

pipe_ridge_2 =Pipeline([('scl',StandardScaler()),
                       ('est',Ridge(alpha=10.0, random_state=0))])

# build models
pipe_ols.fit(X_train, y_train.as_matrix().ravel())
pipe_ridge_1.fit(X_train, y_train.as_matrix().ravel())
pipe_ridge_2.fit(X_train, y_train.as_matrix().ravel())

# get R2 score
y_true = y_test.as_matrix().ravel()

# print the performance
# ここにR2スコアを表示するコードを記述してください。
print('OLS  : %.6f' % r2_score(y_test,pipe_ols.predict(X_test)))
print('Ridge: %.6f' % r2_score(y_test,pipe_ridge_1.predict(X_test)))
print('Ridge: %.6f' % r2_score(y_test,pipe_ridge_2.predict(X_test)))

# OLSが最も高い

# Predict the first data of test data
#test = X_test.[----------]
#print([----------])

OLS  : 0.763481
Ridge: 0.763468
Ridge: 0.761724




<h3>[Work-2]</h3><br>
<p>糖尿病についての回帰のデータセット</p>
- ランダムフォレストおよび勾配ブースティングを用いてモデルのトレーニングをしましょう
- ホールドアウト法を用いてトレインデータ8割、テストデータ2割を用意しましょう
- R2スコアを各アルゴリズムで比較し最も性能のよりアルゴリズムを検討しましょう
- R2スコアの最も高いアルゴリズムを用いてテストデータの1番目の糖尿病指標を求めましょう<br>
※実際の1番目のテストデータの糖尿病指標からどの程度離れているでしょうか<br>
- 他にもっと性能の良いアルゴリズムがないか探してみましょう

データセットを確認してください。

In [3]:
# import the diabetes dataset for regression
import pandas as pd
from sklearn.datasets import load_diabetes
dataset = load_diabetes()

# set dataframe
X = pd.DataFrame(dataset.data, columns=dataset.feature_names)
y = pd.DataFrame(dataset.target, columns=['y'])

# check the shape
print('----------------------------------------------------------------------------------------')
print('X shape: (%i,%i)' %X.shape)
print('y shape: (%i,%i)' %y.shape)
print('----------------------------------------------------------------------------------------')
print(y.describe())
print('----------------------------------------------------------------------------------------')
print(X.join(y).head())
print('----------------------------------------------------------------------------------------')
print(dataset.DESCR)

----------------------------------------------------------------------------------------
X shape: (442,10)
y shape: (442,1)
----------------------------------------------------------------------------------------
                y
count  442.000000
mean   152.133484
std     77.093005
min     25.000000
25%     87.000000
50%    140.500000
75%    211.500000
max    346.000000
----------------------------------------------------------------------------------------
        age       sex       bmi        bp        s1        s2        s3  \
0  0.038076  0.050680  0.061696  0.021872 -0.044223 -0.034821 -0.043401   
1 -0.001882 -0.044642 -0.051474 -0.026328 -0.008449 -0.019163  0.074412   
2  0.085299  0.050680  0.044451 -0.005671 -0.045599 -0.034194 -0.032356   
3 -0.089063 -0.044642 -0.011595 -0.036656  0.012191  0.024991 -0.036038   
4  0.005383 -0.044642 -0.036385  0.021872  0.003935  0.015596  0.008142   

         s4        s5        s6      y  
0 -0.002592  0.019908 -0.017646  151.0  
1 -

以下の[----------]を適切な命令に置き換えて下さい。

In [8]:
# import libraries
import pandas as pd
import numpy as np
from sklearn.datasets import load_diabetes
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression,Ridge # 線形回帰およびリッジ回帰のライブラリ
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor# ランダムフォレストおよび勾配ブースティングのライブラリ
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.metrics import r2_score

# set dataframe
dataset = load_diabetes()
X = pd.DataFrame(dataset.data, columns=dataset.feature_names)
y = pd.DataFrame(dataset.target, columns=['y'])

# cross-validation(holdout)
X_train, X_test, y_train, y_test= train_test_split(X, y, test_size=0.20, random_state=1)

# make pipelines for modeling

pipe_ols =Pipeline([('scl',StandardScaler()),
                    ('est',LinearRegression())])

pipe_ridge =Pipeline([('scl',StandardScaler()),
                       ('est',Ridge(alpha=1.0, random_state=1))])

# RandomForest
pipe_rf =Pipeline([('scl',StandardScaler()),
                    ('est',RandomForestRegressor(random_state=1))])
# GradientBoosting
pipe_gbr = Pipeline([('scl',StandardScaler()),
                    ('est',GradientBoostingRegressor(random_state=1))])

# build models
pipe_ols.fit(X_train, y_train.as_matrix().ravel())
pipe_ridge.fit(X_train, y_train.as_matrix().ravel())
pipe_rf.fit(X_train, y_train.as_matrix().ravel())
pipe_gbr.fit(X_train, y_train.as_matrix().ravel())

# get R2 score
y_true = y_test.as_matrix().ravel()

# print the performance
# ここにR2スコアを表示するコードを記述してください。
print('OLS                      : %.6f' % r2_score(y_test,pipe_ols.predict(X_test)))
print('Ridge                    : %.6f' % r2_score(y_test,pipe_ridge.predict(X_test)))
print('RandomForest             : %.6f' % r2_score(y_test,pipe_rf.predict(X_test)))
print('GradientBoostingRegressor: %.6f' % r2_score(y_test,pipe_gbr.predict(X_test)))

## Predict the first data of test data
#test = X_test.[----------]
#print([----------])

OLS                      : 0.438436
Ridge                    : 0.436251
RandomForest             : 0.200165
GradientBoostingRegressor: 0.297574




<h3>[Work-3]</h3><br>
<p>乳がんについての分類のデータセットを検証します。</p>
- ロジスティック回帰およびランダムフォレストを用いてモデルのトレーニングをしましょう
- ホールドアウト法を用いてトレインデータ8割、テストデータ2割としましょう
- 正解率（Accuracy）を各アルゴリズムで比較しましょう
- 正解率（Accuracy）の最も高いアルゴリズムを用いてテストデータの1番目が良性か悪性か判別しましょう
- 適合率（Precision）、再現率（Recall）、F値（F-measure）についてもアルゴリズム性能を比較してみましょう

データセットを確認してください。

In [9]:
# Load and return the breast cancer wisconsin dataset (classification).
# The breast cancer dataset is a classic and very easy binary classification dataset.
import pandas as pd
from sklearn.datasets import load_breast_cancer

dataset = load_breast_cancer()

# Set dataframe
X = pd.DataFrame(dataset.data, columns=dataset.feature_names)
y = pd.DataFrame(dataset.target, columns=['y'])

# check the shape
print('----------------------------------------------------------------------------------------')
print('X shape: (%i,%i)' %X.shape)
print('y shape: (%i,%i)' %y.shape)
print('----------------------------------------------------------------------------------------')
print(y.groupby('y').size())
print('y=0 means Marignant(悪性), y=1 means Benign(良性):')
print('----------------------------------------------------------------------------------------')
print(X.join(y).head())

----------------------------------------------------------------------------------------
X shape: (569,30)
y shape: (569,1)
----------------------------------------------------------------------------------------
y
0    212
1    357
dtype: int64
y=0 means Marignant(悪性), y=1 means Benign(良性):
----------------------------------------------------------------------------------------
   mean radius  mean texture  mean perimeter  mean area  mean smoothness  \
0        17.99         10.38          122.80     1001.0          0.11840   
1        20.57         17.77          132.90     1326.0          0.08474   
2        19.69         21.25          130.00     1203.0          0.10960   
3        11.42         20.38           77.58      386.1          0.14250   
4        20.29         14.34          135.10     1297.0          0.10030   

   mean compactness  mean concavity  mean concave points  mean symmetry  \
0           0.27760          0.3001              0.14710         0.2419   
1          

以下の[----------]にコードを記述してください。

In [14]:
# import basice apis
import numpy as np
import pandas as pd
from sklearn.datasets import load_breast_cancer
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression # ロジスティック回帰のライブラリ
from sklearn.ensemble import RandomForestClassifier # ランダムフォレストのライブラリ
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, f1_score

# import Sample Data to learn models
dataset = load_breast_cancer()
X = pd.DataFrame(dataset.data, columns=dataset.feature_names)
y = pd.DataFrame(dataset.target, columns=['y'])

# split data for crossvalidation
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)

# set pipelines for different algorithms
pipe_logistic = Pipeline([('scl',StandardScaler()),
                    ('est',LogisticRegression(random_state=1))])

pipe_rf = Pipeline([('scl',StandardScaler()),
                    ('est',RandomForestClassifier(random_state=1))])

# fit & evaluation
pipe_logistic.fit(X_train, y_train.as_matrix().ravel())
pipe_rf.fit(X_train, y_train.as_matrix().ravel())

# print the performance
print('Accuracy of Logistic regression: %.3f' % accuracy_score(y_test, pipe_logistic.predict(X_test)))
print('Accuracy of Random forest regressor: %.3f' % accuracy_score(y_test, pipe_rf.predict(X_test)))

# Predict the first data of test data
#test = X_test.[----------]
#print([----------])

Accuracy of Logistic regression: 0.982
Accuracy of Random forest regressor: 0.947




<h3>[Work-4]</h3><br>
<p>ワインについての分類のデータセットを検証します。</p>
- K近傍法を用いてモデルのトレーニングをしましょう<br>
　　※K近傍法はマルチクラス分類に対応しています
- ホールドアウト法を用いてトレインデータ8割、テストデータ2割としましょう
- 正解率（Accuracy）をデフォルト設定のアルゴリズムで求めましょう
- デフォルト設定のアルゴリズムを用いてテストデータの1番目がどのワインか判別しましょう

データセットを確認してください。

In [15]:
# Load and return the wine dataset (classification).
import pandas as pd
from sklearn.datasets import load_wine

dataset = load_wine()

# Set dataframe
X = pd.DataFrame(dataset.data, columns=dataset.feature_names)
y = pd.DataFrame(dataset.target, columns=['y'])

# check the shape
print('----------------------------------------------------------------------------------------')
print('X shape: (%i,%i)' %X.shape)
print('y shape: (%i,%i)' %y.shape)
print('----------------------------------------------------------------------------------------')
print(y.groupby('y').size())
print('y=0 means wine1, y=1 means wine2, y=2 means wine3:')
print('----------------------------------------------------------------------------------------')
print(X.join(y).head())

----------------------------------------------------------------------------------------
X shape: (178,13)
y shape: (178,1)
----------------------------------------------------------------------------------------
y
0    59
1    71
2    48
dtype: int64
y=0 means wine1, y=1 means wine2, y=2 means wine3:
----------------------------------------------------------------------------------------
   alcohol  malic_acid   ash  alcalinity_of_ash  magnesium  total_phenols  \
0    14.23        1.71  2.43               15.6      127.0           2.80   
1    13.20        1.78  2.14               11.2      100.0           2.65   
2    13.16        2.36  2.67               18.6      101.0           2.80   
3    14.37        1.95  2.50               16.8      113.0           3.85   
4    13.24        2.59  2.87               21.0      118.0           2.80   

   flavanoids  nonflavanoid_phenols  proanthocyanins  color_intensity   hue  \
0        3.06                  0.28             2.29             5

以下の[----------]を適切な命令に置き換えて下さい。

In [19]:
# import basice apis
import numpy as np
import pandas as pd
from sklearn.datasets import load_wine
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier # K近傍法のライブラリ
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, f1_score

# import Sample Data to learn models
dataset = load_wine()
X = pd.DataFrame(dataset.data, columns=dataset.feature_names)
y = pd.DataFrame(dataset.target, columns=['y'])

# split data for crossvalidation
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2 , random_state=1)

# set pipelines for different algorithms
pipe_knn =  Pipeline([('scl',StandardScaler(copy=True, with_mean=True, with_std=True)),
                    ('est',KNeighborsClassifier(algorithm='auto', 
                                                leaf_size=30, 
                                                metric='minkowski',
                                                metric_params=None, 
                                                n_jobs=1, 
                                                n_neighbors=5, 
                                                p=2,
                                                weights='uniform'))])
# fit & evaluation
pipe_knn.fit(X_train, y_train.as_matrix().ravel())

# print the performance
# 正解率（Accuracy）を表示するコードを記述してください。
print('Accuracy of KNeighborsClassifier: %.3f' % accuracy_score(y_test, pipe_knn.predict(X_test)))

# Predict the first data of test data
#test = X_test.[----------]
#print([----------])

Accuracy of KNeighborsClassifier: 0.972




Out[9]:
Pipeline(memory=None,
     steps=[('scl', StandardScaler(copy=True, with_mean=True, with_std=True)), ('est', KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=5, p=2,
           weights='uniform'))])

In [2]:
# Load and return the breast cancer wisconsin dataset (classification).
# The breast cancer dataset is a classic and very easy binary classification dataset.
import pandas as pd
from sklearn.datasets import load_breast_cancer

dataset = load_breast_cancer()

# Set dataframe
X = pd.DataFrame(dataset.data, columns=dataset.feature_names)
y = pd.DataFrame(dataset.target, columns=['y'])

# check the shape
print('----------------------------------------------------------------------------------------')
print('X shape: (%i,%i)' %X.shape)
print('y shape: (%i,%i)' %y.shape)
print('----------------------------------------------------------------------------------------')
print(y.groupby('y').size())
print('y=0 means Marignant(悪性), y=1 means Benign(良性):')
print('----------------------------------------------------------------------------------------')
print(X.join(y).head())

# -*- coding: utf-8 -*-
# 2018/08/10 Randamizedsearch に変更
# 2018/08/19 grid search にて、nested cv を実施
### ご参考 https://qiita.com/msrks/items/e3e958c04a5167575c41

# SET PARAMETERS
### dm_for_model2
#file_model = 'dm_for_model2.csv'
#file_score = 'dm_for_fwd2.csv'
#ohe_cols = ['mode_category']

ohe_cols = []

### av_loan
#file_model = 'av_loan_u6lujuX_CVtuZ9i.csv'
#file_score = 'av_loan_test_Y3wMUE5_7gLdaTN.csv'
#ohe_cols = ['Gender',
#			'Married',
#			'Dependents',
#			'Education',
#			'Self_Employed',
#			'Credit_History',
#			'Property_Area'
#			]

# Chose one of the followings scoreing parameters.
# この解説がわかりやすい: https://blog.amedama.jp/entry/2017/12/18/005311
#SCORE = 'accuracy' # Accuracy = (TP+TN)/(TP+FP+FN+TN)
#SCORE = 'roc_auc'  # 偽陽性率と真陽性率で評価する指標
SCORE = 'precision'# 正と予測したデータのうち，実際に正であるものの割合【正確性】適合率を重視するときは FN を許容できるケース.(FP があっては困るときに使う）
#SCORE = 'recall'    # 実際に正であるもののうち，正であると予測されたものの割合 【網羅性】 再現率を重視するときは FP を許容できるケース. （FN があっては困るとき使う）
#SCORE = 'f1'       # 適合率（PRE）と再現率（REC）の調和平均
#SCORE = 'f1_macro'
#SCORE = 'f1_micro'
#SCORE = 'log_loss'
# Valid options are ['accuracy', 'adjusted_mutual_info_score', 'adjusted_rand_score', 'average_precision', 'completeness_score', 
#'explained_variance', 'f1', 'f1_macro', 'f1_micro', 'f1_samples', 'f1_weighted', 'fowlkes_mallows_score', 'homogeneity_score', 
#'mutual_info_score', 'neg_log_loss', 'neg_mean_absolute_error', 'neg_mean_squared_error', 'neg_mean_squared_log_error', 
#'neg_median_absolute_error', 'normalized_mutual_info_score', 'precision', 'precision_macro', 'precision_micro', 'precision_samples', 
#'precision_weighted', 'r2', 'recall', 'recall_macro', 'recall_micro', 'recall_samples', 'recall_weighted', 'roc_auc', 'v_measure_score']
# http://scikit-learn.org/stable/modules/model_evaluation.html#scoring-parameter

# make dir for modeling file
import os
model_path = './model'
if not os.path.exists(model_path ):
    os.makedirs(model_path )

import warnings
warnings.filterwarnings('ignore')
warnings.filterwarnings('ignore', category=DeprecationWarning) 

import time
import csv
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler,Imputer,MinMaxScaler,RobustScaler,MaxAbsScaler
from sklearn.feature_selection import RFECV, SelectFromModel
from sklearn.model_selection import cross_val_score
from sklearn.decomposition import PCA
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC, LinearSVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import BaggingClassifier
from sklearn.ensemble import ExtraTreesClassifier, RandomForestClassifier,VotingClassifier,AdaBoostClassifier,GradientBoostingClassifier
import xgboost as xgb
from sklearn.neural_network import MLPClassifier
from sklearn.pipeline import Pipeline
from sklearn.metrics import confusion_matrix,classification_report
from sklearn.externals import joblib
from sklearn.model_selection import train_test_split
from sklearn.grid_search import GridSearchCV,RandomizedSearchCV
from scipy.stats import randint as sp_randint

# data load
#df = pd.read_csv('./data/'+ file_model + '.csv', header=0)
#df = pd.read_csv('./data/'+ file_model, header=0)
#ID = df.iloc[:,0] 
#y = df.iloc[:,1]
#X = df.iloc[:,2:]

## TEST
## print(y)
## print(X)

print('モデルデータのOriginalの特徴量(X):', X.shape)
print('モデルデータの正解ラベルの個数')
#print(y.value_counts())

# preprocessing-1: one-hot encoding
X_ohe = pd.get_dummies(X, dummy_na=True, columns=ohe_cols)
X_ohe = X_ohe.dropna(axis=1, how='all')
X_ohe_columns = X_ohe.columns.values

# preprocessing-2: null imputation
imp = Imputer(missing_values='NaN', strategy='mean', axis=0)
imp.fit(X_ohe)
X_ohe = pd.DataFrame(imp.transform(X_ohe), columns=X_ohe_columns)
print('モデルデータの欠損値の補完後の特徴量(X_ohe):', X_ohe.shape)

##########################################################################
# ここでexhaustive search を実施して、最適な特徴量の選択を実施したが、
# RFECVの結果と、exhaustive search の結果があまり変わらなかったので割愛する
#from sklearn.feature_selection import RFE
#pre_pipe = Pipeline([('pre_scl', StandardScaler()),
#                     ('pre_mms', MinMaxScaler()),
#                     ('pre_rs', RobustScaler()),
#                     ('rfe', RFE(estimator=RandomForestClassifier(random_state=1),step=0.05))
#                     ])
#N_FEATURES_OPTIONS = [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25]
#pre_param_grid = [{'rfe__n_features_to_select': N_FEATURES_OPTIONS
#            }]
#
#pre_gs = GridSearchCV(pre_pipe, cv=3, n_jobs=1, param_grid=pre_param_grid)
#pre_gs = pre_gs.fit(X_ohe, y.as_matrix().ravel())
#
#print('Best Score:', pre_gs.best_score_)
#print('Best Params', pre_gs.best_params_)
##########################################################################

# preprocessing-3: feature selection
# 特徴量選択に関するご参考 https://hayataka2049.hatenablog.jp/entry/2018/01/25/075927
#						   https://www.kaggle.com/narumin/titanic-data-science-japanese-tutorial
#						   https://qiita.com/rockhopper/items/a68ceb3248f2b3a41c89
### RFECVの場合
#selector = RFECV(estimator=RandomForestClassifier(random_state=0), step=0.05, cv=3, scoring=SCORE)

# scoring に f1や roc_aucを指定したところ、精度が変化
##selector = RFECV(estimator=RandomForestClassifier(random_state=0), step=0.05) ### ORIGINAL 
##selector = RFECV(estimator=LogisticRegression(), step=0.05, cv=10, scoring='average_precision') NG
##selector = RFECV(estimator=xgb.XGBClassifier(random_state=1), step=0.05, cv=10, scoring=SCORE)

#selector.fit(X_ohe, y.as_matrix().ravel())
#X_ohe_selected = selector.transform(X_ohe)
#X_ohe_selected = pd.DataFrame(X_ohe_selected, columns=X_ohe_columns[selector.support_])
#print('モデルデータの特徴量選択後(X_ohe_selected):',X_ohe_selected.shape)
#X_ohe_selected.head()

############################################################################################
### SelectFromModelの場合
##estimator = RandomForestClassifier(class_weight='balanced', random_state=0)
estimator = RandomForestClassifier(random_state=0)
estimator.fit(X_ohe, y.as_matrix().ravel())
selector = SelectFromModel(estimator, threshold='mean', prefit=True)
X_ohe_selected = selector.transform(X_ohe)
X_ohe_selected = pd.DataFrame(X_ohe_selected, columns=X_ohe_columns[selector.get_support()])
print('モデルデータの特徴量選択後(X_ohe_selected):',X_ohe_selected.shape)
### X_ohe_selected.head()

#print('全パラメータの重要度:', estimator.feature_importances_)
## model profile NG
#imp = pd.DataFrame(estimator.feature_importances_, columns=X_ohe_columns)
#imp.T.to_csv('./data/feature_importances.csv', index=True)

### RFE の場合
#selector = RFE(estimator=RandomForestClassifier(random_state=1),n_features_to_select=10,step=0.05)
#selector = RFE(estimator=RandomForestClassifier(random_state=1),n_features_to_select=pre_gs.best_params_,step=0.05)

#selector.fit(X_ohe, y.as_matrix().ravel())

#X_ohe_selected = selector.transform(X_ohe)
#X_ohe_selected = pd.DataFrame(X_ohe_selected,columns=X_ohe_columns[selector.support_])

#############################################################################################
# preprocessing-4: preprocessing of a score data along with a model dataset
#if len(file_score)>0:
    # load score data
#    dfs = pd.read_csv('./data/'+ file_score, header=0)
#    IDs = dfs.iloc[:,[0]] 
#    Xs = dfs.iloc[:,1:]
#Xs = X.iloc[201:,]
#y = df.iloc[0:200,]
#Xs_ohe = pd.get_dummies(Xs, dummy_na=True, columns=ohe_cols)
#cols_m = pd.DataFrame(None, columns=X_ohe_columns, dtype=float)

# consistent with columns set
#Xs_exp = pd.concat([cols_m, Xs_ohe])
#Xs_exp.loc[:,list(set(X_ohe_columns)-set(Xs_ohe.columns.values))] = \
#                        Xs_exp.loc[:,list(set(X_ohe_columns)-set(Xs_ohe.columns.values))].fillna(0, axis=1)
#Xs_exp = Xs_exp.drop(list(set(Xs_ohe.columns.values)-set(X_ohe_columns)), axis=1)

# re-order the score data columns
#Xs_exp = Xs_exp.reindex_axis(X_ohe_columns, axis=1)
#Xs_exp = pd.DataFrame(imp.transform(Xs_exp), columns=X_ohe_columns)

#Xs_exp_selected = Xs_exp.loc[:,X_ohe_columns[selector.support_]] # RFECVの場合
#Xs_exp_selected = Xs_exp.loc[:,X_ohe_columns[selector.get_support()]]

# CLASSIFIER
#model_name = 'GBC_001'
#clf = Pipeline([('scl',StandardScaler()), ('est',GradientBoostingClassifier(random_state=1))])

lr = Pipeline([('scl', StandardScaler()),
#              ('mms', MinMaxScaler()),
#              ('mas', MaxAbsScaler()),
#              ('rs', RobustScaler()),
               ('reduct',PCA(n_components=None, svd_solver='auto',random_state=1)),
               ('clf', LogisticRegression(C=10, class_weight='balanced', random_state=1))])

knn = Pipeline([('scl', StandardScaler()),
#               ('mms', MinMaxScaler()),
#               ('mas', MaxAbsScaler()),
#               ('rs', RobustScaler()),
                ('reduct',PCA(n_components=None, svd_solver='auto',random_state=1)),
                ('clf', KNeighborsClassifier(algorithm='auto', 
                                                leaf_size=30, 
                                                metric='minkowski',
                                                metric_params=None, 
                                                n_jobs=1, 
                                                n_neighbors=5, 
                                                p=2,
                                                weights='uniform'))])

svm = Pipeline([('scl', StandardScaler()),
#               ('mms', MinMaxScaler()),
#               ('mas', MaxAbsScaler()),
#               ('rs', RobustScaler()),
                ('reduct',PCA(n_components=None, svd_solver='auto',random_state=1)),
                ('clf', SVC(kernel='rbf', C=1.0, class_weight='balanced', random_state=1))])

dc = Pipeline([('scl', StandardScaler()),
#              ('mms', MinMaxScaler()),
#              ('mas', MaxAbsScaler()),
#              ('rs', RobustScaler()),
               ('reduct',PCA(n_components=None, svd_solver='auto',random_state=1)),
               ('clf', DecisionTreeClassifier(class_weight='balanced', random_state=1))])

rf = Pipeline([('scl', StandardScaler()),
#              ('mms', MinMaxScaler()),
#              ('mas', MaxAbsScaler()),
#              ('rs', RobustScaler()),
               ('reduct',PCA(n_components=None, svd_solver='auto',random_state=1)),
               ('clf', RandomForestClassifier(class_weight='balanced', random_state=1))])

rsvc = Pipeline([('scl',StandardScaler()),
#                ('mms', MinMaxScaler()),
#                ('mas', MaxAbsScaler()),
#                ('rs', RobustScaler()),
                 ('reduct',PCA(n_components=None, svd_solver='auto',random_state=1)),
                 ('est',SVC(C=1.0, kernel='rbf', class_weight='balanced', random_state=1))])

lsvc = Pipeline([('scl',StandardScaler()),
#                ('mms', MinMaxScaler()),
#                ('mas', MaxAbsScaler()),
#                ('rs', RobustScaler()),
                 ('reduct',PCA(n_components=None, svd_solver='auto',random_state=1)),
                 ('est',LinearSVC(C=1.0, class_weight='balanced', random_state=1))])

gb1 = Pipeline([('scl',StandardScaler()),
#                ('mms', MinMaxScaler()),
#                ('mas', MaxAbsScaler()),
#                ('rs', RobustScaler()),
                 ('reduct',PCA(n_components=None, svd_solver='auto',random_state=1)),
                 ('est',GradientBoostingClassifier(random_state=1))])

gb2 = Pipeline([('scl',StandardScaler()),
#                ('mms', MinMaxScaler()),
#                ('mas', MaxAbsScaler()),
#                ('rs', RobustScaler()),
                 ('reduct',PCA(n_components=None, svd_solver='auto',random_state=1)),
                 ('est',GradientBoostingClassifier(n_estimators=100, learning_rate=0.1,max_depth=1, random_state=1))])

gb3 = Pipeline([('scl',StandardScaler()),
#                ('mms', MinMaxScaler()),
#                ('mas', MaxAbsScaler()),
#                ('rs', RobustScaler()),
                 ('reduct',PCA(n_components=None, svd_solver='auto',random_state=1)),
                 ('est',GradientBoostingClassifier(n_estimators=100, learning_rate=0.1,max_depth=3, random_state=1))])

xgb1 = Pipeline([('scl',StandardScaler()),
#                ('mms', MinMaxScaler()),
#                ('mas', MaxAbsScaler()),
#                ('rs', RobustScaler()),
                 ('reduct',PCA(n_components=None, svd_solver='auto',random_state=1)),
                 ('est', xgb.XGBClassifier(random_state=1))])

xgb2 = Pipeline([('scl',StandardScaler()),
#                ('mms', MinMaxScaler()),
#                ('mas', MaxAbsScaler()),
#                ('rs', RobustScaler()),
                 ('reduct',PCA(n_components=None, svd_solver='auto',random_state=1)),
                 ('est', xgb.XGBClassifier(n_estimators=100,learning_rate=0.1,colsample_bytree=0.8,
                                           gamma=0,max_depth=5,min_child_weight=3, subsample=0.7,objective='binary:logistic',random_state=1))])

xgb3 = Pipeline([('scl',StandardScaler()),
#                ('mms', MinMaxScaler()),
#                ('mas', MaxAbsScaler()),
#                ('rs', RobustScaler()),
                 ('reduct',PCA(n_components=None, svd_solver='auto',random_state=1)),
                 ('est', xgb.XGBClassifier(n_estimators=1000,learning_rate=0.1,colsample_bytree=0.8,
                                           gamma=0,max_depth=5,min_child_weight=3, subsample=0.7,objective='binary:logistic',random_state=1))])
gnb = Pipeline([('scl',StandardScaler()),
#                ('mms', MinMaxScaler()),
#                ('mas', MaxAbsScaler()),
#                ('rs', RobustScaler()),
                 ('reduct',PCA(n_components=None, svd_solver='auto',random_state=1)),
                 ('est',GaussianNB())])

bag = Pipeline([('scl',StandardScaler()),
#                ('mms', MinMaxScaler()),
#                ('mas', MaxAbsScaler()),
#                ('rs', RobustScaler()),
                 ('reduct',PCA(n_components=None, svd_solver='auto',random_state=1)),
                 ('est',BaggingClassifier(DecisionTreeClassifier(class_weight='balanced', random_state=1),n_estimators = 100, max_features = 0.5,random_state=1))])

mlp = Pipeline([('scl',StandardScaler()),
#                ('mms', MinMaxScaler()),
#                ('mas', MaxAbsScaler()),
#                ('rs', RobustScaler()),
                 ('reduct',PCA(n_components=None, svd_solver='auto',random_state=1)),
                 ('est',MLPClassifier(hidden_layer_sizes=(10,5),max_iter=100,random_state=1))])

clf1 = AdaBoostClassifier(n_estimators=10)
clf2 = ExtraTreesClassifier(n_estimators=10, n_jobs=-1, criterion='gini',max_depth=1)
clf3 = xgb.XGBClassifier(n_estimators=10, nthread=-1, max_depth = 1, seed=1)
clf4 = GradientBoostingClassifier(n_estimators=10)
clf5 = RandomForestClassifier(max_depth = 1,class_weight='balanced', random_state=1)

vot = Pipeline([('scl',StandardScaler()),
#                ('mms', MinMaxScaler()),
#                ('mas', MaxAbsScaler()),
#                ('rs', RobustScaler()),
                 ('reduct',PCA(n_components=None, svd_solver='auto',random_state=1)),
                 ('est',VotingClassifier(estimators=[('ab', clf1), ('etc', clf2), ('xgb', clf3),('gbc', clf4)], weights=[1,1,1,1], voting='soft'))])

vot2 = Pipeline([('scl',StandardScaler()),
#                ('mms', MinMaxScaler()),
#                ('mas', MaxAbsScaler()),
#                ('rs', RobustScaler()),
                 ('reduct',PCA(n_components=None, svd_solver='auto',random_state=1)),
                 ('est',VotingClassifier(estimators=[('ab', clf1), ('etc', clf2), ('xgb', clf3),('gbc', clf4),('gnb', clf5)], weights=[1,1,1,1,1], voting='soft'))])

#models = [lr, knn, svm, dc, rf, rsvc, lsvc, gb1, gb2, gb3, xgb1, xgb2, xgb3, gnb, bgc, mlp]
models = [lr, knn, svm, dc, rf, rsvc, gb1, gb2, gb3, xgb1, xgb2, xgb3, bag, mlp, vot]
model_names = ['LogisticRegression',
               'KNeighborsClassifier',
               'SVM',
               'DecisionTreeClassifier',
               'RandomForestClassifier',
               'RSVC',
#               'LinearSVC',
               'GradientBoostingClassifier',
               'GradientBoostingClassifier',
               'GradientBoostingClassifier',
               'XGBClassifier',
               'XGBClassifier',
               'XGBClassifier',
#               'GaussianNB',
               'BaggingClassifier',
               'MLPClassifier',
               'VotingClassifier',
#               'VotingClassifier'
              ]

#for loop in range(100):
#    x_train, x_test, y_train, y_test = train_test_split(X_ohe_selected,
#                                                        y.as_matrix().ravel(),
#                                                        test_size=0.2)
#with open('./data/' + 'model_score_list.csv', 'w') as f:
#writer = csv.writer(f, lineterminator='\n') # 改行コード（\n）を指定しておく
#writer.writerows([[round(np.average(results), 4), model_name]]) # 2次元配列も書き込める

print('===============================================================================================================')

BEST_SCORE = 0
MODEL_RANKING = pd.DataFrame()
### https://hayataka2049.hatenablog.jp/entry/2018/03/14/112454 #評価指標について
for model_name, model in zip(model_names, models):
    model.fit(X_ohe_selected, y.values.ravel())
    train_score = model.score(X_ohe_selected, y.values.ravel())
    test_score = model.score(X_ohe_selected, y.values.ravel())
 
    results = cross_val_score(model, X_ohe_selected, y.values.ravel(), scoring=SCORE, cv=3)
    results1 = cross_val_score(model, X_ohe_selected, y.values.ravel(), scoring='roc_auc', cv=3)
    results2 = cross_val_score(model, X_ohe_selected, y.values.ravel(), scoring='f1_macro', cv=3) 
    results3 = cross_val_score(model, X_ohe_selected, y.values.ravel(), scoring='accuracy', cv=3) 
    results4 = cross_val_score(model, X_ohe_selected, y.values.ravel(), scoring='f1', cv=3) 
    results5 = cross_val_score(model, X_ohe_selected, y.values.ravel(), scoring='precision', cv=3) 
    results6 = cross_val_score(model, X_ohe_selected, y.values.ravel(), scoring='recall', cv=3) 
    
    #results = cross_val_score(model, X_ohe_selected, y.values.ravel(), scoring=score)
#    print(SCORE, 'score:\t', round(np.average(results), 4), '+-', round(np.std(results), 4),',\t', model_name)
#    print('===============================================================================================================')

    print(model_name, ':\t', SCORE, ':', round(np.average(results), 4), '+-', round(np.std(results), 4))    
    print('roc_auc:', round(np.average(results1), 4), '+-', round(np.std(results1), 4),',\t',
          'f1:', round(np.average(results4), 4), '+-', round(np.std(results4), 4),',\t',
          'accuracy:', round(np.average(results3), 4), '+-', round(np.std(results3), 4),',\t',
          'precision:', round(np.average(results5), 4), '+-', round(np.std(results5), 4),',\t',
          'recall:', round(np.average(results6), 4), '+-', round(np.std(results6), 4),',\t',
          'f1_macro:', round(np.average(results2), 4), '+-', round(np.std(results2), 4)
          ,'\n')
#    print(classification_report(y.as_matrix().ravel(), model.predict(X_ohe_selected)))

    MODEL_SCORE_LIST = pd.DataFrame([[model_name, SCORE, round(np.average(results), 4), model]] )
    MODEL_SCORE_LIST.to_csv('./data/' + 'model_score_list.csv', mode='a', header=False, index=False)
    
    if BEST_SCORE < np.average(results):
        BEST_SCORE = np.average(results)
        BEST_ALGORITHM = model_name
        BEST_MODEL = model
print('===============================================================================================================')
print('Best algorithm:', BEST_ALGORITHM)
print('Best', SCORE ,'score:', BEST_SCORE)
print('Best_model:', BEST_MODEL)
print('===============================================================================================================')
print(BEST_ALGORITHM, 'で grid_search を実行')

# TEST 
#BEST_ALGORITHM = 'RandomForestClassifier'
# xgboost のvoting ensenble
# https://www.kaggle.com/pablonieto/eeg-analysis-voting-ensemble-python-2-7
# round(len(X_ohe_selected.columns)*0.2)
if BEST_ALGORITHM == 'LogisticRegression' :
#4 x 11 x 2 = 88通り
    GRID_EST = LogisticRegression(class_weight='balanced', random_state=1)
    GRID_PARAM = {'pca__n_components':[round(len(X_ohe_selected.columns)*0.1),
                                       round(len(X_ohe_selected.columns)*0.2),
                                       round(len(X_ohe_selected.columns)*0.3),
                                       round(len(X_ohe_selected.columns)*0.4),
                                       round(len(X_ohe_selected.columns)*0.5),
                                       round(len(X_ohe_selected.columns)*0.6),
                                       round(len(X_ohe_selected.columns)*0.7),
                                       round(len(X_ohe_selected.columns)*0.8),
                                       round(len(X_ohe_selected.columns)*0.9),
                                       None],
                  'est__C':[1e-4, 1e-3, 1e-2, 1e-1, 0.5, 1., 5., 10., 15., 20., 25.],
                  'est__penalty':['l1','l2']
                 }
# http://kamonohashiperry.com/archives/209
# https://www.analyticsvidhya.com/blog/2016/03/complete-guide-parameter-tuning-xgboost-with-codes-python/
# https://github.com/aarshayj/Analytics_Vidhya/blob/master/Articles/Parameter_Tuning_XGBoost_with_Example/XGBoost%20models.ipynb
elif BEST_ALGORITHM == 'XGBClassifier' :
#4 x 3 x 11 x 11 x 3 x 3 x 10 x 10 x 2 = 2613600通り
    GRID_EST = xgb.XGBClassifier(objective='binary:logistic', random_state=1)
    GRID_PARAM = {'pca__n_components':[round(len(X_ohe_selected.columns)*0.1),
                                       round(len(X_ohe_selected.columns)*0.2),
                                       round(len(X_ohe_selected.columns)*0.3),
                                       round(len(X_ohe_selected.columns)*0.4),
                                       round(len(X_ohe_selected.columns)*0.5),
                                       round(len(X_ohe_selected.columns)*0.6),
                                       round(len(X_ohe_selected.columns)*0.7),
                                       round(len(X_ohe_selected.columns)*0.8),
                                       round(len(X_ohe_selected.columns)*0.9),
                                       None],
                  'est__learning_rate': [0.01,0.1,0.3],
                  'est__max_depth': sp_randint(1, 11),
                  'est__min_child_weight': sp_randint(1, 11),
                  'est__gamma':[0,0.5,1],
                  'est__subsample': [0.5, 0.7, 0.9], 
                  'est__colsample_bytree': [0.5, 0.7, 0.9],
                  'est__reg_alpha': sp_randint(1, 10),
                  'est__reg_lambda': sp_randint(1, 10),
                  'est__scale_pos_weight':sp_randint(1, 10), # useful for unbalanced
                  'est__max_delta_step':sp_randint(1, 10), # useful for unbalanced
                  'est__n_estimators': [1000]
                 }

elif BEST_ALGORITHM == 'GradientBoostingClassifier' :
# 4 x 4 x 11 x 19 x 20 x 3 x 4 = 802560通り
    GRID_EST = GradientBoostingClassifier(random_state=1)
    GRID_PARAM = {'pca__n_components':[round(len(X_ohe_selected.columns)*0.1),
                                       round(len(X_ohe_selected.columns)*0.2),
                                       round(len(X_ohe_selected.columns)*0.3),
                                       round(len(X_ohe_selected.columns)*0.4),
                                       round(len(X_ohe_selected.columns)*0.5),
                                       round(len(X_ohe_selected.columns)*0.6),
                                       round(len(X_ohe_selected.columns)*0.7),
                                       round(len(X_ohe_selected.columns)*0.8),
                                       round(len(X_ohe_selected.columns)*0.9),
                                       None],
                  'est__learning_rate': [1e-3, 1e-2, 1e-1, 0.5],
                  'est__max_depth': sp_randint(1, 11),
                  'est__min_samples_split': sp_randint(2, 21),
                  'est__min_samples_leaf': sp_randint(1, 21),
                  'est__subsample': [0.5, 0.7, 0.9],
                  'est__max_features': [0.1, 0.3, 0.5, 1],
                  'est__n_estimators': [100]
                 }
    
elif BEST_ALGORITHM == 'RandomForestClassifier' :
# 4 x 3 x 3 x 15 x 2 = 1080
    GRID_EST = RandomForestClassifier(class_weight='balanced',random_state=1)
    GRID_PARAM = {'pca__n_components':[round(len(X_ohe_selected.columns)*0.1),
                                       round(len(X_ohe_selected.columns)*0.2),
                                       round(len(X_ohe_selected.columns)*0.3),
                                       round(len(X_ohe_selected.columns)*0.4),
                                       round(len(X_ohe_selected.columns)*0.5),
                                       round(len(X_ohe_selected.columns)*0.6),
                                       round(len(X_ohe_selected.columns)*0.7),
                                       round(len(X_ohe_selected.columns)*0.8),
                                       round(len(X_ohe_selected.columns)*0.9),
                                       None],
                  'est__n_estimators': [100, 200, 500],
                  'est__max_features': ['auto', 'sqrt', 'log2'],
                  'est__max_depth' : [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15],
                  'est__criterion' :['gini', 'entropy']
                 }
elif BEST_ALGORITHM == 'SVM' :
# 3 x 7 x 9 = 189通り
    GRID_EST = SVC(kernel='rbf', C=1.0,probability=True, class_weight='balanced', random_state=1)
    GRID_PARAM = {'pca__n_components':[round(len(X_ohe_selected.columns)*0.1),
                                       round(len(X_ohe_selected.columns)*0.2),
                                       round(len(X_ohe_selected.columns)*0.3),
                                       round(len(X_ohe_selected.columns)*0.4),
                                       round(len(X_ohe_selected.columns)*0.5),
                                       round(len(X_ohe_selected.columns)*0.6),
                                       round(len(X_ohe_selected.columns)*0.7),
                                       round(len(X_ohe_selected.columns)*0.8),
                                       round(len(X_ohe_selected.columns)*0.9),
                                       None],
#                  'est__kernel': ['rbf','sigmoid','linear'], 
                  'est__kernel': ['rbf'], 
#                  'est__class_weight':['None','balanced'],
				  'est__C': [0.001, 0.01, 0.1, 1,10,100,1000],
                  'est__gamma': [1e-5,1e-4,0.001, 0.01, 0.1, 1,10,100,1000]
                 }
elif BEST_ALGORITHM == 'SVC' : 
# 4 x 3 x 7 x 9 = 189通り
    GRID_EST = SVC(probability=True, random_state=1)
    GRID_PARAM = {'pca__n_components':[round(len(X_ohe_selected.columns)*0.1),
                                       round(len(X_ohe_selected.columns)*0.2),
                                       round(len(X_ohe_selected.columns)*0.3),
                                       round(len(X_ohe_selected.columns)*0.4),
                                       round(len(X_ohe_selected.columns)*0.5),
                                       round(len(X_ohe_selected.columns)*0.6),
                                       round(len(X_ohe_selected.columns)*0.7),
                                       round(len(X_ohe_selected.columns)*0.8),
                                       round(len(X_ohe_selected.columns)*0.9),
                                       None],
#                  'est__kernel': ['rbf','sigmoid','linear'], 
                  'est__kernel': ['rbf'], 
#                  'est__class_weight':['None','balanced'],
                  'est__gamma': [1e-5,1e-4,0.001, 0.01, 0.1, 1,10,100,1000],
                  'est__C': [0.001, 0.01, 0.1, 1,10,100,1000]
                 }
elif BEST_ALGORITHM == 'RSVC' :
    GRID_EST = SVC(C=1.0, kernel='rbf', random_state=1,probability=True)
    GRID_PARAM = {'pca__n_components':[round(len(X_ohe_selected.columns)*0.1),
                                       round(len(X_ohe_selected.columns)*0.2),
                                       round(len(X_ohe_selected.columns)*0.3),
                                       round(len(X_ohe_selected.columns)*0.4),
                                       round(len(X_ohe_selected.columns)*0.5),
                                       round(len(X_ohe_selected.columns)*0.6),
                                       round(len(X_ohe_selected.columns)*0.7),
                                       round(len(X_ohe_selected.columns)*0.8),
                                       round(len(X_ohe_selected.columns)*0.9),
                                       None],
                  'est__kernel': ['rbf'], 
#                  'est__class_weight':['None','balanced'],
                  'est__C': [0.001, 0.01, 0.1, 1,10,100,1000],
                  'est__gamma': [1e-5,1e-4,0.001, 0.01, 0.1, 1,10,100,1000]
                 }

elif (BEST_ALGORITHM == 'MLPClassifier'):
    GRID_EST = MLPClassifier(random_state=1)
    GRID_PARAM = {'pca__n_components':[round(len(X_ohe_selected.columns)*0.1),
                                       round(len(X_ohe_selected.columns)*0.2),
                                       round(len(X_ohe_selected.columns)*0.3),
                                       round(len(X_ohe_selected.columns)*0.4),
                                       round(len(X_ohe_selected.columns)*0.5),
                                       round(len(X_ohe_selected.columns)*0.6),
                                       round(len(X_ohe_selected.columns)*0.7),
                                       round(len(X_ohe_selected.columns)*0.8),
                                       round(len(X_ohe_selected.columns)*0.9),
                                       None],
                  'est__hidden_layer_sizes':[(100,10),(200,10),(500,10),(100,20),(200,20),(500,20)],
                  'est__max_iter':[10, 100, 1000],
                  'est__batch_size':[10,50,100],
                  'est__early_stopping':[True]
                 }
elif (BEST_ALGORITHM == 'KNeighborsClassifier'):
    GRID_EST = KNeighborsClassifier(class_weight='balanced', random_state=1)
    GRID_PARAM = {'pca__n_components':[round(len(X_ohe_selected.columns)*0.1),
                                       round(len(X_ohe_selected.columns)*0.2),
                                       round(len(X_ohe_selected.columns)*0.3),
                                       round(len(X_ohe_selected.columns)*0.4),
                                       round(len(X_ohe_selected.columns)*0.5),
                                       round(len(X_ohe_selected.columns)*0.6),
                                       round(len(X_ohe_selected.columns)*0.7),
                                       round(len(X_ohe_selected.columns)*0.8),
                                       round(len(X_ohe_selected.columns)*0.9),
                                       None],
                  'est__n_neighbors': [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20]
                 }
elif (BEST_ALGORITHM == 'DecisionTreeClassifier'):
    GRID_EST = DecisionTreeClassifier(class_weight='balanced', random_state=1)
    GRID_PARAM = {'pca__n_components':[round(len(X_ohe_selected.columns)*0.1),
                                       round(len(X_ohe_selected.columns)*0.2),
                                       round(len(X_ohe_selected.columns)*0.3),
                                       round(len(X_ohe_selected.columns)*0.4),
                                       round(len(X_ohe_selected.columns)*0.5),
                                       round(len(X_ohe_selected.columns)*0.6),
                                       round(len(X_ohe_selected.columns)*0.7),
                                       round(len(X_ohe_selected.columns)*0.8),
                                       round(len(X_ohe_selected.columns)*0.9),
                                       None],
                  'est__max_depth': [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15],
                  'est__criterion': ['gini', 'entropy']
                 }
elif (BEST_ALGORITHM == 'BaggingClassifier'):
    GRID_EST = BaggingClassifier(DecisionTreeClassifier(),n_estimators = 100, max_features = 0.5)
    GRID_PARAM = {'pca__n_components':[round(len(X_ohe_selected.columns)*0.1),
                                       round(len(X_ohe_selected.columns)*0.2),
                                       round(len(X_ohe_selected.columns)*0.3),
                                       round(len(X_ohe_selected.columns)*0.4),
                                       round(len(X_ohe_selected.columns)*0.5),
                                       round(len(X_ohe_selected.columns)*0.6),
                                       round(len(X_ohe_selected.columns)*0.7),
                                       round(len(X_ohe_selected.columns)*0.8),
                                       round(len(X_ohe_selected.columns)*0.9),
                                       None],
                  'est__base_estimator__max_depth' : [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15],
                  'est__max_samples' : [0.05, 0.1, 0.2, 0.3, 0.5]
                 }
elif (BEST_ALGORITHM == 'VotingClassifier'):
    GRID_EST = VotingClassifier(estimators=[('ab', clf1), ('etc', clf2), ('xgb', clf3),('gbc', clf4)], weights=[1,1,1,1], voting='soft')
    # Voting のgridsearchの例
    # https://qiita.com/yagays/items/a503117bd06bb938fdb9
    GRID_PARAM = {'pca__n_components':[round(len(X_ohe_selected.columns)*0.1),
                                       round(len(X_ohe_selected.columns)*0.2),
                                       round(len(X_ohe_selected.columns)*0.3),
                                       round(len(X_ohe_selected.columns)*0.4),
                                       round(len(X_ohe_selected.columns)*0.5),
                                       round(len(X_ohe_selected.columns)*0.6),
                                       round(len(X_ohe_selected.columns)*0.7),
                                       round(len(X_ohe_selected.columns)*0.8),
                                       round(len(X_ohe_selected.columns)*0.9),
                                       None]
                 }

GRID_PIPE = Pipeline([('scl', StandardScaler()),
#                     ('mms', MinMaxScaler()),
#                     ('mas', MaxAbsScaler()),
#                     ('rs', RobustScaler()),
                      ('pca', PCA(random_state=1)),
                      ('est', GRID_EST)])

# https://qiita.com/ragAgar/items/2f6bebdba5f9d7381310 # ランダムサーチの例について

if BEST_ALGORITHM == 'XGBClassifier' or BEST_ALGORITHM == 'GradientBoostingClassifier' or BEST_ALGORITHM == 'RandomForestClassifier' :
	gs= RandomizedSearchCV( estimator=GRID_PIPE,
                                  param_distributions=GRID_PARAM,
                                  cv=3,                 #CV
                                  n_iter=1000,           #interation num
                                  scoring=SCORE,        #metrics
                                  random_state=1)
                                  
else:
	# GridSearchCVのパイプラインの設定
	gs = GridSearchCV(estimator=GRID_PIPE,
	                  param_grid=GRID_PARAM,
	                  scoring=SCORE,
	                  cv=3)

start = time.time()
gs = gs.fit(X_ohe_selected, y.as_matrix().ravel())

# 探索した結果のベストスコアとパラメータの取得
print('Best Score:',gs.best_score_)
print('Best Params:', gs.best_params_)
print('Best Estimator:',gs.best_estimator_)

end   = time.time()
#print(start)
#print(end)
print('GridSearch 実行時間:',round((end - start)/60, 1), '分')

#print('===============================================================================================================')
#print('nested cross-validation のスコア')
#start2 = time.time()
#nested_cv_scores = cross_val_score(gs,
#                            X_ohe_selected,
#                            y.as_matrix().ravel(),
#                            scoring=SCORE,
#                            cv=3
#                            )
#
#print(BEST_ALGORITHM, ':\t', SCORE, ':', round(np.average(nested_cv_scores), 4), '+-', round(np.std(nested_cv_scores), 4))
#end2   = time.time()
#print('Nested cross-validation 実行時間:',round((end2 - start2)/60, 1), '分')

#CM = confusion_matrix(y_true, y_pred)
CM = confusion_matrix(y.as_matrix().ravel(), gs.predict(X_ohe_selected))
print('confusion matrix:\n', CM)
print(classification_report(y.as_matrix().ravel(), gs.predict(X_ohe_selected)))

GS_SCORE_LIST = pd.DataFrame([[GRID_EST, SCORE, gs.best_score_, gs.best_params_]] )
GS_SCORE_LIST.to_csv('./data/' + 'grid_score_list.csv', mode='a', header=False, index=False)
joblib.dump(gs, './model/'+ BEST_ALGORITHM + '.pkl')

filename=BEST_ALGORITHM + '.pkl'

#保存したモデルをロード
loaded_model = joblib.load(open('./model/'+ filename, 'rb'))

#score = pd.DataFrame(gs.predict_proba(Xs_exp_selected)[:,1], columns=['pred_score'])
#IDs.join(score).to_csv('./data/'+  model_name + '_' + file_score + '_with_pred.csv', index=False)

#cv_result.csvを格納
#pd.DataFrame(gs.cv_results_).to_csv('./data/'+  model_name + '_' + file_score + 'cv_result.csv', index=False)
#AttributeError: 'GridSearchCV' object has no attribute 'cv_results_'

##############################################################################################
# modeling
#clf.fit(X_ohe_selected, y.as_matrix().ravel())
#joblib.dump(clf, './model/'+ model_name + '.pkl')
#results = cross_val_score(clf, X_ohe_selected, y.as_matrix().ravel(), scoring=score, cv=5)
#print('cv score:', np.average(results), '+-', np.std(results))

# scoring
#if len(file_score)>0:
#    score = pd.DataFrame(clf.predict_proba(Xs_exp_selected)[:,1], columns=['pred_score'])
#    IDs.join(score).to_csv('./data/'+  model_name + '_' + file_score + '_with_pred.csv', index=False)

# model profile
#imp = pd.DataFrame([clf.named_steps['est'].feature_importances_], columns=X_ohe_columns[selector.support_])
#imp.T.to_csv('./data/'+  model_name + '_feature_importances.csv', index=True)

----------------------------------------------------------------------------------------
X shape: (569,30)
y shape: (569,1)
----------------------------------------------------------------------------------------
y
0    212
1    357
dtype: int64
y=0 means Marignant(悪性), y=1 means Benign(良性):
----------------------------------------------------------------------------------------
   mean radius  mean texture  mean perimeter  mean area  mean smoothness  \
0        17.99         10.38          122.80     1001.0          0.11840   
1        20.57         17.77          132.90     1326.0          0.08474   
2        19.69         21.25          130.00     1203.0          0.10960   
3        11.42         20.38           77.58      386.1          0.14250   
4        20.29         14.34          135.10     1297.0          0.10030   

   mean compactness  mean concavity  mean concave points  mean symmetry  \
0           0.27760          0.3001              0.14710         0.2419   
1          