[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/corazzon/boostcourse-ds-511/blob/master/pima-classification-baseline-04.ipynb)


## 데이터셋 출처
* [Pima Indians Diabetes Database | Kaggle](https://www.kaggle.com/uciml/pima-indians-diabetes-database)
* https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_diabetes.html


### 데이터 구성

* Pregnancies : 임신 횟수
* Glucose : 2시간 동안의 경구 포도당 내성 검사에서 혈장 포도당 농도 
* BloodPressure : 이완기 혈압 (mm Hg)
* SkinThickness : 삼두근 피부 주름 두께 (mm), 체지방을 추정하는데 사용되는 값
* Insulin : 2시간 혈청 인슐린 (mu U / ml)
* BMI : 체질량 지수 (체중kg / 키(m)^2)
* DiabetesPedigreeFunction : 당뇨병 혈통 기능
* Age : 나이
* Outcome : 768개 중에 268개의 결과 클래스 변수(0 또는 1)는 1이고 나머지는 0입니다.


## 필요한 라이브러리 로드

In [159]:
# 데이터 분석을 위한 pandas, 수치계산을 위한 numpy
# 시각화를 위한 seaborn, matplotlib.pyplot 을 로드합니다. 

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

%matplotlib inline

## 데이터셋 로드

In [160]:
df = pd.read_csv("../data/diabetes_feature.csv")
df.shape

(768, 16)

In [161]:
# 데이터셋을 미리보기 합니다.

df.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome,Pregnancies_high,Age_low,Age_middle,Age_high,Insulin_nan,Insulin_log,low_glu_insulin
0,6,148,72,35,0,33.6,0.627,50,1,False,False,True,False,169.5,5.138735,False
1,1,85,66,29,0,26.6,0.351,31,0,False,False,True,False,102.5,4.639572,True
2,8,183,64,0,0,23.3,0.672,32,1,True,False,True,False,169.5,5.138735,False
3,1,89,66,23,94,28.1,0.167,21,0,False,True,False,False,94.0,4.553877,True
4,0,137,40,35,168,43.1,2.288,33,1,False,False,True,False,168.0,5.129899,False


## 학습과 예측에 사용할 데이터셋 만들기

In [162]:
df.columns

Index(['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin',
       'BMI', 'DiabetesPedigreeFunction', 'Age', 'Outcome', 'Pregnancies_high',
       'Age_low', 'Age_middle', 'Age_high', 'Insulin_nan', 'Insulin_log',
       'low_glu_insulin'],
      dtype='object')

In [163]:
X = df[['Glucose', 'BloodPressure', 'SkinThickness',
       'BMI', 'DiabetesPedigreeFunction', 'Age', 'Pregnancies_high',
       'Insulin_nan', 'low_glu_insulin']]
X.shape

(768, 9)

In [164]:
y = df['Outcome']
y.shape

(768,)

In [165]:
# 사이킷런에서 제공하는 model_selection 의 train_test_split 으로 만듭니다.

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42)

In [166]:
# train 세트의 문제와 정답의 데이터 수를 확인해 주세요.

X_train.shape, y_train.shape

((614, 9), (614,))

In [167]:
# test 세트의 문제와 정답의 데이터 수를 확인해 주세요.

X_test.shape, y_test.shape

((154, 9), (154,))

# 5 여러개의 알고리즘을 사용해서 비교하기

In [168]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier


estimators=[DecisionTreeClassifier(random_state=42),
           RandomForestClassifier(random_state=42),
           GradientBoostingClassifier(random_state=42)]
estimators

[DecisionTreeClassifier(random_state=42),
 RandomForestClassifier(random_state=42),
 GradientBoostingClassifier(random_state=42)]

In [169]:
result=[]
for estimator in estimators:
    result.append(estimator.__class__.__name__)

result

['DecisionTreeClassifier',
 'RandomForestClassifier',
 'GradientBoostingClassifier']

In [170]:
from sklearn.model_selection import RandomizedSearchCV

max_depth= np.random.randint(2,20, 10)
max_features= np.random.uniform(0.3, 1.0, 10)
param_distributions={"max_depth":max_depth, "max_features": max_features }
results=[]
for estimator in estimators:
    result=[]
    if estimator.__class__.__name__ != "DecisionTreeClassifier":
        param_distributions["n_estimators"]= np.random.randint(100,200,10)
    clf=RandomizedSearchCV(estimator=estimator,
                    param_distributions=param_distributions,
                    n_iter=100, 
                    scoring="accuracy",
                    n_jobs=-1,
                    cv=5,
                    verbose=2
                    )
    clf.fit(X_train, y_train)
    result.append(estimator.__class__.__name__)
    result.append(clf.best_params_)
    result.append(clf.best_score_)
    result.append(clf.score(X_test, y_test))
    result.append(clf.cv_results_)
    results.append(result)
results

Fitting 5 folds for each of 100 candidates, totalling 500 fits
Fitting 5 folds for each of 100 candidates, totalling 500 fits
Fitting 5 folds for each of 100 candidates, totalling 500 fits


[['DecisionTreeClassifier',
  {'max_features': 0.7763299407779168, 'max_depth': 5},
  0.8664934026389444,
  0.8701298701298701,
  {'mean_fit_time': array([0.0032011 , 0.00383978, 0.00298867, 0.00380096, 0.0030005 ,
          0.00320091, 0.00320106, 0.00340014, 0.00320106, 0.00360055,
          0.00360026, 0.00340118, 0.0026001 , 0.00300069, 0.0028007 ,
          0.00300083, 0.00300064, 0.00260034, 0.00300021, 0.00300045,
          0.00360041, 0.00348392, 0.00280061, 0.0032012 , 0.0030004 ,
          0.00360107, 0.00260077, 0.00260148, 0.00280046, 0.00360084,
          0.00300074, 0.00360079, 0.00300097, 0.00300026, 0.0030004 ,
          0.00340037, 0.00300093, 0.00300074, 0.00320096, 0.00360041,
          0.00320048, 0.00335135, 0.00320077, 0.00300064, 0.00260057,
          0.00320106, 0.00280094, 0.0030005 , 0.00300059, 0.00340071,
          0.00340085, 0.00340056, 0.00360093, 0.00360098, 0.00280066,
          0.00260053, 0.00300069, 0.00300093, 0.00320063, 0.00320029,
          0.003

In [171]:
df=pd.DataFrame(results, columns=["estimator", "best_params", "train_score", "test_score", "cv_result"])
df

Unnamed: 0,estimator,best_params,train_score,test_score,cv_result
0,DecisionTreeClassifier,"{'max_features': 0.7763299407779168, 'max_dept...",0.866493,0.87013,"{'mean_fit_time': [0.0032011032104492187, 0.00..."
1,RandomForestClassifier,"{'n_estimators': 109, 'max_features': 0.626796...",0.903972,0.857143,"{'mean_fit_time': [0.2204498291015625, 0.28626..."
2,GradientBoostingClassifier,"{'n_estimators': 145, 'max_features': 0.386202...",0.903958,0.863636,"{'mean_fit_time': [0.6595483303070069, 0.61593..."


In [176]:
pd.DataFrame(df.loc[1, "cv_result"]).sort_values(by="rank_test_score")

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_n_estimators,param_max_features,param_max_depth,params,split0_test_score,split1_test_score,split2_test_score,split3_test_score,split4_test_score,mean_test_score,std_test_score,rank_test_score
70,0.202245,0.014814,0.013803,0.001167,109,0.626796,13,"{'n_estimators': 109, 'max_features': 0.626796...",0.886179,0.934959,0.861789,0.894309,0.942623,0.903972,0.030475,1
14,0.280463,0.017134,0.020405,0.003383,172,0.386202,19,"{'n_estimators': 172, 'max_features': 0.386202...",0.886179,0.943089,0.861789,0.894309,0.934426,0.903958,0.030486,2
62,0.317573,0.019573,0.016603,0.000490,150,0.931647,7,"{'n_estimators': 150, 'max_features': 0.931647...",0.878049,0.943089,0.869919,0.902439,0.926230,0.903945,0.027783,3
75,0.339076,0.014930,0.021005,0.002530,177,0.881167,7,"{'n_estimators': 177, 'max_features': 0.881166...",0.878049,0.951220,0.869919,0.894309,0.926230,0.903945,0.030505,3
39,0.348078,0.011716,0.019404,0.001020,163,0.931647,17,"{'n_estimators': 163, 'max_features': 0.931647...",0.886179,0.951220,0.861789,0.902439,0.918033,0.903932,0.030095,5
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
19,0.229652,0.016672,0.017604,0.001855,141,0.304305,7,"{'n_estimators': 141, 'max_features': 0.304305...",0.845528,0.934959,0.869919,0.869919,0.909836,0.886032,0.032001,96
93,0.241854,0.006113,0.019004,0.003034,163,0.304305,7,"{'n_estimators': 163, 'max_features': 0.304305...",0.837398,0.934959,0.861789,0.886179,0.909836,0.886032,0.034390,96
80,0.211847,0.007334,0.016004,0.001674,137,0.319037,5,"{'n_estimators': 137, 'max_features': 0.319037...",0.845528,0.910569,0.853659,0.878049,0.926230,0.882807,0.031363,98
73,0.233653,0.016816,0.017604,0.000800,151,0.319037,5,"{'n_estimators': 151, 'max_features': 0.319037...",0.845528,0.910569,0.853659,0.878049,0.918033,0.881168,0.029189,99


In [172]:
clf.best_params_

{'n_estimators': 145, 'max_features': 0.3862020368732978, 'max_depth': 17}

In [173]:
clf.best_score_

0.9039584166333465