<a href="https://colab.research.google.com/github/jhk990602/datapractice/blob/main/diabetes%5BRandomSearchCV%5Dversion2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## 데이터셋 출처
* [Pima Indians Diabetes Database | Kaggle](https://www.kaggle.com/uciml/pima-indians-diabetes-database)
* https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_diabetes.html


### 데이터 구성

* Pregnancies : 임신 횟수
* Glucose : 2시간 동안의 경구 포도당 내성 검사에서 혈장 포도당 농도
* BloodPressure : 이완기 혈압 (mm Hg)
* SkinThickness : 삼두근 피부 주름 두께 (mm), 체지방을 추정하는데 사용되는 값
* Insulin : 2시간 혈청 인슐린 (mu U / ml)
* BMI : 체질량 지수 (체중kg / 키(m)^2)
* DiabetesPedigreeFunction : 당뇨병 혈통 기능
* Age : 나이
* Outcome : 768개 중에 268개의 결과 클래스 변수(0 또는 1)는 1이고 나머지는 0입니다.


## 필요한 라이브러리 로드

In [1]:
# 데이터 분석을 위한 pandas, 수치계산을 위한 numpy
# 시각화를 위한 seaborn, matplotlib.pyplot 을 로드합니다.

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

## 데이터셋 로드

In [4]:
df= pd.read_csv('/content/drive/MyDrive/boostcourse_data/diabetes_feature.csv')
df.shape

(768, 16)

In [6]:
df_insulin = pd.read_csv("/content/drive/MyDrive/boostcourse_data/diabetes_fill_insulin.csv")
df['Insulin']= df_insulin['Insulin']
df.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome,Pregnancies_high,Age_low,Age_middle,Age_high,Insulin_nan,Insulin_log,low_glu_insulin
0,6,148,72,35,181.94837,33.6,0.627,50,1,False,False,True,False,169.5,5.138735,False
1,1,85,66,29,56.257681,26.6,0.351,31,0,False,False,True,False,102.5,4.639572,True
2,8,183,64,0,191.945588,23.3,0.672,32,1,True,False,True,False,169.5,5.138735,False
3,1,89,66,23,94.0,28.1,0.167,21,0,False,True,False,False,94.0,4.553877,True
4,0,137,40,35,168.0,43.1,2.288,33,1,False,False,True,False,168.0,5.129899,False


In [7]:
# 데이터셋을 미리보기 합니다.

df.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome,Pregnancies_high,Age_low,Age_middle,Age_high,Insulin_nan,Insulin_log,low_glu_insulin
0,6,148,72,35,181.94837,33.6,0.627,50,1,False,False,True,False,169.5,5.138735,False
1,1,85,66,29,56.257681,26.6,0.351,31,0,False,False,True,False,102.5,4.639572,True
2,8,183,64,0,191.945588,23.3,0.672,32,1,True,False,True,False,169.5,5.138735,False
3,1,89,66,23,94.0,28.1,0.167,21,0,False,True,False,False,94.0,4.553877,True
4,0,137,40,35,168.0,43.1,2.288,33,1,False,False,True,False,168.0,5.129899,False


## 학습과 예측에 사용할 데이터셋 만들기

In [8]:
df.columns

Index(['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin',
       'BMI', 'DiabetesPedigreeFunction', 'Age', 'Outcome', 'Pregnancies_high',
       'Age_low', 'Age_middle', 'Age_high', 'Insulin_nan', 'Insulin_log',
       'low_glu_insulin'],
      dtype='object')

In [10]:
X = df[['Glucose', 'BloodPressure', 'SkinThickness',
       'BMI', 'DiabetesPedigreeFunction', 'Age', 'Pregnancies_high',
       'Insulin']]
X.shape

(768, 8)

In [11]:
y = df['Outcome']
y.shape

(768,)

In [12]:
# 사이킷런에서 제공하는 model_selection 의 train_test_split 으로 만듭니다.

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42)

In [13]:
# train 세트의 문제와 정답의 데이터 수를 확인해 주세요.

X_train.shape, y_train.shape

((614, 8), (614,))

In [14]:
# test 세트의 문제와 정답의 데이터 수를 확인해 주세요.

X_test.shape, y_test.shape

((154, 8), (154,))

## 여러 개의 알고리즘을 사용하여 비교하기

In [15]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier,GradientBoostingClassifier

estimators = [DecisionTreeClassifier(random_state=42),
             RandomForestClassifier(random_state=42),
             GradientBoostingClassifier(random_state=42)]
estimators

[DecisionTreeClassifier(random_state=42),
 RandomForestClassifier(random_state=42),
 GradientBoostingClassifier(random_state=42)]

In [16]:
results= []
for estimator in estimators:
  result= []
  result.append(estimator.__class__.__name__)
  results.append(result)
results

[['DecisionTreeClassifier'],
 ['RandomForestClassifier'],
 ['GradientBoostingClassifier']]

In [17]:
from sklearn.model_selection import RandomizedSearchCV

max_depth=np.random.randint(2,20,10)
max_features= np.random.uniform(0.3,1.0,10)

param_distributions={'max_depth':max_depth,'max_features': max_features}

results=[]
for estimator in estimators:
  result = []
  if estimator.__class__.__name__ != 'DecisionTreeClassifier':
    param_distributions['n_estimators'] = np.random.randint(100,1000,10)
  clf= RandomizedSearchCV(estimator,
                          param_distributions,
                          n_iter=10,
                          scoring='accuracy',
                          n_jobs=-1,
                          cv=5,
                          verbose=2)
  clf.fit(X_train,y_train)
  result.append(estimator.__class__.__name__)
  result.append(clf.best_params_)
  result.append(clf.best_score_)
  result.append(clf.score(X_test,y_test))
  result.append(clf.cv_results_)
  results.append(result)

Fitting 5 folds for each of 10 candidates, totalling 50 fits
Fitting 5 folds for each of 10 candidates, totalling 50 fits
Fitting 5 folds for each of 10 candidates, totalling 50 fits


In [18]:
pd.DataFrame(results,columns=['estimator','best_params','train_score','test_score','cv_result'])

Unnamed: 0,estimator,best_params,train_score,test_score,cv_result
0,DecisionTreeClassifier,"{'max_features': 0.726587235147559, 'max_depth...",0.752339,0.75974,"{'mean_fit_time': [0.005501365661621094, 0.007..."
1,RandomForestClassifier,"{'n_estimators': 792, 'max_features': 0.677287...",0.796455,0.753247,"{'mean_fit_time': [2.292432403564453, 3.358266..."
2,GradientBoostingClassifier,"{'n_estimators': 275, 'max_features': 0.626141...",0.793136,0.779221,"{'mean_fit_time': [1.8589829444885253, 4.07375..."
