## kfold
- 학습세트와 검증세트를 나눠 반복해서 검증한다. 
- 과적합을 막아준다

- 교차검증 테스트

In [1]:
import numpy as np
from sklearn.model_selection import KFold

In [2]:
X = np.array([
    [1,2], [3,4], [1,2], [3,4]
])

y = np.array([1,2,3,4])

In [3]:
X

array([[1, 2],
       [3, 4],
       [1, 2],
       [3, 4]])

In [4]:
y

array([1, 2, 3, 4])

In [10]:
kf = KFold(n_splits=2) # n_split : 몇등분으로 할지

print(kf.get_n_splits(X))

2


In [8]:
for train_idx, test_idx in kf.split(X):
    print('Train idx: ', train_idx)
    print('Test idx: ', test_idx)

Train idx:  [2 3]
Test idx:  [0 1]
Train idx:  [0 1]
Test idx:  [2 3]


- X데이터를 kf대로 split해준 후 train과 test로 나눠서 인덱스를 반환할 수 있다

In [9]:
for train_idx, test_idx in kf.split(X):
    print('---idx')
    print(train_idx, test_idx)
    print('---train data')
    print(X[train_idx])
    print('---validation data')
    print(X[test_idx])

---idx
[2 3] [0 1]
---train data
[[1 2]
 [3 4]]
---validation data
[[1 2]
 [3 4]]
---idx
[0 1] [2 3]
---train data
[[1 2]
 [3 4]]
---validation data
[[1 2]
 [3 4]]


- 다시 wine 데이터로

In [11]:
import pandas as pd

red_url = "https://raw.githubusercontent.com/PinkWink/ML_tutorial/master/dataset/winequality-red.csv"
white_url = "https://raw.githubusercontent.com/PinkWink/ML_tutorial/master/dataset/winequality-white.csv"

red_wine = pd.read_csv(red_url, sep=';')
white_wine = pd.read_csv(white_url, sep=';')

red_wine['color'] = 1.
white_wine['color'] = 0.

wine = pd.concat([red_wine, white_wine])

In [12]:
wine['taste'] = [1. if grade>5 else 0. for grade in wine['quality']]

X = wine.drop(['taste', 'quality'], axis=1)
y = wine['taste']

In [13]:
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=13)

wine_tree = DecisionTreeClassifier(max_depth=2, random_state=13)
wine_tree.fit(X_train, y_train)

y_pred_tr = wine_tree.predict(X_train)
y_pred_test = wine_tree.predict(X_test)

print('Train Acc : ', accuracy_score(y_train, y_pred_tr))
print('Test Acc : ', accuracy_score(y_test, y_pred_test))

Train Acc :  0.7294593034442948
Test Acc :  0.7161538461538461


- 이 accuracy를 믿을만 한거야? 라고 누군가 묻는다면
- 5등분으로 나눠서 교차검증을 해보자

In [14]:
from sklearn.model_selection import KFold

kfold = KFold(n_splits=5)
wine_tree_cv = DecisionTreeClassifier(max_depth=2, random_state=13) # cv : cross validation


In [15]:
for train_idx, test_idx in kfold.split(X):
    print(len(train_idx), len(test_idx))

5197 1300
5197 1300
5198 1299
5198 1299
5198 1299


In [16]:
cv_accuracy = []

for train_idx, test_idx in kfold.split(X):
    X_train = X.iloc[train_idx]
    X_test = X.iloc[test_idx]
    y_train = y.iloc[train_idx]
    y_test = y.iloc[test_idx]
    
    wine_tree_cv.fit(X_train, y_train)
    pred = wine_tree_cv.predict(X_test)
    cv_accuracy.append(accuracy_score(y_test, pred))
    
cv_accuracy

[0.6007692307692307,
 0.6884615384615385,
 0.7090069284064665,
 0.7628945342571208,
 0.7867590454195535]

- 60프로에서 78프로까지 생각보다 편차가 컸다
- 그래서 아래처럼 평균을 내줌

In [17]:
np.mean(cv_accuracy)

0.709578255462782

### kfold의 문제점
- 예를 들어 답이 0,1,2인데 나눠진 부분에서 0,1만 가지고 있다면 2를 도출할 수 없다
- 이런 문제를 해결하기 위해 stratifiedkfold가 나옴

## stratifiedkfold
- target에 속성값의 개수를 동일하게 가져감으로 kfold같이 데이터가 한곳에 몰리는 문제점을 방지

In [19]:
from sklearn.model_selection import StratifiedKFold

skfold = StratifiedKFold(n_splits=5)
wine_tree_cv = DecisionTreeClassifier(max_depth=2, random_state=13) 

cv_accuracy = []

for train_idx, test_idx in skfold.split(X, y):  # 어떤걸 기준으로 stratified를 유지할지 y값을 넣어줘야 함
    X_train = X.iloc[train_idx]
    X_test = X.iloc[test_idx]
    y_train = y.iloc[train_idx]
    y_test = y.iloc[test_idx]
    
    wine_tree_cv.fit(X_train, y_train)
    pred = wine_tree_cv.predict(X_test)
    cv_accuracy.append(accuracy_score(y_test, pred))
    
cv_accuracy

[0.5523076923076923,
 0.6884615384615385,
 0.7143956889915319,
 0.7321016166281755,
 0.7567359507313318]

In [20]:
np.mean(cv_accuracy)

0.6888004974240539

- 평균값이 더 낮아졌다

- cross validation을 보다 간단히

In [21]:
from sklearn.model_selection import cross_val_score

skfold = StratifiedKFold(n_splits=5)
wine_tree_cv = DecisionTreeClassifier(max_depth=2, random_state=13) 

cross_val_score(wine_tree_cv, X, y, cv=skfold)

array([0.55230769, 0.68846154, 0.71439569, 0.73210162, 0.75673595])

- max_depth를 5로 바꿔서

In [22]:
from sklearn.model_selection import cross_val_score

skfold = StratifiedKFold(n_splits=5)
wine_tree_cv = DecisionTreeClassifier(max_depth=5, random_state=13) 

cross_val_score(wine_tree_cv, X, y, cv=skfold)

array([0.50076923, 0.62615385, 0.69745958, 0.7582756 , 0.74903772])

- 그래도 비슷함

- train score과 함게 보고싶다면

In [24]:
from sklearn.model_selection import cross_validate

cross_validate(wine_tree_cv, X, y, cv=skfold, return_train_score=True)

{'fit_time': array([0.01932001, 0.01943493, 0.01729107, 0.01725793, 0.01735806]),
 'score_time': array([0.00412297, 0.00211215, 0.00194097, 0.00201607, 0.00190687]),
 'test_score': array([0.50076923, 0.62615385, 0.69745958, 0.7582756 , 0.74903772]),
 'train_score': array([0.78795459, 0.78045026, 0.77568295, 0.76356291, 0.76279338])}

- test_score, train_score을 보니 맨 뒤의 두개 데이터는 괜찮은데 앞에 3개는 과적합 현상이 있어보인다

## 하이퍼파라미터 튜닝

In [26]:
import pandas as pd

red_url = "https://raw.githubusercontent.com/PinkWink/ML_tutorial/master/dataset/winequality-red.csv"
white_url = "https://raw.githubusercontent.com/PinkWink/ML_tutorial/master/dataset/winequality-white.csv"

red_wine = pd.read_csv(red_url, sep=';')
white_wine = pd.read_csv(white_url, sep=';')

red_wine['color'] = 1.
white_wine['color'] = 0.

wine = pd.concat([red_wine, white_wine])

wine['taste'] = [1. if grade>5 else 0. for grade in wine['quality']]

X = wine.drop(['taste', 'quality'], axis=1)
y = wine['taste']

### GfidSearchCV란?

>사이킷런에서는 분류 알고리즘이나 회귀 알고리즘에 사용되는 하이퍼파라미터를 순차적으로 입력해 
학습을 하고 측정을 하면서 가장 좋은 파라미터를 알려준다. GridSearchCV가 없다면 max_depth가
3일때 가장 최적의 스코어를 뽑아내는지 1일때 가장 최적인 스코어를  뽑아내는지 일일이 학습을 해야 한다. 
하지만 grid 파라미터 안에서 집합을 만들고 적용하면 최적화된 파라미터를 뽑아낼 수 있다.

### GridSearchCV 클래스의 생성자 정리

>-estimator : classifier, regressor, pipeline 등 가능 \
-param_grid : 튜닝을 위해 파라미터, 사용될 파라미터를 dictionary 형태로 만들어서 넣는다. \
-scoring : 예측 성능을 측정할 평가 방법을 넣는다. 보통 accuracy 로 지정하여서 정확도로 성능 평가를 한다. \
-cv : 교차 검증에서 몇개로 분할되는지 지정한다. \
-refit : True가 디폴트로 True로 하면 최적의 하이퍼 파라미터를 찾아서 재학습 시킨다. 

In [27]:
from sklearn.model_selection import GridSearchCV
from sklearn.tree import DecisionTreeClassifier

params = {'max_depth': [2,4,7,10]}

wine_tree = DecisionTreeClassifier(max_depth=2, random_state=13)

gridsearch = GridSearchCV(estimator=wine_tree, param_grid=params, cv=5)
gridsearch.fit(X, y)

- GridSearchCV의 결과

In [30]:
import pprint

pp = pprint.PrettyPrinter(indent=4)
pp.pprint(gridsearch.cv_results_)

{   'mean_fit_time': array([0.00951281, 0.01416521, 0.02309527, 0.03189454]),
    'mean_score_time': array([0.00222301, 0.00193624, 0.00194092, 0.00206633]),
    'mean_test_score': array([0.6888005 , 0.66356523, 0.65340854, 0.64401587]),
    'param_max_depth': masked_array(data=[2, 4, 7, 10],
             mask=[False, False, False, False],
       fill_value='?',
            dtype=object),
    'params': [   {'max_depth': 2},
                  {'max_depth': 4},
                  {'max_depth': 7},
                  {'max_depth': 10}],
    'rank_test_score': array([1, 2, 3, 4], dtype=int32),
    'split0_test_score': array([0.55230769, 0.51230769, 0.50846154, 0.51615385]),
    'split1_test_score': array([0.68846154, 0.63153846, 0.60307692, 0.60076923]),
    'split2_test_score': array([0.71439569, 0.72363356, 0.68360277, 0.66743649]),
    'split3_test_score': array([0.73210162, 0.73210162, 0.73672055, 0.71054657]),
    'split4_test_score': array([0.75673595, 0.7182448 , 0.73518091, 0.7251732

- 최적의 성능을 가진 모델은?

In [31]:
gridsearch.best_estimator_

- 최적의 성능을 가진 스코어는?

In [32]:
gridsearch.best_score_

0.6888004974240539

- 최적의 성능을 가진 파라미터는?

In [34]:
gridsearch.best_params_

{'max_depth': 2}

- 만약 pipeline을 적용한 모델이 gridsearch를 적용하고 싶다면

In [35]:
from sklearn.pipeline import Pipeline
from sklearn.tree import DecisionTreeClassifier
from sklearn.preprocessing import StandardScaler

estimators = [
    ('scalar', StandardScaler()),         
    ('clf', DecisionTreeClassifier())     
]

pipe = Pipeline(estimators)

In [36]:
param_grid = [ {'clf__max_depth': [2,4,7,10]}]

GridSearch = GridSearchCV(estimator=pipe, param_grid=param_grid, cv=5)
GridSearch.fit(X, y)

In [37]:
GridSearch.best_estimator_

In [38]:
GridSearch.best_score_

0.6888004974240539

- 예쁘게 정리하는 방법

In [39]:
import pandas as pd

score_df = pd.DataFrame(GridSearch.cv_results_)
score_df

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_clf__max_depth,params,split0_test_score,split1_test_score,split2_test_score,split3_test_score,split4_test_score,mean_test_score,std_test_score,rank_test_score
0,0.010923,0.000336,0.002213,9.6e-05,2,{'clf__max_depth': 2},0.552308,0.688462,0.714396,0.732102,0.756736,0.6888,0.071799,1
1,0.016147,0.000361,0.002095,9e-05,4,{'clf__max_depth': 4},0.512308,0.631538,0.723634,0.732102,0.718245,0.663565,0.083905,2
2,0.024877,0.0005,0.002131,9.4e-05,7,{'clf__max_depth': 7},0.514615,0.608462,0.678984,0.73903,0.735951,0.655408,0.084926,3
3,0.033393,0.000705,0.002229,3.6e-05,10,{'clf__max_depth': 10},0.512308,0.601538,0.666667,0.710547,0.725173,0.643247,0.078325,4


In [41]:
score_df[['params', 'rank_test_score', 'mean_test_score', 'std_test_score']]

Unnamed: 0,params,rank_test_score,mean_test_score,std_test_score
0,{'clf__max_depth': 2},1,0.6888,0.071799
1,{'clf__max_depth': 4},2,0.663565,0.083905
2,{'clf__max_depth': 7},3,0.655408,0.084926
3,{'clf__max_depth': 10},4,0.643247,0.078325
