### 교차검증
- 과적합 : 모델이 학습데이터에 과도하게 최적화된 현상. 일반화된 데이터에서는 예측성능이 과하게 떨어지는 현상
- 나에게 주어진 데이터에 적용한 모델의 성능을 정확히 표현하기 위해 유용하다
---
- holdout
- k-fold cross validation : k번 검증하고 test 데이터로 최종 평가
- stratified k-fold cross validation
    

#### 교차검증 구현하기

In [3]:
from sklearn.model_selection import KFold

x = np.array([[1, 2], [3, 4], [1, 2], [3, 4]])
y = np.array([1, 2, 3, 4])
kf = KFold(n_splits=2)

print(kf.get_n_splits(x))
print(kf)
for train_idx, test_idx, in kf.split(x):
    print('--- idx')
    print(trian_idx, test_idx)
    print('--- train data')
    print (x[train_idx])
    print('--- val data')
    print(x[test_idx])

2
KFold(n_splits=2, random_state=None, shuffle=False)
--- idx
[2 3] [0 1]
--- train data
[[1 2]
 [3 4]]
--- val data
[[1 2]
 [3 4]]
--- idx
[2 3] [2 3]
--- train data
[[1 2]
 [3 4]]
--- val data
[[1 2]
 [3 4]]


와인데이터

In [7]:
red_url = 'https://raw.githubusercontent.com/PinkWink/ML_tutorial/master/dataset/winequality-red.csv'
white_url = 'https://raw.githubusercontent.com/PinkWink/ML_tutorial/master/dataset/winequality-white.csv'

red_wine = pd.read_csv(red_url, sep=';')
white_wine = pd.read_csv(white_url, sep=';')

red_wine['color'] = 1.
white_wine['color']= 0.

wine = pd.concat([red_wine, white_wine])

In [10]:
wine['taste'] = [1. if grade > 5 else 0. for grade in wine['quality']]

x = wine.drop(['taste', 'quality'], axis=1)
y = wine['taste']

In [11]:
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=13)

wine_tree = DecisionTreeClassifier(max_depth=2, random_state=13)
wine_tree.fit(x_train, y_train)

y_pred_tr = wine_tree.predict(x_train)
y_pred_test = wine_tree.predict(x_test)

print('Train Acc : ', accuracy_score(y_train, y_pred_tr))
print('Test Acc : ', accuracy_score(y_test, y_pred_test))

Train Acc :  0.7294593034442948
Test Acc :  0.7161538461538461


- k-fold

In [12]:
from sklearn.model_selection import KFold

kfold = KFold(n_splits=5)
wine_tree_cv = DecisionTreeClassifier(max_depth=2, random_state=13)

- k-fold는 index를 반환한다

In [13]:
for train_idx, test_idx in kfold.split(x):
    print(len(train_idx), len(test_idx))

5197 1300
5197 1300
5198 1299
5198 1299
5198 1299


- 각각의 fold에 대한 학습 후 accuracy

In [15]:
cv_accuracy=[]

for train_idx, test_idx in kfold.split(x):
    x_train, x_test = x.iloc[train_idx], x.iloc[test_idx]
    y_train, y_test = y.iloc[train_idx], y.iloc[test_idx]
    wine_tree_cv.fit(x_train, y_train)
    pred = wine_tree_cv.predict(x_test)
    cv_accuracy.append(accuracy_score(y_test, pred))

cv_accuracy

[0.6007692307692307,
 0.6884615384615385,
 0.7090069284064665,
 0.7628945342571208,
 0.7867590454195535]

- 각 accuracy의 분산이 크지 않다면 평균을 대표값으로 한다

In [16]:
np.mean(cv_accuracy)

0.709578255462782

#### StratifiedKFold

In [17]:
from sklearn.model_selection import StratifiedKFold

skfold = StratifiedKFold(n_splits=5)
wine_tree_cv = DecisionTreeClassifier(max_depth=2, random_state=13)

cv_accuracy = []

for train_idx, test_idx in skfold.split(x, y):
    x_train, x_test = x.iloc[train_idx], x.iloc[test_idx]
    y_train, y_test = y.iloc[train_idx], y.iloc[test_idx]
    wine_tree_cv.fit(x_train, y_train)
    pred = wine_tree_cv.predict(x_test)
    cv_accuracy.append(accuracy_score(y_test, pred))
    
cv_accuracy    

[0.5523076923076923,
 0.6884615384615385,
 0.7146153846153847,
 0.7321016166281755,
 0.7565485362095532]

- accuracy의 평균이 더 나쁘다 -> 이럴 때 어떻게 해야할까?

In [18]:
np.mean(cv_accuracy)

0.6888069536444689

- cross validation을 보다 간편히

In [21]:
from sklearn.model_selection import cross_val_score

skfold = StratifiedKFold(n_splits=5)
wine_tree_cv = DecisionTreeClassifier(max_depth=2, random_state=13)

cross_val_score(wine_tree_cv, x, y, scoring=None, cv=skfold)

array([0.55230769, 0.68846154, 0.71461538, 0.73210162, 0.75654854])

- depth가 높다고 무조건 accuracy가 좋아지는 것도 아니다

In [22]:
wine_tree_cv = DecisionTreeClassifier(max_depth=5, random_state=13)

cross_val_score(wine_tree_cv, x, y, scoring=None, cv=skfold)

array([0.50076923, 0.62615385, 0.69769231, 0.7582756 , 0.74884438])

- train score와 함께 보고싶다면

In [23]:
from sklearn.model_selection import cross_validate
cross_validate(wine_tree_cv, x, y, scoring=None, cv=skfold, return_train_score=True)

{'fit_time': array([0.01397252, 0.01291227, 0.01496029, 0.01997018, 0.01196599]),
 'score_time': array([0.0010016 , 0.00099921, 0.00299311, 0.00197196, 0.00199103]),
 'test_score': array([0.50076923, 0.62615385, 0.69769231, 0.7582756 , 0.74884438]),
 'train_score': array([0.78795459, 0.78045026, 0.77563979, 0.76356291, 0.76283901])}

-> 과적합 발생