## learning-AI101 : wine classification (validation, search)
- 혼자 공부하는 머신러닝과 딥러닝 : 242p~262p
- 2024.07.22.

---------

- 데이터를 아래와 같이 split
    - train set (60%)
    - validation set (20%, 최적의 하이퍼파라미터를 찾은 후 다시 train set과 병합 예정)
    - test set (20%)
- (1) 최적의 하이퍼파라미터를 찾기 위해 아래 반복 실행 
    - 모델 훈련 (train set)
    - score (validation set)
- (2) 최적의 하이퍼파라미터를 적용한 모델을 train set + validation set 병합 데이터로 fit
- (3) test set을 이용해 최종적인 score 도출

In [19]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

In [11]:
# 데이터 불러오기

wine = pd.read_csv ('https://bit.ly/wine_csv_data')
wine.head()

# alcohol, sugar, pH가 feature
# class가 target

Unnamed: 0,alcohol,sugar,pH,class
0,9.4,1.9,3.51,0.0
1,9.8,2.6,3.2,0.0
2,9.8,2.3,3.26,0.0
3,9.8,1.9,3.16,0.0
4,9.4,1.9,3.51,0.0


In [12]:
# data, target으로 split

data = wine[['alcohol', 'sugar', 'pH']].to_numpy()
target = wine['class'].to_numpy()

In [13]:
# 먼저 train set과 test set을 8:2로 split

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(data, target, test_size=0.2)

In [14]:
# train set을 다시 나누어, 거의 6 : 2 : 2 비율을 맞춘다

X_train_vali_ver, X_vali, y_train_vali_ver, y_vali = train_test_split(X_train, y_train, test_size=0.2)

print (f'6 : 2 : 2 = ({X_train.shape, y_train.shape}) : ({X_vali.shape, y_vali.shape}) : ({X_test.shape, y_test.shape})')

6 : 2 : 2 = (((5197, 3), (5197,))) : (((1040, 3), (1040,))) : (((1300, 3), (1300,)))


In [15]:
# 모델 학습

from sklearn.tree import DecisionTreeClassifier

dt = DecisionTreeClassifier()
dt.fit(X_train_vali_ver, y_train_vali_ver)

In [16]:
# score (validation set)

print ('train score : ', dt.score(X_train_vali_ver, y_train_vali_ver))
print ('validation test score : ', dt.score(X_vali, y_vali))

# 아래를 보니 train score가 높아 overfitting된 것을 알 수 있음.
# 위 방법은 사실 좋은 방법이 아니다. 이유는 validation set을 split하기 위해 train set의 data 양이 줄어들었기 때문에

train score :  0.9980755352417608
validation test score :  0.8480769230769231


-------

#### cross-validation
- validation set을 추가적으로 split하면 train할 때의 데이터의 양이 작아질 수밖에 없음
- 따라서, k-fold cross validation 방식을 이용하여 아래와 같이 학습
![image.png](https://www.researchgate.net/publication/332370436/figure/fig1/AS:746775958806528@1555056671117/Diagram-of-k-fold-cross-validation-with-k-10-Image-from-Karl-Rosaen-Log.ppm)
- 위는 10-fold cross validation

In [21]:
# cross-validate를 이용하여 교차 검증
# cross-validate를 할 때는 train set 전체를 대입 (자체적으로 split)

from sklearn.model_selection import cross_validate # 기본적으로 5폴드

scores = cross_validate(dt, X_train, y_train) # 모델, X_train, y_train
print (scores) # 여기에 있는 test_scores가 각 iteration의 score

# 평균 내어서 최종적인 validation score 도출
print ("validation score : ", np.mean(scores['test_score'])) # test_score가 폴드별로의 score

{'fit_time': array([0.00790811, 0.00717521, 0.00715613, 0.00644612, 0.00670624]), 'score_time': array([0.00106597, 0.00094295, 0.00099897, 0.00078416, 0.00080585]), 'test_score': array([0.85865385, 0.84326923, 0.88931665, 0.85563041, 0.86333013])}
validation score :  0.8620400533056933


In [22]:
# 훈련 세트를 섞기 위해 splitter필요, kfold를 사용
# kfold는 회귀이고, stratifiedkfold는 분류

from sklearn.model_selection import StratifiedKFold

splitter = StratifiedKFold(n_splits=10, shuffle=True) # 사실 전에
scores = cross_validate(dt, X_train, y_train, cv=splitter)
print (scores)

# 평균 내어서 최종적인 validation score 도출
print ("validation score : ", np.mean(scores['test_score'])) # test_score가 폴드별로의 score

{'fit_time': array([0.00793791, 0.00817108, 0.00843167, 0.00777698, 0.00717092,
       0.00671101, 0.00633097, 0.00581193, 0.0062058 , 0.00549603]), 'score_time': array([0.0008471 , 0.00092316, 0.00093722, 0.00076699, 0.00067306,
       0.00066519, 0.00059581, 0.00057411, 0.00054407, 0.00049901]), 'test_score': array([0.87692308, 0.86346154, 0.86923077, 0.85      , 0.86346154,
       0.89423077, 0.85384615, 0.86705202, 0.87283237, 0.87283237])}
validation score :  0.8683870609159626


------

#### 하이퍼파라미터 튜닝
![img](https://miro.medium.com/v2/resize:fit:1400/0*LqZEl9-0FRJ98Kzq.png)
- 최적의 하이퍼파라미터를 찾기 위한 방법, 사용자가 주어준 특정 범위 내에서 cross-validation하여 best parameter와 best estimator 도출
- **(1) grid search**
    - grid (표) 형식으로 특정 범위 내 특정 step으로 모든 parameter의 경우를 전부 조사
    - 조사한 값 중 최적의 parameter를 best_params, 최적의 parameter로 학습한 모델을 best_estimator
    - 시간이 오래 걸릴 수 있으며, 병렬적인 계산이 필요할 수 있음 (n_jobs=-1로 두어 모든 코어를 사용하게끔 함)
- **(2) random search**
    - grid search와 다른 성격, 자유로운 parameter 선정 (phase마다 parameter를 특정 범위 내에서 랜덤하게 뽑음)
    - 완전 랜덤은 아닌, 균등분포에 가까운 랜덤으로
    - n_iter로 샘플링 횟수 조정, 나머지는 grid search와 동일

In [24]:
# grid search

from sklearn.model_selection import GridSearchCV

params = { # parameter의 범위를 딕셔너리 형식으로 만들어야 함
    'min_impurity_decrease' : np.arange(0.0001, 0.001, 0.0001), # 0.0001부터 0.001이 될때까지, 0.0001 간격으로
    'max_depth' : range(5, 20, 1), # 5부터 20이 될때까지, 1 간격으로
    'min_samples_split' : range(2, 100, 10)
}

gs = GridSearchCV(dt, params, n_jobs=-1) # cv = 5, splitter를 굳이 사용하지 않음 
gs.fit(X_train, y_train)

print ("best parameter : ", gs.best_params_)
print ("validation score : ", np.max(gs.cv_results_['mean_test_score'])) # 최상의 교차 검증 점수
print ("best model : ", gs.best_estimator_)

best parameter :  {'max_depth': 14, 'min_impurity_decrease': 0.0002, 'min_samples_split': 2}
validation score :  0.871084622788184
best model :  DecisionTreeClassifier(max_depth=14, min_impurity_decrease=0.0002)


In [26]:
# 결과 확인 (cv_results)

pd.set_option('display.max_columns', None)
results = pd.DataFrame(gs.cv_results_)

np.transpose(results.head())

Unnamed: 0,0,1,2,3,4
mean_fit_time,0.004038,0.004595,0.004547,0.004177,0.00386
std_fit_time,0.000159,0.000402,0.000426,0.000203,0.00014
mean_score_time,0.001041,0.00128,0.001145,0.000874,0.000899
std_score_time,0.000179,0.000104,0.000053,0.000073,0.000131
param_max_depth,5,5,5,5,5
param_min_impurity_decrease,0.0001,0.0001,0.0001,0.0001,0.0001
param_min_samples_split,2,12,22,32,42
params,"{'max_depth': 5, 'min_impurity_decrease': 0.00...","{'max_depth': 5, 'min_impurity_decrease': 0.00...","{'max_depth': 5, 'min_impurity_decrease': 0.00...","{'max_depth': 5, 'min_impurity_decrease': 0.00...","{'max_depth': 5, 'min_impurity_decrease': 0.00..."
split0_test_score,0.842308,0.842308,0.842308,0.842308,0.842308
split1_test_score,0.855769,0.855769,0.855769,0.855769,0.855769


In [32]:
# random search

from sklearn.model_selection import RandomizedSearchCV
import scipy.stats as st

params = { # parameter의 범위를 딕셔너리 형식으로 만들어야 함
    'min_impurity_decrease' : st.uniform(0.0001, 0.001),
    'max_depth' : st.randint(5, 20, 1), 
    'min_samples_split' : st.randint(2, 100, 10)
}

rs = RandomizedSearchCV(dt, params, n_iter=100, n_jobs=-1) # cv = 5, splitter를 굳이 사용하지 않음 
rs.fit(X_train, y_train)

print ("best parameter : ", rs.best_params_)
print ("validation score : ", np.max(rs.cv_results_['mean_test_score'])) # 최상의 교차 검증 점수
print ("best model : ", rs.best_estimator_)

best parameter :  {'max_depth': 16, 'min_impurity_decrease': 0.00022008373244668158, 'min_samples_split': 20}
validation score :  0.866657473902421
best model :  DecisionTreeClassifier(max_depth=16,
                       min_impurity_decrease=0.00022008373244668158,
                       min_samples_split=20)


In [33]:
# 결과 확인 (cv_results)

pd.set_option('display.max_columns', None)
results = pd.DataFrame(rs.cv_results_)

np.transpose(results.head())

Unnamed: 0,0,1,2,3,4
mean_fit_time,0.006205,0.007158,0.005421,0.006014,0.007531
std_fit_time,0.000874,0.001134,0.000389,0.00027,0.001029
mean_score_time,0.00167,0.001701,0.00146,0.001497,0.001431
std_score_time,0.000312,0.000156,0.000085,0.000127,0.00011
param_max_depth,15,13,12,7,8
param_min_impurity_decrease,0.000626,0.00085,0.001066,0.000248,0.000106
param_min_samples_split,93,23,99,75,71
params,"{'max_depth': 15, 'min_impurity_decrease': 0.0...","{'max_depth': 13, 'min_impurity_decrease': 0.0...","{'max_depth': 12, 'min_impurity_decrease': 0.0...","{'max_depth': 7, 'min_impurity_decrease': 0.00...","{'max_depth': 8, 'min_impurity_decrease': 0.00..."
split0_test_score,0.850962,0.848077,0.840385,0.840385,0.85
split1_test_score,0.85,0.858654,0.869231,0.869231,0.847115
