훈련세트, 검증세트, 테스트세트
  - 훈련세트 : 모델훈련
  - 검증세트 : 모델을 검증 - 최고성능을 찾음
  - 테스트세트 : 최종 점검데이터

In [2]:
import pandas as np
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_wine
wine = load_wine()

In [5]:
x,x_test,y,y_test = train_test_split(wine.data,wine.target,test_size=0.2,random_state=42)

In [7]:
x_train,x_test,val_train,val_test =  train_test_split(x,y,test_size=0.2,random_state=42)

In [8]:
from sklearn.tree import DecisionTreeClassifier
dt = DecisionTreeClassifier()

In [9]:
from sklearn.model_selection import cross_validate
# 분할기는 kfold를 사용하고  기본값은 5겹
scores = cross_validate(dt,x,y)

In [10]:
scores

{'fit_time': array([0.00467396, 0.00187469, 0.00384998, 0.00158834, 0.00861359]),
 'score_time': array([0.00671768, 0.00113988, 0.00125885, 0.00104451, 0.00351024]),
 'test_score': array([0.93103448, 0.93103448, 0.89285714, 0.92857143, 0.85714286])}

In [13]:
np.mean(scores['test_score'])

0.9081280788177339

분할기를 사용한 교차검증

In [14]:
from sklearn.model_selection import StratifiedKFold
scores = cross_validate(dt,x,y,cv = StratifiedKFold() )
scores

{'fit_time': array([0.0038991 , 0.00255251, 0.0013082 , 0.00124693, 0.00139928]),
 'score_time': array([0.00145483, 0.0007472 , 0.00069547, 0.00070143, 0.00076246]),
 'test_score': array([0.93103448, 0.93103448, 0.89285714, 0.92857143, 0.89285714])}

In [15]:
np.mean(scores['test_score'])

0.9152709359605913

검증폴드 개수를 늘리기(기본은 5겹)

In [17]:
cv =  StratifiedKFold(n_splits=10)
scores = cross_validate(dt,x,y,cv = cv )
np.mean(scores['test_score'])

0.8928571428571429

그리드서치 - 하이퍼파라메터 튜닝
  - 하이퍼파라메터 : 머신러닝의 성능을 결정하는 요인중에 정해지지 
  않은 값
    - 머신러닝함수의 파라메터, 즉 개발에 사용하는 모든 변수를

In [19]:
from sklearn.model_selection import GridSearchCV
parameters = {
  'criterion' : ["gini", "entropy", "log_loss"] ,
  'min_impurity_decrease' : [0.0001,0.0002,0.0003,0.0004,0.0005]  
}
gs =  GridSearchCV(DecisionTreeClassifier(random_state=42), param_grid= parameters, n_jobs=-1) # n_jobs=-1 가용한 모든 컴퓨터 자원을 사용
gs.fit(x,y)

In [20]:
gs.best_params_

{'criterion': 'gini', 'min_impurity_decrease': 0.0001}

In [22]:
gs.cv_results_

{'mean_fit_time': array([0.0044704 , 0.00339036, 0.00232086, 0.00191889, 0.00200181,
        0.0023479 , 0.00359125, 0.00286798, 0.00226121, 0.00391083,
        0.00216599, 0.00614591, 0.00220366, 0.00218806, 0.00214376]),
 'std_fit_time': array([3.34112887e-03, 2.38985906e-03, 5.48378376e-04, 8.62371041e-05,
        2.65620657e-04, 5.94879198e-04, 2.79522904e-03, 1.11528025e-03,
        1.37913919e-04, 3.68685162e-03, 7.99289744e-05, 5.45465453e-03,
        7.27895319e-05, 1.17947924e-04, 8.00550656e-05]),
 'mean_score_time': array([0.0011116 , 0.00175953, 0.00179543, 0.00085635, 0.00270314,
        0.00080094, 0.00497422, 0.00336175, 0.00500922, 0.00366812,
        0.00267529, 0.00207791, 0.00350351, 0.00199366, 0.00082436]),
 'std_score_time': array([1.95542518e-04, 1.62281037e-03, 1.14671539e-03, 4.40158150e-05,
        3.69861436e-03, 2.54639972e-05, 6.01979241e-03, 2.77115070e-03,
        5.36017550e-03, 3.42405424e-03, 2.76767290e-03, 2.46393552e-03,
        3.23987187e-03, 2.34

In [23]:
gs.cv_results_['mean_test_score']

array([0.91527094, 0.91527094, 0.91527094, 0.91527094, 0.91527094,
       0.85246305, 0.85246305, 0.85246305, 0.85246305, 0.85246305,
       0.85246305, 0.85246305, 0.85246305, 0.85246305, 0.85246305])

In [21]:
dt = gs.best_estimator_
dt.score(x,y)

1.0

그리드서치에 들어갈 파라메터의 값들은 어떻게 셋팅하면 좋을까?

In [27]:
# 확률분포의 형태로 데이터를 만들때
# uniform(실수형태), randint(정수형태)
from scipy.stats import uniform, randint
params = {
    'min_impurity_decrease' : uniform(0.0001,0.001),
    'max_depth' : randint(20,50),
    'min_samples_split' : randint(2,25),
    'min_samples_leaf' : randint(1,25)
}
from sklearn.model_selection import RandomizedSearchCV
# RandomizedSearchCV 파라메터 조합을 랜덤하게 조합한다.
gs = RandomizedSearchCV(DecisionTreeClassifier(random_state=42), param_distributions=params, n_iter=100,n_jobs=-1)
gs.fit(x,y)

In [28]:
gs.best_params_

{'max_depth': 49,
 'min_impurity_decrease': 0.0009274089163919622,
 'min_samples_leaf': 1,
 'min_samples_split': 17}

In [29]:
dt = gs.best_estimator_
dt.score(x,y)

0.9929577464788732

In [31]:
np.max(gs.cv_results_['mean_test_score'])

0.9224137931034484