##### Ensemble - ExtraTree
- 배깅 방식의 앙상블 ==> 중복을 허용한 랜덤 샘플 + 동일 모델 (DecisionTree)
    * 대표 알고리즘 : RandomForest C/R 
- 페이스트 방식의 앙상블 ==> 랜덤 샘ㅍ르 + 동일모델 (DecisionTree)
    * 대표 알고리즘 : RandomForest C/R 

[목표] 와인 분류 => 와인의 종류 0, 1 2개 종류 분류

[1] 모듈 로딩 및 데이터 준비

In [1]:
# 모듈로딩 
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [2]:
# Data
DATA_FILE = "C:\Hwan\ML_Work\D0903\wine.csv"

# CSV >> DataFrame
wineDF = pd.read_csv(DATA_FILE)

In [3]:
# Data Check
wineDF.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6497 entries, 0 to 6496
Data columns (total 4 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   alcohol  6497 non-null   float64
 1   sugar    6497 non-null   float64
 2   pH       6497 non-null   float64
 3   class    6497 non-null   float64
dtypes: float64(4)
memory usage: 203.2 KB


In [4]:
wineDF.head(2)

Unnamed: 0,alcohol,sugar,pH,class
0,9.4,1.9,3.51,0.0
1,9.8,2.6,3.2,0.0


In [5]:
# 타겟/라벨의 클래스 분포
wineDF["class"].value_counts()

class
1.0    4898
0.0    1599
Name: count, dtype: int64

In [6]:
## 다운샘플링 or 오버샘플링을 진행해야함

In [7]:
wineDF.describe()

Unnamed: 0,alcohol,sugar,pH,class
count,6497.0,6497.0,6497.0,6497.0
mean,10.491801,5.443235,3.218501,0.753886
std,1.192712,4.757804,0.160787,0.430779
min,8.0,0.6,2.72,0.0
25%,9.5,1.8,3.11,1.0
50%,10.3,3.0,3.21,1.0
75%,11.3,8.1,3.32,1.0
max,14.9,65.8,4.01,1.0


[2] 학습 준비

In [8]:
# 학습용 & 데스트용 데이터셋 분할
from sklearn.model_selection import train_test_split

In [9]:
# 피쳐/독립변수와 타겟/라벨/종속변수 분리
featureDF = wineDF[wineDF.columns[:-1]]
targetSR = wineDF[wineDF.columns[-1]]

print(f"featureDF : {featureDF.shape}   targetSR : {targetSR.shape}")

featureDF : (6497, 3)   targetSR : (6497,)


In [10]:
X_train, X_test, y_train, y_test = train_test_split(featureDF, targetSR, test_size=0.2, stratify=targetSR, random_state=1)

In [11]:
print(f"X_train : {X_train.shape} y_train : {y_train.shape}")
print(f"X_test : {X_test.shape} y_test : {y_test.shape}")

X_train : (5197, 3) y_train : (5197,)
X_test : (1300, 3) y_test : (1300,)


[3] 학습 진행

In [12]:
# 학습방법 : 지도학습 -> 분류
# 알고리즘 : 앙상블 -> 배깅 - ExtraTreesClassifier

In [13]:
from sklearn.ensemble import ExtraTreesClassifier

In [14]:
# 인스턴스 생성 => 100개의 내부 DT 모델에서 사용할 데이터셋 생성
#              => random_State 매개변수 설정으로 고정된 데이터셋 생성
#              => oob_score 매개변수 : 샘플 데이터셋 추출 후 남은 데이터셋 검증용으로 사용
If_model = ExtraTreesClassifier(random_state=7)

If_model.fit(X_train, y_train)

In [15]:
# model parameter

print(f"classed_ : {If_model.classes_}")
print(f"n_classes_ : {If_model.n_classes_}개")
print()
print(f"feature_names_in : {If_model.feature_names_in_}")
print(f"n_features_in_ : {If_model.n_features_in_}개")
print(f"feature_importances_ : {If_model.feature_importances_}")

classed_ : [0. 1.]
n_classes_ : 2개

feature_names_in : ['alcohol' 'sugar' 'pH']
n_features_in_ : 3개
feature_importances_ : [0.18992806 0.53030305 0.2797689 ]


In [16]:
# 모델 파라미터

print(f"classes_  : {If_model.estimator_}")

for est in If_model.estimators_ : print(est)

classes_  : ExtraTreeClassifier()
ExtraTreeClassifier(random_state=327741615)
ExtraTreeClassifier(random_state=976413892)
ExtraTreeClassifier(random_state=1202242073)
ExtraTreeClassifier(random_state=1369975286)
ExtraTreeClassifier(random_state=1882953283)
ExtraTreeClassifier(random_state=2053951699)
ExtraTreeClassifier(random_state=959775639)
ExtraTreeClassifier(random_state=1956722279)
ExtraTreeClassifier(random_state=2052949340)
ExtraTreeClassifier(random_state=1322904761)
ExtraTreeClassifier(random_state=165338510)
ExtraTreeClassifier(random_state=1133316631)
ExtraTreeClassifier(random_state=4812360)
ExtraTreeClassifier(random_state=372560217)
ExtraTreeClassifier(random_state=309457262)
ExtraTreeClassifier(random_state=1801189930)
ExtraTreeClassifier(random_state=1152936666)
ExtraTreeClassifier(random_state=68334472)
ExtraTreeClassifier(random_state=2146978983)
ExtraTreeClassifier(random_state=119248870)
ExtraTreeClassifier(random_state=769786948)
ExtraTreeClassifier(random_state=1

In [None]:
## oob_score_ 수가 잘 나와야함 
## 지금은 더 해야함 (낮아서 ?)

In [None]:
# 이제 정확도를 봐야함

[4] 성능평가

In [18]:
train_score = If_model.score(X_train, y_train)
test_score = If_model.score(X_test, y_test)

In [19]:
print(f"train_score : {train_score}   test_score : {test_score}")

## => 과대적합 상태

train_score : 0.9973061381566288   test_score : 0.8961538461538462


[5] 튜닝

* RandomizedSearchCV 하이퍼 파라미터 최적화 클래스  

    - 범위가 넓은 하이퍼파라미터 설정에 좋음  
    
    - 지정된 범위에서 지정도니 횟수만큼 하이퍼파라미터를 추출하여 조합 진행

In [20]:
# 모듈로딩
from sklearn.model_selection import RandomizedSearchCV

In [27]:
# RandomizedSearchCV 하이퍼파라미터 설정
params = {"max_depth" : range(2, 15),
          "min_samples_leaf" : range(5, 16),
          }

In [28]:
rf_model = ExtraTreesClassifier(random_state=7)

In [29]:
searchCV = RandomizedSearchCV(rf_model, param_distributions = params, n_iter = 10, verbose = 4)

In [30]:
searchCV.fit(X_train, y_train)

Fitting 5 folds for each of 10 candidates, totalling 50 fits
[CV 1/5] END ...max_depth=9, min_samples_leaf=9;, score=0.754 total time=   0.0s
[CV 2/5] END ...max_depth=9, min_samples_leaf=9;, score=0.754 total time=   0.0s
[CV 3/5] END ...max_depth=9, min_samples_leaf=9;, score=0.755 total time=   0.0s
[CV 4/5] END ...max_depth=9, min_samples_leaf=9;, score=0.754 total time=   0.0s
[CV 5/5] END ...max_depth=9, min_samples_leaf=9;, score=0.754 total time=   0.0s
[CV 1/5] END ..max_depth=10, min_samples_leaf=7;, score=0.754 total time=   0.0s
[CV 2/5] END ..max_depth=10, min_samples_leaf=7;, score=0.754 total time=   0.0s
[CV 3/5] END ..max_depth=10, min_samples_leaf=7;, score=0.755 total time=   0.0s
[CV 4/5] END ..max_depth=10, min_samples_leaf=7;, score=0.754 total time=   0.0s
[CV 5/5] END ..max_depth=10, min_samples_leaf=7;, score=0.754 total time=   0.0s
[CV 1/5] END ...max_depth=8, min_samples_leaf=6;, score=0.754 total time=   0.0s
[CV 2/5] END ...max_depth=8, min_samples_leaf=6;

In [31]:
# model parameter

print(f"[searchCV.best_score_] {searchCV.best_score_}")
print(f"[searchCV.best_parmas_] {searchCV.best_params_}")
print(f"[searchCV.best_estimator_] {searchCV.best_estimator_}")

cv_resultDF = pd.DataFrame(searchCV.cv_results_)
cv_resultDF


[searchCV.best_score_] 0.75389649811209
[searchCV.best_parmas_] {'min_samples_leaf': 9, 'max_depth': 9}
[searchCV.best_estimator_] ExtraTreesClassifier(max_depth=9, min_samples_leaf=9, random_state=7)


Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_min_samples_leaf,param_max_depth,params,split0_test_score,split1_test_score,split2_test_score,split3_test_score,split4_test_score,mean_test_score,std_test_score,rank_test_score
0,0.090194,0.012118,0.007316,0.001382,9,9,"{'min_samples_leaf': 9, 'max_depth': 9}",0.753846,0.753846,0.754572,0.753609,0.753609,0.753896,0.000354,1
1,0.079735,0.010718,0.006603,0.00098,7,10,"{'min_samples_leaf': 7, 'max_depth': 10}",0.753846,0.753846,0.754572,0.753609,0.753609,0.753896,0.000354,1
2,0.073003,0.003374,0.006057,0.000709,6,8,"{'min_samples_leaf': 6, 'max_depth': 8}",0.753846,0.753846,0.754572,0.753609,0.753609,0.753896,0.000354,1
3,0.073424,0.004214,0.005845,0.001266,5,9,"{'min_samples_leaf': 5, 'max_depth': 9}",0.753846,0.753846,0.754572,0.753609,0.753609,0.753896,0.000354,1
4,0.09382,0.01327,0.006788,0.001675,14,6,"{'min_samples_leaf': 14, 'max_depth': 6}",0.753846,0.753846,0.754572,0.753609,0.753609,0.753896,0.000354,1
5,0.088093,0.003926,0.00552,0.000318,15,12,"{'min_samples_leaf': 15, 'max_depth': 12}",0.753846,0.753846,0.754572,0.753609,0.753609,0.753896,0.000354,1
6,0.078233,0.004308,0.005832,0.000152,6,14,"{'min_samples_leaf': 6, 'max_depth': 14}",0.753846,0.753846,0.754572,0.753609,0.753609,0.753896,0.000354,1
7,0.076123,0.005968,0.006231,0.000389,13,12,"{'min_samples_leaf': 13, 'max_depth': 12}",0.753846,0.753846,0.754572,0.753609,0.753609,0.753896,0.000354,1
8,0.071963,0.001407,0.006508,0.000784,7,14,"{'min_samples_leaf': 7, 'max_depth': 14}",0.753846,0.753846,0.754572,0.753609,0.753609,0.753896,0.000354,1
9,0.086384,0.022166,0.006648,0.002359,15,14,"{'min_samples_leaf': 15, 'max_depth': 14}",0.753846,0.753846,0.754572,0.753609,0.753609,0.753896,0.000354,1
