### Ensemble - RandomForest & ExtraTree
- 배깅 방식의 앙상블 ==> 중복 랜덤 샘플 + 동일 모델(DT)
    * 대표 알고리즘 : RandomForestC/R
- 페이스트 방식의 앙상블 ==> 랜덤 샘플 + 동일 모델(DT)
    * 대표 알고리즘 : ExtraTreeC/R

[목표] 와인분류 => (0과 1) 2개 종류 분류

[1] 모듈 로딩 및 데이터 준비

In [1]:
# 모듈 로딩
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [4]:
# 데이터
DATA_FILE = '../Data/wine.csv'

# CSV >> DataFrame
WineDF = pd.read_csv(DATA_FILE)

In [6]:
WineDF.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6497 entries, 0 to 6496
Data columns (total 4 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   alcohol  6497 non-null   float64
 1   sugar    6497 non-null   float64
 2   pH       6497 non-null   float64
 3   class    6497 non-null   float64
dtypes: float64(4)
memory usage: 203.2 KB


In [7]:
WineDF.head(2)

Unnamed: 0,alcohol,sugar,pH,class
0,9.4,1.9,3.51,0.0
1,9.8,2.6,3.2,0.0


In [8]:
# 타겟/라벨의 클래스 분포
WineDF['class'].value_counts()

class
1.0    4898
0.0    1599
Name: count, dtype: int64

In [11]:
WineDF.describe()

Unnamed: 0,alcohol,sugar,pH,class
count,6497.0,6497.0,6497.0,6497.0
mean,10.491801,5.443235,3.218501,0.753886
std,1.192712,4.757804,0.160787,0.430779
min,8.0,0.6,2.72,0.0
25%,9.5,1.8,3.11,1.0
50%,10.3,3.0,3.21,1.0
75%,11.3,8.1,3.32,1.0
max,14.9,65.8,4.01,1.0


[2] 학습 준비

In [12]:
# 학습용 & 테스트용 데이터셋 분할
from sklearn.model_selection import train_test_split

In [15]:
featureDF = WineDF.iloc[:, :3]
targetSR = WineDF['class']

print(f"featureDF : {featureDF.shape}, targetSR : {targetSR.shape}")

featureDF : (6497, 3), targetSR : (6497,)


In [20]:
X_train, X_test, y_train, y_test = train_test_split(featureDF, targetSR,
                                                    stratify=targetSR,
                                                    test_size=0.2,
                                                    random_state=1)

print(f"X_train : {X_train.shape}, y_train : {y_train.shape}")
print(f"X_test : {X_test.shape}, y_test : {y_test.shape}")

X_train : (5197, 3), y_train : (5197,)
X_test : (1300, 3), y_test : (1300,)


[3] 학습 진행

In [22]:
# 학습방법 : 지도학습 > 분류
# 알고리즘 : 앙상블 > 배깅 - RandomForestClassifier
from sklearn.ensemble import RandomForestClassifier

In [36]:
# 인스턴스 생성 => 100개의 내부 DT 모델에서 사용할 데이터셋 생성
#                  random_state 매개변수 설정으로 고정된 데이터셋 생성
#                  oob_score 매개변수 : 샘플 데이터셋 추출 후 남은 데이터셋 검증용으로 사용
rf = RandomForestClassifier(random_state=7,
                            oob_score=True)

# 학습
rf.fit(X_train, y_train)

In [27]:
# 모델 파라미터
print(f"rf.classes_ : {rf.classes_}")
print(f"rf.n_classes_ : {rf.n_classes_}개")
print()
print(f"rf.feature_names_in_ : {rf.feature_names_in_}")
print(f"rf.n_features_in_ : {rf.n_features_in_}")
print(f"rf.feature_importances_ : {rf.feature_importances_}")

rf.classes_ : [0. 1.]
rf.n_classes_ : 2개

rf.feature_names_in_ : ['alcohol' 'sugar' 'pH']
rf.n_features_in_ : 3
rf.feature_importances_ : [0.23614754 0.49961563 0.26423683]


In [29]:
# 모델 파라미터
print(f"classes_         : {rf.estimator_}")
for i in rf.estimators_: print(i)

classes_         : DecisionTreeClassifier()
DecisionTreeClassifier(max_features='sqrt', random_state=327741615)
DecisionTreeClassifier(max_features='sqrt', random_state=976413892)
DecisionTreeClassifier(max_features='sqrt', random_state=1202242073)
DecisionTreeClassifier(max_features='sqrt', random_state=1369975286)
DecisionTreeClassifier(max_features='sqrt', random_state=1882953283)
DecisionTreeClassifier(max_features='sqrt', random_state=2053951699)
DecisionTreeClassifier(max_features='sqrt', random_state=959775639)
DecisionTreeClassifier(max_features='sqrt', random_state=1956722279)
DecisionTreeClassifier(max_features='sqrt', random_state=2052949340)
DecisionTreeClassifier(max_features='sqrt', random_state=1322904761)
DecisionTreeClassifier(max_features='sqrt', random_state=165338510)
DecisionTreeClassifier(max_features='sqrt', random_state=1133316631)
DecisionTreeClassifier(max_features='sqrt', random_state=4812360)
DecisionTreeClassifier(max_features='sqrt', random_state=372560217

In [37]:
print(f"oob_score_ : {rf.oob_score_}")

oob_score_ : 0.8949393881085241


[4] 성능평가

In [34]:
train_score = rf.score(X_train, y_train)
test_score = rf.score(X_test, y_test)

In [35]:
print(f"{train_score} {test_score}")

0.9973061381566288 0.9015384615384615


[5] 튜닝

- RandomizedSearchCV 하이퍼 파라미터 최적화 클래스
    * 범위가 넓은 하이퍼 파라미터 설정에 좋음
    * 지정된 범위에서 지정된 횟수 만큼 하이퍼 파라미터를 추출하여 조합 진행

In [39]:
# 모듈 로딩
from sklearn.model_selection import RandomizedSearchCV

In [56]:
# RandomForestClassifier 하이퍼 파라미터 설정
params = {'max_depth' : range(2, 16),
          'min_samples_leaf' : range(5, 16),
          'criterion' : ['gini', 'entropy', 'log_loss']}

In [57]:
rf = RandomForestClassifier(random_state=7)

In [58]:
searchCV = RandomizedSearchCV(rf, param_distributions=params,
                              n_iter=50, verbose=4)

In [59]:
searchCV.fit(X_train, y_train)

Fitting 5 folds for each of 50 candidates, totalling 250 fits
[CV 1/5] END criterion=entropy, max_depth=10, min_samples_leaf=15;, score=0.871 total time=   0.2s
[CV 2/5] END criterion=entropy, max_depth=10, min_samples_leaf=15;, score=0.839 total time=   0.2s
[CV 3/5] END criterion=entropy, max_depth=10, min_samples_leaf=15;, score=0.877 total time=   0.2s
[CV 4/5] END criterion=entropy, max_depth=10, min_samples_leaf=15;, score=0.882 total time=   0.2s
[CV 5/5] END criterion=entropy, max_depth=10, min_samples_leaf=15;, score=0.869 total time=   0.2s
[CV 1/5] END criterion=log_loss, max_depth=8, min_samples_leaf=6;, score=0.872 total time=   0.2s
[CV 2/5] END criterion=log_loss, max_depth=8, min_samples_leaf=6;, score=0.838 total time=   0.2s
[CV 3/5] END criterion=log_loss, max_depth=8, min_samples_leaf=6;, score=0.879 total time=   0.2s
[CV 4/5] END criterion=log_loss, max_depth=8, min_samples_leaf=6;, score=0.880 total time=   0.2s
[CV 5/5] END criterion=log_loss, max_depth=8, min_s

In [52]:
# 모델 파라미터
print(f"[ searchCV.best_score_ ] {searchCV.best_score_}")
print(f"[ searchCV.best_params_ ] {searchCV.best_params_}")
print(f"[ searchCV.best_estimator_ ] {searchCV.best_estimator_}")

cv_resultDF = pd.DataFrame(searchCV.cv_results_)
cv_resultDF

[ searchCV.best_score_ ] 0.8745472717850005
[ searchCV.best_params_ ] {'min_samples_leaf': 6, 'max_depth': 14}
[ searchCV.best_estimator_ ] RandomForestClassifier(max_depth=14, min_samples_leaf=6, random_state=7)


Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_min_samples_leaf,param_max_depth,params,split0_test_score,split1_test_score,split2_test_score,split3_test_score,split4_test_score,mean_test_score,std_test_score,rank_test_score
0,0.143105,0.007747,0.007483,6.6e-05,6,6,"{'min_samples_leaf': 6, 'max_depth': 6}",0.865385,0.832692,0.882579,0.877767,0.858518,0.863388,0.017578,8
1,0.162349,0.002539,0.009441,0.000168,10,10,"{'min_samples_leaf': 10, 'max_depth': 10}",0.873077,0.840385,0.880654,0.881617,0.871992,0.869545,0.015086,3
2,0.156626,0.007926,0.009036,0.000276,13,9,"{'min_samples_leaf': 13, 'max_depth': 9}",0.869231,0.8375,0.87488,0.882579,0.871992,0.867236,0.015522,5
3,0.114331,0.004748,0.005535,5.6e-05,5,3,"{'min_samples_leaf': 5, 'max_depth': 3}",0.804808,0.821154,0.834456,0.843118,0.824832,0.825674,0.012946,10
4,0.15436,0.003808,0.008947,8.5e-05,13,10,"{'min_samples_leaf': 13, 'max_depth': 10}",0.874038,0.830769,0.87103,0.887392,0.875842,0.867814,0.019335,4
5,0.148654,0.007552,0.008048,5.6e-05,15,7,"{'min_samples_leaf': 15, 'max_depth': 7}",0.869231,0.834615,0.872955,0.883542,0.86333,0.864735,0.016436,6
6,0.136599,0.003182,0.007364,5.7e-05,9,6,"{'min_samples_leaf': 9, 'max_depth': 6}",0.864423,0.830769,0.872955,0.880654,0.85948,0.861656,0.017059,9
7,0.156885,0.001636,0.009285,9.1e-05,10,13,"{'min_samples_leaf': 10, 'max_depth': 13}",0.875962,0.834615,0.881617,0.884504,0.87103,0.869546,0.018072,2
8,0.143837,0.00804,0.008163,0.000108,10,7,"{'min_samples_leaf': 10, 'max_depth': 7}",0.869231,0.833654,0.876805,0.87488,0.866218,0.864157,0.015719,7
9,0.163674,0.005027,0.009735,8.9e-05,6,14,"{'min_samples_leaf': 6, 'max_depth': 14}",0.883654,0.843269,0.883542,0.882579,0.879692,0.874547,0.015704,1


In [53]:
searchCV.best_estimator_

In [51]:
searchCV.best_params_

{'min_samples_leaf': 6, 'max_depth': 14}