#### 앙상블 - 랜덤포레스트 & ExtraTree
- 배깅 방식의 앙상블 ==> 중복 허용가능한 랜덤 샘플 + 동일 모델(DecisionTree)
    * Randomforest :
- 페이스트 방식의 앙상블 ==> 랜덤 샘플 + 동일 모델(DT)
    * 대표 알고리즘 : ExtraTreeC/R

- 와인분류 => 0과 1 2개 종류 분류
- 데이터셋 : wine.csv
- 목적 : 와인 분류
- 학습방법 : 지도학습 - 분류
- 알고리즘 : 랜덤포레스트

[1] 모듈로딩

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

[2] 데이터준비

In [3]:
file_path = 'wine.csv'
wineDF = pd.read_csv(file_path)
wineDF.head(2)

Unnamed: 0,alcohol,sugar,pH,class
0,9.4,1.9,3.51,0.0
1,9.8,2.6,3.2,0.0


In [4]:
wineDF.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6497 entries, 0 to 6496
Data columns (total 4 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   alcohol  6497 non-null   float64
 1   sugar    6497 non-null   float64
 2   pH       6497 non-null   float64
 3   class    6497 non-null   float64
dtypes: float64(4)
memory usage: 203.2 KB


In [5]:
# 타겟/라벨의 클래스 분포
wineDF['class'].value_counts()

class
1.0    4898
0.0    1599
Name: count, dtype: int64

#전처리는 생략하겠음

[3] 학습준비

[3-1] 피쳐/타겟 설정

In [6]:
featureDF=wineDF[wineDF.columns[:-1]]
targetSR = wineDF[wineDF.columns[-1]]

print(f'featureDF: {featureDF.shape}, {featureDF.ndim}D')
print(f'targetSR: {targetSR.shape}, {targetSR.ndim}D')

featureDF: (6497, 3), 2D
targetSR: (6497,), 1D


[3-2] 훈련/테스트 데이터 분리

In [7]:
from sklearn.model_selection import train_test_split

In [8]:
X_train, X_test, y_train, y_test = train_test_split(featureDF, targetSR, test_size=0.2, stratify=targetSR,  random_state=1)

In [9]:
print(f'X_train : {X_train.shape} y_train : {y_train.shape}')
print(f'X_train : {X_test.shape} y_train : {y_test.shape}')

X_train : (5197, 3) y_train : (5197,)
X_train : (1300, 3) y_train : (1300,)


[4] 모델 생성

In [10]:
from sklearn.ensemble import ExtraTreesClassifier

In [11]:
# 인스턴스 생성  => 100개의 내부 DT 모델에서 사용할 데이터셋 생성
#                   random_state로 고정된 데이터셋 생성
#                   oob_score 매개변수 : 샘플 데이터셋 추출 후 남은 데이터셋 검증으로 사용
lf_model = ExtraTreesClassifier(random_state=7, )

# 학습
lf_model.fit(X_train, y_train)

In [12]:
# 모델 파라미터
print(f'classes_ : {lf_model.classes_}')
print(f'n_classes_ : {lf_model.n_classes_}개')
print(f'feature_names_in : {lf_model.feature_names_in_}')
print(f'n_features_in : {lf_model.n_features_in_}개')
print(f'feature_importances_ : {lf_model.feature_importances_}')

classes_ : [0. 1.]
n_classes_ : 2개
feature_names_in : ['alcohol' 'sugar' 'pH']
n_features_in : 3개
feature_importances_ : [0.18992806 0.53030305 0.2797689 ]


In [13]:
# 모델 파라미터
print(f'calsses_    :{lf_model.estimator_}')
for idx in lf_model.estimators_:
    print(idx)

calsses_    :ExtraTreeClassifier()
ExtraTreeClassifier(random_state=327741615)
ExtraTreeClassifier(random_state=976413892)
ExtraTreeClassifier(random_state=1202242073)
ExtraTreeClassifier(random_state=1369975286)
ExtraTreeClassifier(random_state=1882953283)
ExtraTreeClassifier(random_state=2053951699)
ExtraTreeClassifier(random_state=959775639)
ExtraTreeClassifier(random_state=1956722279)
ExtraTreeClassifier(random_state=2052949340)
ExtraTreeClassifier(random_state=1322904761)
ExtraTreeClassifier(random_state=165338510)
ExtraTreeClassifier(random_state=1133316631)
ExtraTreeClassifier(random_state=4812360)
ExtraTreeClassifier(random_state=372560217)
ExtraTreeClassifier(random_state=309457262)
ExtraTreeClassifier(random_state=1801189930)
ExtraTreeClassifier(random_state=1152936666)
ExtraTreeClassifier(random_state=68334472)
ExtraTreeClassifier(random_state=2146978983)
ExtraTreeClassifier(random_state=119248870)
ExtraTreeClassifier(random_state=769786948)
ExtraTreeClassifier(random_state=

In [17]:
# # 모델 파라미터
# print(f'calsses_    :{lf_model.estimators_samples_}')
# # for idx in lf_model.estimators_:
# #     print(idx)

In [16]:
#print(f'oob_score_ : {lf_model.oob_score_}')

[5] 성능 평가

In [17]:
train_score = lf_model.score(X_train, y_train)
test_score = lf_model.score(X_test, y_test)

In [18]:
print(f'train_score : {train_score}    test_score : {test_score}')

train_score : 0.9973061381566288    test_score : 0.8961538461538462


[6] 튜닝

- RandomizedSearchCV  하이퍼파라미터 최적화 클래스
    - 범위가 넓은 하이퍼파라미터 설정에 좋음
    - 지정된 범위에서 지정된 횟수 만큼 하이퍼파라미터를 추출하여 조합 진행

In [19]:
# 모듈 로딩
from sklearn.model_selection import RandomizedSearchCV

In [20]:
# RandomForestClassifier 하피퍼파라미터
params = {'max_depth':range(2, 16),
          'min_samples_leaf' : range(5, 16),
          'criterion': ['gini', 'entropy', 'log_loss']}

In [26]:
rf_model = ExtraTreesClassifier(n_estimators= 300, random_state=7) # 트리의개수 300개선택

In [27]:
searchCV = RandomizedSearchCV(rf_model, param_distributions=params, n_iter=50,
                              verbose=4)
searchCV.fit(X_train, y_train)

Fitting 5 folds for each of 50 candidates, totalling 250 fits
[CV 1/5] END criterion=gini, max_depth=2, min_samples_leaf=10;, score=0.754 total time=   0.2s
[CV 2/5] END criterion=gini, max_depth=2, min_samples_leaf=10;, score=0.754 total time=   0.3s
[CV 3/5] END criterion=gini, max_depth=2, min_samples_leaf=10;, score=0.755 total time=   0.2s
[CV 4/5] END criterion=gini, max_depth=2, min_samples_leaf=10;, score=0.754 total time=   0.2s
[CV 5/5] END criterion=gini, max_depth=2, min_samples_leaf=10;, score=0.754 total time=   0.2s
[CV 1/5] END criterion=gini, max_depth=15, min_samples_leaf=6;, score=0.754 total time=   0.2s
[CV 2/5] END criterion=gini, max_depth=15, min_samples_leaf=6;, score=0.754 total time=   0.2s
[CV 3/5] END criterion=gini, max_depth=15, min_samples_leaf=6;, score=0.755 total time=   0.2s
[CV 4/5] END criterion=gini, max_depth=15, min_samples_leaf=6;, score=0.754 total time=   0.2s
[CV 5/5] END criterion=gini, max_depth=15, min_samples_leaf=6;, score=0.754 total t

In [29]:
# 모델 파라미터

print(f'[searchCV.best_score_ ] {searchCV.best_score_}')
print(f'[searchCV.best_params_] {searchCV.best_params_}')
print(f'[searchCV.best_estimator_] {searchCV.best_estimator_}')

cv_resultDF = pd.DataFrame(searchCV.cv_results_)

[searchCV.best_score_ ] 0.75389649811209
[searchCV.best_params_] {'min_samples_leaf': 10, 'max_depth': 2, 'criterion': 'gini'}
[searchCV.best_estimator_] ExtraTreesClassifier(max_depth=2, min_samples_leaf=10, n_estimators=300,
                     random_state=7)


In [30]:
cv_resultDF.head(3)

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_min_samples_leaf,param_max_depth,param_criterion,params,split0_test_score,split1_test_score,split2_test_score,split3_test_score,split4_test_score,mean_test_score,std_test_score,rank_test_score
0,0.261611,0.03767,0.020445,0.00266,10,2,gini,"{'min_samples_leaf': 10, 'max_depth': 2, 'crit...",0.753846,0.753846,0.754572,0.753609,0.753609,0.753896,0.000354,1
1,0.276702,0.021004,0.02234,0.00693,6,15,gini,"{'min_samples_leaf': 6, 'max_depth': 15, 'crit...",0.753846,0.753846,0.754572,0.753609,0.753609,0.753896,0.000354,1
2,0.251123,0.008813,0.025493,0.00605,14,13,gini,"{'min_samples_leaf': 14, 'max_depth': 13, 'cri...",0.753846,0.753846,0.754572,0.753609,0.753609,0.753896,0.000354,1
