### 앙상블(ENSEMBLE) - Voting 방식
- 여러개의 모델 또는 동일 모델과 샘플링 데이터셋으로 병렬학습 진행하는 방식
- Voting 방식/기법
    - 구성 : 동일 데이터셋 + 학습 알고리즘이 다른 모델
    - 결과도출 : Hard(직접), Soft(간접)

- 유방암 판별 모델 구현 ==> 피쳐 30개, 타겟:2

[1] 모듈 로딩

In [46]:
from sklearn.datasets import load_breast_cancer
import pandas as pd
import numpy as np

[2] 데이터 준비

In [47]:
dataSet=load_breast_cancer(as_frame=True)

print(dataSet.keys())

dict_keys(['data', 'target', 'frame', 'target_names', 'DESCR', 'feature_names', 'filename', 'data_module'])


In [48]:
print(f'target_names : {dataSet["target_names"]}')
print(f'feature_names : {dataSet["feature_names"]}')

target_names : ['malignant' 'benign']
feature_names : ['mean radius' 'mean texture' 'mean perimeter' 'mean area'
 'mean smoothness' 'mean compactness' 'mean concavity'
 'mean concave points' 'mean symmetry' 'mean fractal dimension'
 'radius error' 'texture error' 'perimeter error' 'area error'
 'smoothness error' 'compactness error' 'concavity error'
 'concave points error' 'symmetry error' 'fractal dimension error'
 'worst radius' 'worst texture' 'worst perimeter' 'worst area'
 'worst smoothness' 'worst compactness' 'worst concavity'
 'worst concave points' 'worst symmetry' 'worst fractal dimension']


[3] 학습 준비

[3-1] 피쳐/타겟 설정

In [49]:
featureDF=dataSet['data']
targetSR=dataSet['target']

In [50]:
featureDF.head(2)

Unnamed: 0,mean radius,mean texture,mean perimeter,mean area,mean smoothness,mean compactness,mean concavity,mean concave points,mean symmetry,mean fractal dimension,...,worst radius,worst texture,worst perimeter,worst area,worst smoothness,worst compactness,worst concavity,worst concave points,worst symmetry,worst fractal dimension
0,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,0.2419,0.07871,...,25.38,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189
1,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,0.1812,0.05667,...,24.99,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902


[3-2] 훈련/테스트 데이터 분리

In [51]:
from sklearn.model_selection import train_test_split

In [52]:
X_train, X_test, y_train, y_test = train_test_split(featureDF, targetSR, test_size=0.2, stratify=targetSR,  random_state=10)

[4] 모델 생성 ==> 앙상블의 보팅 방식 진행
- 데이터셋 동일
- 알고리즘 모델 : KNeighborsClassifier, LogisticRegression, DecisionTreecLassifier

In [53]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import VotingClassifier

In [54]:
# 모델 인스턴스 생성
k_model = KNeighborsClassifier()
dt_model=DecisionTreeClassifier(random_state=10)
lr_model=LogisticRegression(solver='liblinear')

In [55]:
# 보팅인스턴스 생성
v_model = VotingClassifier(estimators=[('k_model', k_model), ('dt_model', dt_model), ('lr_model', lr_model)],
                            voting='hard')

vs_model = VotingClassifier(estimators=[('k_model', k_model), ('dt_model', dt_model), ('lr_model', lr_model)],
                            voting='soft')

## 여기서 fit할때 기존에는 dataframe으로 주다가 ndarray로 줌

In [56]:
# 학습 진행
v_model.fit(X_train.values, y_train.values)   # Hard 즉, 직접 선거 방식
vs_model.fit(X_train.values, y_train.values)  # soft 즉, 간접 선거 방식으로 모델마다의 확률값 합계에 평균

In [57]:
# 모델 파라미터 확인
print( f' [v_model.classes_] : {v_model.classes_}')
print( f' [v_model.estimators_] : {v_model.estimators_}')
print( f' [v_model.named_estimators_] : {v_model.named_estimators_}')
print()
print( f' [v_model.n_features_in_] : {v_model.n_features_in_}')
#print( f' [v_model.feature_names_in_] : {v_model.feature_names_in_}')

 [v_model.classes_] : [0 1]
 [v_model.estimators_] : [KNeighborsClassifier(), DecisionTreeClassifier(random_state=10), LogisticRegression(solver='liblinear')]
 [v_model.named_estimators_] : {'k_model': KNeighborsClassifier(), 'dt_model': DecisionTreeClassifier(random_state=10), 'lr_model': LogisticRegression(solver='liblinear')}

 [v_model.n_features_in_] : 30


[5] 성능평가
- train 데이터셋와 test 데이터셋으로 평가

In [58]:
train_score = v_model.score(X_train.values, y_train.values)
test_score = v_model.score(X_test.values, y_test.values)

soft_train_score = vs_model.score(X_train.values, y_train.values)
soft_test_score = vs_model.score(X_test.values, y_test.values)

In [59]:
print(f'hard_train_score : test_score = {train_score} : {test_score}')
print(f'soft_train_score : soft_test_score = {soft_train_score} : {soft_test_score}')

hard_train_score : test_score = 0.9714285714285714 : 0.956140350877193
soft_train_score : soft_test_score = 0.9868131868131869 : 0.9649122807017544


In [60]:
##### ERROR MESSAGE : AttributeError: 'Flags' object has no attribute 'c_contiguous'
### => ndarray 타입의 데이터로 변경

In [64]:
from sklearn.metrics import accuracy_score

In [65]:
pred = v_model.predict(X_test.values)
print('Voting 분류기 정확도: {0:.4f}'.format(accuracy_score(y_test.values, pred)))

Voting 분류기 정확도: 0.9561


In [70]:
# 개별 모델의 학습/예측/평가
classifiers = [k_model, dt_model, lr_model]
for classfier in classifiers:
    classfier.fit(X_train.values, y_train.values)
    pred = classfier.predict(X_test.values)
    class_name = classfier.__class__.__name__
    print('{0} 정확도: {1:.4f}'.format(class_name, accuracy_score(y_test.values, pred)))


KNeighborsClassifier 정확도: 0.9123
DecisionTreeClassifier 정확도: 0.9474
LogisticRegression 정확도: 0.9386
