### 앙상블: Voting

- 다양한 모델을 결합해 예측 성능을 향상시키는 방법
    - hard voting: 여러 개의 예측치에 대해 다수결로 결정
    - soft voting: 여러 개 예측치에 대해 각 평균을 내서 가장 높은 걸로 결정

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

##### 위스콘신 유방암 데이터셋 (Wisconsin Breast Cancer Dataset)

유방암의 악성(Malignant)과 양성(Benign)을 분류하기 위해 자주 사용되는 데이터셋
(의학적인 이미지를 바탕으로 유방암 종양의 특징을 수치화한 데이터)

**데이터셋 개요**
- **목적**: 유방암 종양이 악성(Malignant)인지, 양성(Benign)인지 분류
- **샘플 수**: 569개
- **특징(Features) 수**: 30개
- **타겟(Target)**: 0(악성) 또는 1(양성)

**데이터 구성**
1. **Radius mean**: 종양의 평균 반지름
2. **Texture mean**: 종양의 표면의 거칠기
3. **Perimeter mean**: 종양의 평균 둘레 길이
4. **Area mean**: 종양의 평균 면적
5. **Smoothness mean**: 종양의 매끄러움 정도
6. **Compactness mean**: 종양의 압축도
7. **Concavity mean**: 종양의 오목함
8. **Concave points mean**: 종양의 오목한 점 개수
9. **Symmetry mean**: 종양의 대칭성
10. **Fractal dimension mean**: 종양의 프랙탈 차원 

In [None]:
from sklearn.datasets import load_breast_cancer

data = load_breast_cancer()
data.keys() # ['data', 'target', 'frame', 'target_names', 'DESCR', 'feature_names', 'filename', 'data_module']
data



In [6]:
df = pd.DataFrame(data.data, columns=data.feature_names)
df['target'] = data.target
df

Unnamed: 0,mean radius,mean texture,mean perimeter,mean area,mean smoothness,mean compactness,mean concavity,mean concave points,mean symmetry,mean fractal dimension,...,worst texture,worst perimeter,worst area,worst smoothness,worst compactness,worst concavity,worst concave points,worst symmetry,worst fractal dimension,target
0,17.99,10.38,122.80,1001.0,0.11840,0.27760,0.30010,0.14710,0.2419,0.07871,...,17.33,184.60,2019.0,0.16220,0.66560,0.7119,0.2654,0.4601,0.11890,0
1,20.57,17.77,132.90,1326.0,0.08474,0.07864,0.08690,0.07017,0.1812,0.05667,...,23.41,158.80,1956.0,0.12380,0.18660,0.2416,0.1860,0.2750,0.08902,0
2,19.69,21.25,130.00,1203.0,0.10960,0.15990,0.19740,0.12790,0.2069,0.05999,...,25.53,152.50,1709.0,0.14440,0.42450,0.4504,0.2430,0.3613,0.08758,0
3,11.42,20.38,77.58,386.1,0.14250,0.28390,0.24140,0.10520,0.2597,0.09744,...,26.50,98.87,567.7,0.20980,0.86630,0.6869,0.2575,0.6638,0.17300,0
4,20.29,14.34,135.10,1297.0,0.10030,0.13280,0.19800,0.10430,0.1809,0.05883,...,16.67,152.20,1575.0,0.13740,0.20500,0.4000,0.1625,0.2364,0.07678,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
564,21.56,22.39,142.00,1479.0,0.11100,0.11590,0.24390,0.13890,0.1726,0.05623,...,26.40,166.10,2027.0,0.14100,0.21130,0.4107,0.2216,0.2060,0.07115,0
565,20.13,28.25,131.20,1261.0,0.09780,0.10340,0.14400,0.09791,0.1752,0.05533,...,38.25,155.00,1731.0,0.11660,0.19220,0.3215,0.1628,0.2572,0.06637,0
566,16.60,28.08,108.30,858.1,0.08455,0.10230,0.09251,0.05302,0.1590,0.05648,...,34.12,126.70,1124.0,0.11390,0.30940,0.3403,0.1418,0.2218,0.07820,0
567,20.60,29.33,140.10,1265.0,0.11780,0.27700,0.35140,0.15200,0.2397,0.07016,...,39.42,184.60,1821.0,0.16500,0.86810,0.9387,0.2650,0.4087,0.12400,0


In [9]:
df.describe()

Unnamed: 0,mean radius,mean texture,mean perimeter,mean area,mean smoothness,mean compactness,mean concavity,mean concave points,mean symmetry,mean fractal dimension,...,worst texture,worst perimeter,worst area,worst smoothness,worst compactness,worst concavity,worst concave points,worst symmetry,worst fractal dimension,target
count,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,...,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0
mean,14.127292,19.289649,91.969033,654.889104,0.09636,0.104341,0.088799,0.048919,0.181162,0.062798,...,25.677223,107.261213,880.583128,0.132369,0.254265,0.272188,0.114606,0.290076,0.083946,0.627417
std,3.524049,4.301036,24.298981,351.914129,0.014064,0.052813,0.07972,0.038803,0.027414,0.00706,...,6.146258,33.602542,569.356993,0.022832,0.157336,0.208624,0.065732,0.061867,0.018061,0.483918
min,6.981,9.71,43.79,143.5,0.05263,0.01938,0.0,0.0,0.106,0.04996,...,12.02,50.41,185.2,0.07117,0.02729,0.0,0.0,0.1565,0.05504,0.0
25%,11.7,16.17,75.17,420.3,0.08637,0.06492,0.02956,0.02031,0.1619,0.0577,...,21.08,84.11,515.3,0.1166,0.1472,0.1145,0.06493,0.2504,0.07146,0.0
50%,13.37,18.84,86.24,551.1,0.09587,0.09263,0.06154,0.0335,0.1792,0.06154,...,25.41,97.66,686.5,0.1313,0.2119,0.2267,0.09993,0.2822,0.08004,1.0
75%,15.78,21.8,104.1,782.7,0.1053,0.1304,0.1307,0.074,0.1957,0.06612,...,29.72,125.4,1084.0,0.146,0.3391,0.3829,0.1614,0.3179,0.09208,1.0
max,28.11,39.28,188.5,2501.0,0.1634,0.3454,0.4268,0.2012,0.304,0.09744,...,49.54,251.2,4254.0,0.2226,1.058,1.252,0.291,0.6638,0.2075,1.0


In [10]:
# 데이터 준비
X = df.drop('target', axis = 1)
y = df['target']

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

In [None]:
# hard voting

from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
# knn 분류 모델, 로지스틱 회귀 모델, 결정트리모델 사용할 것
from sklearn.ensemble import VotingClassifier
"""
import warnings
warnings.filterwarnings('ignore')
warning 메세지 무시 코드(출력 안 됨)
"""
knn_clf = KNeighborsClassifier()
lr_clf = LogisticRegression()
dt_clf = DecisionTreeClassifier(random_state=0)

voting_clf = VotingClassifier(
    estimators=[
        ('knn_clf', knn_clf),
        ('lr_clf', lr_clf),
        ('dt_clf', dt_clf)
    ], # PipeLine과 같은 형식으로 모델 객체 저장
    voting = 'hard' # 보팅 방식 전달. default는 hard
)

# 앙상블 모델 학습
voting_clf.fit(X_train, y_train)

# 앙상블 모델 평가
print(voting_clf.score(X_train, y_train), voting_clf.score(X_test, y_test))
# 0.9647887323943662 0.951048951048951


0.9647887323943662 0.951048951048951


STOP: TOTAL NO. OF ITERATIONS REACHED LIMIT

Increase the number of iterations to improve the convergence (max_iter=100).
You might also want to scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [None]:
# hard voting 작동 원리: 다수결 결정
start, end = 40, 50

voting_pred = voting_clf.predict(X_test[start:end])
print(f'앙상블 예측값: {voting_pred}')

for classifier in [knn_clf, lr_clf, dt_clf]:
    classifier.fit(X_train, y_train)
    pred = classifier.predict(X_test[start:end])
    score = classifier.score(X_test, y_test)

    class_name = classifier.__class__.__name__
    print(f'{class_name}의 정확도: {score}')
    print(f'{class_name}의 예측값: {pred}\n')

앙상블 예측값: [0 1 0 1 0 0 1 1 1 0]
KNeighborsClassifier의 정확도: 0.9370629370629371
KNeighborsClassifier의 예측값: [0 1 0 1 0 0 1 1 1 0]

LogisticRegression의 정확도: 0.9440559440559441
LogisticRegression의 예측값: [0 1 0 1 0 0 1 1 1 0]

DecisionTreeClassifier의 정확도: 0.8811188811188811
DecisionTreeClassifier의 예측값: [1 1 0 1 0 0 1 1 1 0]



STOP: TOTAL NO. OF ITERATIONS REACHED LIMIT

Increase the number of iterations to improve the convergence (max_iter=100).
You might also want to scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [None]:
# soft voting
voting_clf = VotingClassifier(
    estimators=[
        ('knn_clf', knn_clf),
        ('lr_clf', lr_clf),
        ('dt_clf', dt_clf)
    ], # PipeLine과 같은 형식으로 모델 객체 저장
    voting = 'soft'
)

# 앙상블 모델 학습
voting_clf.fit(X_train, y_train)

# 앙상블 모델 평가
print(voting_clf.score(X_train, y_train), voting_clf.score(X_test, y_test))
# 0.9859154929577465 0.9370629370629371
# hard voting보다 test데이터 성능 떨어짐. -> 이 데이터는 hard voting 방식이 더 좋다!

0.9859154929577465 0.9370629370629371


STOP: TOTAL NO. OF ITERATIONS REACHED LIMIT

Increase the number of iterations to improve the convergence (max_iter=100).
You might also want to scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [None]:
# soft voting 작동 원리: 각 예측기의 확률값 평균
start, end = 40, 50

voting_pred = voting_clf.predict_proba(X_test[start:end])
# 모델이 계산한 '확률값'을 출력
print(f'앙상블 예측값: {voting_pred}')

averages = np.full_like(voting_pred, 0)
# voting_pred와 같은 차원으로 영행렬 만듦

for classifier in [knn_clf, lr_clf, dt_clf]:
    classifier.fit(X_train, y_train)
    pred = classifier.predict_proba(X_test[start:end])
    score = classifier.score(X_test, y_test)

    averages += pred

    class_name = classifier.__class__.__name__
    print(f'{class_name}의 정확도: {score}')
    print(f'{class_name}의 예측 확률: {pred}\n')

print('각 모델의 예측 확률의 평균:', averages/3)
print(np.array_equal(voting_pred, averages/3))
# votin_pred와 averages/3이 일치하는지 조사해 True, False로 반환.
# True 반환됨. 즉, soft voting 방식의 이론대로 확률을 누적합해 평균값 계산 -> voting predict_proba

앙상블 예측값: [[5.70263157e-01 4.29736843e-01]
 [1.08113730e-03 9.98918863e-01]
 [9.99622506e-01 3.77494355e-04]
 [3.35757426e-04 9.99664243e-01]
 [9.00993416e-01 9.90065841e-02]
 [1.00000000e+00 1.75163138e-13]
 [7.79971341e-05 9.99922003e-01]
 [1.83004552e-02 9.81699545e-01]
 [1.14568790e-03 9.98854312e-01]
 [9.32982089e-01 6.70179112e-02]]
KNeighborsClassifier의 정확도: 0.9370629370629371
KNeighborsClassifier의 예측 확률: [[0.8 0.2]
 [0.  1. ]
 [1.  0. ]
 [0.  1. ]
 [0.8 0.2]
 [1.  0. ]
 [0.  1. ]
 [0.  1. ]
 [0.  1. ]
 [0.8 0.2]]

LogisticRegression의 정확도: 0.9440559440559441
LogisticRegression의 예측 확률: [[9.10789471e-01 8.92105287e-02]
 [3.24341189e-03 9.96756588e-01]
 [9.98867517e-01 1.13248306e-03]
 [1.00727228e-03 9.98992728e-01]
 [9.02980248e-01 9.70197522e-02]
 [1.00000000e+00 5.25489414e-13]
 [2.33991402e-04 9.99766009e-01]
 [5.49013655e-02 9.45098634e-01]
 [3.43706371e-03 9.96562936e-01]
 [9.98946266e-01 1.05373359e-03]]

DecisionTreeClassifier의 정확도: 0.8811188811188811
DecisionTreeClassifier

STOP: TOTAL NO. OF ITERATIONS REACHED LIMIT

Increase the number of iterations to improve the convergence (max_iter=100).
You might also want to scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
