<span style="color:grey"> By Seongchan Kang </span>

<span style="color:grey"> Version : Python 3.10.1 in Window </span>

- 출처 1 : (교재) 머신러닝 교과서 with 파이썬, 사이킷런, 텐서플로 
- 출처 2 : (교재) 비즈니스 애널리틱스를 위한 데이터 마이닝 R
- 출처 3 : (URL) <span> https://hyemin-kim.github.io/2020/08/04/S-Python-sklearn4/#4-%EC%95%99%EC%83%81%EB%B8%94-ensemble-%EC%95%8C%EA%B3%A0%EB%A6%AC%EC%A6%98 </span>

# 앙상블(Ensemble) 학습이란

여러 분류기를 하나의 분류기로 연결하여 개별 분류기보다 더 좋은 성능을 만드는 것을 목표로 하는 학습

결국 더 훌륭한 모델을 만들기 위한 방법

앙상블의 방법을 여러가지 존재, 아래에서 계속 소개


In [29]:
# 공통적으로 필요한 라이브러리
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

In [30]:
# 파일 불러오기
data = pd.read_csv("titanic_df.csv")
data = data.drop(labels = "Unnamed: 0", axis = 1)
data

Unnamed: 0,Sex,Embarked,ToH,Survived,Pclass,Age,Fare,Family
0,0,0,2,0,3,22,7.2500,0
1,1,1,3,1,1,38,71.2833,0
2,1,0,1,1,3,26,7.9250,0
3,1,0,3,1,1,35,53.1000,0
4,0,0,2,0,3,35,8.0500,0
...,...,...,...,...,...,...,...,...
886,0,0,4,0,2,27,13.0000,0
887,1,0,1,1,1,19,30.0000,0
888,1,0,1,0,3,21,23.4500,1
889,0,1,2,1,1,26,30.0000,0


In [31]:
# X, Y로 변수를 나누기(독립괴 종속 변수)
X = data[['Sex', 'Embarked', 'ToH', 'Pclass', 'Age', 'Fare', 'Family']]
Y = data['Survived']

# 훈련과 테스트로 변수 다시 한번더 나누기
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.3)

## 투표(Voting)

- 투표를 통해 최종 결과를 결정하는 방식
- 동일한 훈련 세트
- 여러가지 알고리즘을 사용
- 샘플을 뽑을 때 중복은 없음
- 싱글 모델은 튜플 형태로 정의
- 우선적으로 여러 모델을 쓰기 때문 여러 모델을 모델링하는 과정이 필요

### 투표(회귀) 실습

In [32]:
# 라이브러리 불러오기
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Ridge
from sklearn.linear_model import Lasso
from sklearn.linear_model import ElasticNet
from sklearn.preprocessing import StandardScaler, MinMaxScaler, RobustScaler
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import PolynomialFeatures

from sklearn.ensemble import VotingRegressor

In [33]:
# LinearRegression #
linear_reg = LinearRegression(n_jobs=-1)
linear_reg.fit(X_train, Y_train)

print("모델 < LinearRegression >")
print("Train Set Score1 : {}".format(linear_reg.score(X_train, Y_train)))
print("Test  Set Score1 : {}".format(linear_reg.score(X_test, Y_test)))


# Ridge #
ridge = Ridge(alpha=1)
ridge.fit(X_train, Y_train)

print("\n모델 < Ridge >")
print("Train Set Score1 : {}".format(ridge.score(X_train, Y_train)))
print("Test  Set Score1 : {}".format(ridge.score(X_test, Y_test)))


# Lasso #
lasso = Lasso(alpha=0.01)
lasso.fit(X_train, Y_train)

print("\n모델 < Lasso >")
print("Train Set Score1 : {}".format(lasso.score(X_train, Y_train)))
print("Test  Set Score1 : {}".format(lasso.score(X_test, Y_test)))


# Elasticnet #
elasticnet = ElasticNet(alpha=0.5, l1_ratio=0.2)
elasticnet.fit(X_train, Y_train)

print("\n모델 < Elasticnet >")
print("Train Set Score1 : {}".format(elasticnet.score(X_train, Y_train)))
print("Test  Set Score1 : {}".format(elasticnet.score(X_test, Y_test)))


# With Standard Scaling #
standard_elasticnet = make_pipeline(
    StandardScaler(),
    ElasticNet(alpha=0.5, l1_ratio=0.2)
)

standard_elasticnet.fit(X_train, Y_train).predict(X_test)

print("\n모델 < Standard Scaling >")
print("Train Set Score1 : {}".format(standard_elasticnet.score(X_train, Y_train)))
print("Test  Set Score1 : {}".format(standard_elasticnet.score(X_test, Y_test)))


# 2-Degree Polynomial Features + Standard Scaling #
poly_elasticnet = make_pipeline(
    PolynomialFeatures(degree=2, include_bias=False),
    StandardScaler(),
    ElasticNet(alpha=0.5, l1_ratio=0.2)
)

poly_elasticnet.fit(X_train, Y_train).predict(X_test)

print("\n모델 < Polynomial Features >")
print("Train Set Score1 : {}".format(standard_elasticnet.score(X_train, Y_train)))
print("Test  Set Score1 : {}".format(standard_elasticnet.score(X_test, Y_test)))

모델 < LinearRegression >
Train Set Score1 : 0.39442823557612416
Test  Set Score1 : 0.48848106411124204

모델 < Ridge >
Train Set Score1 : 0.394413017750895
Test  Set Score1 : 0.4880800001885599

모델 < Lasso >
Train Set Score1 : 0.3892011578307425
Test  Set Score1 : 0.47799584514456417

모델 < Elasticnet >
Train Set Score1 : 0.07864169600934356
Test  Set Score1 : 0.11638262260730603

모델 < Standard Scaling >
Train Set Score1 : 0.21730455249268854
Test  Set Score1 : 0.2504092252282194

모델 < Polynomial Features >
Train Set Score1 : 0.21730455249268854
Test  Set Score1 : 0.2504092252282194


In [34]:
# 보팅에 참여한 single models 지정
single_models = [
    ('linear_reg', linear_reg),
    ('ridge', ridge),
    ('lasso', lasso),
    ('elasticnet', elasticnet),
    ('standard_elasticnet', standard_elasticnet),
    ('poly_elasticnet', poly_elasticnet)
]

In [35]:
# voting regressor 만들기
voting_regressor = VotingRegressor(single_models, n_jobs=-1)

# 적합 = 학습
voting_regressor.fit(X_train, Y_train)

In [36]:
# 테스트 #
print("모델 < Ensemble_Voting >")
print("Train Set Score1 : {}".format(voting_regressor.score(X_train, Y_train)))
print("Test  Set Score1 : {}".format(voting_regressor.score(X_test, Y_test)))

## 단일의 모델보다는 결과가 좋긴함
## 하지만 몇몇의 단일 모델이 성능이 더 좋아보임 

모델 < Ensemble_Voting >
Train Set Score1 : 0.34097091704558624
Test  Set Score1 : 0.40500099123292677


### 투표(분류) 실습

In [37]:
# 라이브러리 불러오기
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import VotingClassifier

In [38]:
# 함수 선언 #
def min_max_normalize(lst):
    normalized = []
    
    for value in lst:
        normalized_num = (value - min(lst)) / (max(lst) - min(lst))
        normalized.append(normalized_num)
    
    return normalized

def z_score_normalize(lst):
    normalized = []
    for value in lst:
        normalized_num = (value - np.mean(lst)) / np.std(lst)
        normalized.append(normalized_num)
    return normalized

In [39]:
# 정규화
data["Age"] = min_max_normalize(data["Age"])
data["Fare"] = z_score_normalize(data["Fare"])

# X, Y로 변수를 나누기(독립괴 종속 변수)
X = data[['Sex', 'Embarked', 'ToH', 'Pclass', 'Age', 'Fare', 'Family']]
Y = data['Survived']

# 훈련과 테스트로 변수 다시 한번더 나누기
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.3)

In [40]:
# 모델링 및 적합 #

# LogisticRegression
lr_model = LogisticRegression(random_state = 5)
lr_model.fit(X_train, Y_train)

print("모델 < LogisticRegression >")
print("Train Set Score1 : {}".format(lr_model.score(X_train, Y_train)))
print("Test  Set Score1 : {}".format(lr_model.score(X_test, Y_test)))


# DecisionTreeClassifier
dTree = DecisionTreeClassifier(random_state = 5, max_depth = 3, min_samples_split = 8)
dTree.fit(X_train, Y_train)

print("\n모델 < DecisionTreeClassifier >")
print("Train Set Score1 : {}".format(dTree.score(X_train, Y_train)))
print("Test  Set Score1 : {}".format(dTree.score(X_test, Y_test)))


# KNeighborsClassifier
## 정규화를 어떻게 적용시키는게 좋을지 고민이 들어
## 잘못 건들기보다는 정규화를 안하는 방법을 선택
knn_model = KNeighborsClassifier(n_neighbors = 10, metric = 'euclidean')
knn_model.fit(X_train, Y_train)

print("\n모델 < KNeighborsClassifier >")
print("Train Set Score1 : {}".format(knn_model.score(X_train, Y_train)))
print("Test  Set Score1 : {}".format(knn_model.score(X_test, Y_test)))

모델 < LogisticRegression >
Train Set Score1 : 0.826645264847512
Test  Set Score1 : 0.8097014925373134

모델 < DecisionTreeClassifier >
Train Set Score1 : 0.8250401284109149
Test  Set Score1 : 0.8208955223880597

모델 < KNeighborsClassifier >
Train Set Score1 : 0.8426966292134831
Test  Set Score1 : 0.8097014925373134


In [None]:
models = [
    ('Logit', lr_model),
    ('DecisionTree', dTree),
    ('KNN', knn_model) 
]

# 모델링
vc = VotingClassifier(models, voting='soft')

# 적합 = 학습
vc.fit(X_train, Y_train)

In [25]:
# 테스트 #
print("모델 < VotingClassifier >")
print("Train Set Score1 : {}".format(vc.score(X_train, Y_train)))
print("Test  Set Score1 : {}".format(vc.score(X_test, Y_test)))


모델 < VotingClassifier >
Train Set Score1 : 0.85553772070626
Test  Set Score1 : 0.8097014925373134
