# 개요
- stap 1에서 생성된 모델의 과소적합을 완화한다.
---

* 예상 1. 데이터의 columns의 unique값이 대부분 2개(yes or no)이어서 선택복잡도가 낮아 발생하는 것으로 예상됨
* 예상 2. 낮은 복잡도에서 최대한 정규화를 진행하려고 했기 때문에 과소적합이 일어났다고 예상됨
---

* 대책 1. 최대한 많은 선택복잡도를 발생시키는 방향으로 데이터 전처리를 다시 시행한다.
* 대책 2. step1에서 그나마 높은 점수를 받은 모델의 하이퍼파라미터를 조정하여 시행한다.

In [1]:
# 경고창 무시
import warnings
warnings.filterwarnings('ignore')

In [2]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [3]:
df_heart = pd.read_csv('../data/heart_2020_cleaned.csv')
df_heart

Unnamed: 0,HeartDisease,BMI,Smoking,AlcoholDrinking,Stroke,PhysicalHealth,MentalHealth,DiffWalking,Sex,AgeCategory,Race,Diabetic,PhysicalActivity,GenHealth,SleepTime,Asthma,KidneyDisease,SkinCancer
0,No,16.60,Yes,No,No,3.0,30.0,No,Female,55-59,White,Yes,Yes,Very good,5.0,Yes,No,Yes
1,No,20.34,No,No,Yes,0.0,0.0,No,Female,80 or older,White,No,Yes,Very good,7.0,No,No,No
2,No,26.58,Yes,No,No,20.0,30.0,No,Male,65-69,White,Yes,Yes,Fair,8.0,Yes,No,No
3,No,24.21,No,No,No,0.0,0.0,No,Female,75-79,White,No,No,Good,6.0,No,No,Yes
4,No,23.71,No,No,No,28.0,0.0,Yes,Female,40-44,White,No,Yes,Very good,8.0,No,No,No
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
319790,Yes,27.41,Yes,No,No,7.0,0.0,Yes,Male,60-64,Hispanic,Yes,No,Fair,6.0,Yes,No,No
319791,No,29.84,Yes,No,No,0.0,0.0,No,Male,35-39,Hispanic,No,Yes,Very good,5.0,Yes,No,No
319792,No,24.24,No,No,No,0.0,0.0,No,Female,45-49,Hispanic,No,Yes,Good,6.0,No,No,No
319793,No,32.81,No,No,No,0.0,0.0,No,Female,25-29,Hispanic,No,No,Good,12.0,No,No,No


# 데이터 전처리

- 앞서 확인한 이상치 변환 작업에서 오히려 데이터의 선택복잡도가 떨어졌다고 판단, 최대한 많은 변수를 만들어 내기 위해 재 판단 (SleepTime, BMI,Diabetic)

In [5]:
df_heart.describe()

Unnamed: 0,BMI,PhysicalHealth,MentalHealth,SleepTime
count,319795.0,319795.0,319795.0,319795.0
mean,28.325399,3.37171,3.898366,7.097075
std,6.3561,7.95085,7.955235,1.436007
min,12.02,0.0,0.0,1.0
25%,24.03,0.0,0.0,6.0
50%,27.34,0.0,0.0,7.0
75%,31.42,2.0,3.0,8.0
max,94.85,30.0,30.0,24.0


- SleepTime과 Diabetic의 경우 범주형 데이터로 각 범주에 대한 가중치가 있다고 판단하여 통합을 진행하였으나, 데이터의 복잡도를 위해 세분화되어 있는 본 데이터를 사용해보기로 함

- BMI
- 이전 단계에서 문제가 있을 것이라고 생각했던 BMI의 상위,하위 데이터 중 10개의 low를 확인

In [8]:
df_heart.sort_values(by='BMI', ascending=False).head(10)

Unnamed: 0,HeartDisease,BMI,Smoking,AlcoholDrinking,Stroke,PhysicalHealth,MentalHealth,DiffWalking,Sex,AgeCategory,Race,Diabetic,PhysicalActivity,GenHealth,SleepTime,Asthma,KidneyDisease,SkinCancer
126896,No,94.85,No,No,No,0.0,0.0,No,Male,35-39,White,No,Yes,Excellent,7.0,No,No,No
242834,No,94.66,No,No,No,4.0,0.0,No,Female,50-54,White,No,No,Very good,6.0,No,No,No
104267,No,93.97,Yes,No,No,20.0,25.0,Yes,Female,50-54,White,No,No,Poor,6.0,No,No,No
249715,No,93.86,Yes,Yes,No,30.0,30.0,Yes,Female,65-69,Other,Yes,No,Poor,4.0,Yes,Yes,No
156093,No,92.53,Yes,No,No,7.0,0.0,Yes,Female,65-69,Black,Yes,Yes,Poor,8.0,Yes,No,No
126661,No,91.82,No,No,No,0.0,2.0,No,Female,65-69,Black,No,Yes,Very good,5.0,No,No,No
105476,No,91.55,Yes,No,No,0.0,0.0,No,Male,40-44,Other,No,Yes,Excellent,5.0,No,No,No
114087,No,91.55,No,No,No,0.0,10.0,Yes,Female,55-59,Other,No,No,Excellent,2.0,No,No,No
229007,No,88.6,No,No,No,30.0,0.0,Yes,Male,55-59,White,No,Yes,Fair,5.0,No,No,No
290183,No,88.19,No,No,No,0.0,0.0,Yes,Male,80 or older,White,No,No,Poor,8.0,No,Yes,No


- 과도한 BMI 이외에 다른 내용들에 대해 특별한 문제는 발견되지 않음

In [10]:
df_heart.sort_values(by='BMI', ascending=True).head(10)

Unnamed: 0,HeartDisease,BMI,Smoking,AlcoholDrinking,Stroke,PhysicalHealth,MentalHealth,DiffWalking,Sex,AgeCategory,Race,Diabetic,PhysicalActivity,GenHealth,SleepTime,Asthma,KidneyDisease,SkinCancer
69662,No,12.02,Yes,No,No,0.0,30.0,No,Male,18-24,White,No,Yes,Good,8.0,No,No,No
205511,No,12.02,No,No,No,30.0,30.0,No,Female,55-59,Black,No,No,Poor,6.0,No,No,No
113373,No,12.08,Yes,No,No,0.0,0.0,Yes,Male,30-34,White,No,Yes,Good,8.0,No,No,No
51637,No,12.13,No,No,No,0.0,0.0,No,Male,60-64,White,No,Yes,Excellent,7.0,No,No,No
81754,No,12.16,No,No,No,0.0,0.0,No,Male,35-39,Black,No,Yes,Very good,8.0,No,No,No
77250,No,12.2,Yes,Yes,No,0.0,1.0,No,Male,75-79,White,No,Yes,Very good,7.0,No,No,Yes
76270,Yes,12.21,No,No,No,0.0,0.0,No,Male,60-64,White,No,Yes,Good,7.0,No,No,Yes
112289,No,12.26,No,No,No,0.0,0.0,Yes,Female,50-54,White,No,Yes,Good,9.0,Yes,No,No
253451,No,12.27,No,No,No,0.0,0.0,No,Female,50-54,White,No,Yes,Very good,8.0,No,No,No
210628,No,12.4,Yes,No,No,20.0,20.0,Yes,Female,70-74,White,No,No,Poor,10.0,No,No,No


- 낮은 BMI군에서도 동일한것을 확인
따라서, 데이터의 복잡도를 위해 이상치는 없다고 판단하여 그대로 모델에 학습시키는 것을 도전

In [12]:
#### 라벨 인코딩 진행
data_column_list = ['HeartDisease','Smoking','AlcoholDrinking','Stroke','DiffWalking','Sex','AgeCategory','Race','Diabetic','PhysicalActivity','GenHealth','Asthma','KidneyDisease','SkinCancer']
from sklearn.preprocessing import LabelEncoder
label_encoder = LabelEncoder()

In [13]:
for i in data_column_list:
    df_heart[i] = label_encoder.fit_transform(df_heart[i])
df_heart

Unnamed: 0,HeartDisease,BMI,Smoking,AlcoholDrinking,Stroke,PhysicalHealth,MentalHealth,DiffWalking,Sex,AgeCategory,Race,Diabetic,PhysicalActivity,GenHealth,SleepTime,Asthma,KidneyDisease,SkinCancer
0,0,16.60,1,0,0,3.0,30.0,0,0,7,5,2,1,4,5.0,1,0,1
1,0,20.34,0,0,1,0.0,0.0,0,0,12,5,0,1,4,7.0,0,0,0
2,0,26.58,1,0,0,20.0,30.0,0,1,9,5,2,1,1,8.0,1,0,0
3,0,24.21,0,0,0,0.0,0.0,0,0,11,5,0,0,2,6.0,0,0,1
4,0,23.71,0,0,0,28.0,0.0,1,0,4,5,0,1,4,8.0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
319790,1,27.41,1,0,0,7.0,0.0,1,1,8,3,2,0,1,6.0,1,0,0
319791,0,29.84,1,0,0,0.0,0.0,0,1,3,3,0,1,4,5.0,1,0,0
319792,0,24.24,0,0,0,0.0,0.0,0,0,5,3,0,1,2,6.0,0,0,0
319793,0,32.81,0,0,0,0.0,0.0,0,0,1,3,0,0,2,12.0,0,0,0


# 모델 학습
- GaussianNB
- DecisionTreeClassifier

In [14]:
# target과 feature 분리
target = df_heart['HeartDisease']
df_features = df_heart.copy()
features = df_features.drop(columns='HeartDisease')

- 이전 스텝에서 타겟에 대한 편향이 있는 것을 확인
- 이번 테스트에서는 train set과 test set 분리 전 oversampling을 사용

In [16]:
# oversampling
from imblearn.over_sampling import SMOTE
overSampling = SMOTE(sampling_strategy=0.8)
feature_oversample, target_oversample =  overSampling.fit_resample(features,target)
feature_oversample.shape, target_oversample.shape

((526359, 17), (526359,))

In [17]:
# train set과 test set 분리
from sklearn.model_selection import train_test_split
features_train, features_test, target_train, target_test = train_test_split(feature_oversample, target_oversample, test_size=0.3, random_state=42)
features_train.shape, features_test.shape, target_train.shape, target_test.shape

((368451, 17), (157908, 17), (368451,), (157908,))

In [18]:
from sklearn.tree import DecisionTreeClassifier
decisionTreeClassifier = DecisionTreeClassifier()

In [19]:
from sklearn.naive_bayes import GaussianNB
gaussianNB = GaussianNB()

In [20]:
decisionTreeClassifier.fit(features_train,target_train)

In [21]:
gaussianNB.fit(features_train,target_train)

# 초기 모델 평가

In [22]:
decision_test_prdict = decisionTreeClassifier.predict(features_test)
gaussian_test_prdict = gaussianNB.predict(features_test)

In [23]:
from sklearn.metrics import classification_report
print('decision 모델')
print(classification_report(target_test,decision_test_prdict))

decision 모델
              precision    recall  f1-score   support

           0       0.88      0.86      0.87     87604
           1       0.83      0.85      0.84     70304

    accuracy                           0.86    157908
   macro avg       0.86      0.86      0.86    157908
weighted avg       0.86      0.86      0.86    157908



In [24]:
from sklearn.metrics import classification_report
print('gaussian 모델')
print(classification_report(target_test,gaussian_test_prdict))

gaussian 모델
              precision    recall  f1-score   support

           0       0.73      0.74      0.73     87604
           1       0.67      0.66      0.66     70304

    accuracy                           0.70    157908
   macro avg       0.70      0.70      0.70    157908
weighted avg       0.70      0.70      0.70    157908



# 2차 결론

- f1 socre가 step1에 비해 비약적으로 상승한 것을 확인
- 단, 이 경우 oversampling의 시점에 의해 변경되었을 가능성이 높다고 생각됨 (데이터 전처리에 의해 변화한 데이터는 소수이므로 실제적 영향은 낮을 것)
- step1에서 사용한 전처리를 다시 사용해서 step3에서 재학습 진행