<a href="https://colab.research.google.com/github/nywook/AIB09_Discussion/blob/master/n224a_model_selection.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<img align="right" src="https://ds-cs-images.s3.ap-northeast-2.amazonaws.com/Codestates_Fulllogo_Color.png" width=100>

## *AIB / SECTION 2 / SPRINT 2 / NOTE 4*

# 📝 Assignment
---

# 모델선택(Model Selection)

### 1) 캐글 대회를 이어서 진행합니다. RandomizedSearchCV 를 사용하여 하이퍼파라미터 튜닝을 진행합니다.

- [RandomForestClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html)를 사용하세요.
- 분류문제에서 맞는 [scoring parameter](https://scikit-learn.org/stable/modules/model_evaluation.html#common-cases-predefined-values) metric을 사용하세요.
- [OrdinalEncoder](https://contrib.scikit-learn.org/categorical-encoding/ordinal.html) 사용을 권합니다.
- RandomizedSearchCV 를 사용해서 하이퍼파라미터 튜닝을 진행하고 최고 성능을 보이는 모델로 예측을 진행한 후 캐글에 제출합니다.
- **(Urclass Quiz) 캐글 Leaderboard에서 개선된 본인 Score를 과제 제출폼에 제출하세요.**

In [None]:
!pip install category_encoders

In [93]:
import pandas as pd

target = 'vacc_h1n1_f'
# target = 'vacc_seas_f'
train = pd.merge(pd.read_csv('https://ds-lecture-data.s3.ap-northeast-2.amazonaws.com/vacc_flu/train.csv'), 
                 pd.read_csv('https://ds-lecture-data.s3.ap-northeast-2.amazonaws.com/vacc_flu/train_labels.csv')[target], left_index=True, right_index=True)
test = pd.read_csv('https://ds-lecture-data.s3.ap-northeast-2.amazonaws.com/vacc_flu/test.csv')

In [94]:
from sklearn.model_selection import train_test_split

# 데이터 분리
train, val = train_test_split(train, train_size=0.80, test_size=0.20, 
                              stratify=train[target], random_state=2)

In [95]:
import numpy as np

def engineer(df):
    """특성을 엔지니어링 하는 함수입니다."""
    # 새로운 특성을 생성합니다.
    behaviorals = [col for col in df.columns if 'behavioral' in col] 
    df['behaviorals'] = df[behaviorals].sum(axis=1)

    # 컬럼 제거
    df.drop(columns = 'doctor_recc_seasonal', inplace = True)
    df.drop(columns = 'education_comp', inplace = True)
    df.drop(columns = 'opinion_seas_vacc_effective', inplace = True)
    df.drop(columns = 'opinion_seas_risk', inplace = True)
    df.drop(columns = 'opinion_seas_sick_from_vacc', inplace = True)
    df.drop(columns = 'raceeth4_i', inplace = True)
    df.drop(columns = 'sex_i', inplace = True)
    df.drop(columns = 'rent_own_r', inplace = True)
    df.drop(columns = 'census_region', inplace = True)
    df.drop(columns = 'census_msa', inplace = True)
    df.drop(columns = 'n_adult_r', inplace = True)
    df.drop(columns = 'state', inplace = True)

    # missing value > 23% 이면 제외 
    dels = df.columns[df.isnull().mean() > 0.4]
    df.drop(columns=dels, inplace=True)


    '''
    # 결측값 처리 

    df['opinion_h1n1_risk'] = df['opinion_h1n1_risk'].fillna('No')
    df['opinion_h1n1_vacc_effective'] = df['opinion_h1n1_vacc_effective'].fillna('No')
    df['opinion_h1n1_sick_from_vacc'] = df['opinion_h1n1_sick_from_vacc'].fillna('No')
    df['employment_status'] = df['employment_status'].fillna('Unemployed')

    # numeric 데이터는 평균으로 

    df['hhs_region'] = df['hhs_region'].fillna(df['hhs_region'].mean())
    df['n_people_r'] = df['n_people_r'].fillna(df['n_people_r'].mean())
    df['behaviorals'] = df['behaviorals'].fillna(df['behaviorals'].mean())
    '''
    return df

In [96]:
train = engineer(train)
val = engineer(val)
test = engineer(test)

In [97]:
features = train.drop(columns=[target]).columns

X_train = train[features]
y_train = train[target]
X_val = val[features]
y_val = val[target]
X_test = test[features]

In [98]:
from category_encoders import OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.pipeline import make_pipeline
from sklearn.metrics import f1_score
from category_encoders import TargetEncoder
from sklearn.ensemble import RandomForestClassifier
from category_encoders import OrdinalEncoder

In [99]:
pipe = make_pipeline(
    TargetEncoder(min_samples_leaf=1, smoothing=10), 
    SimpleImputer(strategy= 'most_frequent'), 
    RandomForestClassifier(max_depth=10, random_state=2, criterion='entropy')
)
pipe.fit(X_train, y_train)
print('훈련 정확도: ', pipe.score(X_train, y_train))
print('검증 정확도: ', pipe.score(X_val, y_val))
y_pred = pipe.predict(X_val)

print('검증 세트 f1 score: ', f1_score(y_val, y_pred))

훈련 정확도:  0.832784746315571
검증 정확도:  0.8166291068675128
검증 세트 f1 score:  0.5404280618311533


In [100]:
# 최적의 파라미터 값 찾기

from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import randint, uniform

pipe = make_pipeline(
    TargetEncoder(), 
    SimpleImputer(), 
    RandomForestClassifier(random_state=2)
)

dists = {
    'targetencoder__smoothing': [2.,20.,100.,500.,1000.], # int로 넣으면 error(bug)
    'targetencoder__min_samples_leaf': randint(1, 10),     
    'simpleimputer__strategy': ['mean', 'median','most_frequent'], 
    'randomforestclassifier__n_estimators': randint(50, 500), 
    'randomforestclassifier__max_depth': [5, 10, None], 
    'randomforestclassifier__max_features': uniform(0, 1) # max_features
}

clf = RandomizedSearchCV(
    pipe, 
    param_distributions=dists, 
    n_iter=50, 
    cv=3, 
    scoring='neg_mean_absolute_error',  
    verbose=1,
    n_jobs=-1
)

clf.fit(X_train, y_train);

Fitting 3 folds for each of 50 candidates, totalling 150 fits


In [101]:
print('최적 파라미터 : ', clf.best_estimator_)
print('MAE : ', -clf.best_score_)

최적 파라미터 :  Pipeline(steps=[('targetencoder',
                 TargetEncoder(cols=['opinion_h1n1_vacc_effective',
                                     'opinion_h1n1_risk',
                                     'opinion_h1n1_sick_from_vacc', 'agegrp',
                                     'employment_status'],
                               min_samples_leaf=4, smoothing=1000.0)),
                ('simpleimputer', SimpleImputer()),
                ('randomforestclassifier',
                 RandomForestClassifier(max_depth=10,
                                        max_features=0.39831737031932235,
                                        n_estimators=318, random_state=2))])
MAE :  0.1862823592207099


In [102]:
pd.DataFrame(clf.cv_results_).sort_values(by = 'rank_test_score').T

Unnamed: 0,11,4,5,24,48,33,3,19,17,10,34,0,44,39,2,36,1,45,40,46,37,22,49,13,38,12,9,7,28,14,43,6,20,29,21,30,42,47,31,8,32,41,35,23,26,15,18,16,25,27
mean_fit_time,11.2971,5.24507,13.8965,8.71046,19.4765,14.0745,14.4807,13.602,5.76495,26.8002,19.4065,29.3475,1.46513,11.8278,29.2859,7.56676,6.7858,6.24185,12.5231,6.31448,3.43499,16.807,3.67019,2.7244,5.81912,7.72401,32.1983,6.35252,4.05068,3.66725,6.21601,14.7577,21.7961,25.5055,7.8895,31.3998,24.0021,21.1019,33.2471,25.3287,5.5829,30.7173,17.9405,29.7862,27.9798,5.46934,1.40769,1.27496,1.14517,3.39404
std_fit_time,0.0539014,0.0116061,0.115017,0.0187506,0.0535815,0.0735172,0.240972,0.0355605,0.0304602,0.0702849,0.112489,0.248895,0.0169821,0.0387221,0.253774,0.0509052,0.0157963,0.045521,0.070112,0.0423373,0.0460019,0.0754793,0.586838,0.0281757,0.027831,0.024182,0.0588378,0.032155,0.0218932,0.0233105,0.0161236,0.0281759,0.100511,0.0811185,0.0487567,0.00886562,0.162046,0.108751,0.116602,0.176401,0.0433458,0.0299308,0.0935906,0.068057,0.402144,0.0260753,0.00518474,0.0139787,0.00906732,0.0527309
mean_score_time,0.787803,0.330951,0.79412,0.958622,0.778348,0.927557,1.13341,0.783984,0.358265,1.18301,0.733028,1.13512,0.105639,0.652332,0.927011,0.42849,0.269863,0.686161,0.473273,0.707428,0.221916,0.879422,0.188645,0.146288,0.316065,0.417936,1.11829,0.248198,0.568088,0.329765,0.550945,1.31342,1.45293,1.66943,0.634834,1.63852,1.25985,0.973993,1.62866,1.40021,0.471065,1.13169,0.874328,1.31511,1.38384,0.666703,0.172818,0.175225,0.155339,0.548416
std_score_time,0.00381572,0.010501,0.0210367,0.0115698,0.0169872,0.0534804,0.0347443,0.00569172,0.00907041,0.0352608,0.0279344,0.0354776,0.000642458,0.0158672,0.0209305,0.00208132,0.00204135,0.000952304,0.0211416,0.0138054,0.00223795,0.0144127,0.0288865,0.011058,0.00737025,0.00960386,0.0126462,0.00218294,0.0244664,0.00457058,0.0071509,0.0184281,0.0136966,0.0292983,0.0117348,0.0155405,0.0197072,0.0255566,0.015167,0.0160836,0.00338635,0.04628,0.0411171,0.0203145,0.0316279,0.0344358,0.0030362,0.00424542,0.00906822,0.0298078
param_randomforestclassifier__max_depth,10,10,10,10,10,10,10,10,10,10,10,10,5,5,10,5,5,10,10,10,10,10,5,5,5,5,10,5,10,5,,,,,,,,,,,,,,,,5,5,5,5,5
param_randomforestclassifier__max_features,0.398317,0.496879,0.51837,0.224006,0.746792,0.487576,0.343169,0.618201,0.624962,0.696511,0.82209,0.790481,0.495004,0.519137,0.981998,0.539679,0.840238,0.171851,0.900564,0.187797,0.480386,0.662855,0.678676,0.748608,0.610248,0.592569,0.888868,0.972203,0.126904,0.27689,0.292717,0.344663,0.558728,0.579514,0.45316,0.702632,0.655041,0.752731,0.736994,0.670795,0.412234,0.997869,0.74277,0.867502,0.803383,0.154835,0.142141,0.123611,0.0836674,0.0600042
param_randomforestclassifier__n_estimators,318,123,317,364,351,365,456,285,116,489,306,482,51,461,391,289,173,306,193,305,82,339,133,77,209,276,469,155,205,210,135,305,326,385,137,385,336,258,389,328,104,281,209,301,315,438,92,94,76,341
param_simpleimputer__strategy,mean,mean,mean,most_frequent,mean,most_frequent,most_frequent,most_frequent,median,mean,mean,mean,mean,mean,mean,mean,mean,most_frequent,median,median,median,most_frequent,most_frequent,median,median,most_frequent,mean,median,mean,most_frequent,mean,most_frequent,median,median,median,mean,median,median,mean,median,median,mean,mean,mean,median,most_frequent,most_frequent,most_frequent,median,median
param_targetencoder__min_samples_leaf,4,6,3,9,9,1,1,8,4,9,8,5,6,8,1,2,7,6,8,6,5,3,5,1,5,4,3,4,7,1,5,3,3,4,3,3,2,1,5,9,8,7,7,9,7,7,4,2,2,9
param_targetencoder__smoothing,1000,2,500,500,1000,500,20,100,20,2,100,2,2,2,1000,100,2,1000,100,2,1000,100,500,2,100,1000,100,100,500,1000,1000,100,1000,500,2,500,100,500,20,500,20,20,500,2,100,1000,20,1000,2,20


In [103]:
# 최적의 파라미터 적용

pipe = clf.best_estimator_
pipe.fit(X_train, y_train)

Pipeline(steps=[('targetencoder',
                 TargetEncoder(cols=['opinion_h1n1_vacc_effective',
                                     'opinion_h1n1_risk',
                                     'opinion_h1n1_sick_from_vacc', 'agegrp',
                                     'employment_status'],
                               min_samples_leaf=4, smoothing=1000.0)),
                ('simpleimputer', SimpleImputer()),
                ('randomforestclassifier',
                 RandomForestClassifier(max_depth=10,
                                        max_features=0.39831737031932235,
                                        n_estimators=318, random_state=2))])

In [104]:
print('훈련 정확도: ', pipe.score(X_train, y_train))
print('검증 정확도: ', pipe.score(X_val, y_val))

훈련 정확도:  0.8424220858168016
검증 정확도:  0.8175779860040328


In [105]:
y_pred = pipe.predict(X_val)

print('검증 세트 f1 score: ', f1_score(y_val, y_pred))

검증 세트 f1 score:  0.5595647193585338


In [106]:
y_test_pred = pipe.predict(X_test)
y_test_pred

array([0, 0, 0, ..., 0, 0, 0])

In [107]:
submission = pd.DataFrame({'vacc_h1n1_f':y_test_pred})
submission = pd.DataFrame({'id':submission.index,'vacc_h1n1_f':y_test_pred})

submission.to_csv('sample_submission1229.csv',index=False)   ##0.5379

## 🔥 도전과제(Github - Discussion)


### 2) [`GridSearchCV`](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html) 를 사용하여 하이퍼파라미터 튜닝을 진행합니다.
- 모델 성능을 높이기 위해 가능한 시도를 다 해보세요.
- 모델 성능에 가장 큰 영향을 준 하이퍼파라미터에 대해서 분석하고 설명해 보세요.



In [None]:
### 이곳에서 과제를 진행해 주세요 ### 