# 7.Phần Nâng cao - Cải tiến mô hình

Phần này học viên sẽ tự thực hiện việc cải tiến kết quả dự đoán của mô hình để cho ra kết quả tốt hơn ở phần trước. Tức là điểm số trên Kaggle của học viên phải vượt qua **0.8246 (Public Score), > top 50%**. Một vài phương pháp được đề xuất để học viên lựa chọn là:

- Thay đổi phương pháp lựa chọn đặc trưng để chọn ra bộ đặc trưng tốt hơn

- Sử dụng phương pháp khác để xử lý vấn đề mất cân bằng dữ liệu

- Sử dụng các mô hình học máy khác hoặc các kỹ thuật kết hợp mô hình nâng cao 

- Tập trung vào việc cải tiến các mô hình Boosting

- Tinh chỉnh các tham số của mô hình

- Sử dụng các kỹ thuật K-Fold hoặc chia nhỏ tập train để tránh hiện tượng Overfitting

## Load các package cần thiết

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import warnings

warnings.filterwarnings('ignore')
%matplotlib inline

## Load tập dữ liệu

In [2]:
train = pd.read_csv('dataset/train.csv')
test = pd.read_csv('dataset/test.csv')

test_id = test['ID']
X_train = train.drop('ID', axis=1)
X_test = test.drop('ID', axis=1)

X_train = X_train.drop('TARGET', axis=1)
y_train = train['TARGET']

## Khám phá dữ liệu

In [3]:
X_train.shape, y_train.shape, X_test.shape

((76020, 369), (76020,), (75818, 369))

In [4]:
numerical_features = X_train.select_dtypes(include=[np.number]).columns
numerical_features


Index(['var3', 'var15', 'imp_ent_var16_ult1', 'imp_op_var39_comer_ult1',
       'imp_op_var39_comer_ult3', 'imp_op_var40_comer_ult1',
       'imp_op_var40_comer_ult3', 'imp_op_var40_efect_ult1',
       'imp_op_var40_efect_ult3', 'imp_op_var40_ult1',
       ...
       'saldo_medio_var29_ult3', 'saldo_medio_var33_hace2',
       'saldo_medio_var33_hace3', 'saldo_medio_var33_ult1',
       'saldo_medio_var33_ult3', 'saldo_medio_var44_hace2',
       'saldo_medio_var44_hace3', 'saldo_medio_var44_ult1',
       'saldo_medio_var44_ult3', 'var38'],
      dtype='object', length=369)

In [5]:
categories_features = X_train.select_dtypes(include=[np.object]).columns
categories_features

Index([], dtype='object')

In [6]:
X_train.isnull().sum().sum()

0

In [7]:
X_test.isnull().sum().sum()

0

In [8]:
y_train.value_counts()

0    73012
1     3008
Name: TARGET, dtype: int64

In [9]:
ratio = y_train.value_counts()[1] / y_train.value_counts().sum()
ratio

0.0395685345961589

Sau khi khám phá nhận thấy:
- Dữ liệu toàn bộ là các dữ liệu số
- Không có dữ liệu bị khuyết
- Là imbalanced data do có 73012 target là 0 trong khi chỉ có 3008 target là 1 chiếm tỉ lệ 3,9%

## Preprocessing Data

### Loại bỏ các biến trùng, hoặc gần giống nhau, hoặc có tương quan cao

In [10]:
from sklearn.pipeline import Pipeline

from feature_engine.selection import (
    DropConstantFeatures,
    DropDuplicateFeatures,
    SmartCorrelatedSelection,
)

pipe = Pipeline([
    ('constant', DropConstantFeatures(tol=0.998)),
    ('duplicated', DropDuplicateFeatures()),
    ('correlation', SmartCorrelatedSelection(selection_method='variance')),
])

pipe.fit(X_train)
X_train = pipe.transform(X_train)
X_test = pipe.transform(X_test)

In [11]:
X_train.shape, X_test.shape

((76020, 81), (75818, 81))

In [12]:
# Save all features columns
selected_features = {}
#selected_features['all_features'] = X_train.columns.to_list()

## Features selection

In [13]:
def features_selection(X_train, y_train):
    from sklearn.ensemble import ExtraTreesClassifier
    from sklearn.ensemble import RandomForestClassifier
    from sklearn.feature_selection import SelectFromModel
    from sklearn.ensemble import GradientBoostingClassifier
    from feature_engine.selection import RecursiveFeatureElimination
    from feature_engine.selection import RecursiveFeatureAddition
    #All features
    selected_features = {}
    #selected_features['all_features'] = X_train.columns.to_list()
    #print('All features: ', len(selected_features['all_features']))
    
    # ExtraTreesClassifier
    # clf = ExtraTreesClassifier(random_state=0)
    # clf.fit(X_train, y_train)
    # fs = SelectFromModel(estimator=clf , prefit=True)
    # selected
    # _features['extra_trees_classifier'] = X_train.columns[fs.get_support()].tolist()
    
    # ExtraTreesClassifier
    sel = SelectFromModel(ExtraTreesClassifier(random_state=0))
    sel.fit(X_train, y_train)
    selected_features['extra_trees_classifier'] = X_train.columns[sel.get_support()].tolist()
    print('{} features selected by ExtraTreesClassifier'.format(len(selected_features['extra_trees_classifier'])))
    
    # RandomForestClassifier
    sel = SelectFromModel(RandomForestClassifier(n_estimators=100, random_state=10))
    sel.fit(X_train, y_train)
    selected_features['random_forest_classifier'] = X_train.columns[sel.get_support()].tolist()
    print('{} features selected by RandomForestClassifier'.format(len(selected_features['random_forest_classifier'])))

    # RecursiveFeatureElimination
    model = GradientBoostingClassifier(
    n_estimators=10,
    max_depth=2,
    random_state=10)
    sel = RecursiveFeatureElimination(
        variables=None, # automatically evaluate all numerical variables
        estimator = model, # the ML model
        scoring = 'roc_auc', # the metric we want to evalute
        threshold = 0.0005, # the maximum performance drop allowed to remove a feature
        cv=2,) # cross-validation
    sel.fit(X_train, y_train)
    selected_features['recursive_feature_elimination'] = sel.transform(X_train).columns.tolist()
    print('{} features selected by RecursiveFeatureElimination'.format(len(selected_features['recursive_feature_elimination'])))
    
    # RecursiveFeatureAddition
    model = GradientBoostingClassifier(
    n_estimators=10,
    max_depth=2,
    random_state=10)
    # Setup the RFA selector
    sel = RecursiveFeatureAddition(
        variables=None,  # automatically evaluate all numerical variables
        estimator=model,  # the ML model
        scoring='roc_auc',  # the metric we want to evalute
        threshold=0.0005,  # the minimum performance increase needed to select a feature
        cv=2)  # cross-validation
    sel.fit(X_train, y_train)
    selected_features['recursive_feature_addition'] = sel.transform(X_train).columns.tolist()
    print('{} features selected by RecursiveFeatureAddition'.format(len(selected_features['recursive_feature_addition'])))
    return selected_features

In [14]:
selected_features = features_selection(X_train, y_train)

14 features selected by ExtraTreesClassifier
12 features selected by RandomForestClassifier
6 features selected by RecursiveFeatureElimination
4 features selected by RecursiveFeatureAddition


## Features scaling

In [15]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaled_data = {}

for key, value in selected_features.items():
    print('Value: ', value)
    scaled_data[key+'_scaled'] = {'train': scaler.fit_transform(X_train[value]), 'test': scaler.transform(X_test[value])}

Value:  ['var15', 'saldo_var5', 'saldo_var30', 'saldo_var42', 'var36', 'num_var22_hace2', 'num_var22_hace3', 'num_var22_ult3', 'num_var45_hace3', 'num_var45_ult3', 'saldo_medio_var5_hace2', 'saldo_medio_var5_hace3', 'saldo_medio_var5_ult3', 'var38']
Value:  ['var15', 'saldo_var5', 'saldo_var30', 'saldo_var42', 'num_var22_hace2', 'num_var22_ult3', 'num_var45_hace3', 'num_var45_ult3', 'saldo_medio_var5_hace2', 'saldo_medio_var5_hace3', 'saldo_medio_var5_ult3', 'var38']
Value:  ['var15', 'imp_op_var39_efect_ult3', 'saldo_var30', 'num_var22_ult1', 'saldo_medio_var5_hace3', 'var38']
Value:  ['var15', 'saldo_var30', 'saldo_medio_var5_hace3', 'var38']


## Training models

In [16]:
n_jobs = 4

### UnderSampling

In [17]:
from imblearn.pipeline import make_pipeline
from sklearn.ensemble import RandomForestClassifier

from imblearn.under_sampling import (
    RandomUnderSampler,
    CondensedNearestNeighbour,
    TomekLinks,
    OneSidedSelection,
    EditedNearestNeighbours,
    RepeatedEditedNearestNeighbours,
    AllKNN,
    NeighbourhoodCleaningRule,
    NearMiss,
    InstanceHardnessThreshold
)

undersampler_dict = {
    'random': RandomUnderSampler(
        sampling_strategy='auto',
        random_state=0,
        replacement=False),

    'tomek': TomekLinks(
        sampling_strategy='auto',
        n_jobs=n_jobs),

    'enn': EditedNearestNeighbours(
        sampling_strategy='auto',
        n_neighbors=3,
        kind_sel='all',
        n_jobs=n_jobs),

    'allknn': AllKNN(
        sampling_strategy='auto',
        n_neighbors=3,
        kind_sel='all',
        n_jobs=n_jobs),
}


### Oversampling

In [18]:
from sklearn.svm import SVC
from imblearn.over_sampling import (
    RandomOverSampler,
    SMOTE,
    ADASYN,
    BorderlineSMOTE,
    SVMSMOTE,
)

oversampler_dict = {
    'smote': SMOTE(
        sampling_strategy='auto',
        random_state=0,
        k_neighbors=5,
        n_jobs=n_jobs),

    'border1': BorderlineSMOTE(
        sampling_strategy='auto',
        random_state=0,
        k_neighbors=5,
        m_neighbors=10,
        kind='borderline-1',
        n_jobs=n_jobs),

    'adasyn': ADASYN(
        sampling_strategy='auto',
        random_state=0,
        n_neighbors=5,
        n_jobs=n_jobs),
}

### Over-Under sampling

In [19]:
from imblearn.combine import SMOTEENN, SMOTETomek

combined_sampler_dict = {
    'smenn': SMOTEENN(
        sampling_strategy='auto',
        random_state=0,
        smote=SMOTE(sampling_strategy='auto', random_state=0, k_neighbors=5),
        enn=EditedNearestNeighbours(
            sampling_strategy='auto', n_neighbors=3, kind_sel='all'),
        n_jobs=n_jobs),

    'smtomek': SMOTETomek(
        sampling_strategy='auto',
        random_state=0,
        smote=SMOTE(sampling_strategy='auto', random_state=0, k_neighbors=5),
        tomek=TomekLinks(sampling_strategy='all'),
        n_jobs=n_jobs),
}

### Huấn luyện các mô hình cơ sở

In [33]:
#Import libs
from xgboost import XGBClassifier
from sklearn.ensemble import  AdaBoostClassifier, GradientBoostingClassifier
from sklearn.model_selection import cross_validate

In [34]:
ensemle_models = {
    'adaboost_classifier': AdaBoostClassifier(),
    'gradientboost_classifier': GradientBoostingClassifier(),
    'xgboost_classifier': XGBClassifier(missing=np.nan, max_depth=5, n_estimators=350, 
                     learning_rate=0.03, nthread=4, subsample=0.95, 
                     colsample_bytree=0.85, seed=0),
}

In [35]:

def run_model(model, X_train, y_train, resampler):
    pipeline = make_pipeline(resampler, model)
    cv_results =  cross_validate(pipeline, 
                                 X_train, 
                                 y_train, 
                                 cv=3, 
                                 scoring='roc_auc', 
                                 n_jobs=n_jobs,)
    return cv_results['test_score'].mean(), cv_results['test_score'].std()

In [36]:

results_dict = {}
std_dict = {}

max_score = 0
max_model_name = ''
max_dataset_name = ''
max_resampler_name = ''

for model_name, model in ensemle_models.items():
    results_dict[model_name] = {}
    std_dict[model_name] = {}
    print("="*40)
    print('Model: ', model_name)
    print("="*40)
    for dataset_name, dataset in scaled_data.items():
        
        results_dict[model_name][dataset_name] = {}
        std_dict[model_name][dataset_name] = {}
        
        print('__Dataset: {}'.format(dataset_name))
        
        X_train = dataset['train']
        
        for resampler_name, resampler in undersampler_dict.items():
            print('__Resampler: {}'.format(resampler_name))
            auc, auc_std = run_model(model, X_train, y_train, resampler)
            print('________AUC: {:.3f} +/- {:.3f}'.format(auc, auc_std))
            results_dict[model_name][dataset_name][resampler_name] = auc
            std_dict[model_name][dataset_name][resampler_name] = auc_std
            if auc > max_score:
                max_score = auc
                max_model_name = model_name
                max_dataset_name = dataset_name
                max_resampler_name = resampler_name
            print('________Max AUC: {:.3f} (model: {}, dataset: {}, resampler: {})'.format(max_score, max_model_name, max_dataset_name, max_resampler_name))
    
        for resampler_name, resampler in oversampler_dict.items():
            print('__Resampler: {}'.format(resampler_name))
            auc, auc_std = run_model(model, X_train, y_train, resampler)
            print('________AUC: {:.3f} +/- {:.3f}'.format(auc, auc_std))
            results_dict[model_name][dataset_name][resampler_name] = auc
            std_dict[model_name][dataset_name][resampler_name] = auc_std
            if auc > max_score:
                max_score = auc
                max_model_name = model_name
                max_dataset_name = dataset_name
                max_resampler_name = resampler_name
            print('________Max AUC: {:.3f} (model: {}, dataset: {}, resampler: {})'.format(max_score, max_model_name, max_dataset_name, max_resampler_name))
        
        for resampler_name, resampler in combined_sampler_dict.items():
            print('__Resampler: {}'.format(resampler_name))
            auc, auc_std = run_model(model, X_train, y_train, resampler)
            print('________AUC: {:.3f} +/- {:.3f}'.format(auc, auc_std))
            results_dict[model_name][dataset_name][resampler_name] = auc
            std_dict[model_name][dataset_name][resampler_name] = auc_std
            if auc > max_score:
                max_score = auc
                max_model_name = model_name
                max_dataset_name = dataset_name
                max_resampler_name = resampler_name
            print('________Max AUC: {:.3f} (model: {}, dataset: {}, resampler: {})'.format(max_score, max_model_name, max_dataset_name, max_resampler_name))
        

Model:  adaboost_classifier
__Dataset: extra_trees_classifier_scaled
__Resampler: random
________AUC: 0.818 +/- 0.004
________Max AUC: 0.818 (model: adaboost_classifier, dataset: extra_trees_classifier_scaled, resampler: random)
__Resampler: tomek
________AUC: 0.823 +/- 0.004
________Max AUC: 0.823 (model: adaboost_classifier, dataset: extra_trees_classifier_scaled, resampler: tomek)
__Resampler: enn
________AUC: 0.823 +/- 0.006
________Max AUC: 0.823 (model: adaboost_classifier, dataset: extra_trees_classifier_scaled, resampler: enn)
__Resampler: allknn
________AUC: 0.822 +/- 0.006
________Max AUC: 0.823 (model: adaboost_classifier, dataset: extra_trees_classifier_scaled, resampler: enn)
__Resampler: smote
________AUC: 0.817 +/- 0.006
________Max AUC: 0.823 (model: adaboost_classifier, dataset: extra_trees_classifier_scaled, resampler: enn)
__Resampler: border1
________AUC: 0.815 +/- 0.004
________Max AUC: 0.823 (model: adaboost_classifier, dataset: extra_trees_classifier_scaled, resa

In [42]:
from collections.abc import MutableMapping

def flatten(d, parent_key='', sep='.'):
    items = []
    for k, v in d.items():
        new_key = parent_key + sep + k if parent_key else k
        if isinstance(v, MutableMapping):
            items.extend(flatten(v, new_key, sep=sep).items())
        else:
            items.append((new_key, v))
    return dict(items)

results_df = pd.DataFrame.from_dict(flatten(results_dict), orient='index', columns=['AUC'])

In [43]:
results_df.sort_values(by='AUC', ascending=False).head(10)

Unnamed: 0,AUC
xgboost_classifier.extra_trees_classifier_scaled.tomek,0.833057
xgboost_classifier.extra_trees_classifier_scaled.enn,0.832921
xgboost_classifier.extra_trees_classifier_scaled.allknn,0.832241
xgboost_classifier.random_forest_classifier_scaled.tomek,0.832198
xgboost_classifier.random_forest_classifier_scaled.enn,0.832041
xgboost_classifier.recursive_feature_elimination_scaled.tomek,0.831439
xgboost_classifier.random_forest_classifier_scaled.allknn,0.831138
xgboost_classifier.recursive_feature_elimination_scaled.enn,0.831075
xgboost_classifier.recursive_feature_elimination_scaled.allknn,0.830404
gradientboost_classifier.recursive_feature_elimination_scaled.tomek,0.830354


## Đánh gía, thử với tệp test và submit lên Kaggle

### Thử với xgboost_classifier
- model: xgboost_classifier
- selected feature by: extra_trees_classifier_scaled
- resampled model by: tomek

In [45]:
from sklearn.metrics import roc_auc_score

model = ensemle_models['xgboost_classifier']
X_train = scaled_data['extra_trees_classifier_scaled']['train']
X_test = scaled_data['extra_trees_classifier_scaled']['test']
resampler = undersampler_dict['tomek']

X_train_tomek, y_train_tomek = resampler.fit_resample(X_train, y_train)
model.fit(X_train_tomek, y_train_tomek)
y_pred = model.predict_proba(X_train_tomek)
roc_auc_score(y_train_tomek, y_pred[:, 1])

0.8803949217952097

In [46]:
probs = model.predict_proba(X_test)

submission = pd.DataFrame({"ID":test_id, "TARGET": probs[:,1]})
submission.to_csv("adv_xgboost_classifier_submittion.csv", index=False)

### Kết quả trên Kaggle

![](images/my_adv_submittion.png)