# 7.Phần Nâng cao - Cải tiến mô hình

Phần này học viên sẽ tự thực hiện việc cải tiến kết quả dự đoán của mô hình để cho ra kết quả tốt hơn ở phần trước. Tức là điểm số trên Kaggle của học viên phải vượt qua **0.8246 (Public Score), > top 50%**. Một vài phương pháp được đề xuất để học viên lựa chọn là:

- Thay đổi phương pháp lựa chọn đặc trưng để chọn ra bộ đặc trưng tốt hơn

- Sử dụng phương pháp khác để xử lý vấn đề mất cân bằng dữ liệu

- Sử dụng các mô hình học máy khác hoặc các kỹ thuật kết hợp mô hình nâng cao 

- Tập trung vào việc cải tiến các mô hình Boosting

- Tinh chỉnh các tham số của mô hình

- Sử dụng các kỹ thuật K-Fold hoặc chia nhỏ tập train để tránh hiện tượng Overfitting

## Load các package cần thiết

In [53]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import warnings

warnings.filterwarnings('ignore')
%matplotlib inline

## Load tập dữ liệu

In [54]:
train = pd.read_csv('dataset/train.csv')
test = pd.read_csv('dataset/test.csv')

test_id = test['ID']
X_train = train.drop('ID', axis=1)
X_test = test.drop('ID', axis=1)

X_train = X_train.drop('TARGET', axis=1)
y_train = train['TARGET']

## Khám phá dữ liệu

In [55]:
X_train.shape, y_train.shape, X_test.shape

((76020, 369), (76020,), (75818, 369))

In [56]:
numerical_features = X_train.select_dtypes(include=[np.number]).columns
numerical_features


Index(['var3', 'var15', 'imp_ent_var16_ult1', 'imp_op_var39_comer_ult1',
       'imp_op_var39_comer_ult3', 'imp_op_var40_comer_ult1',
       'imp_op_var40_comer_ult3', 'imp_op_var40_efect_ult1',
       'imp_op_var40_efect_ult3', 'imp_op_var40_ult1',
       ...
       'saldo_medio_var29_ult3', 'saldo_medio_var33_hace2',
       'saldo_medio_var33_hace3', 'saldo_medio_var33_ult1',
       'saldo_medio_var33_ult3', 'saldo_medio_var44_hace2',
       'saldo_medio_var44_hace3', 'saldo_medio_var44_ult1',
       'saldo_medio_var44_ult3', 'var38'],
      dtype='object', length=369)

In [57]:
categories_features = X_train.select_dtypes(include=[np.object]).columns
categories_features

Index([], dtype='object')

In [58]:
X_train.isnull().sum().sum()

0

In [59]:
X_test.isnull().sum().sum()

0

In [60]:
y_train.value_counts()

0    73012
1     3008
Name: TARGET, dtype: int64

In [61]:
ratio = y_train.value_counts()[1] / y_train.value_counts().sum()
ratio

0.0395685345961589

Sau khi khám phá nhận thấy:
- Dữ liệu toàn bộ là các dữ liệu số
- Không có dữ liệu bị khuyết
- Là imbalanced data do có 73012 target là 0 trong khi chỉ có 3008 target là 1 chiếm tỉ lệ 3,9%

## Preprocessing Data

### Loại bỏ các biến trùng, hoặc gần giống nhau, hoặc có tương quan cao

In [62]:
from sklearn.pipeline import Pipeline

from feature_engine.selection import (
    DropConstantFeatures,
    DropDuplicateFeatures,
    SmartCorrelatedSelection,
)

pipe = Pipeline([
    ('constant', DropConstantFeatures(tol=0.998)),
    ('duplicated', DropDuplicateFeatures()),
    ('correlation', SmartCorrelatedSelection(selection_method='variance')),
])

pipe.fit(X_train)
X_train = pipe.transform(X_train)
X_test = pipe.transform(X_test)

In [63]:
X_train.shape, X_test.shape

((76020, 81), (75818, 81))

In [64]:
# Save all features columns
selected_features = {}
#selected_features['all_features'] = X_train.columns.to_list()

## Features selection

In [78]:
def features_selection(X_train, y_train):
    from sklearn.ensemble import ExtraTreesClassifier
    from sklearn.ensemble import RandomForestClassifier
    from sklearn.feature_selection import SelectFromModel
    from sklearn.ensemble import GradientBoostingClassifier
    from feature_engine.selection import RecursiveFeatureElimination
    from feature_engine.selection import RecursiveFeatureAddition
    #All features
    selected_features = {}
    #selected_features['all_features'] = X_train.columns.to_list()
    #print('All features: ', len(selected_features['all_features']))
    
    # ExtraTreesClassifier
    # clf = ExtraTreesClassifier(random_state=0)
    # clf.fit(X_train, y_train)
    # fs = SelectFromModel(estimator=clf , prefit=True)
    # selected
    # _features['extra_trees_classifier'] = X_train.columns[fs.get_support()].tolist()
    
    # ExtraTreesClassifier
    sel = SelectFromModel(ExtraTreesClassifier(random_state=0))
    sel.fit(X_train, y_train)
    selected_features['extra_trees_classifier'] = X_train.columns[sel.get_support()].tolist()
    print('{} features selected by ExtraTreesClassifier'.format(len(selected_features['extra_trees_classifier'])))
    
    # RandomForestClassifier
    sel = SelectFromModel(RandomForestClassifier(n_estimators=100, random_state=10))
    sel.fit(X_train, y_train)
    selected_features['random_forest_classifier'] = X_train.columns[sel.get_support()].tolist()
    print('{} features selected by RandomForestClassifier'.format(len(selected_features['random_forest_classifier'])))

    # RecursiveFeatureElimination
    model = GradientBoostingClassifier(
    n_estimators=10,
    max_depth=2,
    random_state=10)
    sel = RecursiveFeatureElimination(
        variables=None, # automatically evaluate all numerical variables
        estimator = model, # the ML model
        scoring = 'roc_auc', # the metric we want to evalute
        threshold = 0.0005, # the maximum performance drop allowed to remove a feature
        cv=2,) # cross-validation
    sel.fit(X_train, y_train)
    selected_features['recursive_feature_elimination'] = sel.transform(X_train).columns.tolist()
    print('{} features selected by RecursiveFeatureElimination'.format(len(selected_features['recursive_feature_elimination'])))
    
    # RecursiveFeatureAddition
    model = GradientBoostingClassifier(
    n_estimators=10,
    max_depth=2,
    random_state=10)
    # Setup the RFA selector
    sel = RecursiveFeatureAddition(
        variables=None,  # automatically evaluate all numerical variables
        estimator=model,  # the ML model
        scoring='roc_auc',  # the metric we want to evalute
        threshold=0.0005,  # the minimum performance increase needed to select a feature
        cv=2)  # cross-validation
    sel.fit(X_train, y_train)
    selected_features['recursive_feature_addition'] = sel.transform(X_train).columns.tolist()
    print('{} features selected by RecursiveFeatureAddition'.format(len(selected_features['recursive_feature_addition'])))
    return selected_features

In [79]:
selected_features = features_selection(X_train, y_train)

14 features selected by ExtraTreesClassifier
12 features selected by RandomForestClassifier
6 features selected by RecursiveFeatureElimination
4 features selected by RecursiveFeatureAddition


## Features scaling

In [80]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaled_data = {}

for key, value in selected_features.items():
    print('Value: ', value)
    scaled_data[key+'_scaled'] = {'train': scaler.fit_transform(X_train[value]), 'test': scaler.transform(X_test[value])}

Value:  ['var15', 'saldo_var5', 'saldo_var30', 'saldo_var42', 'var36', 'num_var22_hace2', 'num_var22_hace3', 'num_var22_ult3', 'num_var45_hace3', 'num_var45_ult3', 'saldo_medio_var5_hace2', 'saldo_medio_var5_hace3', 'saldo_medio_var5_ult3', 'var38']
Value:  ['var15', 'saldo_var5', 'saldo_var30', 'saldo_var42', 'num_var22_hace2', 'num_var22_ult3', 'num_var45_hace3', 'num_var45_ult3', 'saldo_medio_var5_hace2', 'saldo_medio_var5_hace3', 'saldo_medio_var5_ult3', 'var38']
Value:  ['var15', 'imp_op_var39_efect_ult3', 'saldo_var30', 'num_var22_ult1', 'saldo_medio_var5_hace3', 'var38']
Value:  ['var15', 'saldo_var30', 'saldo_medio_var5_hace3', 'var38']


## Training models

In [81]:
n_jobs = None

### UnderSampling

In [82]:
from imblearn.pipeline import make_pipeline
from sklearn.ensemble import RandomForestClassifier

from imblearn.under_sampling import (
    RandomUnderSampler,
    CondensedNearestNeighbour,
    TomekLinks,
    OneSidedSelection,
    EditedNearestNeighbours,
    RepeatedEditedNearestNeighbours,
    AllKNN,
    NeighbourhoodCleaningRule,
    NearMiss,
    InstanceHardnessThreshold
)

undersampler_dict = {
    'random': RandomUnderSampler(
        sampling_strategy='auto',
        random_state=0,
        replacement=False),

    'tomek': TomekLinks(
        sampling_strategy='auto',
        n_jobs=n_jobs),

    'enn': EditedNearestNeighbours(
        sampling_strategy='auto',
        n_neighbors=3,
        kind_sel='all',
        n_jobs=n_jobs),

    'allknn': AllKNN(
        sampling_strategy='auto',
        n_neighbors=3,
        kind_sel='all',
        n_jobs=n_jobs),
}


### Oversampling

In [83]:
from sklearn.svm import SVC
from imblearn.over_sampling import (
    RandomOverSampler,
    SMOTE,
    ADASYN,
    BorderlineSMOTE,
    SVMSMOTE,
)

oversampler_dict = {
    'smote': SMOTE(
        sampling_strategy='auto',
        random_state=0,
        k_neighbors=5,
        n_jobs=n_jobs),

    'border1': BorderlineSMOTE(
        sampling_strategy='auto',
        random_state=0,
        k_neighbors=5,
        m_neighbors=10,
        kind='borderline-1',
        n_jobs=n_jobs),

    'adasyn': ADASYN(
        sampling_strategy='auto',
        random_state=0,
        n_neighbors=5,
        n_jobs=n_jobs),
}

### Over-Under sampling

In [84]:
from imblearn.combine import SMOTEENN, SMOTETomek

combined_sampler_dict = {
    'smenn': SMOTEENN(
        sampling_strategy='auto',
        random_state=0,
        smote=SMOTE(sampling_strategy='auto', random_state=0, k_neighbors=5),
        enn=EditedNearestNeighbours(
            sampling_strategy='auto', n_neighbors=3, kind_sel='all'),
        n_jobs=n_jobs),

    'smtomek': SMOTETomek(
        sampling_strategy='auto',
        random_state=0,
        smote=SMOTE(sampling_strategy='auto', random_state=0, k_neighbors=5),
        tomek=TomekLinks(sampling_strategy='all'),
        n_jobs=n_jobs),
}

### Huấn luyện các mô hình cơ sở

In [89]:
#Import libs
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.model_selection import cross_validate

In [90]:
base_models = {
    'logistic_regression': LogisticRegression(),
    'random_forest': RandomForestClassifier(n_estimators=100, random_state=10),
    'k_nearest_neighbors': KNeighborsClassifier(n_neighbors=2),
    'decision_tree': DecisionTreeClassifier(),
    'adaboost_classifier': AdaBoostClassifier(),
    'neural_network': MLPClassifier(hidden_layer_sizes=(10, 5, ), activation='logistic', max_iter=300),
}

In [91]:

def run_model(model, X_train, y_train, resampler):
    pipeline = make_pipeline(resampler, model)
    cv_results =  cross_validate(pipeline, 
                                 X_train, 
                                 y_train, 
                                 cv=3, 
                                 scoring='roc_auc', 
                                 n_jobs=n_jobs,)
    return cv_results['test_score'].mean(), cv_results['test_score'].std()

In [93]:

results_dict = {}
std_dict = {}

max_score = 0
max_model_name = ''
max_dataset_name = ''
max_resampler_name = ''

for model_name, model in base_models.items():
    results_dict[model_name] = {}
    std_dict[model_name] = {}
    print("="*40)
    print('Model: ', model_name)
    print("="*40)
    for dataset_name, dataset in scaled_data.items():
        
        results_dict[model_name][dataset_name] = {}
        std_dict[model_name][dataset_name] = {}
        
        print('__Dataset: {}'.format(dataset_name))
        
        X_train = dataset['train']
        
        for resampler_name, resampler in undersampler_dict.items():
            print('__Resampler: {}'.format(resampler_name))
            auc, auc_std = run_model(model, X_train, y_train, resampler)
            print('________AUC: {:.3f} +/- {:.3f}'.format(auc, auc_std))
            results_dict[model_name][dataset_name][resampler_name] = auc
            std_dict[model_name][dataset_name][resampler_name] = auc_std
            if auc > max_score:
                max_score = auc
                max_model_name = model_name
                max_dataset_name = dataset_name
                max_resampler_name = resampler_name
            print('________Max AUC: {:.3f} (model: {}, dataset: {}, resampler: {})'.format(max_score, max_model_name, max_dataset_name, max_resampler_name))
    
        for resampler_name, resampler in oversampler_dict.items():
            print('__Resampler: {}'.format(resampler_name))
            auc, auc_std = run_model(model, X_train, y_train, resampler)
            print('________AUC: {:.3f} +/- {:.3f}'.format(auc, auc_std))
            results_dict[model_name][dataset_name][resampler_name] = auc
            std_dict[model_name][dataset_name][resampler_name] = auc_std
            if auc > max_score:
                max_score = auc
                max_model_name = model_name
                max_dataset_name = dataset_name
                max_resampler_name = resampler_name
            print('________Max AUC: {:.3f} (model: {}, dataset: {}, resampler: {})'.format(max_score, max_model_name, max_dataset_name, max_resampler_name))
        
        for resampler_name, resampler in combined_sampler_dict.items():
            print('__Resampler: {}'.format(resampler_name))
            auc, auc_std = run_model(model, X_train, y_train, resampler)
            print('________AUC: {:.3f} +/- {:.3f}'.format(auc, auc_std))
            results_dict[model_name][dataset_name][resampler_name] = auc
            std_dict[model_name][dataset_name][resampler_name] = auc_std
            if auc > max_score:
                max_score = auc
                max_model_name = model_name
                max_dataset_name = dataset_name
                max_resampler_name = resampler_name
            print('________Max AUC: {:.3f} (model: {}, dataset: {}, resampler: {})'.format(max_score, max_model_name, max_dataset_name, max_resampler_name))
        

Model:  logistic_regression
__Dataset: extra_trees_classifier_scaled
__Resampler: random
________AUC: 0.764 +/- 0.003
________Max AUC: 0.764 (model: logistic_regression, dataset: extra_trees_classifier_scaled, resampler: random)
__Resampler: tomek
________AUC: 0.763 +/- 0.001
________Max AUC: 0.764 (model: logistic_regression, dataset: extra_trees_classifier_scaled, resampler: random)
__Resampler: enn
________AUC: 0.764 +/- 0.001
________Max AUC: 0.764 (model: logistic_regression, dataset: extra_trees_classifier_scaled, resampler: random)
__Resampler: allknn
________AUC: 0.764 +/- 0.001
________Max AUC: 0.764 (model: logistic_regression, dataset: extra_trees_classifier_scaled, resampler: random)
__Resampler: smote
________AUC: 0.770 +/- 0.002
________Max AUC: 0.770 (model: logistic_regression, dataset: extra_trees_classifier_scaled, resampler: smote)
__Resampler: border1
________AUC: 0.775 +/- 0.001
________Max AUC: 0.775 (model: logistic_regression, dataset: extra_trees_classifier_sca

In [97]:
import collections

def flatten_dict(d, parent_key='', sep='.'):
    items = []
    for k, v in d.items():
        new_key = parent_key + sep + k if parent_key else k
        if isinstance(v, collections.MutableMapping):
            items.extend(flatten_dict(v, new_key, sep=sep).items())
        else:
            items.append((new_key, v))
    return dict(items)

flatten_result = flatten_dict(results_dict)
result_pd = pd.DataFrame.from_dict(flatten_result, orient='index')
result_pd.columns = ['AUC']
result_pd.to_csv('result.csv')

In [104]:
result_pd.sort_values(by='AUC', ascending=False).head(10)

Unnamed: 0,AUC
adaboost_classifier.recursive_feature_elimination_scaled.tomek,0.825461
adaboost_classifier.recursive_feature_elimination_scaled.enn,0.82515
adaboost_classifier.recursive_feature_elimination_scaled.allknn,0.824081
adaboost_classifier.random_forest_classifier_scaled.enn,0.823706
adaboost_classifier.random_forest_classifier_scaled.allknn,0.823046
adaboost_classifier.extra_trees_classifier_scaled.enn,0.822949
adaboost_classifier.extra_trees_classifier_scaled.tomek,0.822921
adaboost_classifier.random_forest_classifier_scaled.tomek,0.822868
adaboost_classifier.extra_trees_classifier_scaled.allknn,0.822271
adaboost_classifier.recursive_feature_elimination_scaled.random,0.820927


In [102]:
result_pd.sort_values(by='AUC', ascending=False).tail(10)

Unnamed: 0,AUC
k_nearest_neighbors.random_forest_classifier_scaled.smenn,0.579293
decision_tree.recursive_feature_elimination_scaled.smenn,0.578125
decision_tree.extra_trees_classifier_scaled.tomek,0.577728
decision_tree.random_forest_classifier_scaled.smenn,0.574562
decision_tree.extra_trees_classifier_scaled.smenn,0.572037
decision_tree.random_forest_classifier_scaled.tomek,0.571528
k_nearest_neighbors.extra_trees_classifier_scaled.tomek,0.565892
k_nearest_neighbors.random_forest_classifier_scaled.tomek,0.564162
k_nearest_neighbors.recursive_feature_elimination_scaled.tomek,0.563206
k_nearest_neighbors.recursive_feature_addition_scaled.tomek,0.558301


- Model tốt nhất: adaboost_classifier
- Features tốt nhất: recursive_feature_elimination_scaled
- resample tốt nhất: tomek, or enn 