# Model Selcetion: Random Forest vs Decision Tree

This notebook will carry on the comparison between random forest and decision tree

# Non-preprocessing + one-hot encoding + grid search

Firstlly, we will compare the performance of RF and DT in the case of non-preprocessing and normal one-hot encoding.

We will use the **origin dataset** without feature "year" and "timestamp", and then we will use **one-hot encoding** to transform categorical features.

We carry on elementry grid search to find the best parametes, and then we will compare the best mcc score of these two models.

Results:
> Decision Tree: 0.5098119472523316 <br> 
> Random Forest: **0.5588019423045544**


In [1]:
# Initialize the module
%reset -f
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.tree import DecisionTreeClassifier 
from sklearn.ensemble import RandomForestClassifier

from sklearn.model_selection import cross_val_score
from sklearn.metrics import matthews_corrcoef, make_scorer

from sklearn.preprocessing import OneHotEncoder
from sklearn.model_selection import KFold
from sklearn.model_selection import KFold
from sklearn.metrics import confusion_matrix

from sklearn.model_selection import GridSearchCV

data=pd.read_csv("..\\bank-full.csv", sep=';', engine="python")

np.random.seed(12345)

# enc = OneHotEncoder(handle_unknown = 'ignore', sparse = True)
y = data['y'].replace({"yes":1,"no":0})
data.drop(columns = ['y'], inplace = True)

integer_list = []
discrete_list = []
for col in data.columns:
#     print(col, ":", data[col].dtype)    
    if data[col].dtype == "object":
        discrete_list.append(col)
    else:
        integer_list.append(col)
        data[col] = data[col].astype(float)
X = pd.get_dummies(data, columns=discrete_list)

In [3]:
# decision tree + grid search

myscorer = make_scorer(matthews_corrcoef)
parameters = {'criterion':['gini','entropy'],\
              'max_depth':[3, 4, 5, 10, 15],\
              'class_weight':['balanced', None, {0: 1, 1: 2}, {0: 1, 1: 5},{0: 1, 1: 2.125}],
              'random_state': [888]}
mygridsearch = GridSearchCV(estimator = DecisionTreeClassifier(),\
                            param_grid = parameters,\
                            scoring = myscorer,\
                            cv = KFold(n_splits=5, shuffle=True, random_state = 888))
mygridsearch.fit(X, y)
print(mygridsearch.best_params_)
print(mygridsearch.best_score_)
print(mygridsearch.cv_results_)

{'class_weight': {0: 1, 1: 2}, 'criterion': 'gini', 'max_depth': 10, 'random_state': 888}
0.5098119472523316
{'mean_fit_time': array([0.12431231, 0.15954905, 0.17810616, 0.34561911, 0.46528525,
       0.14181929, 0.1728723 , 0.19539142, 0.34318576, 0.44814696,
       0.12246718, 0.15562434, 0.17479501, 0.31853724, 0.43235998,
       0.13992481, 0.15149541, 0.20238571, 0.33012881, 0.43746076,
       0.12838612, 0.15645957, 0.20508761, 0.34499626, 0.44610162,
       0.15135589, 0.14920926, 0.17517319, 0.30784187, 0.389217  ,
       0.1128479 , 0.1490231 , 0.17394395, 0.29134784, 0.40618424,
       0.12351317, 0.15229602, 0.17830677, 0.30726352, 0.3829875 ,
       0.1185441 , 0.14719186, 0.17588415, 0.32890048, 0.43937979,
       0.13105111, 0.15995789, 0.18870816, 0.32053919, 0.40891008]), 'std_fit_time': array([0.00133722, 0.00302426, 0.00764502, 0.02307031, 0.02638922,
       0.00410458, 0.01312648, 0.01009747, 0.02389502, 0.02335148,
       0.00643217, 0.00390046, 0.00933061, 0.018026

The best score of decision tree under this case is 0.5098119472523316

In [4]:
# random forest + grid search

myscorer = make_scorer(matthews_corrcoef)
parameters = {
    'n_estimators' : [50, 100],
    'max_depth': [15],
    'criterion' : ['gini'],
    'min_samples_leaf': [10, 20, 40,],
    'class_weight': [{0: 1, 1: 2}, {0: 1, 1: 3}, {0: 1, 1: 4}],
    'random_state': [888]
}
mygridsearch = GridSearchCV(estimator = RandomForestClassifier(),\
                            param_grid = parameters,\
                            scoring = myscorer,\
                            cv = KFold(n_splits=5, shuffle=True, random_state = 888),
                            verbose = 2)
mygridsearch.fit(X, y)
print(mygridsearch.best_params_)
print(mygridsearch.best_score_)
print(mygridsearch.cv_results_)

Fitting 5 folds for each of 18 candidates, totalling 90 fits
[CV] END class_weight={0: 1, 1: 2}, criterion=gini, max_depth=15, min_samples_leaf=10, n_estimators=50, random_state=888; total time=   2.0s
[CV] END class_weight={0: 1, 1: 2}, criterion=gini, max_depth=15, min_samples_leaf=10, n_estimators=50, random_state=888; total time=   1.9s
[CV] END class_weight={0: 1, 1: 2}, criterion=gini, max_depth=15, min_samples_leaf=10, n_estimators=50, random_state=888; total time=   2.0s
[CV] END class_weight={0: 1, 1: 2}, criterion=gini, max_depth=15, min_samples_leaf=10, n_estimators=50, random_state=888; total time=   1.9s
[CV] END class_weight={0: 1, 1: 2}, criterion=gini, max_depth=15, min_samples_leaf=10, n_estimators=50, random_state=888; total time=   2.3s
[CV] END class_weight={0: 1, 1: 2}, criterion=gini, max_depth=15, min_samples_leaf=10, n_estimators=100, random_state=888; total time=   4.3s
[CV] END class_weight={0: 1, 1: 2}, criterion=gini, max_depth=15, min_samples_leaf=10, n_est

[CV] END class_weight={0: 1, 1: 3}, criterion=gini, max_depth=15, min_samples_leaf=40, n_estimators=100, random_state=888; total time=   3.3s
[CV] END class_weight={0: 1, 1: 3}, criterion=gini, max_depth=15, min_samples_leaf=40, n_estimators=100, random_state=888; total time=   3.3s
[CV] END class_weight={0: 1, 1: 4}, criterion=gini, max_depth=15, min_samples_leaf=10, n_estimators=50, random_state=888; total time=   1.8s
[CV] END class_weight={0: 1, 1: 4}, criterion=gini, max_depth=15, min_samples_leaf=10, n_estimators=50, random_state=888; total time=   2.0s
[CV] END class_weight={0: 1, 1: 4}, criterion=gini, max_depth=15, min_samples_leaf=10, n_estimators=50, random_state=888; total time=   1.9s
[CV] END class_weight={0: 1, 1: 4}, criterion=gini, max_depth=15, min_samples_leaf=10, n_estimators=50, random_state=888; total time=   1.9s
[CV] END class_weight={0: 1, 1: 4}, criterion=gini, max_depth=15, min_samples_leaf=10, n_estimators=50, random_state=888; total time=   1.9s
[CV] END cl

The best score of random foreset is 0.5578066624582119

# Add column year + one-hot encoding +grid search

Secondly, we will **add feature "year"** as we mentioned in the feature preprocessing part. We would **not specify customized encoding method** here but just use one-hot encoding to encode all the categorical features.

After that, we will carry on elementry grid search again. 

Results:
> Decision Tree: 0.5623514379422441 <br> 
> Random Forest: **0.5896692297640508**

In [5]:
# Initialize the module
%reset -f
from xgboost.sklearn import XGBClassifier

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.tree import DecisionTreeClassifier 
from sklearn.ensemble import RandomForestClassifier

from sklearn.model_selection import cross_val_score
from sklearn.metrics import matthews_corrcoef, make_scorer

from sklearn.preprocessing import OneHotEncoder
from sklearn.model_selection import KFold
from sklearn.model_selection import KFold
from sklearn.metrics import confusion_matrix

from sklearn.model_selection import GridSearchCV

from imblearn.over_sampling import SMOTE
from imblearn.over_sampling import SMOTENC
from imblearn.pipeline import Pipeline as imbpipeline

from sklearn.preprocessing import PolynomialFeatures

data = pd.read_csv("..\\bank-full-add_timestamp.csv", sep=',', engine="python")

y = data['y'].replace({"yes":1,"no":0})
data.drop(columns = ['y', 'timestamp'], inplace = True)

data['weekday'] = data['weekday'].astype(str)
data['year'] = data['year'].astype(str)
data['month'] = data['month'].astype(str)

# data.head()

number_list = []
discrete_list = []
for col in data.columns:
#     print(col, ":", data[col].dtype)
    if data[col].dtype == "object":
        discrete_list.append(col)
    else:
        number_list.append(col)
        
print(number_list)
print(discrete_list)

X = pd.get_dummies(data, columns=discrete_list)

['age', 'balance', 'day', 'duration', 'campaign', 'pdays', 'previous']
['job', 'marital', 'education', 'default', 'housing', 'loan', 'contact', 'month', 'poutcome', 'year', 'weekday']


In [6]:
# decision tree + grid search

myscorer = make_scorer(matthews_corrcoef)
parameters = {'criterion':['gini','entropy'],\
              'max_depth':[3, 4, 5, 10, 15],\
              'class_weight':['balanced', None, {0: 1, 1: 2}, {0: 1, 1: 5},{0: 1, 1: 2.125}],
              'random_state': [888]}
mygridsearch = GridSearchCV(estimator = DecisionTreeClassifier(),\
                            param_grid = parameters,\
                            scoring = myscorer,\
                            cv = KFold(n_splits=5, shuffle=True, random_state = 888),
                            verbose = 2)
mygridsearch.fit(X, y)
print(mygridsearch.best_params_)
print(mygridsearch.best_score_)
print(mygridsearch.cv_results_)

Fitting 5 folds for each of 50 candidates, totalling 250 fits
[CV] END class_weight=balanced, criterion=gini, max_depth=3, random_state=888; total time=   0.1s
[CV] END class_weight=balanced, criterion=gini, max_depth=3, random_state=888; total time=   0.1s
[CV] END class_weight=balanced, criterion=gini, max_depth=3, random_state=888; total time=   0.1s
[CV] END class_weight=balanced, criterion=gini, max_depth=3, random_state=888; total time=   0.1s
[CV] END class_weight=balanced, criterion=gini, max_depth=3, random_state=888; total time=   0.1s
[CV] END class_weight=balanced, criterion=gini, max_depth=4, random_state=888; total time=   0.1s
[CV] END class_weight=balanced, criterion=gini, max_depth=4, random_state=888; total time=   0.1s
[CV] END class_weight=balanced, criterion=gini, max_depth=4, random_state=888; total time=   0.1s
[CV] END class_weight=balanced, criterion=gini, max_depth=4, random_state=888; total time=   0.1s
[CV] END class_weight=balanced, criterion=gini, max_dept

[CV] END class_weight=None, criterion=entropy, max_depth=4, random_state=888; total time=   0.1s
[CV] END class_weight=None, criterion=entropy, max_depth=5, random_state=888; total time=   0.1s
[CV] END class_weight=None, criterion=entropy, max_depth=5, random_state=888; total time=   0.1s
[CV] END class_weight=None, criterion=entropy, max_depth=5, random_state=888; total time=   0.1s
[CV] END class_weight=None, criterion=entropy, max_depth=5, random_state=888; total time=   0.1s
[CV] END class_weight=None, criterion=entropy, max_depth=5, random_state=888; total time=   0.1s
[CV] END class_weight=None, criterion=entropy, max_depth=10, random_state=888; total time=   0.2s
[CV] END class_weight=None, criterion=entropy, max_depth=10, random_state=888; total time=   0.2s
[CV] END class_weight=None, criterion=entropy, max_depth=10, random_state=888; total time=   0.2s
[CV] END class_weight=None, criterion=entropy, max_depth=10, random_state=888; total time=   0.2s
[CV] END class_weight=None

[CV] END class_weight={0: 1, 1: 5}, criterion=gini, max_depth=10, random_state=888; total time=   0.2s
[CV] END class_weight={0: 1, 1: 5}, criterion=gini, max_depth=10, random_state=888; total time=   0.2s
[CV] END class_weight={0: 1, 1: 5}, criterion=gini, max_depth=10, random_state=888; total time=   0.2s
[CV] END class_weight={0: 1, 1: 5}, criterion=gini, max_depth=10, random_state=888; total time=   0.2s
[CV] END class_weight={0: 1, 1: 5}, criterion=gini, max_depth=10, random_state=888; total time=   0.2s
[CV] END class_weight={0: 1, 1: 5}, criterion=gini, max_depth=15, random_state=888; total time=   0.3s
[CV] END class_weight={0: 1, 1: 5}, criterion=gini, max_depth=15, random_state=888; total time=   0.3s
[CV] END class_weight={0: 1, 1: 5}, criterion=gini, max_depth=15, random_state=888; total time=   0.3s
[CV] END class_weight={0: 1, 1: 5}, criterion=gini, max_depth=15, random_state=888; total time=   0.3s
[CV] END class_weight={0: 1, 1: 5}, criterion=gini, max_depth=15, random_

[CV] END class_weight={0: 1, 1: 2.125}, criterion=entropy, max_depth=10, random_state=888; total time=   0.2s
[CV] END class_weight={0: 1, 1: 2.125}, criterion=entropy, max_depth=10, random_state=888; total time=   0.2s
[CV] END class_weight={0: 1, 1: 2.125}, criterion=entropy, max_depth=15, random_state=888; total time=   0.3s
[CV] END class_weight={0: 1, 1: 2.125}, criterion=entropy, max_depth=15, random_state=888; total time=   0.3s
[CV] END class_weight={0: 1, 1: 2.125}, criterion=entropy, max_depth=15, random_state=888; total time=   0.3s
[CV] END class_weight={0: 1, 1: 2.125}, criterion=entropy, max_depth=15, random_state=888; total time=   0.3s
[CV] END class_weight={0: 1, 1: 2.125}, criterion=entropy, max_depth=15, random_state=888; total time=   0.3s
{'class_weight': {0: 1, 1: 2.125}, 'criterion': 'entropy', 'max_depth': 10, 'random_state': 888}
0.5623514379422441
{'mean_fit_time': array([0.14730806, 0.18114376, 0.19815512, 0.31997013, 0.36047945,
       0.14942117, 0.17868161

The best score of decision tree is 0.5623514379422441

In [7]:
# random forest + grid search

myscorer = make_scorer(matthews_corrcoef)
parameters = {
    'n_estimators' : [50, 100],
    'max_depth': [15],
    'criterion' : ['gini'],
    'min_samples_leaf': [10, 20, 40,],
    'class_weight': [{0: 1, 1: 2}, {0: 1, 1: 3}, {0: 1, 1: 4}],
    'random_state': [888]
}
mygridsearch = GridSearchCV(estimator = RandomForestClassifier(),\
                            param_grid = parameters,\
                            scoring = myscorer,\
                            cv = KFold(n_splits=5, shuffle=True, random_state = 888),
                            verbose = 2)
mygridsearch.fit(X, y)
print(mygridsearch.best_params_)
print(mygridsearch.best_score_)
print(mygridsearch.cv_results_)

Fitting 5 folds for each of 18 candidates, totalling 90 fits
[CV] END class_weight={0: 1, 1: 2}, criterion=gini, max_depth=15, min_samples_leaf=10, n_estimators=50, random_state=888; total time=   1.9s
[CV] END class_weight={0: 1, 1: 2}, criterion=gini, max_depth=15, min_samples_leaf=10, n_estimators=50, random_state=888; total time=   2.0s
[CV] END class_weight={0: 1, 1: 2}, criterion=gini, max_depth=15, min_samples_leaf=10, n_estimators=50, random_state=888; total time=   2.0s
[CV] END class_weight={0: 1, 1: 2}, criterion=gini, max_depth=15, min_samples_leaf=10, n_estimators=50, random_state=888; total time=   1.9s
[CV] END class_weight={0: 1, 1: 2}, criterion=gini, max_depth=15, min_samples_leaf=10, n_estimators=50, random_state=888; total time=   2.0s
[CV] END class_weight={0: 1, 1: 2}, criterion=gini, max_depth=15, min_samples_leaf=10, n_estimators=100, random_state=888; total time=   4.0s
[CV] END class_weight={0: 1, 1: 2}, criterion=gini, max_depth=15, min_samples_leaf=10, n_est

[CV] END class_weight={0: 1, 1: 3}, criterion=gini, max_depth=15, min_samples_leaf=40, n_estimators=100, random_state=888; total time=   3.2s
[CV] END class_weight={0: 1, 1: 3}, criterion=gini, max_depth=15, min_samples_leaf=40, n_estimators=100, random_state=888; total time=   3.3s
[CV] END class_weight={0: 1, 1: 4}, criterion=gini, max_depth=15, min_samples_leaf=10, n_estimators=50, random_state=888; total time=   1.7s
[CV] END class_weight={0: 1, 1: 4}, criterion=gini, max_depth=15, min_samples_leaf=10, n_estimators=50, random_state=888; total time=   1.7s
[CV] END class_weight={0: 1, 1: 4}, criterion=gini, max_depth=15, min_samples_leaf=10, n_estimators=50, random_state=888; total time=   1.7s
[CV] END class_weight={0: 1, 1: 4}, criterion=gini, max_depth=15, min_samples_leaf=10, n_estimators=50, random_state=888; total time=   1.7s
[CV] END class_weight={0: 1, 1: 4}, criterion=gini, max_depth=15, min_samples_leaf=10, n_estimators=50, random_state=888; total time=   1.7s
[CV] END cl

The best score of random foreset is 0.5896692297640508

# Add column year + one-hot encoding + SMOTE + grid search

In the above test, we decline to modify "class_weight" inside the algorithm. Considering the inbalanced label, We will try to adopt resample policy here. 

Results:
> Decision Tree: 0.510029414063708 <br> 
> Random Forest: **0.5348282104486837**

Besides the score, we can also notice that the training time of involving SMOTE method is much longer than single model, but end up with worst score

In [8]:
# Initialize the module
%reset -f
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.tree import DecisionTreeClassifier 
from sklearn.ensemble import RandomForestClassifier

from sklearn.model_selection import cross_val_score
from sklearn.metrics import matthews_corrcoef, make_scorer

from sklearn.preprocessing import OneHotEncoder
from sklearn.model_selection import KFold
from sklearn.model_selection import KFold
from sklearn.metrics import confusion_matrix

from sklearn.model_selection import GridSearchCV

from imblearn.over_sampling import SMOTE
from imblearn.over_sampling import SMOTENC
from imblearn.pipeline import Pipeline as imbpipeline

data=pd.read_csv("..\\bank-full.csv", sep=';', engine="python")

np.random.seed(12345)

# enc = OneHotEncoder(handle_unknown = 'ignore', sparse = True)
y = data['y'].replace({"yes":1,"no":0})
data.drop(columns = ['y'], inplace = True)

integer_list = []
discrete_list = []
for col in data.columns:
#     print(col, ":", data[col].dtype)    
    if data[col].dtype == "object":
        discrete_list.append(col)
    else:
        integer_list.append(col)
        data[col] = data[col].astype(float)
X = pd.get_dummies(data, columns=discrete_list)

In [9]:
# grid search + imbpipeline + SMOTENC + decision tree
def mcc_calculate(y_true, y_predicted):
    conf_matrix = confusion_matrix(y_true, y_predicted)
#     print(y_true.value_counts()[0], y_true.value_counts()[1])
    TP = conf_matrix[0][0]
    FN = conf_matrix[0][1]
    FP = conf_matrix[1][0]
    TN = conf_matrix[1][1]
    if TP + FP == 0  or TP + FN == 0 or TN + FP == 0 or TN + FN == 0:
#         print(conf_matrix)
        return 0
    return float(TP * TN - FP * FN ) / np.sqrt((TP+FP) * (TP+FN) * (TN+FP) * (TN+FN))

model = imbpipeline([
        ('sampling', SMOTENC(categorical_features = [data.dtypes==object])),
        ('classification', DecisionTreeClassifier())
    ])
## imbpipeline would not resammple the test set.

myscorer = make_scorer(mcc_calculate)
parameters = {
    'sampling__sampling_strategy': [0.5, 0.4, 0.3],
    'classification__criterion' :['gini'],
    'classification__class_weight':[{0: 1, 1: 1}, {0: 1, 1: 2}, {0: 2, 1: 1}],
    'classification__random_state':[888],
    'classification__max_depth':[5, 10, 15]
}

mygridsearch = GridSearchCV(estimator = model,\
                            param_grid = parameters,\
                            scoring = myscorer,\
                            cv = KFold(n_splits=5, shuffle=True, random_state = 888),
                            verbose = 2)
mygridsearch.fit(X, y)
print(mygridsearch.best_params_)
print(mygridsearch.best_score_)

Fitting 5 folds for each of 27 candidates, totalling 135 fits
[CV] END classification__class_weight={0: 1, 1: 1}, classification__criterion=gini, classification__max_depth=5, classification__random_state=888, sampling__sampling_strategy=0.5; total time=  24.3s
[CV] END classification__class_weight={0: 1, 1: 1}, classification__criterion=gini, classification__max_depth=5, classification__random_state=888, sampling__sampling_strategy=0.5; total time=  24.7s
[CV] END classification__class_weight={0: 1, 1: 1}, classification__criterion=gini, classification__max_depth=5, classification__random_state=888, sampling__sampling_strategy=0.5; total time=  22.5s
[CV] END classification__class_weight={0: 1, 1: 1}, classification__criterion=gini, classification__max_depth=5, classification__random_state=888, sampling__sampling_strategy=0.5; total time=  22.0s
[CV] END classification__class_weight={0: 1, 1: 1}, classification__criterion=gini, classification__max_depth=5, classification__random_state=

[CV] END classification__class_weight={0: 1, 1: 1}, classification__criterion=gini, classification__max_depth=15, classification__random_state=888, sampling__sampling_strategy=0.3; total time=  11.7s
[CV] END classification__class_weight={0: 1, 1: 1}, classification__criterion=gini, classification__max_depth=15, classification__random_state=888, sampling__sampling_strategy=0.3; total time=  11.6s
[CV] END classification__class_weight={0: 1, 1: 1}, classification__criterion=gini, classification__max_depth=15, classification__random_state=888, sampling__sampling_strategy=0.3; total time=  11.8s
[CV] END classification__class_weight={0: 1, 1: 1}, classification__criterion=gini, classification__max_depth=15, classification__random_state=888, sampling__sampling_strategy=0.3; total time=  11.7s
[CV] END classification__class_weight={0: 1, 1: 2}, classification__criterion=gini, classification__max_depth=5, classification__random_state=888, sampling__sampling_strategy=0.5; total time=  22.7s
[

[CV] END classification__class_weight={0: 1, 1: 2}, classification__criterion=gini, classification__max_depth=15, classification__random_state=888, sampling__sampling_strategy=0.4; total time=  16.4s
[CV] END classification__class_weight={0: 1, 1: 2}, classification__criterion=gini, classification__max_depth=15, classification__random_state=888, sampling__sampling_strategy=0.4; total time=  16.7s
[CV] END classification__class_weight={0: 1, 1: 2}, classification__criterion=gini, classification__max_depth=15, classification__random_state=888, sampling__sampling_strategy=0.3; total time=  11.8s
[CV] END classification__class_weight={0: 1, 1: 2}, classification__criterion=gini, classification__max_depth=15, classification__random_state=888, sampling__sampling_strategy=0.3; total time=  11.8s
[CV] END classification__class_weight={0: 1, 1: 2}, classification__criterion=gini, classification__max_depth=15, classification__random_state=888, sampling__sampling_strategy=0.3; total time=  11.5s


[CV] END classification__class_weight={0: 2, 1: 1}, classification__criterion=gini, classification__max_depth=15, classification__random_state=888, sampling__sampling_strategy=0.4; total time=  16.8s
[CV] END classification__class_weight={0: 2, 1: 1}, classification__criterion=gini, classification__max_depth=15, classification__random_state=888, sampling__sampling_strategy=0.4; total time=  16.7s
[CV] END classification__class_weight={0: 2, 1: 1}, classification__criterion=gini, classification__max_depth=15, classification__random_state=888, sampling__sampling_strategy=0.4; total time=  16.0s
[CV] END classification__class_weight={0: 2, 1: 1}, classification__criterion=gini, classification__max_depth=15, classification__random_state=888, sampling__sampling_strategy=0.4; total time=  16.1s
[CV] END classification__class_weight={0: 2, 1: 1}, classification__criterion=gini, classification__max_depth=15, classification__random_state=888, sampling__sampling_strategy=0.4; total time=  16.7s


The best result of decision tree is 0.510029414063708

In [10]:
# grid search + imbpipeline + SMOTENC + random forest
def mcc_calculate(y_true, y_predicted):
    conf_matrix = confusion_matrix(y_true, y_predicted)
#     print(y_true.value_counts()[0], y_true.value_counts()[1])
    TP = conf_matrix[0][0]
    FN = conf_matrix[0][1]
    FP = conf_matrix[1][0]
    TN = conf_matrix[1][1]
    if TP + FP == 0  or TP + FN == 0 or TN + FP == 0 or TN + FN == 0:
#         print(conf_matrix)
        return 0
    return float(TP * TN - FP * FN ) / np.sqrt((TP+FP) * (TP+FN) * (TN+FP) * (TN+FN))

model = imbpipeline([
        ('sampling', SMOTENC(categorical_features = [data.dtypes==object])),
        ('classification', RandomForestClassifier())
    ])
## imbpipeline would not resammple the test set.

myscorer = make_scorer(mcc_calculate)
parameters = {
    'sampling__sampling_strategy': [0.3, 0.4, 0.5],
    'classification__criterion' :['gini'],
    'classification__class_weight':[{0: 1, 1: 1}, {0: 1, 1: 2}, {0: 2, 1: 1}],
    'classification__random_state':[888]
}

mygridsearch = GridSearchCV(estimator = model,\
                            param_grid = parameters,\
                            scoring = myscorer,\
                            cv = KFold(n_splits=5, shuffle=True, random_state = 888),
                            verbose = 2)
mygridsearch.fit(X, y)
print(mygridsearch.best_params_)
print(mygridsearch.best_score_)

Fitting 5 folds for each of 9 candidates, totalling 45 fits
[CV] END classification__class_weight={0: 1, 1: 1}, classification__criterion=gini, classification__random_state=888, sampling__sampling_strategy=0.3; total time=  17.8s
[CV] END classification__class_weight={0: 1, 1: 1}, classification__criterion=gini, classification__random_state=888, sampling__sampling_strategy=0.3; total time=  17.6s
[CV] END classification__class_weight={0: 1, 1: 1}, classification__criterion=gini, classification__random_state=888, sampling__sampling_strategy=0.3; total time=  17.6s
[CV] END classification__class_weight={0: 1, 1: 1}, classification__criterion=gini, classification__random_state=888, sampling__sampling_strategy=0.3; total time=  17.4s
[CV] END classification__class_weight={0: 1, 1: 1}, classification__criterion=gini, classification__random_state=888, sampling__sampling_strategy=0.3; total time=  17.4s
[CV] END classification__class_weight={0: 1, 1: 1}, classification__criterion=gini, classi

The best result of random forest is 0.5348282104486837

# Generate cross term by duration + backward selection + grid search 

This section we will use duration, which seems quite important in our model, to generate a series of new feature, by .

And then we will use the feature importance inside decision tree and random forest to repeatedly eliminate some of these feature.

Results:
> Decision Tree: 0.510029414063708 <br> 
> Random Forest: **0.5348282104486837**

In [17]:
# Initialize the module
%reset -f
from copy import deepcopy

from xgboost.sklearn import XGBClassifier

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.tree import DecisionTreeClassifier 
from sklearn.ensemble import RandomForestClassifier

from sklearn.model_selection import cross_val_score
from sklearn.metrics import matthews_corrcoef, make_scorer

from sklearn.preprocessing import OneHotEncoder
from sklearn.model_selection import KFold
from sklearn.model_selection import KFold
from sklearn.metrics import confusion_matrix

from sklearn.model_selection import GridSearchCV

from imblearn.over_sampling import SMOTE
from imblearn.over_sampling import SMOTENC
from imblearn.pipeline import Pipeline as imbpipeline

from sklearn.preprocessing import PolynomialFeatures

data = pd.read_csv("..\\bank-full-add_timestamp.csv", sep=',', engine="python")

y = data['y'].replace({"yes":1,"no":0})
data.drop(columns = ['y', 'day', 'timestamp'], inplace = True)
data['weekday'] = data['weekday'].astype(str)
data['year'] = data['year'].astype(str)
data['month'] = data['month'].astype(str)

# data.head()

number_list = []
discrete_list = []
for col in data.columns:
#     print(col, ":", data[col].dtype)
    if data[col].dtype == "object":
        discrete_list.append(col)
    else:
        number_list.append(col)
        
print(number_list)
print(discrete_list)

X = pd.get_dummies(data, columns = discrete_list)

print(X.columns)

for col in X.columns.drop(['age', 'balance', 'duration', 'campaign', 'pdays', 'previous']):
    for num_col in ['age', 'balance', 'duration', 'campaign', 'pdays', 'previous']:
        X[num_col + "_" + col] = X[num_col] * X[col]
X["duration" + "_" + "age"] = X['age'] * X['duration']

['age', 'balance', 'duration', 'campaign', 'pdays', 'previous']
['job', 'marital', 'education', 'default', 'housing', 'loan', 'contact', 'month', 'poutcome', 'year', 'weekday']
Index(['age', 'balance', 'duration', 'campaign', 'pdays', 'previous',
       'job_admin.', 'job_blue-collar', 'job_entrepreneur', 'job_housemaid',
       'job_management', 'job_retired', 'job_self-employed', 'job_services',
       'job_student', 'job_technician', 'job_unemployed', 'job_unknown',
       'marital_divorced', 'marital_married', 'marital_single',
       'education_primary', 'education_secondary', 'education_tertiary',
       'education_unknown', 'default_no', 'default_yes', 'housing_no',
       'housing_yes', 'loan_no', 'loan_yes', 'contact_cellular',
       'contact_telephone', 'contact_unknown', 'month_1', 'month_10',
       'month_11', 'month_12', 'month_2', 'month_3', 'month_4', 'month_5',
       'month_6', 'month_7', 'month_8', 'month_9', 'poutcome_failure',
       'poutcome_other', 'poutcome_su

In [18]:
print(X.shape)

(45211, 385)


In [24]:
# recursively drop feature by decision tree

np.random.RandomState(888)

newX = deepcopy(X)

# use the best parameter we get in the above parts
para = {'class_weight': {0: 1, 1: 2.125}, 
        'criterion': 'entropy', 
        'max_depth': 10, 
        'random_state': 888
       }


dtree = DecisionTreeClassifier(**para)
kf = KFold(n_splits=5, shuffle = True, random_state = 888)


while newX.shape[1] > 10:
    print("Feature Number: {:d}".format(newX.shape[1]))
    score_array = np.zeros(5) 
    for (ii, (train_index, test_index)) in enumerate(kf.split(newX, y)):
        X_train, X_test = newX.loc[train_index], newX.loc[test_index]
        y_train, y_test = y.iloc[train_index], y.iloc[test_index]
        dtree.fit(X_train, y_train)
        y_predict = dtree.predict(X_test)
        myscore = matthews_corrcoef(y_test, y_predict)
        score_array[ii] = myscore
        print("---number of yes in train {:d}, in test {:d}, in predict {:d}, mcc score is {:.4f}".format(y_train.value_counts()[1],\
                                                                                                      y_test.value_counts()[1],\
                                                                                                      sum(y_predict),\
                                                                                                      myscore))
    print("---mean score {:.4f}".format(np.mean(score_array)))
    feature_importance = dtree.fit(newX,y).feature_importances_
    feature_importance_list = []
    for (ii, col) in enumerate(newX.columns):
        feature_importance_list.append((col, feature_importance[ii]))
    feature_importance_list.sort(reverse=True, key=lambda occurance: occurance[1])
    if len(feature_importance_list) > 1000:
        drop_list = [ item[0] for item in feature_importance_list if item[1] == 0.0 ]
    if len(feature_importance_list) <= 1000 and len(feature_importance_list) > 200:
        drop_list = [ item[0] for item in feature_importance_list[-20:] ]
    if len(feature_importance_list) <= 200 and len(feature_importance_list) > 150:
        drop_list = [ item[0] for item in feature_importance_list[-10:] ]
    if len(feature_importance_list) <= 150 and len(feature_importance_list):
        drop_list = [ item[0] for item in feature_importance_list[-5:] ]
    newX.drop(columns = drop_list, inplace =True)
    newX.to_csv(".\\temp\\X_cross_feature_{:d}_dt.csv".format(newX.shape[1]), index = False)

Feature Number: 385
---number of yes in train 4219, in test 1070, in predict 1533, mcc score is 0.5590
---number of yes in train 4230, in test 1059, in predict 1474, mcc score is 0.5432
---number of yes in train 4251, in test 1038, in predict 1452, mcc score is 0.5370
---number of yes in train 4249, in test 1040, in predict 1495, mcc score is 0.5403
---number of yes in train 4207, in test 1082, in predict 1377, mcc score is 0.5332
---mean score 0.5425
Feature Number: 365
---number of yes in train 4219, in test 1070, in predict 1529, mcc score is 0.5609
---number of yes in train 4230, in test 1059, in predict 1479, mcc score is 0.5391
---number of yes in train 4251, in test 1038, in predict 1453, mcc score is 0.5349
---number of yes in train 4249, in test 1040, in predict 1479, mcc score is 0.5388
---number of yes in train 4207, in test 1082, in predict 1391, mcc score is 0.5341
---mean score 0.5416
Feature Number: 345
---number of yes in train 4219, in test 1070, in predict 1537, mcc s

Feature Number: 125
---number of yes in train 4219, in test 1070, in predict 1549, mcc score is 0.5650
---number of yes in train 4230, in test 1059, in predict 1471, mcc score is 0.5411
---number of yes in train 4251, in test 1038, in predict 1399, mcc score is 0.5376
---number of yes in train 4249, in test 1040, in predict 1445, mcc score is 0.5352
---number of yes in train 4207, in test 1082, in predict 1414, mcc score is 0.5420
---mean score 0.5442
Feature Number: 120
---number of yes in train 4219, in test 1070, in predict 1577, mcc score is 0.5670
---number of yes in train 4230, in test 1059, in predict 1485, mcc score is 0.5366
---number of yes in train 4251, in test 1038, in predict 1424, mcc score is 0.5396
---number of yes in train 4249, in test 1040, in predict 1445, mcc score is 0.5362
---number of yes in train 4207, in test 1082, in predict 1423, mcc score is 0.5490
---mean score 0.5457
Feature Number: 115
---number of yes in train 4219, in test 1070, in predict 1583, mcc s

Feature Number: 35
---number of yes in train 4219, in test 1070, in predict 1501, mcc score is 0.5562
---number of yes in train 4230, in test 1059, in predict 1379, mcc score is 0.5400
---number of yes in train 4251, in test 1038, in predict 1537, mcc score is 0.5455
---number of yes in train 4249, in test 1040, in predict 1471, mcc score is 0.5267
---number of yes in train 4207, in test 1082, in predict 1444, mcc score is 0.5425
---mean score 0.5422
Feature Number: 30
---number of yes in train 4219, in test 1070, in predict 1496, mcc score is 0.5584
---number of yes in train 4230, in test 1059, in predict 1364, mcc score is 0.5393
---number of yes in train 4251, in test 1038, in predict 1488, mcc score is 0.5513
---number of yes in train 4249, in test 1040, in predict 1445, mcc score is 0.5248
---number of yes in train 4207, in test 1082, in predict 1461, mcc score is 0.5436
---mean score 0.5435
Feature Number: 25
---number of yes in train 4219, in test 1070, in predict 1473, mcc scor

In [25]:
# recursively drop feature by random forest
np.random.RandomState(888)

newX = deepcopy(X)

# use less estimators and smaller depthto accelerate the model
para = {'class_weight': {0: 1, 1: 4}, 
        'criterion': 'gini', 
        'max_depth': 10, 
        'n_estimators': 100, 
        'random_state': 888
       }

rftree = RandomForestClassifier(**para)
kf = KFold(n_splits=5, shuffle = True, random_state = 888)


while newX.shape[1] > 10:
    print("Feature Number: {:d}".format(newX.shape[1]))
    score_array = np.zeros(5) 
    for (ii, (train_index, test_index)) in enumerate(kf.split(newX, y)):
        X_train, X_test = newX.loc[train_index], newX.loc[test_index]
        y_train, y_test = y.iloc[train_index], y.iloc[test_index]
        rftree.fit(X_train, y_train)
        y_predict = rftree.predict(X_test)
        myscore = matthews_corrcoef(y_test, y_predict)
        score_array[ii] = myscore
        print("---number of yes in train {:d}, in test {:d}, in predict {:d}, mcc score is {:.4f}".format(y_train.value_counts()[1],\
                                                                                                      y_test.value_counts()[1],\
                                                                                                      sum(y_predict),\
                                                                                                      myscore))
    print("---mean score {:.4f}".format(np.mean(score_array)))
    feature_importance = rftree.fit(newX,y).feature_importances_
    feature_importance_list = []
    for (ii, col) in enumerate(newX.columns):
        feature_importance_list.append((col, feature_importance[ii]))
    feature_importance_list.sort(reverse=True, key=lambda occurance: occurance[1])
    if len(feature_importance_list) > 1000:
        drop_list = [ item[0] for item in feature_importance_list if item[1] == 0.0 ]
    if len(feature_importance_list) <= 1000 and len(feature_importance_list) > 200:
        drop_list = [ item[0] for item in feature_importance_list[-20:] ]
    if len(feature_importance_list) <= 200 and len(feature_importance_list) > 150:
        drop_list = [ item[0] for item in feature_importance_list[-10:] ]
    if len(feature_importance_list) <= 150 and len(feature_importance_list):
        drop_list = [ item[0] for item in feature_importance_list[-5:] ]
    newX.drop(columns = drop_list, inplace =True)
    newX.to_csv(".\\temp\\X_cross_feature_{:d}_rf.csv".format(newX.shape[1]), index = False)

Feature Number: 385
---number of yes in train 4219, in test 1070, in predict 1649, mcc score is 0.5957
---number of yes in train 4230, in test 1059, in predict 1604, mcc score is 0.5916
---number of yes in train 4251, in test 1038, in predict 1634, mcc score is 0.5729
---number of yes in train 4249, in test 1040, in predict 1603, mcc score is 0.5724
---number of yes in train 4207, in test 1082, in predict 1592, mcc score is 0.5900
---mean score 0.5845
Feature Number: 365
---number of yes in train 4219, in test 1070, in predict 1617, mcc score is 0.5965
---number of yes in train 4230, in test 1059, in predict 1610, mcc score is 0.5883
---number of yes in train 4251, in test 1038, in predict 1631, mcc score is 0.5718
---number of yes in train 4249, in test 1040, in predict 1612, mcc score is 0.5747
---number of yes in train 4207, in test 1082, in predict 1592, mcc score is 0.5900
---mean score 0.5843
Feature Number: 345
---number of yes in train 4219, in test 1070, in predict 1630, mcc s

Feature Number: 125
---number of yes in train 4219, in test 1070, in predict 1611, mcc score is 0.5954
---number of yes in train 4230, in test 1059, in predict 1588, mcc score is 0.5839
---number of yes in train 4251, in test 1038, in predict 1593, mcc score is 0.5775
---number of yes in train 4249, in test 1040, in predict 1587, mcc score is 0.5718
---number of yes in train 4207, in test 1082, in predict 1565, mcc score is 0.5924
---mean score 0.5842
Feature Number: 120
---number of yes in train 4219, in test 1070, in predict 1608, mcc score is 0.5970
---number of yes in train 4230, in test 1059, in predict 1560, mcc score is 0.5919
---number of yes in train 4251, in test 1038, in predict 1601, mcc score is 0.5819
---number of yes in train 4249, in test 1040, in predict 1603, mcc score is 0.5706
---number of yes in train 4207, in test 1082, in predict 1577, mcc score is 0.5947
---mean score 0.5872
Feature Number: 115
---number of yes in train 4219, in test 1070, in predict 1624, mcc s

Feature Number: 35
---number of yes in train 4219, in test 1070, in predict 1674, mcc score is 0.5764
---number of yes in train 4230, in test 1059, in predict 1657, mcc score is 0.5573
---number of yes in train 4251, in test 1038, in predict 1695, mcc score is 0.5595
---number of yes in train 4249, in test 1040, in predict 1626, mcc score is 0.5497
---number of yes in train 4207, in test 1082, in predict 1623, mcc score is 0.5672
---mean score 0.5620
Feature Number: 30
---number of yes in train 4219, in test 1070, in predict 1667, mcc score is 0.5702
---number of yes in train 4230, in test 1059, in predict 1658, mcc score is 0.5624
---number of yes in train 4251, in test 1038, in predict 1706, mcc score is 0.5632
---number of yes in train 4249, in test 1040, in predict 1640, mcc score is 0.5446
---number of yes in train 4207, in test 1082, in predict 1632, mcc score is 0.5694
---mean score 0.5620
Feature Number: 25
---number of yes in train 4219, in test 1070, in predict 1674, mcc scor

# Calculate the importance of each feature, in the above backward selection, which can show the Robustness of two models

We will calculate the importance of each feature by summing up all the terms involved it

In [26]:
# Decision tree, feature_num = 100
feature_num = 100
columns_list = ['age', 'balance', 'duration', 'campaign', 'pdays', 'previous',\
                'job', 'marital', 'education', 'default', 'housing', 'loan', 'contact', 'month', 'poutcome', 'year', 'weekday']
newX = pd.read_csv(".\\temp\\X_cross_feature_{:d}_dt.csv".format(feature_num), engine = "python")

para = {'class_weight': {0: 1, 1: 2.125}, 
        'criterion': 'entropy', 
        'max_depth': 10, 
        'random_state': 888
       }
dtree = DecisionTreeClassifier(**para)

feature_importance = dtree.fit(newX,y).feature_importances_
feature_importance_list = []
for (ii, col) in enumerate(newX.columns):
    feature_importance_list.append((col, feature_importance[ii]))
    
feature_stat = pd.Series(index = columns_list)
for col in columns_list:
    feature_stat[col] = np.sum([item[1] for item in feature_importance_list if item[0].split('_')[0] == col])
feature_stat.sort_values(ascending = False, inplace = True)
feature_stat



duration     0.575804
year         0.260976
age          0.095638
balance      0.043053
pdays        0.012570
campaign     0.008871
previous     0.001923
education    0.000591
month        0.000575
default      0.000000
marital      0.000000
job          0.000000
housing      0.000000
loan         0.000000
contact      0.000000
poutcome     0.000000
weekday      0.000000
dtype: float64

In [27]:
# Decision tree, feature_num = 50
feature_num = 50
columns_list = ['age', 'balance', 'duration', 'campaign', 'pdays', 'previous',\
                'job', 'marital', 'education', 'default', 'housing', 'loan', 'contact', 'month', 'poutcome', 'year', 'weekday']
newX = pd.read_csv(".\\temp\\X_cross_feature_{:d}_dt.csv".format(feature_num), engine = "python")

para = {'class_weight': {0: 1, 1: 2.125}, 
        'criterion': 'entropy', 
        'max_depth': 10, 
        'random_state': 888
       }
dtree = DecisionTreeClassifier(**para)

feature_importance = dtree.fit(newX,y).feature_importances_
feature_importance_list = []
for (ii, col) in enumerate(newX.columns):
    feature_importance_list.append((col, feature_importance[ii]))
    
feature_stat = pd.Series(index = columns_list)
for col in columns_list:
    feature_stat[col] = np.sum([item[1] for item in feature_importance_list if item[0].split('_')[0] == col])
feature_stat.sort_values(ascending = False, inplace = True)
feature_stat



duration     0.585236
year         0.261837
age          0.091517
balance      0.050443
pdays        0.008340
campaign     0.002627
marital      0.000000
previous     0.000000
job          0.000000
weekday      0.000000
default      0.000000
housing      0.000000
loan         0.000000
contact      0.000000
month        0.000000
poutcome     0.000000
education    0.000000
dtype: float64

In [28]:
# Decision tree, feature_num = 20
feature_num = 20
columns_list = ['age', 'balance', 'duration', 'campaign', 'pdays', 'previous',\
                'job', 'marital', 'education', 'default', 'housing', 'loan', 'contact', 'month', 'poutcome', 'year', 'weekday']
newX = pd.read_csv(".\\temp\\X_cross_feature_{:d}_dt.csv".format(feature_num), engine = "python")

para = {'class_weight': {0: 1, 1: 2.125}, 
        'criterion': 'entropy', 
        'max_depth': 10, 
        'random_state': 888
       }
dtree = DecisionTreeClassifier(**para)

feature_importance = dtree.fit(newX,y).feature_importances_
feature_importance_list = []
for (ii, col) in enumerate(newX.columns):
    feature_importance_list.append((col, feature_importance[ii]))
    
feature_stat = pd.Series(index = columns_list)
for col in columns_list:
    feature_stat[col] = np.sum([item[1] for item in feature_importance_list if item[0].split('_')[0] == col])
feature_stat.sort_values(ascending = False, inplace = True)
feature_stat



duration     0.592220
year         0.269256
age          0.090320
balance      0.032836
pdays        0.015369
marital      0.000000
campaign     0.000000
previous     0.000000
job          0.000000
weekday      0.000000
default      0.000000
housing      0.000000
loan         0.000000
contact      0.000000
month        0.000000
poutcome     0.000000
education    0.000000
dtype: float64

In [33]:
# random forest, feature_num = 100
feature_num = 100
columns_list = ['age', 'balance', 'duration', 'campaign', 'pdays', 'previous',\
                'job', 'marital', 'education', 'default', 'housing', 'loan', 'contact', 'month', 'poutcome', 'year', 'weekday']
newX = pd.read_csv(".\\temp\\X_cross_feature_{:d}_rf.csv".format(feature_num), engine = "python")

para = {'class_weight': {0: 1, 1: 4}, 
        'criterion': 'gini', 
        'max_depth': 10, 
        'n_estimators': 100, 
        'random_state': 888
       }
rftree = RandomForestClassifier(**para)

feature_importance = rftree.fit(newX,y).feature_importances_
feature_importance_list = []
for (ii, col) in enumerate(newX.columns):
    feature_importance_list.append((col, feature_importance[ii]))
    
feature_stat = pd.Series(index = columns_list)
for col in columns_list:
    feature_stat[col] = np.sum([item[1] for item in feature_importance_list if item[0].split('_')[0] == col])
feature_stat.sort_values(ascending = False, inplace = True)
feature_stat



duration     0.586377
age          0.112398
balance      0.087167
campaign     0.078834
pdays        0.056266
year         0.051079
previous     0.011529
month        0.007397
housing      0.005259
poutcome     0.003694
weekday      0.000000
job          0.000000
marital      0.000000
default      0.000000
loan         0.000000
contact      0.000000
education    0.000000
dtype: float64

In [31]:
# random forest, feature_num = 50
feature_num = 50
columns_list = ['age', 'balance', 'duration', 'campaign', 'pdays', 'previous',\
                'job', 'marital', 'education', 'default', 'housing', 'loan', 'contact', 'month', 'poutcome', 'year', 'weekday']
newX = pd.read_csv(".\\temp\\X_cross_feature_{:d}_rf.csv".format(feature_num), engine = "python")

para = {'class_weight': {0: 1, 1: 4}, 
        'criterion': 'gini', 
        'max_depth': 10, 
        'n_estimators': 100, 
        'random_state': 888
       }
rftree = RandomForestClassifier(**para)

feature_importance = rftree.fit(newX,y).feature_importances_
feature_importance_list = []
for (ii, col) in enumerate(newX.columns):
    feature_importance_list.append((col, feature_importance[ii]))
    
feature_stat = pd.Series(index = columns_list)
for col in columns_list:
    feature_stat[col] = np.sum([item[1] for item in feature_importance_list if item[0].split('_')[0] == col])
feature_stat.sort_values(ascending = False, inplace = True)
feature_stat

FileNotFoundError: [Errno 2] No such file or directory: '.\\temp\\X_cross_feature_50_rf.csv'

In [None]:
# random forest, feature_num = 20
feature_num = 20
columns_list = ['age', 'balance', 'duration', 'campaign', 'pdays', 'previous',\
                'job', 'marital', 'education', 'default', 'housing', 'loan', 'contact', 'month', 'poutcome', 'year', 'weekday']
newX = pd.read_csv(".\\temp\\X_cross_feature_{:d}_rf.csv".format(feature_num), engine = "python")

para = {'class_weight': {0: 1, 1: 4}, 
        'criterion': 'gini', 
        'max_depth': 10, 
        'n_estimators': 100, 
        'random_state': 888
       }
rftree = RandomForestClassifier(**para)

feature_importance = rftree.fit(newX,y).feature_importances_
feature_importance_list = []
for (ii, col) in enumerate(newX.columns):
    feature_importance_list.append((col, feature_importance[ii]))
    
feature_stat = pd.Series(index = columns_list)
for col in columns_list:
    feature_stat[col] = np.sum([item[1] for item in feature_importance_list if item[0].split('_')[0] == col])
feature_stat.sort_values(ascending = False, inplace = True)
feature_stat