**In this file I present creating models with pipelines comparing results of cross-validated hyperparameters to achive best model's fitting on binned dataset. Summary.**

Imports:

In [1]:
import pandas as pd
import numpy as np
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from xgboost.sklearn import XGBClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.linear_model import LogisticRegression

from sklearn.preprocessing import (StandardScaler, 
                                   OrdinalEncoder, 
                                   MinMaxScaler)

from sklearn.model_selection import (train_test_split, 
                                     GridSearchCV, 
                                     StratifiedKFold, 
                                     RandomizedSearchCV)

from imblearn.over_sampling import SMOTE
from imblearn.combine import SMOTEENN
from imblearn.pipeline import Pipeline as imbpipeline
from sklearn.pipeline import Pipeline
from sklearn.metrics import (classification_report, 
                             roc_auc_score, 
                             make_scorer, 
                             recall_score, 
                             confusion_matrix, 
                             accuracy_score,
                            get_scorer_names)
from sklearn.decomposition import PCA

Loading dataset:

In [2]:
data_clean = pd.read_pickle("data/data_bins.pkl")

In [3]:
data_clean.sample(5)

Unnamed: 0,hotel,is_canceled,lead_time,stays_in_weekend_nights,stays_in_week_nights,adults,children,babies,meal,country,...,deposit_type,agent,company,days_in_waiting_list,customer_type,adr,required_car_parking_spaces,total_of_special_requests,reservation_status_date,arrival_date
82643,City Hotel,0,3,zero,zero,2,No_child,No_babies,BB,Hi-Freq,...,No Deposit,PRT,empty,0,Transient,94.0,0,0,2016-01-04,2016-01-03 00:00:00
28746,Resort Hotel,0,18,zero,zero,1,No_child,No_babies,BB,Freq,...,No Deposit,PRT,empty,0,Transient,57.6,1,0,2016-10-12,2016-10-11 00:00:00
71620,City Hotel,1,226,one_night,zero,2,No_child,No_babies,BB,Hi-Freq,...,Non Refund,PRT,empty,0,Transient,110.0,0,0,2016-11-25,2017-07-09 00:00:00
1411,Resort Hotel,0,25,one_night,one_night,2,No_child,No_babies,FB,Freq,...,No Deposit,250.0,empty,0,Transient,208.0,0,2,2015-09-01,2015-08-28 00:00:00
85574,City Hotel,0,1,zero,more,2,No_child,No_babies,BB,Freq,...,No Deposit,28.0,empty,0,Transient-Party,82.0,0,1,2016-03-18,2016-03-15 00:00:00


Dividing into predictor variables X and target y ("is_canceled"):

In [4]:
X = data_clean.drop("is_canceled", axis=1)
y = data_clean.is_canceled

Splitting dataset into train and test subsets with test size 30% and train 70%:

In [5]:
X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                    test_size=0.3,
                                                    stratify=y,
                                                    random_state=42
                                                   )

Shape after division

In [6]:
X_train.shape

(83573, 27)

In [7]:
X_test.shape

(35817, 27)

Inputting NaNs in country column with the most frequent value ()max of train subset into train and test:

In [None]:
country_input = X_train["country"][X_train.country.value_counts().max()]

In [None]:
X_train.country.fillna(country_input, inplace=True)

In [None]:
X_test.country.fillna(country_input, inplace=True)

Inputting NaNs in agent column with the most frequent value ()max of train subset into train and test:

In [None]:
agent_input = X_train["country"][X_train.agent.value_counts().max()]

In [None]:
X_train.agent.fillna(agent_input, inplace=True)

In [None]:
X_test.agent.fillna(agent_input, inplace=True)

Outlier value of column adr found in a file "Reservation_Cancelation_Prediction" now is to be replaced with mean of adr column.

In [8]:
(X_train["adr"]==5400).sum()

1

In [9]:
(X_test["adr"]==5400).sum()

0

In [10]:
if (X_train["adr"]==5400).sum() > 0:
    X_train.replace({5400.0:np.round(X_train.adr.mean(), 2)}, inplace=True) #filling inordinary adr value with mean of training set adr column
    print("Outlier observations in train subset = ", (X_train["adr"]==5400).sum())
elif (X_test["adr"]==5400).sum() > 0:
    X_test.replace({5400.0:np.round(X_train.adr.mean(), 2)}, inplace=True)
    print("Outlier observations in test subset = ", (X_test["adr"]==5400).sum())

Outlier observations in train subset =  0


Encoding categorial columns with OrdinalEncoder:

In [11]:
data_cat = data_clean.select_dtypes(["object"]).columns

In [12]:
data_label_train = X_train[data_cat]
data_label_test = X_test[data_cat]

In [13]:
ode = OrdinalEncoder(handle_unknown='use_encoded_value', unknown_value=-1)
ode.fit(data_label_train)
data_label_train_ode = pd.DataFrame(ode.transform(data_label_train),
                                    columns=data_cat)
data_label_test_ode = pd.DataFrame(ode.transform(data_label_test), 
                                   columns=data_cat)

In [14]:
data_label_train_ode

Unnamed: 0,hotel,stays_in_weekend_nights,stays_in_week_nights,children,babies,meal,country,market_segment,distribution_channel,reserved_room_type,assigned_room_type,deposit_type,agent,company,customer_type,reservation_status_date,arrival_date
0,0.0,2.0,1.0,0.0,0.0,0.0,1.0,4.0,2.0,0.0,0.0,0.0,288.0,323.0,2.0,400.0,562.0
1,1.0,2.0,1.0,0.0,0.0,0.0,1.0,4.0,2.0,0.0,3.0,0.0,98.0,323.0,2.0,375.0,258.0
2,1.0,2.0,2.0,0.0,0.0,0.0,1.0,3.0,2.0,3.0,4.0,0.0,316.0,92.0,3.0,886.0,770.0
3,1.0,2.0,0.0,0.0,0.0,0.0,0.0,2.0,1.0,0.0,0.0,0.0,316.0,76.0,3.0,449.0,330.0
4,1.0,2.0,1.0,0.0,0.0,0.0,1.0,1.0,1.0,4.0,4.0,0.0,316.0,323.0,2.0,714.0,597.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
83568,0.0,1.0,1.0,0.0,0.0,0.0,2.0,2.0,2.0,0.0,0.0,0.0,0.0,323.0,3.0,339.0,220.0
83569,0.0,2.0,1.0,0.0,0.0,0.0,1.0,2.0,2.0,0.0,0.0,1.0,143.0,323.0,2.0,640.0,660.0
83570,1.0,2.0,0.0,0.0,0.0,0.0,2.0,4.0,2.0,4.0,4.0,0.0,99.0,323.0,2.0,817.0,699.0
83571,0.0,2.0,1.0,0.0,0.0,0.0,1.0,2.0,2.0,0.0,0.0,1.0,193.0,323.0,2.0,304.0,231.0


Updating encoded columns:

In [15]:
X_train.drop(data_cat, axis=1, inplace=True)
X_test.drop(data_cat, axis=1, inplace=True)

Concatenating encoded features with the rest:

In [16]:
X_train = pd.concat([X_train.reset_index(drop=True), data_label_train_ode.reset_index(drop=True)], axis=1)
X_test = pd.concat([X_test.reset_index(drop=True), data_label_test_ode.reset_index(drop=True)], axis=1)

In [17]:
X_train.shape

(83573, 27)

Encoding with get_dummies:

In [18]:
X_train = pd.get_dummies(X_train, drop_first=True)

In [19]:
X_test = pd.get_dummies(X_test, drop_first=True)
X_test = X_test.reindex(columns = X_train.columns, fill_value=0)

In [20]:
X_train.shape

(83573, 27)

Initiating StandardScaler for further data scaling:

In [21]:
scaler = StandardScaler()

Initiating Principal Components with ten components reducing dimentions to ten components :

In [22]:
pca = PCA(n_components=10)

Initiating algorithm to ballance unballanced data- SMOTEENN:

In [23]:
SMOTEEN = SMOTEENN()

RandomForestClassifier algorythm with RandomizedGridSearch in pipeline, scaling reducing, ballancing:

In [24]:
stratified_kfold = StratifiedKFold(n_splits=5,
                                       shuffle=True,
                                       random_state=11)
#imbpipeline
pipeline_rf = imbpipeline(steps=[
    ['scaler', scaler],
    ['pca', pca],
    ['smote', SMOTEEN],
    ['rf', RandomForestClassifier()]])
    
param_distributions_rf = {
    'rf__n_estimators': [20, 100],
    'rf__max_depth': [10, 20],
    'rf__min_samples_split': [5, 10],
    'pca__n_components': [5, 10, 20]
}

search_rf = RandomizedSearchCV(pipeline_rf, 
                               param_distributions_rf, 
                               n_iter=10, 
                               cv=stratified_kfold, 
                               scoring='roc_auc',
                               verbose=3
                              )

search_rf.fit(X_train, y_train)
y_pred_rf = search_rf.best_estimator_.predict(X_test)
print("Random Forest:")
print(search_rf.best_params_)
print(f'Results on test: {search_rf.best_estimator_.score(X_test, y_test)}')
print(f'Results on train: {search_rf.best_estimator_.score(X_train, y_train)}')

Fitting 5 folds for each of 10 candidates, totalling 50 fits
[CV 1/5] END pca__n_components=5, rf__max_depth=10, rf__min_samples_split=10, rf__n_estimators=100;, score=0.838 total time=  15.0s
[CV 2/5] END pca__n_components=5, rf__max_depth=10, rf__min_samples_split=10, rf__n_estimators=100;, score=0.838 total time=  14.6s
[CV 3/5] END pca__n_components=5, rf__max_depth=10, rf__min_samples_split=10, rf__n_estimators=100;, score=0.844 total time=  14.2s
[CV 4/5] END pca__n_components=5, rf__max_depth=10, rf__min_samples_split=10, rf__n_estimators=100;, score=0.846 total time=  14.3s
[CV 5/5] END pca__n_components=5, rf__max_depth=10, rf__min_samples_split=10, rf__n_estimators=100;, score=0.844 total time=  14.0s
[CV 1/5] END pca__n_components=10, rf__max_depth=20, rf__min_samples_split=5, rf__n_estimators=20;, score=0.895 total time=  16.8s
[CV 2/5] END pca__n_components=10, rf__max_depth=20, rf__min_samples_split=5, rf__n_estimators=20;, score=0.898 total time=  17.1s
[CV 3/5] END pca_

Achieving scores of classification, saving accuracy, recall and F1 score in data frame:

In [25]:
#print(get_scorer_names())

In [26]:
y_pred_rf

array([0, 1, 1, ..., 0, 0, 0])

In [27]:
print(classification_report(y_test, y_pred_rf))

              precision    recall  f1-score   support

           0       0.82      0.91      0.87     22550
           1       0.82      0.67      0.74     13267

    accuracy                           0.82     35817
   macro avg       0.82      0.79      0.80     35817
weighted avg       0.82      0.82      0.82     35817



In [28]:
B_report_rf = pd.DataFrame(classification_report(y_test, y_pred_rf, output_dict=True))

In [29]:
for i, name in enumerate(B_report_rf.columns):
  B_report_rf = B_report_rf.rename(columns={(B_report_rf.iloc[:,i].name): ('RF_'+B_report_rf.iloc[:,i].name)})


In [30]:
B_report_rf

Unnamed: 0,RF_0,RF_1,RF_accuracy,RF_macro avg,RF_weighted avg
precision,0.824122,0.821382,0.823296,0.822752,0.823107
recall,0.914501,0.668275,0.823296,0.791388,0.823296
f1-score,0.866962,0.73696,0.823296,0.801961,0.818808
support,22550.0,13267.0,0.823296,35817.0,35817.0


DecisionTreeClassifier algorythm with RandomizedGridSearch in pipeline, scaling reducing, ballancing:

In [31]:
stratified_kfold = StratifiedKFold(n_splits=5,
                                       shuffle=True,
                                       random_state=13)

pipeline = imbpipeline(steps = [['scaler', scaler],
                                ['pca', pca],
                                ['smote', SMOTEEN],
                                ['dtc', DecisionTreeClassifier()]])

    
param_grid = {'dtc__max_leaf_nodes' : [5, 30], 
             'dtc__max_depth': [10, 40],
             'dtc__random_state' : [23],
             'pca__n_components': [5, 10, 20]
             }

search_dtc = GridSearchCV(estimator=pipeline,
                           param_grid=param_grid,
                           scoring='roc_auc',
                           cv=stratified_kfold,                           
                          verbose=3,
                           #n_jobs=3
                         )

search_dtc.fit(X_train, y_train)
y_pred_dtc = search_dtc.best_estimator_.predict(X_test)
cv_score = search_dtc.best_score_
test_score = search_dtc.score(X_test, y_test)
print(f'Cross-validation score: {cv_score}\nTest score: {test_score}')
print("Decision Tree:")
print(search_rf.best_params_)
print(f'Results on test: {search_rf.best_estimator_.score(X_test, y_test)}')
print(f'Results on train: {search_rf.best_estimator_.score(X_train, y_train)}')

Fitting 5 folds for each of 12 candidates, totalling 60 fits
[CV 1/5] END dtc__max_depth=10, dtc__max_leaf_nodes=5, dtc__random_state=23, pca__n_components=5;, score=0.716 total time=   2.4s
[CV 2/5] END dtc__max_depth=10, dtc__max_leaf_nodes=5, dtc__random_state=23, pca__n_components=5;, score=0.723 total time=   2.4s
[CV 3/5] END dtc__max_depth=10, dtc__max_leaf_nodes=5, dtc__random_state=23, pca__n_components=5;, score=0.733 total time=   2.4s
[CV 4/5] END dtc__max_depth=10, dtc__max_leaf_nodes=5, dtc__random_state=23, pca__n_components=5;, score=0.729 total time=   2.5s
[CV 5/5] END dtc__max_depth=10, dtc__max_leaf_nodes=5, dtc__random_state=23, pca__n_components=5;, score=0.726 total time=   2.3s
[CV 1/5] END dtc__max_depth=10, dtc__max_leaf_nodes=5, dtc__random_state=23, pca__n_components=10;, score=0.728 total time=  11.7s
[CV 2/5] END dtc__max_depth=10, dtc__max_leaf_nodes=5, dtc__random_state=23, pca__n_components=10;, score=0.725 total time=  12.9s
[CV 3/5] END dtc__max_depth

Achieving scores of classification, saving accuracy, recall and F1 score in data frame:

In [32]:
y_pred_dtc

array([1, 1, 0, ..., 0, 0, 0])

In [33]:
print(classification_report(y_test, y_pred_dtc))
B_report_dtc = pd.DataFrame(classification_report(y_test, y_pred_dtc, output_dict=True))

              precision    recall  f1-score   support

           0       0.80      0.76      0.78     22550
           1       0.62      0.67      0.65     13267

    accuracy                           0.73     35817
   macro avg       0.71      0.72      0.71     35817
weighted avg       0.73      0.73      0.73     35817



In [34]:
for i, name in enumerate(B_report_dtc.columns):
  B_report_dtc = B_report_dtc.rename(columns={(B_report_dtc.iloc[:,i].name): ('DTC_'+B_report_dtc.iloc[:,i].name)})


In [35]:
B_report_dtc

Unnamed: 0,RF_0,RF_1,RF_accuracy,RF_macro avg,RF_weighted avg
precision,0.79674,0.624156,0.728202,0.710448,0.732813
recall,0.762927,0.669179,0.728202,0.716053,0.728202
f1-score,0.779467,0.645884,0.728202,0.712675,0.729986
support,22550.0,13267.0,0.728202,35817.0,35817.0


Support Vector Classifier algorythm with RandomizedGridSearch in pipeline, scaling reducing, ballancing:

In [36]:
stratified_kfold = StratifiedKFold(n_splits=5,
                                       shuffle=True,
                                       random_state=23)

pipeline_SVC = imbpipeline([('scaler', scaler),
                            ('pca', pca),
                            ('SMOTE', SMOTEEN),
                            ('SVC', SVC())])
    
params_SVC = {
              'SVC__gamma': ['auto'],# [10, 20, 50]
              'SVC__max_iter': [150, 300],
              'SVC__decision_function_shape': ['ovo'],
              'SVC__degree': [1], #, 3, 5],
              'SVC__kernel': ['rbf'],
              'SVC__random_state': [11],
              'pca__n_components': [5, 10, 20]
             }

search_SVC = GridSearchCV(pipeline_SVC,
                             params_SVC,
                             scoring='roc_auc',
                             cv=stratified_kfold,
                            verbose=3,
                            #n_jobs=3
                         )

search_SVC.fit(X_train, y_train)

cv_score = search_SVC.best_score_
test_score = search_SVC.score(X_test, y_test)
print(f'Cross-validation score: {cv_score}\nTest score: {test_score}')
print("Support Vector:")
print(search_SVC.best_params_)
print(f'Results on test: {search_SVC.best_estimator_.score(X_test, y_test)}')
print(f'Results on train: {search_SVC.best_estimator_.score(X_train, y_train)}')

Fitting 5 folds for each of 6 candidates, totalling 30 fits




[CV 1/5] END SVC__decision_function_shape=ovo, SVC__degree=1, SVC__gamma=auto, SVC__kernel=rbf, SVC__max_iter=150, SVC__random_state=11, pca__n_components=5;, score=0.529 total time=   3.5s




[CV 2/5] END SVC__decision_function_shape=ovo, SVC__degree=1, SVC__gamma=auto, SVC__kernel=rbf, SVC__max_iter=150, SVC__random_state=11, pca__n_components=5;, score=0.591 total time=   3.4s




[CV 3/5] END SVC__decision_function_shape=ovo, SVC__degree=1, SVC__gamma=auto, SVC__kernel=rbf, SVC__max_iter=150, SVC__random_state=11, pca__n_components=5;, score=0.475 total time=   3.4s




[CV 4/5] END SVC__decision_function_shape=ovo, SVC__degree=1, SVC__gamma=auto, SVC__kernel=rbf, SVC__max_iter=150, SVC__random_state=11, pca__n_components=5;, score=0.583 total time=   3.5s




[CV 5/5] END SVC__decision_function_shape=ovo, SVC__degree=1, SVC__gamma=auto, SVC__kernel=rbf, SVC__max_iter=150, SVC__random_state=11, pca__n_components=5;, score=0.477 total time=   3.5s




[CV 1/5] END SVC__decision_function_shape=ovo, SVC__degree=1, SVC__gamma=auto, SVC__kernel=rbf, SVC__max_iter=150, SVC__random_state=11, pca__n_components=10;, score=0.536 total time=  12.6s




[CV 2/5] END SVC__decision_function_shape=ovo, SVC__degree=1, SVC__gamma=auto, SVC__kernel=rbf, SVC__max_iter=150, SVC__random_state=11, pca__n_components=10;, score=0.618 total time=  13.0s




[CV 3/5] END SVC__decision_function_shape=ovo, SVC__degree=1, SVC__gamma=auto, SVC__kernel=rbf, SVC__max_iter=150, SVC__random_state=11, pca__n_components=10;, score=0.555 total time=  13.6s




[CV 4/5] END SVC__decision_function_shape=ovo, SVC__degree=1, SVC__gamma=auto, SVC__kernel=rbf, SVC__max_iter=150, SVC__random_state=11, pca__n_components=10;, score=0.570 total time=  12.7s




[CV 5/5] END SVC__decision_function_shape=ovo, SVC__degree=1, SVC__gamma=auto, SVC__kernel=rbf, SVC__max_iter=150, SVC__random_state=11, pca__n_components=10;, score=0.646 total time=  12.6s




[CV 1/5] END SVC__decision_function_shape=ovo, SVC__degree=1, SVC__gamma=auto, SVC__kernel=rbf, SVC__max_iter=150, SVC__random_state=11, pca__n_components=20;, score=0.595 total time=  20.7s




[CV 2/5] END SVC__decision_function_shape=ovo, SVC__degree=1, SVC__gamma=auto, SVC__kernel=rbf, SVC__max_iter=150, SVC__random_state=11, pca__n_components=20;, score=0.702 total time=  20.7s




[CV 3/5] END SVC__decision_function_shape=ovo, SVC__degree=1, SVC__gamma=auto, SVC__kernel=rbf, SVC__max_iter=150, SVC__random_state=11, pca__n_components=20;, score=0.608 total time=  20.7s




[CV 4/5] END SVC__decision_function_shape=ovo, SVC__degree=1, SVC__gamma=auto, SVC__kernel=rbf, SVC__max_iter=150, SVC__random_state=11, pca__n_components=20;, score=0.591 total time=  20.6s




[CV 5/5] END SVC__decision_function_shape=ovo, SVC__degree=1, SVC__gamma=auto, SVC__kernel=rbf, SVC__max_iter=150, SVC__random_state=11, pca__n_components=20;, score=0.679 total time=  20.5s




[CV 1/5] END SVC__decision_function_shape=ovo, SVC__degree=1, SVC__gamma=auto, SVC__kernel=rbf, SVC__max_iter=300, SVC__random_state=11, pca__n_components=5;, score=0.476 total time=   4.5s




[CV 2/5] END SVC__decision_function_shape=ovo, SVC__degree=1, SVC__gamma=auto, SVC__kernel=rbf, SVC__max_iter=300, SVC__random_state=11, pca__n_components=5;, score=0.550 total time=   4.5s




[CV 3/5] END SVC__decision_function_shape=ovo, SVC__degree=1, SVC__gamma=auto, SVC__kernel=rbf, SVC__max_iter=300, SVC__random_state=11, pca__n_components=5;, score=0.460 total time=   4.5s




[CV 4/5] END SVC__decision_function_shape=ovo, SVC__degree=1, SVC__gamma=auto, SVC__kernel=rbf, SVC__max_iter=300, SVC__random_state=11, pca__n_components=5;, score=0.630 total time=   4.5s




[CV 5/5] END SVC__decision_function_shape=ovo, SVC__degree=1, SVC__gamma=auto, SVC__kernel=rbf, SVC__max_iter=300, SVC__random_state=11, pca__n_components=5;, score=0.461 total time=   4.6s




[CV 1/5] END SVC__decision_function_shape=ovo, SVC__degree=1, SVC__gamma=auto, SVC__kernel=rbf, SVC__max_iter=300, SVC__random_state=11, pca__n_components=10;, score=0.629 total time=  14.6s




[CV 2/5] END SVC__decision_function_shape=ovo, SVC__degree=1, SVC__gamma=auto, SVC__kernel=rbf, SVC__max_iter=300, SVC__random_state=11, pca__n_components=10;, score=0.564 total time=  14.2s




[CV 3/5] END SVC__decision_function_shape=ovo, SVC__degree=1, SVC__gamma=auto, SVC__kernel=rbf, SVC__max_iter=300, SVC__random_state=11, pca__n_components=10;, score=0.623 total time=  15.7s




[CV 4/5] END SVC__decision_function_shape=ovo, SVC__degree=1, SVC__gamma=auto, SVC__kernel=rbf, SVC__max_iter=300, SVC__random_state=11, pca__n_components=10;, score=0.555 total time=  14.2s




[CV 5/5] END SVC__decision_function_shape=ovo, SVC__degree=1, SVC__gamma=auto, SVC__kernel=rbf, SVC__max_iter=300, SVC__random_state=11, pca__n_components=10;, score=0.580 total time=  13.9s




[CV 1/5] END SVC__decision_function_shape=ovo, SVC__degree=1, SVC__gamma=auto, SVC__kernel=rbf, SVC__max_iter=300, SVC__random_state=11, pca__n_components=20;, score=0.706 total time=  22.5s




[CV 2/5] END SVC__decision_function_shape=ovo, SVC__degree=1, SVC__gamma=auto, SVC__kernel=rbf, SVC__max_iter=300, SVC__random_state=11, pca__n_components=20;, score=0.659 total time=  22.3s




[CV 3/5] END SVC__decision_function_shape=ovo, SVC__degree=1, SVC__gamma=auto, SVC__kernel=rbf, SVC__max_iter=300, SVC__random_state=11, pca__n_components=20;, score=0.696 total time=  22.3s




[CV 4/5] END SVC__decision_function_shape=ovo, SVC__degree=1, SVC__gamma=auto, SVC__kernel=rbf, SVC__max_iter=300, SVC__random_state=11, pca__n_components=20;, score=0.659 total time=  22.4s




[CV 5/5] END SVC__decision_function_shape=ovo, SVC__degree=1, SVC__gamma=auto, SVC__kernel=rbf, SVC__max_iter=300, SVC__random_state=11, pca__n_components=20;, score=0.587 total time=  22.4s




Cross-validation score: 0.6615616956235008
Test score: 0.6544448748933929
Support Vector:
{'SVC__decision_function_shape': 'ovo', 'SVC__degree': 1, 'SVC__gamma': 'auto', 'SVC__kernel': 'rbf', 'SVC__max_iter': 300, 'SVC__random_state': 11, 'pca__n_components': 20}
Results on test: 0.6114694139654354
Results on train: 0.6322017876586936


Achieving scores of classification, saving accuracy, recall and F1 score in data frame:

In [37]:
y_pred_SVC_train = search_SVC.best_estimator_.predict(X_train)

In [38]:
y_pred_svc_test = search_SVC.best_estimator_.predict(X_test)

In [39]:
y_pred_SVC = search_SVC.predict(X_test)

In [40]:
search_SVC.best_params_

{'SVC__decision_function_shape': 'ovo',
 'SVC__degree': 1,
 'SVC__gamma': 'auto',
 'SVC__kernel': 'rbf',
 'SVC__max_iter': 300,
 'SVC__random_state': 11,
 'pca__n_components': 20}

In [41]:
print(classification_report(y_test, y_pred_SVC))
B_report_svc = pd.DataFrame(classification_report(y_test, y_pred_SVC, output_dict=True))

              precision    recall  f1-score   support

           0       0.73      0.61      0.67     22550
           1       0.48      0.61      0.54     13267

    accuracy                           0.61     35817
   macro avg       0.60      0.61      0.60     35817
weighted avg       0.64      0.61      0.62     35817



In [42]:
for i, name in enumerate(B_report_svc.columns):
  B_report_svc = B_report_svc.rename(columns={(B_report_svc.iloc[:,i].name): ('SVC_'+B_report_svc.iloc[:,i].name)})


In [43]:
B_report_svc

Unnamed: 0,RF_0,RF_1,RF_accuracy,RF_macro avg,RF_weighted avg
precision,0.726234,0.48061,0.611469,0.603422,0.635252
recall,0.614545,0.606241,0.611469,0.610393,0.611469
f1-score,0.665738,0.536164,0.611469,0.600951,0.617742
support,22550.0,13267.0,0.611469,35817.0,35817.0


XGBClassifier algorythm with RandomizedGridSearch in pipeline, scaling reducing, ballancing:

In [44]:
stratified_kfold = StratifiedKFold(n_splits=5,
                                       shuffle=True,
                                       random_state=77)

pipeline = imbpipeline(steps=[('scaler', scaler),
                              ('pca', pca),
                              ('smote', SMOTEEN),
                              ('XGB', XGBClassifier())])

params = {
    'XGB__n_estimators': [100, 500],
    'XGB__max_depth': [5, 10],
    'XGB__learning_rate': [0.1, 0.5],
    'pca__n_components': [5, 10, 20]
    }

search_XGB = GridSearchCV(pipeline, 
                          params, 
                          scoring='roc_auc', 
                          cv=stratified_kfold, 
                          verbose=3,
                        #n_jobs=3
                         ) 

search_XGB.fit(X_train, y_train)
accuracy_score(y_test, search_XGB.predict(X_test))

Fitting 5 folds for each of 24 candidates, totalling 120 fits
[CV 1/5] END XGB__learning_rate=0.1, XGB__max_depth=5, XGB__n_estimators=100, pca__n_components=5;, score=0.826 total time=   5.3s
[CV 2/5] END XGB__learning_rate=0.1, XGB__max_depth=5, XGB__n_estimators=100, pca__n_components=5;, score=0.827 total time=   5.2s
[CV 3/5] END XGB__learning_rate=0.1, XGB__max_depth=5, XGB__n_estimators=100, pca__n_components=5;, score=0.834 total time=   5.2s
[CV 4/5] END XGB__learning_rate=0.1, XGB__max_depth=5, XGB__n_estimators=100, pca__n_components=5;, score=0.833 total time=   5.2s
[CV 5/5] END XGB__learning_rate=0.1, XGB__max_depth=5, XGB__n_estimators=100, pca__n_components=5;, score=0.826 total time=   5.3s
[CV 1/5] END XGB__learning_rate=0.1, XGB__max_depth=5, XGB__n_estimators=100, pca__n_components=10;, score=0.863 total time=  16.9s
[CV 2/5] END XGB__learning_rate=0.1, XGB__max_depth=5, XGB__n_estimators=100, pca__n_components=10;, score=0.872 total time=  16.5s
[CV 3/5] END XGB__l

[CV 3/5] END XGB__learning_rate=0.5, XGB__max_depth=5, XGB__n_estimators=100, pca__n_components=5;, score=0.850 total time=   5.2s
[CV 4/5] END XGB__learning_rate=0.5, XGB__max_depth=5, XGB__n_estimators=100, pca__n_components=5;, score=0.850 total time=   5.3s
[CV 5/5] END XGB__learning_rate=0.5, XGB__max_depth=5, XGB__n_estimators=100, pca__n_components=5;, score=0.846 total time=   5.1s
[CV 1/5] END XGB__learning_rate=0.5, XGB__max_depth=5, XGB__n_estimators=100, pca__n_components=10;, score=0.883 total time=  16.2s
[CV 2/5] END XGB__learning_rate=0.5, XGB__max_depth=5, XGB__n_estimators=100, pca__n_components=10;, score=0.892 total time=  15.8s
[CV 3/5] END XGB__learning_rate=0.5, XGB__max_depth=5, XGB__n_estimators=100, pca__n_components=10;, score=0.892 total time=  16.4s
[CV 4/5] END XGB__learning_rate=0.5, XGB__max_depth=5, XGB__n_estimators=100, pca__n_components=10;, score=0.893 total time=  15.4s
[CV 5/5] END XGB__learning_rate=0.5, XGB__max_depth=5, XGB__n_estimators=100, p

0.8316162716028701

Achieving scores of classification, saving accuracy, recall and F1 score in data frame:

In [45]:
#XGBClassifier().get_params().keys()

In [46]:
search_XGB.cv_results_["mean_test_score"]

array([0.82909896, 0.8697872 , 0.89343259, 0.84799024, 0.89466898,
       0.91383193, 0.8581246 , 0.89861315, 0.91577694, 0.86583967,
       0.90819476, 0.9210893 , 0.8466754 , 0.88936656, 0.90883384,
       0.85383082, 0.89979542, 0.91797723, 0.86168687, 0.90387911,
       0.91764326, 0.86175316, 0.90485314, 0.91993909])

In [47]:
y_pred_XGB = search_XGB.best_estimator_.predict(X_test)
test_score = search_XGB.score(X_test, y_test)
cv_score = search_XGB.best_score_

In [48]:
print(f'Cross-validation score: {cv_score}\nTest score: {test_score}')
print("XGBClassifier:")
print(search_XGB.best_params_)
print(f'Results on test: {search_XGB.best_estimator_.score(X_test, y_test)}')
print(f'Results on train: {search_XGB.best_estimator_.score(X_train, y_train)}')

Cross-validation score: 0.9210892976947438
Test score: 0.8959155144961483
XGBClassifier:
{'XGB__learning_rate': 0.1, 'XGB__max_depth': 10, 'XGB__n_estimators': 500, 'pca__n_components': 20}
Results on test: 0.8316162716028701
Results on train: 0.8827372476756847


In [49]:
print(classification_report(y_test, y_pred_XGB))
B_report_xgb = pd.DataFrame(classification_report(y_test, y_pred_XGB, output_dict=True))

              precision    recall  f1-score   support

           0       0.84      0.90      0.87     22550
           1       0.81      0.71      0.76     13267

    accuracy                           0.83     35817
   macro avg       0.83      0.81      0.81     35817
weighted avg       0.83      0.83      0.83     35817



In [50]:
for i, name in enumerate(B_report_xgb.columns):
  B_report_xgb = B_report_xgb.rename(columns={(B_report_xgb.iloc[:,i].name): ('XGB_'+B_report_xgb.iloc[:,i].name)})


In [51]:
B_report_xgb

Unnamed: 0,RF_0,RF_1,RF_accuracy,RF_macro avg,RF_weighted avg
precision,0.840781,0.812435,0.831616,0.826608,0.830281
recall,0.903681,0.709128,0.831616,0.806404,0.831616
f1-score,0.871097,0.757275,0.831616,0.814186,0.828936
support,22550.0,13267.0,0.831616,35817.0,35817.0


LogisticRegression algorythm with RandomizedGridSearch in pipeline, scaling reducing, ballancing:

In [52]:
pipeline = imbpipeline(steps = [['scaler', scaler],
                                ['pca', pca],
                                ['smote', SMOTEEN],
                                ['LR', LogisticRegression()]])

stratified_kfold = StratifiedKFold(n_splits=5,
                                       shuffle=True,
                                       random_state=13)
    
param_grid = {'LR__C':[20, 70],
             'LR__random_state': [11],
             'LR__multi_class': ['auto'],
             'LR__max_iter': [50, 100],
             'LR__solver': ['saga'],
             'LR__penalty': ['l2', 'l1'],
             'pca__n_components': [5, 10, 20]
             }
                                                                 
search_LR = GridSearchCV(estimator=pipeline,
                           param_grid=param_grid,
                           scoring='roc_auc',
                           cv=stratified_kfold,
                           verbose=3,
                           #n_jobs=3
                        )

search_LR.fit(X_train, y_train)
cv_score = search_LR.best_score_
test_score = search_LR.score(X_test, y_test)
print(f'Cross-validation score: {cv_score}\nTest score: {test_score}')

Fitting 5 folds for each of 24 candidates, totalling 120 fits
[CV 1/5] END LR__C=20, LR__max_iter=50, LR__multi_class=auto, LR__penalty=l2, LR__random_state=11, LR__solver=saga, pca__n_components=5;, score=0.773 total time=   2.6s
[CV 2/5] END LR__C=20, LR__max_iter=50, LR__multi_class=auto, LR__penalty=l2, LR__random_state=11, LR__solver=saga, pca__n_components=5;, score=0.774 total time=   2.4s
[CV 3/5] END LR__C=20, LR__max_iter=50, LR__multi_class=auto, LR__penalty=l2, LR__random_state=11, LR__solver=saga, pca__n_components=5;, score=0.776 total time=   2.4s
[CV 4/5] END LR__C=20, LR__max_iter=50, LR__multi_class=auto, LR__penalty=l2, LR__random_state=11, LR__solver=saga, pca__n_components=5;, score=0.773 total time=   2.4s
[CV 5/5] END LR__C=20, LR__max_iter=50, LR__multi_class=auto, LR__penalty=l2, LR__random_state=11, LR__solver=saga, pca__n_components=5;, score=0.773 total time=   2.4s
[CV 1/5] END LR__C=20, LR__max_iter=50, LR__multi_class=auto, LR__penalty=l2, LR__random_stat



[CV 1/5] END LR__C=20, LR__max_iter=50, LR__multi_class=auto, LR__penalty=l2, LR__random_state=11, LR__solver=saga, pca__n_components=20;, score=0.826 total time=  20.7s




[CV 2/5] END LR__C=20, LR__max_iter=50, LR__multi_class=auto, LR__penalty=l2, LR__random_state=11, LR__solver=saga, pca__n_components=20;, score=0.821 total time=  20.6s




[CV 3/5] END LR__C=20, LR__max_iter=50, LR__multi_class=auto, LR__penalty=l2, LR__random_state=11, LR__solver=saga, pca__n_components=20;, score=0.824 total time=  20.9s




[CV 4/5] END LR__C=20, LR__max_iter=50, LR__multi_class=auto, LR__penalty=l2, LR__random_state=11, LR__solver=saga, pca__n_components=20;, score=0.825 total time=  20.6s




[CV 5/5] END LR__C=20, LR__max_iter=50, LR__multi_class=auto, LR__penalty=l2, LR__random_state=11, LR__solver=saga, pca__n_components=20;, score=0.825 total time=  20.9s
[CV 1/5] END LR__C=20, LR__max_iter=50, LR__multi_class=auto, LR__penalty=l1, LR__random_state=11, LR__solver=saga, pca__n_components=5;, score=0.772 total time=   2.4s
[CV 2/5] END LR__C=20, LR__max_iter=50, LR__multi_class=auto, LR__penalty=l1, LR__random_state=11, LR__solver=saga, pca__n_components=5;, score=0.773 total time=   2.4s
[CV 3/5] END LR__C=20, LR__max_iter=50, LR__multi_class=auto, LR__penalty=l1, LR__random_state=11, LR__solver=saga, pca__n_components=5;, score=0.774 total time=   2.4s
[CV 4/5] END LR__C=20, LR__max_iter=50, LR__multi_class=auto, LR__penalty=l1, LR__random_state=11, LR__solver=saga, pca__n_components=5;, score=0.772 total time=   2.4s
[CV 5/5] END LR__C=20, LR__max_iter=50, LR__multi_class=auto, LR__penalty=l1, LR__random_state=11, LR__solver=saga, pca__n_components=5;, score=0.773 tota



[CV 1/5] END LR__C=20, LR__max_iter=50, LR__multi_class=auto, LR__penalty=l1, LR__random_state=11, LR__solver=saga, pca__n_components=20;, score=0.826 total time=  21.1s




[CV 2/5] END LR__C=20, LR__max_iter=50, LR__multi_class=auto, LR__penalty=l1, LR__random_state=11, LR__solver=saga, pca__n_components=20;, score=0.821 total time=  20.9s




[CV 3/5] END LR__C=20, LR__max_iter=50, LR__multi_class=auto, LR__penalty=l1, LR__random_state=11, LR__solver=saga, pca__n_components=20;, score=0.824 total time=  21.0s




[CV 4/5] END LR__C=20, LR__max_iter=50, LR__multi_class=auto, LR__penalty=l1, LR__random_state=11, LR__solver=saga, pca__n_components=20;, score=0.824 total time=  20.9s




[CV 5/5] END LR__C=20, LR__max_iter=50, LR__multi_class=auto, LR__penalty=l1, LR__random_state=11, LR__solver=saga, pca__n_components=20;, score=0.825 total time=  20.9s
[CV 1/5] END LR__C=20, LR__max_iter=100, LR__multi_class=auto, LR__penalty=l2, LR__random_state=11, LR__solver=saga, pca__n_components=5;, score=0.772 total time=   2.3s
[CV 2/5] END LR__C=20, LR__max_iter=100, LR__multi_class=auto, LR__penalty=l2, LR__random_state=11, LR__solver=saga, pca__n_components=5;, score=0.772 total time=   2.4s
[CV 3/5] END LR__C=20, LR__max_iter=100, LR__multi_class=auto, LR__penalty=l2, LR__random_state=11, LR__solver=saga, pca__n_components=5;, score=0.774 total time=   2.4s
[CV 4/5] END LR__C=20, LR__max_iter=100, LR__multi_class=auto, LR__penalty=l2, LR__random_state=11, LR__solver=saga, pca__n_components=5;, score=0.772 total time=   2.4s
[CV 5/5] END LR__C=20, LR__max_iter=100, LR__multi_class=auto, LR__penalty=l2, LR__random_state=11, LR__solver=saga, pca__n_components=5;, score=0.774



[CV 1/5] END LR__C=20, LR__max_iter=100, LR__multi_class=auto, LR__penalty=l2, LR__random_state=11, LR__solver=saga, pca__n_components=20;, score=0.827 total time=  22.2s




[CV 2/5] END LR__C=20, LR__max_iter=100, LR__multi_class=auto, LR__penalty=l2, LR__random_state=11, LR__solver=saga, pca__n_components=20;, score=0.821 total time=  22.3s




[CV 3/5] END LR__C=20, LR__max_iter=100, LR__multi_class=auto, LR__penalty=l2, LR__random_state=11, LR__solver=saga, pca__n_components=20;, score=0.824 total time=  24.2s




[CV 4/5] END LR__C=20, LR__max_iter=100, LR__multi_class=auto, LR__penalty=l2, LR__random_state=11, LR__solver=saga, pca__n_components=20;, score=0.825 total time=  23.0s




[CV 5/5] END LR__C=20, LR__max_iter=100, LR__multi_class=auto, LR__penalty=l2, LR__random_state=11, LR__solver=saga, pca__n_components=20;, score=0.826 total time=  22.4s
[CV 1/5] END LR__C=20, LR__max_iter=100, LR__multi_class=auto, LR__penalty=l1, LR__random_state=11, LR__solver=saga, pca__n_components=5;, score=0.771 total time=   2.5s
[CV 2/5] END LR__C=20, LR__max_iter=100, LR__multi_class=auto, LR__penalty=l1, LR__random_state=11, LR__solver=saga, pca__n_components=5;, score=0.774 total time=   2.4s
[CV 3/5] END LR__C=20, LR__max_iter=100, LR__multi_class=auto, LR__penalty=l1, LR__random_state=11, LR__solver=saga, pca__n_components=5;, score=0.775 total time=   2.4s
[CV 4/5] END LR__C=20, LR__max_iter=100, LR__multi_class=auto, LR__penalty=l1, LR__random_state=11, LR__solver=saga, pca__n_components=5;, score=0.774 total time=   2.4s
[CV 5/5] END LR__C=20, LR__max_iter=100, LR__multi_class=auto, LR__penalty=l1, LR__random_state=11, LR__solver=saga, pca__n_components=5;, score=0.77



[CV 1/5] END LR__C=20, LR__max_iter=100, LR__multi_class=auto, LR__penalty=l1, LR__random_state=11, LR__solver=saga, pca__n_components=20;, score=0.827 total time=  22.8s




[CV 2/5] END LR__C=20, LR__max_iter=100, LR__multi_class=auto, LR__penalty=l1, LR__random_state=11, LR__solver=saga, pca__n_components=20;, score=0.821 total time=  22.9s




[CV 3/5] END LR__C=20, LR__max_iter=100, LR__multi_class=auto, LR__penalty=l1, LR__random_state=11, LR__solver=saga, pca__n_components=20;, score=0.824 total time=  23.4s




[CV 4/5] END LR__C=20, LR__max_iter=100, LR__multi_class=auto, LR__penalty=l1, LR__random_state=11, LR__solver=saga, pca__n_components=20;, score=0.825 total time=  24.5s




[CV 5/5] END LR__C=20, LR__max_iter=100, LR__multi_class=auto, LR__penalty=l1, LR__random_state=11, LR__solver=saga, pca__n_components=20;, score=0.826 total time=  32.0s
[CV 1/5] END LR__C=70, LR__max_iter=50, LR__multi_class=auto, LR__penalty=l2, LR__random_state=11, LR__solver=saga, pca__n_components=5;, score=0.774 total time=   3.2s
[CV 2/5] END LR__C=70, LR__max_iter=50, LR__multi_class=auto, LR__penalty=l2, LR__random_state=11, LR__solver=saga, pca__n_components=5;, score=0.774 total time=   3.1s
[CV 3/5] END LR__C=70, LR__max_iter=50, LR__multi_class=auto, LR__penalty=l2, LR__random_state=11, LR__solver=saga, pca__n_components=5;, score=0.775 total time=   2.8s
[CV 4/5] END LR__C=70, LR__max_iter=50, LR__multi_class=auto, LR__penalty=l2, LR__random_state=11, LR__solver=saga, pca__n_components=5;, score=0.772 total time=   2.9s
[CV 5/5] END LR__C=70, LR__max_iter=50, LR__multi_class=auto, LR__penalty=l2, LR__random_state=11, LR__solver=saga, pca__n_components=5;, score=0.769 tot



[CV 1/5] END LR__C=70, LR__max_iter=50, LR__multi_class=auto, LR__penalty=l2, LR__random_state=11, LR__solver=saga, pca__n_components=20;, score=0.826 total time=  23.6s




[CV 2/5] END LR__C=70, LR__max_iter=50, LR__multi_class=auto, LR__penalty=l2, LR__random_state=11, LR__solver=saga, pca__n_components=20;, score=0.821 total time=  24.0s




[CV 3/5] END LR__C=70, LR__max_iter=50, LR__multi_class=auto, LR__penalty=l2, LR__random_state=11, LR__solver=saga, pca__n_components=20;, score=0.824 total time=  23.9s




[CV 4/5] END LR__C=70, LR__max_iter=50, LR__multi_class=auto, LR__penalty=l2, LR__random_state=11, LR__solver=saga, pca__n_components=20;, score=0.824 total time=  21.7s




[CV 5/5] END LR__C=70, LR__max_iter=50, LR__multi_class=auto, LR__penalty=l2, LR__random_state=11, LR__solver=saga, pca__n_components=20;, score=0.825 total time=  22.6s
[CV 1/5] END LR__C=70, LR__max_iter=50, LR__multi_class=auto, LR__penalty=l1, LR__random_state=11, LR__solver=saga, pca__n_components=5;, score=0.772 total time=   2.5s
[CV 2/5] END LR__C=70, LR__max_iter=50, LR__multi_class=auto, LR__penalty=l1, LR__random_state=11, LR__solver=saga, pca__n_components=5;, score=0.775 total time=   2.6s
[CV 3/5] END LR__C=70, LR__max_iter=50, LR__multi_class=auto, LR__penalty=l1, LR__random_state=11, LR__solver=saga, pca__n_components=5;, score=0.776 total time=   2.6s
[CV 4/5] END LR__C=70, LR__max_iter=50, LR__multi_class=auto, LR__penalty=l1, LR__random_state=11, LR__solver=saga, pca__n_components=5;, score=0.772 total time=   2.5s
[CV 5/5] END LR__C=70, LR__max_iter=50, LR__multi_class=auto, LR__penalty=l1, LR__random_state=11, LR__solver=saga, pca__n_components=5;, score=0.768 tota



[CV 1/5] END LR__C=70, LR__max_iter=50, LR__multi_class=auto, LR__penalty=l1, LR__random_state=11, LR__solver=saga, pca__n_components=20;, score=0.826 total time=  22.0s




[CV 2/5] END LR__C=70, LR__max_iter=50, LR__multi_class=auto, LR__penalty=l1, LR__random_state=11, LR__solver=saga, pca__n_components=20;, score=0.822 total time=  21.9s




[CV 3/5] END LR__C=70, LR__max_iter=50, LR__multi_class=auto, LR__penalty=l1, LR__random_state=11, LR__solver=saga, pca__n_components=20;, score=0.824 total time=  21.9s




[CV 4/5] END LR__C=70, LR__max_iter=50, LR__multi_class=auto, LR__penalty=l1, LR__random_state=11, LR__solver=saga, pca__n_components=20;, score=0.825 total time=  22.1s




[CV 5/5] END LR__C=70, LR__max_iter=50, LR__multi_class=auto, LR__penalty=l1, LR__random_state=11, LR__solver=saga, pca__n_components=20;, score=0.825 total time=  22.1s
[CV 1/5] END LR__C=70, LR__max_iter=100, LR__multi_class=auto, LR__penalty=l2, LR__random_state=11, LR__solver=saga, pca__n_components=5;, score=0.772 total time=   2.6s
[CV 2/5] END LR__C=70, LR__max_iter=100, LR__multi_class=auto, LR__penalty=l2, LR__random_state=11, LR__solver=saga, pca__n_components=5;, score=0.773 total time=   2.5s
[CV 3/5] END LR__C=70, LR__max_iter=100, LR__multi_class=auto, LR__penalty=l2, LR__random_state=11, LR__solver=saga, pca__n_components=5;, score=0.775 total time=   2.5s
[CV 4/5] END LR__C=70, LR__max_iter=100, LR__multi_class=auto, LR__penalty=l2, LR__random_state=11, LR__solver=saga, pca__n_components=5;, score=0.772 total time=   2.6s
[CV 5/5] END LR__C=70, LR__max_iter=100, LR__multi_class=auto, LR__penalty=l2, LR__random_state=11, LR__solver=saga, pca__n_components=5;, score=0.772



[CV 1/5] END LR__C=70, LR__max_iter=100, LR__multi_class=auto, LR__penalty=l2, LR__random_state=11, LR__solver=saga, pca__n_components=20;, score=0.826 total time=  23.1s




[CV 2/5] END LR__C=70, LR__max_iter=100, LR__multi_class=auto, LR__penalty=l2, LR__random_state=11, LR__solver=saga, pca__n_components=20;, score=0.821 total time=  23.3s




[CV 3/5] END LR__C=70, LR__max_iter=100, LR__multi_class=auto, LR__penalty=l2, LR__random_state=11, LR__solver=saga, pca__n_components=20;, score=0.825 total time=  23.0s




[CV 4/5] END LR__C=70, LR__max_iter=100, LR__multi_class=auto, LR__penalty=l2, LR__random_state=11, LR__solver=saga, pca__n_components=20;, score=0.825 total time=  23.1s




[CV 5/5] END LR__C=70, LR__max_iter=100, LR__multi_class=auto, LR__penalty=l2, LR__random_state=11, LR__solver=saga, pca__n_components=20;, score=0.826 total time=  23.3s
[CV 1/5] END LR__C=70, LR__max_iter=100, LR__multi_class=auto, LR__penalty=l1, LR__random_state=11, LR__solver=saga, pca__n_components=5;, score=0.773 total time=   2.6s
[CV 2/5] END LR__C=70, LR__max_iter=100, LR__multi_class=auto, LR__penalty=l1, LR__random_state=11, LR__solver=saga, pca__n_components=5;, score=0.775 total time=   2.5s
[CV 3/5] END LR__C=70, LR__max_iter=100, LR__multi_class=auto, LR__penalty=l1, LR__random_state=11, LR__solver=saga, pca__n_components=5;, score=0.776 total time=   2.6s
[CV 4/5] END LR__C=70, LR__max_iter=100, LR__multi_class=auto, LR__penalty=l1, LR__random_state=11, LR__solver=saga, pca__n_components=5;, score=0.772 total time=   2.6s
[CV 5/5] END LR__C=70, LR__max_iter=100, LR__multi_class=auto, LR__penalty=l1, LR__random_state=11, LR__solver=saga, pca__n_components=5;, score=0.77



[CV 1/5] END LR__C=70, LR__max_iter=100, LR__multi_class=auto, LR__penalty=l1, LR__random_state=11, LR__solver=saga, pca__n_components=20;, score=0.826 total time=  23.8s




[CV 2/5] END LR__C=70, LR__max_iter=100, LR__multi_class=auto, LR__penalty=l1, LR__random_state=11, LR__solver=saga, pca__n_components=20;, score=0.822 total time=  23.8s




[CV 3/5] END LR__C=70, LR__max_iter=100, LR__multi_class=auto, LR__penalty=l1, LR__random_state=11, LR__solver=saga, pca__n_components=20;, score=0.824 total time=  23.8s




[CV 4/5] END LR__C=70, LR__max_iter=100, LR__multi_class=auto, LR__penalty=l1, LR__random_state=11, LR__solver=saga, pca__n_components=20;, score=0.825 total time=  23.8s




[CV 5/5] END LR__C=70, LR__max_iter=100, LR__multi_class=auto, LR__penalty=l1, LR__random_state=11, LR__solver=saga, pca__n_components=20;, score=0.826 total time=  23.9s
Cross-validation score: 0.8247579737625663
Test score: 0.8173499523767105




Achieving scores of classification, saving accuracy, recall and F1 score in data frame:

In [53]:
search_LR.best_params_

{'LR__C': 20,
 'LR__max_iter': 100,
 'LR__multi_class': 'auto',
 'LR__penalty': 'l1',
 'LR__random_state': 11,
 'LR__solver': 'saga',
 'pca__n_components': 20}

In [54]:
y_pred_lr = search_LR.best_estimator_.predict(X_test)

In [55]:
test_score = search_LR.score(X_test, y_test)

In [58]:
print(classification_report(y_test, y_pred_lr))
B_report_lr = pd.DataFrame(classification_report(y_test, y_pred_lr, output_dict=True))

              precision    recall  f1-score   support

           0       0.80      0.83      0.81     22550
           1       0.69      0.64      0.67     13267

    accuracy                           0.76     35817
   macro avg       0.74      0.74      0.74     35817
weighted avg       0.76      0.76      0.76     35817



In [59]:
for i, name in enumerate(B_report_lr.columns):
  B_report_lr = B_report_lr.rename(columns={(B_report_lr.iloc[:,i].name): ('LR_'+B_report_lr.iloc[:,i].name)})


In [60]:
B_report_lr

Unnamed: 0,RF_0,RF_1,RF_accuracy,RF_macro avg,RF_weighted avg
precision,0.798565,0.689714,0.760896,0.74414,0.758246
recall,0.829446,0.644381,0.760896,0.736913,0.760896
f1-score,0.813713,0.666277,0.760896,0.739995,0.759101
support,22550.0,13267.0,0.760896,35817.0,35817.0


Utilizing Multi Layer Perceptron algorythm with RandomizedGridSearch in pipeline, scaling reducing, ballancing:

In [61]:
pipeline = imbpipeline(steps = [['scaler', scaler],
                                ['pca', pca],
                                ['smote', SMOTEEN],
                                ['MLP', MLPClassifier()]])

stratified_kfold = StratifiedKFold(n_splits=5,
                                       shuffle=True,
                                       random_state=13)
    
param_grid = {'MLP__hidden_layer_sizes':[8, 16],
             'MLP__activation': ['relu'],
              'MLP__solver': ['adam'],
              'MLP__random_state': [42],
              'MLP__max_iter': [1000],
              'MLP__batch_size': [32],
              'pca__n_components': [5, 10, 20]
             }
                                                                 
search_MLP = GridSearchCV(estimator=pipeline,
                           param_grid=param_grid,
                           scoring='roc_auc',
                           cv=stratified_kfold,
                           verbose=3,
                           #n_jobs=3
                        )

search_MLP.fit(X_train, y_train)
cv_score = search_MLP.best_score_
test_score = search_MLP.score(X_test, y_test)
print(f'Cross-validation score: {cv_score}\nTest score: {test_score}')

Fitting 5 folds for each of 6 candidates, totalling 30 fits
[CV 1/5] END MLP__activation=relu, MLP__batch_size=32, MLP__hidden_layer_sizes=8, MLP__max_iter=1000, MLP__random_state=42, MLP__solver=adam, pca__n_components=5;, score=0.802 total time=  23.8s
[CV 2/5] END MLP__activation=relu, MLP__batch_size=32, MLP__hidden_layer_sizes=8, MLP__max_iter=1000, MLP__random_state=42, MLP__solver=adam, pca__n_components=5;, score=0.804 total time=  20.0s
[CV 3/5] END MLP__activation=relu, MLP__batch_size=32, MLP__hidden_layer_sizes=8, MLP__max_iter=1000, MLP__random_state=42, MLP__solver=adam, pca__n_components=5;, score=0.802 total time= 1.0min
[CV 4/5] END MLP__activation=relu, MLP__batch_size=32, MLP__hidden_layer_sizes=8, MLP__max_iter=1000, MLP__random_state=42, MLP__solver=adam, pca__n_components=5;, score=0.805 total time=  18.9s
[CV 5/5] END MLP__activation=relu, MLP__batch_size=32, MLP__hidden_layer_sizes=8, MLP__max_iter=1000, MLP__random_state=42, MLP__solver=adam, pca__n_components=

Achieving scores of classification, saving accuracy, recall and F1 score in data frame:

In [64]:
y_pred_mlp = search_MLP.predict(X_test)
print(classification_report(y_test, y_pred_mlp))
B_report_mlp = pd.DataFrame(classification_report(y_test, y_pred_mlp, output_dict=True))

              precision    recall  f1-score   support

           0       0.84      0.90      0.87     22550
           1       0.80      0.70      0.75     13267

    accuracy                           0.83     35817
   macro avg       0.82      0.80      0.81     35817
weighted avg       0.82      0.83      0.82     35817



In [65]:
for i, name in enumerate(B_report_mlp.columns):
  B_report_mlp = B_report_mlp.rename(columns={(B_report_mlp.iloc[:,i].name): ('MLP_'+B_report_mlp.iloc[:,i].name)})


In [66]:
B_report_mlp

Unnamed: 0,RF_0,RF_1,RF_accuracy,RF_macro avg,RF_weighted avg
precision,0.837387,0.799829,0.825139,0.818608,0.823475
recall,0.896319,0.704153,0.825139,0.800236,0.825139
f1-score,0.865851,0.748948,0.825139,0.8074,0.822549
support,22550.0,13267.0,0.825139,35817.0,35817.0


Creating Data Frame containing all six classifiers results:

In [67]:
B_results = pd.concat([B_report_rf, 
                       B_report_dtc, 
                       B_report_svc, 
                       B_report_xgb, 
                       B_report_rl, 
                       B_report_mlp], 
                      axis=1)

In [69]:
B_results.sample(4)

Unnamed: 0,RF_0,RF_1,RF_accuracy,RF_macro avg,RF_weighted avg,RF_0.1,RF_1.1,RF_accuracy.1,RF_macro avg.1,RF_weighted avg.1,...,0,1,accuracy,macro avg,weighted avg,RF_0.2,RF_1.2,RF_accuracy.2,RF_macro avg.2,RF_weighted avg.2
f1-score,0.866962,0.73696,0.823296,0.801961,0.818808,0.779467,0.645884,0.728202,0.712675,0.729986,...,0.813713,0.666277,0.760896,0.739995,0.759101,0.865851,0.748948,0.825139,0.8074,0.822549
support,22550.0,13267.0,0.823296,35817.0,35817.0,22550.0,13267.0,0.728202,35817.0,35817.0,...,22550.0,13267.0,0.760896,35817.0,35817.0,22550.0,13267.0,0.825139,35817.0,35817.0
precision,0.824122,0.821382,0.823296,0.822752,0.823107,0.79674,0.624156,0.728202,0.710448,0.732813,...,0.798565,0.689714,0.760896,0.74414,0.758246,0.837387,0.799829,0.825139,0.818608,0.823475
recall,0.914501,0.668275,0.823296,0.791388,0.823296,0.762927,0.669179,0.728202,0.716053,0.728202,...,0.829446,0.644381,0.760896,0.736913,0.760896,0.896319,0.704153,0.825139,0.800236,0.825139


Saving results in a file:

In [70]:
B_results.to_pickle("data/B_dataset_results.pkl")

Loading and presenting saved Data Frame:

In [None]:
B_results = pickle.load("data/B_dataset_results.pkl")

# Summary

In [4]:
c_list = ["RF", "DTC", "SVC", "XGB", "LR", "MLP"]

Loading results from not binned dataset from pickle file:

In [None]:
A_results = pd.read_pickle("data/A_dataset_results.pkl")

In [78]:
A_results

Unnamed: 0,RF_0,RF_1,RF_accuracy,RF_macro avg,RF_weighted avg,DTC_0,DTC_1,DTC_accuracy,DTC_macro avg,DTC_weighted avg,...,LR_0,LR_1,LR_accuracy,LR_macro avg,LR_weighted avg,MLP_0,MLP_1,MLP_accuracy,MLP_macro avg,MLP_weighted avg
precision,0.798638,0.791443,0.796577,0.795041,0.795973,0.801758,0.639499,0.738448,0.720629,0.741656,...,0.830833,0.651304,0.755898,0.741069,0.764334,0.826294,0.77639,0.810006,0.801342,0.807809
recall,0.9051,0.61212,0.796577,0.75861,0.796577,0.776585,0.673626,0.738448,0.725106,0.738448,...,0.768825,0.733926,0.755898,0.751376,0.755898,0.88408,0.684103,0.810006,0.784092,0.810006
f1-score,0.848543,0.690326,0.796577,0.769435,0.789938,0.788971,0.656119,0.738448,0.722545,0.739761,...,0.798627,0.690151,0.755898,0.744389,0.758447,0.854211,0.727331,0.810006,0.790771,0.807213
support,22550.0,13267.0,0.796577,35817.0,35817.0,22550.0,13267.0,0.738448,35817.0,35817.0,...,22550.0,13267.0,0.755898,35817.0,35817.0,22550.0,13267.0,0.810006,35817.0,35817.0


**Best model before binning data**

Best f1-score:

In [51]:
predi_0s = A_results.filter(like='0')
max_f1 = predi_0s[predi_0s.values==(predi_0s.loc["f1-score",:]).max()]
max_0s = predi_0s[max_f1.idxmax(axis=1)]

predi_ones = A_results.filter(like='1')
max_f1 = predi_ones[predi_ones.values==(predi_ones.loc["f1-score",:]).max()]
max_ones = predi_ones[max_f1.idxmax(axis=1)]
print(max_0s)
print(max_ones)

                  XGB_0
precision      0.822499
recall         0.901685
f1-score       0.860274
support    22550.000000
                  XGB_1
precision      0.800198
recall         0.669255
f1-score       0.728892
support    13267.000000


Predicting hotel guests who cancel their reservation I discovered that from all six models with given hyperparameters one achieved highest f1-score witch is a harmonic mean between precision and recall. Shows how precise model managed to fit into given dataset. Precision tells how acurate model was predicting cancelation(1) in theory(on training part of the dataset). And recall metric tells how well model made it on test data. In scoretable above is shown that model predicted only 67% true cancelations right, when its training score was on level of 80%. To increase effectiveness I should consider do more experiments with hyperparameters of winning model and models with close scores to it, in this case XGBClassifier is the winner.
In situation when consider 0's(not canceled) there is overfitting, precision number is smaller than recall what tells me that model done it better on test data than on training. In future to avoid overfitting I'll use regularization methods, in the case of XGBClassifier there are 3 hyperparameters to tune: alpha: l1 regularization, gamma: minimum loss reduction, lambda: l2 regularization.

Best precision score:

In [52]:
predi_0s = A_results.filter(like='0')
max_prec = predi_0s[predi_0s.values==(predi_0s.loc["precision",:]).max()]
max_0s = predi_0s[max_prec.idxmax(axis=1)]

predi_ones = A_results.filter(like='1')
max_prec = predi_ones[predi_ones.values==(predi_ones.loc["precision",:]).max()]
max_ones = predi_ones[max_prec.idxmax(axis=1)]
print(max_0s)
print(max_ones)

                   LR_0
precision      0.830833
recall         0.768825
f1-score       0.798627
support    22550.000000
                  XGB_1
precision      0.800198
recall         0.669255
f1-score       0.728892
support    13267.000000


In [77]:
B_results

Unnamed: 0,RF_0,RF_1,RF_accuracy,RF_macro avg,RF_weighted avg,RF_0.1,RF_1.1,RF_accuracy.1,RF_macro avg.1,RF_weighted avg.1,...,0,1,accuracy,macro avg,weighted avg,RF_0.2,RF_1.2,RF_accuracy.2,RF_macro avg.2,RF_weighted avg.2
precision,0.824122,0.821382,0.823296,0.822752,0.823107,0.79674,0.624156,0.728202,0.710448,0.732813,...,0.798565,0.689714,0.760896,0.74414,0.758246,0.837387,0.799829,0.825139,0.818608,0.823475
recall,0.914501,0.668275,0.823296,0.791388,0.823296,0.762927,0.669179,0.728202,0.716053,0.728202,...,0.829446,0.644381,0.760896,0.736913,0.760896,0.896319,0.704153,0.825139,0.800236,0.825139
f1-score,0.866962,0.73696,0.823296,0.801961,0.818808,0.779467,0.645884,0.728202,0.712675,0.729986,...,0.813713,0.666277,0.760896,0.739995,0.759101,0.865851,0.748948,0.825139,0.8074,0.822549
support,22550.0,13267.0,0.823296,35817.0,35817.0,22550.0,13267.0,0.728202,35817.0,35817.0,...,22550.0,13267.0,0.760896,35817.0,35817.0,22550.0,13267.0,0.825139,35817.0,35817.0


Renaming mistaken columns names:

In [41]:
names = []
k = 0
nr_col = 0
for col in c_list:
    for i in range(5):        
        c_name = str(B_results.iloc[:, i].name)
        for char in c_name:
            if char!="_":
                c_name = c_name.replace(char, '')
            else:
                break    
        names.append(col+c_name)
        #names[str(i+k)] = (col+c_name)    
    k+=5        
print(names)

B_results.columns = [names]
#B_results.rename(columns=names, inplace=True) rename not working with dict!!
        

['RF_0', 'RF_1', 'RF_accuracy', 'RF_macro avg', 'RF_weighted avg', 'DTC_0', 'DTC_1', 'DTC_accuracy', 'DTC_macro avg', 'DTC_weighted avg', 'SVC_0', 'SVC_1', 'SVC_accuracy', 'SVC_macro avg', 'SVC_weighted avg', 'XGB_0', 'XGB_1', 'XGB_accuracy', 'XGB_macro avg', 'XGB_weighted avg', 'LR_0', 'LR_1', 'LR_accuracy', 'LR_macro avg', 'LR_weighted avg', 'MLP_0', 'MLP_1', 'MLP_accuracy', 'MLP_macro avg', 'MLP_weighted avg']


Columns after renaming:

In [43]:
B_results.columns

MultiIndex([(            'RF_0',),
            (            'RF_1',),
            (     'RF_accuracy',),
            (    'RF_macro avg',),
            ( 'RF_weighted avg',),
            (           'DTC_0',),
            (           'DTC_1',),
            (    'DTC_accuracy',),
            (   'DTC_macro avg',),
            ('DTC_weighted avg',),
            (           'SVC_0',),
            (           'SVC_1',),
            (    'SVC_accuracy',),
            (   'SVC_macro avg',),
            ('SVC_weighted avg',),
            (           'XGB_0',),
            (           'XGB_1',),
            (    'XGB_accuracy',),
            (   'XGB_macro avg',),
            ('XGB_weighted avg',),
            (            'LR_0',),
            (            'LR_1',),
            (     'LR_accuracy',),
            (    'LR_macro avg',),
            ( 'LR_weighted avg',),
            (           'MLP_0',),
            (           'MLP_1',),
            (    'MLP_accuracy',),
            (   'MLP

In [46]:
B_results

Unnamed: 0,RF_0,RF_1,RF_accuracy,RF_macro avg,RF_weighted avg,DTC_0,DTC_1,DTC_accuracy,DTC_macro avg,DTC_weighted avg,...,LR_0,LR_1,LR_accuracy,LR_macro avg,LR_weighted avg,MLP_0,MLP_1,MLP_accuracy,MLP_macro avg,MLP_weighted avg
precision,0.824122,0.821382,0.823296,0.822752,0.823107,0.79674,0.624156,0.728202,0.710448,0.732813,...,0.798565,0.689714,0.760896,0.74414,0.758246,0.837387,0.799829,0.825139,0.818608,0.823475
recall,0.914501,0.668275,0.823296,0.791388,0.823296,0.762927,0.669179,0.728202,0.716053,0.728202,...,0.829446,0.644381,0.760896,0.736913,0.760896,0.896319,0.704153,0.825139,0.800236,0.825139
f1-score,0.866962,0.73696,0.823296,0.801961,0.818808,0.779467,0.645884,0.728202,0.712675,0.729986,...,0.813713,0.666277,0.760896,0.739995,0.759101,0.865851,0.748948,0.825139,0.8074,0.822549
support,22550.0,13267.0,0.823296,35817.0,35817.0,22550.0,13267.0,0.728202,35817.0,35817.0,...,22550.0,13267.0,0.760896,35817.0,35817.0,22550.0,13267.0,0.825139,35817.0,35817.0


**Best model after binning data**

Model with highest f1-score:

In [44]:
predi_0s = B_results.filter(like='0')
max_f1 = predi_0s[predi_0s.values==(predi_0s.loc["f1-score",:]).max()]
max_0s = predi_0s[max_f1.idxmax(axis=1)]

predi_ones = B_results.filter(like='1')
max_f1 = predi_ones[predi_ones.values==(predi_ones.loc["f1-score",:]).max()]
max_ones = predi_ones[max_f1.idxmax(axis=1)]
print(max_0s)
print(max_ones)

                  XGB_0
precision      0.840781
recall         0.903681
f1-score       0.871097
support    22550.000000
                  XGB_1
precision      0.812435
recall         0.709128
f1-score       0.757275
support    13267.000000


Recall shows that model achieved 71% efficiency in predicting real cancelations, when on training data it's score was 81% right predicted cancelations.
After binning data, XGBClassifier has the best f1-score again.

On 0's prediction model is overfitted, need to add gamma, alpha or lambda punishment hyperparameters to improve tuning of the algorithm.

Model with highest precision:

In [48]:
predi_0s = B_results.filter(like='0')
max_prec = predi_0s[predi_0s.values==(predi_0s.loc["precision",:]).max()]
max_0s = predi_0s[max_prec.idxmax(axis=1)]

predi_ones = B_results.filter(like='1')
max_prec = predi_ones[predi_ones.values==(predi_ones.loc["precision",:]).max()]
max_ones = predi_ones[max_prec.idxmax(axis=1)]
print(max_0s)
print(max_ones)

                  XGB_0
precision      0.840781
recall         0.903681
f1-score       0.871097
support    22550.000000
                   RF_1
precision      0.821382
recall         0.668275
f1-score       0.736960
support    13267.000000


**Final comparison**