## Neural networks applications to fraud detection
## Alternative learning methods

Neural networks are one of the most relevant learning methods currently available, and their widespread application is understood by theoretical robustness, flexible architecture design, and strong expected predictive accuracy.
<br>
<br>
The main objective of this study is to develop a neural network application to fraud detection, and mainly to construct and implement a strategy for hyper-parameter tuning, since this learning method requires a proper definition of a large set of parameters in order to result in a competitive performance.
<br>
<br>
Previously to empirical inquirements, it is necessary to review all details concerning neural networks structure, fitting, and specification, which will base experiments design and tests implementation. So, the theoretical presentation of this notebook will be followed by an empirical stage of tests in which hyper-parameters will be defined to improve neural networks predictive accuracy, after which the best specification obtained should be opposed to alternative learning methods.

---------------

After discussion and empirical application of neural network models, this notebook estimates additional models based on alternative learning methods (logistic regression, SVM and GBM).

---------------

**Summary:**
1. [Libraries](#libraries)<a href='#libraries'></a>.
2. [Functions and classes](#functions_classes)<a href='#functions_classes'></a>.
3. [Settings](#settings)<a href='#settings'></a>.
4. [Importing data](#imports)<a href='#imports'></a>.
    * [Categorical features](#categorical_features)<a href='#categorical_features'></a>.
    * [Model assessment](#model_assessment)<a href='#model_assessment'></a>.
    * [Classifying features](#classif_feat)<a href='#classif_feat'></a>.
<br>
<br>
5. [Data pre-processing](#data_pre_proc)<a href='#data_pre_proc'></a>.
    * [Assessing missing values](#assessing_missing)<a href='#assessing_missing'></a>.
    * [Transforming numerical features](#num_transf)<a href='#num_transf'></a>.
    * [Transforming categorical features](#categorical_transf)<a href='#categorical_transf'></a>.
    * [Datasets structure](#datasets_structure)<a href='#datasets_structure'></a>.
<br>
<br>
6. [Logistic regression estimation](#logistic_regression)<a href='#logistic_regression'></a>.
    * [Hyper-parameters definition](#lr_params)<a href='#lr_params'></a>.
    * [Final estimation](#lr_estimation)<a href='#lr_estimation'></a>.
<br>
<br>
7. [SVM estimation](#svm)<a href='#svm'></a>.
    * [Hyper-parameters definition](#svm_params)<a href='#svm_params'></a>.
    * [Final estimation](#svm_estimation)<a href='#svm_estimation'></a>.
<br>
<br>
8. [GBM estimation](#gbm)<a href='#gbm'></a>.
    * [Hyper-parameters definition](#gbm_params)<a href='#gbm_params'></a>.
    * [Final estimation](#gbm_estimation)<a href='#gbm_estimation'></a>.

<a id='libraries'></a>

## Libraries

In [1]:
import pandas as pd
import numpy as np
import json
import os

from datetime import datetime
import time

import progressbar
from time import sleep

import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Activation, Dropout, LeakyReLU, PReLU
from tensorflow.keras.regularizers import l1, l2
from tensorflow.keras.optimizers import SGD, Adam
from tensorflow.keras.callbacks import EarlyStopping, Callback
from tensorflow.keras.initializers import RandomNormal, Zeros
from tensorflow.nn import leaky_relu
from tensorflow.keras.activations import swish
from tensorflow.keras.models import load_model

from scipy.stats import uniform, norm, randint

from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import GradientBoostingClassifier, RandomForestClassifier
from sklearn.svm import SVC
from sklearn.metrics import roc_auc_score, average_precision_score, auc, precision_recall_curve, brier_score_loss

import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

from plotly import __version__
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
# print(__version__) # requires version >= 1.9.0

import cufflinks as cf
init_notebook_mode(connected=True)
cf.go_offline()

import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots

import pickle

<a id='functions_classes'></a>

## Functions and classes

In [2]:
import utils
from utils import epoch_to_date, text_clean, is_velocity, balanced_sample, get_cat, permutation

In [3]:
from transformations import log_transformation, standard_scale, recreate_missings, impute_missing
from transformations import one_hot_encoding

In [4]:
import keras_nn
from keras_nn import keras_nn

<a id='settings'></a>

## Settings

In [5]:
# Declare whether to export results:
export = True

# Define a dataset id:
s = 6044

# Declare whether to apply logarithmic transformation over numerical data:
log_transform = True

# Declare whether to standardize numerical data:
standardize = True

<a id='imports'></a>

## Importing data

In [6]:
# Train data:
os.chdir('/home/matheus_rosso/Arquivo/Features/Datasets/')

df_train = pd.read_csv('new_additional_datasets/dataset_' + str(s) + '.csv',
                       dtype={'order_id': str, 'store_id': int})
df_train.drop_duplicates(['order_id', 'epoch', 'order_amount'], inplace=True)
df_train.reset_index(drop=True, inplace=True)
df_train['date'] = df_train.epoch.apply(epoch_to_date)

# Dropping original categorical features:
cat_vars = get_cat(df_train)
c_vars = [c for c in list(df_train.columns) if 'C#' in c]
na_vars = ['NA#' + c for c in cat_vars if 'NA#' + c in list(df_train.columns)]

df_train = df_train.drop(c_vars, axis=1).drop(na_vars, axis=1)

# Splitting data into train and test:
df_test = df_train[(df_train.date > datetime.strptime('2020-03-30', '%Y-%m-%d'))]
df_train = df_train[(df_train.date <= datetime.strptime('2020-03-30', '%Y-%m-%d'))]

# Splitting data into validation and test:
df_val = df_test[df_test.date < datetime.strptime('2020-05-01', '%Y-%m-%d')]
df_test = df_test[df_test.date >= datetime.strptime('2020-05-01', '%Y-%m-%d')]

print('\033[1mShape of df_train for store ' + str(s) + ':\033[0m ' + str(df_train.shape) + '.')
print('\033[1mShape of df_val for store ' + str(s) + ':\033[0m ' + str(df_val.shape) + '.')
print('\033[1mShape of df_test for store ' + str(s) + ':\033[0m ' + str(df_test.shape) + '.')
print('\n')

# Accessory variables:
drop_vars = ['y', 'order_amount', 'store_id', 'order_id', 'status', 'epoch', 'date', 'weight']

df_train.head(3)

[1mShape of df_train for store 6044:[0m (35897, 2173).
[1mShape of df_val for store 6044:[0m (20940, 2173).
[1mShape of df_test for store 6044:[0m (21791, 2173).




Unnamed: 0,BILLINGLARGEAREAREPUTATION(),BILLINGSMALLAREAREPUTATION(),"BILLINGZIP(CREDITCARD,10080)","BILLINGZIP(CREDITCARD,1440)","BILLINGZIP(CREDITCARD,21600)","BILLINGZIP(CREDITCARD,360)","BILLINGZIP(CREDITCARD,43200)","BILLINGZIP(CREDITCARD,60)","BILLINGZIP(CREDITCARD,64800)","BILLINGZIP(DOCUMENT,10080)",...,ZIPFIRST3REPUTATION(),ZIPFIRST5REPUTATION(),y,order_amount,order_id,status,epoch,store_id,weight,date
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,156.75,D48D0720681E4F5D9A2767F7174B5FA6-2782006,APPROVED,1577751000000.0,6044,1.0,2019-12-30
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.000992,0.0,0.0,67.96,A0EB579C0AE0452D9020C91C54565B4F-2782009,APPROVED,1577751000000.0,6044,1.0,2019-12-30
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.003344,0.0,0.0,315.72,17A1DF0F984E4B34AC512D7E9E23B7BB-2782011,APPROVED,1577751000000.0,6044,1.0,2019-12-30


In [7]:
# Assessing missing values:
num_miss_train = df_train.isnull().sum().sum()
num_miss_val = df_val.isnull().sum().sum()
num_miss_test = df_test.isnull().sum().sum()

if num_miss_train > 0:
    print('\033[1mProblem - Number of overall missings detected (training data):\033[0m ' +
          str(df_train.isnull().sum().sum()) + '.')
    print('\n')

if num_miss_val > 0:
    print('\033[1mProblem - Number of overall missings detected (validation data):\033[0m ' +
          str(df_val.isnull().sum().sum()) + '.')
    print('\n')
    
if num_miss_test > 0:
    print('\033[1mProblem - Number of overall missings detected (test data):\033[0m ' +
          str(df_test.isnull().sum().sum()) + '.')
    print('\n')

<a id='categorical_features'></a>

### Categorical features

In [8]:
categorical_train = pd.read_csv('new_additional_datasets/categorical_features/dataset_' + str(s) + '.csv',
                      dtype={'order_id': str, 'store_id': int})
categorical_train.drop_duplicates(['order_id', 'epoch', 'order_amount'], inplace=True)

categorical_train['date'] = categorical_train.epoch.apply(epoch_to_date)

# Splitting data into train and test:
categorical_test = categorical_train[(categorical_train.date > datetime.strptime('2020-03-30', '%Y-%m-%d'))]
categorical_train = categorical_train[(categorical_train.date <= datetime.strptime('2020-03-30', '%Y-%m-%d'))]

# Splitting data into validation and test:
categorical_val = categorical_test[categorical_test.date < datetime.strptime('2020-05-01', '%Y-%m-%d')]
categorical_test = categorical_test[categorical_test.date >= datetime.strptime('2020-05-01', '%Y-%m-%d')]

print('\033[1mShape of categorical_train (training data):\033[0m ' + str(categorical_train.shape) + '.')
print('\033[1mNumber of orders (training data):\033[0m ' + str(categorical_train.order_id.nunique()) + '.')
print('\n')

print('\033[1mShape of categorical_val (validation data):\033[0m ' + str(categorical_val.shape) + '.')
print('\033[1mNumber of orders (validation data):\033[0m ' + str(categorical_val.order_id.nunique()) + '.')
print('\n')

print('\033[1mShape of categorical_test (test data):\033[0m ' + str(categorical_test.shape) + '.')
print('\033[1mNumber of orders (test data):\033[0m ' + str(categorical_test.order_id.nunique()) + '.')
print('\n')

categorical_train.head()

[1mShape of categorical_train (training data):[0m (35897, 22).
[1mNumber of orders (training data):[0m 35897.


[1mShape of categorical_val (validation data):[0m (20940, 22).
[1mNumber of orders (validation data):[0m 20940.


[1mShape of categorical_test (test data):[0m (21791, 22).
[1mNumber of orders (test data):[0m 21791.




Unnamed: 0,BILLINGCITY(),BILLINGSTATE(),BROWSER(),CREDITCARDBRAND(),CREDITCARDCOUNTRY(),CREDITCARDSUBTYPE(),EMAILDOMAIN(),GENDERBYNAMEPTBR(),IPGEOLOCATIONCITY(),IPGEOLOCATIONCOUNTRY(),...,SHIPPINGSTATE(),UTMSOURCELASTCLICK(),y,order_amount,order_id,status,epoch,store_id,weight,date
0,,,,VISA,BR,GOLD,hotmail.com,F,Fartura,BR,...,SP,,0.0,156.75,D48D0720681E4F5D9A2767F7174B5FA6-2782006,APPROVED,1577751000000.0,6044,1.0,2019-12-30
1,,,,MASTERCARD,BR,GOLD,gmail.com,F,São Paulo,BR,...,SP,,0.0,67.96,A0EB579C0AE0452D9020C91C54565B4F-2782009,APPROVED,1577751000000.0,6044,1.0,2019-12-30
2,,,,VISA,BR,CLASSIC,gmail.com,F,Recife,BR,...,AL,,0.0,315.72,17A1DF0F984E4B34AC512D7E9E23B7BB-2782011,APPROVED,1577751000000.0,6044,1.0,2019-12-30
3,,,,MASTERCARD,BR,PLATINUM,gmail.com,F,Guarapari,BR,...,RJ,,0.0,514.15,21CA5C8AA45B400DB55985466AEE0BCD-2782028,APPROVED,1577751000000.0,6044,1.0,2019-12-30
4,,,,ELO/DISCOVER,BR,NANJING DINERS,hotmail.com,M,Curitiba,BR,...,SC,,0.0,64.74,FFC167F3C6C742C9AD26E7E07ED72115-2782055,APPROVED,1577752000000.0,6044,1.0,2019-12-30


#### Treating missing values

In [9]:
print('\033[1mAssessing missing values in categorical data (training data):\033[0m')
print(categorical_train.drop(drop_vars, axis=1).isnull().sum().sort_values(ascending=False))

[1mAssessing missing values in categorical data (training data):[0m
UTMSOURCELASTCLICK()      35793
BROWSER()                 35689
BILLINGSTATE()            32920
BILLINGCITY()             32920
CREDITCARDSUBTYPE()         642
IPGEOLOCATIONCITY()         522
IPGEOLOCATIONCOUNTRY()       20
GENDERBYNAMEPTBR()           12
SHIPPINGSTATE()               0
SHIPPINGCITY()                0
SELLERID()                    0
EMAILDOMAIN()                 0
CREDITCARDCOUNTRY()           0
CREDITCARDBRAND()             0
dtype: int64


In [10]:
print('\033[1mAssessing missing values in categorical data (validation data):\033[0m')
print(categorical_val.drop(drop_vars, axis=1).isnull().sum().sort_values(ascending=False))

[1mAssessing missing values in categorical data (validation data):[0m
UTMSOURCELASTCLICK()      20896
BROWSER()                 20846
BILLINGSTATE()            19447
BILLINGCITY()             19447
CREDITCARDSUBTYPE()         350
IPGEOLOCATIONCITY()         274
GENDERBYNAMEPTBR()           10
IPGEOLOCATIONCOUNTRY()        5
CREDITCARDCOUNTRY()           1
SHIPPINGSTATE()               0
SHIPPINGCITY()                0
SELLERID()                    0
EMAILDOMAIN()                 0
CREDITCARDBRAND()             0
dtype: int64


In [11]:
print('\033[1mAssessing missing values in categorical data (test data):\033[0m')
print(categorical_test.drop(drop_vars, axis=1).isnull().sum().sort_values(ascending=False))

[1mAssessing missing values in categorical data (test data):[0m
UTMSOURCELASTCLICK()      21757
BROWSER()                 21689
BILLINGSTATE()            20084
BILLINGCITY()             20084
IPGEOLOCATIONCITY()        1927
IPGEOLOCATIONCOUNTRY()      492
CREDITCARDSUBTYPE()         455
CREDITCARDCOUNTRY()           2
GENDERBYNAMEPTBR()            1
SHIPPINGSTATE()               0
SHIPPINGCITY()                0
SELLERID()                    0
EMAILDOMAIN()                 0
CREDITCARDBRAND()             0
dtype: int64


In [12]:
# Loop over categorical features:
for f in categorical_train.drop(drop_vars, axis=1).columns:
    # Training data
    categorical_train[f] = categorical_train[f].apply(lambda x: 'NA_VALUE' if pd.isna(x) else x)
    
    # Validation data:
    categorical_val[f] = categorical_val[f].apply(lambda x: 'NA_VALUE' if pd.isna(x) else x)
    
    # Test data:
    categorical_test[f] = categorical_test[f].apply(lambda x: 'NA_VALUE' if pd.isna(x) else x)

In [13]:
# Assessing missing values:
if categorical_train.isnull().sum().sum() > 0:
    print('\033[1mProblem - Number of overall missings detected (training data):\033[0m ' +
          str(categorical_train.isnull().sum().sum()) + '.')
    print('\n')

if categorical_val.isnull().sum().sum() > 0:
    print('\033[1mProblem - Number of overall missings detected (validation data):\033[0m ' +
          str(categorical_val.isnull().sum().sum()) + '.')
    print('\n')
    
if categorical_test.isnull().sum().sum() > 0:
    print('\033[1mProblem - Number of overall missings detected (test data):\033[0m ' +
          str(categorical_test.isnull().sum().sum()) + '.')
    print('\n')

#### Treating text data

In [14]:
na_vars = [c for c in categorical_train.drop(drop_vars, axis=1) if 'NA#' in c]

# Loop over categorical features:
for f in categorical_train.drop(drop_vars, axis=1).drop(na_vars, axis=1).columns:
    # Training data:
    categorical_train[f] = categorical_train[f].apply(lambda x: text_clean(str(x)))
    
    # Validation data:
    categorical_val[f] = categorical_val[f].apply(lambda x: text_clean(str(x)))
    
    # Test data:
    categorical_test[f] = categorical_test[f].apply(lambda x: text_clean(str(x)))

categorical_train.head()

Unnamed: 0,BILLINGCITY(),BILLINGSTATE(),BROWSER(),CREDITCARDBRAND(),CREDITCARDCOUNTRY(),CREDITCARDSUBTYPE(),EMAILDOMAIN(),GENDERBYNAMEPTBR(),IPGEOLOCATIONCITY(),IPGEOLOCATIONCOUNTRY(),...,SHIPPINGSTATE(),UTMSOURCELASTCLICK(),y,order_amount,order_id,status,epoch,store_id,weight,date
0,na_value,na_value,na_value,visa,br,gold,hotmail.com,f,fartura,br,...,sp,na_value,0.0,156.75,D48D0720681E4F5D9A2767F7174B5FA6-2782006,APPROVED,1577751000000.0,6044,1.0,2019-12-30
1,na_value,na_value,na_value,mastercard,br,gold,gmail.com,f,sao_paulo,br,...,sp,na_value,0.0,67.96,A0EB579C0AE0452D9020C91C54565B4F-2782009,APPROVED,1577751000000.0,6044,1.0,2019-12-30
2,na_value,na_value,na_value,visa,br,classic,gmail.com,f,recife,br,...,al,na_value,0.0,315.72,17A1DF0F984E4B34AC512D7E9E23B7BB-2782011,APPROVED,1577751000000.0,6044,1.0,2019-12-30
3,na_value,na_value,na_value,mastercard,br,platinum,gmail.com,f,guarapari,br,...,rj,na_value,0.0,514.15,21CA5C8AA45B400DB55985466AEE0BCD-2782028,APPROVED,1577751000000.0,6044,1.0,2019-12-30
4,na_value,na_value,na_value,elo/discover,br,nanjing_diners,hotmail.com,m,curitiba,br,...,sc,na_value,0.0,64.74,FFC167F3C6C742C9AD26E7E07ED72115-2782055,APPROVED,1577752000000.0,6044,1.0,2019-12-30


#### Merging all features

In [15]:
# Training data:
df_train = df_train.merge(categorical_train[[f for f in categorical_train.columns if (f not in drop_vars) |
                                             (f == 'order_id')]],
                          on='order_id', how='left')

print('\033[1mShape of df_train for store ' + str(s) + ':\033[0m ' + str(df_train.shape) + '.')
print('\n')
df_train.head()

[1mShape of df_train for store 6044:[0m (35897, 2187).




Unnamed: 0,BILLINGLARGEAREAREPUTATION(),BILLINGSMALLAREAREPUTATION(),"BILLINGZIP(CREDITCARD,10080)","BILLINGZIP(CREDITCARD,1440)","BILLINGZIP(CREDITCARD,21600)","BILLINGZIP(CREDITCARD,360)","BILLINGZIP(CREDITCARD,43200)","BILLINGZIP(CREDITCARD,60)","BILLINGZIP(CREDITCARD,64800)","BILLINGZIP(DOCUMENT,10080)",...,CREDITCARDCOUNTRY(),CREDITCARDSUBTYPE(),EMAILDOMAIN(),GENDERBYNAMEPTBR(),IPGEOLOCATIONCITY(),IPGEOLOCATIONCOUNTRY(),SELLERID(),SHIPPINGCITY(),SHIPPINGSTATE(),UTMSOURCELASTCLICK()
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,br,gold,hotmail.com,f,fartura,br,none,fartura,sp,na_value
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,br,gold,gmail.com,f,sao_paulo,br,none,santos,sp,na_value
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,br,classic,gmail.com,f,recife,br,none,maceio,al,na_value
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,br,platinum,gmail.com,f,guarapari,br,none,itaperuna,rj,na_value
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,br,nanjing_diners,hotmail.com,m,curitiba,br,none,sao_jose,sc,na_value


In [16]:
# Validation data:
df_val = df_val.merge(categorical_val[[f for f in categorical_val.columns if (f not in drop_vars) |
                                       (f == 'order_id')]],
                      on='order_id', how='left')

print('\033[1mShape of df_val for store ' + str(s) + ':\033[0m ' + str(df_val.shape) + '.')
print('\n')
df_val.head()

[1mShape of df_val for store 6044:[0m (20940, 2187).




Unnamed: 0,BILLINGLARGEAREAREPUTATION(),BILLINGSMALLAREAREPUTATION(),"BILLINGZIP(CREDITCARD,10080)","BILLINGZIP(CREDITCARD,1440)","BILLINGZIP(CREDITCARD,21600)","BILLINGZIP(CREDITCARD,360)","BILLINGZIP(CREDITCARD,43200)","BILLINGZIP(CREDITCARD,60)","BILLINGZIP(CREDITCARD,64800)","BILLINGZIP(DOCUMENT,10080)",...,CREDITCARDCOUNTRY(),CREDITCARDSUBTYPE(),EMAILDOMAIN(),GENDERBYNAMEPTBR(),IPGEOLOCATIONCITY(),IPGEOLOCATIONCOUNTRY(),SELLERID(),SHIPPINGCITY(),SHIPPINGSTATE(),UTMSOURCELASTCLICK()
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,br,gold,hotmail.com,f,sao_paulo,br,none,itapecerica_da_serra,sp,na_value
1,0.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,br,infinite,karseg.com.br,m,campinas,br,none,campinas,sp,na_value
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,br,gold,gmail.com,m,sao_paulo,br,none,sao_paulo,sp,na_value
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,br,platinum,gmail.com,f,mairinque,br,none,mairinque,sp,na_value
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,br,platinum,hotmail.com,f,salvador,br,none,joao_pessoa,pb,na_value


In [17]:
# Test data:
df_test = df_test.merge(categorical_test[[f for f in categorical_test.columns if (f not in drop_vars) |
                                          (f == 'order_id')]],
                        on='order_id', how='left')

print('\033[1mShape of df_test for store ' + str(s) + ':\033[0m ' + str(df_test.shape) + '.')
print('\n')
df_test.head()

[1mShape of df_test for store 6044:[0m (21791, 2187).




Unnamed: 0,BILLINGLARGEAREAREPUTATION(),BILLINGSMALLAREAREPUTATION(),"BILLINGZIP(CREDITCARD,10080)","BILLINGZIP(CREDITCARD,1440)","BILLINGZIP(CREDITCARD,21600)","BILLINGZIP(CREDITCARD,360)","BILLINGZIP(CREDITCARD,43200)","BILLINGZIP(CREDITCARD,60)","BILLINGZIP(CREDITCARD,64800)","BILLINGZIP(DOCUMENT,10080)",...,CREDITCARDCOUNTRY(),CREDITCARDSUBTYPE(),EMAILDOMAIN(),GENDERBYNAMEPTBR(),IPGEOLOCATIONCITY(),IPGEOLOCATIONCOUNTRY(),SELLERID(),SHIPPINGCITY(),SHIPPINGSTATE(),UTMSOURCELASTCLICK()
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,br,nanjing,adp.com,m,porto_alegre,br,none,porto_alegre,rs,na_value
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,br,platinum,uol.com.br,f,jundiai,br,none,jundiai,sp,na_value
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,br,black,uol.com.br,m,itanhaem,br,none,itanhaem,sp,na_value
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,br,gold,gmail.com,f,santa_maria,br,none,santa_maria,rs,na_value
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,br,gold,gmail.com,m,sao_paulo,br,none,sao_paulo,sp,na_value


In [18]:
# Assessing missing values (training data):
if df_train.isnull().sum().sum() != num_miss_train:
    print('\033[1mInconsistent number of overall missings values (training data)!\033[0m')
    print('\n')

# Assessing missing values (validation data):
if df_val.isnull().sum().sum() != num_miss_val:
    print('\033[1mInconsistent number of overall missings values (validation data)!\033[0m')
    print('\n')
    
# Assessing missing values (test data):
if df_test.isnull().sum().sum() != num_miss_test:
    print('\033[1mInconsistent number of overall missings values (test data)!\033[0m')
    print('\n')

<a id='model_assessment'></a>

### Model assessment

In [19]:
os.chdir('/home/matheus_rosso/Arquivo/Materiais/Codes/neural_nets/')

In [20]:
# Logistic regression estimations:
if 'model_assessment_LR.json' not in os.listdir('Datasets'):
    model_assessment_LR = {}

else:
    with open('Datasets/model_assessment_LR.json') as json_file:
        model_assessment_LR = json.load(json_file)

In [21]:
# SVM estimations:
if 'model_assessment_SVM.json' not in os.listdir('Datasets'):
    model_assessment_SVM = {}

else:
    with open('Datasets/model_assessment_SVM.json') as json_file:
        model_assessment_SVM = json.load(json_file)

In [22]:
# GBM estimations:
if 'model_assessment_GBM.json' not in os.listdir('Datasets'):
    model_assessment_GBM = {}

else:
    with open('Datasets/model_assessment_GBM.json') as json_file:
        model_assessment_GBM = json.load(json_file)

<a id='classif_feat'></a>

### Classifying features

In [23]:
# Categorical features:
cat_vars = list(categorical_train.drop(drop_vars, axis=1).columns)

# Dummy variables indicating missing value status:
missing_vars = [c for c in list(df_train.drop(drop_vars, axis=1).columns) if ('NA#' in c)]

# Dropping features with no variance:
no_variance = [c for c in df_train.drop(drop_vars, axis=1).drop(cat_vars,
                                                                axis=1).drop(missing_vars,
                                                                             axis=1) if df_train[c].var()==0]

if len(no_variance) > 0:
    df_train.drop(no_variance, axis=1, inplace=True)
    df_val.drop(no_variance, axis=1, inplace=True)
    df_test.drop(no_variance, axis=1, inplace=True)

# Numerical features:
cont_vars = [c for c in  list(df_train.drop(drop_vars, axis=1).columns) if is_velocity(c)]

# Binary features:
binary_vars = [c for c in list(df_train.drop([c for c in df_train.columns if (c in drop_vars) |
                                             (c in cat_vars) | (c in missing_vars) | (c in cont_vars)],
                                             axis=1).columns) if set(df_train[c].unique()) == set([0,1])]

# Updating the list of numerical features:
for c in list(df_train.drop(drop_vars, axis=1).columns):
    if (c not in cat_vars) & (c not in missing_vars) & (c not in cont_vars) & (c not in binary_vars):
        cont_vars.append(c)

# Dataframe presenting the frequency of features by class:
feats_assess = pd.DataFrame(data={
    'class': ['cat_vars', 'missing_vars', 'binary_vars', 'cont_vars', 'drop_vars'],
    'frequency': [len(cat_vars), len(missing_vars), len(binary_vars), len(cont_vars), len(drop_vars)]
})
feats_assess.sort_values('frequency', ascending=False)

Unnamed: 0,class,frequency
3,cont_vars,1619
1,missing_vars,415
2,binary_vars,27
0,cat_vars,14
4,drop_vars,8


<a id='data_pre_proc'></a>

## Data pre-processing

<a id='assessing_missing'></a>

### Assessing missing values

#### Recreating missing values

In [24]:
missing_vars = [f for f in df_train.columns if 'NA#' in f]

# Loop over variables with missing values:
for f in [c for c in missing_vars if c.replace('NA#', '') not in cat_vars]:
    if f.replace('NA#', '') in df_train.columns:
        # Training data:
        df_train[f.replace('NA#', '')] = recreate_missings(df_train[f.replace('NA#', '')], df_train[f])
        
        # Validation data:
        df_val[f.replace('NA#', '')] = recreate_missings(df_val[f.replace('NA#', '')], df_val[f])
        
        # Test data:
        df_test[f.replace('NA#', '')] = recreate_missings(df_test[f.replace('NA#', '')], df_test[f])
    else:
        df_train.drop([f], axis=1, inplace=True)
        
        df_val.drop([f], axis=1, inplace=True)
        
        df_test.drop([f], axis=1, inplace=True)

In [25]:
# Dropping all variables with missing value status:
df_train.drop([f for f in df_train.columns if 'NA#' in f], axis=1, inplace=True)

df_val.drop([f for f in df_val.columns if 'NA#' in f], axis=1, inplace=True)

df_test.drop([f for f in df_test.columns if 'NA#' in f], axis=1, inplace=True)

#### Describing the frequency of missing values

In [26]:
# Dataframe with the number of missings by feature (training data):
missings_dict = df_train.isnull().sum().sort_values(ascending=False).to_dict()

missings_assess_train = pd.DataFrame(data={
    'feature': list(missings_dict.keys()),
    'missings': list(missings_dict.values())
})

print('\033[1mNumber of features with missings:\033[0m {}'.format(sum(missings_assess_train.missings > 0)) +
      ' out of {} features'.format(len(missings_assess_train)) +
      ' ({}%).'.format(round((sum(missings_assess_train.missings > 0)/len(missings_assess_train))*100, 2)))
print('\033[1mAverage number of missings:\033[0m {}'.format(int(missings_assess_train.missings.mean())) +
      ' out of {} observations'.format(len(df_train)) +
      ' ({}%).'.format(round((int(missings_assess_train.missings.mean())/len(df_train))*100,2)))
print('\n')
missings_assess_train.index.name = 'training_data'
missings_assess_train.head(10)

[1mNumber of features with missings:[0m 389 out of 1668 features (23.32%).
[1mAverage number of missings:[0m 7108 out of 35897 observations (19.8%).




Unnamed: 0_level_0,feature,missings
training_data,Unnamed: 1_level_1,Unnamed: 2_level_1
0,"CUSTNAVCOUNT(cv,6M)",35601
1,"GDOCUMENT(TOTAL_AMOUNT,60)",35574
2,"GTELEPHONE(TOTAL_AMOUNT,360)",35540
3,"NAME(TOTAL_AMOUNT,1440)",35494
4,"IP(TOTAL_AMOUNT,1440)",35490
5,"EMAIL(TOTAL_AMOUNT,1440)",35482
6,"DOCUMENT(TOTAL_AMOUNT,1440)",35459
7,"CREDITCARD(TOTAL_AMOUNT,1440)",35458
8,"TELEPHONE(TOTAL_AMOUNT,1440)",35447
9,"GEMAIL(TOTAL_AMOUNT,360)",35387


In [27]:
# Dataframe with the number of missings by feature (validation data):
missings_dict = df_val.isnull().sum().sort_values(ascending=False).to_dict()

missings_assess_val = pd.DataFrame(data={
    'feature': list(missings_dict.keys()),
    'missings': list(missings_dict.values())
})

print('\033[1mNumber of features with missings:\033[0m {}'.format(sum(missings_assess_val.missings > 0)) +
      ' out of {} features'.format(len(missings_assess_val)) +
      ' ({}%).'.format(round((sum(missings_assess_val.missings > 0)/len(missings_assess_val))*100, 2)))
print('\033[1mAverage number of missings:\033[0m {}'.format(int(missings_assess_val.missings.mean())) +
      ' out of {} observations'.format(len(df_val)) +
      ' ({}%).'.format(round((int(missings_assess_val.missings.mean())/len(df_val))*100,2)))
print('\n')
missings_assess_val.index.name = 'val_data'
missings_assess_val.head(10)

[1mNumber of features with missings:[0m 389 out of 1668 features (23.32%).
[1mAverage number of missings:[0m 4176 out of 20940 observations (19.94%).




Unnamed: 0_level_0,feature,missings
val_data,Unnamed: 1_level_1,Unnamed: 2_level_1
0,"CUSTNAVCOUNT(cv,6M)",20741
1,"GDOCUMENT(TOTAL_AMOUNT,60)",20731
2,"GTELEPHONE(TOTAL_AMOUNT,360)",20710
3,"NAME(TOTAL_AMOUNT,1440)",20685
4,"EMAIL(TOTAL_AMOUNT,1440)",20674
5,"CREDITCARD(TOTAL_AMOUNT,1440)",20658
6,"IP(TOTAL_AMOUNT,1440)",20657
7,"DOCUMENT(TOTAL_AMOUNT,1440)",20657
8,"TELEPHONE(TOTAL_AMOUNT,1440)",20654
9,FSBZIPPHONE(),20630


In [28]:
# Dataframe with the number of missings by feature (test data):
missings_dict = df_test.isnull().sum().sort_values(ascending=False).to_dict()

missings_assess_test = pd.DataFrame(data={
    'feature': list(missings_dict.keys()),
    'missings': list(missings_dict.values())
})

print('\033[1mNumber of features with missings:\033[0m {}'.format(sum(missings_assess_test.missings > 0)) +
      ' out of {} features'.format(len(missings_assess_test)) +
      ' ({}%).'.format(round((sum(missings_assess_test.missings > 0)/len(missings_assess_test))*100, 2)))
print('\033[1mAverage number of missings:\033[0m {}'.format(int(missings_assess_test.missings.mean())) +
      ' out of {} observations'.format(len(df_test)) +
      ' ({}%).'.format(round((int(missings_assess_test.missings.mean())/len(df_test))*100,2)))
print('\n')
missings_assess_test.index.name = 'test_data'
missings_assess_test.head(10)

[1mNumber of features with missings:[0m 389 out of 1668 features (23.32%).
[1mAverage number of missings:[0m 4302 out of 21791 observations (19.74%).




Unnamed: 0_level_0,feature,missings
test_data,Unnamed: 1_level_1,Unnamed: 2_level_1
0,"GDOCUMENT(TOTAL_AMOUNT,60)",21521
1,"GTELEPHONE(TOTAL_AMOUNT,360)",21512
2,"EMAIL(TOTAL_AMOUNT,1440)",21493
3,"CUSTNAVCOUNT(cv,6M)",21490
4,"NAME(TOTAL_AMOUNT,1440)",21482
5,"TELEPHONE(TOTAL_AMOUNT,1440)",21476
6,"DOCUMENT(TOTAL_AMOUNT,1440)",21475
7,"CREDITCARD(TOTAL_AMOUNT,1440)",21472
8,"IP(TOTAL_AMOUNT,1440)",21466
9,FSBZIPPHONE(),21444


<a id='num_transf'></a>

### Transforming numerical features

#### Logarithmic transformation

In [29]:
print('---------------------------------------------------------------------------------------------------------')
print('\033[1mAPPLYING LOGARITHMIC TRANSFORMATION OVER NUMERICAL DATA\033[0m')
print('\n')
# Variables that should not be log-transformed:
not_log = [c for c in df_train.columns if c not in cont_vars]

if log_transform:
    print('\033[1mTraining data:\033[0m')

    # Assessing missing values (before logarithmic transformation):
    num_miss_train = df_train.isnull().sum().sum()
    if num_miss_train > 0:
        print('\033[1mNumber of overall missings detected (before logarithmic transformation):\033[0m ' +
              str(num_miss_train) + '.')

    log_transf = log_transformation(not_log=not_log)
    log_transf.transform(df_train)
    df_train = log_transf.log_transformed

    # Assessing missing values (after logarithmic transformation):
    num_miss_train_log = df_train.isnull().sum().sum()
    if num_miss_train_log > 0:
        print('\033[1mNumber of overall missings detected (after logarithmic transformation):\033[0m ' + 
              str(num_miss_train_log) + '.')

    # Checking consistency in the number of missings:
    if num_miss_train_log != num_miss_train:
        print('\033[1mProblem - Inconsistent number of overall missings!\033[0m')

    print('\n')
    print('\033[1mValidation data:\033[0m')

    # Assessing missing values (before logarithmic transformation):
    num_miss_val = df_val.isnull().sum().sum()
    if num_miss_val > 0:
        print('\033[1mNumber of overall missings detected (before logarithmic transformation):\033[0m ' +
              str(num_miss_val) + '.')

    log_transf = log_transformation(not_log=not_log)
    log_transf.transform(df_val)
    df_val = log_transf.log_transformed

    # Assessing missing values (after logarithmic transformation):
    num_miss_val_log = df_val.isnull().sum().sum()
    if num_miss_val_log > 0:
        print('\033[1mNumber of overall missings detected (after logarithmic transformation):\033[0m ' + 
              str(num_miss_val_log) + '.')

    # Checking consistency in the number of missings:
    if num_miss_val_log != num_miss_val:
        print('\033[1mProblem - Inconsistent number of overall missings!\033[0m')
        
    print('\n')
    print('\033[1mTest data:\033[0m')

    # Assessing missing values (before logarithmic transformation):
    num_miss_test = df_test.isnull().sum().sum()
    if num_miss_test > 0:
        print('\033[1mNumber of overall missings detected (before logarithmic transformation):\033[0m ' +
              str(num_miss_test) + '.')

    log_transf = log_transformation(not_log=not_log)
    log_transf.transform(df_test)
    df_test = log_transf.log_transformed

    # Assessing missing values (after logarithmic transformation):
    num_miss_test_log = df_test.isnull().sum().sum()
    if num_miss_test_log > 0:
        print('\033[1mNumber of overall missings detected (after logarithmic transformation):\033[0m ' + 
              str(num_miss_test_log) + '.')

    # Checking consistency in the number of missings:
    if num_miss_test_log != num_miss_test:
        print('\033[1mProblem - Inconsistent number of overall missings!\033[0m')

else:
    print('\033[1mNo transformation performed!\033[0m')

print('\n')
print('---------------------------------------------------------------------------------------------------------')
print('\n')

---------------------------------------------------------------------------------------------------------
[1mAPPLYING LOGARITHMIC TRANSFORMATION OVER NUMERICAL DATA[0m


[1mTraining data:[0m
[1mNumber of overall missings detected (before logarithmic transformation):[0m 11856255.
[1mNumber of numerical variables log-transformed:[0m 1619.
[1mNumber of overall missings detected (after logarithmic transformation):[0m 11856255.


[1mValidation data:[0m
[1mNumber of overall missings detected (before logarithmic transformation):[0m 6967103.
[1mNumber of numerical variables log-transformed:[0m 1619.
[1mNumber of overall missings detected (after logarithmic transformation):[0m 6967103.


[1mTest data:[0m
[1mNumber of overall missings detected (before logarithmic transformation):[0m 7177024.
[1mNumber of numerical variables log-transformed:[0m 1619.
[1mNumber of overall missings detected (after logarithmic transformation):[0m 7177024.


---------------------------------

#### Standardizing numerical features

In [30]:
print('---------------------------------------------------------------------------------------------------------')
print('\033[1mAPPLYING STANDARD SCALE TRANSFORMATION OVER NUMERICAL DATA\033[0m')
print('\n')
# Inputs that should not be standardized:
not_stand = [c for c in df_train.columns if c.replace('L#', '') not in cont_vars]

if standardize:
    print('\033[1mTraining data:\033[0m')

    stand_scale = standard_scale(not_stand = not_stand)
    
    stand_scale.scale(train = df_train, test = df_val)
    
    df_train_scaled = stand_scale.train_scaled
    print('\033[1mShape of df_train_scaled (after scaling):\033[0m ' + str(df_train_scaled.shape) + '.')

    # Assessing missing values (after standardizing numerical features):
    num_miss_train = df_train.isnull().sum().sum()
    num_miss_train_scaled = df_train_scaled.isnull().sum().sum()
    if num_miss_train_scaled > 0:
        print('\033[1mNumber of overall missings:\033[0m ' + str(num_miss_train_scaled) + '.')
    else:
        print('\033[1mNo missing values detected (training data)!\033[0m')

    if num_miss_train_scaled != num_miss_train:
        print('\033[1mProblem - Inconsistent number of overall missings!\033[0m')
    
    print('\n')
    print('\033[1mValidation data:\033[0m')
    df_val_scaled = stand_scale.test_scaled
    print('\033[1mShape of df_val_scaled (after scaling):\033[0m ' + str(df_val_scaled.shape) + '.')

    # Assessing missing values (after standardizing numerical features):
    num_miss_val = df_val.isnull().sum().sum()
    num_miss_val_scaled = df_val_scaled.isnull().sum().sum()
    if num_miss_val_scaled > 0:
        print('\033[1mNumber of overall missings:\033[0m ' + str(num_miss_val_scaled) + '.')
    else:
        print('\033[1mNo missing values detected (val data)!\033[0m')

    if num_miss_val_scaled != num_miss_val:
        print('\033[1mProblem - Inconsistent number of overall missings!\033[0m')
        
    print('\n')
    print('\033[1mTest data:\033[0m')
    stand_scale.scale(train = df_train, test = df_test)
    df_test_scaled = stand_scale.test_scaled
    print('\033[1mShape of df_test_scaled (after scaling):\033[0m ' + str(df_test_scaled.shape) + '.')

    # Assessing missing values (after standardizing numerical features):
    num_miss_test = df_test.isnull().sum().sum()
    num_miss_test_scaled = df_test_scaled.isnull().sum().sum()
    if num_miss_test_scaled > 0:
        print('\033[1mNumber of overall missings:\033[0m ' + str(num_miss_test_scaled) + '.')
    else:
        print('\033[1mNo missing values detected (test data)!\033[0m')

    if num_miss_test_scaled != num_miss_test:
        print('\033[1mProblem - Inconsistent number of overall missings!\033[0m')

else:
    df_train_scaled = df_train.copy()
    df_val_scaled = df_val.copy()
    df_test_scaled = df_test.copy()
    
    print('\033[1mNo transformation performed!\033[0m')

print('\n')
print('---------------------------------------------------------------------------------------------------------')
print('\n')

---------------------------------------------------------------------------------------------------------
[1mAPPLYING STANDARD SCALE TRANSFORMATION OVER NUMERICAL DATA[0m


[1mTraining data:[0m
[1mShape of df_train_scaled (after scaling):[0m (35897, 1668).
[1mNumber of overall missings:[0m 11856255.


[1mValidation data:[0m
[1mShape of df_val_scaled (after scaling):[0m (20940, 1668).
[1mNumber of overall missings:[0m 6967103.


[1mTest data:[0m
[1mShape of df_test_scaled (after scaling):[0m (21791, 1668).
[1mNumber of overall missings:[0m 7177024.


---------------------------------------------------------------------------------------------------------




#### Treating missing values

In [31]:
print('---------------------------------------------------------------------------------------------------------')
print('\033[1mTREATING MISSING VALUES\033[0m')
print('\n')

print('\033[1mTraining data:\033[0m')
num_miss_train = df_train_scaled.isnull().sum().sum()
print('\033[1mNumber of overall missing values detected before treatment:\033[0m ' +
      str(num_miss_train) + '.')

# Loop over features:
for f in df_train_scaled.drop(drop_vars, axis=1):
    # Checking if there is missing values for a given feature:
    if df_train_scaled[f].isnull().sum() > 0:
        check_missing = impute_missing(df_train_scaled[f])
        df_train_scaled[f] = check_missing['var']
        df_train_scaled['NA#' + f.replace('L#', '')] = check_missing['missing_var']

num_miss_train_treat = int(sum([sum(df_train_scaled[f]) for f in df_train_scaled.columns if 'NA#' in f]))
print('\033[1mNumber of overall missing values detected during treatment:\033[0m ' +
      str(num_miss_train_treat) + '.')

if num_miss_train_treat != num_miss_train:
    print('\033[1mProblem - Inconsistent number of overall missings!\033[0m')

if df_train_scaled.isnull().sum().sum() > 0:
    print('\033[1mProblem - Number of overall missings detected (training data):\033[0m ' +
          str(df_train_scaled.isnull().sum().sum()) + '.')

print('\n')
print('\033[1mValidation data:\033[0m')
num_miss_val = df_val_scaled.isnull().sum().sum()
num_miss_val_treat = 0
print('\033[1mNumber of overall missing values detected before treatment:\033[0m ' + str(num_miss_val) + '.')

# Loop over features:
for f in df_val_scaled.drop(drop_vars, axis=1):
    # Check if there is dummy variable of missing value status for training data:
    if 'NA#' + f.replace('L#', '') in list(df_train_scaled.columns):
        check_missing = impute_missing(df_val_scaled[f])
        df_val_scaled[f] = check_missing['var']
        df_val_scaled['NA#' + f.replace('L#', '')] = check_missing['missing_var']
    else:
        # Checking if there are missings for variables without missings in training data:
        if df_val_scaled[f].isnull().sum() > 0:
            num_miss_val_treat += df_val_scaled[f].isnull().sum()
            df_val_scaled[f].fillna(0, axis=0, inplace=True)

num_miss_val_treat += int(sum([sum(df_val_scaled[f]) for f in df_val_scaled.columns if 'NA#' in f]))
print('\033[1mNumber of overall missing values detected during treatment:\033[0m ' +
      str(num_miss_val_treat) + '.')

if num_miss_val_treat != num_miss_val:
    print('\033[1mProblem - Inconsistent number of overall missings!\033[0m')

if df_val_scaled.isnull().sum().sum() > 0:
    print('\033[1mProblem - Number of overall missings detected (val data):\033[0m ' +
          str(df_val_scaled.isnull().sum().sum()) + '.')
    
print('\n')
print('\033[1mTest data:\033[0m')
num_miss_test = df_test_scaled.isnull().sum().sum()
num_miss_test_treat = 0
print('\033[1mNumber of overall missing values detected before treatment:\033[0m ' + str(num_miss_test) + '.')

# Loop over features:
for f in df_test_scaled.drop(drop_vars, axis=1):
    # Check if there is dummy variable of missing value status for training data:
    if 'NA#' + f.replace('L#', '') in list(df_train_scaled.columns):
        check_missing = impute_missing(df_test_scaled[f])
        df_test_scaled[f] = check_missing['var']
        df_test_scaled['NA#' + f.replace('L#', '')] = check_missing['missing_var']
    else:
        # Checking if there are missings for variables without missings in training data:
        if df_test_scaled[f].isnull().sum() > 0:
            num_miss_test_treat += df_test_scaled[f].isnull().sum()
            df_test_scaled[f].fillna(0, axis=0, inplace=True)

num_miss_test_treat += int(sum([sum(df_test_scaled[f]) for f in df_test_scaled.columns if 'NA#' in f]))
print('\033[1mNumber of overall missing values detected during treatment:\033[0m ' +
      str(num_miss_test_treat) + '.')

if num_miss_test_treat != num_miss_test:
    print('\033[1mProblem - Inconsistent number of overall missings!\033[0m')

if df_test_scaled.isnull().sum().sum() > 0:
    print('\033[1mProblem - Number of overall missings detected (test data):\033[0m ' +
          str(df_test_scaled.isnull().sum().sum()) + '.')

print('\n')
print('---------------------------------------------------------------------------------------------------------')
print('\n')

---------------------------------------------------------------------------------------------------------
[1mTREATING MISSING VALUES[0m


[1mTraining data:[0m
[1mNumber of overall missing values detected before treatment:[0m 11856255.
[1mNumber of overall missing values detected during treatment:[0m 11856255.


[1mValidation data:[0m
[1mNumber of overall missing values detected before treatment:[0m 6967103.
[1mNumber of overall missing values detected during treatment:[0m 6967103.


[1mTest data:[0m
[1mNumber of overall missing values detected before treatment:[0m 7177024.
[1mNumber of overall missing values detected during treatment:[0m 7177024.


---------------------------------------------------------------------------------------------------------




<a id='categorical_transf'></a>

### Transforming categorical features

#### Creating dummies through one-hot encoding

In [32]:
# Create object for one-hot encoding:
categorical_transf = one_hot_encoding(categorical_features = cat_vars)

# Creating dummies:
categorical_transf.create_dummies(categorical_train = categorical_train,
                                  categorical_test = categorical_val)

# Selected dummies:
dummy_vars = list(categorical_transf.dummies_train.columns)

# Training data:
dummies_train = categorical_transf.dummies_train
dummies_train.index = df_train_scaled.index

# Validation data:
dummies_val = categorical_transf.dummies_test
dummies_val.index = df_val_scaled.index

# Create object for one-hot encoding:
categorical_transf = one_hot_encoding(categorical_features = cat_vars)

# Creating dummies:
categorical_transf.create_dummies(categorical_train = categorical_train,
                                  categorical_test = categorical_test)

# Test data:
dummies_test = categorical_transf.dummies_test
dummies_test.index = df_test_scaled.index

# Dropping original categorical features:
df_train_scaled.drop(cat_vars, axis=1, inplace=True)
df_val_scaled.drop(cat_vars, axis=1, inplace=True)
df_test_scaled.drop(cat_vars, axis=1, inplace=True)

print('\033[1mNumber of categorical features:\033[0m {}.'.format(len(categorical_transf.categorical_features)))
print('\033[1mNumber of overall selected dummies:\033[0m {}.'.format(dummies_train.shape[1]))
print('\033[1mShape of dummies_train for store ' + str(s) + ':\033[0m ' +
      str(dummies_train.shape) + '.')
print('\033[1mShape of dummies_val for store ' + str(s) + ':\033[0m ' +
      str(dummies_val.shape) + '.')
print('\033[1mShape of dummies_test for store ' + str(s) + ':\033[0m ' +
      str(dummies_test.shape) + '.')
print('\n')

dummies_train.head()

[1mNumber of categorical features:[0m 14.
[1mNumber of overall selected dummies:[0m 62.
[1mShape of dummies_train for store 6044:[0m (35897, 62).
[1mShape of dummies_val for store 6044:[0m (20940, 62).
[1mShape of dummies_test for store 6044:[0m (21791, 62).




Unnamed: 0,C#BILLINGCITY()#NA_VALUE,C#BILLINGCITY()#SAO_PAULO,C#BILLINGSTATE()#NA_VALUE,C#BILLINGSTATE()#SP,C#CREDITCARDBRAND()#AMERICAN_EXPRESS,C#CREDITCARDBRAND()#ELO/DISCOVER,C#CREDITCARDBRAND()#HIPERCARD,C#CREDITCARDBRAND()#MASTERCARD,C#CREDITCARDBRAND()#VISA,C#CREDITCARDSUBTYPE()#BLACK,...,C#SHIPPINGSTATE()#DF,C#SHIPPINGSTATE()#ES,C#SHIPPINGSTATE()#GO,C#SHIPPINGSTATE()#MG,C#SHIPPINGSTATE()#PE,C#SHIPPINGSTATE()#PR,C#SHIPPINGSTATE()#RJ,C#SHIPPINGSTATE()#RS,C#SHIPPINGSTATE()#SC,C#SHIPPINGSTATE()#SP
0,1,0,1,0,0,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,1
1,1,0,1,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,1
2,1,0,1,0,0,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0
3,1,0,1,0,0,0,0,1,0,0,...,0,0,0,0,0,0,1,0,0,0
4,1,0,1,0,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0


#### Concatenating all features

In [33]:
df_train_scaled = pd.concat([df_train_scaled, dummies_train], axis=1)
df_val_scaled = pd.concat([df_val_scaled, dummies_val], axis=1)
df_test_scaled = pd.concat([df_test_scaled, dummies_test], axis=1)

print('\033[1mShape of df_train_scaled for store ' + str(s) + ':\033[0m ' + str(df_train_scaled.shape) + '.')
print('\033[1mShape of df_val_scaled for store ' + str(s) + ':\033[0m ' + str(df_val_scaled.shape) + '.')
print('\033[1mShape of df_test_scaled for store ' + str(s) + ':\033[0m ' + str(df_test_scaled.shape) + '.')
print('\n')

df_train_scaled.head()

[1mShape of df_train_scaled for store 6044:[0m (35897, 2105).
[1mShape of df_val_scaled for store 6044:[0m (20940, 2105).
[1mShape of df_test_scaled for store 6044:[0m (21791, 2105).




Unnamed: 0,BUREAUBILLCITY(),BUREAUBILLSTATE(),BUREAUEMAIL(),BUREAUPHONE(),BUREAUPHONEAREACODE(),BUREAUSHIPCITY(),BUREAUSHIPSTATE(),CREDITCARDCOUNTRYSAMEASSHIPPING(),EMAILHASFRAUD(),EMAILSAMEAMOUNT(),...,C#SHIPPINGSTATE()#DF,C#SHIPPINGSTATE()#ES,C#SHIPPINGSTATE()#GO,C#SHIPPINGSTATE()#MG,C#SHIPPINGSTATE()#PE,C#SHIPPINGSTATE()#PR,C#SHIPPINGSTATE()#RJ,C#SHIPPINGSTATE()#RS,C#SHIPPINGSTATE()#SC,C#SHIPPINGSTATE()#SP
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,...,0,0,0,0,0,0,0,0,0,1
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,...,0,0,0,0,0,0,0,0,0,1
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,...,0,0,0,0,0,0,0,0,0,0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,...,0,0,0,0,0,0,1,0,0,0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,...,0,0,0,0,0,0,0,0,1,0


In [34]:
# Assessing missing values (training data):
num_miss_train = df_train_scaled.isnull().sum().sum() > 0
if num_miss_train:
    print('\033[1mProblem - Number of overall missings detected (training data):\033[0m ' +
          str(df_train_scaled.isnull().sum().sum()) + '.')
    print('\n')

# Assessing missing values (validation data):
num_miss_val = df_val_scaled.isnull().sum().sum() > 0
if num_miss_val:
    print('\033[1mProblem - Number of overall missings detected (validation data):\033[0m ' +
          str(df_val_scaled.isnull().sum().sum()) + '.')
    print('\n')
    
# Assessing missing values (test data):
num_miss_test = df_test_scaled.isnull().sum().sum() > 0
if num_miss_test:
    print('\033[1mProblem - Number of overall missings detected (test data):\033[0m ' +
          str(df_test_scaled.isnull().sum().sum()) + '.')
    print('\n')

<a id='datasets_structure'></a>

### Datasets structure

In [35]:
# Checking consistency of structure between training and validation dataframes:
if len(list(df_train_scaled.columns)) != len(list(df_val_scaled.columns)):
    print('\033[1mProblem - Inconsistent number of columns between dataframes for training and validation data!\033[0m')

else:
    consistency_check = 0
    
    # Loop over variables:
    for c in list(df_train_scaled.columns):
        if list(df_train_scaled.columns).index(c) != list(df_val_scaled.columns).index(c):
            print('\033[1mProblem - Feature {0} was positioned differently in training and val validation!\033[0m'.format(c))
            consistency_check += 1
            
    # Reordering columns of val dataframe:
    if consistency_check > 0:
        ordered_columns = list(df_train_scaled.columns)
        df_val_scaled = df_val_scaled[ordered_columns]

In [36]:
# Checking consistency of structure between training and test dataframes:
if len(list(df_train_scaled.columns)) != len(list(df_test_scaled.columns)):
    print('\033[1mProblem - Inconsistent number of columns between dataframes for training and test data!\033[0m')

else:
    consistency_check = 0
    
    # Loop over variables:
    for c in list(df_train_scaled.columns):
        if list(df_train_scaled.columns).index(c) != list(df_test_scaled.columns).index(c):
            print('\033[1mProblem - Feature {0} was positioned differently in training and test dataframes!\033[0m'.format(c))
            consistency_check += 1
            
    # Reordering columns of test dataframe:
    if consistency_check > 0:
        ordered_columns = list(df_train_scaled.columns)
        df_test_scaled = df_test_scaled[ordered_columns]

<a id='logistic_regression'></a>

## Logistic regression estimation

<a id='lr_params'></a>

### Hyper-parameters definition

In [38]:
# Converting data from dataframes into nd-arrays:
X_train = df_train_scaled.drop(drop_vars, axis=1).values
y_train = df_train_scaled['y'].values

X_val = df_val_scaled.drop(drop_vars, axis=1).values
y_val = df_val_scaled['y'].values

#### Setting

In [51]:
# Number of estimations:
n_estimations = 10

# Grid of values for regularization parameter:
regul_params = [0.0001, 0.0003, 0.001, 0.003, 0.01, 0.03, 0.1, 0.25, 0.3, 0.5, 0.75, 1, 3, 10]

#### Model estimation

In [53]:
test_bar = progressbar.ProgressBar(maxval=len(regul_params),
                                   widgets=['\033[1mTest progress:\033[0m ',
                                   progressbar.Bar('-', '[', ']'), ' ',
                                   progressbar.Percentage()])

start_time = datetime.now()

for c in range(len(regul_params)):
    estimation_id = str(int(time.time()))
    
    start_time_estimation = datetime.now()
    
    # Lists to store results:
    val_roc_auc = []
    val_avg_prec_score = []
    val_brier_score = []
    
    for t in range(n_estimations):
        # Creating the model object:
        model = LogisticRegression(solver='liblinear', penalty = 'l1', C = regul_params[c],
                                   warm_start=True)

        # Training the model:
        model.fit(X_train, y_train)

        # Performance metrics on validation data:
        val_roc_auc.append(roc_auc_score(y_val, [p[1] for p in model.predict_proba(X_val)]))
        val_avg_prec_score.append(average_precision_score(y_val, [p[1] for p in model.predict_proba(X_val)]))
        val_brier_score.append(brier_score_loss(y_val, [p[1] for p in model.predict_proba(X_val)]))

    end_time_estimation = datetime.now()
        
    # Dictionary with information on model structure and performance:
    model_assessment_LR[estimation_id] = {
        'hyper_parameters': {
            'regul_param': regul_params[c]
        },
        'n_estimations': n_estimations,
        'performance_metrics': {
            'application': 'validation',
            'avg_roc_auc': np.nanmean(val_roc_auc),
            'avg_avg_prec_score': np.nanmean(val_avg_prec_score),
            'avg_brier_score': np.nanmean(val_brier_score),
            'std_roc_auc': np.nanstd(val_roc_auc),
            'std_avg_prec_score': np.nanstd(val_avg_prec_score),
            'std_brier_score': np.nanstd(val_brier_score)
        },
        'running_time': str(round(((end_time_estimation - start_time_estimation).seconds)/60, 2)) + ' minutes',
        "comment": 'Defining regularization parameter.'
    }

    if export:
        with open('Datasets/model_assessment_LR.json', 'w') as json_file:
            json.dump(model_assessment_LR, json_file, indent=2)
    
    test_bar.update(c+1)
    sleep(0.01)

# Assessing running time:
end_time = datetime.now()

print('------------------------------------')
print('\033[1mOverall running time:\033[0m ' + str(round(((end_time - start_time).seconds)/60, 2)) +
      ' minutes.')
print('Start time: ' + start_time.strftime('%Y-%m-%d') + ', ' + start_time.strftime('%H:%M:%S'))
print('End time: ' + end_time.strftime('%Y-%m-%d') + ', ' + end_time.strftime('%H:%M:%S'))
print('\n')

[1mTest progress:[0m [---------------------------------------------------------] 100%

------------------------------------
[1mOverall running time:[0m 57.95 minutes.
Start time: 2021-03-06, 16:07:15
End time: 2021-03-06, 17:05:12




#### Assessing results

In [60]:
estimation_ids = []
regul_params = []
avg_roc_auc = []
std_roc_auc = []
avg_prec = []
std_prec = []
ratio_roc_auc = []
ratio_prec = []
running_time = []

# Loop over estimations:
for e in [model_assessment_LR[e] for e in model_assessment_LR.keys() if
          ('Defining regularization parameter.' in model_assessment_LR[e]['comment'])]:
    estimation_ids.append(list(model_assessment_LR.keys())[list(model_assessment_LR.values()).index(e)])
    regul_params.append(e['hyper_parameters']['regul_param'])
    avg_roc_auc.append(e['performance_metrics']['avg_roc_auc'])
    std_roc_auc.append(e['performance_metrics']['std_roc_auc'])
    avg_prec.append(e['performance_metrics']['avg_avg_prec_score'])
    std_prec.append(e['performance_metrics']['std_avg_prec_score'])
    ratio_roc_auc.append(e['performance_metrics']['avg_roc_auc']/(e['performance_metrics']['std_roc_auc'] + 1e-7))
    ratio_prec.append(e['performance_metrics']['avg_avg_prec_score']/(e['performance_metrics']['std_avg_prec_score'] + 1e-7))
    running_time.append(float(e['running_time'].split(' minutes')[0]))
    
# Dataframe with performance metrics by mini-batch size:
metrics = pd.DataFrame(data={
    'estimation_id': estimation_ids,
    'regul_param': regul_params,
    'avg_roc_auc': avg_roc_auc,
    'std_roc_auc': std_roc_auc,
    'avg_prec': avg_prec,
    'std_prec': std_prec,
    'ratio_roc_auc': ratio_roc_auc,
    'ratio_prec': ratio_prec,
    'running_time': running_time
})

metrics.sort_values('avg_roc_auc', ascending=False)

Unnamed: 0,estimation_id,regul_param,avg_roc_auc,std_roc_auc,avg_prec,std_prec,ratio_roc_auc,ratio_prec,running_time
7,1615057981,0.25,0.955959,2.1e-05,0.494911,0.000267,46378.36,1855.338737,2.4
6,1615057890,0.1,0.955629,4.2e-05,0.506327,0.000204,22498.87,2477.968175,1.5
8,1615058125,0.3,0.955378,3.4e-05,0.490183,0.000209,27669.67,2341.526233,2.7
9,1615058288,0.5,0.952854,5.7e-05,0.479803,0.000601,16692.46,797.649401,3.53
10,1615058500,0.75,0.950577,0.000113,0.47475,0.000602,8435.548,788.312979,4.2
11,1615058752,1.0,0.947863,7.5e-05,0.465347,0.00044,12617.75,1056.52343,4.75
5,1615057832,0.03,0.947221,3.2e-05,0.483245,0.00032,29965.69,1511.26707,0.95
4,1615057787,0.01,0.934738,5e-05,0.41064,0.000408,18659.15,1006.517687,0.73
12,1615059038,3.0,0.93462,0.000118,0.410909,0.000559,7929.845,734.441767,9.97
13,1615059637,10.0,0.916468,0.000104,0.353622,0.000542,8781.78,652.266124,24.58


<a id='lr_estimation'></a>

### Final estimation

In [61]:
# Converting data from dataframes into nd-arrays:
X_train = df_train_scaled.drop(drop_vars, axis=1).values
y_train = df_train_scaled['y'].values

X_val = df_val_scaled.drop(drop_vars, axis=1).values
y_val = df_val_scaled['y'].values

X_test = df_test_scaled.drop(drop_vars, axis=1).values
y_test = df_test_scaled['y'].values

#### Setting

In [62]:
# Number of estimations:
n_estimations = 100

# Best value for regularization parameter:
regul_param = 0.25

#### Model estimation

In [65]:
start_time = datetime.now()

estimation_id = str(int(time.time()))

start_time_estimation = datetime.now()

# Lists to store results:
test_roc_auc = []
test_avg_prec_score = []
test_brier_score = []

for t in range(n_estimations):
    # Creating the model object:
    model = LogisticRegression(solver='liblinear', penalty = 'l1', C = regul_param,
                               warm_start=True)

    # Training the model:
    model.fit(X_train, y_train)

    # Performance metrics on validation data:
    test_roc_auc.append(roc_auc_score(y_test, [p[1] for p in model.predict_proba(X_test)]))
    test_avg_prec_score.append(average_precision_score(y_test, [p[1] for p in model.predict_proba(X_test)]))
    test_brier_score.append(brier_score_loss(y_test, [p[1] for p in model.predict_proba(X_test)]))

end_time_estimation = datetime.now()

# Dictionary with information on model structure and performance:
model_assessment_LR[estimation_id] = {
    'hyper_parameters': {
        'regul_param': regul_param
    },
    'n_estimations': n_estimations,
    'performance_metrics': {
        'application': 'test',
        'avg_roc_auc': np.nanmean(test_roc_auc),
        'avg_avg_prec_score': np.nanmean(test_avg_prec_score),
        'avg_brier_score': np.nanmean(test_brier_score),
        'std_roc_auc': np.nanstd(test_roc_auc),
        'std_avg_prec_score': np.nanstd(test_avg_prec_score),
        'std_brier_score': np.nanstd(test_brier_score),
    },
    'running_time': str(round(((end_time_estimation - start_time_estimation).seconds)/60, 2)) + ' minutes',
    "comment": 'Final estimation.'
}

if export:
    with open('Datasets/model_assessment_LR.json', 'w') as json_file:
        json.dump(model_assessment_LR, json_file, indent=2)

# Assessing running time:
end_time = datetime.now()

print('------------------------------------')
print('\033[1mOverall running time:\033[0m ' + str(round(((end_time - start_time).seconds)/60, 2)) +
      ' minutes.')
print('Start time: ' + start_time.strftime('%Y-%m-%d') + ', ' + start_time.strftime('%H:%M:%S'))
print('End time: ' + end_time.strftime('%Y-%m-%d') + ', ' + end_time.strftime('%H:%M:%S'))
print('\n')

------------------------------------
[1mOverall running time:[0m 25.68 minutes.
Start time: 2021-03-06, 17:15:29
End time: 2021-03-06, 17:41:10




<a id='svm'></a>

## SVM estimation

<a id='svm_params'></a>

### Hyper-parameters definition

In [36]:
# Converting data from dataframes into nd-arrays:
X_train = df_train_scaled.drop(drop_vars, axis=1).values
y_train = df_train_scaled['y'].values

X_val = df_val_scaled.drop(drop_vars, axis=1).values
y_val = df_val_scaled['y'].values

#### Setting

In [37]:
# Number of estimations:
n_estimations = 1

# Grid of values for the hyper-parameters:
grid_param = []
dict_param = {
    'C': [0.001, 0.01, 0.1, 1, 10],
    'kernel': ['poly'],
    'degree': [1, 2, 3, 4],
    'gamma': ['scale']
}

list_param = [dict_param[k] for k in dict_param.keys()]
list_param = [list(x) for x in np.array(np.meshgrid(*list_param)).T.reshape(-1,len(list_param))]

for i in list_param:
    grid_param.append(dict(zip(dict_param.keys(), i)))

#### Model estimation

In [38]:
test_bar = progressbar.ProgressBar(maxval=len(grid_param),
                                   widgets=['\033[1mTest progress:\033[0m ',
                                   progressbar.Bar('-', '[', ']'), ' ',
                                   progressbar.Percentage()])

start_time = datetime.now()

for j in range(len(grid_param)):
    estimation_id = str(int(time.time()))
    
    start_time_estimation = datetime.now()
    
    # Lists to store results:
    val_roc_auc = []
    val_avg_prec_score = []
    val_brier_score = []
    
    for t in range(n_estimations):
        # Creating the model object:
        model = SVC(C = float(grid_param[j]['C']),
                    kernel = grid_param[j]['kernel'],
                    degree = int(grid_param[j]['degree']),
                    gamma = grid_param[j]['gamma'],
                    probability = True, coef0 = 0.0, shrinking = True, tol = 0.001, max_iter = -1,
                    cache_size = 200, class_weight = None, decision_function_shape = 'ovr',
                    verbose = False, random_state = None)

        # Training the model:
        model.fit(X_train, y_train)

        # Performance metrics on validation data:
        val_roc_auc.append(roc_auc_score(y_val, [p[1] for p in model.predict_proba(X_val)]))
        val_avg_prec_score.append(average_precision_score(y_val, [p[1] for p in model.predict_proba(X_val)]))
        val_brier_score.append(brier_score_loss(y_val, [p[1] for p in model.predict_proba(X_val)]))

    end_time_estimation = datetime.now()
        
    # Dictionary with information on model structure and performance:
    model_assessment_SVM[estimation_id] = {
        'hyper_parameters': {
            'C': grid_param[j]['C'],
            'kernel': grid_param[j]['kernel'],
            'degree': int(grid_param[j]['degree']),
            'gamma': grid_param[j]['gamma']
        },
        'n_estimations': n_estimations,
        'performance_metrics': {
            'application': 'validation',
            'avg_roc_auc': np.nanmean(val_roc_auc),
            'avg_avg_prec_score': np.nanmean(val_avg_prec_score),
            'avg_brier_score': np.nanmean(val_brier_score),
            'std_roc_auc': np.nanstd(val_roc_auc),
            'std_avg_prec_score': np.nanstd(val_avg_prec_score),
            'std_brier_score': np.nanstd(val_brier_score)
        },
        'running_time': str(round(((end_time_estimation - start_time_estimation).seconds)/60, 2)) + ' minutes',
        "comment": 'Defining hyper-parameters.'
    }

    if export:
        with open('Datasets/model_assessment_SVM.json', 'w') as json_file:
            json.dump(model_assessment_SVM, json_file, indent=2)
    
    test_bar.update(j+1)
    sleep(0.01)

# Assessing running time:
end_time = datetime.now()

print('------------------------------------')
print('\033[1mOverall running time:\033[0m ' + str(round(((end_time - start_time).seconds)/60, 2)) +
      ' minutes.')
print('Start time: ' + start_time.strftime('%Y-%m-%d') + ', ' + start_time.strftime('%H:%M:%S'))
print('End time: ' + end_time.strftime('%Y-%m-%d') + ', ' + end_time.strftime('%H:%M:%S'))
print('\n')

[1mTest progress:[0m [---------------------------------------------------------] 100%

------------------------------------
[1mOverall running time:[0m 337.5 minutes.
Start time: 2021-03-09, 23:10:52
End time: 2021-03-10, 04:48:23




#### Assessing results

In [45]:
estimation_ids = []
c_list = []
kernels = []
degrees = []
gammas = []
avg_roc_auc = []
std_roc_auc = []
avg_prec = []
std_prec = []
ratio_roc_auc = []
ratio_prec = []
running_time = []

# Loop over estimations:
for e in [model_assessment_SVM[e] for e in model_assessment_SVM.keys() if
          ('Defining hyper-parameters.' in model_assessment_SVM[e]['comment'])]:
    estimation_ids.append(list(model_assessment_SVM.keys())[list(model_assessment_SVM.values()).index(e)])
    c_list.append(e['hyper_parameters']['C'])
    kernels.append(e['hyper_parameters']['kernel'])
    degrees.append(e['hyper_parameters']['degree'])
    gammas.append(e['hyper_parameters']['gamma'])
    avg_roc_auc.append(e['performance_metrics']['avg_roc_auc'])
    std_roc_auc.append(e['performance_metrics']['std_roc_auc'])
    avg_prec.append(e['performance_metrics']['avg_avg_prec_score'])
    std_prec.append(e['performance_metrics']['std_avg_prec_score'])
    ratio_roc_auc.append(e['performance_metrics']['avg_roc_auc']/(e['performance_metrics']['std_roc_auc'] + 1e-7))
    ratio_prec.append(e['performance_metrics']['avg_avg_prec_score']/(e['performance_metrics']['std_avg_prec_score'] + 1e-7))
    running_time.append(float(e['running_time'].split(' minutes')[0]))
    
# Dataframe with performance metrics by mini-batch size:
metrics = pd.DataFrame(data={
    'estimation_id': estimation_ids,
    'C': c_list,
    'kernel': kernels,
    'degree': degrees,
    'gamma': gammas,
    'avg_roc_auc': avg_roc_auc,
    'std_roc_auc': std_roc_auc,
    'avg_prec': avg_prec,
    'std_prec': std_prec,
    'ratio_roc_auc': ratio_roc_auc,
    'ratio_prec': ratio_prec,
    'running_time': running_time
})

metrics.sort_values('avg_roc_auc', ascending=False)

Unnamed: 0,estimation_id,C,kernel,degree,gamma,avg_roc_auc,std_roc_auc,avg_prec,std_prec,ratio_roc_auc,ratio_prec,running_time
4,1615345366,10.0,poly,1,scale,0.949295,0.0,0.490812,0.0,9492951.0,4908122.0,10.15
3,1615344669,1.0,poly,1,scale,0.940714,0.0,0.462448,0.0,9407142.0,4624481.0,11.6
8,1615348625,1.0,poly,2,scale,0.929405,0.0,0.399705,0.0,9294052.0,3997050.0,14.85
9,1615349516,10.0,poly,2,scale,0.928807,0.0,0.435525,0.0,9288068.0,4355247.0,15.23
14,1615354482,10.0,poly,3,scale,0.917996,0.0,0.40077,0.0,9179965.0,4007696.0,20.98
2,1615343896,0.1,poly,1,scale,0.915404,0.0,0.342095,0.0,9154044.0,3420952.0,12.88
13,1615353349,1.0,poly,3,scale,0.914734,0.0,0.338512,0.0,9147338.0,3385124.0,18.87
7,1615347748,0.1,poly,2,scale,0.912896,0.0,0.323124,0.0,9128958.0,3231240.0,14.62
1,1615343069,0.01,poly,1,scale,0.899856,0.0,0.308697,0.0,8998559.0,3086969.0,13.77
0,1615342252,0.001,poly,1,scale,0.898669,0.0,0.308039,0.0,8986690.0,3080385.0,13.6


<a id='svm_estimation'></a>

### Final estimation

In [36]:
# Converting data from dataframes into nd-arrays:
X_train = df_train_scaled.drop(drop_vars, axis=1).values
y_train = df_train_scaled['y'].values

X_val = df_val_scaled.drop(drop_vars, axis=1).values
y_val = df_val_scaled['y'].values

X_test = df_test_scaled.drop(drop_vars, axis=1).values
y_test = df_test_scaled['y'].values

#### Setting

In [37]:
# Number of estimations:
n_estimations = 100

# Best value for the hyper-parameters:
grid_param = {
    'C': 10,
    'kernel': 'poly',
    'degree': 1,
    'gamma': 'scale'
}

#### Model estimation

In [38]:
start_time = datetime.now()

estimation_id = str(int(time.time()))

start_time_estimation = datetime.now()

# Lists to store results:
test_roc_auc = []
test_avg_prec_score = []
test_brier_score = []

for t in range(n_estimations):
    # Creating the model object:
    model = SVC(C = float(grid_param['C']),
                kernel = grid_param['kernel'],
                degree = int(grid_param['degree']),
                gamma = grid_param['gamma'],
                probability = True)

    # Training the model:
    model.fit(X_train, y_train)

    # Performance metrics on validation data:
    test_roc_auc.append(roc_auc_score(y_test, [p[1] for p in model.predict_proba(X_test)]))
    test_avg_prec_score.append(average_precision_score(y_test, [p[1] for p in model.predict_proba(X_test)]))
    test_brier_score.append(brier_score_loss(y_test, [p[1] for p in model.predict_proba(X_test)]))

end_time_estimation = datetime.now()

# Dictionary with information on model structure and performance:
model_assessment_SVM[estimation_id] = {
    'hyper_parameters': {
        'C': 10,
        'kernel': 'poly',
        'degree': 1,
        'gamma': 'scale'
    },
    'n_estimations': n_estimations,
    'performance_metrics': {
        'application': 'test',
        'avg_roc_auc': np.nanmean(test_roc_auc),
        'avg_avg_prec_score': np.nanmean(test_avg_prec_score),
        'avg_brier_score': np.nanmean(test_brier_score),
        'std_roc_auc': np.nanstd(test_roc_auc),
        'std_avg_prec_score': np.nanstd(test_avg_prec_score),
        'std_brier_score': np.nanstd(test_brier_score),
    },
    'running_time': str(round(((end_time_estimation - start_time_estimation).seconds)/60, 2)) + ' minutes',
    "comment": 'Final estimation.'
}

if export:
    with open('Datasets/model_assessment_SVM.json', 'w') as json_file:
        json.dump(model_assessment_SVM, json_file, indent=2)

# Assessing running time:
end_time = datetime.now()

print('------------------------------------')
print('\033[1mOverall running time:\033[0m ' + str(round(((end_time - start_time).seconds)/60, 2)) +
      ' minutes.')
print('Start time: ' + start_time.strftime('%Y-%m-%d') + ', ' + start_time.strftime('%H:%M:%S'))
print('End time: ' + end_time.strftime('%Y-%m-%d') + ', ' + end_time.strftime('%H:%M:%S'))
print('\n')

------------------------------------
[1mOverall running time:[0m 1008.53 minutes.
Start time: 2021-03-10, 15:50:05
End time: 2021-03-11, 08:38:38




<a id='gbm'></a>

## GBM estimation

<a id='gbm_params'></a>

### Hyper-parameters definition

In [None]:
# Converting data from dataframes into nd-arrays:
X_train = df_train_scaled.drop(drop_vars, axis=1).values
y_train = df_train_scaled['y'].values

X_val = df_val_scaled.drop(drop_vars, axis=1).values
y_val = df_val_scaled['y'].values

#### Setting

In [40]:
# Number of estimations:
n_estimations = 1

# Grid of values for the hyper-parameters:
grid_param = []
dict_param = {
    'subsample': uniform(0.5, 0.5),
    'learning_rate': uniform(0.0001, 0.1),
    'max_depth': randint(1, 5+1),
    'n_estimators': randint(100, 1000+1)
}

for i in range(1, 15+1):
    list_param = []

    for k in dict_param.keys():
        try:
            list_param.append(dict_param[k].rvs(1)[0])
        except:
            list_param.append(np.random.choice(dict_param[k]))
    grid_param.append(dict(zip(dict_param.keys(), list_param)))

#### Model estimation

In [43]:
test_bar = progressbar.ProgressBar(maxval=len(grid_param),
                                   widgets=['\033[1mTest progress:\033[0m ',
                                   progressbar.Bar('-', '[', ']'), ' ',
                                   progressbar.Percentage()])

start_time = datetime.now()

for j in range(len(grid_param)):
    estimation_id = str(int(time.time()))
    
    start_time_estimation = datetime.now()
    
    # Lists to store results:
    val_roc_auc = []
    val_avg_prec_score = []
    val_brier_score = []
    
    for t in range(n_estimations):
        # Creating the model object:
        model = GradientBoostingClassifier(subsample = float(grid_param[j]['subsample']),
                                           max_depth = int(grid_param[j]['max_depth']),
                                           learning_rate = float(grid_param[j]['learning_rate']),
                                           n_estimators = int(grid_param[j]['n_estimators']),
                                           warm_start = True)

        # Training the model:
        model.fit(X_train, y_train)

        # Performance metrics on validation data:
        val_roc_auc.append(roc_auc_score(y_val, [p[1] for p in model.predict_proba(X_val)]))
        val_avg_prec_score.append(average_precision_score(y_val, [p[1] for p in model.predict_proba(X_val)]))
        val_brier_score.append(brier_score_loss(y_val, [p[1] for p in model.predict_proba(X_val)]))

    end_time_estimation = datetime.now()
        
    # Dictionary with information on model structure and performance:
    model_assessment_GBM[estimation_id] = {
        'hyper_parameters': {
            'subsample': str(grid_param[j]['subsample']),
            'max_depth': str(grid_param[j]['max_depth']),
            'learning_rate': str(grid_param[j]['learning_rate']),
            'n_estimators': str(grid_param[j]['n_estimators'])
        },
        'n_estimations': n_estimations,
        'performance_metrics': {
            'application': 'validation',
            'avg_roc_auc': np.nanmean(val_roc_auc),
            'avg_avg_prec_score': np.nanmean(val_avg_prec_score),
            'avg_brier_score': np.nanmean(val_brier_score),
            'std_roc_auc': np.nanstd(val_roc_auc),
            'std_avg_prec_score': np.nanstd(val_avg_prec_score),
            'std_brier_score': np.nanstd(val_brier_score)
        },
        'running_time': str(round(((end_time_estimation - start_time_estimation).seconds)/60, 2)) + ' minutes',
        "comment": 'Defining hyper-parameters.'
    }

    if export:
        with open('Datasets/model_assessment_GBM.json', 'w') as json_file:
            json.dump(model_assessment_GBM, json_file, indent=2)
    
    test_bar.update(j+1)
    sleep(0.01)

# Assessing running time:
end_time = datetime.now()

print('------------------------------------')
print('\033[1mOverall running time:\033[0m ' + str(round(((end_time - start_time).seconds)/60, 2)) +
      ' minutes.')
print('Start time: ' + start_time.strftime('%Y-%m-%d') + ', ' + start_time.strftime('%H:%M:%S'))
print('End time: ' + end_time.strftime('%Y-%m-%d') + ', ' + end_time.strftime('%H:%M:%S'))
print('\n')

[1mTest progress:[0m [---------------------------------------------------------] 100%

------------------------------------
[1mOverall running time:[0m 155.17 minutes.
Start time: 2021-03-12, 10:44:26
End time: 2021-03-12, 13:19:37




#### Assessing results

In [47]:
estimation_ids = []
eta_list = []
J_list = []
v_list = []
M_list = []
avg_roc_auc = []
std_roc_auc = []
avg_prec = []
std_prec = []
ratio_roc_auc = []
ratio_prec = []
running_time = []

# Loop over estimations:
for e in [model_assessment_GBM[e] for e in model_assessment_GBM.keys() if
          ('Defining hyper-parameters.' in model_assessment_GBM[e]['comment'])]:
    estimation_ids.append(list(model_assessment_GBM.keys())[list(model_assessment_GBM.values()).index(e)])
    eta_list.append(e['hyper_parameters']['subsample'])
    J_list.append(e['hyper_parameters']['max_depth'])
    v_list.append(e['hyper_parameters']['learning_rate'])
    M_list.append(e['hyper_parameters']['n_estimators'])
    avg_roc_auc.append(e['performance_metrics']['avg_roc_auc'])
    std_roc_auc.append(e['performance_metrics']['std_roc_auc'])
    avg_prec.append(e['performance_metrics']['avg_avg_prec_score'])
    std_prec.append(e['performance_metrics']['std_avg_prec_score'])
    ratio_roc_auc.append(e['performance_metrics']['avg_roc_auc']/(e['performance_metrics']['std_roc_auc'] + 1e-7))
    ratio_prec.append(e['performance_metrics']['avg_avg_prec_score']/(e['performance_metrics']['std_avg_prec_score'] + 1e-7))
    running_time.append(float(e['running_time'].split(' minutes')[0]))
    
# Dataframe with performance metrics by mini-batch size:
metrics = pd.DataFrame(data={
    'estimation_id': estimation_ids,
    'subsample': eta_list,
    'max_depth': J_list,
    'learning_rate': v_list,
    'n_estimators': M_list,
    'avg_roc_auc': avg_roc_auc,
    'std_roc_auc': std_roc_auc,
    'avg_prec': avg_prec,
    'std_prec': std_prec,
    'ratio_roc_auc': ratio_roc_auc,
    'ratio_prec': ratio_prec,
    'running_time': running_time
})

metrics.sort_values('avg_roc_auc', ascending=False)

Unnamed: 0,estimation_id,subsample,max_depth,learning_rate,n_estimators,avg_roc_auc,std_roc_auc,avg_prec,std_prec,ratio_roc_auc,ratio_prec,running_time
7,1615557195,0.7118400417035669,4,0.0467957863424787,884,0.949177,0.0,0.45211,0.0,9491773.0,4521096.0,30.85
1,1615553321,0.6427877387298072,2,0.0858423761437097,562,0.947162,0.0,0.475113,0.0,9471622.0,4751130.0,9.13
19,1615565252,0.9456088856900078,5,0.0383611260433845,248,0.946713,0.0,0.444771,0.0,9467129.0,4447707.0,12.07
4,1615555108,0.8784779222684376,4,0.0805863642157635,351,0.945808,0.0,0.430675,0.0,9458079.0,4306749.0,9.98
10,1615560640,0.7855770369179904,4,0.0622410089892493,779,0.94565,0.0,0.388148,0.0,9456505.0,3881479.0,25.72
8,1615559047,0.8243354273385988,3,0.0813583490569955,464,0.945471,0.0,0.458176,0.0,9454714.0,4581762.0,10.8
6,1615556968,0.9843951879119976,1,0.0347986189048324,782,0.945458,0.0,0.477039,0.0,9454578.0,4770387.0,3.78
2,1615553870,0.933108217434422,4,0.0775826943084925,499,0.945003,0.0,0.419677,0.0,9450030.0,4196772.0,15.95
14,1615563747,0.5803623178080404,2,0.0639118191738419,343,0.944974,0.0,0.473211,0.0,9449740.0,4732111.0,5.42
5,1615556666,0.5638141361469999,2,0.0694653676990341,308,0.944828,0.0,0.456144,0.0,9448282.0,4561436.0,5.02


<a id='gbm_estimation'></a>

### Final estimation

In [37]:
# Converting data from dataframes into nd-arrays:
X_train = df_train_scaled.drop(drop_vars, axis=1).values
y_train = df_train_scaled['y'].values

X_val = df_val_scaled.drop(drop_vars, axis=1).values
y_val = df_val_scaled['y'].values

X_test = df_test_scaled.drop(drop_vars, axis=1).values
y_test = df_test_scaled['y'].values

#### Setting

In [38]:
# Number of estimations:
n_estimations = 100

# Best value for the hyper-parameters:
grid_param = {
    'subsample': 0.6427877387298072,
    'max_depth': 2,
    'learning_rate': 0.08584237614370978,
    'n_estimators': 562
}

#### Model estimation

In [39]:
start_time = datetime.now()

estimation_id = str(int(time.time()))

start_time_estimation = datetime.now()

# Lists to store results:
test_roc_auc = []
test_avg_prec_score = []
test_brier_score = []

for t in range(n_estimations):
    # Creating the model object:
    model = GradientBoostingClassifier(subsample = float(grid_param['subsample']),
                                       max_depth = int(grid_param['max_depth']),
                                       learning_rate = float(grid_param['learning_rate']),
                                       n_estimators = int(grid_param['n_estimators']),
                                       warm_start = True)

    # Training the model:
    model.fit(X_train, y_train)

    # Performance metrics on validation data:
    test_roc_auc.append(roc_auc_score(y_test, [p[1] for p in model.predict_proba(X_test)]))
    test_avg_prec_score.append(average_precision_score(y_test, [p[1] for p in model.predict_proba(X_test)]))
    test_brier_score.append(brier_score_loss(y_test, [p[1] for p in model.predict_proba(X_test)]))

end_time_estimation = datetime.now()

# Dictionary with information on model structure and performance:
model_assessment_GBM[estimation_id] = {
    'hyper_parameters': {
        'subsample': str(grid_param['subsample']),
        'max_depth': str(grid_param['max_depth']),
        'learning_rate': str(grid_param['learning_rate']),
        'n_estimators': str(grid_param['n_estimators'])
    },
    'n_estimations': n_estimations,
    'performance_metrics': {
        'application': 'test',
        'avg_roc_auc': np.nanmean(test_roc_auc),
        'avg_avg_prec_score': np.nanmean(test_avg_prec_score),
        'avg_brier_score': np.nanmean(test_brier_score),
        'std_roc_auc': np.nanstd(test_roc_auc),
        'std_avg_prec_score': np.nanstd(test_avg_prec_score),
        'std_brier_score': np.nanstd(test_brier_score),
    },
    'running_time': str(round(((end_time_estimation - start_time_estimation).seconds)/60, 2)) + ' minutes',
    "comment": 'Final estimation.'
}

if export:
    with open('Datasets/model_assessment_GBM.json', 'w') as json_file:
        json.dump(model_assessment_GBM, json_file, indent=2)

# Assessing running time:
end_time = datetime.now()

print('------------------------------------')
print('\033[1mOverall running time:\033[0m ' + str(round(((end_time - start_time).seconds)/60, 2)) +
      ' minutes.')
print('Start time: ' + start_time.strftime('%Y-%m-%d') + ', ' + start_time.strftime('%H:%M:%S'))
print('End time: ' + end_time.strftime('%Y-%m-%d') + ', ' + end_time.strftime('%H:%M:%S'))
print('\n')

------------------------------------
[1mOverall running time:[0m 771.3 minutes.
Start time: 2021-03-12, 17:16:24
End time: 2021-03-13, 06:07:42


