# Modeling

## Contents
* [1. <b> Summary </b>](#summary)
* [2. <b> Model Function </b>](#func)
* [2.1 Hyperparameter Dict ](#hp)
* [2.2 Models to run ](#models_list)
* [3 <b> Model Algorithms</b>](#models_algo)
* [3.1 Logistic Regression](#LR)
* [3.2 Decision Tree](#DT)
* [3.3 Random Forest](#RF)
* [3.4 LGB](#LGB)
* [3.5 Neural Networks](#NN)
* [3.6 KNN](#KNN)
* [3.7 SVM](#SVM)
* [4 <b> Best Model </b>](#best_model)
* [4.1 <b> Three results tables </b>](#3_tables)
* [4.2 <b> Applying on OOT </b>](#oot)

## Summary <a class="anchor" id="summary"></a>

This code applies various models to the dataset with final features and does hyperparamter tuning to get the best set of hyperparameters for each model.Then it applies the best model on the entire training data set and uses that model on out of time validation data set.

In order to make the code less redundant a common function has been created which when passed with model type (logistic regression, Decision tree, Random forests etc) and hyperparameter dict generates the FDR result for train, test and OOT data. Only the models that are there in modelsToRunList will be explored. The code also outputs the results for all models in modelsToRun list into an excel.

If we want to add new model in the code, then we need to add it's implementation in modelAndOutput function, it's hyperparameters in hpDict, it's name in modelsToRun list. After this we can call modelAndOutput function in the same way as done for existing models algos.


In [1]:
# Libraries to load
# !pip install lightgbm

In [2]:
from datetime import datetime
start_time = datetime.now()

# Libraries to load
# !pip install lightgbm

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.neural_network import MLPClassifier
import lightgbm as lgb
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
import numpy as np
%matplotlib inline

print('LOAD DURATION: ', datetime.now() - start_time) # load time about 30 seconds

LOAD DURATION:  0:00:03.545280


In [3]:
oot_index = 84299 #if vars_final had index as recnum then this should be 84300

In [4]:
versionFilterCols = 2

#The following lists of filterCols have been found by running feature selection many times to see which ones are the best.

if versionFilterCols == 1:
    #50,50,50 from each file and then a wrapper
    filterCols = ['Cardnum_Merchnum_total_7','Cardnum_state_max_14','Merchnum_Zip3_max_14','Cardnum_Merchnum_total_14','Cardnum_Merchnum_state_total_7','Merchnum_med_3','Cardnum_Merchnum_zip_total_30','Cardnum_Merchnum_zip3_total_7','Merchnum_med_0','Merchnum_Zip_med_1','Merchnum_Zip_med_0','Merchnum_Zip3_med_1','Merchnum_med_1','Merchnum_Zip3_med_3','Merchnum_Zip_med_3','Cardnum_Merchnum_total_30','Merchnum_Zip3_med_0','Merchnum_Zip3_avg_1','Cardnum_Merchnum_state_total_14','Cardnum_Zip3_total_30']

if versionFilterCols == 2:
    filterCols = ['Cardnum_state_max_14','Cardnum_Merchnum_zip3_total_7','Cardnum_Merchnum_zip3_total_14','Cardnum_Merchnum_total_7','Cardnum_Merchnum_total_14','Cardnum_Merchnum_state_total_7','Cardnum_Merchnum_zip_total_30','Cardnum_Merchnum_total_30','Cardnum_Merchnum_state_total_30','Cardnum_Merchnum_zip_total_14','Cardnum_Merchnum_state_total_14','Merchnum_Zip3_max_14','Merchnum_med_3','Merchnum_med_0','Merchnum_Zip_med_1','Merchnum_Zip_med_0','Merchnum_Zip3_med_1','Merchnum_med_1','Merchnum_Zip3_med_3','Merchnum_Zip_med_3']


if versionFilterCols == 3:
    #Other group members' variables. To use these use corresponding df
    filterCols = ['Cardnum_Merchnum_total_7','Cardnum_state_rel','Cardnum_Merchnum_zip_rel','Cardnum_zip_rel','Cardnum_Merchnum_zip3_rel','Cardnum_Merchnum_state_rel','Cardnum_Merchnum_zip_rel_2',\
                  'Cardnum_Merchnum_zip3_total_14','Cardnum_Zip_total_14',\
                 'Cardnum_Merchnum_state_total_30','Cardnum_Zip3_total_14','Cardnum_state_total_30','Cardnum_Zip3_total_7','Cardnum_Merchnum_total_14']

if versionFilterCols == 4:
    #normal without 50,50,50 logic
    filterCols = ['Cardnum_Merchnum_total_7','Cardnum_state_max_30','Cardnum_state_total_14','Cardnum_Merchnum_total_30','Cardnum_Merchnum_zip_total_30','Cardnum_Zip_total_30','Cardnum_Zip3_total_30','Cardnum_Merchnum_zip3_total_30','Cardnum_Merchnum_state_total_7','Cardnum_Merchnum_zip_total_14',
                 'Cardnum_Merchnum_zip3_total_7','Cardnum_Merchnum_zip3_total_14','Cardnum_Merchnum_state_total_14','Cardnum_Merchnum_zip_total_7','Cardnum_Zip_total_14','Cardnum_Merchnum_state_total_30','Cardnum_Zip3_total_14','Cardnum_state_total_30','Cardnum_Zip3_total_7','Cardnum_Merchnum_total_14']

#direct from the file created by feature selection
if versionFilterCols ==5:
    rank_order = pd.read_excel("final_vars_list.xlsx")
    filterCols = list(rank_order['New Feature'])

y_name = 'Fraud'
index_name = 'Recnum'
fileList = ['final1_1.csv','final2_2.csv','final3_3.csv']

final_df_l = []
for i in fileList:
    if i==fileList[0]:
        final_df = pd.read_csv(i).filter(filterCols + [y_name,index_name])
    else:
        final_df = pd.read_csv(i).filter(filterCols)
    final_df_l.append(final_df)
df = pd.concat(final_df_l,axis = 1)

In [5]:
filterCols

['Cardnum_state_max_14',
 'Cardnum_Merchnum_zip3_total_7',
 'Cardnum_Merchnum_zip3_total_14',
 'Cardnum_Merchnum_total_7',
 'Cardnum_Merchnum_total_14',
 'Cardnum_Merchnum_state_total_7',
 'Cardnum_Merchnum_zip_total_30',
 'Cardnum_Merchnum_total_30',
 'Cardnum_Merchnum_state_total_30',
 'Cardnum_Merchnum_zip_total_14',
 'Cardnum_Merchnum_state_total_14',
 'Merchnum_Zip3_max_14',
 'Merchnum_med_3',
 'Merchnum_med_0',
 'Merchnum_Zip_med_1',
 'Merchnum_Zip_med_0',
 'Merchnum_Zip3_med_1',
 'Merchnum_med_1',
 'Merchnum_Zip3_med_3',
 'Merchnum_Zip_med_3']

In [6]:
# n = 15
# vars = df.filter(['record','fraud_label']+filterCols[0:n])#df.copy()
# vars['Fraud'] = Y
# vars = df.copy()
vars = df.filter(filterCols+['Recnum','Fraud'])

In [7]:
len(vars.columns)

22

In [8]:
vars['Fraud'].value_counts()

0    95338
1     1059
Name: Fraud, dtype: int64

In [9]:
# using this to cap variables.
Clip = 10

In [10]:
vars.head(10)

Unnamed: 0,Cardnum_state_max_14,Cardnum_Merchnum_zip3_total_7,Cardnum_Merchnum_zip3_total_14,Cardnum_Merchnum_total_7,Cardnum_Merchnum_total_14,Cardnum_Merchnum_state_total_7,Cardnum_Merchnum_zip_total_30,Cardnum_Merchnum_total_30,Cardnum_Merchnum_state_total_30,Cardnum_Merchnum_zip_total_14,...,Merchnum_med_3,Merchnum_med_0,Merchnum_Zip_med_1,Merchnum_Zip_med_0,Merchnum_Zip3_med_1,Merchnum_med_1,Merchnum_Zip3_med_3,Merchnum_Zip_med_3,Recnum,Fraud
0,3.62,3.62,3.62,3.62,3.62,3.62,3.62,3.62,3.62,3.62,...,3.62,3.62,3.62,3.62,3.62,3.62,3.62,3.62,1,0
1,31.42,31.42,31.42,31.42,31.42,31.42,31.42,31.42,31.42,31.42,...,31.42,31.42,31.42,31.42,31.42,31.42,31.42,31.42,2,0
2,178.49,178.49,178.49,178.49,178.49,178.49,178.49,178.49,178.49,178.49,...,178.49,178.49,178.49,178.49,178.49,178.49,178.49,178.49,3,0
3,3.62,3.62,3.62,3.62,3.62,3.62,3.62,3.62,3.62,3.62,...,3.62,3.62,3.62,3.62,3.62,3.62,3.62,3.62,4,0
4,3.62,7.24,7.24,7.24,7.24,7.24,7.24,7.24,7.24,7.24,...,3.62,3.62,3.62,3.62,3.62,3.62,3.62,3.62,5,0
5,3.67,3.67,3.67,3.67,3.67,3.67,3.67,3.67,3.67,3.67,...,3.62,3.62,3.62,3.62,3.62,3.62,3.62,3.62,6,0
6,3.62,3.62,3.62,3.62,3.62,3.62,3.62,3.62,3.62,3.62,...,3.62,3.62,3.62,3.62,3.62,3.62,3.62,3.62,7,0
7,230.32,230.32,230.32,230.32,230.32,230.32,230.32,230.32,230.32,230.32,...,230.32,230.32,230.32,230.32,230.32,230.32,230.32,230.32,8,0
8,62.11,62.11,62.11,62.11,62.11,62.11,62.11,62.11,62.11,62.11,...,62.11,62.11,62.11,62.11,62.11,62.11,62.11,62.11,9,0
9,3.62,10.86,10.86,10.86,10.86,10.86,10.86,10.86,10.86,10.86,...,3.62,3.62,3.62,3.62,3.62,3.62,3.62,3.62,10,0


In [11]:
vars.shape

(96397, 22)

In [12]:
vars.describe()

Unnamed: 0,Cardnum_state_max_14,Cardnum_Merchnum_zip3_total_7,Cardnum_Merchnum_zip3_total_14,Cardnum_Merchnum_total_7,Cardnum_Merchnum_total_14,Cardnum_Merchnum_state_total_7,Cardnum_Merchnum_zip_total_30,Cardnum_Merchnum_total_30,Cardnum_Merchnum_state_total_30,Cardnum_Merchnum_zip_total_14,...,Merchnum_med_3,Merchnum_med_0,Merchnum_Zip_med_1,Merchnum_Zip_med_0,Merchnum_Zip3_med_1,Merchnum_med_1,Merchnum_Zip3_med_3,Merchnum_Zip_med_3,Recnum,Fraud
count,96397.0,96397.0,96397.0,96397.0,96397.0,96397.0,96397.0,96397.0,96397.0,96397.0,...,96397.0,96397.0,96397.0,96397.0,96397.0,96397.0,96397.0,96397.0,96397.0,96397.0
mean,604.574489,686.091673,765.702015,690.6114,772.109278,689.875292,922.519302,926.928674,923.743287,769.567144,...,359.105159,379.52893,371.718333,381.336867,371.331601,368.794987,361.539682,362.140518,48365.48182,0.010986
std,1190.475781,4101.605158,4165.996211,4104.138245,4170.644835,4103.963679,4299.112926,4304.302283,4299.826303,4168.029315,...,707.308222,742.334681,737.858824,750.513298,737.543383,728.921876,716.128564,716.572697,27945.003883,0.104236
min,0.01,0.01,0.01,0.01,0.01,0.01,0.01,0.01,0.01,0.01,...,0.01,0.01,0.01,0.01,0.01,0.01,0.01,0.01,1.0,0.0
25%,66.96,65.0,76.1,65.6,77.16,65.51,93.78,94.0,93.89,77.0,...,40.47,36.7,38.71,36.3,38.785,39.0,40.005,40.0,24154.0,0.0
50%,258.25,201.6,235.4,203.0,237.75,202.6,289.59,290.09,289.95,236.74,...,150.97,147.67,149.09,147.1,149.1,149.75,150.0,150.0,48365.0,0.0
75%,734.39,597.7,674.0,600.0,679.67,599.98,824.06,829.84,825.0,676.62,...,390.0,420.0,405.47,421.9,404.97,400.0,394.45,395.0,72578.0,0.0
max,47900.0,306633.41,306633.41,306633.41,306633.41,306633.41,306633.41,306633.41,306633.41,306633.41,...,27218.0,27218.0,28392.84,28392.84,28392.84,27218.0,28392.84,28392.84,96753.0,1.0


In [13]:
record_save = vars['Recnum']
Y_save = pd.DataFrame(vars.loc[:,'Fraud'])
Y_save.head()

Unnamed: 0,Fraud
0,0
1,0
2,0
3,0
4,0


In [14]:
record_save

0            1
1            2
2            3
3            4
4            5
         ...  
96392    96749
96393    96750
96394    96751
96395    96752
96396    96753
Name: Recnum, Length: 96397, dtype: int64

In [15]:
## Scale and truncate field values

In [16]:
X_no_scaling = vars.drop(columns = ['Recnum','Fraud'])
X_no_scaling.describe()

Unnamed: 0,Cardnum_state_max_14,Cardnum_Merchnum_zip3_total_7,Cardnum_Merchnum_zip3_total_14,Cardnum_Merchnum_total_7,Cardnum_Merchnum_total_14,Cardnum_Merchnum_state_total_7,Cardnum_Merchnum_zip_total_30,Cardnum_Merchnum_total_30,Cardnum_Merchnum_state_total_30,Cardnum_Merchnum_zip_total_14,Cardnum_Merchnum_state_total_14,Merchnum_Zip3_max_14,Merchnum_med_3,Merchnum_med_0,Merchnum_Zip_med_1,Merchnum_Zip_med_0,Merchnum_Zip3_med_1,Merchnum_med_1,Merchnum_Zip3_med_3,Merchnum_Zip_med_3
count,96397.0,96397.0,96397.0,96397.0,96397.0,96397.0,96397.0,96397.0,96397.0,96397.0,96397.0,96397.0,96397.0,96397.0,96397.0,96397.0,96397.0,96397.0,96397.0,96397.0
mean,604.574489,686.091673,765.702015,690.6114,772.109278,689.875292,922.519302,926.928674,923.743287,769.567144,770.446108,1003.285185,359.105159,379.52893,371.718333,381.336867,371.331601,368.794987,361.539682,362.140518
std,1190.475781,4101.605158,4165.996211,4104.138245,4170.644835,4103.963679,4299.112926,4304.302283,4299.826303,4168.029315,4168.679531,1830.54797,707.308222,742.334681,737.858824,750.513298,737.543383,728.921876,716.128564,716.572697
min,0.01,0.01,0.01,0.01,0.01,0.01,0.01,0.01,0.01,0.01,0.01,0.01,0.01,0.01,0.01,0.01,0.01,0.01,0.01,0.01
25%,66.96,65.0,76.1,65.6,77.16,65.51,93.78,94.0,93.89,77.0,77.0,159.92,40.47,36.7,38.71,36.3,38.785,39.0,40.005,40.0
50%,258.25,201.6,235.4,203.0,237.75,202.6,289.59,290.09,289.95,236.74,237.01,479.0,150.97,147.67,149.09,147.1,149.1,149.75,150.0,150.0
75%,734.39,597.7,674.0,600.0,679.67,599.98,824.06,829.84,825.0,676.62,677.4,1488.0,390.0,420.0,405.47,421.9,404.97,400.0,394.45,395.0
max,47900.0,306633.41,306633.41,306633.41,306633.41,306633.41,306633.41,306633.41,306633.41,306633.41,306633.41,47900.0,27218.0,27218.0,28392.84,28392.84,28392.84,27218.0,28392.84,28392.84


In [17]:
X = (X_no_scaling - X_no_scaling.mean()) / X_no_scaling.std()

In [18]:
# push in any outlier values
cols = X.columns
X.loc[:,cols] = X[cols].clip(upper=Clip)
X.loc[:,cols] = X[cols].clip(lower=-1*Clip)
# X = (X_no_scaling - X_no_scaling.mean()) / X_no_scaling.std()
X.describe()

Unnamed: 0,Cardnum_state_max_14,Cardnum_Merchnum_zip3_total_7,Cardnum_Merchnum_zip3_total_14,Cardnum_Merchnum_total_7,Cardnum_Merchnum_total_14,Cardnum_Merchnum_state_total_7,Cardnum_Merchnum_zip_total_30,Cardnum_Merchnum_total_30,Cardnum_Merchnum_state_total_30,Cardnum_Merchnum_zip_total_14,Cardnum_Merchnum_state_total_14,Merchnum_Zip3_max_14,Merchnum_med_3,Merchnum_med_0,Merchnum_Zip_med_1,Merchnum_Zip_med_0,Merchnum_Zip3_med_1,Merchnum_med_1,Merchnum_Zip3_med_3,Merchnum_Zip_med_3
count,96397.0,96397.0,96397.0,96397.0,96397.0,96397.0,96397.0,96397.0,96397.0,96397.0,96397.0,96397.0,96397.0,96397.0,96397.0,96397.0,96397.0,96397.0,96397.0,96397.0
mean,-0.011541,-0.012572,-0.012353,-0.012559,-0.01233,-0.01256,-0.011739,-0.011721,-0.011735,-0.012343,-0.01234,-0.01175,-0.007265,-0.007237,-0.007509,-0.007377,-0.007517,-0.007369,-0.007432,-0.007422
std,0.794573,0.461278,0.48728,0.462481,0.48926,0.462393,0.537883,0.539561,0.538136,0.488158,0.488435,0.795895,0.850221,0.86209,0.849691,0.858191,0.849497,0.853399,0.845882,0.846162
min,-0.507834,-0.167272,-0.183796,-0.16827,-0.185127,-0.168097,-0.214581,-0.215347,-0.21483,-0.184633,-0.184815,-0.548074,-0.507693,-0.51125,-0.503766,-0.508088,-0.503457,-0.505932,-0.504839,-0.505365
25%,-0.451596,-0.151426,-0.165531,-0.152288,-0.166629,-0.152137,-0.19277,-0.193511,-0.192997,-0.166162,-0.166347,-0.460717,-0.45049,-0.461825,-0.451317,-0.459735,-0.450884,-0.452442,-0.44899,-0.449557
50%,-0.290913,-0.118122,-0.127293,-0.11881,-0.128124,-0.118733,-0.147223,-0.147954,-0.1474,-0.127837,-0.127963,-0.286409,-0.294264,-0.312337,-0.301722,-0.312102,-0.301313,-0.300505,-0.295393,-0.296049
75%,0.109045,-0.021551,-0.022012,-0.022078,-0.022164,-0.021905,-0.022902,-0.022556,-0.022964,-0.0223,-0.02232,0.264792,0.043679,0.054519,0.045743,0.054047,0.045609,0.04281,0.045956,0.045856
max,10.0,10.0,10.0,10.0,10.0,10.0,10.0,10.0,10.0,10.0,10.0,10.0,10.0,10.0,10.0,10.0,10.0,10.0,10.0,10.0


In [19]:
# # separate data into modeling (traintest) and out of time
# X_trntst = X[0:833507]
# Y_trntst = Y_save[0:833507]
# X_oot = X[833507:]
# Y_oot = Y_save[833507:]

In [20]:
niter = 0
nitermax = 10
FDR3 = pd.DataFrame(np.zeros((nitermax,3)), columns=('trn', 'tst', 'oot'))
pd.options.mode.chained_assignment = None  # default='warn'

In [21]:
type(Y_save)

pandas.core.frame.DataFrame

In [22]:
X[oot_index:]

Unnamed: 0,Cardnum_state_max_14,Cardnum_Merchnum_zip3_total_7,Cardnum_Merchnum_zip3_total_14,Cardnum_Merchnum_total_7,Cardnum_Merchnum_total_14,Cardnum_Merchnum_state_total_7,Cardnum_Merchnum_zip_total_30,Cardnum_Merchnum_total_30,Cardnum_Merchnum_state_total_30,Cardnum_Merchnum_zip_total_14,Cardnum_Merchnum_state_total_14,Merchnum_Zip3_max_14,Merchnum_med_3,Merchnum_med_0,Merchnum_Zip_med_1,Merchnum_Zip_med_0,Merchnum_Zip3_med_1,Merchnum_med_1,Merchnum_Zip3_med_3,Merchnum_Zip_med_3
84299,2.037249,0.571430,0.543488,0.569976,0.541346,0.570179,0.490183,0.488567,0.489817,0.542295,0.542000,1.107092,3.775956,3.570278,3.602521,3.528962,3.604586,3.650700,3.726049,3.722901
84300,-0.377668,-0.133419,-0.150466,-0.134438,-0.151835,-0.134264,-0.182284,-0.183089,-0.182538,-0.151320,-0.151507,-0.472222,-0.311385,-0.324205,-0.315587,-0.323081,-0.315197,-0.315445,-0.310949,-0.311595
84301,-0.373107,-0.156722,-0.173409,-0.157727,-0.174752,-0.157554,-0.204516,-0.205294,-0.204767,-0.174252,-0.174436,-0.524436,-0.449564,-0.455864,-0.448044,-0.453306,-0.447711,-0.449527,-0.447426,-0.447987
84302,1.202986,0.329288,0.305089,0.327983,0.303212,0.328177,0.259165,0.257828,0.258838,0.304012,0.303754,0.702912,2.550861,2.232377,2.428150,2.205641,2.429713,2.461931,2.516043,2.513645
84303,-0.339843,-0.118513,-0.135790,-0.119541,-0.137175,-0.119366,-0.168062,-0.168884,-0.168319,-0.136651,-0.136841,-0.438822,-0.224945,-0.241844,-0.232725,-0.241617,-0.232300,-0.231568,-0.225574,-0.226272
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
96392,-0.436619,-0.146602,-0.163445,-0.147612,-0.164799,-0.147439,-0.194861,-0.195650,-0.195113,-0.164293,-0.164478,-0.501760,-0.387830,-0.397043,-0.388866,-0.395125,-0.388508,-0.389623,-0.386453,-0.387051
96393,0.655558,-0.138322,-0.011306,-0.139338,-0.012830,-0.139164,-0.047433,-0.048400,-0.047710,-0.012228,-0.012437,-0.130975,-0.339817,-0.351296,-0.342841,-0.349876,-0.342463,-0.343034,-0.339031,-0.339659
96394,-0.202452,-0.078635,-0.096530,-0.079688,-0.097958,-0.079512,-0.130017,-0.130885,-0.130280,-0.097410,-0.097606,0.811066,1.563067,1.461802,1.481254,1.443463,1.482412,1.503426,1.540415,1.538622
96395,1.341863,0.369596,0.344774,0.368267,0.342854,0.368462,0.424067,0.422531,0.423712,0.343679,0.343414,0.735662,1.705077,2.455094,1.617385,2.425931,1.618601,1.641225,1.680676,1.678796


## Model Function <a class="anchor" id="func"></a>

In [23]:
def modelAndOutput(modelName, criterionDict, X, Y_save, filterCols,nitermax,oot_index = oot_index,fraud_label = 'Fraud'):
    """
    Function to fit a model on train, apply on test and calculate average FDR over number of iterations specified by nitermax
    
    Args:
    modelName(str): name of the model 
    criterionDict(dict): hyperparameter dict
    X(dataframe): containing input variables
    Y_save(dataframe): contains Y variable
    filterCols(list): list of columns that are ranked according to their importance (used to decide which variables for the model have to be used)
    nitermax(int): number of iterations to run for each model
    """
    countScenarios = len(criterionDict['n_variables'])
    FDR_results = []
    for i in range(0,countScenarios):
        X_trntst = X[0:oot_index]
        Y_trntst = Y_save[0:oot_index]
        X_oot = X[oot_index:]
        Y_oot = Y_save[oot_index:]
        n = criterionDict['n_variables'][i]
        X_trntst = X_trntst.filter(filterCols[0:n])
        X_oot = X_oot.filter(filterCols[0:n])
        X_oot_orig = X_oot.copy()
        
        FDR3 = pd.DataFrame(np.zeros((nitermax,3)), columns=('trn', 'tst', 'oot'))
        
        for niter in range(nitermax):  
            print("train-test-split")
            X_trn, X_tst, Y_trn, Y_tst = train_test_split(X_trntst, Y_trntst, test_size = .3)
            
            
            if modelName == 'DecisionTreeClassifier':
                model = DecisionTreeClassifier(criterion = criterionDict['criterion'][i], max_depth = criterionDict['max_depth'][i],
                                          max_features = criterionDict['max_features'][i], min_samples_split = criterionDict['min_samples_split'][i],\
                                                  min_samples_leaf = criterionDict['min_samples_leaf'][i])
                    
                    
                
            elif modelName == 'RandomForestClassifier':
                model = RandomForestClassifier(criterion = criterionDict['criterion'][i], max_depth = criterionDict['max_depth'][i],
                                          max_features = criterionDict['max_features'][i],min_samples_split = criterionDict['min_samples_split'][i],\
                                                  min_samples_leaf = criterionDict['min_samples_leaf'][i])

                
            elif modelName == 'lgb':
                model = lgb.LGBMClassifier(learning_rate = criterionDict['learning_rate'][i],\
                                               boosting_type = criterionDict['boosting_type'][i],\
                                               num_leaves = criterionDict['num_leaves'][i])
            
            elif modelName == 'LogisticRegression':
                if len([i for i in criterionDict.keys() if i!='n_variables'])!=0:
                    raise Exception("Hyperparameter tuning not implemented for Logistic Regression")
                model = LogisticRegression()

            elif modelName == 'MLPClassifier':
                model = MLPClassifier(hidden_layer_sizes = criterionDict['hidden_layer_sizes'][i],
                                          activation = criterionDict['activation'][i],
                                          solver = criterionDict['solver'][i],
                                          alpha = criterionDict['alpha'][i],
                                          learning_rate = criterionDict['learning_rate'][i],
                                          learning_rate_init = criterionDict['learning_rate_init'][i])
            
            elif modelName == 'SVC':
                model = SVC(kernel = criterionDict['kernel'][i],probability = True)
            
            elif modelName == 'KNeighborsClassifier':
                model = KNeighborsClassifier(n_neighbors = criterionDict['n_neighbors'][i],weights = criterionDict['weights'][i])

            else:
                raise Exception("Not implemented yet")
                
            
            print("model object created")
            
            X_oot = X_oot_orig.copy()
            X_trn_save = X_trn.copy()
            Y_trn_save = Y_trn.copy()

            print("training started")
            model.fit(X_trn, Y_trn.values.ravel())   
            print("training ended")
            predictions = model.predict_proba(X_trn_save)[:,1]
            X_trn['predicted'] = predictions
            X_trn[fraud_label] = Y_trn_save[fraud_label]
            topRows = int(round(X_trn.shape[0]*0.03))
            temp = X_trn.sort_values('predicted',ascending=False).head(topRows)
            needed = temp.loc[:,fraud_label]
            FDR3.loc[niter, 'trn'] = sum(needed)/sum(X_trn.loc[:,fraud_label])

            predictions = model.predict_proba(X_tst)[:,1]
            X_tst['predicted']=predictions
            X_tst[fraud_label] = Y_tst[fraud_label]
            topRows = int(round(X_tst.shape[0]*0.03))
            temp = X_tst.sort_values('predicted',ascending=False).head(topRows)
            needed = temp.loc[:,fraud_label]
            FDR3.loc[niter, 'tst'] = sum(needed)/sum(X_tst.loc[:,fraud_label])

            predictions = model.predict_proba(X_oot)[:,1]
            X_oot['predicted']=predictions
            X_oot[fraud_label] = Y_oot[fraud_label]
            topRows = int(round(X_oot.shape[0]*0.03))
            temp = X_oot.sort_values('predicted',ascending=False).head(topRows)
            needed = temp.loc[:,fraud_label]
            FDR3.loc[niter, 'oot'] = sum(needed)/sum(X_oot.loc[:,fraud_label])
            print(niter, FDR3.loc[niter, 'trn'],FDR3.loc[niter, 'tst'],FDR3.loc[niter, 'oot'])
        print("Scenario "+ str(i))
        print(FDR3.mean())
        FDR_results.append(FDR3.mean())
        print(f"*************************Scenario {i} completed***********************************************") 

    resultDf = pd.DataFrame(criterionDict).join(pd.concat(FDR_results,axis = 1).T)
    resultDf
    resultDf['Best OOT'] = np.where(resultDf['oot'] == max(resultDf['oot']),1,0)
    resultDf['model'] = modelName 
    return resultDf

### Models to run <a class="anchor" id="models_list"></a>

In [24]:
modelsToRun = ['LogisticRegression','DecisionTreeClassifier','RandomForestClassifier','lgb','MLPClassifier']#SVC
outputDfList = []

### Hyperparameter Dict <a class="anchor" id="hp"></a>

In [25]:
hpDict = {'DecisionTreeClassifier': {'n_variables':[5,15,15,15,15,20,5,20],
                                    'criterion':['gini','gini','entropy','gini','entropy','entropy','gini','gini'],
                                    'max_depth':[None,None,None,4,1000,2,None,10000],
                                    'min_samples_split':[2,2,2,2,2,2,100,2],
                                    'max_features':[5,15,15,15,15,10,5,20],
                                    'min_samples_leaf':[1,1,1,1,1,1,50,1]},
          
         'LogisticRegression': {'n_variables':[5,10,15,20]},
          
          'RandomForestClassifier': {'n_variables':[5,15,15,15,15,20,3,20,5,5,5],
                                    'criterion':['gini','gini','entropy','gini','entropy','entropy','gini','gini','gini','entropy','entropy'],
                                    'max_depth': [None,None,None,20,20,None,None,100,None,5,1000],
                                    'min_samples_split':[2,2,2,2,2,2,2,2,100,2,2], 
                                    'max_features':[5,15,15,15,15,10,3,20,5,5,5],
                                    'min_samples_leaf':[1,1,1,1,1,1,1,1,50,1,1]},
          
         'lgb': {'n_variables':[5,15,15,15,15,20,5,5],
                 'num_leaves' : [31,31,31,31,40,40,5,10000],
                 'learning_rate':[0.1,0.1,0.01,0.1,0.1,0.01,0.001,0.001],
                 'boosting_type':['gbdt','gbdt','gbdt','dart','gbdt','dart','gbdt','gbdt']
                },
          
         'MLPClassifier': { 'n_variables':[5,5,5,5,5,5,5,5,5],
                            'hidden_layer_sizes':[(100),(10),(50,50),(50,50),(100,100,100),(25,25,25),(1),(1),(1)],\
                            'activation':['relu','relu','relu','logistic','logistic','logistic','relu','relu','logistic'],
                            'solver':['adam','adam','adam','adam','sgd','sgd','adam','sgd','adam'], \
                            'alpha' : [0.0001,0.0001,0.001,0.0001,0.0001,0.0001,0.0001,0.0001,0.0001],\
                            'learning_rate':['constant','invscaling','adaptive','constant','constant','constant','constant','constant','constant'],\
                            'learning_rate_init':[0.001,0.001,0.001,0.025,0.5,0.05,0.001,0.001,0.001]},
          
          'KNeighborsClassifier':  { 'n_variables':[5],'n_neighbors':[5],'weights':['uniform']},
          
          'SVC' : { 'n_variables':[5,5], 'kernel':["rbf","linear"]}
         
         
         }

In [26]:
# hpDict = {'DecisionTreeClassifier': {'n_variables':[5,14,14,14,14,14,5,14],
#                                     'criterion':['gini','gini','entropy','gini','entropy','entropy','gini','gini'],
#                                     'max_depth':[None,None,None,4,1000,2,None,10000],
#                                     'min_samples_split':[2,2,2,2,2,2,100,2],
#                                     'max_features':[5,14,14,14,14,14,5,14],
#                                     'min_samples_leaf':[1,1,1,1,1,1,50,1]},
          
#          'LogisticRegression': {'n_variables':[5,10,14]},
          
#           'RandomForestClassifier': {'n_variables':[15,15,18],
#                                     'criterion':['gini','gini','gini'],
#                                     'max_depth': [20,None,5],
#                                     'min_samples_split':[200,2,100], 
#                                     'max_features':[5,10,5],
#                                     'min_samples_leaf':[100,50]},
          
#          'lgb': {'n_variables':[5,15,15,15,15,20,5,5],
#                  'num_leaves' : [31,31,31,31,40,40,5,10000],
#                  'learning_rate':[0.1,0.1,0.01,0.1,0.1,0.01,0.001,0.001],
#                  'boosting_type':['gbdt','gbdt','gbdt','dart','gbdt','dart','gbdt','gbdt']
#                 },
          
#          'MLPClassifier': { 'n_variables':[5,5,5,5,5,5,5,5,5],
#                             'hidden_layer_sizes':[(100),(10),(50,50),(50,50),(100,100,100),(25,25,25),(1),(1),(1)],\
#                             'activation':['relu','relu','relu','logistic','logistic','logistic','relu','relu','logistic'],
#                             'solver':['adam','adam','adam','adam','sgd','sgd','adam','sgd','adam'], \
#                             'alpha' : [0.0001,0.0001,0.001,0.0001,0.0001,0.0001,0.0001,0.0001,0.0001],\
#                             'learning_rate':['constant','invscaling','adaptive','constant','constant','constant','constant','constant','constant'],\
#                             'learning_rate_init':[0.001,0.001,0.001,0.025,0.5,0.05,0.001,0.001,0.001]},
          
#           'KNeighborsClassifier':  { 'n_variables':[5],'n_neighbors':[5],'weights':['uniform']},
          
#           'SVC' : { 'n_variables':[5,5], 'kernel':["rbf","linear"]}
         
         
#          }

## Model Algorithms  <a class="anchor" id="models_algo"></a>

### Logistic Regression  <a class="anchor" id="LR"></a>

In [27]:
if 'LogisticRegression' in modelsToRun:
    outputLR = modelAndOutput(modelName = 'LogisticRegression',criterionDict = hpDict['LogisticRegression'],\
                   X = X, Y_save = Y_save, filterCols = filterCols, nitermax = nitermax)
    outputDfList.append(outputLR)

train-test-split
model object created
training started
training ended
0 0.6381578947368421 0.6691176470588235 0.3687150837988827
train-test-split
model object created
training started
training ended
1 0.6316639741518578 0.6513409961685823 0.4301675977653631
train-test-split
model object created
training started
training ended
2 0.648424543946932 0.6353790613718412 0.3687150837988827
train-test-split
model object created
training started
training ended
3 0.6477272727272727 0.6439393939393939 0.3575418994413408
train-test-split
model object created
training started
training ended
4 0.6473429951690821 0.6293436293436293 0.3687150837988827
train-test-split
model object created
training started
training ended
5 0.6601307189542484 0.6044776119402985 0.3743016759776536
train-test-split
model object created
training started
training ended
6 0.6392914653784219 0.6525096525096525 0.3687150837988827
train-test-split
model object created
training started
training ended
7 0.6384 0.6588235294117647 

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


training ended
1 0.6075533661740559 0.5571955719557196 0.3575418994413408
train-test-split
model object created
training started
training ended
2 0.597444089456869 0.6141732283464567 0.3743016759776536
train-test-split
model object created
training started


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


training ended
3 0.5984251968503937 0.5755102040816327 0.36312849162011174
train-test-split
model object created
training started


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


training ended
4 0.5931372549019608 0.5932835820895522 0.36312849162011174
train-test-split
model object created
training started


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


training ended
5 0.603648424543947 0.5740072202166066 0.36312849162011174
train-test-split
model object created
training started
training ended
6 0.5656565656565656 0.6573426573426573 0.36312849162011174
train-test-split
model object created
training started


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


training ended
7 0.6033950617283951 0.6379310344827587 0.37988826815642457
train-test-split
model object created
training started
training ended
8 0.5915721231766613 0.6083650190114068 0.3575418994413408
train-test-split
model object created
training started


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


training ended
9 0.6131045241809673 0.5815899581589958 0.36312849162011174
Scenario 2
trn    0.598125
tst    0.595717
oot    0.364804
dtype: float64
*************************Scenario 2 completed***********************************************
train-test-split
model object created
training started


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


training ended
0 0.5892857142857143 0.5833333333333334 0.37988826815642457
train-test-split
model object created
training started


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


training ended
1 0.6199376947040498 0.6050420168067226 0.3575418994413408
train-test-split
model object created
training started


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


training ended
2 0.5990338164251208 0.6486486486486487 0.36312849162011174
train-test-split
model object created
training started


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


training ended
3 0.6174603174603175 0.576 0.36312849162011174
train-test-split
model object created
training started


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


training ended
4 0.5980392156862745 0.5932835820895522 0.36312849162011174
train-test-split
model object created
training started


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


training ended
5 0.5940099833610649 0.6415770609318996 0.3687150837988827
train-test-split
model object created
training started


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


training ended
6 0.6183574879227053 0.5984555984555985 0.36312849162011174
train-test-split
model object created
training started


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


training ended
7 0.608130081300813 0.5849056603773585 0.36312849162011174
train-test-split
model object created
training started


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


training ended
8 0.592891760904685 0.6015325670498084 0.3743016759776536
train-test-split
model object created
training started


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


training ended
9 0.6142625607779578 0.6007604562737643 0.36312849162011174
Scenario 3
trn    0.605141
tst    0.603354
oot    0.365922
dtype: float64
*************************Scenario 3 completed***********************************************


In [28]:
# outputLR = modelAndOutput(modelName = 'LogisticRegression',criterionDict = {'n_variables':[5]},\
#                X = X, Y_save = Y_save, filterCols = filterCols, nitermax = nitermax)

### Decision Tree  <a class="anchor" id="DT"></a>

In [29]:
if 'DecisionTreeClassifier' in modelsToRun:
    outputDT = modelAndOutput(modelName = 'DecisionTreeClassifier',criterionDict = hpDict['DecisionTreeClassifier']
               , X = X, Y_save = Y_save, filterCols = filterCols, nitermax = nitermax)
    outputDfList.append(outputDT)

train-test-split
model object created
training started
training ended
0 1.0 0.568 0.24581005586592178
train-test-split
model object created
training started
training ended
1 1.0 0.549645390070922 0.2122905027932961
train-test-split
model object created
training started
training ended
2 1.0 0.5843137254901961 0.2122905027932961
train-test-split
model object created
training started
training ended
3 1.0 0.5357142857142857 0.2849162011173184
train-test-split
model object created
training started
training ended
4 1.0 0.596 0.24581005586592178
train-test-split
model object created
training started
training ended
5 1.0 0.5638297872340425 0.22346368715083798
train-test-split
model object created
training started
training ended
6 1.0 0.5826446280991735 0.22346368715083798
train-test-split
model object created
training started
training ended
7 1.0 0.5851851851851851 0.27932960893854747
train-test-split
model object created
training started
training ended
8 1.0 0.5675675675675675 0.2458100558659

training ended
1 0.8003194888178914 0.7559055118110236 0.5586592178770949
train-test-split
model object created
training started
training ended
2 0.7964743589743589 0.76953125 0.5586592178770949
train-test-split
model object created
training started
training ended
3 0.8096026490066225 0.7463768115942029 0.36312849162011174
train-test-split
model object created
training started
training ended
4 0.8153078202995009 0.7275985663082437 0.547486033519553
train-test-split
model object created
training started
training ended
5 0.8109375 0.725 0.36312849162011174
train-test-split
model object created
training started
training ended
6 0.8072100313479624 0.7107438016528925 0.36312849162011174
train-test-split
model object created
training started
training ended
7 0.8086816720257235 0.7286821705426356 0.36312849162011174
train-test-split
model object created
training started
training ended
8 0.7987220447284346 0.7401574803149606 0.36312849162011174
train-test-split
model object created
training st

### Random forest  <a class="anchor" id="RF"></a>

In [30]:
if 'RandomForestClassifier' in modelsToRun:
    outputRF = modelAndOutput(modelName = 'RandomForestClassifier',criterionDict = hpDict['RandomForestClassifier']
               , X = X, Y_save = Y_save, filterCols = filterCols, nitermax = nitermax)
    outputDfList.append(outputRF)

train-test-split
model object created
training started
training ended
0 1.0 0.7635658914728682 0.30726256983240224
train-test-split
model object created
training started
training ended
1 1.0 0.8208333333333333 0.4301675977653631
train-test-split
model object created
training started
training ended
2 1.0 0.7640449438202247 0.39664804469273746
train-test-split
model object created
training started
training ended
3 1.0 0.7938931297709924 0.41899441340782123
train-test-split
model object created
training started
training ended
4 1.0 0.7704280155642024 0.37988826815642457
train-test-split
model object created
training started
training ended
5 1.0 0.7862318840579711 0.4301675977653631
train-test-split
model object created
training started
training ended
6 1.0 0.7689243027888446 0.40782122905027934
train-test-split
model object created
training started
training ended
7 1.0 0.7536764705882353 0.3016759776536313
train-test-split
model object created
training started
training ended
8 1.0 0.74131

training ended
3 1.0 0.7175572519083969 0.41899441340782123
train-test-split
model object created
training started
training ended
4 1.0 0.77734375 0.39664804469273746
train-test-split
model object created
training started
training ended
5 1.0 0.7449392712550608 0.35195530726256985
train-test-split
model object created
training started
training ended
6 1.0 0.7692307692307693 0.26256983240223464
train-test-split
model object created
training started
training ended
7 1.0 0.7789855072463768 0.37988826815642457
train-test-split
model object created
training started
training ended
8 1.0 0.7824427480916031 0.36312849162011174
train-test-split
model object created
training started
training ended
9 1.0 0.7295373665480427 0.4245810055865922
Scenario 6
trn    1.000000
tst    0.757766
oot    0.378771
dtype: float64
*************************Scenario 6 completed***********************************************
train-test-split
model object created
training started
training ended
0 1.0 0.88235294117647

### LGB  <a class="anchor" id="LGB"></a>

In [32]:
if 'lgb' in modelsToRun:
    outputLGB = modelAndOutput(modelName = 'lgb',criterionDict = hpDict['lgb'], X = X, Y_save = Y_save, filterCols = filterCols, nitermax = 10 )
    outputDfList.append(outputLGB)

train-test-split
model object created
training started
training ended
0 0.9207119741100324 0.7633587786259542 0.31843575418994413
train-test-split
model object created
training started
training ended
1 0.9014778325123153 0.7933579335793358 0.3240223463687151
train-test-split
model object created
training started
training ended
2 0.9109697933227345 0.7689243027888446 0.30726256983240224
train-test-split
model object created
training started
training ended
3 0.911864406779661 0.7793103448275862 0.329608938547486
train-test-split
model object created
training started
training ended
4 0.914332784184514 0.7728937728937729 0.5027932960893855
train-test-split
model object created
training started
training ended
5 0.9124797406807131 0.7718631178707225 0.2905027932960894
train-test-split
model object created
training started
training ended
6 0.913961038961039 0.7613636363636364 0.329608938547486
train-test-split
model object created
training started
training ended
7 0.9056910569105691 0.7735849

training ended
8 0.8499156829679595 0.7839721254355401 0.553072625698324
train-test-split
model object created
training started
training ended
9 0.8546409807355516 0.8284789644012945 0.5754189944134078
Scenario 5
trn    0.850039
tst    0.784657
oot    0.560894
dtype: float64
*************************Scenario 5 completed***********************************************
train-test-split
model object created
training started
training ended
0 0.6616915422885572 0.6389891696750902 0.4301675977653631
train-test-split
model object created
training started
training ended
1 0.6888168557536467 0.714828897338403 0.5586592178770949
train-test-split
model object created
training started
training ended
2 0.6996644295302014 0.6866197183098591 0.5698324022346368
train-test-split
model object created
training started
training ended
3 0.6883561643835616 0.706081081081081 0.5586592178770949
train-test-split
model object created
training started
training ended
4 0.711038961038961 0.6363636363636364 0.463687

### Neural Network  <a class="anchor" id="NN"></a>

In [33]:
if 'MLPClassifier' in modelsToRun:
    outputNN = modelAndOutput(modelName = 'MLPClassifier',criterionDict = hpDict['MLPClassifier'] ,\
                                                            X = X, Y_save = Y_save, filterCols = filterCols, nitermax = 5)
    outputDfList.append(outputNN)

train-test-split
model object created
training started
training ended
0 0.6751700680272109 0.684931506849315 0.5083798882681564
train-test-split
model object created
training started
training ended
1 0.694888178913738 0.6653543307086615 0.553072625698324
train-test-split
model object created
training started
training ended
2 0.686084142394822 0.6984732824427481 0.553072625698324
train-test-split
model object created
training started
training ended
3 0.6769983686786297 0.6928838951310862 0.547486033519553
train-test-split
model object created
training started
training ended
4 0.6868852459016394 0.7037037037037037 0.553072625698324
Scenario 0
trn    0.684005
tst    0.689069
oot    0.543017
dtype: float64
*************************Scenario 0 completed***********************************************
train-test-split
model object created
training started
training ended
0 0.6610738255033557 0.6654929577464789 0.5363128491620112
train-test-split
model object created
training started
training en

### KNN  <a class="anchor" id="KNN"></a>

In [34]:
if 'KNeighborsClassifier' in modelsToRun:
    outputKNN = modelAndOutput(modelName = 'KNeighborsClassifier',criterionDict = hpDict['KNeighborsClassifier'],
                                X = X, Y_save = Y_save, filterCols = filterCols, nitermax = 2) #change nitermax
    outputDfList.append(outputKNN)

### SVM  <a class="anchor" id="SVM"></a>

In [35]:
if 'SVC' in modelsToRun:
    outputSVN = modelAndOutput(modelName = 'SVC',criterionDict = hpDict['SVC'],
                                                            X = X, Y_save = Y_save, filterCols = filterCols, nitermax = 1)#change nitermax
    outputDfList.append(outputSVN)

In [36]:
indexModel = [outputDfList[i]['model'][0] for i in range(0,len(outputDfList))]
finalDf = pd.concat(outputDfList, keys = indexModel).drop('model',axis = 1)

In [37]:
new_cols = [col for col in finalDf.columns if not col in ['trn','tst','oot','Best OOT']] + ['trn','tst','oot','Best OOT']
finalDf = finalDf[new_cols]

In [38]:
num_format_cols = [list(finalDf.columns).index(i) for i in finalDf.columns if i in ['trn', 'tst', 'oot'] ]

In [39]:
list(finalDf.columns).index('Best OOT')

17

In [40]:
finalDf.columns

Index(['n_variables', 'criterion', 'max_depth', 'min_samples_split',
       'max_features', 'min_samples_leaf', 'num_leaves', 'learning_rate',
       'boosting_type', 'hidden_layer_sizes', 'activation', 'solver', 'alpha',
       'learning_rate_init', 'trn', 'tst', 'oot', 'Best OOT'],
      dtype='object')

In [41]:
with pd.ExcelWriter('Model_Performances_fv_'+str(versionFilterCols)+'.xlsx') as writer:  
    finalDf.to_excel(writer, sheet_name='Results')
    worksheet = writer.sheets['Results']
#     worksheet.conditional_format('F2:F8', {'type': 'number'})
    workbook  = writer.book
    format1 = workbook.add_format({'num_format': '0.000'})
    bold = workbook.add_format({'bold': True})
    for i in num_format_cols:
        worksheet.set_column(i+2,i+2, None, format1) 
    writer.save()

  warn("Calling close() on already closed file.")


## Best Model <a class="anchor" id="best_model"></a>

In [64]:
X = df.drop(['Fraud','Recnum'],axis = 1)
Y_save = df[['Fraud']]


#Put the best hyperparatmeters and best modelName below:
criterionDict = {'n_variables':[5],
                'n_estimators':[25],
                'max_features':[5],
                 'criterion':['gini'],
                'max_depth':[10],
                'min_samples_split':[200],
                'min_samples_leaf':[50],
                'bootstrap':[True]}

modelName = 'RandomForestClassifier'
filterCols = X.columns

In [59]:
filterCols

Index(['Cardnum_state_max_14', 'Cardnum_Merchnum_zip3_total_7',
       'Cardnum_Merchnum_zip3_total_14', 'Cardnum_Merchnum_total_7',
       'Cardnum_Merchnum_total_14', 'Cardnum_Merchnum_state_total_7',
       'Cardnum_Merchnum_zip_total_30', 'Cardnum_Merchnum_total_30',
       'Cardnum_Merchnum_state_total_30', 'Cardnum_Merchnum_zip_total_14',
       'Cardnum_Merchnum_state_total_14', 'Merchnum_Zip3_max_14',
       'Merchnum_med_3', 'Merchnum_med_0', 'Merchnum_Zip_med_1',
       'Merchnum_Zip_med_0', 'Merchnum_Zip3_med_1', 'Merchnum_med_1',
       'Merchnum_Zip3_med_3', 'Merchnum_Zip_med_3'],
      dtype='object')

In [65]:
#best model

i = 0
X_trntst = X[0:oot_index]
Y_trntst = Y_save[0:oot_index]
X_oot = X[oot_index:]
Y_oot = Y_save[oot_index:]
n = criterionDict['n_variables'][i]
X_trntst = X_trntst.filter(filterCols[0:n])
X_oot = X_oot.filter(filterCols[0:n])
X_oot_orig = X_oot.copy()
fraud_label = 'Fraud'

nitermax = 10
FDR3 = pd.DataFrame(np.zeros((nitermax,3)), columns=('trn', 'tst', 'oot'))

for niter in range(nitermax):  
    print("train-test-split")
    X_trn, X_tst, Y_trn, Y_tst = train_test_split(X_trntst, Y_trntst, test_size = .3)


    if modelName == 'DecisionTreeClassifier':
        model = DecisionTreeClassifier(criterion = criterionDict['criterion'][i], max_depth = criterionDict['max_depth'][i],
                                  max_features = criterionDict['max_features'][i], min_samples_split = criterionDict['min_samples_split'][i],\
                                          min_samples_leaf = criterionDict['min_samples_leaf'][i])



    elif modelName == 'RandomForestClassifier':
        model = RandomForestClassifier(criterion = criterionDict['criterion'][i], max_depth = criterionDict['max_depth'][i],
                                        max_features = criterionDict['max_features'][i],min_samples_split = criterionDict['min_samples_split'][i],\
                                        min_samples_leaf = criterionDict['min_samples_leaf'][i],\
                                        bootstrap = criterionDict['bootstrap'][i],\
                                      n_estimators = criterionDict['n_estimators'][i])


    elif modelName == 'lgb':
        model = lgb.LGBMClassifier(learning_rate = criterionDict['learning_rate'][i],\
                                       boosting_type = criterionDict['boosting_type'][i],\
                                       num_leaves = criterionDict['num_leaves'][i])

    elif modelName == 'LogisticRegression':
        if len([i for i in criterionDict.keys() if i!='n_variables'])!=0:
            raise Exception("Hyperparameter tuning not implemented for Logistic Regression")
        model = LogisticRegression()

    elif modelName == 'MLPClassifier':
        model = MLPClassifier(hidden_layer_sizes = criterionDict['hidden_layer_sizes'][i],
                                  activation = criterionDict['activation'][i],
                                  solver = criterionDict['solver'][i],
                                  alpha = criterionDict['alpha'][i],
                                  learning_rate = criterionDict['learning_rate'][i],
                                  learning_rate_init = criterionDict['learning_rate_init'][i])

    elif modelName == 'SVC':
        model = SVC(kernel = criterionDict['kernel'][i],probability = True)

    elif modelName == 'KNeighborsClassifier':
        model = KNeighborsClassifier(n_neighbors = criterionDict['n_neighbors'][i],weights = criterionDict['weights'][i])

    else:
        raise Exception("Not implemented yet")


    print("model object created")

    X_oot = X_oot_orig.copy()
    X_trn_save = X_trn.copy()
    Y_trn_save = Y_trn.copy()

    print("training started")
    model.fit(X_trn, Y_trn.values.ravel())   
    print("training ended")
    predictions = model.predict_proba(X_trn_save)[:,1]
    X_trn['predicted'] = predictions
    X_trn[fraud_label] = Y_trn_save[fraud_label]
    topRows = int(round(X_trn.shape[0]*0.03))
    temp = X_trn.sort_values('predicted',ascending=False).head(topRows)
    needed = temp.loc[:,fraud_label]
    FDR3.loc[niter, 'trn'] = sum(needed)/sum(X_trn.loc[:,fraud_label])

    predictions = model.predict_proba(X_tst)[:,1]
    X_tst['predicted']=predictions
    X_tst[fraud_label] = Y_tst[fraud_label]
    topRows = int(round(X_tst.shape[0]*0.03))
    temp = X_tst.sort_values('predicted',ascending=False).head(topRows)
    needed = temp.loc[:,fraud_label]
    FDR3.loc[niter, 'tst'] = sum(needed)/sum(X_tst.loc[:,fraud_label])


    predictions = model.predict_proba(X_oot)[:,1]
    X_oot['predicted']=predictions
    X_oot[fraud_label] = Y_oot[fraud_label]
    topRows = int(round(X_oot.shape[0]*0.03))
    temp = X_oot.sort_values('predicted',ascending=False).head(topRows)
    needed = temp.loc[:,fraud_label]
    FDR3.loc[niter, 'oot'] = sum(needed)/sum(X_oot.loc[:,fraud_label])
    print(niter, FDR3.loc[niter, 'trn'],FDR3.loc[niter, 'tst'],FDR3.loc[niter, 'oot'])
    
    if FDR3.loc[niter, 'tst']>0.76 and abs(FDR3.loc[niter, 'tst'] - FDR3.loc[niter, 'trn'])<0.05:
        break
    
print("Scenario "+ str(i))
# FDR_results.append(FDR3.mean())
# print(f"*************************Scenario {i} completed***********************************************") 


train-test-split
model object created
training started
training ended
0 0.7825396825396825 0.772 0.5698324022346368
Scenario 0


### Three results tables <a class="anchor" id="3_tables"></a>

In [66]:
#Code for the three tables
X_trn_eval = X_trn.filter([fraud_label,'predicted'])
X_tst_eval = X_tst.filter([fraud_label,'predicted'])
X_oot_eval = X_oot.filter([fraud_label,'predicted'])

cols = ['bin','#recs','#g','#b','%g','%b','tot','cg','cb','%cg','FDR','KS','FPR']
FDR_trn = pd.DataFrame(np.zeros((101, 13)), columns = cols)
FDR_tst = pd.DataFrame(np.zeros((101, 13)), columns = cols)
FDR_oot = pd.DataFrame(np.zeros((101, 13)), columns = cols)
trn_sorted = X_trn_eval.sort_values('predicted',ascending=False) # X_trn
tst_sorted = X_tst_eval.sort_values('predicted',ascending=False) # X_tst
oot_sorted = X_oot_eval.sort_values('predicted',ascending=False) # X_oot
bad_tot_trn = sum(X_trn_eval.loc[:, fraud_label])
bad_tot_tst = sum(X_tst_eval.loc[:, fraud_label])
bad_tot_oot = sum(X_oot_eval.loc[:, fraud_label])
num_tot_trn = len(X_trn_eval)
num_tot_tst = len(X_tst_eval)
num_tot_oot = len(X_oot_eval)
good_tot_trn = num_tot_trn - bad_tot_trn
good_tot_tst = num_tot_tst - bad_tot_tst
good_tot_oot = num_tot_oot - bad_tot_oot

for i in range(101):
    percent_rows_trn = int(round(X_trn_eval.shape[0]*0.01*i))
    percent_rows_tst = int(round(X_tst_eval.shape[0]*0.01*i))
    percent_rows_oot = int(round(X_oot_eval.shape[0]*0.01*i))
    temp_trn = trn_sorted.head(percent_rows_trn)
    temp_tst = tst_sorted.head(percent_rows_tst)
    temp_oot = oot_sorted.head(percent_rows_oot)
    num_bad_trn = sum(temp_trn.loc[:,fraud_label])
    num_bad_tst = sum(temp_tst.loc[:,fraud_label])
    num_bad_oot = sum(temp_oot.loc[:,fraud_label])
    num_tot_trn = len(temp_trn)
    num_tot_tst = len(temp_tst)
    num_tot_oot = len(temp_oot)
    num_good_trn = num_tot_trn - num_bad_trn
    num_good_tst = num_tot_tst - num_bad_tst
    num_good_oot = num_tot_oot - num_bad_oot
    FDR_trn.loc[i, 'bin'] = i
    FDR_trn.loc[i,'#recs'] = 0
    FDR_trn.loc[i, 'tot'] = num_tot_trn
    FDR_trn.loc[i, 'cg'] = num_good_trn
    FDR_trn.loc[i, 'cb'] = num_bad_trn
    FDR_tst.loc[i, 'bin'] = i
    FDR_tst.loc[i, 'tot'] = num_tot_tst
    FDR_tst.loc[i, 'cg'] = num_good_tst
    FDR_tst.loc[i, 'cb'] = num_bad_tst
    FDR_oot.loc[i, 'bin'] = i
    FDR_oot.loc[i, 'tot'] = num_tot_oot
    FDR_oot.loc[i, 'cg'] = num_good_oot
    FDR_oot.loc[i, 'cb'] = num_bad_oot
    if i != 0:
        FDR_trn.loc[i, '#g'] = num_good_trn - FDR_trn.loc[i-1, 'cg']
        FDR_trn.loc[i, '#b'] = num_bad_trn - FDR_trn.loc[i-1, 'cb']
        FDR_trn.loc[i,'#recs'] = FDR_trn.loc[i, '#g'] + FDR_trn.loc[i, '#b']
        FDR_trn.loc[i, '%g'] = 100* (num_good_trn - FDR_trn.loc[i-1, 'cg']) / (num_tot_trn - FDR_trn.loc[i-1, 'tot'])
        FDR_trn.loc[i, '%b'] = 100 - FDR_trn.loc[i, '%g']
        FDR_trn.loc[i, '%cg'] = 100 * num_good_trn / good_tot_trn
        FDR_trn.loc[i, 'FDR'] = 100 * num_bad_trn / bad_tot_trn
        FDR_trn.loc[i, 'KS'] = FDR_trn.loc[i, 'FDR'] - FDR_trn.loc[i, '%cg']
        FDR_trn.loc[i, 'FPR'] = num_good_trn / num_bad_trn
        FDR_tst.loc[i, '#g'] = num_good_tst - FDR_tst.loc[i-1, 'cg']
        FDR_tst.loc[i, '#b'] = num_bad_tst - FDR_tst.loc[i-1, 'cb']
        FDR_tst.loc[i,'#recs'] = FDR_tst.loc[i, '#g'] + FDR_tst.loc[i, '#b']
        FDR_tst.loc[i, '%g'] = 100* (num_good_tst - FDR_tst.loc[i-1, 'cg']) / (num_tot_tst - FDR_tst.loc[i-1, 'tot'])
        FDR_tst.loc[i, '%b'] = 100 - FDR_tst.loc[i, '%g']
        FDR_tst.loc[i, '%cg'] = 100 * num_good_tst / good_tot_tst
        FDR_tst.loc[i, 'FDR'] = 100 * num_bad_tst / bad_tot_tst
        FDR_tst.loc[i, 'KS'] = FDR_tst.loc[i, 'FDR'] - FDR_tst.loc[i, '%cg']
        FDR_tst.loc[i, 'FPR'] = num_good_tst / num_bad_tst
        FDR_oot.loc[i, '#g'] = num_good_oot - FDR_oot.loc[i-1, 'cg']
        FDR_oot.loc[i, '#b'] = num_bad_oot - FDR_oot.loc[i-1, 'cb']
        FDR_oot.loc[i,'#recs'] = FDR_oot.loc[i, '#g'] + FDR_oot.loc[i, '#b']
        FDR_oot.loc[i, '%g'] = 100* (num_good_oot - FDR_oot.loc[i-1, 'cg']) / (num_tot_oot - FDR_oot.loc[i-1, 'tot'])
        FDR_oot.loc[i, '%b'] = 100 - FDR_oot.loc[i, '%g']
        FDR_oot.loc[i, '%cg'] = 100 * num_good_oot / good_tot_oot
        FDR_oot.loc[i, 'FDR'] = 100 * num_bad_oot / bad_tot_oot
        FDR_oot.loc[i, 'KS'] = FDR_oot.loc[i, 'FDR'] - FDR_oot.loc[i, '%cg']
        FDR_oot.loc[i, 'FPR'] = num_good_oot / num_bad_oot


In [67]:
FDR_oot

Unnamed: 0,bin,#recs,#g,#b,%g,%b,tot,cg,cb,%cg,FDR,KS,FPR
0,0.0,0.0,0.0,0.0,0.000000,0.000000,0.0,0.0,0.0,0.000000,0.000000,0.000000,0.000000
1,1.0,121.0,55.0,66.0,45.454545,54.545455,121.0,55.0,66.0,0.461448,36.871508,36.410060,0.833333
2,2.0,121.0,98.0,23.0,80.991736,19.008264,242.0,153.0,89.0,1.283665,49.720670,48.437006,1.719101
3,3.0,121.0,108.0,13.0,89.256198,10.743802,363.0,261.0,102.0,2.189781,56.983240,54.793459,2.558824
4,4.0,121.0,119.0,2.0,98.347107,1.652893,484.0,380.0,104.0,3.188187,58.100559,54.912372,3.653846
...,...,...,...,...,...,...,...,...,...,...,...,...,...
96,96.0,121.0,121.0,0.0,100.000000,0.000000,11614.0,11435.0,179.0,95.939257,100.000000,4.060743,63.882682
97,97.0,121.0,121.0,0.0,100.000000,0.000000,11735.0,11556.0,179.0,96.954442,100.000000,3.045558,64.558659
98,98.0,121.0,121.0,0.0,100.000000,0.000000,11856.0,11677.0,179.0,97.969628,100.000000,2.030372,65.234637
99,99.0,121.0,121.0,0.0,100.000000,0.000000,11977.0,11798.0,179.0,98.984814,100.000000,1.015186,65.910615


### Applying on OOT <a class="anchor" id="oot"></a>

In [68]:
i = 0
n = criterionDict['n_variables'][i]
X_oot2 = X_oot.filter(filterCols[0:n])
X_trntst = X_trntst.filter(filterCols[0:n])
#would need to change Model here 
model = RandomForestClassifier(criterion = criterionDict['criterion'][i], max_depth = criterionDict['max_depth'][i],
max_features = criterionDict['max_features'][i],min_samples_split = criterionDict['min_samples_split'][i],\
min_samples_leaf = criterionDict['min_samples_leaf'][i],\
bootstrap = criterionDict['bootstrap'][i],\
n_estimators = criterionDict['n_estimators'][i])

# filterCols = 

model.fit(X_trntst, Y_trntst.values.ravel())   
print("training ended")
predictions = model.predict_proba(X_oot2)[:,1]
X_oot2['predicted'] = predictions
X_oot2[fraud_label] = Y_oot[fraud_label]
topRows = int(round(X_oot2.shape[0]*0.03))
temp = X_oot2.sort_values('predicted',ascending=False).head(topRows)
needed = temp.loc[:,fraud_label]
# FDR3.loc[niter, 'oot'] = 
print(sum(needed)/sum(X_oot2.loc[:,fraud_label]))

training ended
0.5642458100558659


In [69]:
importances = model.feature_importances_
feature_names = list(X_trntst.columns)
pd.Series(importances, index=feature_names).sort_values(ascending = False).to_csv("Feature_importance"+"_fv_"+str(versionFilterCols)+".csv")

In [70]:
#OOT table

X_oot_eval = X_oot2.filter([fraud_label,'predicted'])

cols = ['bin','#recs','#g','#b','%g','%b','tot','cg','cb','%cg','FDR','KS','FPR']

FDR_oot = pd.DataFrame(np.zeros((101, 13)), columns = cols)

oot_sorted = X_oot_eval.sort_values('predicted',ascending=False) # X_oot

bad_tot_oot = sum(X_oot_eval.loc[:, fraud_label])

num_tot_oot = len(X_oot_eval)

good_tot_oot = num_tot_oot - bad_tot_oot

for i in range(101):

    percent_rows_oot = int(round(X_oot_eval.shape[0]*0.01*i))

    temp_oot = oot_sorted.head(percent_rows_oot)

    num_bad_oot = sum(temp_oot.loc[:,fraud_label])

    num_tot_oot = len(temp_oot)

    num_good_oot = num_tot_oot - num_bad_oot

    FDR_oot.loc[i, 'bin'] = i
    FDR_oot.loc[i, 'tot'] = num_tot_oot
    FDR_oot.loc[i, 'cg'] = num_good_oot
    FDR_oot.loc[i, 'cb'] = num_bad_oot
    if i != 0:
        FDR_oot.loc[i, '#g'] = num_good_oot - FDR_oot.loc[i-1, 'cg']
        FDR_oot.loc[i, '#b'] = num_bad_oot - FDR_oot.loc[i-1, 'cb']
        FDR_oot.loc[i,'#recs'] = FDR_oot.loc[i, '#g'] + FDR_oot.loc[i, '#b']
        FDR_oot.loc[i, '%g'] = 100* (num_good_oot - FDR_oot.loc[i-1, 'cg']) / (num_tot_oot - FDR_oot.loc[i-1, 'tot'])
        FDR_oot.loc[i, '%b'] = 100 - FDR_oot.loc[i, '%g']
        FDR_oot.loc[i, '%cg'] = 100 * num_good_oot / good_tot_oot
        FDR_oot.loc[i, 'FDR'] = 100 * num_bad_oot / bad_tot_oot
        FDR_oot.loc[i, 'KS'] = FDR_oot.loc[i, 'FDR'] - FDR_oot.loc[i, '%cg']
        FDR_oot.loc[i, 'FPR'] = num_good_oot / num_bad_oot


In [71]:
X_oot2.to_csv("X_oot2_fv_"+str(versionFilterCols)+".csv")
X_trn.to_csv("X_trn_fv_"+str(versionFilterCols)+".csv")
X_tst.to_csv("X_tst_fv_"+str(versionFilterCols)+".csv")

In [72]:
len(X_trn)

59009

In [73]:
FDR_oot.to_csv("FDR_oot_fv_"+str(versionFilterCols)+".csv")

In [74]:
with pd.ExcelWriter("Results_project3_fv_"+str(versionFilterCols)+".xlsx") as writer:  
        FDR_trn.to_excel(writer, sheet_name='train')
        FDR_tst.to_excel(writer, sheet_name='test')
        FDR_oot.to_excel(writer, sheet_name='oot')
        

In [None]:
assert 1==2

### For fraud score dynamics charts

In [None]:
X['Recnum'] = data['Recnum']
index_rec_map = X['Recnum']
rec_card_map

In [None]:
X_trn_copy = X_trn.copy()

In [None]:
X_trn_copy['Recnum'] = X_trn_copy.index.map(index_rec_map)

In [None]:
rec_card_map = pd.read_csv("card transactions.csv")[['Recnum','Cardnum','Merchnum','Date']]

In [None]:
X_trn_copy.merge(rec_card_map,on = 'Recnum',how = 'left').to_csv("X_trn_w_num.csv")

In [None]:
X_tst_copy = X_tst.copy()

In [None]:
X_tst_copy['Recnum'] = X_tst_copy.index.map(index_rec_map)

In [None]:
X_tst_copy.merge(rec_card_map,on = 'Recnum',how = 'left').to_csv("X_tst_w_num.csv")

In [None]:
X_oot2_copy = X_oot2.copy()

In [None]:
X_oot2_copy['Recnum'] = X_oot2_copy.index.map(index_rec_map)

In [None]:
X_oot2_copy.merge(rec_card_map,on = 'Recnum',how = 'left').to_csv("X_oot2_w_num.csv")

In [None]:
# X_trn_eval = X_trn.filter(['record','fraud_label','predicted'])
# X_tst_eval = X_tst.filter(['record','fraud_label','predicted'])
# X_oot_eval = X_oot.filter(['record','fraud_label','predicted'])

# X_trn_eval['bin'] = pd.cut(X_trn_eval['predicted'],101)
# X_tst_eval['bin'] = pd.cut(X_trn_eval['predicted'],101)
# X_oot_eval['bin'] = pd.cut(X_trn_eval['predicted'],101)

# FDR_trn = X_trn_eval.groupby('bin').agg({'predicted':'count','fraud_label':np.sum}).rename(columns = {'predicted':'#Records','fraud_label':'#Bads'})
# FDR_trn['%Bads'] = FDR_trn['#Bads']/FDR_trn['#Records']
# FDR_trn['#Goods'] = FDR_trn['#Records'] - FDR_trn['#Bads']
# FDR_trn['%Goods'] = FDR_trn['#Goods']/FDR_trn['#Records']

# FDR_trn['Cumulative_Bads'] = np.cumsum(FDR_trn['#Bads'])
# FDR_trn['Cumulative_Goods'] = np.cumsum(FDR_trn['#Goods'])
# FDR_trn['Cumulative_%Goods'] = FDR_trn['Cumulative_Goods'] / len(X_trn_eval[X_trn_eval['fraud_label']==0])
# FDR_trn['Cumulative_%Bads'] = FDR_trn['Cumulative_Bads'] / len(X_trn_eval[X_trn_eval['fraud_label']==1]) #check
# FDR_trn['KS'] = FDR_trn['Cumulative_%Bads'] - FDR_trn['Cumulative_%Goods'] #Cumulative_%Bads = FDR
# FDR_trn['FPR'] = FDR_trn['Cumulative_Goods'] / FDR_trn['Cumulative_Bads']


