# IEEE-CIS Fraud Detection Study

## Introduction
Credit or debit cards play an important role in modern life. It makes daily life much easier. However, it is vulnerable to frauds. 
A robust fraud prevention system can save consumers millions of dollars per year. With a large dataset and better algorithm this system can be improved.

The task here is to benchmark machine learning models on a challenging large-scale dataset. The data comes from Vesta's real-world e-commerce transactions and contains a wide range of features from device type to product features. You also have the opportunity to create new features to improve your results.

If successful, we’ll improve the efficiency of fraudulent transaction alerts for millions of people around the world, helping hundreds of thousands of businesses reduce their fraud loss and increase their revenue.

## Load Data
Data is downloaded in the page: [here](https://www.kaggle.com/c/ieee-fraud-detection/data). In the notebook below, we assume the data files are under the *data/* folder under the current folder.

### Load modules

In [1]:
# personal modules
from read import read_data, count_missing
from model import train_model_dnn, load_model_dnn

# standard modules
import numpy as np
import pandas as pd
from pandas.api.types import is_numeric_dtype
import datetime

Using TensorFlow backend.


### Read training and testing data
In the data, we realize that a few data fields have *string* inputs, i.e. *email domain*. This is not usable in machine learning tools. These data fields have to be converted to numbers first.

For the first time reading the training data, we converted data fields to int numbers and kept them in a separated csv output. Later on, the converted csv file is used to save time.

Meanwhile, we record how much time it takes to read the data. It's longer for the first time and much shorter for the later.

In [2]:
cdt0 = datetime.datetime.now()
print (cdt0.strftime("%Y-%m-%d %H:%M:%S"))

train = read_data("train", converted=True)
test = read_data("test", converted=True)
print("Size of training data: %d, testing data: %d" %(len(train), len(test)) )                                        

time_dif_1 = datetime.datetime.now() - cdt0

print ("Time passed: %2d Minutes %2d Seconds."\
       % ( time_dif_1.total_seconds()//60, time_dif_1.total_seconds()%60) )


2019-10-28 20:03:47
Reading: data/train.csv
Reading: data/test.csv
Size of training data: 590540, testing data: 506691
Time passed:  0 Minutes 52 Seconds.


## Data Cleaning and Normalization
There are steps to clean the data:
* Remove NaN
* Replace outlier numbers with upper/lower bound
* Normalize variables values to be around 1.0

### Training data normalization
And the parameters are calculated and kept. The testing data should apply the same parameters.

In [3]:
# Fill NaN values with:
#  minimum - 1 if it is int
#  minimum - 0.1 * std if it is float
cdt0 = datetime.datetime.now()

# keep the parameters used in calculation
paradict = {}
for icol, colname in enumerate(train.columns):
    if colname == 'isFraud':
        continue
    if icol %50 == 0:
        print(icol, "out of", len(train.columns), "column: ", colname)
    # initialize 5 parameters
    paradict[colname] = [0] * 5
    unq = train[colname].unique()
    isint_col = False
    if train[colname].dtype == np.int64 or ( (train[colname].dtype == np.float64) and (np.ceil( unq ) == unq).all() ):
        isint_col = True
    if train[colname].dtype == np.int64:
        train[colname] = train[colname].astype('float64')
    
    paradict[colname][0] = isint_col
    
    vmin = train[colname].min()
    vstd = train[colname].std()
    paradict[colname][1] = vmin
    paradict[colname][2] = vstd
    
    train[colname].fillna(vmin - 1. if isint_col else vmin - 0.1 * vstd, inplace=True)
    
    vmean = train[colname].mean()
    vstd = train[colname].std()
    paradict[colname][3] = vmean
    paradict[colname][4] = vstd
    
    train[colname] = train[colname].mask( train[colname] > vmean + 10 * vstd, vmean + 10.5 * vstd)
    train[colname] = train[colname].mask( train[colname] < vmean - 10 * vstd, vmean - 10.5 * vstd)
    if vstd < 1.e-9:
        print("std of", colname, " is 0???!!!")
    else:
        train[colname] = (train[colname] - vmean) / vstd

parapd = pd.DataFrame.from_dict(paradict, orient='index', columns=['isint', 'min0', 'std0', 'mean1', 'std1'])
print(parapd.head(5))
time_dif_1 = datetime.datetime.now() - cdt0
print ("Time passed: %2d Minutes %2d Seconds."\
       % ( time_dif_1.total_seconds()//60, time_dif_1.total_seconds()%60) )


50 out of 433 column:  V11
100 out of 433 column:  V61
150 out of 433 column:  V111
200 out of 433 column:  V161
250 out of 433 column:  V211
300 out of 433 column:  V261
350 out of 433 column:  V311
400 out of 433 column:  id_26
                isint       min0          std0         mean1          std1
TransactionDT    True  86400.000  4.617224e+06  7.372311e+06  4.617224e+06
TransactionAmt  False      0.251  2.391625e+02  1.350272e+02  2.391625e+02
card1            True   1000.000  4.901170e+03  9.898735e+03  4.901170e+03
card2           False    100.000  1.577932e+02  3.583452e+02  1.602380e+02
card3           False    100.000  1.133644e+01  1.530509e+02  1.166086e+01
Time passed:  0 Minutes 20 Seconds.


In [4]:
print(train.describe())

             isFraud  TransactionDT  TransactionAmt         card1  \
count  590540.000000   5.905400e+05   590540.000000  5.905400e+05   
mean        0.034990  -4.620321e-17       -0.003281  1.400535e-16   
std         0.183755   1.000000e+00        0.930382  1.000000e+00   
min         0.000000  -1.577985e+00       -0.563534 -1.815635e+00   
25%         0.000000  -9.410966e-01       -0.383447 -7.915935e-01   
50%         0.000000  -1.424748e-02       -0.277042 -4.503713e-02   
75%         0.000000   8.390992e-01       -0.041926  8.743352e-01   
max         1.000000   1.827683e+00       10.500000  1.733722e+00   

              card2         card3         card5         addr1         addr2  \
count  5.905400e+05  5.905400e+05  5.905400e+05  5.905400e+05  5.905400e+05   
mean   4.922266e-16  3.908960e-15 -1.854866e-16  1.535583e-15  2.168085e-15   
std    1.000000e+00  1.000000e+00  1.000000e+00  1.000000e+00  1.000000e+00   
min   -1.710733e+00 -4.646705e+00 -2.443264e+00 -1.554649e+00 

### Normalization of testing data
Application of normalization using the parameters calculated above.

In [5]:
# data normalize to 0 - 1.
cdt0 = datetime.datetime.now()

for colname in test.columns:
    if colname == 'isFraud':
        continue
    visint = parapd.loc[colname, 'isint']
    vmin0 = parapd.loc[colname, 'min0']
    vstd0 = parapd.loc[colname, 'std0']
    test[colname].fillna(vmin0 - 1. if visint else vmin0 - 0.1 * vstd0, inplace=True)
    
    vmean1 = parapd.loc[colname, 'mean1']
    vstd1 = parapd.loc[colname, 'std1']
    
    test[colname] = test[colname].mask( test[colname] > vmean1 + 10 * vstd1, vmean1 + 10.5 * vstd1)
    test[colname] = test[colname].mask( test[colname] < vmean1 - 10 * vstd1, vmean1 - 10.5 * vstd1)
    if vstd1 < 1.e-9:
        print("std of", colname, "not found in parameter list!")
    else:
        test[colname] = (test[colname] - vmean1) / vstd1

time_dif_1 = datetime.datetime.now() - cdt0
print ("Time passed: %2d Minutes %2d Seconds."\
       % ( time_dif_1.total_seconds()//60, time_dif_1.total_seconds()%60) )


Time passed:  0 Minutes  9 Seconds.


In [6]:
print(test.describe())

       TransactionDT  TransactionAmt          card1          card2  \
count  506691.000000   506691.000000  506691.000000  506691.000000   
mean        4.235798       -0.006845       0.011933       0.003846   
std         1.030166        0.942928       0.996693       1.007515   
min         2.389079       -0.564508      -1.815431      -1.710733   
25%         3.335171       -0.397333      -0.791594      -0.969465   
50%         4.295297       -0.280467      -0.019533       0.016568   
75%         5.192785       -0.041926       0.893106       0.958916   
max         5.813458       10.500000       1.733926       1.508099   

               card3          card5          addr1          addr2  \
count  506691.000000  506691.000000  506691.000000  506691.000000   
mean        0.014451       0.016516      -0.023440      -0.060388   
std         1.123117       0.989144       1.017935       1.066948   
min        -4.646705      -2.443264      -1.554649      -2.810880   
25%        -0.261640    

#### Information about training and testing data
* Calcualte the fraction Fraud transactions.
* Count the number of variables, etc.

In [7]:
frate = 100. * train["isFraud"].sum() /  len(train)
print("Fraud rate: %d out of %d (%4.2f%%)" % 
      (train["isFraud"].sum(), len(train), frate) )

Fraud rate: 20663 out of 590540 (3.50%)


In [8]:
print("Number of columns in training:%d, testing:%d" %( len(train.columns), len(test.columns)) )
print(train.columns)

Number of columns in training:433, testing:432
Index(['isFraud', 'TransactionDT', 'TransactionAmt', 'card1', 'card2', 'card3',
       'card5', 'addr1', 'addr2', 'dist1',
       ...
       'id_30_i', 'id_31_i', 'id_33_i', 'id_34_i', 'id_35_i', 'id_36_i',
       'id_37_i', 'id_38_i', 'DeviceType_i', 'DeviceInfo_i'],
      dtype='object', length=433)


In [9]:
print(test.columns)

Index(['TransactionDT', 'TransactionAmt', 'card1', 'card2', 'card3', 'card5',
       'addr1', 'addr2', 'dist1', 'dist2',
       ...
       'id_30_i', 'id_31_i', 'id_33_i', 'id_34_i', 'id_35_i', 'id_36_i',
       'id_37_i', 'id_38_i', 'DeviceType_i', 'DeviceInfo_i'],
      dtype='object', length=432)


## Data Assessment
What can we learn from the training and testing data before putting them into models.

### Variable separation powers
Separation power is calculated to understand the difference of two categories of data in given variable. The separation can be used in two situations:
* Fraud vs. Non-Fraud: it gives ideas which variables are more powerful in identifying the Fraud Transactions.
* Training vs. Testing: it gives ideas which variables are not that consistent in training and testing, thus should be avoided.

In [11]:
def separation_power(v1 = None, v2 = None, nbins = 10):
    """
    Calculation of separation power for given variables' two distributions.
    
    Parameters:
    -----------
    v1: pd.Series()
        the variables' one distribution
    v2: pd.Series()
        the variables' another distribution
    nbins: int
        in the calculation, the distributions are binned from its min to max
        the number of bins can be changed (default: 10).
    Returns:
    --------
    float: separation power value
    """
    
    if v1 is None or v2 is None:
        print("Cannot calculate seperation power if either variable is None. Return 0.")
        return 0.0
    if len(v1) == 0 or len(v2) == 0:
        print("Length of v1:", len(v1), "v2:", len(v2), " No separation power!")
        return 0.0
    
    if nbins < 2:
        nbins = 2
    # separation defined p25: https://root.cern.ch/download/doc/tmva/TMVAUsersGuide.pdf
    vmin = min(v1.min(), v2.min())
    vmax = max(v1.max(), v2.max())
    binsize = (vmax-vmin)/nbins
    ssqr = 0.
    # vrng = [ vmin + (idx - 0.5) * binsize for idx in range(nbins+2)]
    # count 
    for idx in range(nbins+1):
        imin = vmin + (idx - 0.5) * binsize
        b1 = ((v1 >= imin) & (v1 < imin+binsize)).sum()
        b1 = float(b1) / len(v1)
        b2 = ((v2 >= imin) & (v2 < imin+binsize)).sum()
        b2 = float(b2) / len(v2)
        if b1 ==0 and b2==0: continue
        ssqr = ssqr + (b1 - b2) * (b1 - b2) / (b1 + b2)

    return ssqr

#### Training data separation Fraud vs Non-Fraud
The larger this separation is the better the variable can be used for identifying fraud and non-fraud transactions.

In [18]:
from pandas.api.types import is_numeric_dtype
cdt0 = datetime.datetime.now()

sepdict_train = {}
for col, content in train.items():
    if col == "isFraud": continue
    if not is_numeric_dtype(content):
        continue
    var = train[[col, "isFraud"]]
    v1 = var[ var["isFraud"] >0 ]
    v2 = var[ var["isFraud"] <1 ]
    ssqr = separation_power(v1[col], v2[col])
    sepdict_train[col] = [ssqr]
    if len(sepdict_train) % 20 == 0:
        print(len(sepdict_train),"name", col, "separation", ssqr)
separationpd = pd.DataFrame.from_dict(sepdict_train, orient='index', columns=['separation_train_isfraud'])
separationpd.sort_values(by=['separation_train_isfraud'], ascending=False, inplace=True)

time_dif_1 = datetime.datetime.now() - cdt0
print ("Time passed: %2d Minutes %2d Seconds."\
       % ( time_dif_1.total_seconds()//60, time_dif_1.total_seconds()%60) )


20 name C10 separation 0.02459643964997889
40 name V1 separation 0.11851778379106581
60 name V21 separation 0.17521370663654176
80 name V41 separation 0.00018675687269592778
100 name V61 separation 0.03200563622543222
120 name V81 separation 0.18494041279756585
140 name V101 separation 0.0023543558683684174
160 name V121 separation 0.0013851940443661346
180 name V141 separation 0.008229288922090072
200 name V161 separation 0.009128074885787285
220 name V181 separation 0.05751830531501733
240 name V201 separation 0.28437534198663833
260 name V221 separation 0.14890481804708203
280 name V241 separation 0.17461248129504284
300 name V261 separation 0.1873045740141796
320 name V281 separation 0.0436589771015734
340 name V301 separation 0.011775543654432891
360 name V321 separation 0.010090501676662016
380 name id_02 separation 0.20342784582085427
400 name id_26 separation 0.006384034415852924
420 name id_27_i separation 0.005299152905254094
Time passed:  0 Minutes 44 Seconds.


In [28]:
print("The top 20 best separation variables.")
print(separationpd.head(20))
print("\n")
print("The bottom 20 best separation variables.")
print(separationpd.tail(20))

The top 20 best separation variables.
             separation_train_isfraud
V258                         0.287977
V199                         0.287482
V201                         0.284375
V190                         0.278176
V257                         0.277462
V200                         0.276125
V246                         0.266459
V186                         0.262299
V189                         0.261477
V243                         0.259026
V170                         0.254778
ProductCD_i                  0.254646
V176                         0.253911
V188                         0.253672
V230                         0.252413
V244                         0.251987
V242                         0.249132
id_35_i                      0.248911
id_17                        0.247678
V171                         0.242484


The bottom 20 best separation variables.
       separation_train_isfraud
V120               2.447490e-03
V173               2.420176e-03
V122               2.3816

#### Training data and testing data separation
The smaller this separation is the lower difference it is between training and testing data for the variable. The variables with high separation in training and testing data shouldn't be used for this study, because that would bias the results.

In [21]:
cdt0 = datetime.datetime.now()

sepdict_test = {}
for col, content in test.items():
    if not is_numeric_dtype(content):
        continue
    ssqr = separation_power(train[ col ], test[ col ])
    sepdict_test[col] = [ssqr]
septestpd = pd.DataFrame.from_dict(sepdict_test, orient='index', columns=['separation_train_test'])
septestpd.sort_values(by=['separation_train_test'], ascending=False, inplace=True)

print("\nRanking the variables by train vs test separation power:")
print(septestpd.head(20))


time_dif_1 = datetime.datetime.now() - cdt0
print ("Time passed: %2d Minutes %2d Seconds."\
       % ( time_dif_1.total_seconds()//60, time_dif_1.total_seconds()%60) )



Ranking the variables by train vs test separation power:
               separation_train_test
TransactionDT               1.954791
card4_i                     1.240270
M7_i                        0.724378
M2_i                        0.541231
M4_i                        0.511028
M9_i                        0.498900
M5_i                        0.407274
M8_i                        0.308927
M3_i                        0.286788
card6_i                     0.255866
D15                         0.215991
id_13                       0.202369
D4                          0.129962
ProductCD_i                 0.116852
V78                         0.102801
V77                         0.102467
V87                         0.101863
V86                         0.101834
V88                         0.101656
V66                         0.099123
Time passed:  1 Minutes  2 Seconds.


#### Combine two types of separations

In [26]:
print("Combining the two separation values!")
sepcombpd = pd.concat([separationpd, septestpd], sort=False, axis=1)
print(sepcombpd.head())
sepcombpd.sort_values(by=["separation_train_test"], ascending=False, inplace=True)
print("After sorting")
print(sepcombpd.head())

Combining the two separation values!
      separation_train_isfraud  separation_train_test
V258                  0.287977               0.002569
V199                  0.287482               0.003077
V201                  0.284375               0.002248
V190                  0.278176               0.003110
V257                  0.277462               0.002514
After sorting
               separation_train_isfraud  separation_train_test
TransactionDT                  0.020730               1.954791
card4_i                        0.005974               1.240270
M7_i                           0.079873               0.724378
M2_i                           0.130560               0.541231
M4_i                           0.208226               0.511028


#### Selection of variables based on the two types of separations

In [42]:
# only keep variables that has low separation between train and test
print("Number of variables before selection: %d"%len(sepcombpd))
print("Selecting variables with low training vs. testing separation to reduce bias.")
sepsel01 = sepcombpd[ sepcombpd["separation_train_test"] < 0.2 ]
print("Number of variables after keeping only sep(train, test) < 0.2: %d"%(len(sepsel01)))
separationpd_ok = sepcombpd[ sepcombpd["separation_train_isfraud"] >0.01 ]
print("Number of variables after keeping only sep(train fraud, non-fraud)>0.01: %d"%(len(separationpd_ok)))

Number of variables before selection: 432
Selecting variables with low training vs. testing separation to reduce bias.
Number of variables after keeping only sep(train, test) < 0.2: 420
Number of variables after keeping only sep(train fraud, non-fraud)>0.01: 336


In [29]:
import matplotlib.pyplot as plt
#### %matplotlib notebook
#
# make a variable plot by comparing isFraud = 0 to 1.
#
def isFraud_compare(data, categ, varname, xlim = (1.0, -1.0), data_test = None):
    var = data[[varname,"isFraud"]].dropna()
    if xlim[0] < xlim[1]:
        vmin, vmax = xlim[0], xlim[1]
    else:
        vmin, vmax = var[varname].min(), var[varname].max()
        if data_test is not None:
            vmin = min(vmin, data_test[varname].min())
            vmax = max(vmax, data_test[varname].max())
    binsize = (vmax-vmin)/9.
    vbin = [vmin+(i-0.5)*binsize for i in range(11)]
    v1 = var[ var["isFraud"] >0 ]
    v2 = var[ var["isFraud"] <1 ]
    if len(v1) == 0 or len(v2) == 0:
        print("Length of var", varname, "with fraud:", len(v1), "without fraud:", len(v2), " Skip!")
        return None
    
    figx = plt.figure(figsize=(5,5))
    plt.hist(v1[varname], weights=[1./len(v1)]*len(v1), bins = vbin, #10, #range=(vmin, vmax), 
             alpha=0.85, color='grey', 
             label='Is Fraud: %d (%3.1f%%)'%(len(v1),100*len(v1)*1./(len(v1)+len(v2))) )
    plt.hist(v2[varname], weights=[1./len(v2)]*len(v2), rwidth = 0.75, hatch='/', bins = vbin, #10,
             alpha=0.75, color='darksalmon', 
             label='None Fraud: %d (%3.1f%%)'%(len(v2),100*len(v2)*1./(len(v1)+len(v2))) )
    if data_test is not None:
        plt.hist(data_test[varname], weights=[1./len(data_test)]*len(data_test), rwidth = 0.5, 
                 hatch='.', bins = vbin, 
                 alpha=0.65, color='dodgerblue', 
                 label='Test: %d'%(len(data_test)) )
    pltname = "plot/"+categ+"_"+varname+"_sb.png"
    x1,x2,y1,y2 = plt.axis()
    plt.ylim(y1, y2*1.5)
    if xlim[0] < xlim[1]:
        plt.xlim(xlim[0] -0.5 * binsize, xlim[1] +0.5 * binsize)
    plt.xlabel("Value of: "+varname+" ("+categ.replace("_", " ")+")")
    plt.ylabel("Normalized")
    plt.legend(loc='upper left')
    plt.title("Desity of "+varname)
    #plt.show()
    plt.savefig(pltname)
    plt.clf()
    plt.close()

In [34]:
cdt0 = datetime.datetime.now()

colname = separationpd.index.values
print("number of variables:", len(colname))
for idx,col in enumerate(colname):
    if idx % 50 == 0:
        print(idx, "column: ", col)
    name = "sep%3d_" % idx
    name = name.replace(" ", "0")
    isFraud_compare(train, name, col)
    
    # use break to draw only one plot, remove for more
    #break
    
time_dif_1 = datetime.datetime.now() - cdt0
print ("Time passed: %2d Minutes %2d Seconds."\
       % ( time_dif_1.total_seconds()//60, time_dif_1.total_seconds()%60) )


number of variables: 432
0 column:  V258
50 column:  id_11
100 column:  V74
150 column:  V255
200 column:  V123
250 column:  V214
300 column:  V284
350 column:  V212
400 column:  V332
Time passed:  1 Minutes 48 Seconds.


In [37]:
#from sklearn.ensemble import BaggingRegressor
from sklearn.ensemble import BaggingClassifier
from sklearn.model_selection import train_test_split
#X=data[['sepal length', 'sepal width', 'petal length', 'petal width']]  # Features
#y=data['species']  # Labels
#Import Random Forest Model
from sklearn.ensemble import RandomForestClassifier
#Import scikit-learn metrics module for accuracy calculation
from sklearn import metrics

def train_model_rfbag( data, ntree = 100, use_prob = True, use_cols = None):
    cols = []
    for col, content in data.items():
        if col == "isFraud": continue
        if not is_numeric_dtype(content):
            continue
        if use_cols is not None and col not in use_cols:
            continue
        cols.append( col )

    trnX = data[ cols ] # Features
    trnX = trnX.fillna( -1.0 )
    trnY = data['isFraud']  # Labels
    # Split dataset into training set and test set
    #X_train, X_test, Y_train, Y_test = train_test_split(trnX, trnY, test_size=0.3)
    X_train, X_test, Y_train, Y_test = train_test_split(trnX, trnY, test_size=0.05)
    print("Data points: ",len(trnX), len(X_train), len(X_test))
    #Create a Gaussian Classifier
    #clf= BaggingRegressor(RandomForestClassifier(n_estimators = ntree))
    clf= BaggingClassifier(RandomForestClassifier(n_estimators = ntree))
    print("Model registered.")
    #Train the model using the training sets y_pred=clf.predict(X_test)
    clf.fit(X_train, Y_train)

    print("Model trained.")
    if use_prob:
        Y_pred_train=clf.predict_proba(X_train)
        Y_pred_test=clf.predict_proba(X_test)
    else: 
        Y_pred_train=clf.predict(X_train)
        Y_pred_test=clf.predict(X_test)
    return (cols, clf, Y_train, Y_pred_train, Y_test, Y_pred_test)

In [38]:
#
# use only the leading 100 variables in separation power
#
#cols_select_outer = sep_train_outer.index[0:500].get_values()
cols_select = separationpd_ok.index.get_values()
print(type(cols_select), len(cols_select))


<class 'numpy.ndarray'> 336


In [52]:
cdt0 = datetime.datetime.now()

res = train_model_rfbag(train, ntree = 500, use_prob = True, use_cols = cols_select)
cols_prob, model_prob, Ytrain, Ytrain_prd, Ytest, Ytest_prd = res

time_dif_1 = datetime.datetime.now() - cdt0
print ("Time passed: %2d Minutes %2d Seconds."\
       % ( time_dif_1.total_seconds()//60, time_dif_1.total_seconds()%60) )

Data points:  590540 561013 29527
Model registered.
Model trained.
Time passed: 237 Minutes 47 Seconds.


In [53]:
# print the acuracy matrix
Ytrain_prd=Ytrain_prd[:,1]
print("Training precision accuracy:", metrics.average_precision_score(Ytrain, Ytrain_prd))


Training precision accuracy: 0.9962975647686902


Ytest_prd=Ytest_prd[:,1]
print("Testing precision accuracy:", metrics.average_precision_score(Ytest, Ytest_prd))

## here comes the testing data

In [49]:
import matplotlib.pyplot as plt
#### %matplotlib notebook
#
# make a variable plot by comparing training to testing data
#
def train_test_compare(data_train, data_test, categ, varname, xlim = (1.0, -1.0)):
    if xlim[0] < xlim[1]:
        vmin, vmax = xlim[0], xlim[1]
    else:
        vmin, vmax = data_train[varname].min(), data_train[varname].max()
        vmin = min(vmin, data_test[varname].min())
        vmax = max(vmax, data_test[varname].max())
    binsize = (vmax-vmin)/9.
    vbin = [vmin+(i-0.5)*binsize for i in range(11)]
    
    figx = plt.figure(figsize=(5,5))
    plt.hist(data_train[varname], weights=[1./len(data_train)]*len(data_train), bins = vbin, #10, #range=(vmin, vmax), 
             alpha=0.85, color='grey', 
             label='Train Data: %d'%(len(data_train)) )
    plt.hist(data_test[varname], weights=[1./len(data_test)]*len(data_test), rwidth = 0.5, 
             hatch='.', bins = vbin, 
             alpha=0.65, color='dodgerblue', 
             label='Test Data: %d'%(len(data_test)) )
    pltname = "plot/"+categ+"_"+varname+"_sb.png"
    x1,x2,y1,y2 = plt.axis()
    plt.ylim(y1, y2*1.5)
    if xlim[0] < xlim[1]:
        plt.xlim(xlim[0] -0.5 * binsize, xlim[1] +0.5 * binsize)
    plt.xlabel("Value of: "+varname+" ("+categ.replace("_", " ")+")")
    plt.ylabel("Normalized")
    plt.legend(loc='upper left')
    plt.title("Desity of "+varname)
    #plt.show()
    plt.savefig(pltname)
    plt.clf()
    plt.close()

In [51]:
cdt0 = datetime.datetime.now()
testcolname = septestpd.index.values
#figure with test
for idx,col in enumerate(testcolname):
    name = "test_sep%3d" % idx
    name = name.replace(" ", "0")
    train_test_compare(train, test, name, col)
    
time_dif_1 = datetime.datetime.now() - cdt0
print ("Time passed: %2d Minutes %2d Seconds."\
       % ( time_dif_1.total_seconds()//60, time_dif_1.total_seconds()%60) )


Time passed:  1 Minutes 36 Seconds.


In [45]:
print(test.columns)

Index(['TransactionDT', 'TransactionAmt', 'card1', 'card2', 'card3', 'card5',
       'addr1', 'addr2', 'dist1', 'dist2',
       ...
       'id_30_i', 'id_31_i', 'id_33_i', 'id_34_i', 'id_35_i', 'id_36_i',
       'id_37_i', 'id_38_i', 'DeviceType_i', 'DeviceInfo_i'],
      dtype='object', length=432)


In [28]:
def test_model( data, cols_model, clf, use_prob = True):
    cols = []
    for col, content in data.items():
        if col == "isFraud": continue
        if not is_numeric_dtype(content):
            continue
        # if one variable is not in training model, skip
        if col not in cols_model:
            #print(col,"NOT found in training model. Skip.")
            continue
        cols.append( col )

    tstX = data[ cols ] # Features
    tstX = tstX.fillna( -1.0 )
    print("Data points: ",len(tstX))
    if use_prob:
        Y_pred=clf.predict_proba(tstX)
        Y_pred=Y_pred[:,1]
    else:
        Y_pred=clf.predict(tstX)
    return Y_pred

In [30]:
Ytst_prob = test_model(test, cols_select, model_prob, use_prob = True)
result_prob = pd.DataFrame(Ytst_prob, index = test.index, columns=["isFraud"])
result_prob.sort_index(inplace=True)
result_prob.to_csv("out_prob.csv")

Data points:  506691
