Introduction
==
In an earlier pipeline we implemented a multivariate Gaussian approach to anomaly detection [1]. While the data resulting from the principal component analysis is fairly normal distributed this looked promising. 

While easily reaching an optimal recall, the precision was very low, resulting in many false positives.

The reason for this approach to anomaly detection is obvious: As the model is only trained on the normal/valid data, the majority class, the skewdness of the data is not the problem. For other supervised learning algorithm this imbalance has to be taken into account. 

One possible way would be to sample the data to get a more balanced dataset another is the introduction of weighted cost for the classes [2]. We will focus on the second approach as it would be a pity to down-sample the data set. The cost sensitive approach also feels natural in the present case, as the detection of a fraud is of high value to the customer.

We use the standard sklearn Random Forest algorithm as it has the weighting of classes already bulit-in.

[1] https://www.kaggle.com/clemensmzr/simple-multivariate-gaussian-anomaly-detection/

[2] http://statistics.berkeley.edu/sites/default/files/tech-reports/666.pdf

Data handling
==

In [95]:
import numpy as np 
import pandas as pd 

from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix
from imblearn.ensemble import BalancedRandomForestClassifier

In [96]:
data = pd.read_csv('../input/creditcard.csv')

In [98]:
data.head()

Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
0,0.0,-1.359807,-0.072781,2.536347,1.378155,-0.338321,0.462388,0.239599,0.098698,0.363787,...,-0.018307,0.277838,-0.110474,0.066928,0.128539,-0.189115,0.133558,-0.021053,149.62,0
1,0.0,1.191857,0.266151,0.16648,0.448154,0.060018,-0.082361,-0.078803,0.085102,-0.255425,...,-0.225775,-0.638672,0.101288,-0.339846,0.16717,0.125895,-0.008983,0.014724,2.69,0
2,1.0,-1.358354,-1.340163,1.773209,0.37978,-0.503198,1.800499,0.791461,0.247676,-1.514654,...,0.247998,0.771679,0.909412,-0.689281,-0.327642,-0.139097,-0.055353,-0.059752,378.66,0
3,1.0,-0.966272,-0.185226,1.792993,-0.863291,-0.010309,1.247203,0.237609,0.377436,-1.387024,...,-0.1083,0.005274,-0.190321,-1.175575,0.647376,-0.221929,0.062723,0.061458,123.5,0
4,2.0,-1.158233,0.877737,1.548718,0.403034,-0.407193,0.095921,0.592941,-0.270533,0.817739,...,-0.009431,0.798278,-0.137458,0.141267,-0.20601,0.502292,0.219422,0.215153,69.99,0


In [99]:
df2 = data.loc[data['Class'] == 1]
df3 = data.loc[data['Class'] == 0]
df2.head()


Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
541,406.0,-2.312227,1.951992,-1.609851,3.997906,-0.522188,-1.426545,-2.537387,1.391657,-2.770089,...,0.517232,-0.035049,-0.465211,0.320198,0.044519,0.17784,0.261145,-0.143276,0.0,1
623,472.0,-3.043541,-3.157307,1.088463,2.288644,1.359805,-1.064823,0.325574,-0.067794,-0.270953,...,0.661696,0.435477,1.375966,-0.293803,0.279798,-0.145362,-0.252773,0.035764,529.0,1
4920,4462.0,-2.30335,1.759247,-0.359745,2.330243,-0.821628,-0.075788,0.56232,-0.399147,-0.238253,...,-0.294166,-0.932391,0.172726,-0.08733,-0.156114,-0.542628,0.039566,-0.153029,239.93,1
6108,6986.0,-4.397974,1.358367,-2.592844,2.679787,-1.128131,-1.706536,-3.496197,-0.248778,-0.247768,...,0.573574,0.176968,-0.436207,-0.053502,0.252405,-0.657488,-0.827136,0.849573,59.0,1
6329,7519.0,1.234235,3.01974,-4.304597,4.732795,3.624201,-1.357746,1.713445,-0.496358,-1.282858,...,-0.379068,-0.704181,-0.656805,-1.632653,1.488901,0.566797,-0.010016,0.146793,1.0,1


In [100]:
df4 = df3.head(492 * 99)
df5 = df3.head(492 * 90)

In [101]:
df_1_percent_fraud = df2.append(df4)
df_1_percent_fraud.groupby(['Class'])['Class'].count()

Class
0    48708
1      492
Name: Class, dtype: int64

In [102]:
df_10_percent_fraud = df2.append(df5)
df_10_percent_fraud.groupby(['Class'])['Class'].count()

Class
0    44280
1      492
Name: Class, dtype: int64

In [104]:
def print_metrics_numbers(ytest, ypred):
    
    tn, fp, fn, tp = confusion_matrix(y_test, y_pred).ravel()
    recall = tp / (tp + fn)
    prec = tp / (tp + fp)
    F1 = 2 * recall * prec / (recall + prec)
    print(recall, prec, F1)

****R Random Forest****

In [105]:
def create_and_run_RF_model(df):
    X = df.drop('Class', 1).values
    y = df['Class'].values
    
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1)
    
    RF = RandomForestClassifier()
    RF.fit(X_train, y_train)
    
    y_pred = RF.predict(X_test)
    
    print_metrics_numbers(y_test, y_pred)    

In [94]:
create_and_run_RF_model(df_1_percent_fraud)
create_and_run_RF_model(df_10_percent_fraud)

  


0.9213483146067416 0.9761904761904762 0.9479768786127167


  


0.9213483146067416 0.9761904761904762 0.9479768786127167


In [92]:
def create_and_run_BRF_model(df):
    X = df.drop('Class', 1).values
    y = df['Class'].values
    
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1)
    
    BRF = BalancedRandomForestClassifier(n_estimators=10)
    BRF.fit(X_train, y_train)
    
    y_pred = BRF.predict(X_test)
    
    tn, fp, fn, tp = confusion_matrix(y_test, y_pred).ravel()
    recall = tp / (tp + fn)
    prec = tp / (tp + fp)
    F1 = 2 * recall * prec / (recall + prec)
    print(recall, prec, F1)

In [93]:
create_and_run_BRF_model(df_1_percent_fraud)
create_and_run_BRF_model(df_10_percent_fraud)

  
  


0.9577464788732394 0.48398576512455516 0.6430260047281324
0.9555555555555556 0.48863636363636365 0.6466165413533835


In [90]:
w = 50 # The weight for the positive class

def create_and_run_WRF_model(df):
    X = df.drop('Class', 1).values
    y = df['Class'].values
    
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1)
    
    WRF = RandomForestClassifier(class_weight={0: 1, 1: w})
    WRF.fit(X_train, y_train)
    
    y_pred = WRF.predict(X_test)
    
    tn, fp, fn, tp = confusion_matrix(y_test, y_pred).ravel()
    recall = tp / (tp + fn)
    prec = tp / (tp + fp)
    F1 = 2 * recall * prec / (recall + prec)
    print(recall, prec, F1)
    
    #print_metrics_numbers(y_test, y_pred)   

In [91]:
create_and_run_WRF_model(df_1_percent_fraud)
create_and_run_WRF_model(df_10_percent_fraud)

  after removing the cwd from sys.path.


0.9295774647887324 0.9777777777777777 0.9530685920577617


  after removing the cwd from sys.path.


0.9481481481481482 0.9696969696969697 0.9588014981273408


Evaluation
==

In [None]:
# Some results for different weights (bad implementation, 
# these weights should be chosen agains a validation set)

#w=1 : 0.735632183908 0.888888888889 0.805031446541
#w=10 : 0.701149425287 0.938461538462 0.802631578947
#w=100 : 0.724137931034 0.940298507463 0.818181818182
#w=1000 : 0.701149425287 0.953125 0.807947019868