My first public Kaggle notebook. Using Recall and Precision to judge the predictions. Trying out some ideas for novelty / outlier detection. Implemented my own Multivariate Gaussian outlier detection function and compare to scikit OneClassSVM. 

I reach 97% recall with 0.01 precision. This corresponds to catching 97% of all frauds, but giving a false alert 99% of the time. Any feedback on this result is much appreciated.

Note: I don't think using accuracy as a measure of how well your prediction algorithm works is useful here. If we simply set all predicitons to "No Fraud", we obtain an accuracy of over 99%. For more information you can read this https://tryolabs.com/blog/2013/03/25/why-accuracy-alone-bad-measure-classification-tasks-and-what-we-can-do-about-it/

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from sklearn import svm

from sklearn.model_selection import train_test_split
import seaborn as sns

%matplotlib inline

In [None]:
#load data
data = pd.read_csv("../input/creditcard.csv")
data.head()

In [None]:
data.tail()

In [None]:
data.groupby(("Class")).mean()

Seems like the values for V1, V2, etc are on average much farther from 0 for fraud.

Let's check out some correlations matrices

In [None]:
#correlation matrix
f, ax = plt.subplots(figsize=(12, 9))
sns.heatmap(data.drop(['Amount','Time'],1).corr(), vmax=.8, square=True);

* Class correlates most with V1 - V18 and not (or barely) with V19 - V28

In [None]:
#correlation matrix for only Fraud
f, (ax1, ax2) = plt.subplots(1,2,figsize=(13, 5))
sns.heatmap(data.query("Class==1").drop(['Class','Time'],1).corr(), vmax=.8, square=True, ax=ax1)
ax1.set_title('Fraud')
sns.heatmap(data.query("Class==0").drop(['Class','Time'],1).corr(), vmax=.8, square=True, ax=ax2);
ax2.set_title('Legit')
plt.show()

* Strong correlations between the different V for Fraud data
* Much less correlation for Legit data
* Correlation between the data seems to be an important key (This should be captured by Multivariate Gaussian)
* Seems like Amount correlates as well. Thus, I should perhaps include it...


#### Check out some distributions
They should ideally be gaussian for non-fraud examples

In [None]:
fig, (ax1, ax2) = plt.subplots(1,2,figsize=(10,3))
data.query("Class==1").hist(column='V6',bins=np.linspace(-10,10,20),ax=ax1,label='Fraud')
ax1.legend()
data.query("Class==0").hist(column='V6',bins=np.linspace(-10,10,20),ax=ax2,label='Legit')
plt.legend()
plt.show()
fig, (ax1, ax2) = plt.subplots(1,2,figsize=(10,3))
data.query("Class==1").hist(column='V2',bins=np.linspace(-10,10,20),ax=ax1,label='Fraud')
ax1.legend()
data.query("Class==0").hist(column='V2',bins=np.linspace(-10,10,20),ax=ax2,label='Legit')
plt.legend()
plt.show()

(try it for different Vi)

For Legit transactions, the Vi are centered around 0 and look kind of gaussian. For frauds, they are off-center

In [None]:
bins=np.linspace(-10,50,40)
data.query("Class==1").hist(column='Amount',bins=bins)
data.query("Class==0").hist(column="Amount",bins=bins)

In [None]:
data.query("Class==1").hist(column="Time")#,bins=np.linspace(-10,10,20))
data.query("Class==0").hist(column="Time")#,bins=np.linspace(-10,10,20))

* **TIME** makes no difference apparently
* **AMOUNT** hard to say... Does not really look like a gaussian. Frauds seem to have a tendency for low amounts.

### For now drop "Amount"

In [None]:
X_Legit = data.query("Class==0").drop(["Amount","Class","Time"],1)
y_Legit = data.query("Class==0")["Class"]

X_Fraud = data.query("Class==1").drop(["Amount","Class","Time"],1)
y_Fraud = data.query("Class==1")["Class"]

#split data into training and cv set
X_train, X_test, y_train, y_test = train_test_split(X_Legit, y_Legit, test_size=0.33, random_state=42)
print(len(X_test))
X_test = X_test.append(X_Fraud)
print(len(X_Fraud),'   ', len(X_test))
y_test = y_test.append(y_Fraud)
X_test.head()

In [None]:
data.plot.scatter("V21","V22",c="Class")

V22 and V21 are definitely anticorrelated. The distribution looks like a Gaussian.

### Multivariate Gaussian
OneClassSVM is further below

In [None]:
# Write my own Multivariate Gaussian outlier detection
X = X_train
m = len(X)
mu = 1./m * X.mean()
Sigma=0
for i in range(m):
    Sigma += np.outer((X.iloc[i]-mu) , (X.iloc[i]-mu))
Sigma*=1./m
Sig_inv = np.linalg.inv(Sigma)
Sig_det = np.linalg.det(Sigma)

In [None]:
np.matrix(Sigma).shape

In [None]:
# This function calculates the probability for a Gaussian distribution
def prob(x_example):
    n=len(Sigma)
    xminusmu = x_example - mu
    return 1./((2*np.pi)**(n/2.) * Sig_det**0.5) * np.exp(-0.5* xminusmu.dot(Sig_inv).dot(xminusmu))

In [None]:
Sigma.diagonal()

In [None]:
# Check out some resulting probablilities for Fraud examples
for i in range(10):
    print(prob(X_Fraud.iloc[i]))

In [None]:
# Check out some resulting probablilities for NON-Fraud examples
for i in range(10):
    print(prob(X_train.iloc[i]))

In [None]:
# Picking out 100 training examples to test how many are misclassified as false positive
ptrain_result = np.apply_along_axis(prob, 1, X_train.head(100))

In [None]:
sum(ptrain_result < 1e-13)

With an epsilon of 1e-13, roughly 50% of the test samples are falsely classified as Fraud. Let's see how many are classified correctly using that epsilon

In [None]:
# Copying this to a variable with a new name because i am using the same 
# variable below with 'Amount' included as feature
pTest_result = np.apply_along_axis(prob, 1, X_test)
pTest_result_prev = np.copy(pTest_result)

In [None]:
epsilon = 1e-13
yTest_result_prev = (pTest_result_prev < epsilon)

In [None]:
tp = sum(yTest_result_prev  & y_test)
tn = sum((~ yTest_result_prev)  & (~ y_test))
fp = sum((yTest_result_prev)  & (~ y_test))
fn = sum((~ yTest_result_prev)  & ( y_test))

print("true_pos ",tp)
print("true_neg ",tn)
print("false_pos ",fp)
print("false_neg ",fn)

recall = tp / (tp + fn)
precision = tp / (tp + fp)
F1 = 2*recall*precision/(recall+precision)
print("recall=",recall,"\nprecision=",precision)
print("F1=",F1)

Thus, I obtain a recall of 97%, but a low precision of 0.01, which means only 1 out of 100 fraud 'detections' are actual frauds

#### rescale "Amount" and use it


In [None]:
data["Amountresc"] = (data["Amount"])/data["Amount"].var()

X_Legit = data.query("Class==0").drop(["Amount","Class","Time"],1)
y_Legit = data.query("Class==0")["Class"]

X_Fraud = data.query("Class==1").drop(["Amount","Class","Time"],1)
y_Fraud = data.query("Class==1")["Class"]

#split data into training and cv set
X_train, X_test, y_train, y_test = train_test_split(X_Legit, y_Legit, test_size=0.33, random_state=42)
X_test = X_test.append(X_Fraud)
y_test = y_test.append(y_Fraud)
X_test.head()

In [None]:
# Use my outlier detection
X = X_train
m = len(X)
mu = 1./m * X.mean()
Sigma=0
for i in range(m):
    Sigma += np.outer((X.iloc[i]-mu) , (X.iloc[i]-mu))
Sigma*=1./m
Sig_inv = np.linalg.inv(Sigma)
Sig_det = np.linalg.det(Sigma)

In [None]:
pTest_result = np.apply_along_axis(prob, 1, X_test)

In [None]:
epsilon = 2e-11
yTest_result = (pTest_result < epsilon)

tp = sum(yTest_result  & y_test)
tn = sum((~ yTest_result)  & (~ y_test))
fp = sum((yTest_result)  & (~ y_test))
fn = sum((~ yTest_result)  & ( y_test))

print("true_pos ",tp)
print("true_neg ",tn)
print("false_pos ",fp)
print("false_neg ",fn)

recall = tp / (tp + fn)
precision = tp / (tp + fp)
F1 = 2*recall*precision/(recall+precision)
print("recall=",recall,"\nprecision=",precision)
print("F1=",F1)

### Conclusion
No real improvement. Note, that I am using a larger epsilon



### Next up
I can try the Novelty Detection algorithm by scikit

In [None]:
len(X_train)

In [None]:
## Use only part of training set, otherwise it takes very long
Xsmall = X_train.head(20000)

In [None]:
# fit the model
clf = svm.OneClassSVM(kernel="rbf", nu=0.01, gamma=0.3)
clf.fit(Xsmall)

In [None]:
y_pred_test = clf.predict(X_test)
y_pred_test = np.array([y==-1 for y in y_pred_test])

tp = sum(y_pred_test  & y_test)
tn = sum((~ y_pred_test)  & (~ y_test))
fp = sum((y_pred_test)  & (~ y_test))
fn = sum((~ y_pred_test)  & ( y_test))


print("true_pos ",tp)
print("true_neg ",tn)
print("false_pos ",fp)
print("false_neg ",fn)

recall = tp / (tp + fn)
precision = tp / (tp + fp)
F1 = 2*recall*precision/(recall+precision)
print("recall=",recall,"\nprecision=",precision)
print("F1=",F1)

In [None]:
# ## Some results from different test runs
# # nu=0.01, gamma=0.3
# true_pos  478
# true_neg  55794
# false_pos  38030
# false_neg  14
# recall= 0.971544715447 
# precision= 0.0124130050899
# F1= 0.0245128205128

# # nu=0.05, gamma=0.3
# true_pos  478
# true_neg  55774
# false_pos  38050
# false_neg  14
# recall= 0.971544715447 
# precision= 0.0124065614618
# F1= 0.0245002562788

# # nu=0.05, gamma=0.2
# true_pos  463
# true_neg  68954
# false_pos  24870
# false_neg  29
# recall= 0.941056910569 
# precision= 0.0182765562705
# F1= 0.0358567279768

# # nu=0.5, gamma=0.5
# true_pos  487
# true_neg  36001
# false_pos  57823
# false_neg  5
# recall= 0.989837398374 
# precision= 0.00835191219345
# F1= 0.0165640624469

# # nu=0.5, gamma=0.1
# true_pos  478
# true_neg  47316
# false_pos  46508
# false_neg  14
# recall= 0.971544715447 
# precision= 0.0101732430937
# F1= 0.0201356417709

- With nu=0.5, gamma=0.1:  97% of frauds are detected, but only 1 in 100 detections is actual fraud (i.e. 99% false alert). Seems fine to me... But: It's not better than Multivariate Gaussian (but much slower).
- with a smaller nu we get larger F1 and higher precision BUT smaller recall...
- Same thing goes for gamma... Larger gamma means larger recall and smaller precision. 
- TODO: Plot precision, recall and F1 as a function of mu and gamma.
- TODO (?): Add new features V1xV3, V1xV5, V1xV7 to make use of the strong correlation between the two for fraud detection and see if it improves results. For multivariate Gaussian, the correlations should already be included. I am not sure if this is also the case for OneClass SVM with a Gaussian Kernel...

### Try another classification algorithm - IsolationForest

In [None]:
from sklearn.ensemble import IsolationForest
rng = np.random.RandomState(42)


clf = IsolationForest(max_samples=10, random_state=rng)
clf.fit(X_train.head(100000))
y_pred_train = clf.predict(X_train)
y_pred_test = clf.predict(X_test)
#y_pred_outliers = clf.predict(X_outliers)

In [None]:
y_pred_test = clf.predict(X_test)
y_pred_test = np.array([y==-1 for y in y_pred_test])

tp = sum(y_pred_test  & y_test)
tn = sum((~ y_pred_test)  & (~ y_test))
fp = sum((y_pred_test)  & (~ y_test))
fn = sum((~ y_pred_test)  & ( y_test))


print("true_pos ",tp)
print("true_neg ",tn)
print("false_pos ",fp)
print("false_neg ",fn)

recall = tp / (tp + fn)
precision = tp / (tp + fp)
F1 = 2*recall*precision/(recall+precision)
print("recall=",recall,"\nprecision=",precision)
print("F1=",F1)

This does not look very promising and I know too little about Random Forest... 

In fact, isolation forest seems to be for outlier detection, not for novely detection (http://scikit-learn.org/stable/modules/outlier_detection.html#outlier-detection). So maybe it's not that useful here