# Credit Card Fraud

Data: https://www.kaggle.com/mlg-ulb/creditcardfraud

Will implement Naive Bayes and Isolation Forest algorithms to predict fraud in credit card transactions. The goal is to compare the two models in an anomaly detection problem with an imbalanced dataset.

### Conclusion

The Correct Classification Rates are very high for this dataset because the classes are imbalanced and there is a majority of 0s. For instance, predicting a 0 the majority of the time would give an almost perfect CCR (99.84%). That said, the isolation forest method performs better than Naive Bayes for this dataset, 97.81% vs. 99.84%. 

# Naive Bayes Classifier

The goal of any classifier is to prededict the class of an instance from among a set of possible classes. For this example we have a binary class problem where we need to predict if a point belongs to class 1 or 0.


The Naive Bayes algorithm is a classification algorithm that is based on Bayes Rule and the assumption that all features  in our inputs are independent.

Naive Bayes helps us find a proxy for the probability of a point belonging to a class i.e. Probability of class C given a point X:

P(C|X)

Our prediction for the class (0 or 1) will be the value of C for which this probability is the highest.

P(C|X) ∝ P(X|C)P(C) = P(x_1|C)P(x_2|C)...P(x_n|C)P(C)

For calculating P(C|X) we need to calculate the following probabilities:

1. P(x_i|C) for every feature x_i given every class C. This is drawn from a normal distribution with mean and standard deviation derived from the training data.
2. P(C) for every class C. This is derived from the percentage of points in each class found in training set.


In [1]:
import numpy as np
import pandas as pd
import scipy.stats
from sklearn.model_selection import train_test_split

In [2]:
df = pd.read_csv("./creditcard.csv")

In [3]:
df.head()

Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
0,0.0,-1.359807,-0.072781,2.536347,1.378155,-0.338321,0.462388,0.239599,0.098698,0.363787,...,-0.018307,0.277838,-0.110474,0.066928,0.128539,-0.189115,0.133558,-0.021053,149.62,0
1,0.0,1.191857,0.266151,0.16648,0.448154,0.060018,-0.082361,-0.078803,0.085102,-0.255425,...,-0.225775,-0.638672,0.101288,-0.339846,0.16717,0.125895,-0.008983,0.014724,2.69,0
2,1.0,-1.358354,-1.340163,1.773209,0.37978,-0.503198,1.800499,0.791461,0.247676,-1.514654,...,0.247998,0.771679,0.909412,-0.689281,-0.327642,-0.139097,-0.055353,-0.059752,378.66,0
3,1.0,-0.966272,-0.185226,1.792993,-0.863291,-0.010309,1.247203,0.237609,0.377436,-1.387024,...,-0.1083,0.005274,-0.190321,-1.175575,0.647376,-0.221929,0.062723,0.061458,123.5,0
4,2.0,-1.158233,0.877737,1.548718,0.403034,-0.407193,0.095921,0.592941,-0.270533,0.817739,...,-0.009431,0.798278,-0.137458,0.141267,-0.20601,0.502292,0.219422,0.215153,69.99,0


In [4]:
del df["Time"]    # delete time column since it is not statistically significant for the prediction

In [5]:
df.shape

(284807, 30)

In [6]:
df.describe()

Unnamed: 0,V1,V2,V3,V4,V5,V6,V7,V8,V9,V10,...,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
count,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,...,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0
mean,3.91956e-15,5.688174e-16,-8.769071e-15,2.782312e-15,-1.552563e-15,2.010663e-15,-1.694249e-15,-1.927028e-16,-3.137024e-15,1.768627e-15,...,1.537294e-16,7.959909e-16,5.36759e-16,4.458112e-15,1.453003e-15,1.699104e-15,-3.660161e-16,-1.206049e-16,88.349619,0.001727
std,1.958696,1.651309,1.516255,1.415869,1.380247,1.332271,1.237094,1.194353,1.098632,1.08885,...,0.734524,0.7257016,0.6244603,0.6056471,0.5212781,0.482227,0.4036325,0.3300833,250.120109,0.041527
min,-56.40751,-72.71573,-48.32559,-5.683171,-113.7433,-26.16051,-43.55724,-73.21672,-13.43407,-24.58826,...,-34.83038,-10.93314,-44.80774,-2.836627,-10.2954,-2.604551,-22.56568,-15.43008,0.0,0.0
25%,-0.9203734,-0.5985499,-0.8903648,-0.8486401,-0.6915971,-0.7682956,-0.5540759,-0.2086297,-0.6430976,-0.5354257,...,-0.2283949,-0.5423504,-0.1618463,-0.3545861,-0.3171451,-0.3269839,-0.07083953,-0.05295979,5.6,0.0
50%,0.0181088,0.06548556,0.1798463,-0.01984653,-0.05433583,-0.2741871,0.04010308,0.02235804,-0.05142873,-0.09291738,...,-0.02945017,0.006781943,-0.01119293,0.04097606,0.0165935,-0.05213911,0.001342146,0.01124383,22.0,0.0
75%,1.315642,0.8037239,1.027196,0.7433413,0.6119264,0.3985649,0.5704361,0.3273459,0.597139,0.4539234,...,0.1863772,0.5285536,0.1476421,0.4395266,0.3507156,0.2409522,0.09104512,0.07827995,77.165,0.0
max,2.45493,22.05773,9.382558,16.87534,34.80167,73.30163,120.5895,20.00721,15.59499,23.74514,...,27.20284,10.50309,22.52841,4.584549,7.519589,3.517346,31.6122,33.84781,25691.16,1.0


In [7]:
df[df["Class"] == 1].shape[0]    # Class 1

492

In [8]:
df[df["Class"] == 0].shape[0]   # Class 0

284315

In [9]:
df[df["Class"] == 1].shape[0] / df.shape[0]

0.001727485630620034

In [10]:
df[df["Class"] == 0].shape[0] / df.shape[0]

0.9982725143693799

In [11]:
y = df["Class"]   # get class label

In [12]:
X_train, X_test, y_train, y_test = train_test_split(df, y, test_size=0.3, random_state=42)

In [13]:
# del X_test["Class"]

### 1. Separate training set by classes and summarize the two groupings by mean and standard deviation

Find mean and standard deviation of each feature of the training set for each of the classes

In [14]:
mean = X_train.groupby("Class").mean()
mean

Unnamed: 0_level_0,V1,V2,V3,V4,V5,V6,V7,V8,V9,V10,...,V20,V21,V22,V23,V24,V25,V26,V27,V28,Amount
Class,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0,0.006232,-0.008392,0.014015,-0.009202,0.005552,0.000694,0.011888,-0.002616,0.004171,0.008729,...,-0.000478,-0.001405,-0.001872,3.7e-05,-0.000445,-0.001011,0.000527,-0.000904,-0.000163,88.754825
1,-4.318652,3.385874,-6.691748,4.543605,-2.824286,-1.451156,-5.176784,0.80942,-2.514537,-5.459115,...,0.328545,0.612619,0.042283,-0.120926,-0.106338,0.028429,0.081238,0.183397,0.079234,113.91573


In [15]:
std = X_train.groupby("Class").std()
std

Unnamed: 0_level_0,V1,V2,V3,V4,V5,V6,V7,V8,V9,V10,...,V20,V21,V22,V23,V24,V25,V26,V27,V28,Amount
Class,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0,1.935063,1.65454,1.460741,1.399925,1.372097,1.335351,1.196249,1.164935,1.087862,1.04395,...,0.778696,0.720368,0.724137,0.630126,0.60592,0.52164,0.482478,0.402705,0.328805,258.44644
1,6.456369,4.013237,6.842078,2.807483,5.178481,1.709044,6.787486,5.630525,2.46218,4.765614,...,1.198494,3.037548,1.238645,1.627011,0.517226,0.826883,0.476215,1.331331,0.537429,246.318562


### 2. Set up the normal distributions

Get P(X|C)

In [16]:
def probabilities(X):
    prob_0 = np.ones((X.shape[0],))
    prob_1 = np.ones((X.shape[0],))
    for i in range(X.shape[1]):
        prob_0 *= scipy.stats.norm(mean.loc[0][i], std.loc[0][i]).pdf(X.iloc[:,i])
        prob_1 *= scipy.stats.norm(mean.loc[1][i], std.loc[1][i]).pdf(X.iloc[:,i])
    return prob_0, prob_1

Get P(C)

In [17]:
P0 = X_train[X_train["Class"] == 0].shape[0] / X_train.shape[0]
P1 = X_train[X_train["Class"] == 1].shape[0] / X_train.shape[0]

### 3. Make predictions

In [18]:
del X_test["Class"]

In [19]:
prob0, prob1 = probabilities(X_test)

In [20]:
prob0 *= P0
prob1 *= P1

In [21]:
predictions = np.argmax([prob0, prob1],axis=0)

### 4. Evaluate accuracy

Total size of test set

In [22]:
total_pred = len(predictions)
total_pred

85443

Positive prediction (Count of predicted 1's)

In [23]:
sum(predictions)

1965

Percentage of predicted 1's

In [24]:
sum(predictions)/total_pred

0.02299778799901689

Number of corrected predictions

In [25]:
correct_pred = (predictions == y_test)
sum(correct_pred)

83572

Correct Clasisfication rate

In [26]:
ccr = sum(correct_pred)/total_pred

In [27]:
ccr

0.9781023606380862

# Isolation Forest Classifier

In [28]:
from sklearn.ensemble import IsolationForest

In [29]:
del X_train["Class"]

In [30]:
IF = IsolationForest(n_estimators=100, 
                     max_samples='auto', 
                     contamination=float(.12), 
                     max_features=10,
                     random_state=42, 
                     verbose=0)

In [31]:
IF.fit(X_train, y_train)



IsolationForest(behaviour='old', bootstrap=False, contamination=0.12,
                max_features=10, max_samples='auto', n_estimators=100,
                n_jobs=None, random_state=42, verbose=0, warm_start=False)

In [32]:
y_train.unique()

array([0, 1])

In [33]:
predictions = IF.predict(X_test)



In [34]:
predictions = np.where(predictions==-1, 1, predictions) # outliers
predictions = np.where(predictions==1, 0, predictions)  # inliers

In [35]:
correct_pred = (predictions == y_test)
sum(correct_pred)

85307

In [36]:
total_pred = len(predictions)

In [37]:
total_pred

85443

In [38]:
ccr = sum(correct_pred)/total_pred

In [39]:
ccr

0.9984082955888721