# Credit Card Fraud Detection

## The Data   

The datasets contains transactions made by credit cards in September 2013 by european cardholders. This dataset presents transactions that occurred in two days, where we have 492 frauds out of 284,807 transactions. The dataset is highly unbalanced, the positive class (frauds) account for 0.172% of all transactions.

It contains only numerical input variables which are the result of a PCA transformation. Unfortunately, due to confidentiality issues, we cannot provide the original features and more background information about the data. Features V1, V2, ... V28 are the principal components obtained with PCA, the only features which have not been transformed with PCA are 'Time' and 'Amount'. Feature 'Time' contains the seconds elapsed between each transaction and the first transaction in the dataset. The feature 'Amount' is the transaction Amount, this feature can be used for example-dependant cost-senstive learning. Feature 'Class' is the response variable and it takes value 1 in case of fraud and 0 otherwise.   
Can be found here: https://www.kaggle.com/mlg-ulb/creditcardfraud

## The goal of the Project

For this project, we'll use everything we've learned about Data Science and Machine Learning thus far to source a dataset, preprocess and explore it, and then build and interpret a classification model that answers your chosen question.

Our goal is to predict if a transaction was fraudulent or not.

In [1]:
# basic libraries
import numpy as np
import pandas as pd
import scipy.stats as stats
import matplotlib.pyplot as plt
from matplotlib import pyplot
import seaborn as sns

# display options
pd.set_option('display.max_columns', None)
sns.set(style='whitegrid')

# date
from datetime import datetime 

# visualizations libraries
import plotly.graph_objs as go
import plotly.figure_factory as ff
from plotly import tools
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
init_notebook_mode(connected=True)

import warnings
warnings.filterwarnings('ignore')

import functions
%load_ext autoreload
%autoreload 2

In [2]:
# ML libraries
import itertools
from collections import Counter
from sklearn.datasets import make_classification
from numpy import where

from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression

from sklearn.metrics import roc_curve, auc
from sklearn.metrics import confusion_matrix

from imblearn.over_sampling import SMOTE

from sklearn.svm import SVC # Support Vector Machine Classifier
from sklearn.metrics import precision_score, recall_score,confusion_matrix, classification_report, accuracy_score, f1_score  # Skearns Metrics
from sklearn.neighbors import KNeighborsClassifier # KNN Classifier
from xgboost import XGBClassifier # Boosting Algo
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score
from sklearn.metrics import roc_curve, auc # Comparing Various Classifiers
from sklearn.tree import DecisionTreeClassifier

import os
os.environ['KMP_DUPLICATE_LIB_OK']='True'


numpy.core.umath_tests is an internal NumPy module and should not be imported. It will be removed in a future NumPy release.



In [3]:
df = pd.read_csv('creditcard.csv')

## Preview the data

In [4]:
df.head()

Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,V10,V11,V12,V13,V14,V15,V16,V17,V18,V19,V20,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
0,0.0,-1.359807,-0.072781,2.536347,1.378155,-0.338321,0.462388,0.239599,0.098698,0.363787,0.090794,-0.5516,-0.617801,-0.99139,-0.311169,1.468177,-0.470401,0.207971,0.025791,0.403993,0.251412,-0.018307,0.277838,-0.110474,0.066928,0.128539,-0.189115,0.133558,-0.021053,149.62,0
1,0.0,1.191857,0.266151,0.16648,0.448154,0.060018,-0.082361,-0.078803,0.085102,-0.255425,-0.166974,1.612727,1.065235,0.489095,-0.143772,0.635558,0.463917,-0.114805,-0.183361,-0.145783,-0.069083,-0.225775,-0.638672,0.101288,-0.339846,0.16717,0.125895,-0.008983,0.014724,2.69,0
2,1.0,-1.358354,-1.340163,1.773209,0.37978,-0.503198,1.800499,0.791461,0.247676,-1.514654,0.207643,0.624501,0.066084,0.717293,-0.165946,2.345865,-2.890083,1.109969,-0.121359,-2.261857,0.52498,0.247998,0.771679,0.909412,-0.689281,-0.327642,-0.139097,-0.055353,-0.059752,378.66,0
3,1.0,-0.966272,-0.185226,1.792993,-0.863291,-0.010309,1.247203,0.237609,0.377436,-1.387024,-0.054952,-0.226487,0.178228,0.507757,-0.287924,-0.631418,-1.059647,-0.684093,1.965775,-1.232622,-0.208038,-0.1083,0.005274,-0.190321,-1.175575,0.647376,-0.221929,0.062723,0.061458,123.5,0
4,2.0,-1.158233,0.877737,1.548718,0.403034,-0.407193,0.095921,0.592941,-0.270533,0.817739,0.753074,-0.822843,0.538196,1.345852,-1.11967,0.175121,-0.451449,-0.237033,-0.038195,0.803487,0.408542,-0.009431,0.798278,-0.137458,0.141267,-0.20601,0.502292,0.219422,0.215153,69.99,0


## Amount of transactions

In [28]:
print ("Fraud")
print (df.Amount[df.Class == 1].sum())
print ()
print ("Normal")
print (df.Amount[df.Class == 0].sum())

Fraud
25162590.009999998

Normal
0.0


## Split to train and test sets (not normalised data)

In [5]:
y = df['Class']
X = df.drop(columns=['Class'], axis=1)

In [6]:
print(X.shape)
print(y.shape)

(284807, 30)
(284807,)


In [7]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
print("Data Shapes:")
print(f"X_train: {X_train.shape} | X_test: {X_test.shape} | y_train {y_train.shape} | y_test {y_test.shape}")

Data Shapes:
X_train: (199364, 30) | X_test: (85443, 30) | y_train (199364,) | y_test (85443,)


In [8]:
print(f"Number of Frauds in Train Set: {y_train.sum()}")
print(f"Number of Frauds in Test Set: {y_test.sum()}")

Number of Frauds in Train Set: 356
Number of Frauds in Test Set: 136


## Split the train set to train and validation sets

In [None]:
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.3, random_state=42)
print(f"X_train: {X_train.shape} | X_val: {X_val.shape} | y_train {y_train.shape} | y_val {y_val.shape}")

In [None]:
print(f"Number of Frauds in Train: {y_train.sum()}")
print(f"Number of Frauds in Validation: {y_val.sum()}")

In [None]:
# Training set
print(y_train.value_counts())
print('\n')
# Validation set
print(y_val.value_counts())

## Create a baseline (non normalised data)

For a baseline model do a Logistic Regression model

In [None]:
clf = LogisticRegression()
clf.fit(X_train, y_train)

y_pred = clf.predict(X_val)
functions.scores(y_val,y_pred);

## Logistic Regression with Ridge and Lasso

Regularization terms are penalties to a more straightforward error expression between our model and its outputs. The two most common regularizations are the l1 lasso and l2 ridge penalties. These add additional complexity to the loss function.  
The default is to use an 'l2' penalty, so unless you specified otherwise, that's what you've been using.

In addition to simply specifying how to regularize the model, you can also specify the amount of regularization. This is controlled through the C parameter.

Ridge --> l2 (is default)   
Lasso --> l1  

In [None]:
# Lasso
clf = LogisticRegression(penalty='l1')
clf.fit(X_train, y_train)

y_pred = clf.predict(X_val)
functions.scores(y_val,y_pred);

## Scaling

As the data description says, all our features have been PCA transformed except 'Time' and 'Amount'. So we are scaling only these two columns:

In [None]:
from sklearn.preprocessing import StandardScaler

In [None]:
scaler = StandardScaler()
df.Amount = scaler.fit_transform((df.Amount).values.reshape(-1,1))
df.Time = scaler.fit_transform((df.Time).values.reshape(-1,1))

In [None]:
df.head()

## Undersampling



In [None]:
from imblearn.under_sampling import RandomUnderSampler
rus = RandomUnderSampler(random_state=0)

In [None]:
rus = RandomUnderSampler(random_state=0)
rus.fit(X_train, y_train)

In [None]:
# Previous original class distribution
print(y_train.value_counts()) 

# Fit SMOTE to training data
X_train, y_train = rus.fit_sample(X_train, y_train) 

# Preview synthetic sample class distribution
print('\n')
print(pd.Series(y_train).value_counts()) 

In [None]:
# observe that data has been balanced
pd.Series(y_train).value_counts().plot.bar()

Accuracy = (TP+TN)/total  
Precision = TP/(TP+FP)  
Recall = TP/(TP+FN)  

In [None]:
clf = LogisticRegression()
clf.fit(X_train, y_train)

y_pred = clf.predict(X_val)
functions.scores(y_val,y_pred);

## KNN

In [None]:
neigh = KNeighborsClassifier()
neigh.fit(X_train, y_train)

y_pred = neigh.predict(X_val)
functions.scores(y_val,y_pred);

In [None]:
# NN = 3
neigh = KNeighborsClassifier(n_neighbors=3)
neigh.fit(X_train, y_train)

y_pred = neigh.predict(X_val)
functions.scores(y_val,y_pred);

In [None]:
# NN = 7
neigh = KNeighborsClassifier(n_neighbors=7)
neigh.fit(X_train, y_train)

y_pred = neigh.predict(X_val)
functions.scores(y_val,y_pred);

### I need numpy array

In [None]:
print(type(X_train))
print(type(X_val))
print(type(y_train))
print(type(y_val))

In [None]:
X_val = X_val.values
y_val = y_val.values
print(type(X_val))
print(type(y_val))

## XGBoost

In [None]:
xgb = XGBClassifier(max_depth=5, n_jobs=-1)
xgb.fit(X_train, y_train)

y_pred = xgb.predict(X_val)
functions.scores(y_val,y_pred);

## Decision Tree

In [None]:
d_tree = DecisionTreeClassifier(random_state=10)  
d_tree.fit(X_train, y_train) 

y_pred = d_tree.predict(X_val)
functions.scores(y_val,y_pred);

## Random Forest Classifier

In [None]:
r_for = RandomForestClassifier(random_state=0)
r_for.fit(X_train, y_train)

y_pred = r_for.predict(X_val)
functions.scores(y_val,y_pred);

## Support Vector Machines (SVM)

In [None]:
svc = SVC(kernel='linear')
svc.fit(X_train, y_train)

In [None]:
y_pred = svc.predict(X_val)
functions.scores(y_val,y_pred);