# Credit Card Fraud: Anomaly Detection

Jessica Mouras

## Table of Contents
[1. Motivation](#motive)<br> 
[2. Data](#data)
>   [i. Pipeline](#pipeline)<br>
    [ii. Intro to Fraud](#intro)<br>
    [iii. Unbalanced Datasets](#unbalanced)<br>
    
[3. Machine Learning Classification Models](#mlclass)
> [i. Undersampling](#undersample)<br>
> [ii. SMOTE](#smote)<br>
> [iii. Isolation Forest](#isolation)<br>
> [iv. PCA Anomaly Detection](#pca)<br>
> [v. Neural Network Autoencoder](#auto)<br>
> [vi. Best Model](#best)<br>

[4. Conclusion & Further Application](#conclude)

## Motivation
<a id="motive"> </a>

I started my career in financial services As a CPA (license still current, but inactive) who found the gritter aspects of accounting and finance most intriguing. In fact, my first consulting project ever was an Anti Money Laundering (AML) assignment that deployed Machine Learning techniques, but was 2012 and I was a very junior employee. 

I wanted to revisit the topic of money laundering by building methods and models to assist with identifying fraudulent transactions.

Here, I assess a large dataset of credit card transactions over 2 days in September 2013 of European cardholders to classify the transactions between fraud and non-fraud. As discussed below, by doing this classification, I am detecting anomalies. Through my analysis, I hope to determine which methods are best for detecting fraudlent transactions.

## Data
<a id="data"> </a>

As discussed briefly above, this dataset contains transactions made by European credit holders cards in September 2013 over the coure of two days. The minority class is  492 frauds out of 284,807 transactions which is 0.172% of all transactions. Therefore, the majority class is 99.827% of transactions.

The following had already been done to the data to anonymize the information except for time and amount:

+ PCA Transformation:  the features went through a PCA transformation (Dimensionality Reduction technique) which creates latent features that are some combination of the original features of the data. 

+ Scaling: The data has already been scaled as in order to perform PCA, one must scale the features prior to transformation.

I separately scaled time and amount to perform the remainder of this analysis.

Due to the anonymization process, the feature names, descriptions, and nature is largely undisclosed. I reviewed the features for data-leakage and potential rank issues (where a feature is a transformation of another feature and therefore redundant). 


### Pipeline
<a id="pipeline"> </a>

To perform my analysis I used the following libaries:

<p align="center">
  <img width="660" height="300" src="images/keras_regression_logos.png">
</p>


The data was already classified as 1 - the class for fraudulent transaction and 0 - the class for normal transactions. Throughout this analysis I will refer to predicting a fraud instance correctly to be considered a true positive.

### Introduction to Fraud
<a id="intro"> </a>

There are many types of financial fraud and many subsets of Money Laundering. In the course of this notebook, we will only be addressing the concept of Money Laundering through a Retail Banking Institution. Also known as: Credit Card Fraud.

>*Credit card fraud is the unauthorized use of a credit or debit card, or similar payment tool (ACH, EFT, recurring charge, etc.), to fraudulently obtain money or property.* - **Federal Bureau of Investigation**

Disclaimer: the methods deployed in this notebook may not be applicable or appropriate for other types of Money Laundering e.g. through an Investment Bank / Private Equity Fund, etc.

### Unbalanced Datasets
<a id="unbalanced"> </a>

Unbalanced data in terms of a classification models means that there is proportionally more of one class (taget) than the other. This is an issue because what happens during training and deployment of machine learning classifiers is  that there are not enough examples of the minority class for a model to effectively learn the "decision boundary".

All models will struggle on new or true testing data as they just didn't have enough evidence and "coaching" to learn the difference!

There are two ways to attempt to solve this issue:

1. Undersampling
2. Oversampling
3. Other Techniques: Anomaly Detection

**Undersampling**

We find out how many instances are in the minority class,  fraudulent transactions in this case.
Then we need to make a new data set that is a randomized (shuffled) subsample of the majority class, normal transactions, to the same amount as fraud transactions. This creates an even, balanced data set of 50% of each class (for binary classification such as fraud vs not fraud. For this dataset that means we limit our original dataset to be 492 cases of fraud and 492 cases of normal transactions.

Warning: This methodology of undersampling comes at a relatively large price. The original data set was approximately 284,000 transactions, and now it has been reduced to 984.There is a risk that a classification model will not perform well since there is large volumes of general information loss.

One way to solve this information loss issue brings us to our other option:

**Oversampling**

You want to oversample instances the minority class. How do we oversample something that doesn't exist? 

**a.** Make copies of exact samples from the minority class in the training dataset prior to fitting a model. Rather simplistic and doesn't actually assist in information gain like actual new data would provide.

**b.** Synthesize new instances from the minority class. This methodology is called Synthetic Minority Oversampling TEchnique, or SMOTE for short. This technique was described by Nitesh Chawla, et al. in their 2002 paper named for the technique titled “SMOTE: Synthetic Minority Over-sampling Technique.”

**Anomaly Detection Techniques**

*Isolation Forest*: isolate anomaly values through a supervised or unsupervised classifier. Anomalies are the ponts with the shortest average path length.

*Auto-encoders*: detect anomaly values through an unsupervised neural network in which we measure how “far” the reconstructed data point provived by the model is from the actual, original datapoint. If the error is large, then the original datapoint is likely an anomaly. In this analysis we perform 2 types of this technique, one based on linear compression and the other with nonlinear compression.

As such, during the EDA process, I did not choose to select outliers to remove, as they could be possible important transactions to review during the anomaly classification process.


## Machine Learning Classification Models
<a id="mlclass"> </a>

The models analyzed over the course of this analysis are as follows:

Undersampling:
+ Logistic Regression
+ kNN Classifier
+ Decision Trees Classifier
+ Random Forest Classifier
+ Gradient Boosted Classifier

Oversampling (SMOTE):
+ None

Anomaly Detection on Unbalanced Classes:
+ Isolation Forest
+ PCA Anomaly Detection
+ Neural Network Autoencoder

### Undersampling
<a id="undersample"> </a>

### SMOTE
<a id="smote"> </a>

This oversampling technique ultimately selects instances that are close to each other in the existing dataset's feature space. It then "draws" a line between the instances selected and adds new sample at a point somwhere along that line.

To put this into a perspective we can digest using machine learning methods and terms:

Step 1) a random instance from the minority class is first chosen.
Step 2) k of the nearest neighbors for that example are found (typically k=5).
Step 3) a randomly selected "neighbor" is chosen.
Step 4) finally, a synthetic example is created at a randomly selected point between the two examples in feature space.

That is a lot of randoms!

Why does this work? Well, "new synthetic" instances from the minority class that are generated through this process  are generally speaking close in feature space to existing examples from the minority class.

This is not a very good idea for data with many features. The larger the dimension space, the more complex it is for kNN to collect nearest neighbors

## Anomaly Detection

### Isolation Forest

explain what is isolation forest 
explain model's strength regarding how gridsearching parameters is best performed on the contamination score.
explain why you want to have a contamination score higher generally than your actual % of anomaly

explain why you want high recall

Show confusion matrix for best performing grid searched parameters here 

### PCA Anomaly Detection

Discuss that this is linear decompression

### Autoencoding

+ describe and discuss autoencoding

### Best Model
<a id="best"> </a>

+ [placeholder for best model confusion matrix]

## Conclusion

+ [placeholder for conclusion]