# Catching Criminals with Math
### Identification of fraudulent credit card transactions using statistical and machine learning models
#### March 8th, 2019

#### Authors: Angad S. Kalra, Pulkit Mathur and Shuang Di
#### Collaborators: Shobhit Jain and Adam Rahman

### Introduction

##### Problem Statement: 
* Every year billions of dollars are lost worldwide due to credit card fraud, which forces financial institutions to continuously improve their fraud detection systems. In recent years, several studies have proposed the use of machine learning and data mining techniques to address this problem.
* When constructing a credit card fraud detection model, it is very important to extract the right features from the transactional data. However, this has not been addressed much in previous studies.
* Moreover, most studies used some sort of misclassification measure to evaluate the different solutions, and do not take into account the actual financial costs associated with the fraud detection process.
* In this project, we expand the existing work on credit card fraud dection by performing different feature engineering techniques, machine learning algorithms, and model evaluation methods with actual financial cost taken into consideration.

##### Primary Questions: 
1. What are the characteristics of fraudulent transactions?
2. Is there a statistical/ML model that can accurately detect fraud through retrospective data?

##### Dataset description:
* The dataset is generated by simulator called PaySim.
* PaySim simulator:
    * Generates a synthetic dataset from aggregated private dataset to resemble normal transaction behavior.
    * Combination of statistical and social network analysis.
    * Malicious transactions are later injected to the synthetic dataset.
    * The dataset contains 6,372,620 transactions simulated to resemble a month of data.
    * Each transaction is described using 10 featues.

##### Attributes description:
* CASH-IN - Process of increasing the amount available for purchases (e.g. paying your credit card bill)
* CASH-OUT - Opposite of CASH-IN, it means to withdraw cash which decreases the amount available
* DEBIT - Is similar process to CASH-OUT and involves sending the money to other account (e.g. preauthorized debit)
* BILL-PAYMENT - Paying online bills (e.g. hydro)
* PURCHASE - Process of sending money to another user for goods or services
* Time_Stamp - Transactions recorded on hourly basis (~31 days of simulation)
* Transaction_Type - CASH-IN, CASH-OUT, DEBIT, BILL-PAYMENT and PURCHASE 
* Amount - Amount of transaction in local currency
* Client_Id - ID of the client who initiated the transaction (credit card holder)
* Client_Old_Balance - Balance before transaction
* Cleint_New_Balance - Balance after transaction
* Merchant_Id - ID of the merchant
* Merchant_Old -_Balance - Balance before transaction
* Merchant_New_Balance - Balance after transaction
* Is_Fraud - Fraudulent transaction flag (target variable)
* Is_Flagged_Fraud - Any transaction amount  > 200,000


###  Methods

#### Data Collection
In this project, we work on a synthetic dataset generated using the simulator called PaySim. The data is given to us in a CSV. There are approx. 6.3 million transactions from approx. 950K clients.

#### Feature Engineering

##### ML Models
1. Transform Transaction_Type column into a one-hot encoding using Pandas get_dummies() function.
2. Remove these columns: 'Time_Stamp', 'Merchant_Id', 'Merchant_Old_Balance', 'Merchant_New_Balance', 'Is_Flagged_Fraud'. 
3. Create feature '%_of_balance' by performing transaction-wise division: Amount/Client_Old_Balance.  
4. Combine Client_Old_Balance and Client_New_Balance into one feature by taking the difference in values. Remove Client_Old_Balance and Client_New_Balance Columns.

##### Time-Series Models
To capture the time series information associated with a transaction, we will model the time of the transaction as a periodic variable, using the von Mises distribution (periodic normal distribution).

#### Data Preprocessing
1. Standardize all numerical columns by subtracting each value in a column from the mean and dividing by the standard deviation. 

#### Undersample Majority Class
1. Find indices for non-fraud transactions and indices for fraud transactions. 
2. Randomly select n non-fraud indices where n equals number of fraud transactions. 
3. Concatenate non-fraud indices and fraud indices and select those transactions as your new dataset. 

#### Oversample Minority Class
1. TODO SHUANG

#### Model Training
1. Split the dataset into a training and test set using an 80/20 split. 
2. Train the follow models on the training set, using cross-validation for selection of hyper parameters when appropriate: 
    * Logistic Regression
    * SVM with RBF kernel
    * Random forest 
    * K Nearest Neighbours
    * LGBM Classifier

#### Model Evaluation 
Calculate the following for each of the models above: 
* Accuracy
* AUC score
* F1 score

#### Model Interpretation
* TODO...

### Results

#### Exploratory Data Analysis:
Exploratory Data Analysis has been carried out to answer the following questions:

1. What is the ratio of non-fraud transactions to fraud transactions?
2. What is the average number of trans. per client?
3. What are the counts for different transactions types for fraud transactions? Use histogram.
4. How many unique merchant IDs are there? Are there any common merchants for fraud transactions?

By exploring the dataset, we've come up with insights that can help with data preprocessing, feature engineering, model selection, and model evaluation.

* First, we know that the data is severely unbalanced because the number of transactions that are fraud is 0.3% of the dataset (approx. 18000). Thus, we explored different training techniques such as under-sampling and over-sampling, so we can have an equal number of positive and negative cases.
* Also, we notice that the proportion of clients with more than 1 transaction recorded is very small (around 1.5%); thus it is not feasible to fit time series model on each client's transaction data.
* In addition, we also know that all the fraud transactions were one of two types: Cash-Out and Purchase. Thus we might explore how filtering incoming transactions can increase prediction accuracy.

(TODO...)

#### Model Evaluation:

| Model | Accuracy | AUC | F1 Score |
| --- | --- | --- | --- |
| Logistic Regression | 0.9607430453879942 | 0.9946000805218244 | 0.9604353038826893 |
| SVM with RBF kernel | 0.9652269399707174 | 0.9949346111874423 | 0.9648343512863226 |
| Random forest | 0.9989019033674963 | 0.9995288267461078 | 0.9989000916590285 |
| K Nearest Neighbours (k=5) | 0.9808748169838946 | 0.993928021393885 | 0.9810224280395896 |
| LGBM Classifier | 0.9990894509763931 | 0.9998040363109248 | 0.9990882827791965 |


### Discussion
(TODO...)

### Limitations
(TODO...)