# Catching Criminals with Math
### Identification of fraudulent credit card transactions using statistical and machine learning models
#### April 8 2019

#### Authors: Angad S. Kalra, Pulkit Mathur and Shuang Di
#### Collaborators: Shobhit Jain and Adam Rahman

### Introduction

##### Problem Statement: 
* Every year billions of dollars are lost worldwide due to credit card fraud, which forces financial institutions to continuously improve their fraud detection systems. In recent years, several studies have proposed the use of machine learning and data mining techniques to address this problem.
* When constructing a credit card fraud detection model, it is very important to extract the right features from the transactional data. However, this has not been addressed much in previous studies.
* Moreover, most studies used some sort of misclassification measure to evaluate the different solutions, and do not take into account the actual financial costs associated with the fraud detection process.
* In this project, we expand the existing work on credit card fraud dection by performing different feature engineering techniques, machine learning algorithms, and model evaluation methods with actual financial cost taken into consideration.

##### Primary Questions: 
1. What are the characteristics of fraudulent transactions?
2. Is there a statistical/ML model that can accurately detect fraud through retrospective data?

##### Dataset Description:
* The dataset is generated by simulator called PaySim.
* PaySim simulator:
    * Generates a synthetic dataset from aggregated private dataset to resemble normal transaction behavior.
    * Combination of statistical and social network analysis.
    * Malicious transactions are later injected to the synthetic dataset.
    * The dataset contains 6,372,620 transactions simulated to resemble a month of data.
    * Each transaction is described using 10 featues.

##### Attributes Description:
* CASH-IN - Process of increasing the amount available for purchases (e.g. paying your credit card bill)
* CASH-OUT - Opposite of CASH-IN, it means to withdraw cash which decreases the amount available
* DEBIT - Is similar process to CASH-OUT and involves sending the money to other account (e.g. preauthorized debit)
* BILL-PAYMENT - Paying online bills (e.g. hydro)
* PURCHASE - Process of sending money to another user for goods or services
* Time_Stamp - Transactions recorded on hourly basis (~31 days of simulation)
* Transaction_Type - CASH-IN, CASH-OUT, DEBIT, BILL-PAYMENT and PURCHASE 
* Amount - Amount of transaction in local currency
* Client_Id - ID of the client who initiated the transaction (credit card holder)
* Client_Old_Balance - Balance before transaction
* Cleint_New_Balance - Balance after transaction
* Merchant_Id - ID of the merchant
* Merchant_Old -_Balance - Balance before transaction
* Merchant_New_Balance - Balance after transaction
* Is_Fraud - Fraudulent transaction flag (target variable)
* Is_Flagged_Fraud - Any transaction amount  > 200,000


###  Methods

#### Data Collection
In this project, we work on a synthetic dataset generated using the simulator called PaySim. The data is given to us in a CSV. There are approx. 6.3 million transactions from approx. 950K clients.

#### Feature Engineering

##### ML Models
1. Transform Transaction_Type column into a one-hot encoding using Pandas get_dummies() function.
2. Remove these columns: 'Time_Stamp', 'Merchant_Id', 'Is_Flagged_Fraud'. 
3. Create feature '%_of_balance' by performing transaction-wise division: Amount/Client_Old_Balance.  
4. Combine Client_Old_Balance and Client_New_Balance into one feature by taking the difference in values. Remove Client_Old_Balance and Client_New_Balance Columns.
5. Combine Merchant_Old_Balance and Merchant_New_Balance into one feature by taking the difference in values. Remove Merchant_Old_Balance and Merchant_New_Balance columns. 

##### Time-Series Models
To capture the time series information associated with a transaction, we will model the time of the transaction as a periodic variable, using the von Mises distribution (periodic normal distribution).

#### Data Preprocessing
1. Standardize all numerical columns by subtracting each value in a column from the mean and dividing by the standard deviation. 

#### Undersample Majority Class
1. Find indices for non-fraud transactions and indices for fraud transactions. 
2. Randomly select n non-fraud indices where n equals number of fraud transactions. 
3. Concatenate n non-fraud indices and fraud indices and select those transactions as your new dataset. 

#### Oversample Minority Class
1. Find indices for non-fraud transactions and indices for fraud transactions. 
2. Randomly sample n fraud transactions (with replacement) where n equals number of non-fraud transactions.
3. Concatenate non-fraud indices and n fraud indices and select those transactions as your new dataset.

#### Model Training
1. Split the resampled dataset into a training and test set using an 80/20 split. Ensure that the ratio of negative to positive examples is roughly 50-50 in both. 
2. Train the following models on the training set using 5-fold cross-validation for selection of hyper parameters when appropriate: 
    * Regularized Logistic Regression
    * SVM with RBF kernel
    * Random Forest 
    * K Nearest Neighbours
    * LGBM Classifier
    * XGBoost

#### Model Evaluation 
Calculate the following for each of the models above using the predicted labels and true labels of the test set: 
* Accuracy
* AUC score
* Precision
* Recall
* F1 score

### Results

#### Exploratory Data Analysis:

1. What is the ratio of non-fraud transactions to fraud transactions?
2. What is the average number of trans. per client?
3. What are the counts for different transactions types for fraud transactions?
4. For fraud and non-fraud transactions, what is the range for Amount and Amount/Client_Old_Balance?

By exploring the dataset, we've come up with insights that can help with data preprocessing, feature engineering, model selection, and model evaluation.

* The data is severely unbalanced because the number of transactions that are fraud is 0.3% of the dataset (approx. 18000). Thus, we explored different training techniques such as under-sampling and over-sampling, so we can have an equal number of positive and negative cases.
* We noticed that the proportion of clients with more than 1 transaction recorded is very small (around 1.5%); thus it is not feasible to fit time series model on each client's transaction data.
* All fraud transactions are one of two types: Cash-Out and Purchase. Thus, we might explore how filtering incoming transactions can increase prediction accuracy.
* Fraud transaction amounts look very normal per client and are not outliers in terms of range of Amount or of Amount/Client Old Balance. 

#### Model Evaluation:

| Model | Accuracy | AUC | Precision | Recall | F1 Score |
| --- | --- | --- | --- | --- | --- | 
| Logistic Regression | 0.96074 | 0.99460 | 0.96695 | 0.95401 |0.96044 |
| SVM with RBF kernel | 0.96523 | 0.99493 | 0.97476 | 0.95511 | 0.96483 |
| Random forest | 0.99890 | 0.99953 | 0.99963 | 0.99835 | 0.99890 |
| K Nearest Neighbours (k=5) | 0.98087 | 0.99393 | 0.98087 | 0.98974 | 0.98102 |
| LGBM Classifier | 0.99909 | 0.99980 | 1.0 | 0.99976 | 0.99909 |
| XG Boost | 0.99890 | 0.99966 | 1.0 | 0.99780 | 0.99890 |


#### Model Interpretability:

##### Random Forest
<img src="RF_interpret.png" width="300">

Random Forest Classifier is one of the top performing models and from the image above we can see that the top three features the model uses for classification is "Amount", "%\_of_Balance", and "client_bal_diff". This makes sense because we know from our EDA that fraud transactions try to look inconspicuous with regards to Amount. Thus, information from these three features is important to RF when classifying.  

##### Logistic Regression
<img src="RLR_interpret.png" width="300">

Regularized Logistic Regression is interpretable due to the weights assigned to each feature of the model. It performed very well and we can see that %\_Of_Balance, Client_Bal_Diff, and Merch_Bal_Diff are very important when classifying a transaction as non-fraudulent. In addition, Cash_Out and Purchase are very important for classifying fraud transactions. This aligns well with our EDA where we discovered characteristics of fraud transactions. 

##### LIME on KNN
"The value is not in software, the value is in data, and this is really important for every single company, that they understand what data they’ve got."

LIME (Locally Interpretable Model-Agnostic Explanations) is an algorithm that can explain the predictions of any classifier or regressor in a faithful way by approximating it locally with an interpretable model. LIME treats every model as a blackbox and tries to interpret output by making changes in input of the model. It specifies a range of feature values that are causing that feature to have its influence.

LIME framework was used to explain results of KNN classifier for standardized input:
<img src ="Normalized_data.png" width ="300">

<img src ="Lime_output.png" width ="600">

As per the above results, features like "Client_change_in_balance" and "Merchant_change_in_balance" has a negative influence while features like "Client_New_Balance", "Transaction_Type_CASH_IN" etc has a positive influence in predicting a transaction as fraudulent. 

##### SHAP with XGBoost
SHAP (SHapley Additive exPlanation) leverages the idea of Shapley values for model feature influence scoring. The technical definition of a Shapley value is the “average marginal contribution of a feature value over all possible coalitions.” In other words, Shapley values consider all possible predictions for an instance using all possible combinations of inputs. Because of this exhaustive approach, SHAP can guarantee properties like consistency and local accuracy.

Similar to LIME framework, SHAP framework was used for model interpretability

<img src ="shap_output.png" width ="600">

As per the above results, "Percentage of Balance", "Client change in Balance", and "Client New Balance" are the top 3 factors for classifying a transaction as fraudulent for XGBoost classifier.

### Discussion
By experimenting with different machine learning algorithms on the synthetic financial dataset we found that Random Forest and LightGBM yield the best performance, with good balance between classification accuracy and training time. Also, transaction type and %\_of_bal (i.e., transaction_amount/client_old_balance) are the two key predictors of a fraudulent transaction. Specifically, transactions that are "normal" amounts of type “cash-out” or “purchase” are more likely to be fraudulent.

### Limitations

#### Model evaluation
In our study we only use misclassification measure to evaluate the different solutions and did not take into account the actual financial costs associated with the fraud detection process. In future research we can develop a new cost-based measure to evaluate credit card fraud detection models by taking into account the different financial costs incurred by the fraud detection process.

#### Dataset
The dataset we used was generated by a simulator and is not real transaction data. This could explain the unusually high performance by all the models. We are curious to know how our models would perform on actual transactions.  