<img src="images/nyp_ago_logo.png" width='400'/>

# Fraud Detection 

In this exercise, we will build a financial fraud detection model. The model is a binary classifier that classifies a transaction as non-fraud (negative case) and fraud (positive case).

There is a lack of publicly available datasets on financial services and specially in the emerging mobile money transactions domain. We will be using a sythetic dataset called PaySim. PaySim simulates mobile money transactions based on a sample of real transactions extracted from one month of financial logs from a mobile money service implemented in an African country. You can find out more how the data is generate from the [paper](http://www.msc-les.org/proceedings/emss/2016/EMSS2016_249.pdf).

Here are the description of the different columns of the PaySim dataset: 

|Field|Description|
|-----|-----|
|step|Maps a unit of time in the real world. In this case 1 step is 1 hour of time. Total steps 744 (30 days simulation).|
|type|CASH-IN, CASH-OUT, DEBIT, PAYMENT and TRANSFER|
|amount|Amount of the transaction in local currency|
|nameOrig|Customer who started the transaction|
|oldbalanceOrg|Initial balance before the transaction|
|newbalanceOrig|New balance after the transaction|
|nameDest|Customer who is the recipient of the transaction|
|oldbalanceDest|Initial balance recipient before the transaction. Note that there is not information for customers that start with M (Merchants)|
|newbalanceDest|New balance recipient after the transaction. Note that there is not information for customers that start with M (Merchants)|
|isFlaggedFraud|The business model aims to control massive transfers from one account to another and flags illegal attempts. An illegal attempt in this dataset is an attempt to transfer more than 200.000 in a single transaction|
|isFraud|This is the transactions made by the fraudulent agents inside the simulation. In this specific dataset the fraudulent behavior of the agents aims to profit by taking control or customers accounts and try to empty the funds by transferring to another account and then cashing out of the system|


## Import Packages

In [None]:
import pandas as pd
import numpy as np
%matplotlib inline
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.metrics import average_precision_score
from sklearn.metrics import classification_report
from sklearn.metrics import precision_recall_curve
from sklearn.metrics import PrecisionRecallDisplay
from sklearn.linear_model import LogisticRegression
from imblearn.over_sampling import SMOTE, BorderlineSMOTE
from xgboost.sklearn import XGBClassifier
from xgboost import plot_importance, to_graphviz
import lightgbm as lgb

In [None]:
import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning)

## Import the Data 

In [None]:
df = pd.read_csv('Fraud.csv')

In [None]:
df.info()

For consistency, let's correct spelling of original column headers.

In [None]:
df = df.rename(columns={'oldbalanceOrg':'oldBalanceOrig', 'newbalanceOrig':'newBalanceOrig', \
                        'oldbalanceDest':'oldBalanceDest', 'newbalanceDest':'newBalanceDest'})
df.head()

Let's check for any missing values. It turns out there are no obvious missing values but, as we will see below, value 0 may be used as a proxy if no data is available (e.g. those in the newBalanaceOrig, newBalanceDest, etc)

In [None]:
df.isna().sum()

## Exploratory Data Analysis

In this section, we will do some data-wrangling to gain more insights into the dataset.

**Exercise** 

Let's find out how many different types of transactions first. How many types of transactions are there?

In [None]:
## complete the code



#### Which types of transactions are fraudulent? 

We would like to find out which type of transactions are most fraudulent? We find that of the various types of transactions, fraud occurs only in two of them:
'TRANSFER' where money is sent to a customer / fraudster and 'CASH_OUT' where money is sent to a merchant who pays the customer / fraudster in cash. 

In [None]:
df.loc[df.isFraud == 1].type.unique()

Let's find the percentage of fraudulent transactions that belong to TRANSFER type. 

In [None]:
num_fraudulent_xfer = len(df.loc[ (df.isFraud == 1) & (df.type == 'TRANSFER')])
percentage_fraudulent_xfer = num_fraudulent_xfer / len(df.loc[df.isFraud == 1])
print(f'number of fraudulent TRANSFER = {num_fraudulent_xfer}')
print(f'percentage of fraudulent TRANSFER = {percentage_fraudulent_xfer}')

**Exercise**

1. Find out how many fraudulent CASH_OUT?  
2. Find out what is the percentage of fraudulent transactions is of CASH_OUT type.
3. What do you observe? 

<details><summary>Click here for solution</summary>
<br/>
    
```python 
num_fraudulent_cash_out = len(df.loc[ (df.isFraud == 1) & (df.type == 'CASH_OUT')])
percentage_fraudulent_cash_out = num_fraudulent_cash_out / len(df.loc[df.type == 'CASH_OUT'])
print(f'number of fraudulent CASH_OUT = {num_fraudulent_cash_out}')
print(f'percentage of fraudulent CASH_OUT = {percentage_fraudulent_cash_out}')
```
<br/>
We observe that the number of fraudulent TRANSFERs is almost equals the number of fraudulent CASH_OUTs These observations appear to bear out the description provided on Kaggle that the modus operandi of fraudulent transactions in  this dataset is committed by first transferring out funds to another account which subsequently cashes it out.


In [None]:
## complete the code 


#### Are there account labels common to fraudulent TRANSFERs and CASH_OUTs?

From the data description, the modus operandi for committing fraud involves first making a TRANSFER to a (fraudulent) account which in turn conducts a CASH_OUT. CASH_OUT involves transacting with a merchant who pays out cash. Thus, within this two-step process, the fraudulent account would be both, the destination in a TRANSFER and the originator in a CASH_OUT.

In [None]:
dfFraudTransfer = df.loc[(df.isFraud == 1) & (df.type == 'TRANSFER')]
dfFraudCashout = df.loc[(df.isFraud == 1) & (df.type == 'CASH_OUT')]

print('\nOf all fraudulent transactions, destinations for TRANSFERS that are also originators for CASH_OUTs? {}'.format(\
(dfFraudTransfer.nameDest.isin(dfFraudCashout.nameOrig)).any())) # False

However, our analysis above showed there are no  such common accounts among fraudulent transactions. Thus, the data is not imprinted with the expected modus-operandi. We can therefore drop the nameDest and nameOrig from the features used for modelling later.

#### Are the destination accounts with zero balances before and after non-zero amount is transacted normal?  

The data has several transactions with zero balances in the destination account both before and after a non-zero amount is transacted. 

Let's find out the how many of these transactions, of type TRANSFER/CASH_OUT,  are actually fraudulent. 

In [None]:
dfFraudTransferCashOut = df.loc[(df.isFraud == 1) & ((df.type == 'TRANSFER') | (df.type == 'CASH_OUT'))]
dfNonFraudTransferCashOut = df.loc[(df.isFraud == 0) & ((df.type == 'TRANSFER') | (df.type == 'CASH_OUT'))]

print('\nThe fraction of fraudulent transactions with \'oldBalanceDest\' = \
\'newBalanceDest\' = 0 although the transacted \'amount\' is non-zero is: {}'.\
format(len(dfFraudTransferCashOut.loc[(dfFraudTransferCashOut.oldBalanceDest == 0) & \
(dfFraudTransferCashOut.newBalanceDest == 0) & (dfFraudTransferCashOut.amount != 0)]) / (1.0 * len(dfFraudTransferCashOut))))

**Exercise** 

Now find out how many of these transactions (of type TRANSFER/CASH_OUT) are genuine (non-fraudulent) transaction. 

<details><summary>Click here for solution</summary>
<br/>

```python 
print('\nThe fraction of genuine transactions with \'oldBalanceDest\' = \
newBalanceDest\' = 0 although the transacted \'amount\' is non-zero is: {}'.\
format(len(dfNonFraudTransferCashOut.loc[(dfNonFraudTransferCashOut.oldBalanceDest == 0) & \
(dfNonFraudTransferCashOut.newBalanceDest == 0) & (dfNonFraudTransferCashOut.amount != 0 )]) / (1.0 * len(dfNonFraudTransferCashOut))))
```
<br/>
</details>

In [None]:
## Complete the code 



The fraction of such transactions, where zero likely denotes a missing value, is much larger in fraudulent (50%) compared to genuine transactions (0.06%). This shows that a 0 in the oldBalanceDest and newBalanceDest is a strong indicator of fraud.

In [None]:
print("The fraction of fraudulent transactions with 'oldBalanceOrig' = 'newBalanceOrig' = 0 although the transacted 'amount' is non-zero is: {}".
      format(len(dfFraudTransferCashOut.loc[(dfFraudTransferCashOut.oldBalanceOrig == 0) & 
                                            (dfFraudTransferCashOut.newBalanceOrig == 0) & 
                                            (dfFraudTransferCashOut.amount != 0)]) / (1.0 * len(dfFraudTransferCashOut))))

print("The fraction of genuine transactions with 'oldBalanceOrig' = newBalanceOrig' = 0 although the transacted 'amount' is non-zero is: {}".
      format(len(dfNonFraudTransferCashOut.loc[(dfNonFraudTransferCashOut.oldBalanceOrig == 0) & 
                                               (dfNonFraudTransferCashOut.newBalanceOrig == 0) & 
                                               (dfNonFraudTransferCashOut.amount != 0 )]) / (1.0 * len(dfNonFraudTransferCashOut))))

### Feature-engineering

Motivated by the possibility of zero-balances serving to differentiate between fraudulent and genuine transactions, we create 2 new features (columns) recording errors in the  originating and destination accounts for each transaction.

In [None]:
df['errorBalanceOrig'] = df.newBalanceOrig + df.amount - df.oldBalanceOrig
df['errorBalanceDest'] = df.oldBalanceDest + df.amount - df.newBalanceDest

### Correlation of features to the target label

Let's find out the correlation of each of our numerical features with the target label, for all TRANSFER/CASH_OUT transactions.

In [None]:
dfTransferCashOut = df[ (df.type == 'TRANSFER') | (df.type == 'CASH_OUT') ] 

In [None]:
corr_matrix = dfTransferCashOut.corr()
# corr_matrix['median_house_value'].sort_values(ascending=False)

In [None]:
corr_matrix['isFraud'].abs().sort_values(ascending=False)

## Create Train/Test Set 

From the exploratory data analysis (EDA), we know that fraud only occurs in 'TRANSFER's and 'CASH_OUT's. So we create a train/test set only from those transaction. Also, we will drop the nameOrig and nameDest, as our EDA shows that they are not relevant in predicting if a transaction is fraud or not.  We also need to convert the TRANSFER and CASHOUT to a numeric value instead.

**Exercise**

1. Create a dataframe that consists of TRANSFER/CASH_OUT transactions only
2. Drop the following features 'nameOrig', 'nameDest' 
3. Map the type TRANSFER to numeric value 0, and CASH_OUT to numeric value 1
4. create features (X), and labels (y) 
5. create a stratified train/test split of 80:20 ratio 

<br/>
<details><summary>Click here for solution</summary>

<br/>

```python
#1. Create a dataframe that consists of TRANSFER/CASH_OUT transactions only

dfTransferCashOut = df.loc[(df.type == 'TRANSFER') | (df.type == 'CASH_OUT')]

#2. Drop the following features 'nameOrig', 'nameDest'
dfTransferCashOut = dfTransferCashOut.drop(['nameOrig', 'nameDest', 'isFlaggedFraud'], axis = 1)

#3. Map the type TRANSFER to numeric value 0, and CASH_OUT to numeric value 1
dfTransferCashOut['type'] = dfTransferCashOut['type'].apply(lambda x: 0 if x == 'TRANSFER' else 1)

#4. create features (X) and labels (y) 
y = dfTransferCashOut['isFraud'] 
X = dfTransferCashOut.drop('isFraud', axis=1)

# 5. Create train/test split of 80:20 ratio, in stratified  
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, stratify=y, random_state = 49)

```
</details>

In [None]:
## Complete the code 

# 1. Create a dataframe that consists of TRANSFER/CASH_OUT transactions only



# 2. Drop the following features 'nameOrig', 'nameDest'


# 3. Map the type TRANSFER to numeric value 0, and CASH_OUT to numeric value 1



# 4. create features (X) and labels (y) 



# 5. Create train/test split of 80:20 ratio 




In [None]:
X_train.head()

## Modeling 

We can see from below, the data is highly imbalanced. 

In [None]:
y.value_counts()

As the data is highly imbalanced, we will use the Area under Precision/Recall Curve (Average Precision) as metrics to measure the performance of the classifier.

### Linear Models

Let's train a Logistic Regressor to classify the fraud.

In [None]:
lr_clf = LogisticRegression()
lr_clf.fit(X_train, y_train)

In [None]:
test_probs = lr_clf.predict_proba(X_test)
print('AUPRC for test set = {}'.format(average_precision_score(y_test, test_probs[:, 1])))

In [None]:
train_probs = lr_clf.predict_proba(X_train)
print('AUPRC for train set = {}'.format(average_precision_score(y_train, train_probs[:, 1])))

In [None]:
test_preds = lr_clf.predict(X_test)
print(classification_report(y_test, test_preds))

The performance is not great. Our recall/precison for fraud class is rather disappointing.  Let's apply higheer weightage to the minority class and try again. 

In [None]:
lr_clf = LogisticRegression(class_weight='balanced')
lr_clf.fit(X_train, y_train)

In [None]:
train_scores = lr_clf.decision_function(X_train)
print('AUPRC for train set = {}'.format(average_precision_score(y_train, train_scores)))
test_scores = lr_clf.decision_function(X_test)
print('AUPRC for test set = {}'.format(average_precision_score(y_test, test_scores)))

In [None]:
test_preds = lr_clf.predict(X_test)
print(classification_report(y_test, test_preds))

**Exercise**

What do you observe about the precision/recall of the fraud class? 

<details><summary>Click here for answer</summary>
    
It seems that by placing more weights on 'fraud' class, we have managed to improve the recall rate for 'Fraud' but also increase the false positive (lower precision). 

### Oversampling 

It is difficult for our model to learn from a highly skewed dataset such as this, as there are simply too few fraud samples (minority class) compared to non-fraud (majority class). We can use oversampling technique to create more samples for minority class to make it a more balanced dataset. SMOTE (and it's variants are one such technique). 

Note that we should only perform oversampling on the train split only. 

In [None]:
#over_sample = SMOTE(sampling_strategy='auto')
over_sample = BorderlineSMOTE(sampling_strategy='auto')
X_train_oversampled,y_train_oversampled = over_sample.fit_resample(X_train,y_train)
print(y_train_oversampled.value_counts()) #resampled

In [None]:
lr_clf_oversampled = LogisticRegression()
lr_clf_oversampled.fit(X_train_oversampled, y_train_oversampled)

### Evaluation of Model Performance 

As mentioned before, for a skew dataset, accuracy is not a good metrics to use.  Let's see what is our accuracy score first:

In [None]:
from sklearn.metrics import accuracy_score

test_preds = lr_clf_oversampled.predict(X_test)
accuracy_score(y_test, test_preds)

The accuracy looks very good. But this is misleading. We can find better insights by looking at the performance of each class (positive/negative) with classification report.

In [None]:
print(classification_report(y_test, test_preds))

#### Area under Precision Recall Curve (Average Precision)

Let's calculate the Area under Precision Recall curve. 

In [None]:
train_probs = lr_clf_oversampled.predict_proba(X_train)
print('AUPRC for train set = {}'.format(average_precision_score(y_train, train_probs[:,1])))
test_probs = lr_clf_oversampled.predict_proba(X_test)
print('AUPRC for test set = {}'.format(average_precision_score(y_test, test_probs[:,1])))

Our AUPRC seems to improve with oversampling. However, if we look at the classification report for the positive and negative class, we see that our precision with positive class (i.e. the fraud case) is extremely low, i.e. a lot of false positives.  The `predict()` function uses a default threshold of 0.5 (probability) to decide if something is a positive or negative class. We can adjust this threshold to trade-off a bit of recall with precision, as in the following "Precision Recall Trade-off". 

#### Precision Recall Trade-off 

Let's plot the precision recall curve for visualization.  

In [None]:
precisions, recalls, thresholds = precision_recall_curve(y_test, test_probs[:,1])

In [None]:
display = PrecisionRecallDisplay.from_predictions(
    y_test, test_probs[:,1], name="LogisticRegression"
)
_ = display.ax_.set_title("Precision-Recall curve")

Assuming we want to get a higher precision and we are willing to trade-off for a lower recall, e.g. 0.65. 

In [None]:
threshold_65_recall = thresholds[np.argmin(recalls >= 0.65)]
threshold_65_recall

Note that the threshold is now at more than 0.99, i.e. we will only classify a case as "fraud" if the classifier is 99% sure that this is fraud. Now we can use this threshold for deciding if a prediction is positive (fraud) or negative (non-fraud) by checking if it is above or below this threshold.

In [None]:
y_test_pred_65 = (test_probs[:,1] >= threshold_65_recall)

In [None]:
print(classification_report(y_test, y_test_pred_65))

From the classification report, you can see that we achieve a higher precision of 0.58 if we are willing to lower our recall rate to 0.65. 

**Exercise** 

Assuming you want to have a higher recall rate of 0.75 instead of 0.65.  What is the expected precision for positive class? 
Write code to print the classification report based on this new criteria. 

<details><summary>Click here for solution</summary>
    
```python 
threshold_75_recall = thresholds[np.argmin(recalls >= 0.75)]
threshold_75_recall
y_test_pred_75 = (test_probs[:,1] >= threshold_75_recall)
print(classification_report(y_test, y_test_pred_75))
```
    
</details>

In [None]:
## Complete the code 



### Non-linear Model

It seems that our linear model underfits quite badly with the data. Let's try a more complex ensemble model with boosting algorithms. In this case we will use a very fast boosting algorithm called lightGBM (you can try other boosting algorithm such as XGBoost)

Note: We did not cover this algorithm in the lecture. but we are using it here for comparison only. To learn more about lightGBM, you can refer to the [lightGBM website](https://github.com/microsoft/LightGBM) 

In [None]:
lgbm_clf = lgb.LGBMClassifier(num_leaves=30, learning_rate=0.05, n_estimators=30) 
lgbm_clf.fit(X_train, y_train)


In [None]:
train_probs = lgbm_clf.predict_proba(X_train)
print('AUPRC for train set = {}'.format(average_precision_score(y_train, train_probs[:, 1])))
test_probs = lgbm_clf.predict_proba(X_test)
print('AUPRC for test set = {}'.format(average_precision_score(y_test, test_probs[:, 1])))

In [None]:
test_preds = lgbm_clf.predict(X_test)
print(classification_report(y_test, test_preds))

We can see that we achieve an almost perfect classifier!  Normally boosting classifier is very effective in dealing with imbalanced dataset. 

The LightGBM also allows us to plot the importance of various features used in the model. The feature

In [None]:
lgb.plot_importance(lgbm_clf)