## Study: Applying ML models to fraud detection on financial transactions
Applying a decision tree based model (RandomForestClassifier) to classify financial transactions as fraud or non-fraud.

The dataset used in this study is the **Synthetic Financial Datasets For Fraud Detection** https://www.kaggle.com/ealaxi/paysim1


- **step**: maps a unit of time in the real world. In this case 1 step is 1 hour of time. Total steps 744 (30 days simulation).
- **type**: CASH-IN, CASH-OUT, DEBIT, PAYMENT and TRANSFER.
- **amount**: amount of the transaction in local currency.
- **nameOrig**: customer who started the transaction
- **oldbalanceOrg**: initial balance before the transaction
- **newbalanceOrig**: new balance after the transaction
- **nameDest**: customer who is the recipient of the transaction
- **oldbalanceDest**: initial balance recipient before the transaction. Note that there is not information for customers that start with M (Merchants).
- **newbalanceDest**: new balance recipient after the transaction. Note that there is not information for customers that start with M (Merchants).
- **isFraud**: This is the transactions made by the fraudulent agents inside the simulation. In this specific dataset the fraudulent behavior of the agents aims to profit by taking control of customers accounts and try to empty the funds by transferring to another account and then cashing out of the system.
- **isFlaggedFraud**: The business model aims to control massive transfers from one account to another and flags illegal attempts. An illegal attempt in this dataset is an attempt to transfer more than 200.000 in a single transaction.

In [None]:
!pip install seaborn

In [None]:
%matplotlib inline

import pandas as pd
import os
import seaborn as sns
import numpy as np

## Load the data

In [None]:
SM_BASE_TRAIN = '../data/PS_20174392719_1491204439457_log.csv'

df = pd.read_csv(SM_BASE_TRAIN)
df.head()

### Creating new features based on the dataset description

In [None]:
# there is fraud only on TRANSFER and CASH_OUT, let's filter and try only with these types
df_dataset = df.copy()
df_dataset['hour'] = (df_dataset.step % 24)
df_dataset['dayOfMonth'] = (df_dataset.step // 24) + 1
df_dataset['signal'] = df_dataset.type.apply(lambda x: -1 if x == 'CASH_IN' else 1)
df_dataset['currbalanceDest'] = df_dataset.oldbalanceDest + (df_dataset.signal * df_dataset.amount)
df_dataset['isMerchantDest'] = df_dataset.nameDest.apply(lambda x: 1 if x.startswith('M') else 0)

df_dataset.type = df_dataset.type.astype('category').cat.codes

## After some analysis we can say that there are errors related to the balance 
## of both accounts after the transaction. Let's try to evidence it to the model
df_dataset['errorBalanceOrig'] = df_dataset.newbalanceOrig + df_dataset.amount - df_dataset.oldbalanceOrg
df_dataset['errorBalanceDest'] = df_dataset.oldbalanceDest + df_dataset.amount - df_dataset.newbalanceDest

df_dataset = df_dataset.drop(columns=['step', 'nameOrig', 'nameDest', 'isFlaggedFraud', 'currbalanceDest', 'signal']).fillna(0)

df_dataset.head()

### The dataset is very imbalanced, but we will not use smote or adasyn here to fix that

In [None]:
df_dataset[['isFraud', 'amount']].groupby(['isFraud']).count()

### There are some features with high correlation. 
We could have applied PCA here to reduce the # of features, but let's follow that way by now

In [None]:
import matplotlib.pyplot as plt
corr = df_dataset.corr()

f, ax = plt.subplots(figsize=(15, 8))
sns.heatmap(corr, annot=True, fmt="f",
            xticklabels=corr.columns.values,
            yticklabels=corr.columns.values,
            ax=ax)

### Now, we can select some features and generate the dataset
After a few rounds of training/testing and optimization, SHAP was applied to help us to select the best features

In [None]:
df_train = df_dataset[[
    'isFraud', 'type', 'amount', 'oldbalanceOrg', 'newbalanceOrig', 'oldbalanceDest',    
    'newbalanceDest', 'hour', 'dayOfMonth', 'isMerchantDest',
        'errorBalanceOrig', 'errorBalanceDest' 
]].copy()

### Register the dataset in the Workspace

In [None]:
from azureml.core import Workspace, Datastore, Dataset

workspace = Workspace.from_config()
datastore = Datastore.get(workspace, 'workspaceblobstore')
dataset = Dataset.Tabular.register_pandas_dataframe(df_train, datastore, "train-data", show_progress=True)