# Anti-Money Laundering System

## Problem Statement

Money Laundering is a major challenge in the present highly digitalized economic ecosystem. The fraudulent behavior of the agents aims to profit by taking control of customers accounts and try to empty the funds by transferring to another account and then cashing out of the system. The main objective of this project is to built a machine learning model based on financial transaction data for detecting such fraudulent behavior.

***CRISP-ML(Q)*** process model describes six phases:

1. Business and Data Understanding
2. Data Preparation
3. Model Building
4. Model Evaluation
5. Deployment
6. Monitoring and Maintenance

**Objective(s):** Maximizing the detection of fraud transactions via different channels.

**Constraints:** Minimizing false positives being generated for fraud transactions

# Data Collection

**Data:** This is a synthetic dataset generated using the simulator called PaySim. PaySim uses aggregated data from the private dataset to generate a synthetic dataset that resembles the normal operation of transactions and injects malicious behaviour to later evaluate the performance of fraud detection methods.

**Dataset:** 
* Our dataset has 6362620 observations about financial transactions.
* And it has 11 variables associated to each transactions. 

**Variables Description:**
* step - Maps a unit of time in the real world(1 step = 1 hour of time).

* type - CASH-IN, CASH-OUT, DEBIT, PAYMENT and TRANSFER.

* amount - Amount of the transaction in local currency.

* nameOrig - Customer who started the transaction.

* oldbalanceOrg - Initial balance before the transaction.

* newbalanceOrig - New balance after the transaction.

* nameDest - Customer who is the recipient of the transaction.

* oldbalanceDest - Initial balance recipient before the transaction.

* newbalanceDest - New balance recipient after the transaction.

* isFraud - This is the transactions made by the fraudulent agents inside the simulation.

* isFlaggedFraud - The business model aims to control massive transfers from one account to another and flags illegal attempts. An illegal attempt means to transfer more than 200.000 in a single transaction.

>  Note that there is not information for customers that start with M (Merchants).

**Required Libraries**

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.cm as cm
import seaborn as sns
import sweetviz
import dtale
from sklearn_pandas import DataFrameMapper
from sklearn.model_selection import train_test_split
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn import preprocessing
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import MinMaxScaler
from sklearn.pipeline import Pipeline
from scipy.stats import boxcox, skew

**Importing the data**

In [None]:
transaction_details_dataset = pd.read_csv("transaction_details.csv")

In [None]:
transaction_details_dataset.isnull().values.any()

In [None]:
print("Our dataset has {} observations and {} columns".format(transaction_details_dataset.shape[0], transaction_details_dataset.shape[1]))

In [None]:
transaction_details_dataset.head()

In [None]:
transaction_details_dataset.describe()

In [None]:
transaction_details_dataset.info()

## Exploratory Data Analysis (EDA) / Descriptive Statistics

In [None]:
print(transaction_details_dataset.type.value_counts())

In [None]:
print('Types of fraud transactions: {}'.format(list(transaction_details_dataset.loc[transaction_details_dataset.isFraud == 1].type.drop_duplicates().values)))

fraud_transfer = transaction_details_dataset.loc[(transaction_details_dataset.isFraud == 1) & (transaction_details_dataset.type == 'TRANSFER')]
fraud_cash_out = transaction_details_dataset.loc[(transaction_details_dataset.isFraud == 1) & (transaction_details_dataset.type == 'CASH_OUT')]

print('\nNumber of fraud TRANSFER\'s: {}'.format(len(fraud_transfer)))
print('Number of fraud CASH_OUT\'s: {}'.format(len(fraud_cash_out)))

print('\nPercentage of fraud TRANSFER\'s: {} %'.format((len(fraud_transfer)/len(transaction_details_dataset)) * 100))
print('Percentage of fraud CASH_OUT\'s: {} %'.format((len(fraud_transfer)/len(transaction_details_dataset)) * 100))

In [None]:
print('Types of transactions which are \'isFlaggedFraud\': {}'.format(list(transaction_details_dataset.loc[transaction_details_dataset.isFlaggedFraud == 1].type.drop_duplicates().values)))

transfer = transaction_details_dataset.loc[(transaction_details_dataset.type == 'TRANSFER')]
is_flagged_fraud = transaction_details_dataset.loc[(transaction_details_dataset.isFlaggedFraud == 1)]
not_flagged_fraud = transaction_details_dataset.loc[(transaction_details_dataset.isFlaggedFraud == 0)]

print('Minimum amount of transaction when \'isFlaggedFraud\' is set: {}'.format(is_flagged_fraud.amount.min()))
print('Maximum amount of transaction in a TRANSFER when \'isFlaggedFraud\' is not set: {}'.format(transfer.loc[(transfer.isFlaggedFraud == 0)].amount.max()))

In [None]:
print('Number of TRANSFER\'s where isFlaggedFraud = 1 yet oldbalanceDest = 0, & newbalanceDest = 0: {}'.format(len(transaction_details_dataset.loc[(transaction_details_dataset.isFlaggedFraud == 1) & (transaction_details_dataset.newbalanceDest == 0) & (transaction_details_dataset.oldbalanceDest == 0)])))
print('Number of TRANSFER\'s where isFlaggedFraud = 0 yet oldbalanceDest = 0, & newbalanceDest = 0: {}'.format(len(transaction_details_dataset.loc[(transaction_details_dataset.isFlaggedFraud == 0) & (transaction_details_dataset.newbalanceDest == 0) & (transaction_details_dataset.oldbalanceDest == 0)])))

In [None]:
print('Minimum value of newbalanceOrig when isFlaggedFraud = 0 where oldbalanceOrg = newbalanceOrig: {}'.format(transfer.loc[(transfer.isFlaggedFraud == 0) & (transfer.oldbalanceOrg == transfer.newbalanceOrig)].oldbalanceOrg.min()))
print('Maximum value of newbalanceOrig when isFlaggedFraud = 0 where oldbalanceOrg = newbalanceOrig: {}'.format(transfer.loc[(transfer.isFlaggedFraud == 0) & (transfer.oldbalanceOrg == transfer.newbalanceOrig)].oldbalanceOrg.max()))
print('Minimum value of oldbalanceOrg when isFlaggedFraud = 1: {}'.format(is_flagged_fraud.oldbalanceOrg.min()))
print('Maximum value of oldbalanceOrg when isFlaggedFraud = 1: {}'.format(is_flagged_fraud.oldbalanceOrg.max()))

In [None]:
print('Any originator of transaction having more than 1 flagged fraud transaction? {}'.format(is_flagged_fraud.nameOrig.isin(pd.concat([not_flagged_fraud.nameOrig, not_flagged_fraud.nameDest])).any()))
print('Any destination for transaction flagged as fraud? {}'.format(is_flagged_fraud.nameDest.isin(not_flagged_fraud.nameOrig).any()))
print('Number of destination accounts flagged fraud were already destination accounts before? {}'.format(sum(is_flagged_fraud.nameDest.isin(not_flagged_fraud.nameDest))))

In [None]:
print('Any merchants among originator account? {}'.format(transaction_details_dataset.nameOrig.str.contains('M').any()))
print('Are there any transactions having merchants as destination accounts other than \'PAYMENT\' type? {}'.format((transaction_details_dataset.loc[transaction_details_dataset.nameDest.str.contains('M')].type != 'PAYMENT').any()))
print('Any merchants among originator who accounts for \'CASH_IN\' transactions? {}'.format(transaction_details_dataset.loc[transaction_details_dataset.type == 'CASH_IN'].nameOrig.str.contains('M').any()))
print('Any merchants among originator who accounts for \'CASH_OUT\' transactions? {}'.format(transaction_details_dataset.loc[transaction_details_dataset.type == 'CASH_OUT'].nameOrig.str.contains('M').any()))

In [None]:
not_fraud = transaction_details_dataset.loc[transaction_details_dataset.isFraud == 0]
print('Fraud TRANSFER"s where destination accounts are originator of "CASH_OUT":\n {}'.format(fraud_transfer.loc[fraud_transfer.nameDest.isin(not_fraud.loc[not_fraud.type == 'CASH_OUT'].nameOrig.drop_duplicates())]))

In [None]:
print('Fraud TRANSFER to \'C1714931087\' occurs at step [65] whereas genuine \'CASH_OUT\' from this account occured at step = {}'.format(not_fraud.loc[(not_fraud.type == 'CASH_OUT') & (not_fraud.nameOrig == 'C1714931087')].step.values))
print('Fraud TRANSFER to \'C423543548\' occurs at step [486] whereas genuine \'CASH_OUT\' from this account occured at step = {}'.format(not_fraud.loc[(not_fraud.type == 'CASH_OUT') & (not_fraud.nameOrig == 'C423543548')].step.values))
print('Fraud TRANSFER to \'C1023330867\' occurs at step [738] whereas genuine \'CASH_OUT\' from this account occured at step = {}'.format(not_fraud.loc[(not_fraud.type == 'CASH_OUT') & (not_fraud.nameOrig == 'C1023330867')].step.values))

In [None]:
figure, ax = plt.subplots(1, 1, figsize = (8, 6))
transaction_details_dataset.type.value_counts().plot(kind = 'bar', title = 'Transaction Type', color ='red')

In [None]:
figure, ax = plt.subplots(1, 1, figsize = (8, 6))
ax = transaction_details_dataset.groupby(['type', 'isFraud']).size().plot(kind = 'bar')
ax.set_title('Number of actual fraud transaction per transaction type')
ax.set_xlabel('Type, isFraud')
ax.set_ylabel('Number of transactions')

for x in ax.patches:
    ax.annotate(str(format(int(x.get_height()))), (x.get_x(), x.get_height() * 1.01))

In [None]:
figure, ax = plt.subplots(1, 1, figsize = (8, 6))
ax = transaction_details_dataset.groupby(['type', 'isFlaggedFraud']).size().plot(kind = 'bar')
ax.set_title('Number of actual fraud transaction per transaction type')
ax.set_xlabel('Type, isFlaggedFraud')
ax.set_ylabel('Number of transactions')

for x in ax.patches:
    ax.annotate(str(format(int(x.get_height()))), (x.get_x(), x.get_height() * 1.01))

In [None]:
figure, axis = plt.subplots(2, 2, figsize = (8, 8))

figure_1 = sns.boxplot(x = 'isFlaggedFraud', y = 'amount', data = transfer, ax = axis[0][0])
axis[0][0].set_yscale('log')
figure_2 = sns.boxplot(x = 'isFlaggedFraud', y = 'oldbalanceDest', data = transfer, ax = axis[0][1])
axis[0][1].set(ylim=(0, 0.5e8))
figure_3 = sns.boxplot(x = 'isFlaggedFraud', y = 'oldbalanceOrg', data = transfer, ax = axis[1][0])
axis[1][0].set(ylim=(0, 3e7))
figure_4 = sns.regplot(x = 'oldbalanceOrg', y = 'amount', data = transfer.loc[(transaction_details_dataset.isFlaggedFraud == 1)], ax = axis[1][1])
plt.show()

## Data Preprocessing

***'TRANSFER' OR 'CASH_OUT' Transactions***

In [None]:
transfer_or_cash_out = transaction_details_dataset.loc[(transaction_details_dataset.type == 'TRANSFER') | (transaction_details_dataset.type == 'CASH_OUT')]

print('We have a total of {} transactions that are either \'TRANSFER\' OR \'CASH_OUT\'.'.format(transfer_or_cash_out.shape[0]))

### Automated Library

In [None]:
automated_report = sweetviz.analyze(transfer_or_cash_out)
automated_report.show_html('Report.html')

In [None]:
dataset_visuals = dtale.show(transfer_or_cash_out)
dataset_visuals.open_browser()

**Cleaning Unwanted Columns**

'nameDest', 'nameOrig', 'isFlaggedFraud': These columns are of no relevance for analytically domain of work, as it does not have any nominal data. Hence, we can ignore them.

In [None]:
transfer_or_cash_out = transfer_or_cash_out.drop(['nameDest', 'nameOrig', 'isFlaggedFraud'], axis = 1)

In [None]:
fraud_transfer_or_cash_out = transfer_or_cash_out.loc[transfer_or_cash_out['isFraud'] == 1]
not_fraud_transfer_or_cash_out = transfer_or_cash_out.loc[transfer_or_cash_out['isFraud'] == 0]

print('Fraction of fraud transactions with \'oldbalanceDest\' = \'newbalanceDest\' after having a non-zero transaction: {}'.format(len(fraud_transfer_or_cash_out.loc[(fraud_transfer_or_cash_out.oldbalanceDest == 0) & (fraud_transfer_or_cash_out.newbalanceDest == 0) & (fraud_transfer_or_cash_out.amount)]) / len(fraud_transfer_or_cash_out)))

***Computation:***
1. Imputing null values using Mean imputation.
2. Conversion of 'categorical data' to 'numerical data' using LabelEncoder.
3. DataFrameMapper is used to map the given attribute.

In [None]:
numerical_features = transfer_or_cash_out.select_dtypes(include = ['int32', 'int64', 'float32', 'float64']).columns
numerical_features

In [None]:
categorical_features = transfer_or_cash_out.select_dtypes(include = ['object']).columns
categorical_features

In [None]:
numerical_pipeline = Pipeline([('impute', SimpleImputer(strategy = 'mean'))])
categorical_pipeline = Pipeline([('label', DataFrameMapper([(categorical_features, LabelEncoder())]))])
scale_pipeline = Pipeline([('scale', MinMaxScaler())])

In [None]:
preprocess_pipeline = ColumnTransformer([('category', categorical_pipeline, categorical_features), 
                                         ('numerical', numerical_pipeline, numerical_features)],
                                        remainder = 'passthrough')
preprocess_fit = preprocess_pipeline.fit(transfer_or_cash_out)
preprocess_fit

In [None]:
preprocess_transform = preprocess_fit.transform(transfer_or_cash_out)
preprocess_transform

In [None]:
preprocess_2 = pd.DataFrame(preprocess_transform, columns = [transfer_or_cash_out])
preprocess_2

## Data Visualization

In [None]:
def correlation_plot(dataframe, labels):
    figure, axis = plt.subplots(1, 1, figsize = (8, 8))
    cmap = cm.get_cmap('inferno')
    cax = axis.imshow(dataframe.corr(), cmap = cmap)
    plt.title('Correlation HeatMap')
    axis.set_xticklabels(labels, fontsize = 12, rotation = 60)
    axis.set_yticklabels(labels, fontsize = 12)
    figure.colorbar(cax)
    plt.show()
    
correlation_plot_labels = preprocess_2.columns.tolist()

for i in range(len(correlation_plot_labels)):
    correlation_plot_labels[i] = correlation_plot_labels[i][0]
    
correlation_plot(preprocess_2, correlation_plot_labels)

In [None]:
sns.heatmap(preprocess_2.corr())

In [None]:
transaction_type_plot = transfer_or_cash_out.type.value_counts().plot(kind = 'bar', title = 'Transaction Type', figsize = (8, 6))

for patch in transaction_type_plot.patches:
    transaction_type_plot.annotate(str(format(int(patch.get_height()))), (patch.get_x(), patch.get_height() * 1.01))

In [None]:
fraud_transaction_plot = pd.value_counts(transfer_or_cash_out['isFraud']).plot(kind = 'bar', title = 'Fraud Transaction', figsize = (8, 6))

for patch in fraud_transaction_plot.patches:
    fraud_transaction_plot.annotate(str(format(int(patch.get_height()))), (patch.get_x(), patch.get_height() * 1.01))

Data visualized above have quite skewed numerical variables. Therefore scaling the data to skew

In [None]:
transfer_or_cash_out['amount_boxcox'] = preprocessing.scale(boxcox(transfer_or_cash_out['amount'] + 1)[0])
transfer_or_cash_out['amount_oldbalanceOrg'] = preprocessing.scale(boxcox(transfer_or_cash_out['oldbalanceOrg'] + 1)[0])
transfer_or_cash_out['amount_newbalanceOrig'] = preprocessing.scale(boxcox(transfer_or_cash_out['newbalanceOrig'] + 1)[0])
transfer_or_cash_out['amount_oldbalanceDest'] = preprocessing.scale(boxcox(transfer_or_cash_out['oldbalanceDest'] + 1)[0])
transfer_or_cash_out['amount_newbalanceDest'] = preprocessing.scale(boxcox(transfer_or_cash_out['newbalanceDest'] + 1)[0])

In [None]:
figure, axis = plt.subplots(1, 3, figsize = (12, 5))

axis[0].hist(transfer_or_cash_out['amount'])
axis[0].set_xlabel('Transaction Amount')
axis[0].set_title('Transaction Amount')
axis[0].text(0.3e8, 2750000, 'Skewness: {:.2f}'.format(skew(transfer_or_cash_out['amount'])))

axis[1].hist(np.sqrt(transfer_or_cash_out['amount']))
axis[1].set_xlabel('Square Root of Transaction Amount')
axis[1].set_title('SQRT on Transaction Amount')
axis[1].text(3000, 2650000, 'Skewness: {:.2f}'.format(skew(np.sqrt(transfer_or_cash_out['amount']))))

axis[2].hist(transfer_or_cash_out['amount_boxcox'])
axis[2].set_xlabel('Boxcox of Transaction Amount')
axis[2].set_title('Boxcox on Transaction Amount')
axis[2].text(-2, 1625000, 'Skewness: {:.2f}'.format(skew(transfer_or_cash_out['amount_boxcox'])))

plt.show()

In [None]:
figure, axis = plt.subplots(1, 3, figsize = (12, 5))

axis[0].hist(transfer_or_cash_out['oldbalanceOrg'])
axis[0].set_xlabel('Original Old Balance')
axis[0].set_title('Original Old Balance')
axis[0].text(0.2e8, 2650000, 'Skewness: {:.2f}'.format(skew(transfer_or_cash_out['oldbalanceOrg'])))

axis[1].hist(np.sqrt(transfer_or_cash_out['oldbalanceOrg']))
axis[1].set_xlabel('Square Root of Original Old Balance')
axis[1].set_title('SQRT on Original Old Balance')
axis[1].text(2500, 2650000, 'Skewness: {:.2f}'.format(skew(np.sqrt(transfer_or_cash_out['oldbalanceOrg']))))

axis[2].hist(transfer_or_cash_out['amount_oldbalanceOrg'])
axis[2].set_xlabel('Boxcox of Original Old Balance')
axis[2].set_title('Boxcox on Original Old Balance')
axis[2].text(0, 1275000, 'Skewness: {:.2f}'.format(skew(transfer_or_cash_out['amount_oldbalanceOrg'])))

plt.show()

In [None]:
figure, axis = plt.subplots(1, 3, figsize = (12, 5))

axis[0].hist(transfer_or_cash_out['newbalanceOrig'])
axis[0].set_xlabel('Original New Balance')
axis[0].set_title('Original New Balance')
axis[0].text(0.1e8, 2650000, 'Skewness: {:.2f}'.format(skew(transfer_or_cash_out['newbalanceOrig'])))

axis[1].hist(np.sqrt(transfer_or_cash_out['newbalanceOrig']))
axis[1].set_xlabel('Square Root of Original New Balance')
axis[1].set_title('SQRT on Original New Balance')
axis[1].text(2250, 2600000, 'Skewness: {:.2f}'.format(skew(np.sqrt(transfer_or_cash_out['newbalanceOrig']))))

axis[2].hist(transfer_or_cash_out['amount_newbalanceOrig'])
axis[2].set_xlabel('Boxcox of Original New Balance')
axis[2].set_title('Boxcox on Original New Balance')
axis[2].text(0.75, 2300000, 'Skewness: {:.2f}'.format(skew(transfer_or_cash_out['amount_newbalanceOrig'])))

plt.show()

In [None]:
figure, axis = plt.subplots(1, 3, figsize = (12, 5))

axis[0].hist(transfer_or_cash_out['oldbalanceDest'])
axis[0].set_xlabel('Destination Old Balance')
axis[0].set_title('Destination Old Balance')
axis[0].text(0.1e9, 2700000, 'Skewness: {:.2f}'.format(skew(transfer_or_cash_out['oldbalanceDest'])))

axis[1].hist(np.sqrt(transfer_or_cash_out['oldbalanceDest']))
axis[1].set_xlabel('Square Root of Destination Old Balance')
axis[1].set_title('SQRT on Destination Old Balance')
axis[1].text(6000, 2350000, 'Skewness: {:.2f}'.format(skew(np.sqrt(transfer_or_cash_out['oldbalanceDest']))))

axis[2].hist(transfer_or_cash_out['amount_oldbalanceDest'])
axis[2].set_xlabel('Boxcox of Destination Old Balance')
axis[2].set_title('Boxcox on Destination Old Balance')
axis[2].text(0.75, 1000000, 'Skewness: {:.2f}'.format(skew(transfer_or_cash_out['amount_oldbalanceDest'])))

plt.show()

In [None]:
figure, axis = plt.subplots(1, 3, figsize = (12, 5))

axis[0].hist(transfer_or_cash_out['newbalanceDest'])
axis[0].set_xlabel('Destination New Balance')
axis[0].set_title('Destination New Balance')
axis[0].text(0.1e9, 2700000, 'Skewness: {:.2f}'.format(skew(transfer_or_cash_out['newbalanceDest'])))

axis[1].hist(np.sqrt(transfer_or_cash_out['newbalanceDest']))
axis[1].set_xlabel('Square Root of Destination New Balance')
axis[1].set_title('SQRT on Destination New Balance')
axis[1].text(2250, 2650000, 'Skewness: {:.2f}'.format(skew(np.sqrt(transfer_or_cash_out['newbalanceDest']))))

axis[2].hist(transfer_or_cash_out['amount_newbalanceDest'])
axis[2].set_xlabel('Boxcox of Destination New Balance')
axis[2].set_title('Boxcox on Destination New Balance')
axis[2].text(0, 1200000, 'Skewness: {:.2f}'.format(skew(transfer_or_cash_out['amount_newbalanceDest'])))

plt.show()

In [None]:
print('Percentage of fraud transactions of the filtered dataset: {}%'.format((len(transfer_or_cash_out[transfer_or_cash_out['isFraud'] == 1]) / len(transfer_or_cash_out)) * 100))

As we can observe, there is approximately only 0.3% of the actual fraud data and the remaining unrelevant data has been filtered out.

Now, only boxcox data transformation will be used for the model prediction.

In [None]:
transfer_or_cash_out.drop(['type', 'amount', 'oldbalanceOrg', 'newbalanceOrig', 'oldbalanceDest', 'newbalanceDest'], axis = 1, inplace = True)

In [None]:
transfer_or_cash_out.reset_index(drop = True, inplace = True)

fraud_record_count = len(transfer_or_cash_out[transfer_or_cash_out['isFraud'] == 1])

fraud_indices = transfer_or_cash_out[transfer_or_cash_out['isFraud'] == 1].index.values
normal_indices = transfer_or_cash_out[transfer_or_cash_out['isFraud'] == 0].index

random_normal_indices = np.array(np.random.choice(normal_indices, fraud_record_count, replace = False))

combine_sample_indices = np.concatenate([fraud_indices, random_normal_indices])
combine_sample_data = transfer_or_cash_out.iloc[combine_sample_indices, :]

not_fraud_undersample = combine_sample_data.loc[:, combine_sample_data.columns != 'isFraud']
is_fraud_undersample = combine_sample_data.loc[:, combine_sample_data.columns == 'isFraud']

print('Percentage of normal transactions: ', len(combine_sample_data[combine_sample_data.isFraud == 0]) / len(combine_sample_data))
print('Percentage of fraud transactions: ', len(combine_sample_data[combine_sample_data.isFraud == 1]) / len(combine_sample_data))
print('Total count of sample transactions data: ', len(combine_sample_data))