# Fraud Detection in Transactions

First, let's import the required packages and the data set.

We will be using the Kaggle api to download the dataset and then import it into a pandas dataframe.

In [None]:
import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import sklearn
from random import randint

In [None]:
!mkdir .kaggle

mkdir: cannot create directory ‘.kaggle’: File exists


In [None]:
from google.colab import files
files.upload()

{}

In [None]:
!mkdir ~/.kaggle
!cp kaggle.json ~/.kaggle/
!chmod 600 ~/.kaggle/kaggle.json

mkdir: cannot create directory ‘/root/.kaggle’: File exists


In [None]:
!kaggle datasets download -d ntnu-testimon/paysim1

paysim1.zip: Skipping, found more recently modified local copy (use --force to force download)


In [None]:
!unzip 'paysim1.zip'

Archive:  paysim1.zip
  inflating: PS_20174392719_1491204439457_log.csv  

In [None]:
df = pd.read_csv('data.csv')
print(df.shape)

(6362620, 11)


We see that the dataset has more than 6 million rows. Let's checkout the dataframe.

In [None]:
df.head()

Unnamed: 0,step,type,amount,nameOrig,oldbalanceOrg,newbalanceOrig,nameDest,oldbalanceDest,newbalanceDest,isFraud,isFlaggedFraud
0,1,PAYMENT,9839.64,C1231006815,170136.0,160296.36,M1979787155,0.0,0.0,0,0
1,1,PAYMENT,1864.28,C1666544295,21249.0,19384.72,M2044282225,0.0,0.0,0,0
2,1,TRANSFER,181.0,C1305486145,181.0,0.0,C553264065,0.0,0.0,1,0
3,1,CASH_OUT,181.0,C840083671,181.0,0.0,C38997010,21182.0,0.0,1,0
4,1,PAYMENT,11668.14,C2048537720,41554.0,29885.86,M1230701703,0.0,0.0,0,0


Let's print out a summary of the dataset before going ahead and also check whether there any null values in the dataset.

In [None]:
df.describe(include='all')

Unnamed: 0,step,type,amount,nameOrig,oldbalanceOrg,newbalanceOrig,nameDest,oldbalanceDest,newbalanceDest,isFraud,isFlaggedFraud
count,6362620.0,6362620,6362620.0,6362620,6362620.0,6362620.0,6362620,6362620.0,6362620.0,6362620.0,6362620.0
unique,,5,,6353307,,,2722362,,,,
top,,CASH_OUT,,C2051359467,,,C1286084959,,,,
freq,,2237500,,3,,,113,,,,
mean,243.3972,,179861.9,,833883.1,855113.7,,1100702.0,1224996.0,0.00129082,2.514687e-06
std,142.332,,603858.2,,2888243.0,2924049.0,,3399180.0,3674129.0,0.0359048,0.001585775
min,1.0,,0.0,,0.0,0.0,,0.0,0.0,0.0,0.0
25%,156.0,,13389.57,,0.0,0.0,,0.0,0.0,0.0,0.0
50%,239.0,,74871.94,,14208.0,0.0,,132705.7,214661.4,0.0,0.0
75%,335.0,,208721.5,,107315.2,144258.4,,943036.7,1111909.0,0.0,0.0


One interesting point to note here is that most of the nameOrig s are unique unlike the nameDest s. 

In [None]:
df.isnull().values.any()

False

There are no null (NaN) values in the dataframe. This doesn't mean that there are no missing values as some of them might be stores as a 0, negative/very large numbers etc. We will try to do a more detailed exploration on that in the EDA section.

### 2. Exploratory Data Analysis

First, let's do some basic analysis using pandas methods before going into visualisations and plots.

Let's check the total number of fraudulent transactions.

In [None]:
df['isFraud'].value_counts()

0    6354407
1       8213
Name: isFraud, dtype: int64

There are only 8213 fraudulent transactions (~0.13%) in the dataset. So as expected, we have a highly imbalanced dataset in our hands. And this will need to be taken care of during the modelling.

Let's check the type of transactions that we have and also how fraudulent transactions are distributed among these types.

In [None]:
df[['type', 'isFraud']][df['isFraud'] == 1].groupby('type').count()

Unnamed: 0_level_0,isFraud
type,Unnamed: 1_level_1
CASH_OUT,4116
TRANSFER,4097


We see that, fraudulent transactions occur only in two types of payments. Let's have a more detailed look into the fraudulent transactions.

In [None]:
df[df['isFraud'] == 1].head(25)

Unnamed: 0,step,type,amount,nameOrig,oldbalanceOrg,newbalanceOrig,nameDest,oldbalanceDest,newbalanceDest,isFraud,isFlaggedFraud
2,1,TRANSFER,181.0,C1305486145,181.0,0.0,C553264065,0.0,0.0,1,0
3,1,CASH_OUT,181.0,C840083671,181.0,0.0,C38997010,21182.0,0.0,1,0
251,1,TRANSFER,2806.0,C1420196421,2806.0,0.0,C972765878,0.0,0.0,1,0
252,1,CASH_OUT,2806.0,C2101527076,2806.0,0.0,C1007251739,26202.0,0.0,1,0
680,1,TRANSFER,20128.0,C137533655,20128.0,0.0,C1848415041,0.0,0.0,1,0
681,1,CASH_OUT,20128.0,C1118430673,20128.0,0.0,C339924917,6268.0,12145.85,1,0
724,1,CASH_OUT,416001.33,C749981943,0.0,0.0,C667346055,102.0,9291619.62,1,0
969,1,TRANSFER,1277212.77,C1334405552,1277212.77,0.0,C431687661,0.0,0.0,1,0
970,1,CASH_OUT,1277212.77,C467632528,1277212.77,0.0,C716083600,0.0,2444985.19,1,0
1115,1,TRANSFER,35063.63,C1364127192,35063.63,0.0,C1136419747,0.0,0.0,1,0


We see that in most of the cases, the cash gets transferred and immediately it's cashed out. 

Another important point to note is that in most of the fraudulent transactions, the entire 'oldbalanceOrg' is transactes as the 'amount'. Let's dig a little deeper into this.

In [None]:
df[(df['amount'] == df['oldbalanceOrg'])]['isFraud'].value_counts()

1    8034
Name: isFraud, dtype: int64

In 8034 out of the 8213 fraudulent transactions, the entire amount in the origin account gets transacted. And quite surprisingly, there is not a single genuine transaction that shows a similar behaviour.

This observation will come in handy during featuure engineering. **We will be adding an 'entireAmount' feature that can flag this behaviour.** 

We can also see that the difference in the balance in the destination account is not the same as the amount of the transaction and in many cases both the original and new balances are 0. These might be indicative of missing values. Let's have a more detailed look into this on the entire dataset.

Note that from the data description, for transactions with merchants where the nameDest starts with M, this behaviour can be expected.

In [None]:
df[((df['newbalanceDest'] - df['oldbalanceDest']) != df['amount']) & (df['nameDest'].str.startswith('M') != True)]

Unnamed: 0,step,type,amount,nameOrig,oldbalanceOrg,newbalanceOrig,nameDest,oldbalanceDest,newbalanceDest,isFraud,isFlaggedFraud
2,1,TRANSFER,181.00,C1305486145,181.00,0.00,C553264065,0.00,0.00,1,0
3,1,CASH_OUT,181.00,C840083671,181.00,0.00,C38997010,21182.00,0.00,1,0
9,1,DEBIT,5337.77,C712410124,41720.00,36382.23,C195600860,41898.00,40348.79,0,0
10,1,DEBIT,9644.94,C1900366749,4465.00,0.00,C997608398,10845.00,157982.12,0,0
15,1,CASH_OUT,229133.94,C905080434,15325.00,0.00,C476402209,5083.00,51513.44,0,0
...,...,...,...,...,...,...,...,...,...,...,...
6362614,743,TRANSFER,339682.13,C2013999242,339682.13,0.00,C1850423904,0.00,0.00,1,0
6362616,743,TRANSFER,6311409.28,C1529008245,6311409.28,0.00,C1881841831,0.00,0.00,1,0
6362617,743,CASH_OUT,6311409.28,C1162922333,6311409.28,0.00,C1365125890,68488.84,6379898.11,1,0
6362618,743,TRANSFER,850002.52,C1685995037,850002.52,0.00,C2080388513,0.00,0.00,1,0


We see that most of the rows have this discrepancy. Let's also check for cases where both the balances are zero.



In [None]:
df[(df['newbalanceDest'] == 0.00) & (df['oldbalanceDest'] == 0.00) & (df['nameDest'].str.startswith('M') != True)]

Unnamed: 0,step,type,amount,nameOrig,oldbalanceOrg,newbalanceOrig,nameDest,oldbalanceDest,newbalanceDest,isFraud,isFlaggedFraud
2,1,TRANSFER,181.00,C1305486145,181.00,0.0,C553264065,0.0,0.0,1,0
251,1,TRANSFER,2806.00,C1420196421,2806.00,0.0,C972765878,0.0,0.0,1,0
680,1,TRANSFER,20128.00,C137533655,20128.00,0.0,C1848415041,0.0,0.0,1,0
969,1,TRANSFER,1277212.77,C1334405552,1277212.77,0.0,C431687661,0.0,0.0,1,0
1115,1,TRANSFER,35063.63,C1364127192,35063.63,0.0,C1136419747,0.0,0.0,1,0
...,...,...,...,...,...,...,...,...,...,...,...
6362610,742,TRANSFER,63416.99,C778071008,63416.99,0.0,C1812552860,0.0,0.0,1,0
6362612,743,TRANSFER,1258818.82,C1531301470,1258818.82,0.0,C1470998563,0.0,0.0,1,0
6362614,743,TRANSFER,339682.13,C2013999242,339682.13,0.0,C1850423904,0.0,0.0,1,0
6362616,743,TRANSFER,6311409.28,C1529008245,6311409.28,0.0,C1881841831,0.0,0.0,1,0


Among non-merchant transactions, we have nearly 170k transactions with both the old balance and new balance at the destination as zero. These can be considered as missing values.


We can also see that none of the merchant transactions are fraudulent.

In [None]:
df[(df['nameDest'].str.startswith('M') == True)]['isFraud'].value_counts()

0    2151495
Name: isFraud, dtype: int64

Let's see whether this behaviour(the balances being zero) is indicative of a fraudulent transaction. We will check the frequency of occurence of such behaviour in fraudulent transaction and genuine ones.

In [None]:
df[(df['newbalanceDest'] == 0.00) & (df['oldbalanceDest'] == 0.00) & (df['nameDest'].str.startswith('M') != True)]['isFraud'].value_counts()

0    161711
1      4076
Name: isFraud, dtype: int64

As can be seen, in nearly 50% of the fraudulent transaction, we can see such a behaviour. But the occurence is not very stark among the genuine ones.

**We can add a new feature 'destDiscrepancy' feature that can capture this.** 

We will make it a categorical variable with the following three values; Merchant, Yes and No.

Now let's try to use some visualisation to learn more about the dataset. Since frauds only happen in two particular types of transactions, for our visualisations we will use a reduced dataset with these types only. By doing this, the effect of the different features can be captured in a better manner as the observations would not be swayed by theother types of transactions which are completely non-fraudulent.

In [None]:
df_reduced = df[df['type'].isin(['CASH_OUT', 'TRANSFER'])]
df_reduced.shape

(2770409, 11)

Let's analyse the 'isFlaggedFraud' feature

In [None]:
df['isFlaggedFraud'].value_counts()

0    6362604
1         16
Name: isFlaggedFraud, dtype: int64

There are only 16 transactions that have been flagged as fraud. Let's see whether those transactions have been fraud or not.

In [None]:
df[df['isFlaggedFraud'] == 1]['isFraud'].value_counts()

1    16
Name: isFraud, dtype: int64

All these 16 transactions were actually fraudulent. Let's see if these transactions belong to a certain type.

In [None]:
df[df['isFlaggedFraud'] == 1]['type'].value_counts()

TRANSFER    16
Name: type, dtype: int64

All of these were transfers. Prima facie, this feature doesn't tell us much. It wouldn't hurt if this feature is dropped.

Now let's get to the 'step' feature. Each step denotes one hour. So let's take the modulus by 24 which will indicate the time/hour in which the transaction occured.

In [None]:
df['step'] = df['step'].mod(24)

Now let's go ahead with the visualisations. Let's define some functions for our visualisations.

In [None]:
def dist(feature, log_type = False):
    plt.hist(df_reduced[feature], log = log_type)
    num1 = df_reduced[feature].max()
    num2 = df_reduced[feature].min()
    print(f'The maximum and minimum values in \'{feature}\' are {num1} and {num2}')

### Feature Engineering

Now let's create the two features, 'entireAmount' and 'destDiscrepancy'

In [None]:
#entireAmount

df_final = df.copy()

In [None]:
df_final['entireAmount'] = (df['amount'] == df['oldbalanceOrg']).astype(int)

In [None]:
df_final

Unnamed: 0,step,type,amount,nameOrig,oldbalanceOrg,newbalanceOrig,nameDest,oldbalanceDest,newbalanceDest,isFraud,isFlaggedFraud,entireAmount
0,1,PAYMENT,9839.64,C1231006815,170136.00,160296.36,M1979787155,0.00,0.00,0,0,0
1,1,PAYMENT,1864.28,C1666544295,21249.00,19384.72,M2044282225,0.00,0.00,0,0,0
2,1,TRANSFER,181.00,C1305486145,181.00,0.00,C553264065,0.00,0.00,1,0,1
3,1,CASH_OUT,181.00,C840083671,181.00,0.00,C38997010,21182.00,0.00,1,0,1
4,1,PAYMENT,11668.14,C2048537720,41554.00,29885.86,M1230701703,0.00,0.00,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...
6362615,23,CASH_OUT,339682.13,C786484425,339682.13,0.00,C776919290,0.00,339682.13,1,0,1
6362616,23,TRANSFER,6311409.28,C1529008245,6311409.28,0.00,C1881841831,0.00,0.00,1,0,1
6362617,23,CASH_OUT,6311409.28,C1162922333,6311409.28,0.00,C1365125890,68488.84,6379898.11,1,0,1
6362618,23,TRANSFER,850002.52,C1685995037,850002.52,0.00,C2080388513,0.00,0.00,1,0,1


In [None]:
#destDiscrepancy

def discrepancy(dataframe):
    if dataframe['nameDest'].startswith('M'):
      return 'Merchant'
    elif (dataframe['newbalanceDest'] == 0.00) & (dataframe['oldbalanceDest'] == 0.00):
        return 'Yes'
    else:
        return 'No'

In [None]:
df_final['destDiscrepancy'] = df.apply(lambda x: discrepancy(x),axis=1)

In [None]:
df_final

Unnamed: 0,step,type,amount,nameOrig,oldbalanceOrg,newbalanceOrig,nameDest,oldbalanceDest,newbalanceDest,isFraud,isFlaggedFraud,entireAmount,destDiscrepancy
0,1,PAYMENT,9839.64,C1231006815,170136.00,160296.36,M1979787155,0.00,0.00,0,0,0,Merchant
1,1,PAYMENT,1864.28,C1666544295,21249.00,19384.72,M2044282225,0.00,0.00,0,0,0,Merchant
2,1,TRANSFER,181.00,C1305486145,181.00,0.00,C553264065,0.00,0.00,1,0,1,Yes
3,1,CASH_OUT,181.00,C840083671,181.00,0.00,C38997010,21182.00,0.00,1,0,1,No
4,1,PAYMENT,11668.14,C2048537720,41554.00,29885.86,M1230701703,0.00,0.00,0,0,0,Merchant
...,...,...,...,...,...,...,...,...,...,...,...,...,...
6362615,23,CASH_OUT,339682.13,C786484425,339682.13,0.00,C776919290,0.00,339682.13,1,0,1,No
6362616,23,TRANSFER,6311409.28,C1529008245,6311409.28,0.00,C1881841831,0.00,0.00,1,0,1,Yes
6362617,23,CASH_OUT,6311409.28,C1162922333,6311409.28,0.00,C1365125890,68488.84,6379898.11,1,0,1,No
6362618,23,TRANSFER,850002.52,C1685995037,850002.52,0.00,C2080388513,0.00,0.00,1,0,1,Yes


Let's also create another feature indicating whether the payment is made to a merchant or not.

In [None]:
#Merchant
df_final['merchant'] = (df_final['nameDest'].str.startswith('M')).astype(int)

In [None]:
df_final

Unnamed: 0,step,type,amount,nameOrig,oldbalanceOrg,newbalanceOrig,nameDest,oldbalanceDest,newbalanceDest,isFraud,isFlaggedFraud,entireAmount,destDiscrepancy,merchant
0,1,PAYMENT,9839.64,C1231006815,170136.00,160296.36,M1979787155,0.00,0.00,0,0,0,Merchant,1
1,1,PAYMENT,1864.28,C1666544295,21249.00,19384.72,M2044282225,0.00,0.00,0,0,0,Merchant,1
2,1,TRANSFER,181.00,C1305486145,181.00,0.00,C553264065,0.00,0.00,1,0,1,Yes,0
3,1,CASH_OUT,181.00,C840083671,181.00,0.00,C38997010,21182.00,0.00,1,0,1,No,0
4,1,PAYMENT,11668.14,C2048537720,41554.00,29885.86,M1230701703,0.00,0.00,0,0,0,Merchant,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
6362615,23,CASH_OUT,339682.13,C786484425,339682.13,0.00,C776919290,0.00,339682.13,1,0,1,No,0
6362616,23,TRANSFER,6311409.28,C1529008245,6311409.28,0.00,C1881841831,0.00,0.00,1,0,1,Yes,0
6362617,23,CASH_OUT,6311409.28,C1162922333,6311409.28,0.00,C1365125890,68488.84,6379898.11,1,0,1,No,0
6362618,23,TRANSFER,850002.52,C1685995037,850002.52,0.00,C2080388513,0.00,0.00,1,0,1,Yes,0


Now let's drop the redundant features and then use on-hot encoding to create the data for modelling.

In [None]:
df_final.drop(columns = ['nameOrig', 'nameDest', 'oldbalanceDest', 'newbalanceDest'], inplace = True)

In [None]:
df_final

Unnamed: 0,step,type,amount,oldbalanceOrg,newbalanceOrig,isFraud,isFlaggedFraud,entireAmount,destDiscrepancy,merchant
0,1,PAYMENT,9839.64,170136.00,160296.36,0,0,0,Merchant,1
1,1,PAYMENT,1864.28,21249.00,19384.72,0,0,0,Merchant,1
2,1,TRANSFER,181.00,181.00,0.00,1,0,1,Yes,0
3,1,CASH_OUT,181.00,181.00,0.00,1,0,1,No,0
4,1,PAYMENT,11668.14,41554.00,29885.86,0,0,0,Merchant,1
...,...,...,...,...,...,...,...,...,...,...
6362615,23,CASH_OUT,339682.13,339682.13,0.00,1,0,1,No,0
6362616,23,TRANSFER,6311409.28,6311409.28,0.00,1,0,1,Yes,0
6362617,23,CASH_OUT,6311409.28,6311409.28,0.00,1,0,1,No,0
6362618,23,TRANSFER,850002.52,850002.52,0.00,1,0,1,Yes,0


In [None]:
#one-hot encoding
X = df_final.drop(columns = 'isFraud', axis = 1)
y = df_final['isFraud']
X = pd.get_dummies(X)
X.head()

Unnamed: 0,step,amount,oldbalanceOrg,newbalanceOrig,isFlaggedFraud,entireAmount,merchant,type_CASH_IN,type_CASH_OUT,type_DEBIT,type_PAYMENT,type_TRANSFER,destDiscrepancy_Merchant,destDiscrepancy_No,destDiscrepancy_Yes
0,1,9839.64,170136.0,160296.36,0,0,1,0,0,0,1,0,1,0,0
1,1,1864.28,21249.0,19384.72,0,0,1,0,0,0,1,0,1,0,0
2,1,181.0,181.0,0.0,0,1,0,0,0,0,0,1,0,0,1
3,1,181.0,181.0,0.0,0,1,0,0,1,0,0,0,0,1,0
4,1,11668.14,41554.0,29885.86,0,0,1,0,0,0,1,0,1,0,0


We will now split the dataset into train, dev and test sets in a 90-5-5 ratio. It is important that we have an explicit dev set as we can't use cross validation on the training set here. The reason being the use of undersampling to tackle the data imbalance problem. So our traning set won't be representative of the real world case and hence can't be generalised.

In [None]:
#train-test-dev (90-5-5) sets split 
from sklearn.model_selection import train_test_split

X_1, X_test, y_1, y_test = train_test_split(X, y, test_size=0.05, random_state=3)
X_train, X_dev, y_train, y_dev = train_test_split(X_1, y_1, test_size=1/19, random_state=3)
print(X_train.shape,X_dev.shape,X_test.shape,y_train.shape,y_dev.shape,y_test.shape)

(5726358, 15) (318131, 15) (318131, 15) (5726358,) (318131,) (318131,)


We will create an undersampled dataset using the imblearn package.

In [None]:
from imblearn.under_sampling import RandomUnderSampler
rus = RandomUnderSampler(random_state=0)
X_under, y_under = rus.fit_resample(X_train, y_train)
X_under.shape



(14820, 15)

We will be using the XGBoost model for building our draud detector.

In [None]:
import xgboost as xgb

XGBC = xgb.XGBClassifier(max_depth = 5, verbose = 1)

In [None]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_train_under = scaler.fit_transform(X_under)
X_dev_under = scaler.transform(X_dev)
X_test_under = scaler.transform(X_test)

In [None]:
XGBC.fit(X_train_under,y_under)

XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=1, gamma=0,
              learning_rate=0.1, max_delta_step=0, max_depth=5,
              min_child_weight=1, missing=None, n_estimators=100, n_jobs=1,
              nthread=None, objective='binary:logistic', random_state=0,
              reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
              silent=None, subsample=1, verbose=1, verbosity=1)

In [None]:
y_pred = XGBC.predict(X_test_under)

In [None]:
from sklearn.metrics import f1_score

In [None]:
print(f'The f1 score on the test dataset is {f1_score(y_test, y_pred)}')

0.9987357774968395