# Table of Contents

1. [Initial Setup](#Initial-Setup)
2. [Exploratory Data Analysis (EDA)](#Exploratory-Data-Analysis-(EDA))
    1. [Dataset Exploration](#Dataset-Exploration)

# Dataset Description

I am using the "paysim" dataset for this project which can be found on kaggle (here is the [link](https://www.kaggle.com/datasets/mtalaltariq/paysim-data) to the dataset). The PaySim dataset is a **synthetic** financial transaction dataset that simulates mobile money operations based on real transactional patterns. It was generated using aggregated and anonymized data from a mobile financial service provider (ex. JazzCash and Easypaisa). 

The dataset is commonly used for fraud detection research and machine learning model training, allowing analysts and data scientists to explore techniques for identifying suspicious or fraudulent activity without using real customer data.

<ins> Difference between a mobile financial system and digital bank </ins>

A mobile financial system (or mobile money service) is not a traditional bank. These services are typically operated by telecommunications providers, not by licensed banks. While they offer bank-like features—such as money transfers, bill payments, and mobile wallet storage—they are not registered financial institutions.

The key distinction lies in access and structure:

* Mobile financial systems are SIM card–based, allowing users to send and receive money using their mobile phone number without needing a formal bank account.
* In contrast, digital banks require users to open an official bank account, identified by an account number, and are regulated under banking laws.

Additionally, mobile money services primarily focus on payments and transfers, whereas banks provide a broader range of services, such as savings accounts, loans, interest-bearing deposits, and official bank statements.

<ins> How are the rules to detect financial fraud formulated? </ins>

There are three main institutions that define the rules to detect financial crime. These institutions are governments, international bodies, and banks/mobile financial systems/ fintechs. 

| Institution | Purpose | 
| :------- | :------: | 
| Governments  | Define the legal requierments (ex. report financial crime)  | 
| International bodies (ICA, CFCS, ACAMS)  | Define the frameworks on how to identify fradulent transactions  | 
| Banks/Fintechs/Mobile banking providers  | Define the actual rules on how to flag certain transactions as fradulent  | 

**Example**: \\$10,000 thereshold rule

* The government defined the \\$10,000 threshold rule in the Bank Secrecy Act (BSA) of 1970, requiring financial institutions to report large transactions. 
* The international bodies then came in to set guidelines that inform other banks, fintechs, and mobile banking providers on how they should use this new law. For example, international bodies say how financial institutions should have systems in place to detect unusual activity over \\$10,000. 
* The banks/fintechs/mobile banking providers would then establish internal systems to automatically detect and flag any transaction that is over \\$10,000. 

<ins> What are some rules with indicate fraudulent transactions? </ins>

It would be impossible to list down the specific rules, to detect financial crime, that each individual financial institution has set in place since each institution has kept their rules and methods in detecting financial crime to be private to maintain a competitive advantage in the market and to also prevent criminals from finding work arounds in the financial system.

Financial institutions implement internal fraud detection systems based on government regulations (e.g., the \\$10,000 threshold rule) and international guidance (FATF, ICA, ACAMS). While governments and international bodies provide legal and framework-level requirements, each institution may design its own monitoring strategies and thresholds to identify potentially fraudulent activity. Specific internal methods are proprietary, and banks do not publicly disclose them to maintain competitive advantage and prevent abuse.

* Governments set legally binding rules and reporting requirements, such as the $10,000 threshold rule, which requires institutions to flag or report transactions above a certain amount.
* International bodies (e.g., FATF, ACAMS, CFCS, ICA) issue guidelines and typologies describing methods that criminals use to launder money or commit fraud — such as smurfing (structuring transactions to avoid thresholds) and the use of shell companies (fake or inactive businesses used to conceal ownership or move illicit funds).
    
<ins> What machine learning models are used to detect fradulent transactions? </ins>

Machine learning models for fraud detection are typically divided into supervised and unsupervised approaches, depending on whether the dataset contains labels that identify which transactions are fraudulent.

**Supervised Learning Models**: These models are trained using labeled data — where each transaction is marked as fraudulent or legitimate.

Examples include:

* Logistic Regression
* Decision Trees / Random Forests (classification)
* Gradient Boosting (XGBoost, LightGBM)
* Neural Networks

The model learns patterns from the labeled data and applies these patterns to predict whether new transactions are fraudulent.
(Example: The PaySim dataset is synthetic and labeled, so it can be used to train supervised models.)

**Unsupervised Learning Models**: These are used when fraud labels are not available. The model attempts to find unusual patterns or “anomalies” in the data that deviate from normal behavior.

Examples include:

* Clustering (e.g., K-Means, DBSCAN)
* Autoencoders (Dimensionality Reduction)
* Isolation Forest
* One-Class SVM (Anomaly Detection)

These methods are effective in real-world banking environments where fraudulent behavior is rare and constantly evolving.

In practice, many financial institutions use hybrid systems, combining supervised and unsupervised models. For example, an anomaly detection system might flag suspicious transactions, and a supervised model can then classify whether those are likely to be fraudulent.

# Initial Setup

Installing necessary packages 

In [1]:
#pip install kagglehub

Libraries

In [2]:
import os #current working directory, changing directories, listing files in directory
import kagglehub #connect to Kaggle’s API and easily download datasets

Checking current working directory

In [3]:
current_directory = os.getcwd()
print("Current working directory:", current_directory)

Current working directory: C:\Users\hussainsarfraz\0_finecrime_online_payment_project


Connecting to Kaggle API to download the dataset and obtain working directory to dataset

In [4]:
# Download latest version
path = kagglehub.dataset_download("mtalaltariq/paysim-data")

print("Path to dataset files:", path)

Path to dataset files: C:\Users\hussainsarfraz\.cache\kagglehub\datasets\mtalaltariq\paysim-data\versions\1


Setting working directory

In [5]:
# Absolute path example (Windows)
os.chdir("C:/Users/hussainsarfraz/.cache/kagglehub/datasets/mtalaltariq/paysim-data/versions/1")
    
print(os.getcwd())

C:\Users\hussainsarfraz\.cache\kagglehub\datasets\mtalaltariq\paysim-data\versions\1


Checking files in current working directory

In [6]:
# Get the list of all files and directories in the current working directory
contents = os.listdir()
print("Contents of the current directory (including files and folders):")
for item in contents:
    print(item)

Contents of the current directory (including files and folders):
paysim dataset.csv


# Exploratory Data Analysis (EDA)

## Dataset Exploration

<ins> Column Descriptions </ins>

 step            int64  
 type            object 
 amount          float64
 nameOrig        object 
 oldbalanceOrg   float64
 5   newbalanceOrig  float64
 6   nameDest        object 
 7   oldbalanceDest  float64
 8   newbalanceDest  float64
 9   isFraud         int64  
 10  isFlaggedFraud  int64  


| Column Name | Common Description | `PAYMENT` | 
| :------- | :------: | :------: | 
| step | The number of hours into the simulation of this financial transaction paysim dataset |  |
| type  | The type of transaction (PAYMENT, CASH_OUT, CASH_IN, TRANSFER, DEBIT) |  |
| amount  |   | The cost amount of the item you purchased |
| nameOrig  | The name of the account that is sending the money out from their account | Only includes customer accounts |
| oldbalanceOrg  |   |  |
| newbalanceOrig  |   |  |
| nameDest  | This is the name of the account that is receiving the money | Only includes merchant accounts |
| oldbalanceDest  |   |  |
| newbalanceDest  |   |  |
| isFraud  | The transactions that were actually fraudulent |  |
| isFlaggedFraud  | The transactions that were flagged as fraud by the machine learning model or the current systems to detect fraud set by the mobile financial organisation |  |

Here is a description of the different payment types:
* `PAYMENT`: A customer pays a merchant
* `CASH_OUT`: A customer withdraws money from their account 
* `CASH_IN`: A customer deposits money into their account 
* `TRANSFER`: A customer sends money to another customer
* `DEBIT`: When a bank removes money from a customers account (like fees or chargebacks)

The columns mean something different based on the payment type, here is the description of each column based on the payment type:

`PAYMENT`: A customer pays a merchant



Questions
* At what hour into the simulation were the number of fraudulent transactions were the highest?
* Which payment types had the most fraud? Which payment types had the most flagged fraud? -compare if both results are similar
* Are there any particular naming conventions on the account based on the transaction type?

In [7]:
import pandas as pd

data = pd.read_csv('paysim dataset.csv')
df = pd.DataFrame(data)

In [8]:
df.head()

Unnamed: 0,step,type,amount,nameOrig,oldbalanceOrg,newbalanceOrig,nameDest,oldbalanceDest,newbalanceDest,isFraud,isFlaggedFraud
0,1,PAYMENT,9839.64,C1231006815,170136.0,160296.36,M1979787155,0.0,0.0,0,0
1,1,PAYMENT,1864.28,C1666544295,21249.0,19384.72,M2044282225,0.0,0.0,0,0
2,1,TRANSFER,181.0,C1305486145,181.0,0.0,C553264065,0.0,0.0,1,0
3,1,CASH_OUT,181.0,C840083671,181.0,0.0,C38997010,21182.0,0.0,1,0
4,1,PAYMENT,11668.14,C2048537720,41554.0,29885.86,M1230701703,0.0,0.0,0,0


### EDA of `step` column

In [9]:
# '{:.0f}' formats as a float with 0 decimal places, effectively showing the integer
pd.options.display.float_format = '{:.0f}'.format 
df['step'].describe() #using it to get the min (1) and max (743) values only

count   6362620
mean        243
std         142
min           1
25%         156
50%         239
75%         335
max         743
Name: step, dtype: float64

### EDA of `type` column

In [10]:
df['type'].unique()

array(['PAYMENT', 'TRANSFER', 'CASH_OUT', 'DEBIT', 'CASH_IN'],
      dtype=object)

### EDA of `PAYMENT` type transactions

<ins> Conclusions </ins>

Since the project's goal is to analyse potential indicators of fraud, the `PAYMENT` transaction types **should not** be included in the dataset for analysis since none of the rows are marked for fraud so it would not give us insights on the indicators of fraud. Now one could say that financial fraud it not committed through payment transaction types, but the paysim dataset does not include the `oldbalanceDest` and `newbalanceDest` for the merchants who receive the money. 

The paysim dataset does not include the financial information of the merchants because the paysim dataset only simulates the behavior of customers. In a real financial dataset, merchant balances would increase after receiving payments.

In [57]:
df['differenceOrg'] = df['oldbalanceOrg'] - df['newbalanceOrig'] #to check if the amount column represents differences
df[df['type'] == 'PAYMENT'].head(5)

Unnamed: 0,step,type,amount,nameOrig,oldbalanceOrg,newbalanceOrig,nameDest,oldbalanceDest,newbalanceDest,isFraud,isFlaggedFraud,differenceOrg
0,1,PAYMENT,9840,C1231006815,170136,160296,M1979787155,0,0,0,0,9840
1,1,PAYMENT,1864,C1666544295,21249,19385,M2044282225,0,0,0,0,1864
4,1,PAYMENT,11668,C2048537720,41554,29886,M1230701703,0,0,0,0,11668
5,1,PAYMENT,7818,C90045638,53860,46042,M573487274,0,0,0,0,7818
6,1,PAYMENT,7108,C154988899,183195,176087,M408069119,0,0,0,0,7108


In [33]:
df.columns

Index(['step', 'type', 'amount', 'nameOrig', 'oldbalanceOrg', 'newbalanceOrig',
       'nameDest', 'oldbalanceDest', 'newbalanceDest', 'isFraud',
       'isFlaggedFraud', 'differenceOrg'],
      dtype='object')

In [63]:
temp_df = df[df['type'] == 'PAYMENT']
temp_df.describe()

Unnamed: 0,step,amount,oldbalanceOrg,newbalanceOrig,oldbalanceDest,newbalanceDest,isFraud,isFlaggedFraud,differenceOrg
count,2151495,2151495,2151495,2151495,2151495,2151495,2151495,2151495,2151495
mean,244,13058,68217,61838,0,0,0,0,6379
std,143,12556,198991,196992,0,0,0,0,9530
min,1,0,0,0,0,0,0,0,0
25%,156,4384,0,0,0,0,0,0,0
50%,249,9482,10530,0,0,0,0,0,2136
75%,335,17561,60883,49654,0,0,0,0,9713
max,718,238638,43686616,43673802,0,0,0,0,185123


In [64]:
#removing 'step', 'oldbalanceDest', 'newbalanceDest','isFraud','isFlaggedFraud' columns
df.loc[df['type'] == 'PAYMENT', ['type', 'amount', 'nameOrig', 'oldbalanceOrg', 'newbalanceOrig',
                                 'nameDest', 'differenceOrg']].head(10)

Unnamed: 0,type,amount,nameOrig,oldbalanceOrg,newbalanceOrig,nameDest,differenceOrg
0,PAYMENT,9840,C1231006815,170136,160296,M1979787155,9840
1,PAYMENT,1864,C1666544295,21249,19385,M2044282225,1864
4,PAYMENT,11668,C2048537720,41554,29886,M1230701703,11668
5,PAYMENT,7818,C90045638,53860,46042,M573487274,7818
6,PAYMENT,7108,C154988899,183195,176087,M408069119,7108
7,PAYMENT,7862,C1912850431,176087,168226,M633326333,7862
8,PAYMENT,4024,C1265012928,2671,0,M1176932104,2671
11,PAYMENT,3100,C249177573,20771,17671,M2096539129,3100
12,PAYMENT,2561,C1648232591,5070,2509,M972865270,2561
13,PAYMENT,11634,C1716932897,10127,0,M801569151,10127


For transactions which are only `PAYMENT` types, the `nameOrig` column only starts with the letter C and the `nameDest` column only starts with the letter M. I believe that names which begin with a letter C represent a customer while names which begin with the letter M are merchants.

In [46]:
temp_df['nameOrig'].str[0].unique()

array(['C'], dtype=object)

In [48]:
temp_df['nameDest'].str[0].unique()

array(['M'], dtype=object)

### EDA of `TRANSFER` type transactions

In [22]:
df[df['type'] == 'TRANSFER'].head(10)

Unnamed: 0,step,type,amount,nameOrig,oldbalanceOrg,newbalanceOrig,nameDest,oldbalanceDest,newbalanceDest,isFraud,isFlaggedFraud
2,1,TRANSFER,181,C1305486145,181,0,C553264065,0,0,1,0
19,1,TRANSFER,215310,C1670993182,705,0,C1100439041,22425,0,0,0
24,1,TRANSFER,311686,C1984094095,10835,0,C932583850,6267,2719173,0,0
58,1,TRANSFER,62611,C1976401987,79114,16503,C1937962514,517,8383,0,0
78,1,TRANSFER,42712,C283039401,10363,0,C1330106945,57902,24044,0,0
79,1,TRANSFER,77958,C207471778,0,0,C1761291320,94900,22234,0,0
80,1,TRANSFER,17231,C1243171897,0,0,C783286238,24672,0,0,0
81,1,TRANSFER,78766,C1376151044,0,0,C1749186397,103772,277515,0,0
82,1,TRANSFER,224607,C873175411,0,0,C766572210,354679,0,0,0
83,1,TRANSFER,125873,C1443967876,0,0,C392292416,348512,3420103,0,0


In [23]:
df[df['type'] == 'CASH_OUT'].head(10)

Unnamed: 0,step,type,amount,nameOrig,oldbalanceOrg,newbalanceOrig,nameDest,oldbalanceDest,newbalanceDest,isFraud,isFlaggedFraud
3,1,CASH_OUT,181,C840083671,181,0,C38997010,21182,0,1,0
15,1,CASH_OUT,229134,C905080434,15325,0,C476402209,5083,51513,0,0
42,1,CASH_OUT,110415,C768216420,26845,0,C1509514333,288800,2415,0,0
47,1,CASH_OUT,56954,C1570470538,1942,0,C824009085,70253,64106,0,0
48,1,CASH_OUT,5347,C512549200,0,0,C248609774,652637,6453431,0,0
51,1,CASH_OUT,23261,C2072313080,20412,0,C2001112025,25742,0,0,0
60,1,CASH_OUT,82940,C1528834618,3018,0,C476800120,132372,49864,0,0
70,1,CASH_OUT,47459,C527211736,209535,162076,C2096057945,52120,0,0,0
71,1,CASH_OUT,136873,C1533123860,162076,25203,C766572210,217806,0,0,0
72,1,CASH_OUT,94253,C1718906711,25203,0,C977993101,99773,965870,0,0


In [24]:
df[df['type'] == 'DEBIT'].head(10)

Unnamed: 0,step,type,amount,nameOrig,oldbalanceOrg,newbalanceOrig,nameDest,oldbalanceDest,newbalanceDest,isFraud,isFlaggedFraud
9,1,DEBIT,5338,C712410124,41720,36382,C195600860,41898,40349,0,0
10,1,DEBIT,9645,C1900366749,4465,0,C997608398,10845,157982,0,0
21,1,DEBIT,9303,C1566511282,11299,1996,C1973538135,29832,16897,0,0
22,1,DEBIT,1065,C1959239586,1817,752,C515132998,10330,0,0,0
41,1,DEBIT,5759,C1466917878,32604,26845,C1297685781,209699,16997,0,0
59,1,DEBIT,5529,C867288517,8547,3018,C242131142,10206,0,0,0
61,1,DEBIT,4510,C280615803,10256,5746,C1254526270,10697,0,0,0
62,1,DEBIT,8728,C166694583,882770,874042,C1129670968,12636,0,0,0
64,1,DEBIT,4874,C811207775,153,0,C1971489295,253104,0,0,0
68,1,DEBIT,5150,C1955990522,4782,0,C1330106945,52752,24044,0,0


In [25]:
df[df['type'] == 'CASH_IN'].head(10)

Unnamed: 0,step,type,amount,nameOrig,oldbalanceOrg,newbalanceOrig,nameDest,oldbalanceDest,newbalanceDest,isFraud,isFlaggedFraud
389,1,CASH_IN,143236,C1862994526,0,143236,C1688019098,608932,97264,0,0
390,1,CASH_IN,228452,C1614133563,143236,371688,C2083562754,719678,1186557,0,0
391,1,CASH_IN,35902,C839771540,371688,407591,C2001112025,49003,0,0,0
392,1,CASH_IN,232954,C1037163664,407591,640544,C33524623,1172672,1517262,0,0
393,1,CASH_IN,65913,C180316302,640544,706457,C1330106945,104198,24044,0,0
394,1,CASH_IN,193493,C1200546947,706457,899950,C1531333864,1247284,55975,0,0
395,1,CASH_IN,60837,C443713699,899950,960787,C488044861,143436,5204,0,0
396,1,CASH_IN,62325,C695530017,960787,1023112,C564160838,1880272,1254956,0,0
397,1,CASH_IN,349641,C1493042329,1023112,1372752,C909295153,360951,5602235,0,0
398,1,CASH_IN,135324,C1751403001,1372752,1508076,C453211571,1356748,3461666,0,0


Number of Rows and Columns

In [11]:
print('Number of rows:', df.shape[0], 
     '\nNumber of columns:', df.shape[1]) #approximatly 6 million rows

Number of rows: 6362620 
Number of columns: 11


`summary()` function

Description of Columns

In [12]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6362620 entries, 0 to 6362619
Data columns (total 11 columns):
 #   Column          Dtype  
---  ------          -----  
 0   step            int64  
 1   type            object 
 2   amount          float64
 3   nameOrig        object 
 4   oldbalanceOrg   float64
 5   newbalanceOrig  float64
 6   nameDest        object 
 7   oldbalanceDest  float64
 8   newbalanceDest  float64
 9   isFraud         int64  
 10  isFlaggedFraud  int64  
dtypes: float64(5), int64(3), object(3)
memory usage: 534.0+ MB


In [13]:
df.describe()

Unnamed: 0,step,amount,oldbalanceOrg,newbalanceOrig,oldbalanceDest,newbalanceDest,isFraud,isFlaggedFraud
count,6362620,6362620,6362620,6362620,6362620,6362620,6362620,6362620
mean,243,179862,833883,855114,1100702,1224996,0,0
std,142,603858,2888243,2924049,3399180,3674129,0,0
min,1,0,0,0,0,0,0,0
25%,156,13390,0,0,0,0,0,0
50%,239,74872,14208,0,132706,214661,0,0
75%,335,208721,107315,144258,943037,1111909,0,0
max,743,92445517,59585040,49585040,356015889,356179279,1,1


In [14]:
df[(df['isFraud'] == 1) & (df['isFlaggedFraud'] == 0)].head(10)

Unnamed: 0,step,type,amount,nameOrig,oldbalanceOrg,newbalanceOrig,nameDest,oldbalanceDest,newbalanceDest,isFraud,isFlaggedFraud
2,1,TRANSFER,181,C1305486145,181,0,C553264065,0,0,1,0
3,1,CASH_OUT,181,C840083671,181,0,C38997010,21182,0,1,0
251,1,TRANSFER,2806,C1420196421,2806,0,C972765878,0,0,1,0
252,1,CASH_OUT,2806,C2101527076,2806,0,C1007251739,26202,0,1,0
680,1,TRANSFER,20128,C137533655,20128,0,C1848415041,0,0,1,0
681,1,CASH_OUT,20128,C1118430673,20128,0,C339924917,6268,12146,1,0
724,1,CASH_OUT,416001,C749981943,0,0,C667346055,102,9291620,1,0
969,1,TRANSFER,1277213,C1334405552,1277213,0,C431687661,0,0,1,0
970,1,CASH_OUT,1277213,C467632528,1277213,0,C716083600,0,2444985,1,0
1115,1,TRANSFER,35064,C1364127192,35064,0,C1136419747,0,0,1,0


In [15]:
df[(df['isFraud'] == 1) & (df['isFlaggedFraud'] == 1)].head(10)

Unnamed: 0,step,type,amount,nameOrig,oldbalanceOrg,newbalanceOrig,nameDest,oldbalanceDest,newbalanceDest,isFraud,isFlaggedFraud
2736446,212,TRANSFER,4953893,C728984460,4953893,4953893,C639921569,0,0,1,1
3247297,250,TRANSFER,1343002,C1100582606,1343002,1343002,C1147517658,0,0,1,1
3760288,279,TRANSFER,536624,C1035541766,536624,536624,C1100697970,0,0,1,1
5563713,387,TRANSFER,4892193,C908544136,4892193,4892193,C891140444,0,0,1,1
5996407,425,TRANSFER,10000000,C689608084,19585040,19585040,C1392803603,0,0,1,1
5996409,425,TRANSFER,9585040,C452586515,19585040,19585040,C1109166882,0,0,1,1
6168499,554,TRANSFER,3576297,C193696150,3576297,3576297,C484597480,0,0,1,1
6205439,586,TRANSFER,353874,C1684585475,353874,353874,C1770418982,0,0,1,1
6266413,617,TRANSFER,2542664,C786455622,2542664,2542664,C661958277,0,0,1,1
6281482,646,TRANSFER,10000000,C19004745,10399045,10399045,C1806199534,0,0,1,1


In [16]:
df[['type', 'isFraud']].value_counts()

type      isFraud
CASH_OUT  0          2233384
PAYMENT   0          2151495
CASH_IN   0          1399284
TRANSFER  0           528812
DEBIT     0            41432
CASH_OUT  1             4116
TRANSFER  1             4097
Name: count, dtype: int64

In [17]:
df[['type']].value_counts()

type    
CASH_OUT    2237500
PAYMENT     2151495
CASH_IN     1399284
TRANSFER     532909
DEBIT         41432
Name: count, dtype: int64

In [18]:
df[['isFraud']].value_counts()

isFraud
0          6354407
1             8213
Name: count, dtype: int64