1. Understand the Dataset

Introduction: This dataset is a synthetic representation of mobile money transactions created using PaySim, simulating real-world financial activities and fraudulent behaviors. It spans a simulated period of 30 days and includes various transaction types.
** Structure**
step: Unit of time, 1 step = 1 hour.
type: Transaction type (CASH-IN, CASH-OUT, DEBIT, PAYMENT, TRANSFER). amount: Transaction amount.
nameOrig: Initiator of the transaction.
oldbalanceOrg: Initial balance before the transaction (not for fraud analysis).
newbalanceOrig: New balance after the transaction (not for fraud analysis).
nameDest: Recipient of the transaction.
oldbalanceDest: Initial recipient's balance before the transaction (not for fraud analysis).
newbalanceDest: New recipient's balance after the transaction (not for fraud analysis).
isFraud: Indicator if the transaction is fraudulent.
isFlaggedFraud: Indicator if the transaction is flagged for potential fraud.

2. Load the Dataset

In [4]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Import the dataset
df= pd.read_csv("Synthetic_Financial_datasets_log.csv")

# Check the first five rows of the dataset
df.head()

Unnamed: 0,step,type,amount,nameOrig,oldbalanceOrg,newbalanceOrig,nameDest,oldbalanceDest,newbalanceDest,isFraud,isFlaggedFraud
0,1,PAYMENT,9839.64,C1231006815,170136.0,160296.36,M1979787155,0.0,0.0,0,0
1,1,PAYMENT,1864.28,C1666544295,21249.0,19384.72,M2044282225,0.0,0.0,0,0
2,1,TRANSFER,181.0,C1305486145,181.0,0.0,C553264065,0.0,0.0,1,0
3,1,CASH_OUT,181.0,C840083671,181.0,0.0,C38997010,21182.0,0.0,1,0
4,1,PAYMENT,11668.14,C2048537720,41554.0,29885.86,M1230701703,0.0,0.0,0,0


3. Initial Data Exploration

In [5]:
# Check the shape of the data
df.shape

(6362620, 11)

In [6]:
 # Summary statistics for numerical variables
df.describe()

Unnamed: 0,step,amount,oldbalanceOrg,newbalanceOrig,oldbalanceDest,newbalanceDest,isFraud,isFlaggedFraud
count,6362620.0,6362620.0,6362620.0,6362620.0,6362620.0,6362620.0,6362620.0,6362620.0
mean,243.3972,179861.9,833883.1,855113.7,1100702.0,1224996.0,0.00129082,2.514687e-06
std,142.332,603858.2,2888243.0,2924049.0,3399180.0,3674129.0,0.0359048,0.001585775
min,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,156.0,13389.57,0.0,0.0,0.0,0.0,0.0,0.0
50%,239.0,74871.94,14208.0,0.0,132705.7,214661.4,0.0,0.0
75%,335.0,208721.5,107315.2,144258.4,943036.7,1111909.0,0.0,0.0
max,743.0,92445520.0,59585040.0,49585040.0,356015900.0,356179300.0,1.0,1.0


In [7]:
# Check the missing values in the data
df.isna().sum()

step              0
type              0
amount            0
nameOrig          0
oldbalanceOrg     0
newbalanceOrig    0
nameDest          0
oldbalanceDest    0
newbalanceDest    0
isFraud           0
isFlaggedFraud    0
dtype: int64

There are no missing values

In [8]:
# Check the information of the dataset
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6362620 entries, 0 to 6362619
Data columns (total 11 columns):
 #   Column          Dtype  
---  ------          -----  
 0   step            int64  
 1   type            object 
 2   amount          float64
 3   nameOrig        object 
 4   oldbalanceOrg   float64
 5   newbalanceOrig  float64
 6   nameDest        object 
 7   oldbalanceDest  float64
 8   newbalanceDest  float64
 9   isFraud         int64  
 10  isFlaggedFraud  int64  
dtypes: float64(5), int64(3), object(3)
memory usage: 534.0+ MB


step, isFraud and isFlaggedFraud are integer type of data, other type of data is object and it indicates generally they are categorical data. In machine learning, we need to convert categorical data into numerical data.

4. Data Cleaning and Encoding Categorical Variables