# Data Preprocessing for Financial Fraud Detection

This notebook demonstrates the preprocessing steps performed on a financial dataset, which includes both categorical and numerical features. The goal is to prepare the data for training machine learning models.


## Importing Libraries

We start by importing the necessary libraries.

In [1]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler

## Loading the Dataset

Next, we load the dataset from a CSV file. The dataset contains financial transaction data with targets indicating fraudulent transactions.


In [5]:
file_path = r'C:\Users\jadia\Financial-Fraud-Detection\datasets\financial_data_with_targets.csv'
df = pd.read_csv(file_path)
df.drop(columns= 'index', inplace= True)

pd.set_option('display.max_columns', df.shape[1] + 1)
df.head(10)

Unnamed: 0,year,month,hour,txn_type,txn_status,error_code,remitter_bank,beneficiary_bank,payer_handle,payer_app,payee_handle,payee_app,payee_requested_amount,payee_settlement_amount,difference_amount,payer_state,payee_state,cred_type,cred_subtype,time_of_day,targets
0,2023,8,0,Refund,Successful,0,Allahabad Bank,Karur Vysya Bank,SCB,Standard Chartered UPI,KOTAK,BHIM KOTAK Pay,54020,54020,0,Punjab,Maharashtra,Debit Card,Prepaid Debit Card,LateNight,0
1,2021,10,0,Payment,Successful,0,Madhya Bihar Gramin Bank,Kotak Mahindra Bank,WASBI,WhatsApp Pay,HDFCBANKJD,JustDial,37670,37670,0,Uttar Pradesh,Rajasthan,Overdraft,Business Overdraft,LateNight,0
2,2019,11,0,Withdrawal,Successful,0,Karur Vysya Bank,United Bank of India,KMBL,Khalijeb,UTKARSHBANK,UTKARSHBANK,22984,22984,0,Haryana,Punjab,Auto Loan,Used Car Loan,LateNight,0
3,2023,9,0,Transfer,Successful,0,HDFC Bank,Corporation Bank,IDBI,BHIM PAyWIZ by IDBI Bank,WASBI,WhatsApp Pay,62038,62038,0,Punjab,Goa,Overdraft,Personal Overdraft,LateNight,0
4,2021,9,0,Fee,Successful,0,Union Bank of India,Bank of India,UNIONBANK,BHIM Union Bank UPI App,NSDL,NSDL,72624,72624,0,Odisha,Maharashtra,Personal Loan,Unsecured Personal Loan,LateNight,0
5,2020,8,0,Deposit,Successful,0,Jammu & Kashmir Bank,Federal Bank,UBOI,BHIM Union Bank UPI App,YESBANK,BHIM YES Pay,15544,15544,0,Bihar,Andhra Pradesh,Auto Loan,New Car Loan,LateNight,0
6,2020,10,0,Payment,Successful,0,ESAF Small Finance Bank,Punjab National Bank (PNB),YESPAY,JusPay Technologies,IMOBILE,ICICI iMobile,70626,70626,0,Kerala,Odisha,Debit Card,Prepaid Debit Card,LateNight,0
7,2023,4,0,Transfer,Successful,0,Fincare Small Finance Bank,NKGSB Co-operative Bank,YAPL,AmazonPay,SBI,BHIM SBI Pay,91678,91678,0,Goa,Punjab,Home Loan,Adjustable-Rate Mortgage (ARM),LateNight,0
8,2023,4,0,Payment,Successful,0,Bank of Baroda,Punjab and Maharashtra Co-operative Bank (PMC),FBL,Cointab,HDFCBANKJD,JustDial,98740,98740,0,Kerala,Goa,Personal Loan,Secured Personal Loan,LateNight,0
9,2020,2,0,Reversal,Successful,0,Utkarsh Small Finance Bank,Lakshmi Vilas Bank,MAHB,BHIM Maha UPI(Bank of Maharashtra),CITI,CITI Bank (Mobile Banking App),10282,10282,0,Uttar Pradesh,Jammu and Kashmir,Home Loan,Adjustable-Rate Mortgage (ARM),LateNight,1


## Descriptive Statistics

We then generate descriptive statistics for the dataset, which helps in understanding the distribution of numerical features.


In [6]:
df.describe()

Unnamed: 0,year,month,hour,payee_requested_amount,payee_settlement_amount,difference_amount,targets
count,55671.0,55671.0,55671.0,55671.0,55671.0,55671.0,55671.0
mean,2020.981265,6.447468,11.437229,50224.339279,52509.940579,-2285.6013,0.064396
std,1.40054,3.421936,6.660917,29003.090558,32316.082141,11412.568862,0.24546
min,2019.0,1.0,0.0,14.0,14.0,-317340.0,0.0
25%,2020.0,3.0,6.0,24988.0,25636.0,0.0,0.0
50%,2021.0,6.0,12.0,50264.0,51808.0,0.0,0.0
75%,2022.0,9.0,18.0,75530.0,77826.0,0.0,0.0
max,2023.0,12.0,22.0,99990.0,416716.0,27976.0,1.0


We also describe the categorical features to understand their distribution.


In [7]:
df.describe(include= 'O')

Unnamed: 0,txn_type,txn_status,error_code,remitter_bank,beneficiary_bank,payer_handle,payer_app,payee_handle,payee_app,payer_state,payee_state,cred_type,cred_subtype,time_of_day
count,55671,55671,55671,55671,55671,55671,55671,55671,55671,55671,55671,55671,55671,55671
unique,7,8,32,59,59,104,81,104,81,22,22,7,15,8
top,Payment,Successful,0,Indian Bank,Indian Bank,OKAXIS,WhatsApp Pay,OKAXIS,Google Pay,Haryana,Haryana,Credit Card,Secured Personal Loan,Afternoon
freq,10806,50768,50768,2018,1903,1181,2246,1131,2236,4580,4523,11131,3927,9637


## Selecting Relevant Features

Next, we select the relevant features for our analysis. These include both categorical and numerical features.

In [8]:
df.columns

Index(['year', 'month', 'hour', 'txn_type', 'txn_status', 'error_code',
       'remitter_bank', 'beneficiary_bank', 'payer_handle', 'payer_app',
       'payee_handle', 'payee_app', 'payee_requested_amount',
       'payee_settlement_amount', 'difference_amount', 'payer_state',
       'payee_state', 'cred_type', 'cred_subtype', 'time_of_day', 'targets'],
      dtype='object')

In [9]:
cat_feat_to_keep = ['time_of_day', 'cred_type', 'error_code']
num_feat_to_keep = ['payee_requested_amount', 'payee_settlement_amount', 'difference_amount', 'targets']
feat_to_keep = cat_feat_to_keep + num_feat_to_keep
feat_to_keep

['time_of_day',
 'cred_type',
 'error_code',
 'payee_requested_amount',
 'payee_settlement_amount',
 'difference_amount',
 'targets']

## Splitting the Data

We split the data into categorical and numerical features for separate processing.


In [10]:
df_cat = df[cat_feat_to_keep]
df_num = df[num_feat_to_keep]

In [11]:
df_cat.head()

Unnamed: 0,time_of_day,cred_type,error_code
0,LateNight,Debit Card,0
1,LateNight,Overdraft,0
2,LateNight,Auto Loan,0
3,LateNight,Overdraft,0
4,LateNight,Personal Loan,0


In [12]:
df_num.head()

Unnamed: 0,payee_requested_amount,payee_settlement_amount,difference_amount,targets
0,54020,54020,0,0
1,37670,37670,0,0
2,22984,22984,0,0
3,62038,62038,0,0
4,72624,72624,0,0


## Encoding Categorical Features

We encode the categorical features using one-hot encoding.


In [13]:
df_cat_encoded = pd.get_dummies(df_cat, drop_first= True).astype(np.int32)
df_cat_encoded

Unnamed: 0,time_of_day_EarlyMorning,time_of_day_Evening,time_of_day_LateAfternoon,time_of_day_LateMorning,time_of_day_LateNight,time_of_day_Morning,time_of_day_Night,cred_type_Credit Card,cred_type_Debit Card,cred_type_Home Loan,cred_type_Line of Credit,...,error_code_U80,error_code_U85,error_code_U86,error_code_U88,error_code_U89,error_code_U90,error_code_U91,error_code_U92,error_code_U93,error_code_U94,error_code_U96
0,0,0,0,0,1,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
55666,0,0,0,0,0,0,1,0,1,0,0,...,0,0,0,1,0,0,0,0,0,0,0
55667,0,0,0,0,0,0,1,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0,0
55668,0,0,0,0,0,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0
55669,0,0,0,0,0,0,1,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0,0


## Scaling Numerical Features

We scale the numerical features using StandardScaler to normalize the data, which is important for many machine learning algorithms.


In [14]:
scaler = StandardScaler()
df_num_scaled = scaler.fit_transform(df_num.iloc[:, :-1])
df_num_scaled = pd.DataFrame(df_num_scaled, columns= num_feat_to_keep[:-1])
df_num_scaled = pd.concat([df_num_scaled, df_num['targets']], axis= 1)
df_num_scaled

Unnamed: 0,payee_requested_amount,payee_settlement_amount,difference_amount,targets
0,0.130872,0.046728,0.200272,0
1,-0.432866,-0.459216,0.200272,0
2,-0.939230,-0.913669,0.200272,0
3,0.407328,0.294842,0.200272,0
4,0.772327,0.622422,0.200272,0
...,...,...,...,...
55666,0.379882,0.270210,0.200272,0
55667,1.522391,1.295591,0.200272,0
55668,-1.358431,-1.289894,0.200272,0
55669,-1.222513,-1.167910,0.200272,0


## Combining Processed Features

We combine the encoded categorical features and the scaled numerical features into a single DataFrame.


In [15]:
df_encoded = pd.concat([df_cat_encoded, df_num_scaled], axis= 1)
df_encoded

Unnamed: 0,time_of_day_EarlyMorning,time_of_day_Evening,time_of_day_LateAfternoon,time_of_day_LateMorning,time_of_day_LateNight,time_of_day_Morning,time_of_day_Night,cred_type_Credit Card,cred_type_Debit Card,cred_type_Home Loan,cred_type_Line of Credit,...,error_code_U89,error_code_U90,error_code_U91,error_code_U92,error_code_U93,error_code_U94,error_code_U96,payee_requested_amount,payee_settlement_amount,difference_amount,targets
0,0,0,0,0,1,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0.130872,0.046728,0.200272,0
1,0,0,0,0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,-0.432866,-0.459216,0.200272,0
2,0,0,0,0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,-0.939230,-0.913669,0.200272,0
3,0,0,0,0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0.407328,0.294842,0.200272,0
4,0,0,0,0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0.772327,0.622422,0.200272,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
55666,0,0,0,0,0,0,1,0,1,0,0,...,0,0,0,0,0,0,0,0.379882,0.270210,0.200272,0
55667,0,0,0,0,0,0,1,0,1,0,0,...,0,0,0,0,0,0,0,1.522391,1.295591,0.200272,0
55668,0,0,0,0,0,0,1,0,0,0,0,...,0,0,0,0,0,0,0,-1.358431,-1.289894,0.200272,0
55669,0,0,0,0,0,0,1,0,0,1,0,...,0,0,0,0,0,0,0,-1.222513,-1.167910,0.200272,0


## Saving the Processed Data

Finally, we save the processed DataFrame to a CSV file, which can be used for training machine learning models.


In [17]:
df_encoded.to_csv(r'fin_data_processed.csv', index= False)

## END OF DOCUMENT