<br>

<br>

# 💳 **BANK FRAUD DETECTION** 💳

<br>

**MODEL**

<br>

## **INDEX**

- **🧠 STEP 1: PROBLEM DEFINITION**
- **🎣 STEP 2: DATA COLLECTION**
- **🔎 STEP 3: EXPLORATORY DATA ANALYSIS (EDA)**


<br>

<br>

# **🧠 STEP 1: PROBLEM DEFINITION**


In today's digital banking landscape, credit card fraud continues to be a persistent threat, resulting in significant financial losses and operational risks for financial institutions. Detecting fraudulent transactions is a complex task, especially due to their rarity and the evolving nature of fraud patterns.

This project aims to build a **machine learning model to detect credit card fraud** using a realistic, synthetic dataset provided by **Feedzai** (published at NeurIPS 2022). Although the data is synthetic, it has been generated to reflect the structure and behavior of real-world fraud cases, providing a robust environment for model development and evaluation.

The dataset is:
- **Highly imbalanced**, with less than 10% of records labeled as fraud.
- **Bias-controlled**, with multiple variants designed to test model fairness and robustness.
- **Rich in features**, including both numerical and categorical variables.
- **Time-aware**, with a `month` column, though this project will not focus on temporal modeling.

### 💼 BUSINESS CONTEXT

The target user of this model is a **fraud analyst working at a bank**. Rather than being deployed in real-time production, the model is intended to serve as an analytical tool to help prioritize suspicious transactions and guide human decision-making.

### 🎯 PROJECT GOAL

The primary goal is to **maximize fraud detection capability** (high recall), while keeping false positives under control (F1-score). The model should identify fraudulent behavior patterns to help reduce financial losses without overwhelming analysts with irrelevant alerts.

<br>


<br>

<br>

# **🎣 STEP 2: DATA COLLECTION**

<br>

### **IMPORTING LIBRARIES**

In [2]:
import pandas as pd
import numpy as np
import os
import zipfile


In [3]:
# Kaggle dataset identifier
dataset_identifier = "sgpjesus/bank-account-fraud-dataset-neurips-2022"

# Download the dataset using the Kaggle API
os.system(f'kaggle datasets download -d {dataset_identifier}')

0

### **SELECTIVE EXTRACTION**

In [4]:
# Name of the ZIP file (based on the dataset identifier)
zip_filename = dataset_identifier.split('/')[-1] + ".zip"

# Check if the ZIP file exists and extract only the Base.csv file
if os.path.exists(zip_filename):
    with zipfile.ZipFile(zip_filename, 'r') as zip_ref:
        # List all files in the ZIP
        files_in_zip = zip_ref.namelist()
        
        # Check if Base.csv is in the ZIP file
        if "Base.csv" in files_in_zip:
            # Extract only Base.csv
            zip_ref.extract("Base.csv", ".")
            print("Base.csv has been successfully extracted!")
        else:
            print("Base.csv not found in the ZIP file.")
else:
    print("The downloaded ZIP file was not found.")

Base.csv has been successfully extracted!


<br>

<br>

<br>

# **🔎 STEP 3: EXPLORATORY DATA ANALYSIS (EDA)**

### **LOADING THE DATASET**

In [7]:
df = pd.read_csv("Base.csv")
df.head()

Unnamed: 0,fraud_bool,income,name_email_similarity,prev_address_months_count,current_address_months_count,customer_age,days_since_request,intended_balcon_amount,payment_type,zip_count_4w,...,has_other_cards,proposed_credit_limit,foreign_request,source,session_length_in_minutes,device_os,keep_alive_session,device_distinct_emails_8w,device_fraud_count,month
0,0,0.3,0.986506,-1,25,40,0.006735,102.453711,AA,1059,...,0,1500.0,0,INTERNET,16.224843,linux,1,1,0,0
1,0,0.8,0.617426,-1,89,20,0.010095,-0.849551,AD,1658,...,0,1500.0,0,INTERNET,3.363854,other,1,1,0,0
2,0,0.8,0.996707,9,14,40,0.012316,-1.490386,AB,1095,...,0,200.0,0,INTERNET,22.730559,windows,0,1,0,0
3,0,0.6,0.4751,11,14,30,0.006991,-1.863101,AB,3483,...,0,200.0,0,INTERNET,15.215816,linux,1,1,0,0
4,0,0.9,0.842307,-1,29,40,5.742626,47.152498,AA,2339,...,0,200.0,0,INTERNET,3.743048,other,0,1,0,0


In [8]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000000 entries, 0 to 999999
Data columns (total 32 columns):
 #   Column                            Non-Null Count    Dtype  
---  ------                            --------------    -----  
 0   fraud_bool                        1000000 non-null  int64  
 1   income                            1000000 non-null  float64
 2   name_email_similarity             1000000 non-null  float64
 3   prev_address_months_count         1000000 non-null  int64  
 4   current_address_months_count      1000000 non-null  int64  
 5   customer_age                      1000000 non-null  int64  
 6   days_since_request                1000000 non-null  float64
 7   intended_balcon_amount            1000000 non-null  float64
 8   payment_type                      1000000 non-null  object 
 9   zip_count_4w                      1000000 non-null  int64  
 10  velocity_6h                       1000000 non-null  float64
 11  velocity_24h                      1000

In [10]:
df.nunique()

fraud_bool                               2
income                                   9
name_email_similarity               998861
prev_address_months_count              374
current_address_months_count           423
customer_age                             9
days_since_request                  989330
intended_balcon_amount              994971
payment_type                             5
zip_count_4w                          6306
velocity_6h                         998687
velocity_24h                        998940
velocity_4w                         998318
bank_branch_count_8w                  2326
date_of_birth_distinct_emails_4w        40
employment_status                        7
credit_risk_score                      551
email_is_free                            2
housing_status                           7
phone_home_valid                         2
phone_mobile_valid                       2
bank_months_count                       33
has_other_cards                          2
proposed_cr

In [12]:
print("Dataset shape:", df.shape)

Dataset shape: (1000000, 32)


In [14]:
df.dtypes

fraud_bool                            int64
income                              float64
name_email_similarity               float64
prev_address_months_count             int64
current_address_months_count          int64
customer_age                          int64
days_since_request                  float64
intended_balcon_amount              float64
payment_type                         object
zip_count_4w                          int64
velocity_6h                         float64
velocity_24h                        float64
velocity_4w                         float64
bank_branch_count_8w                  int64
date_of_birth_distinct_emails_4w      int64
employment_status                    object
credit_risk_score                     int64
email_is_free                         int64
housing_status                       object
phone_home_valid                      int64
phone_mobile_valid                    int64
bank_months_count                     int64
has_other_cards                 

In [15]:
df.describe().transpose()

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
fraud_bool,1000000.0,0.011029,0.104438,0.0,0.0,0.0,0.0,1.0
income,1000000.0,0.562696,0.290343,0.1,0.3,0.6,0.8,0.9
name_email_similarity,1000000.0,0.493694,0.289125,1.43455e-06,0.225216,0.492153,0.755567,0.999999
prev_address_months_count,1000000.0,16.718568,44.04623,-1.0,-1.0,-1.0,12.0,383.0
current_address_months_count,1000000.0,86.587867,88.406599,-1.0,19.0,52.0,130.0,428.0
customer_age,1000000.0,33.68908,12.025799,10.0,20.0,30.0,40.0,90.0
days_since_request,1000000.0,1.025705,5.381835,4.03686e-09,0.007193,0.015176,0.026331,78.456904
intended_balcon_amount,1000000.0,8.661499,20.236155,-15.53055,-1.181488,-0.830507,4.984176,112.956928
zip_count_4w,1000000.0,1572.692049,1005.374565,1.0,894.0,1263.0,1944.0,6700.0
velocity_6h,1000000.0,5665.296605,3009.380665,-170.6031,3436.365848,5319.769349,7680.717827,16715.565404
