<br>

<br>

# 💳 **BANK FRAUD DETECTION** 💳

<br>

**MODEL**

<br>

## **INDEX**

- **🧠 STEP 1: PROBLEM DEFINITION**
- **🎣 STEP 2: DATA COLLECTION**
- **🔎 STEP 3: EXPLORATORY DATA ANALYSIS (EDA)**


<br>

<br>

# **🧠 STEP 1: PROBLEM DEFINITION**


In today's digital banking landscape, credit card fraud continues to be a persistent threat, resulting in significant financial losses and operational risks for financial institutions. Detecting fraudulent transactions is a complex task, especially due to their rarity and the evolving nature of fraud patterns.

This project aims to build a **machine learning model to detect credit card fraud** using a realistic, synthetic dataset provided by **Feedzai** (published at NeurIPS 2022). Although the data is synthetic, it has been generated to reflect the structure and behavior of real-world fraud cases, providing a robust environment for model development and evaluation.

The dataset is:
- **Highly imbalanced**, with less than 10% of records labeled as fraud.
- **Bias-controlled**, with multiple variants designed to test model fairness and robustness.
- **Rich in features**, including both numerical and categorical variables.
- **Time-aware**, with a `month` column, though this project will not focus on temporal modeling.

### 💼 BUSINESS CONTEXT

The target user of this model is a **fraud analyst working at a bank**. Rather than being deployed in real-time production, the model is intended to serve as an analytical tool to help prioritize suspicious transactions and guide human decision-making.

### 🎯 PROJECT GOAL

The primary goal is to **maximize fraud detection capability** (high recall), while keeping false positives under control (F1-score). The model should identify fraudulent behavior patterns to help reduce financial losses without overwhelming analysts with irrelevant alerts.

<br>


<br>

<br>

# **🎣 STEP 2: DATA COLLECTION**

<br>

### **IMPORTING LIBRARIES**

In [2]:
import pandas as pd
import numpy as np
import os
import zipfile


In [3]:
# Kaggle dataset identifier
dataset_identifier = "sgpjesus/bank-account-fraud-dataset-neurips-2022"

# Download the dataset using the Kaggle API
os.system(f'kaggle datasets download -d {dataset_identifier}')

0

### **SELECTIVE EXTRACTION**

In [4]:
# Name of the ZIP file (based on the dataset identifier)
zip_filename = dataset_identifier.split('/')[-1] + ".zip"

# Check if the ZIP file exists and extract only the Base.csv file
if os.path.exists(zip_filename):
    with zipfile.ZipFile(zip_filename, 'r') as zip_ref:
        # List all files in the ZIP
        files_in_zip = zip_ref.namelist()
        
        # Check if Base.csv is in the ZIP file
        if "Base.csv" in files_in_zip:
            # Extract only Base.csv
            zip_ref.extract("Base.csv", ".")
            print("Base.csv has been successfully extracted!")
        else:
            print("Base.csv not found in the ZIP file.")
else:
    print("The downloaded ZIP file was not found.")

Base.csv has been successfully extracted!


<br>

<br>

<br>

# **🔎 STEP 3: EXPLORATORY DATA ANALYSIS (EDA)**

### **LOADING THE DATASET**

In [7]:
df = pd.read_csv("Base.csv")
df.head()

Unnamed: 0,fraud_bool,income,name_email_similarity,prev_address_months_count,current_address_months_count,customer_age,days_since_request,intended_balcon_amount,payment_type,zip_count_4w,...,has_other_cards,proposed_credit_limit,foreign_request,source,session_length_in_minutes,device_os,keep_alive_session,device_distinct_emails_8w,device_fraud_count,month
0,0,0.3,0.986506,-1,25,40,0.006735,102.453711,AA,1059,...,0,1500.0,0,INTERNET,16.224843,linux,1,1,0,0
1,0,0.8,0.617426,-1,89,20,0.010095,-0.849551,AD,1658,...,0,1500.0,0,INTERNET,3.363854,other,1,1,0,0
2,0,0.8,0.996707,9,14,40,0.012316,-1.490386,AB,1095,...,0,200.0,0,INTERNET,22.730559,windows,0,1,0,0
3,0,0.6,0.4751,11,14,30,0.006991,-1.863101,AB,3483,...,0,200.0,0,INTERNET,15.215816,linux,1,1,0,0
4,0,0.9,0.842307,-1,29,40,5.742626,47.152498,AA,2339,...,0,200.0,0,INTERNET,3.743048,other,0,1,0,0
