# Probabilistic Machine Learning - Project Report

## Fraud detection

- **Course:** Probabilistic Machine Learning (SoSe 2025)
- **Lecturer:** Alvaro Diaz-Ruelas
- **Students Names:**  khalid Sabih, abdellah charki
- **GitHub Usernames:**  @khalidsabih / @abdellahcharki
- **Date:**  05/07/2025
- **PROJECT-ID:** 26-1CASKXX  

---

# 1. Introduction

## 1.1 Motivation
Fraud detection has become an increasingly critical task in financial systems and digital transactions, where even a small number of fraudulent activities can result in significant financial losses and erode trust in institutions. The complexity of detecting fraud arises from its rarity and the constantly evolving tactics used by fraudsters to conceal illicit activities within massive volumes of legitimate transactions. As organizations handle millions of financial operations daily, distinguishing fraudulent patterns from normal behavior is both technically challenging and essential for operational security and customer trust.

## 1.2 Dataset
The dataset used in this project, named Fraud.csv, consists of approximately 6 million synthetic financial transactions, created to reflect realistic banking operations while protecting privacy. Each transaction record contains several attributes that describe its details, including both numerical and categorical features. The primary challenge posed by this dataset is the severe class imbalance, as fraudulent transactions account for fewer than 0.2% of all records, making fraud detection a complex and highly imbalanced classification problem.

The dataset includes the following columns:

- step: The hour of the simulation.
- type: The type of transaction, such as PAYMENT, TRANSFER, CASH_OUT, DEBIT, or CASH_IN.
- amount: The amount of money involved in the transaction.
- nameOrig: An anonymized identifier for the originator’s account.
- oldbalanceOrg: The account balance of the originator before the transaction.
- newbalanceOrig: The account balance of the originator after the transaction.
- nameDest: An anonymized identifier for the recipient’s account.
- oldbalanceDest: The account balance of the recipient before the transaction.
- newbalanceDest: The account balance of the recipient after the transaction.
- isFraud: A binary label indicating whether the transaction was fraudulent (1) or not (0).
- isFlaggedFraud: A binary flag indicating whether the transaction was flagged as suspicious by internal business rules.

## 1.3 Hypothesis
- Fraudulent transactions are more likely to occur in specific transaction types, particularly TRANSFER and CASH_OUT, compared to other types such as PAYMENT or CASH_IN.
- Fraudulent transactions tend to involve higher transaction amounts than legitimate transactions.
- Fraudulent transactions often result in the destination account balance dropping to zero, suggesting immediate withdrawal or transfer of illicit funds.

# 2. Data Loading and Exploration
## 2.1. Data Loading
We use a synthetic fraud detection dataset for training and evaluating fraud detection models.


[Fraud Detection Dataset – Kaggle](https://www.kaggle.com/datasets/ashishkumarjayswal/froud-detection-dataset)


Our analysis begins with loading the dataset Fraud.csv, which is stored in the  `data/` folder of our project repository. We use the pandas library in Python to handle the file, as it efficiently manages large datasets and provides useful tools for data exploration.

**Snapshot of Original Dataset (Before Preprocessing)**


| step | type      | amount   | nameOrig       | oldbalanceOrg | newbalanceOrig | nameDest     | oldbalanceDest | newbalanceDest | isFraud | isFlaggedFraud |
|------|-----------|----------|----------------|---------------|----------------|--------------|----------------|----------------|---------|----------------|
| 1    | PAYMENT   | 9839.64  | C1231006815    | 170136.0      | 160296.36      | M1979787155  | 0.0            | 0.0            | 0       | 0              |
| 1    | PAYMENT   | 1864.28  | C1666544295    | 21249.0       | 19384.72       | M2044282225  | 0.0            | 0.0            | 0       | 0              |
| 1    | TRANSFER  | 181.00   | C1305486145    | 181.0         | 0.00           | C553264065   | 0.0            | 0.0            | 1       | 0              |
| 1    | CASH_OUT  | 181.00   | C840083671     | 181.0         | 0.00           | C38997010    | 21182.0        | 0.0            | 1       | 0              |
| 1    | PAYMENT   | 11668.14 | C2048537720    | 41554.0       | 29885.86       | M1230701703  | 0.0            | 0.0            | 0       | 0              |


## 2.2. Data Exploration
After successfully loading the dataset, we performed a detailed exploratory data analysis to better understand its structure and the nature of fraudulent transactions.


### 2.2.1 Class Distribution
A critical first step was to examine the distribution of our target variable, isFraud. As shown in Figure  (Class Distribution), fraudulent transactions are extremely rare, accounting for only about 0.13% of all transactions. In absolute terms, there are 8,213 fraudulent transactions out of a total of 6,354,620 transactions, which is consistent with the class distribution reported as follows:
- Non-fraudulent (0): 6,354,407 transactions (99.87%)
- Fraudulent (1): 8,213 transactions (0.13%)


![Class Distribution](/results/class_distribution.png)

This significant class imbalance underscores the challenges associated with fraud detection, where traditional metrics like overall accuracy would be misleading.

### 2.2.2 Fraud Rate by Transaction Type
We then analyzed how fraud is distributed across different transaction types. The dataset includes various transaction categories such as PAYMENT, TRANSFER, CASH_OUT, DEBIT, and CASH_IN. Our analysis revealed that fraud is concentrated almost exclusively in the TRANSFER and CASH_OUT transaction types.
Figure (Fraud Rate by Transaction Type) illustrates that:

- TRANSFER transactions account for approximately 80.69% of fraudulent activity.

- CASH_OUT transactions account for about 19.31% of fraud.

- Other transaction types show virtually no fraud.

![Fraud Rate by Transaction Type](./results/fraud_rate_by_transaction.png)

These findings highlight the importance of transaction type as a strong predictor of fraud.

### 2.2.3 Transaction Type vs. Fraud Count
To visualize how fraud and non-fraud transactions are distributed across different transaction types, we plotted a count graph, shown in Figure 3 (Transaction Type vs. Fraud). The chart confirms that while PAYMENT, CASH_IN, and DEBIT transactions are numerous, they rarely involve fraud. By contrast, TRANSFER and CASH_OUT transactions, although less frequent overall, carry a much higher proportion of fraudulent cases relative to their volume.

This information is critical for model development, suggesting that the transaction type should be included as a categorical feature in any predictive modeling approach.


![Transaction Type vs. Fraud Count](./results/Transaction_Type_vs_Fraud.png)


### 2.2.4 Correlation Analysis
To further investigate relationships between variables, we computed and visualized a correlation heatmap, presented in Figure (Correlation Heatmap). 
The heatmap provides insights into how features relate to each other and to the target variable `isFraud`.

The strongest positive correlations with isFraud were observed for:

- `amount` (correlation coefficient ≈ 0.0767)
- `type_TRANSFER` (≈ 0.0539)
- `isFlaggedFraud` (≈ 0.0441)

Meanwhile, features such as type_PAYMENT show a slight negative correlation with fraud. Although these correlation values are generally low, they point to certain trends that may help distinguish fraudulent transactions.

![Correlation Analysis](results/heatmap.png)

### 2.2.5 Insights from Data Exploration
From this exploratory phase, we can conclude several important patterns:

- The dataset is highly imbalanced, with fraud representing less than 0.2% of transactions.
- Fraud occurs almost exclusively in TRANSFER and CASH_OUT transactions.
- Fraudulent transactions often involve larger amounts, supporting the hypothesis that transaction value is a key indicator of potential fraud.
- Correlation analysis, while showing modest relationships, suggests that transaction type and amount are among the most informative features for predicting fraud.

These findings inform the direction of our feature engineering and modeling strategies. In particular, they emphasize the need to account for class imbalance and to focus on transaction types and amounts when building fraud detection models.

## 3. Data Preprocessing
Before developing any models, we performed several preprocessing steps to prepare the dataset for analysis. These steps ensured that the data was clean, consistent, and suitable for machine learning algorithms.

- **Verified missing values:**  Checked all columns and confirmed there was no missing data.
- **Data type checks:** Verified that numeric columns remained as floats or integers, and that new one-hot encoded columns were stored as boolean values.
- **remove unneeded column:**  Removed the columns  `nameOrig` and `nameDest`.
- **Column transfer:**  Transformed the type column (categorical) into multiple binary columns such as `type_PAYMENT`, `type_TRANSFER`, etc. Each new column indicates whether the transaction belongs to that type (True/False).

 **Snapshot of Transformed Data**


| step | amount   | oldbalanceOrg | newbalanceOrig | oldbalanceDest | newbalanceDest | isFraud | isFlaggedFraud | type_CASH_OUT | type_DEBIT | type_PAYMENT | type_TRANSFER |
|------|----------|---------------|----------------|----------------|----------------|---------|----------------|---------------|------------|--------------|---------------|
| 1    | 9839.64  | 170136.00     | 160296.36      | 0.00           | 0.00           | 0       | 0              | False         | False      | True         | False         |
| 1    | 1864.28  | 21249.00      | 19384.72       | 0.00           | 0.00           | 0       | 0              | False         | False      | True         | False         |
| 1    | 181.00   | 181.00        | 0.00           | 0.00           | 0.00           | 1       | 0              | False         | False      | False        | True          |
| 1    | 181.00   | 181.00        | 0.00           | 21182.00       | 0.00           | 1       | 0              | True          | False      | False        | False         |
| 1    | 11668.14 | 41554.00      | 29885.86       | 0.00           | 0.00           | 0       | 0              | False         | False      | True         | False         |

## 4. Modeling Approach
- Description of the models chosen
- Why they are suitable for your problem
- Mathematical formulations (if applicable)

## 5. Model Training and Evaluation
- Training process
- Model evaluation (metrics, plots, performance)
- Cross-validation or uncertainty quantification

## 6. Results
- Present key findings
- Comparison of models if multiple approaches were used

## 7. Discussion
- Interpretation of results
- Limitations of the approach
- Possible improvements or extensions

## 8. Conclusion
- Summary of main outcomes

## 9. References
- Cite any papers, datasets, or tools used