# Fraud Detection with Random Forest Classifier

# Overview
#### This notebook presents a machine learning project focused on detecting fraudulent transactions in a financial dataset. The primary goal is to build and evaluate a robust classification model that can accurately identify fraudulent activities, which are a highly imbalanced class within the dataset. The methodology involves data exploration, feature engineering, model training using a Random Forest Classifier, and performance evaluation tailored for imbalanced data.

## 1. Data Loading and Initial Exploration

#### Importing Required Dependencies

In [2]:
import warnings
warnings.filterwarnings('ignore')

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import (classification_report, confusion_matrix, roc_auc_score, average_precision_score)
from sklearn.pipeline import Pipeline


#### Load the dataset from the 'Fraud.csv' file into a pandas DataFrame.


In [3]:
df = pd.read_csv('Fraud.csv')

#### Check the dimensions (number of rows and columns) of the DataFrame.


In [4]:
df.shape

(6362620, 11)

#### Display the first few rows of the DataFrame to get a glimpse of the data structure and content.


In [5]:
df.head()

Unnamed: 0,step,type,amount,nameOrig,oldbalanceOrg,newbalanceOrig,nameDest,oldbalanceDest,newbalanceDest,isFraud,isFlaggedFraud
0,1,PAYMENT,9839.64,C1231006815,170136.0,160296.36,M1979787155,0.0,0.0,0,0
1,1,PAYMENT,1864.28,C1666544295,21249.0,19384.72,M2044282225,0.0,0.0,0,0
2,1,TRANSFER,181.0,C1305486145,181.0,0.0,C553264065,0.0,0.0,1,0
3,1,CASH_OUT,181.0,C840083671,181.0,0.0,C38997010,21182.0,0.0,1,0
4,1,PAYMENT,11668.14,C2048537720,41554.0,29885.86,M1230701703,0.0,0.0,0,0


#### Display the last few rows of the DataFrame, excluding the final 10.


In [6]:
df.head(-10)

Unnamed: 0,step,type,amount,nameOrig,oldbalanceOrg,newbalanceOrig,nameDest,oldbalanceDest,newbalanceDest,isFraud,isFlaggedFraud
0,1,PAYMENT,9839.64,C1231006815,170136.00,160296.36,M1979787155,0.00,0.00,0,0
1,1,PAYMENT,1864.28,C1666544295,21249.00,19384.72,M2044282225,0.00,0.00,0,0
2,1,TRANSFER,181.00,C1305486145,181.00,0.00,C553264065,0.00,0.00,1,0
3,1,CASH_OUT,181.00,C840083671,181.00,0.00,C38997010,21182.00,0.00,1,0
4,1,PAYMENT,11668.14,C2048537720,41554.00,29885.86,M1230701703,0.00,0.00,0,0
...,...,...,...,...,...,...,...,...,...,...,...
6362605,742,CASH_OUT,54652.46,C43545501,54652.46,0.00,C830041824,0.00,54652.46,1,0
6362606,742,TRANSFER,303846.74,C959102961,303846.74,0.00,C114421319,0.00,0.00,1,0
6362607,742,CASH_OUT,303846.74,C1148860488,303846.74,0.00,C846260566,343660.89,647507.63,1,0
6362608,742,TRANSFER,258355.42,C1226129332,258355.42,0.00,C1744173808,0.00,0.00,1,0


#### Generate descriptive statistics for the numerical columns in the DataFrame,
#### providing insights into their central tendency, dispersion, and shape.

In [7]:
df.describe()

Unnamed: 0,step,amount,oldbalanceOrg,newbalanceOrig,oldbalanceDest,newbalanceDest,isFraud,isFlaggedFraud
count,6362620.0,6362620.0,6362620.0,6362620.0,6362620.0,6362620.0,6362620.0,6362620.0
mean,243.3972,179861.9,833883.1,855113.7,1100702.0,1224996.0,0.00129082,2.514687e-06
std,142.332,603858.2,2888243.0,2924049.0,3399180.0,3674129.0,0.0359048,0.001585775
min,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,156.0,13389.57,0.0,0.0,0.0,0.0,0.0,0.0
50%,239.0,74871.94,14208.0,0.0,132705.7,214661.4,0.0,0.0
75%,335.0,208721.5,107315.2,144258.4,943036.7,1111909.0,0.0,0.0
max,743.0,92445520.0,59585040.0,49585040.0,356015900.0,356179300.0,1.0,1.0


#### Print the data types of each column, check for missing values, and show the distribution of the target variable 'isFraud'.


In [8]:
print("\nColumns and dtypes:\n", df.dtypes)
print("\nMissing values per column:\n", df.isnull().sum())
print("\nTarget distribution:\n", df['isFraud'].value_counts())


Columns and dtypes:
 step                int64
type               object
amount            float64
nameOrig           object
oldbalanceOrg     float64
newbalanceOrig    float64
nameDest           object
oldbalanceDest    float64
newbalanceDest    float64
isFraud             int64
isFlaggedFraud      int64
dtype: object

Missing values per column:
 step              0
type              0
amount            0
nameOrig          0
oldbalanceOrg     0
newbalanceOrig    0
nameDest          0
oldbalanceDest    0
newbalanceDest    0
isFraud           0
isFlaggedFraud    0
dtype: int64

Target distribution:
 isFraud
0    6354407
1       8213
Name: count, dtype: int64


#### Check the number of unique values in the 'nameOrig' and 'nameDest' columns to see if they are suitable for one-hot encoding.


In [9]:
# Quick check: are nameOrig and nameDest unique-like?
print("Unique nameOrig:", df['nameOrig'].nunique(), "rows:", len(df))
print("Unique nameDest:", df['nameDest'].nunique(), "rows:", len(df))

Unique nameOrig: 6353307 rows: 6362620
Unique nameDest: 2722362 rows: 6362620


# 2. Feature Engineering

#### Create new features by calculating the difference in balances for both the originating and destination accounts.


In [10]:
df['orig_balance_diff'] = df['oldbalanceOrg'] - df['newbalanceOrig']
df['dest_balance_diff'] = df['newbalanceDest'] - df['oldbalanceDest']

#### Create ratio-based features to capture the relationship between transaction amount and original balances.
#### A small epsilon is added to the denominator to avoid division-by-zero errors.

In [11]:
# Ratio features (guard against division by zero)
df['amount_to_oldOrig'] = df['amount'] / (df['oldbalanceOrg'] + 1e-9)
df['amount_to_oldDest'] = df['amount'] / (df['oldbalanceDest'] + 1e-9)

#### Create binary flag features to indicate specific transactional behaviors.


In [12]:
# Flags
df['orig_zero_after'] = (df['newbalanceOrig'] == 0).astype(int)
df['dest_zero_before'] = (df['oldbalanceDest'] == 0).astype(int)

#### Calculate the frequency of transactions for each originating and destination account and add these as new features.


In [13]:
# Frequency (count) features for origin and destination IDs
orig_counts = df['nameOrig'].value_counts()
dest_counts = df['nameDest'].value_counts()


df['orig_txn_count'] = df['nameOrig'].map(orig_counts)
df['dest_txn_count'] = df['nameDest'].map(dest_counts)

#### Create a new binary feature to identify if the destination account is a merchant, based on the account ID prefix.


In [14]:
# Is the destination an external merchant? (in many datasets merchant names start with 'M')
# This is dataset specific; adjust as necessary.
df['dest_is_merchant'] = df['nameDest'].str.startswith('M').astype(int)

#### Convert the categorical 'type' column into numerical features using one-hot encoding. The first dummy variable is dropped to avoid multicollinearity.


In [15]:
# Encode transaction type using one-hot
df = pd.get_dummies(df, columns=['type'], drop_first=True)

# 3. Model Training

#### Drop non-essential columns from the DataFrame before model training.


In [18]:
df_model = df.drop(columns = ['nameOrig','nameDest','isFlaggedFraud'])

#### Print a sample of the final features to be used for the model.


In [19]:
print("\nModel features (sample):", df_model.columns.tolist()[:30])


Model features (sample): ['step', 'amount', 'oldbalanceOrg', 'newbalanceOrig', 'oldbalanceDest', 'newbalanceDest', 'isFraud', 'orig_balance_diff', 'dest_balance_diff', 'amount_to_oldOrig', 'amount_to_oldDest', 'orig_zero_after', 'dest_zero_before', 'orig_txn_count', 'dest_txn_count', 'dest_is_merchant', 'type_CASH_OUT', 'type_DEBIT', 'type_PAYMENT', 'type_TRANSFER']


#### Separate the features (X) and the target variable (y).


In [None]:
X = df_model.drop(columns=['isFraud'])
y = df_model['isFraud']

#### Split the data into training and testing sets using a stratified split to maintain the class distribution of the target variable.


In [24]:
# Stratified split to preserve rare class distribution
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, stratify=y, random_state=42)


print("Train shape:", X_train.shape, "Test shape:", X_test.shape)

Train shape: (5090096, 19) Test shape: (1272524, 19)


#### Create a machine learning pipeline that first scales the features and then fits a Random Forest Classifier.
#### The `class_weight='balanced'` parameter is used to handle the class imbalance.

In [26]:
# Scale numeric features
scaler = StandardScaler()

# Choose classifier (RandomForest with class_weight to handle imbalance)
clf = RandomForestClassifier(n_estimators=200, random_state=42, class_weight='balanced', n_jobs=-1)

# Fit pipeline
pipeline = Pipeline([('scaler', scaler), ('clf', clf)])

#### Fit the pipeline to the training data.


In [27]:
print("Fitting RandomForest...")
pipeline.fit(X_train, y_train)

Fitting RandomForest...


# 4. Model Evaluation

#### Make predictions and calculate probabilities on the test set.


In [None]:
y_pred = pipeline.predict(X_test)
y_proba = pipeline.predict_proba(X_test)[:, 1]

#### Print the classification report and confusion matrix to evaluate the model's performance on the test data.


In [28]:
print("\nClassification report:\n", classification_report(y_test, y_pred, digits=4))

cm = confusion_matrix(y_test, y_pred)
print("Confusion matrix:\n", cm)



Classification report:
               precision    recall  f1-score   support

           0     0.9997    1.0000    0.9999   1270881
           1     0.9747    0.7973    0.8771      1643

    accuracy                         0.9997   1272524
   macro avg     0.9872    0.8986    0.9385   1272524
weighted avg     0.9997    0.9997    0.9997   1272524

Confusion matrix:
 [[1270847      34]
 [    333    1310]]


#### Calculate and print the ROC AUC and Precision-Recall AUC scores for a comprehensive evaluation.


In [29]:
roc_auc = roc_auc_score(y_test, y_proba)
pr_auc = average_precision_score(y_test, y_proba)
print(f"ROC AUC: {roc_auc:.4f}, PR AUC (avg precision): {pr_auc:.4f}")

ROC AUC: 0.9987, PR AUC (avg precision): 0.9584
