Fraud Transaction Detection

Project Overview

This project builds a Machine Learning model to detect fraudulent financial transactions in real-time. Using Logistic Regression with class weighted optimization, the model identifies suspicious transactions and helps financial institutions prevent fraud proactively.

Real-world Application: Banks and payment processors can flag fraudulent transactions instantly, preventing money loss and protecting customers.

Objective

Build a predictive model that:

Detects fraudulent transactions with high accuracy
Prioritizes Recall (catch maximum frauds)
Minimizes false negatives (missing frauds is costlier than false positives)
Provides interpretable insights for fraud patterns
Enables data-driven fraud prevention strategies

Dataset

Source: PaySim Simulated Mobile Money Transaction Dataset

Dataset Name: PS_20174392719_1491204439457_log.csv

Dataset Characteristics:

Total Transactions: Millions of records
Class Distribution: Highly imbalanced
- Genuine transactions: ~99.9%
- Fraudulent transactions: ~0.1%
Problem Type: Binary Classification (Fraud vs. Genuine)

Key Features:

Feature	Description	Data Type	Notes
step	Time step (represents hours)	Numeric	1-744 hours
type	Transaction type	Categorical	CASH-IN, CASH-OUT, DEBIT, PAYMENT, TRANSFER
amount	Transaction amount	Numeric	Variable range, right-skewed
nameOrig	Originating customer name	Categorical	Removed (privacy/irrelevant)
oldbalanceOrg	Balance before transaction	Numeric	Important for fraud detection
newbalanceOrig	Balance after transaction	Numeric	Shows money movement
nameDest	Destination customer name	Categorical	Removed (privacy/irrelevant)
oldbalanceDest	Recipient balance before	Numeric	Balance change indicator
newbalanceDest	Recipient balance after	Numeric	Shows received amount
isFraud	Target variable	Binary	0 = Genuine, 1 = Fraud

Project Workflow

1. DATA LOADING
   └─ Load PaySim dataset from CSV
   
2. EXPLORATORY DATA ANALYSIS (EDA)
   ├─ Check dataset shape and size
   ├─ Inspect data types and structure
   ├─ Check for duplicates
   ├─ Analyze fraud distribution (class imbalance)
   └─ Visualize transaction patterns
   
3. DATA CLEANING
   ├─ Check and handle missing values
   │  └─ Drop rows with NaN (negligible proportion)
   ├─ Identify outliers
   │  └─ Apply log transformation to amount (right-skewed)
   ├─ Check for duplicates
   │  └─ Remove duplicate records
   └─ Remove non-predictive features
      └─ nameOrig, nameDest (privacy/irrelevant)
   
4. FEATURE ANALYSIS
   ├─ Check correlation matrix
   ├─ Analyze multicollinearity
   │  └─ Balance variables are correlated but meaningful
   ├─ Statistical checks
   └─ Domain knowledge validation
   
5. FEATURE ENGINEERING
   ├─ One-Hot Encoding
   │  └─ type column: CASH-IN, CASH-OUT, DEBIT, PAYMENT, TRANSFER
   ├─ Feature Selection
   │  ├─ Keep log_amount (normalized amount)
   │  ├─ Keep balance features
   │  └─ Keep transaction type indicators
   └─ Create feature matrix X and target y
   
6. DATA SPLITTING
   ├─ Train Set: 80% of data
   ├─ Test Set: 20% of data
   └─ Stratification: Maintain fraud ratio
   
7. MODEL SELECTION & TRAINING
   ├─ Algorithm: Logistic Regression
   ├─ Reason: Interpretability + binary classification + imbalanced data
   ├─ Class Weighting: 'balanced' (fraud gets higher importance)
   ├─ Max Iterations: 1000 (for convergence)
   └─ Training with X_train and y_train
   
8. PREDICTION & EVALUATION
   ├─ Predict on test set
   ├─ Get probability scores
   ├─ Use classification_report
   ├─ Calculate ROC-AUC score
   └─ Focus on Recall (catch all frauds)
   
9. FEATURE IMPORTANCE ANALYSIS
   ├─ Extract model coefficients
   ├─ Rank by absolute value
   ├─ Identify fraud predictors
   └─ Generate business insights
   
10. ACTIONABLE INSIGHTS
    ├─ Identify fraud patterns
    ├─ Recommend prevention strategies
    ├─ Design monitoring rules
    └─ Suggest infrastructure updates

Technologies & Libraries

Core Libraries:

Python 3.x - Programming language
Pandas - Data manipulation and analysis
NumPy - Numerical computing
Scikit-learn - Machine learning algorithms
Matplotlib / Seaborn - Data visualization

Specific Tools Used:

Tool	Purpose	Usage
`pd.read_csv()`	Load data	Read transaction dataset
`df.isnull().sum()`	Missing values	Check data quality
`np.log1p()`	Log transformation	Normalize right-skewed amount
`df.corr()`	Correlation matrix	Detect multicollinearity
`pd.get_dummies()`	One-hot encoding	Convert transaction types
`train_test_split()`	Data partitioning	80-20 split with stratification
`LogisticRegression`	Classification model	Binary fraud detection
`class_weight='balanced'`	Handle imbalance	Give fraud higher importance
`classification_report()`	Performance metrics	Precision, Recall, F1-score
`roc_auc_score()`	AUC evaluation	Model discrimination ability

Model Details

Algorithm: Logistic Regression

Why Logistic Regression?

Simple and interpretable (understand which features drive fraud)
Fast training (suitable for real-time applications)
Probabilistic output (0-1 confidence score)
Handles binary classification well
Works with imbalanced data (using class weights)

How It Works:

Features (amount, balances, type)
    ↓
Linear combination (weighted sum)
    ↓
Sigmoid function (maps to 0-1)
    ↓
Probability of fraud
    ↓
If P(fraud) > 0.5 → FLAG AS FRAUD 
If P(fraud) ≤ 0.5 → APPROVE TRANSACTION

Key Parameter:

class_weight='balanced'

Normally: 99.9% genuine, 0.1% fraud
With balanced weights: Both classes equally important
Fraud misclassification penalty increased
Catches more frauds (higher recall)

Data Cleaning & Preprocessing

1. Missing Values Handling

df.isnull().sum()
df = df.dropna()

Found negligible missing values
Removed affected rows
No imputation bias introduced

2. Outlier Detection & Handling

df['log_amount'] = np.log1p(df['amount'])

Problem: Amount is right-skewed (few very large transactions)
Solution: Log transformation
Result: Values normalized to reasonable scale
Benefit: Model trains faster and more stable

3. Multicollinearity Analysis

df[['oldbalanceOrg','newbalanceOrig','oldbalanceDest','newbalanceDest']].corr()

Balance variables show correlation
Decision: Retain all (they represent different aspects)
- oldbalanceOrg & newbalanceOrig → Money left sender account
- oldbalanceDest & newbalanceDest → Money reached receiver
All provide unique fraud indicators

4. Feature Removal

df = df.drop(columns=['nameOrig', 'nameDest'])

Removed: Names of customers
Reason: Privacy concerns + Not predictive of fraud

5. One-Hot Encoding

df = pd.get_dummies(df, columns=['type'], drop_first=True)

type: ['CASH-IN', 'CASH-OUT', 'DEBIT', 'PAYMENT', 'TRANSFER']
    ↓
type_CASH-IN: 1/0
type_CASH-OUT: 1/0
type_DEBIT: 1/0
type_PAYMENT: 1/0
type_TRANSFER: 1/0 (dropped - redundant)

Model Performance

Evaluation Metrics:

Metric	Definition	Why Important
Precision	Of predicted frauds, how many are actual?	Minimize false alarms
Recall	Of actual frauds, how many detected?	PRIORITY - Catch all frauds
F1-Score	Harmonic mean of precision & recall	Balanced measure
ROC-AUC	True positive rate vs false positive rate	Overall discrimination ability

Class Imbalance Challenge:

Normal scenario: 99.9% genuine, 0.1% fraud
Without handling: Model predicts "all genuine" → 99.9% accuracy (useless!)
Solution: class_weight='balanced' → Treat fraud as 1000x more important
Result: Model learns fraud patterns despite low representation

Performance Trade-off:

HIGH RECALL:
   └─ Catch 99% of frauds
   └─ Some false positives (genuine transactions flagged)
   └─ Cost: Customer inconvenience (manageable)

LOW RECALL:
   └─ Miss frauds
   └─ Fraudster steals money (major loss)
   └─ Cost: Financial loss + reputation damage

Decision: Prioritize RECALL > Precision

Key Fraud Predictors

Top Factors Indicating Fraud:

Transaction Type: TRANSFER
- High-risk category
- Fraudsters quickly move money
- Direct transfer suggests theft
Transaction Type: CASH-OUT
- Withdrawing to untraceable cash
- Quick conversion to cash
- Common fraud pattern
Balance Changes
- Sudden large balance drop (money leaving)
- Receiver getting unexpected large sum
- Abnormal financial behavior
Amount (log_amount)
- Larger transactions → Higher risk
- But log-transformed (non-linear)
- Extreme values flagged

Why These Make Sense:

Fraudster Behavior Pattern:

1. Compromise account
   ↓
2. Quickly transfer/cash-out money
   ↓
3. Use high-risk transaction types
   ↓
4. Leave obvious balance changes
   ↓
5. Model detects these patterns
   ↓
6. Transaction flagged

Fraud Prevention Strategies

Real-time Monitoring:

incoming_transaction
    ↓
[Extract features: amount, type, balances]
    ↓
[Scale features]
    ↓
[Pass to ML model]
    ↓
[Get fraud probability]
    ↓
If P(fraud) > 0.5:
├─ Flag transaction
├─ Alert fraud team
├─ Request extra verification
└─ Can block if high confidence

Recommended Controls:

Velocity Checks

If customer does 5 transfers in < 1 hour
→ Suspicious (normal user = 1-2 per day)
→ Request verification

Multi-Factor Authentication (MFA)

Large or unusual transactions
→ Require OTP / biometric / security question
→ Verify customer identity

Threshold-Based Alerts

Transaction amount > customer's monthly average
→ Flag for review
→ Monitor for patterns

Geographic Checks

Transaction from unusual location
→ Compare with customer history
→ Verify travel plans

Balance Anomaly Detection

Sudden large balance drop
→ Not matching customer behavior
→ Request confirmation

Measuring Effectiveness

Key Metrics to Track:

1. FRAUD DETECTION RATE
   └─ % frauds caught / total frauds
   └─ Target: > 95%

2. FALSE POSITIVE RATE
   └─ % legitimate transactions blocked / total legitimate
   └─ Target: < 1% (customer experience)

3. FRAUD LOSS RATE
   └─ $ lost to fraud / total $ transacted
   └─ Target: < 0.01% (industry benchmark)

4. CUSTOMER SATISFACTION
   └─ Complaints about false blocks
   └─ Churn due to friction
   └─ Track and minimize

A/B Testing Approach:

Normal Pathway:
├─ 50% users: Current rules
└─ 50% users: New ML model

Measure improvements:
├─ Which catches more fraud? ✓
├─ Which has lower false positives? ✓
└─ Deploy winner

Continuous Improvement:

Week 1: Deploy model (baseline)
    ↓
Week 2-4: Monitor fraud rate, false positives
    ↓
Week 5: Analyze misclassifications
    ↓
Week 6: Retrain with recent data
    ↓
Week 7: Adjust threshold if needed
    ↓
Week 8: Compare with Week 1 (improvement?)
    ↓
Continue cycle...

Project Files

Fraud_Transaction_Detection.ipynb
├── Cell 1-3: Project header & objective
├── Cell 4: Import libraries
├── Cell 5: Load dataset
├── Cell 6-8: Exploratory analysis
├── Cell 9-12: Missing values handling
├── Cell 13-15: Outlier handling (log transformation)
├── Cell 16-18: Multicollinearity analysis
├── Cell 19-20: Feature selection & engineering
├── Cell 21: Train-test split
├── Cell 22-24: Model building & training
├── Cell 25-26: Predictions
├── Cell 27-28: Model evaluation (Classification report, ROC-AUC)
├── Cell 29-30: Feature importance
└── Cell 31-32: Business insights

Installation & Setup

Requirements:

Python 3.7+
Jupyter Notebook or Google Colab

Install Dependencies:

pip install pandas numpy scikit-learn matplotlib seaborn

For Google Colab (if using cloud):

# Colab comes with all libraries pre-installed
# Just upload the CSV file or mount Google Drive
from google.colab import files
files.upload()  # Upload PS_20174392719_1491204439457_log.csv

How to Run

Step 1: Load Dataset

Place CSV file in working directory
Or provide correct file path

Step 2: Run Data Cleaning Cells

Check missing values
Apply log transformation
Analyze correlations

Step 3: Run Feature Engineering

One-hot encode transaction types
Create final feature matrix

Step 4: Train Model

Split data (80-20)
Train Logistic Regression
Monitor training progress

Step 5: Evaluate

Get predictions on test set
Print classification report
Calculate ROC-AUC

Step 6: Extract Insights

Feature importance ranking
Identify fraud patterns
Generate business recommendations

Expected Results

Model Performance:

Precision (Fraud Class):  0.70-0.85
Recall (Fraud Class):     0.90-0.98  ← PRIORITY METRIC
F1-Score (Fraud Class):   0.78-0.90
ROC-AUC Score:            0.95-0.98

Key Insights:

TRANSFER transactions: Highest fraud risk (2-3x normal)
CASH-OUT transactions: Second highest risk
Large amounts: More fraudulent patterns
Balance anomalies: Key fraud indicators

Customization & Improvements

Model Enhancements:

Try Alternative Algorithms:
```
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier
```
- Random Forest: Better for non-linear patterns
- XGBoost: State-of-the-art performance
Handle Class Imbalance:
```
from imblearn.over_sampling import SMOTE
```
- Generate synthetic fraud samples
- Better model learning
Hyperparameter Tuning:
```
from sklearn.model_selection import GridSearchCV
```
- Find best parameters
- Cross-validation
Ensemble Methods:
- Combine multiple models
- Voting ensemble
- Stacking
Feature Engineering:
- Create interaction features
- Domain-specific features
- Time-based features

Learning Resources

Scikit-learn: https://scikit-learn.org/
Pandas Documentation: https://pandas.pydata.org/docs
Imbalanced Data Handling: https://imbalanced-learn.org/

Key Concepts Explained

Class Imbalance Problem:

Real-world fraud rate: 0.1% (1 fraud per 1000 transactions)
Naive model: "Always predict genuine"
Result: 99.9% accuracy but catches ZERO frauds (useless!)

Solution: Use class_weight='balanced'
└─ Fraud misclassification = 1000x penalty
└─ Model forced to learn fraud patterns
└─ Better recall (catches more frauds)

Log Transformation:

Before: [0.01, 100, 500000, 1000000, 2000000]
        Range: 2,000,000x spread (huge outliers)

After log: [0.01, 4.6, 13.1, 13.8, 14.5]
        Range: 1000x spread (more manageable)

Benefit: Model trains faster, more stable weights

One-Hot Encoding:

Categorical: ['TRANSFER', 'PAYMENT', 'CASH-OUT']
      ↓
type_PAYMENT: [1, 0, 0]
type_CASH-OUT: [0, 0, 1]
type_TRANSFER: [0, 1, 0]

Benefit: ML models work with numerical data

Contributing

Feel free to:

Fork and modify
Improve the model
Add new features
Share insights

License

This project uses the PaySim public dataset. For more information, refer to their documentation.

Support

For questions about:

Dataset: Refer to PaySim documentation
Logistic Regression: Check scikit-learn docs
Class Imbalance: See imbalanced-learn library
Fraud Detection Best Practices: See Fraud-Detection-Handbook

Summary

This project demonstrates:
Handling highly imbalanced data
Feature engineering and cleaning
Logistic regression for fraud detection
Interpretable machine learning
Real-world application design
Business impact metrics

Status: Complete and Production-Ready

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
Fraud_Transaction_Detection.ipynb		Fraud_Transaction_Detection.ipynb
README.md		README.md

Folders and files

Latest commit

History

Repository files navigation

Fraud Transaction Detection

Project Overview

Objective

Dataset

Dataset Characteristics:

Key Features:

Project Workflow

Technologies & Libraries

Core Libraries:

Specific Tools Used:

Model Details

Algorithm: Logistic Regression

Why Logistic Regression?

How It Works:

Key Parameter:

Data Cleaning & Preprocessing

1. Missing Values Handling

2. Outlier Detection & Handling

3. Multicollinearity Analysis

4. Feature Removal

5. One-Hot Encoding

Model Performance

Evaluation Metrics:

Class Imbalance Challenge:

Performance Trade-off:

Key Fraud Predictors

Top Factors Indicating Fraud:

Why These Make Sense:

Fraud Prevention Strategies

Real-time Monitoring:

Recommended Controls:

Measuring Effectiveness

Key Metrics to Track:

A/B Testing Approach:

Continuous Improvement:

Project Files

Installation & Setup

Requirements:

Install Dependencies:

For Google Colab (if using cloud):

How to Run

Step 1: Load Dataset

Step 2: Run Data Cleaning Cells

Step 3: Run Feature Engineering

Step 4: Train Model

Step 5: Evaluate

Step 6: Extract Insights

Expected Results

Model Performance:

Key Insights:

Customization & Improvements

Model Enhancements:

Learning Resources

Key Concepts Explained

Class Imbalance Problem:

Log Transformation:

One-Hot Encoding:

Contributing

License

Support

Summary

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages