Skip to content

priya2001/Fraud-Transaction-Detection

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 

Repository files navigation

Fraud Transaction Detection

Project Overview

This project builds a Machine Learning model to detect fraudulent financial transactions in real-time. Using Logistic Regression with class weighted optimization, the model identifies suspicious transactions and helps financial institutions prevent fraud proactively.

Real-world Application: Banks and payment processors can flag fraudulent transactions instantly, preventing money loss and protecting customers.


Objective

Build a predictive model that:

  • Detects fraudulent transactions with high accuracy
  • Prioritizes Recall (catch maximum frauds)
  • Minimizes false negatives (missing frauds is costlier than false positives)
  • Provides interpretable insights for fraud patterns
  • Enables data-driven fraud prevention strategies

Dataset

Source: PaySim Simulated Mobile Money Transaction Dataset

Dataset Name: PS_20174392719_1491204439457_log.csv

Dataset Characteristics:

  • Total Transactions: Millions of records
  • Class Distribution: Highly imbalanced
    • Genuine transactions: ~99.9%
    • Fraudulent transactions: ~0.1%
  • Problem Type: Binary Classification (Fraud vs. Genuine)

Key Features:

Feature Description Data Type Notes
step Time step (represents hours) Numeric 1-744 hours
type Transaction type Categorical CASH-IN, CASH-OUT, DEBIT, PAYMENT, TRANSFER
amount Transaction amount Numeric Variable range, right-skewed
nameOrig Originating customer name Categorical Removed (privacy/irrelevant)
oldbalanceOrg Balance before transaction Numeric Important for fraud detection
newbalanceOrig Balance after transaction Numeric Shows money movement
nameDest Destination customer name Categorical Removed (privacy/irrelevant)
oldbalanceDest Recipient balance before Numeric Balance change indicator
newbalanceDest Recipient balance after Numeric Shows received amount
isFraud Target variable Binary 0 = Genuine, 1 = Fraud

Project Workflow

1. DATA LOADING
   └─ Load PaySim dataset from CSV
   
2. EXPLORATORY DATA ANALYSIS (EDA)
   ├─ Check dataset shape and size
   ├─ Inspect data types and structure
   ├─ Check for duplicates
   ├─ Analyze fraud distribution (class imbalance)
   └─ Visualize transaction patterns
   
3. DATA CLEANING
   ├─ Check and handle missing values
   │  └─ Drop rows with NaN (negligible proportion)
   ├─ Identify outliers
   │  └─ Apply log transformation to amount (right-skewed)
   ├─ Check for duplicates
   │  └─ Remove duplicate records
   └─ Remove non-predictive features
      └─ nameOrig, nameDest (privacy/irrelevant)
   
4. FEATURE ANALYSIS
   ├─ Check correlation matrix
   ├─ Analyze multicollinearity
   │  └─ Balance variables are correlated but meaningful
   ├─ Statistical checks
   └─ Domain knowledge validation
   
5. FEATURE ENGINEERING
   ├─ One-Hot Encoding
   │  └─ type column: CASH-IN, CASH-OUT, DEBIT, PAYMENT, TRANSFER
   ├─ Feature Selection
   │  ├─ Keep log_amount (normalized amount)
   │  ├─ Keep balance features
   │  └─ Keep transaction type indicators
   └─ Create feature matrix X and target y
   
6. DATA SPLITTING
   ├─ Train Set: 80% of data
   ├─ Test Set: 20% of data
   └─ Stratification: Maintain fraud ratio
   
7. MODEL SELECTION & TRAINING
   ├─ Algorithm: Logistic Regression
   ├─ Reason: Interpretability + binary classification + imbalanced data
   ├─ Class Weighting: 'balanced' (fraud gets higher importance)
   ├─ Max Iterations: 1000 (for convergence)
   └─ Training with X_train and y_train
   
8. PREDICTION & EVALUATION
   ├─ Predict on test set
   ├─ Get probability scores
   ├─ Use classification_report
   ├─ Calculate ROC-AUC score
   └─ Focus on Recall (catch all frauds)
   
9. FEATURE IMPORTANCE ANALYSIS
   ├─ Extract model coefficients
   ├─ Rank by absolute value
   ├─ Identify fraud predictors
   └─ Generate business insights
   
10. ACTIONABLE INSIGHTS
    ├─ Identify fraud patterns
    ├─ Recommend prevention strategies
    ├─ Design monitoring rules
    └─ Suggest infrastructure updates

Technologies & Libraries

Core Libraries:

  • Python 3.x - Programming language
  • Pandas - Data manipulation and analysis
  • NumPy - Numerical computing
  • Scikit-learn - Machine learning algorithms
  • Matplotlib / Seaborn - Data visualization

Specific Tools Used:

Tool Purpose Usage
pd.read_csv() Load data Read transaction dataset
df.isnull().sum() Missing values Check data quality
np.log1p() Log transformation Normalize right-skewed amount
df.corr() Correlation matrix Detect multicollinearity
pd.get_dummies() One-hot encoding Convert transaction types
train_test_split() Data partitioning 80-20 split with stratification
LogisticRegression Classification model Binary fraud detection
class_weight='balanced' Handle imbalance Give fraud higher importance
classification_report() Performance metrics Precision, Recall, F1-score
roc_auc_score() AUC evaluation Model discrimination ability

Model Details

Algorithm: Logistic Regression

Why Logistic Regression?

Simple and interpretable (understand which features drive fraud)
Fast training (suitable for real-time applications)
Probabilistic output (0-1 confidence score)
Handles binary classification well
Works with imbalanced data (using class weights)

How It Works:

Features (amount, balances, type)
    ↓
Linear combination (weighted sum)
    ↓
Sigmoid function (maps to 0-1)
    ↓
Probability of fraud
    ↓
If P(fraud) > 0.5 → FLAG AS FRAUD 
If P(fraud) ≤ 0.5 → APPROVE TRANSACTION 

Key Parameter:

class_weight='balanced'
  • Normally: 99.9% genuine, 0.1% fraud
  • With balanced weights: Both classes equally important
  • Fraud misclassification penalty increased
  • Catches more frauds (higher recall)

Data Cleaning & Preprocessing

1. Missing Values Handling

df.isnull().sum()
df = df.dropna()
  • Found negligible missing values
  • Removed affected rows
  • No imputation bias introduced

2. Outlier Detection & Handling

df['log_amount'] = np.log1p(df['amount'])
  • Problem: Amount is right-skewed (few very large transactions)
  • Solution: Log transformation
  • Result: Values normalized to reasonable scale
  • Benefit: Model trains faster and more stable

3. Multicollinearity Analysis

df[['oldbalanceOrg','newbalanceOrig','oldbalanceDest','newbalanceDest']].corr()
  • Balance variables show correlation
  • Decision: Retain all (they represent different aspects)
    • oldbalanceOrg & newbalanceOrig → Money left sender account
    • oldbalanceDest & newbalanceDest → Money reached receiver
  • All provide unique fraud indicators

4. Feature Removal

df = df.drop(columns=['nameOrig', 'nameDest'])
  • Removed: Names of customers
  • Reason: Privacy concerns + Not predictive of fraud

5. One-Hot Encoding

df = pd.get_dummies(df, columns=['type'], drop_first=True)
type: ['CASH-IN', 'CASH-OUT', 'DEBIT', 'PAYMENT', 'TRANSFER']
    ↓
type_CASH-IN: 1/0
type_CASH-OUT: 1/0
type_DEBIT: 1/0
type_PAYMENT: 1/0
type_TRANSFER: 1/0 (dropped - redundant)

Model Performance

Evaluation Metrics:

Metric Definition Why Important
Precision Of predicted frauds, how many are actual? Minimize false alarms
Recall Of actual frauds, how many detected? PRIORITY - Catch all frauds
F1-Score Harmonic mean of precision & recall Balanced measure
ROC-AUC True positive rate vs false positive rate Overall discrimination ability

Class Imbalance Challenge:

Normal scenario: 99.9% genuine, 0.1% fraud
Without handling: Model predicts "all genuine" → 99.9% accuracy (useless!)
Solution: class_weight='balanced' → Treat fraud as 1000x more important
Result: Model learns fraud patterns despite low representation

Performance Trade-off:

HIGH RECALL:
   └─ Catch 99% of frauds
   └─ Some false positives (genuine transactions flagged)
   └─ Cost: Customer inconvenience (manageable)

LOW RECALL:
   └─ Miss frauds
   └─ Fraudster steals money (major loss)
   └─ Cost: Financial loss + reputation damage

Decision: Prioritize RECALL > Precision

Key Fraud Predictors

Top Factors Indicating Fraud:

  1. Transaction Type: TRANSFER

    • High-risk category
    • Fraudsters quickly move money
    • Direct transfer suggests theft
  2. Transaction Type: CASH-OUT

    • Withdrawing to untraceable cash
    • Quick conversion to cash
    • Common fraud pattern
  3. Balance Changes

    • Sudden large balance drop (money leaving)
    • Receiver getting unexpected large sum
    • Abnormal financial behavior
  4. Amount (log_amount)

    • Larger transactions → Higher risk
    • But log-transformed (non-linear)
    • Extreme values flagged

Why These Make Sense:

Fraudster Behavior Pattern:

1. Compromise account
   ↓
2. Quickly transfer/cash-out money
   ↓
3. Use high-risk transaction types
   ↓
4. Leave obvious balance changes
   ↓
5. Model detects these patterns
   ↓
6. Transaction flagged 

Fraud Prevention Strategies

Real-time Monitoring:

incoming_transaction
    ↓
[Extract features: amount, type, balances]
    ↓
[Scale features]
    ↓
[Pass to ML model]
    ↓
[Get fraud probability]
    ↓
If P(fraud) > 0.5:
├─ Flag transaction
├─ Alert fraud team
├─ Request extra verification
└─ Can block if high confidence

Recommended Controls:

  1. Velocity Checks

    If customer does 5 transfers in < 1 hour
    → Suspicious (normal user = 1-2 per day)
    → Request verification
    
  2. Multi-Factor Authentication (MFA)

    Large or unusual transactions
    → Require OTP / biometric / security question
    → Verify customer identity
    
  3. Threshold-Based Alerts

    Transaction amount > customer's monthly average
    → Flag for review
    → Monitor for patterns
    
  4. Geographic Checks

    Transaction from unusual location
    → Compare with customer history
    → Verify travel plans
    
  5. Balance Anomaly Detection

    Sudden large balance drop
    → Not matching customer behavior
    → Request confirmation
    

Measuring Effectiveness

Key Metrics to Track:

1. FRAUD DETECTION RATE
   └─ % frauds caught / total frauds
   └─ Target: > 95%

2. FALSE POSITIVE RATE
   └─ % legitimate transactions blocked / total legitimate
   └─ Target: < 1% (customer experience)

3. FRAUD LOSS RATE
   └─ $ lost to fraud / total $ transacted
   └─ Target: < 0.01% (industry benchmark)

4. CUSTOMER SATISFACTION
   └─ Complaints about false blocks
   └─ Churn due to friction
   └─ Track and minimize

A/B Testing Approach:

Normal Pathway:
├─ 50% users: Current rules
└─ 50% users: New ML model

Measure improvements:
├─ Which catches more fraud? ✓
├─ Which has lower false positives? ✓
└─ Deploy winner

Continuous Improvement:

Week 1: Deploy model (baseline)
    ↓
Week 2-4: Monitor fraud rate, false positives
    ↓
Week 5: Analyze misclassifications
    ↓
Week 6: Retrain with recent data
    ↓
Week 7: Adjust threshold if needed
    ↓
Week 8: Compare with Week 1 (improvement?)
    ↓
Continue cycle...

Project Files

Fraud_Transaction_Detection.ipynb
├── Cell 1-3: Project header & objective
├── Cell 4: Import libraries
├── Cell 5: Load dataset
├── Cell 6-8: Exploratory analysis
├── Cell 9-12: Missing values handling
├── Cell 13-15: Outlier handling (log transformation)
├── Cell 16-18: Multicollinearity analysis
├── Cell 19-20: Feature selection & engineering
├── Cell 21: Train-test split
├── Cell 22-24: Model building & training
├── Cell 25-26: Predictions
├── Cell 27-28: Model evaluation (Classification report, ROC-AUC)
├── Cell 29-30: Feature importance
└── Cell 31-32: Business insights

Installation & Setup

Requirements:

Python 3.7+
Jupyter Notebook or Google Colab

Install Dependencies:

pip install pandas numpy scikit-learn matplotlib seaborn

For Google Colab (if using cloud):

# Colab comes with all libraries pre-installed
# Just upload the CSV file or mount Google Drive
from google.colab import files
files.upload()  # Upload PS_20174392719_1491204439457_log.csv

How to Run

Step 1: Load Dataset

  • Place CSV file in working directory
  • Or provide correct file path

Step 2: Run Data Cleaning Cells

  • Check missing values
  • Apply log transformation
  • Analyze correlations

Step 3: Run Feature Engineering

  • One-hot encode transaction types
  • Create final feature matrix

Step 4: Train Model

  • Split data (80-20)
  • Train Logistic Regression
  • Monitor training progress

Step 5: Evaluate

  • Get predictions on test set
  • Print classification report
  • Calculate ROC-AUC

Step 6: Extract Insights

  • Feature importance ranking
  • Identify fraud patterns
  • Generate business recommendations

Expected Results

Model Performance:

Precision (Fraud Class):  0.70-0.85
Recall (Fraud Class):     0.90-0.98  ← PRIORITY METRIC
F1-Score (Fraud Class):   0.78-0.90
ROC-AUC Score:            0.95-0.98

Key Insights:

TRANSFER transactions: Highest fraud risk (2-3x normal)
CASH-OUT transactions: Second highest risk
Large amounts: More fraudulent patterns
Balance anomalies: Key fraud indicators

Customization & Improvements

Model Enhancements:

  1. Try Alternative Algorithms:

    from sklearn.ensemble import RandomForestClassifier
    from xgboost import XGBClassifier
    • Random Forest: Better for non-linear patterns
    • XGBoost: State-of-the-art performance
  2. Handle Class Imbalance:

    from imblearn.over_sampling import SMOTE
    • Generate synthetic fraud samples
    • Better model learning
  3. Hyperparameter Tuning:

    from sklearn.model_selection import GridSearchCV
    • Find best parameters
    • Cross-validation
  4. Ensemble Methods:

    • Combine multiple models
    • Voting ensemble
    • Stacking
  5. Feature Engineering:

    • Create interaction features
    • Domain-specific features
    • Time-based features

Learning Resources


Key Concepts Explained

Class Imbalance Problem:

Real-world fraud rate: 0.1% (1 fraud per 1000 transactions)
Naive model: "Always predict genuine"
Result: 99.9% accuracy but catches ZERO frauds (useless!)

Solution: Use class_weight='balanced'
└─ Fraud misclassification = 1000x penalty
└─ Model forced to learn fraud patterns
└─ Better recall (catches more frauds)

Log Transformation:

Before: [0.01, 100, 500000, 1000000, 2000000]
        Range: 2,000,000x spread (huge outliers)

After log: [0.01, 4.6, 13.1, 13.8, 14.5]
        Range: 1000x spread (more manageable)

Benefit: Model trains faster, more stable weights

One-Hot Encoding:

Categorical: ['TRANSFER', 'PAYMENT', 'CASH-OUT']
      ↓
type_PAYMENT: [1, 0, 0]
type_CASH-OUT: [0, 0, 1]
type_TRANSFER: [0, 1, 0]

Benefit: ML models work with numerical data

Contributing

Feel free to:

  • Fork and modify
  • Improve the model
  • Add new features
  • Share insights

License

This project uses the PaySim public dataset. For more information, refer to their documentation.


Support

For questions about:

  • Dataset: Refer to PaySim documentation
  • Logistic Regression: Check scikit-learn docs
  • Class Imbalance: See imbalanced-learn library
  • Fraud Detection Best Practices: See Fraud-Detection-Handbook

Summary

  • This project demonstrates:
  • Handling highly imbalanced data
  • Feature engineering and cleaning
  • Logistic regression for fraud detection
  • Interpretable machine learning
  • Real-world application design
  • Business impact metrics

Status: Complete and Production-Ready

About

This project builds a Machine Learning model to detect fraudulent financial transactions in real-time.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors