This project builds a Machine Learning model to detect fraudulent financial transactions in real-time. Using Logistic Regression with class weighted optimization, the model identifies suspicious transactions and helps financial institutions prevent fraud proactively.
Real-world Application: Banks and payment processors can flag fraudulent transactions instantly, preventing money loss and protecting customers.
Build a predictive model that:
- Detects fraudulent transactions with high accuracy
- Prioritizes Recall (catch maximum frauds)
- Minimizes false negatives (missing frauds is costlier than false positives)
- Provides interpretable insights for fraud patterns
- Enables data-driven fraud prevention strategies
Source: PaySim Simulated Mobile Money Transaction Dataset
Dataset Name: PS_20174392719_1491204439457_log.csv
- Total Transactions: Millions of records
- Class Distribution: Highly imbalanced
- Genuine transactions: ~99.9%
- Fraudulent transactions: ~0.1%
- Problem Type: Binary Classification (Fraud vs. Genuine)
| Feature | Description | Data Type | Notes |
|---|---|---|---|
| step | Time step (represents hours) | Numeric | 1-744 hours |
| type | Transaction type | Categorical | CASH-IN, CASH-OUT, DEBIT, PAYMENT, TRANSFER |
| amount | Transaction amount | Numeric | Variable range, right-skewed |
| nameOrig | Originating customer name | Categorical | Removed (privacy/irrelevant) |
| oldbalanceOrg | Balance before transaction | Numeric | Important for fraud detection |
| newbalanceOrig | Balance after transaction | Numeric | Shows money movement |
| nameDest | Destination customer name | Categorical | Removed (privacy/irrelevant) |
| oldbalanceDest | Recipient balance before | Numeric | Balance change indicator |
| newbalanceDest | Recipient balance after | Numeric | Shows received amount |
| isFraud | Target variable | Binary | 0 = Genuine, 1 = Fraud |
1. DATA LOADING
└─ Load PaySim dataset from CSV
2. EXPLORATORY DATA ANALYSIS (EDA)
├─ Check dataset shape and size
├─ Inspect data types and structure
├─ Check for duplicates
├─ Analyze fraud distribution (class imbalance)
└─ Visualize transaction patterns
3. DATA CLEANING
├─ Check and handle missing values
│ └─ Drop rows with NaN (negligible proportion)
├─ Identify outliers
│ └─ Apply log transformation to amount (right-skewed)
├─ Check for duplicates
│ └─ Remove duplicate records
└─ Remove non-predictive features
└─ nameOrig, nameDest (privacy/irrelevant)
4. FEATURE ANALYSIS
├─ Check correlation matrix
├─ Analyze multicollinearity
│ └─ Balance variables are correlated but meaningful
├─ Statistical checks
└─ Domain knowledge validation
5. FEATURE ENGINEERING
├─ One-Hot Encoding
│ └─ type column: CASH-IN, CASH-OUT, DEBIT, PAYMENT, TRANSFER
├─ Feature Selection
│ ├─ Keep log_amount (normalized amount)
│ ├─ Keep balance features
│ └─ Keep transaction type indicators
└─ Create feature matrix X and target y
6. DATA SPLITTING
├─ Train Set: 80% of data
├─ Test Set: 20% of data
└─ Stratification: Maintain fraud ratio
7. MODEL SELECTION & TRAINING
├─ Algorithm: Logistic Regression
├─ Reason: Interpretability + binary classification + imbalanced data
├─ Class Weighting: 'balanced' (fraud gets higher importance)
├─ Max Iterations: 1000 (for convergence)
└─ Training with X_train and y_train
8. PREDICTION & EVALUATION
├─ Predict on test set
├─ Get probability scores
├─ Use classification_report
├─ Calculate ROC-AUC score
└─ Focus on Recall (catch all frauds)
9. FEATURE IMPORTANCE ANALYSIS
├─ Extract model coefficients
├─ Rank by absolute value
├─ Identify fraud predictors
└─ Generate business insights
10. ACTIONABLE INSIGHTS
├─ Identify fraud patterns
├─ Recommend prevention strategies
├─ Design monitoring rules
└─ Suggest infrastructure updates
- Python 3.x - Programming language
- Pandas - Data manipulation and analysis
- NumPy - Numerical computing
- Scikit-learn - Machine learning algorithms
- Matplotlib / Seaborn - Data visualization
| Tool | Purpose | Usage |
|---|---|---|
pd.read_csv() |
Load data | Read transaction dataset |
df.isnull().sum() |
Missing values | Check data quality |
np.log1p() |
Log transformation | Normalize right-skewed amount |
df.corr() |
Correlation matrix | Detect multicollinearity |
pd.get_dummies() |
One-hot encoding | Convert transaction types |
train_test_split() |
Data partitioning | 80-20 split with stratification |
LogisticRegression |
Classification model | Binary fraud detection |
class_weight='balanced' |
Handle imbalance | Give fraud higher importance |
classification_report() |
Performance metrics | Precision, Recall, F1-score |
roc_auc_score() |
AUC evaluation | Model discrimination ability |
Simple and interpretable (understand which features drive fraud)
Fast training (suitable for real-time applications)
Probabilistic output (0-1 confidence score)
Handles binary classification well
Works with imbalanced data (using class weights)
Features (amount, balances, type)
↓
Linear combination (weighted sum)
↓
Sigmoid function (maps to 0-1)
↓
Probability of fraud
↓
If P(fraud) > 0.5 → FLAG AS FRAUD
If P(fraud) ≤ 0.5 → APPROVE TRANSACTION
class_weight='balanced'- Normally: 99.9% genuine, 0.1% fraud
- With balanced weights: Both classes equally important
- Fraud misclassification penalty increased
- Catches more frauds (higher recall)
df.isnull().sum()
df = df.dropna()- Found negligible missing values
- Removed affected rows
- No imputation bias introduced
df['log_amount'] = np.log1p(df['amount'])- Problem: Amount is right-skewed (few very large transactions)
- Solution: Log transformation
- Result: Values normalized to reasonable scale
- Benefit: Model trains faster and more stable
df[['oldbalanceOrg','newbalanceOrig','oldbalanceDest','newbalanceDest']].corr()- Balance variables show correlation
- Decision: Retain all (they represent different aspects)
- oldbalanceOrg & newbalanceOrig → Money left sender account
- oldbalanceDest & newbalanceDest → Money reached receiver
- All provide unique fraud indicators
df = df.drop(columns=['nameOrig', 'nameDest'])- Removed: Names of customers
- Reason: Privacy concerns + Not predictive of fraud
df = pd.get_dummies(df, columns=['type'], drop_first=True)type: ['CASH-IN', 'CASH-OUT', 'DEBIT', 'PAYMENT', 'TRANSFER']
↓
type_CASH-IN: 1/0
type_CASH-OUT: 1/0
type_DEBIT: 1/0
type_PAYMENT: 1/0
type_TRANSFER: 1/0 (dropped - redundant)
| Metric | Definition | Why Important |
|---|---|---|
| Precision | Of predicted frauds, how many are actual? | Minimize false alarms |
| Recall | Of actual frauds, how many detected? | PRIORITY - Catch all frauds |
| F1-Score | Harmonic mean of precision & recall | Balanced measure |
| ROC-AUC | True positive rate vs false positive rate | Overall discrimination ability |
Normal scenario: 99.9% genuine, 0.1% fraud
Without handling: Model predicts "all genuine" → 99.9% accuracy (useless!)
Solution: class_weight='balanced' → Treat fraud as 1000x more important
Result: Model learns fraud patterns despite low representation
HIGH RECALL:
└─ Catch 99% of frauds
└─ Some false positives (genuine transactions flagged)
└─ Cost: Customer inconvenience (manageable)
LOW RECALL:
└─ Miss frauds
└─ Fraudster steals money (major loss)
└─ Cost: Financial loss + reputation damage
Decision: Prioritize RECALL > Precision
-
Transaction Type: TRANSFER
- High-risk category
- Fraudsters quickly move money
- Direct transfer suggests theft
-
Transaction Type: CASH-OUT
- Withdrawing to untraceable cash
- Quick conversion to cash
- Common fraud pattern
-
Balance Changes
- Sudden large balance drop (money leaving)
- Receiver getting unexpected large sum
- Abnormal financial behavior
-
Amount (log_amount)
- Larger transactions → Higher risk
- But log-transformed (non-linear)
- Extreme values flagged
Fraudster Behavior Pattern:
1. Compromise account
↓
2. Quickly transfer/cash-out money
↓
3. Use high-risk transaction types
↓
4. Leave obvious balance changes
↓
5. Model detects these patterns
↓
6. Transaction flagged
incoming_transaction
↓
[Extract features: amount, type, balances]
↓
[Scale features]
↓
[Pass to ML model]
↓
[Get fraud probability]
↓
If P(fraud) > 0.5:
├─ Flag transaction
├─ Alert fraud team
├─ Request extra verification
└─ Can block if high confidence
-
Velocity Checks
If customer does 5 transfers in < 1 hour → Suspicious (normal user = 1-2 per day) → Request verification -
Multi-Factor Authentication (MFA)
Large or unusual transactions → Require OTP / biometric / security question → Verify customer identity -
Threshold-Based Alerts
Transaction amount > customer's monthly average → Flag for review → Monitor for patterns -
Geographic Checks
Transaction from unusual location → Compare with customer history → Verify travel plans -
Balance Anomaly Detection
Sudden large balance drop → Not matching customer behavior → Request confirmation
1. FRAUD DETECTION RATE
└─ % frauds caught / total frauds
└─ Target: > 95%
2. FALSE POSITIVE RATE
└─ % legitimate transactions blocked / total legitimate
└─ Target: < 1% (customer experience)
3. FRAUD LOSS RATE
└─ $ lost to fraud / total $ transacted
└─ Target: < 0.01% (industry benchmark)
4. CUSTOMER SATISFACTION
└─ Complaints about false blocks
└─ Churn due to friction
└─ Track and minimize
Normal Pathway:
├─ 50% users: Current rules
└─ 50% users: New ML model
Measure improvements:
├─ Which catches more fraud? ✓
├─ Which has lower false positives? ✓
└─ Deploy winner
Week 1: Deploy model (baseline)
↓
Week 2-4: Monitor fraud rate, false positives
↓
Week 5: Analyze misclassifications
↓
Week 6: Retrain with recent data
↓
Week 7: Adjust threshold if needed
↓
Week 8: Compare with Week 1 (improvement?)
↓
Continue cycle...
Fraud_Transaction_Detection.ipynb
├── Cell 1-3: Project header & objective
├── Cell 4: Import libraries
├── Cell 5: Load dataset
├── Cell 6-8: Exploratory analysis
├── Cell 9-12: Missing values handling
├── Cell 13-15: Outlier handling (log transformation)
├── Cell 16-18: Multicollinearity analysis
├── Cell 19-20: Feature selection & engineering
├── Cell 21: Train-test split
├── Cell 22-24: Model building & training
├── Cell 25-26: Predictions
├── Cell 27-28: Model evaluation (Classification report, ROC-AUC)
├── Cell 29-30: Feature importance
└── Cell 31-32: Business insights
Python 3.7+
Jupyter Notebook or Google Colab
pip install pandas numpy scikit-learn matplotlib seaborn# Colab comes with all libraries pre-installed
# Just upload the CSV file or mount Google Drive
from google.colab import files
files.upload() # Upload PS_20174392719_1491204439457_log.csv- Place CSV file in working directory
- Or provide correct file path
- Check missing values
- Apply log transformation
- Analyze correlations
- One-hot encode transaction types
- Create final feature matrix
- Split data (80-20)
- Train Logistic Regression
- Monitor training progress
- Get predictions on test set
- Print classification report
- Calculate ROC-AUC
- Feature importance ranking
- Identify fraud patterns
- Generate business recommendations
Precision (Fraud Class): 0.70-0.85
Recall (Fraud Class): 0.90-0.98 ← PRIORITY METRIC
F1-Score (Fraud Class): 0.78-0.90
ROC-AUC Score: 0.95-0.98
TRANSFER transactions: Highest fraud risk (2-3x normal)
CASH-OUT transactions: Second highest risk
Large amounts: More fraudulent patterns
Balance anomalies: Key fraud indicators
-
Try Alternative Algorithms:
from sklearn.ensemble import RandomForestClassifier from xgboost import XGBClassifier
- Random Forest: Better for non-linear patterns
- XGBoost: State-of-the-art performance
-
Handle Class Imbalance:
from imblearn.over_sampling import SMOTE
- Generate synthetic fraud samples
- Better model learning
-
Hyperparameter Tuning:
from sklearn.model_selection import GridSearchCV
- Find best parameters
- Cross-validation
-
Ensemble Methods:
- Combine multiple models
- Voting ensemble
- Stacking
-
Feature Engineering:
- Create interaction features
- Domain-specific features
- Time-based features
- Scikit-learn: https://scikit-learn.org/
- Pandas Documentation: https://pandas.pydata.org/docs
- Imbalanced Data Handling: https://imbalanced-learn.org/
Real-world fraud rate: 0.1% (1 fraud per 1000 transactions)
Naive model: "Always predict genuine"
Result: 99.9% accuracy but catches ZERO frauds (useless!)
Solution: Use class_weight='balanced'
└─ Fraud misclassification = 1000x penalty
└─ Model forced to learn fraud patterns
└─ Better recall (catches more frauds)
Before: [0.01, 100, 500000, 1000000, 2000000]
Range: 2,000,000x spread (huge outliers)
After log: [0.01, 4.6, 13.1, 13.8, 14.5]
Range: 1000x spread (more manageable)
Benefit: Model trains faster, more stable weights
Categorical: ['TRANSFER', 'PAYMENT', 'CASH-OUT']
↓
type_PAYMENT: [1, 0, 0]
type_CASH-OUT: [0, 0, 1]
type_TRANSFER: [0, 1, 0]
Benefit: ML models work with numerical data
Feel free to:
- Fork and modify
- Improve the model
- Add new features
- Share insights
This project uses the PaySim public dataset. For more information, refer to their documentation.
For questions about:
- Dataset: Refer to PaySim documentation
- Logistic Regression: Check scikit-learn docs
- Class Imbalance: See imbalanced-learn library
- Fraud Detection Best Practices: See Fraud-Detection-Handbook
- This project demonstrates:
- Handling highly imbalanced data
- Feature engineering and cleaning
- Logistic regression for fraud detection
- Interpretable machine learning
- Real-world application design
- Business impact metrics
Status: Complete and Production-Ready