# Anomaly Detection Modeling

This notebook demonstrates how to use the rule-based, statistical, and ensemble anomaly detection models in our fraud detection system.

In [1]:
import sys
import os
sys.path.append('/Users/m1pro/Documents/GitHub/fraud_detection_system') # Adjust the path as necessary

# Import necessary libraries
import pandas as pd
import numpy as np
from src.models.rule_based import RuleBasedAnomalyDetector
from src.models.statistical import StatisticalAnomalyDetector
from src.models.ensemble import EnsembleAnomalyDetector
import logging

# Set up logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)


## Rule-Based Anomaly Detection

We'll start by using the rule-based anomaly detector which applies business rules and heuristics to flag unusual transactions.

In [2]:
# Read the feature_store.csv as input data
feature_store_path = "../results/feature_store.csv"
feature_df = pd.read_csv(feature_store_path)

# Display the first few rows to verify
print("Loaded feature_store.csv:")
display(feature_df.head())


Loaded feature_store.csv:


Unnamed: 0,raw_log,timestamp,user_id,transaction_type,amount,currency,location,device,is_parsed,parse_errors,...,type_rarity,location_device,location_device_frequency,location_device_rarity,hour_type,hour_type_frequency,hour_type_rarity,amount_percentile,amount_deviation_location,amount_deviation_type
0,2025-06-01 12:03:31 - user=user1000 - action=c...,2025-06-01 12:03:31,user1000,cashout,2235.91,$,London,Samsung Galaxy S10,True,,...,0.000984,London_Samsung Galaxy S10,124,0.008,12_cashout,37,0.026316,0.49434,32.055041,122.37464
1,01/06/2025 19:19:50 ::: user1000 *** DEBIT :::...,2025-06-01 19:19:50,user1000,debit,1267.67,£,Manchester,Xiaomi Mi 11,True,,...,0.000992,Manchester_Xiaomi Mi 11,136,0.007299,19_debit,41,0.02381,0.280293,1024.647175,1053.774806
2,2025-06-02 19:52:44 | user: user1000 | txn: re...,2025-06-02 19:52:44,user1000,refund,2708.01,$,Cardiff,Huawei P30,True,,...,0.001104,Cardiff_Huawei P30,113,0.008772,19_refund,38,0.025641,0.601106,399.260344,288.250044
3,2025-06-03 10:11:53 - user=user1000 - action=c...,2025-06-03 10:11:53,user1000,cashout,4659.06,£,Birmingham,Nokia 3310,True,,...,0.000984,Birmingham_Nokia 3310,132,0.007519,10_cashout,45,0.021739,0.961796,2371.998033,2300.77536
4,2025-06-03 21:23:30 | user: user1000 | txn: ca...,2025-06-03 21:23:30,user1000,cashout,4063.97,£,Liverpool,,True,,...,0.000984,Liverpool_nan,138,0.007194,21_cashout,45,0.021739,0.871495,1770.661188,1705.68536


### How the RuleBasedAnomalyDetector Works
 
The `RuleBasedAnomalyDetector` is a fraud detection system that uses a set of predefined business rules and heuristics to identify potentially fraudulent transactions. Each rule is designed to capture a specific type of suspicious behavior listed below:
 - Large or unusual transaction amounts
 - Transactions at odd hours (e.g., late night)
 - High frequency of transactions in a short time
 - Rare or new locations/devices
 - Unusual transaction types for a user
 
**Here's how it works:**
 
 - **Initialization:** When the detector is created, it loads a set of rules, each with its own logic, parameters, and weight (importance).
 - **Rule Evaluation:** For each transaction, every rule is applied. If a rule is triggered (i.e., the transaction matches the suspicious pattern), it contributes to the transaction's anomaly score.
 - **Scoring:** The anomaly score for a transaction is a weighted sum of the triggered rules. The higher the score, the more likely the transaction is considered anomalous or fraudulent.
 - **Thresholding:** Transactions with scores above a certain threshold are flagged as potential fraud.
 
For every transaction, each rule is checked. If a rule is triggered, it contributes a weighted score.
The final anomaly score is the sum of all triggered rule weights for that transaction.
A higher total score means the transaction matches more suspicious patterns and is more likely to be flagged as anomalous.
This approach is transparent and interpretable, making it easy to understand why a transaction was flagged. It is especially useful for incorporating domain knowledge and business logic into the fraud detection process.





In [None]:
# Initialize and run the rule-based detector
rule_detector = RuleBasedAnomalyDetector()

# Run anomaly detection
rule_scores, rule_details = rule_detector.detect_anomalies(feature_df)

# Display rule-based results
print('Rule-Based Anomaly Scores:', rule_scores)




INFO:src.models.rule_based:Initialized 10 fraud detection rules
INFO:src.models.rule_based:Rule-based anomaly detector initialized
INFO:src.models.rule_based:Running rule-based anomaly detection on 7774 transactions...
INFO:src.models.rule_based:Rule-based detection complete. Mean anomaly score: 0.1821


Rule-Based Anomaly Scores: 0       0.142857
1       0.142857
2       0.142857
3       0.314286
4       0.142857
          ...   
7769    0.142857
7770    0.142857
7771    0.142857
7772    0.114286
7773    0.142857
Name: amount, Length: 7774, dtype: float64


In [4]:
# Display the top 10 rule-based anomaly scores


# Create a DataFrame to display user and their anomaly score
if hasattr(feature_df, 'user_id'):
    user_col = 'user_id'
elif 'user_id' in feature_df.columns:
    user_col = 'user_id'
else:
    user_col = feature_df.columns[0]  # fallback

top_n = 10
rule_scores_df = pd.DataFrame({
    'user_id': feature_df[user_col],
    'anomaly_score': rule_scores
})

# Sort by anomaly score descending and display top 10
top_rule_scores = rule_scores_df.sort_values('anomaly_score', ascending=False).head(top_n)
print("Top 10 Rule-Based Anomaly Scores:")
print(top_rule_scores.reset_index(drop=True))


Top 10 Rule-Based Anomaly Scores:
    user_id  anomaly_score
0  user1097       0.542857
1  user1097       0.542857
2  user1020       0.514286
3  user1003       0.514286
4  user1098       0.514286
5  user1081       0.514286
6  user1021       0.457143
7  user1065       0.457143
8  user1031       0.457143
9  user1037       0.457143


## Statistical Anomaly Detection

Next, we demonstrate using statistical models like Isolation Forest and DBSCAN to detect anomalies.


The `StatisticalAnomalyDetector` uses statistical machine learning models to identify unusual transactions based on patterns in the data. 
Typically, it leverages algorithms such as Isolation Forest and DBSCAN:
 
- **Isolation Forest** works by randomly partitioning the data and isolating observations. Anomalies are more easily isolated and thus receive higher anomaly scores.
- **DBSCAN** (Density-Based Spatial Clustering of Applications with Noise) groups together points that are closely packed and labels points that lie alone in low-density regions as anomalies.
 
The detector is first fitted to the transaction feature data, learning the normal patterns. It then assigns an anomaly score to each transaction, where higher scores indicate a greater likelihood of being anomalous or fraudulent.


In [5]:
# Initialize and fit the statistical detector
statistical_detector = StatisticalAnomalyDetector()
statistical_detector.fit(feature_df)

# Predict anomalies
stat_scores = statistical_detector.predict_anomalies(feature_df)

# Display statistical results
print('Statistical Anomaly Scores:', stat_scores)

INFO:src.models.statistical:Initialized 5 statistical models
INFO:src.models.statistical:Statistical anomaly detector initialized
INFO:src.models.statistical:Fitting statistical models on 7774 samples...
INFO:src.models.statistical:Preparing features for statistical models...
INFO:src.models.statistical:Prepared 53 features for 7774 samples
INFO:src.models.statistical:✅ Isolation Forest fitted
INFO:src.models.statistical:✅ One-Class SVM fitted
INFO:src.models.statistical:✅ Local Outlier Factor fitted
INFO:src.models.statistical:Statistical models fitting complete. 3 models fitted.
INFO:src.models.statistical:Predicting anomalies for 7774 samples...
INFO:src.models.statistical:Preparing features for statistical models...
INFO:src.models.statistical:Prepared 53 features for 7774 samples
INFO:src.models.statistical:Statistical anomaly detection complete. Generated 5 score sets.


Statistical Anomaly Scores: {'isolation_forest': array([0.12992247, 0.12444061, 0.23723035, ..., 0.36334529, 0.6232587 ,
       0.49936178]), 'one_class_svm': array([0.2920501 , 0.18405193, 0.32209424, ..., 0.18803552, 0.40121102,
       0.32655532]), 'lof': array([0.20341514, 0.15641548, 0.16825087, ..., 0.06699607, 0.25230028,
       0.13222091]), 'dbscan': array([1., 1., 1., ..., 1., 1., 1.]), 'hdbscan': array([0.13194602, 0.0223217 , 0.19747059, ..., 0.04932377, 0.0623968 ,
       0.28943835])}


<!-- # 
# ### Interpreting StatisticalAnomalyDetector Output
# 
# The output from the `StatisticalAnomalyDetector` is typically a set of anomaly scores—one for each transaction. These scores quantify how unusual or "anomalous" each transaction appears based on the statistical models used (e.g., Isolation Forest, One-Class SVM, Local Outlier Factor).
# 
# **How to interpret the scores:**
# - **Higher scores** indicate transactions that are more likely to be anomalous or fraudulent.
# - **Lower scores** suggest transactions that are more typical or expected.
# 
# The exact range and meaning of the scores can depend on the underlying model:
# - For some models, scores may be negative (e.g., Isolation Forest), with lower values indicating higher anomaly.
# - For others, scores may be probabilities or distances from the "normal" cluster.
# 
# **Making sense of the output:**
# - You can sort transactions by their anomaly score to identify the most suspicious ones.
# - Investigate the top-scoring transactions to look for patterns or commonalities.
# - Use a threshold (e.g., top 1% of scores) to flag transactions for further review.
# 
# **Example:**
# ```python
# # Combine scores with transaction IDs for inspection
# stat_scores_df = pd.DataFrame({
#     'user_id': feature_df['user_id'],
#     'anomaly_score': stat_scores
# })
# 
# # Show the top 10 most anomalous transactions
# print(stat_scores_df.sort_values('anomaly_score', ascending=False).head(10))
# ```
# 
# By analyzing these results, you can prioritize which transactions to investigate for potential fraud.
 -->


## Ensemble Anomaly Detection

Finally, we'll run the ensemble anomaly detector that combines both rule-based and statistical approaches for more comprehensive detection.

In [6]:
# Initialize and fit the ensemble detector
ensemble_detector = EnsembleAnomalyDetector()
ensemble_detector.fit(feature_df)

# Detect anomalies using the ensemble
ensemble_scores, ensemble_results = ensemble_detector.detect_anomalies(feature_df)

# Display ensemble results
# print('Ensemble Anomaly Scores:', ensemble_scores)
# print('Ensemble Results:', ensemble_results)

INFO:src.models.rule_based:Initialized 10 fraud detection rules
INFO:src.models.rule_based:Rule-based anomaly detector initialized
INFO:src.models.statistical:Initialized 5 statistical models
INFO:src.models.statistical:Statistical anomaly detector initialized
INFO:src.models.ensemble:Ensemble anomaly detector initialized
INFO:src.models.ensemble:Fitting ensemble detector on 7774 samples...
INFO:src.models.statistical:Fitting statistical models on 7774 samples...
INFO:src.models.statistical:Preparing features for statistical models...
INFO:src.models.statistical:Prepared 53 features for 7774 samples
INFO:src.models.statistical:✅ Isolation Forest fitted
INFO:src.models.statistical:✅ One-Class SVM fitted
INFO:src.models.statistical:✅ Local Outlier Factor fitted
INFO:src.models.statistical:Statistical models fitting complete. 3 models fitted.
INFO:src.models.ensemble:✅ Rule-based detector ready
INFO:src.models.ensemble:✅ Ensemble detector fitting complete
INFO:src.models.ensemble:Running 


🚨 ENSEMBLE FRAUD DETECTION SYSTEM - COMPREHENSIVE REPORT

📊 OVERALL STATISTICS
--------------------------------------------------
📈 Total Transactions Analyzed: 7,774
🎯 Mean Ensemble Score: 0.3727
📏 Score Standard Deviation: 0.0635
🔝 Highest Score: 0.7221
🔻 Lowest Score: 0.2481

⚠️ RISK LEVEL BREAKDOWN
--------------------------------------------------
🔥 CRITICAL RISK (>0.8): 0 transactions (0.0%)
⚠️ HIGH RISK (0.7-0.8): 1 transactions (0.0%)
⚡ MEDIUM RISK (0.5-0.7): 303 transactions (3.9%)
📊 LOW RISK (0.3-0.5): 6,635 transactions (85.3%)
✅ NORMAL (<0.3): 835 transactions (10.7%)

👥 USERS BY RISK CATEGORY
--------------------------------------------------

⚠️ HIGH RISK USERS (1 unique users):
   • user1026: Score 0.7221 (Max: 0.7221), 1 tx, £654 total, Unknown

⚡ MEDIUM RISK USERS (79 unique users):
   • user1080: Score 0.5964 (Max: 0.5964), 1 tx, £4759 total, Liverpool
   • user1054: Score 0.5767 (Max: 0.5767), 1 tx, £4866 total, Unknown
   • user1015: Score 0.5758 (Max: 0.6438), 7 t

## Conclusion

This notebook illustrated the modeling process using rule-based, statistical, and ensemble methods to detect anomalies in transaction data, providing a comprehensive approach to fraud detection.