# 🚦 Real-Time Fraud Detection System – Exercise Notebook

## **Notebook Overview**

***

- Generate synthetic transaction data in Databricks using Python
- Ingest streaming data with Databricks Structured Streaming
- Apply ETL: cleansing, enrichment, deduplication, anomaly tagging
- Implement rule-based fraud detection logic (NOT machine learning)
- Store data through Bronze → Silver → Gold tables
- Add checkpointing, monitoring, and fault tolerance

***

## 1. Setup \& Initialization

```python
# Import needed libraries
from pyspark.sql import functions as F
from pyspark.sql.types import *

# Define workspace paths & table names
bronze_path = "dbfs:/mnt/fraud/bronze"
silver_path = "dbfs:/mnt/fraud/silver"
gold_path = "dbfs:/mnt/fraud/gold"
checkpoint_path = "dbfs:/mnt/fraud/checkpoints"
```


***

## 2. Synthetic Data Generation (Python – Databricks Notebook Cell)

```python
import random
from datetime import datetime, timedelta

def generate_transaction():
    return {
        "transaction_id": f"TX{random.randint(1000000,9999999)}",
        "timestamp": datetime.now().isoformat(),
        "customer_id": f"C{random.randint(1000,9999)}",
        "amount": round(random.uniform(5,10000),2),
        "merchant_id": f"M{random.randint(10,999)}",
        "country": random.choice(["IN", "US", "UK", "CA", "DE", "JP", "BR"]),
        "channel": random.choice(["ecommerce", "offline", "mobile"]),
        "payment_method": random.choice(["credit_card", "debit_card", "upi", "wallet"]),
        "is_international": random.choice([True, False])
    }

# Generate test data
transactions = [generate_transaction() for _ in range(1000)]

# Convert to Spark DataFrame
transaction_schema = StructType([
    StructField("transaction_id", StringType()),
    StructField("timestamp", StringType()),
    StructField("customer_id", StringType()),
    StructField("amount", DoubleType()),
    StructField("merchant_id", StringType()),
    StructField("country", StringType()),
    StructField("channel", StringType()),
    StructField("payment_method", StringType()),
    StructField("is_international", BooleanType())
])

df = spark.createDataFrame(transactions, schema=transaction_schema)

# Write as a single batch to a folder emulating streaming source
df.write.mode("overwrite").json(bronze_path + "/raw")
```


***

## 3. Real-Time Ingestion – Structured Streaming

```python
# Read from simulated real-time source (bronze/raw)
stream_df = (spark.readStream
    .format("cloudFiles")
    .option("cloudFiles.format", "json")
    .option("cloudFiles.schemaLocation", checkpoint_path + "/schema")
    .load(bronze_path + "/raw"))

# Store raw events to Bronze table (append)
bronze_query = (stream_df
    .writeStream
    .format("delta")
    .option("checkpointLocation", checkpoint_path + "/bronze")
    .outputMode("append")
    .table("fraud_bronze"))
```


***

## 4. ETL – Cleansing, Enrichment, Deduplication

```python
from pyspark.sql import functions as F

bronze_stream = spark.readStream.table("fraud_bronze")

# Cleansing: filter out incomplete rows
clean_df = bronze_stream.filter(
    "transaction_id IS NOT NULL and customer_id IS NOT NULL and amount > 0"
)

# Deduplication
dedup_df = clean_df.withWatermark("timestamp", "10 seconds").dropDuplicates(["transaction_id"])

# Enrichment: flag high-value and international transactions
enriched_df = (dedup_df
    .withColumn("high_value", F.col("amount") > 5000)
    .withColumn("is_night", F.hour(F.to_timestamp("timestamp")) >= 22)
)

# Write to Silver table
silver_query = (enriched_df
    .writeStream
    .format("delta")
    .option("checkpointLocation", checkpoint_path + "/silver")
    .outputMode("append")
    .table("fraud_silver"))
```


***

## 5. Rule-Based Fraud Detection Logic (No ML)

```python
silver_stream = spark.readStream.table("fraud_silver")

# Add fraud rules column (simple, production-style logic)
fraud_rules_df = (silver_stream
    .withColumn("fraud_suspect", 
       (
           (F.col("is_international") & F.col("high_value")) |          # high-value & international
           (F.col("channel") == "ecommerce") & (F.col("amount") > 8000) | # big ecommerce transactions
           (F.col("is_night") & (F.col("amount") > 2000))                # night time, high amount
       )
    )
    .withColumn("fraud_reason",
       F.when((F.col("is_international") & F.col("high_value")), "International High Value")
        .when((F.col("channel") == "ecommerce") & (F.col("amount") > 8000), "Ecommerce Large Payment")
        .when((F.col("is_night") & (F.col("amount") > 2000)), "Night-Time Large Transaction")
        .otherwise(None)
    )
)

# Write to Gold table
gold_query = (fraud_rules_df
    .writeStream
    .format("delta")
    .option("checkpointLocation", checkpoint_path + "/gold")
    .outputMode("append")
    .table("fraud_gold"))
```


***

## 6. Monitoring, Checkpointing, and Fault Tolerance

**Monitoring Metrics**

```python
# Example: Monitor counts and suspects
fraud_gold_df = spark.read.table("fraud_gold")
fraud_gold_df.groupBy("fraud_suspect", "fraud_reason").count().show()
```

**View latest suspect transactions**

```python
# Query recent fraud suspects
spark.sql("SELECT * FROM fraud_gold WHERE fraud_suspect = true ORDER BY timestamp DESC LIMIT 10").show()
```

**Check Streaming Query Health**

```python
# Access query progress and checkpoint status
print(gold_query.recentProgress)
```


***

## 7. Production Recap \& Extension Ideas

- All source, ETL, and output tables are **Delta** for reliability, ACID, and time travel
- Checkpointing ensures fault tolerance \& restarts
- Extend with data lineage tags, periodic access audits, or integrate alerting (simple print/logging for now)

***

