# Real-Time Cryptocurrency Streaming Analytics

## Processing Live Market Data with Databricks Structured Streaming

This notebook demonstrates how to process real-time cryptocurrency trade data using
Databricks Auto Loader and Structured Streaming. The data is sourced from Binance's
WebSocket API via the companion producer notebook.

### What You'll See:
1. **Auto Loader Ingestion** - Automatically pick up new JSON files as they arrive
2. **Real-Time Analytics** - VWAP, volume, price volatility per symbol
3. **Windowed Aggregations** - Time-based metrics (per minute, per 5 minutes)
4. **Price Alerts** - Detect significant price movements
5. **Delta Lake Integration** - Persist streaming data for historical analysis
6. **Live Dashboards** - Visualize market activity in real-time

### Prerequisites:
- Run the **`crypto_data_producer`** notebook first (or in parallel) to generate data
- Data should be landing in the Unity Catalog Volume

## Setup & Configuration

In [0]:
# Configuration - must match producer settings
CATALOG = "takamol_demo"
SCHEMA = "crypto_streaming"
VOLUME_NAME = "crypto_landing"

# Paths
VOLUME_PATH = f"/Volumes/{CATALOG}/{SCHEMA}/{VOLUME_NAME}"
LANDING_PATH = f"{VOLUME_PATH}/trades"

# Session ID for unique checkpoints
import uuid
from datetime import datetime
SESSION_ID = datetime.now().strftime("%Y%m%d_%H%M%S") + "_" + str(uuid.uuid4())[:8]
CHECKPOINT_BASE = f"{VOLUME_PATH}/checkpoints/{SESSION_ID}"

print("=" * 60)
print("CRYPTOCURRENCY STREAMING ANALYTICS")
print("=" * 60)
print(f"\nConfiguration:")
print(f"  Catalog: {CATALOG}")
print(f"  Schema: {SCHEMA}")
print(f"  Landing Path: {LANDING_PATH}")
print(f"  Checkpoint Base: {CHECKPOINT_BASE}")
print(f"  Session ID: {SESSION_ID}")

CRYPTOCURRENCY STREAMING ANALYTICS

Configuration:
  Catalog: takamol_demo
  Schema: crypto_streaming
  Landing Path: /Volumes/takamol_demo/crypto_streaming/crypto_landing/trades
  Checkpoint Base: /Volumes/takamol_demo/crypto_streaming/crypto_landing/checkpoints/20260112_054612_841e1cc7
  Session ID: 20260112_054612_841e1cc7


## Verify Data Availability

In [0]:
# Check if data is available
try:
    files = dbutils.fs.ls(LANDING_PATH)
    print(f"\n✓ Found {len(files)} files in landing zone")

    # Show recent files
    files_sorted = sorted(files, key=lambda x: x.name, reverse=True)
    print("\nMost recent files:")
    for f in files_sorted[:5]:
        print(f"  {f.name} ({f.size / 1024:.1f} KB)")

    # Quick data preview
    sample_df = spark.read.json(LANDING_PATH).limit(5)
    print(f"\nSample schema:")
    sample_df.printSchema()

except Exception as e:
    print(f"\n⚠ No data found in landing zone: {e}")
    print(f"\nPlease run the crypto_data_producer notebook first to generate data.")
    dbutils.notebook.exit("No data available - run producer first")


✓ Found 70 files in landing zone

Most recent files:
  trades_20260112_054543_b26e5e0f.json (0.3 KB)
  trades_20260112_054533_0cc90eb7.json (0.3 KB)
  trades_20260112_054515_e418e430.json (0.3 KB)
  trades_20260112_051757_503428aa.json (0.3 KB)
  trades_20260112_051529_270d4bec.json (0.3 KB)

Sample schema:
root
 |-- buyer_order_id: long (nullable = true)
 |-- event_time: long (nullable = true)
 |-- event_type: string (nullable = true)
 |-- ingestion_time: long (nullable = true)
 |-- is_buyer_maker: boolean (nullable = true)
 |-- price: double (nullable = true)
 |-- producer_id: string (nullable = true)
 |-- quantity: double (nullable = true)
 |-- seller_order_id: long (nullable = true)
 |-- symbol: string (nullable = true)
 |-- trade_id: long (nullable = true)
 |-- trade_time: long (nullable = true)
 |-- trade_value_usdt: double (nullable = true)



## Create Delta Tables for Streaming Output

In [0]:
from pyspark.sql.types import *
from pyspark.sql.functions import *

# Save Python's built-in abs before it gets shadowed by PySpark's abs
import builtins
py_abs = builtins.abs

# Ensure schema exists
spark.sql(f"CREATE SCHEMA IF NOT EXISTS {CATALOG}.{SCHEMA}")

# 1. Raw trades table
trades_raw_table = f"{CATALOG}.{SCHEMA}.trades_raw"
spark.sql(f"DROP TABLE IF EXISTS {trades_raw_table}")
spark.sql(f"""
    CREATE TABLE {trades_raw_table} (
        event_type STRING,
        event_time BIGINT,
        symbol STRING,
        trade_id BIGINT,
        price DOUBLE,
        quantity DOUBLE,
        buyer_order_id BIGINT,
        seller_order_id BIGINT,
        trade_time BIGINT,
        is_buyer_maker BOOLEAN,
        trade_value_usdt DOUBLE,
        ingestion_time BIGINT,
        producer_id STRING,
        processing_time TIMESTAMP,
        trade_timestamp TIMESTAMP,
        batch_id BIGINT
    )
    COMMENT 'Raw cryptocurrency trades from Binance WebSocket'
""")
print(f"✓ Created: {trades_raw_table}")

# 2. Analytics aggregations table
trades_analytics_table = f"{CATALOG}.{SCHEMA}.trades_analytics"
spark.sql(f"DROP TABLE IF EXISTS {trades_analytics_table}")
spark.sql(f"""
    CREATE TABLE {trades_analytics_table} (
        window_start TIMESTAMP,
        window_end TIMESTAMP,
        symbol STRING,
        trade_count BIGINT,
        total_volume DOUBLE,
        total_value_usdt DOUBLE,
        vwap DOUBLE,
        avg_price DOUBLE,
        min_price DOUBLE,
        max_price DOUBLE,
        price_range_pct DOUBLE,
        buy_volume DOUBLE,
        sell_volume DOUBLE,
        buy_sell_ratio DOUBLE,
        batch_id BIGINT
    )
    COMMENT 'Windowed cryptocurrency trade analytics'
""")
print(f"✓ Created: {trades_analytics_table}")

# 3. Price alerts table
price_alerts_table = f"{CATALOG}.{SCHEMA}.price_alerts"
spark.sql(f"DROP TABLE IF EXISTS {price_alerts_table}")
spark.sql(f"""
    CREATE TABLE {price_alerts_table} (
        alert_time TIMESTAMP,
        symbol STRING,
        alert_type STRING,
        current_price DOUBLE,
        previous_price DOUBLE,
        price_change_pct DOUBLE,
        volume_in_window DOUBLE,
        trade_count BIGINT,
        severity STRING,
        message STRING,
        batch_id BIGINT
    )
    COMMENT 'Price movement alerts for cryptocurrency trades'
""")
print(f"✓ Created: {price_alerts_table}")

✓ Created: takamol_demo.crypto_streaming.trades_raw
✓ Created: takamol_demo.crypto_streaming.trades_analytics
✓ Created: takamol_demo.crypto_streaming.price_alerts


---
# Part 1: Auto Loader - Streaming File Ingestion

## Reading New JSON Files as They Arrive

Auto Loader (`cloudFiles` format) automatically:
- Discovers new files in the landing zone
- Tracks which files have been processed
- Handles schema evolution
- Scales to millions of files

### 1.1 Configure Auto Loader Stream

In [0]:
# Define schema for trade data
trade_schema = StructType([
    StructField("event_type", StringType(), True),
    StructField("event_time", LongType(), True),
    StructField("symbol", StringType(), True),
    StructField("trade_id", LongType(), True),
    StructField("price", DoubleType(), True),
    StructField("quantity", DoubleType(), True),
    StructField("buyer_order_id", LongType(), True),
    StructField("seller_order_id", LongType(), True),
    StructField("trade_time", LongType(), True),
    StructField("is_buyer_maker", BooleanType(), True),
    StructField("trade_value_usdt", DoubleType(), True),
    StructField("ingestion_time", LongType(), True),
    StructField("producer_id", StringType(), True)
])

# Create Auto Loader stream
auto_loader_stream = (
    spark.readStream
    .format("cloudFiles")
    .option("cloudFiles.format", "json")
    .option("cloudFiles.schemaLocation", f"{CHECKPOINT_BASE}/schema")
    .option("cloudFiles.inferColumnTypes", "true")
    .option("cloudFiles.schemaHints", "price DOUBLE, quantity DOUBLE, trade_value_usdt DOUBLE")
    .schema(trade_schema)
    .load(LANDING_PATH)
)

print("Auto Loader Stream Schema:")
auto_loader_stream.printSchema()

Auto Loader Stream Schema:
root
 |-- event_type: string (nullable = true)
 |-- event_time: long (nullable = true)
 |-- symbol: string (nullable = true)
 |-- trade_id: long (nullable = true)
 |-- price: double (nullable = true)
 |-- quantity: double (nullable = true)
 |-- buyer_order_id: long (nullable = true)
 |-- seller_order_id: long (nullable = true)
 |-- trade_time: long (nullable = true)
 |-- is_buyer_maker: boolean (nullable = true)
 |-- trade_value_usdt: double (nullable = true)
 |-- ingestion_time: long (nullable = true)
 |-- producer_id: string (nullable = true)



### 1.2 Stream Raw Trades to Delta Lake

In [0]:
# Define foreachBatch function for raw trades
def process_raw_trades_batch(batch_df, batch_id):
    """Process and persist raw trade data."""
    if batch_df.count() > 0:
        # Enrich with processing metadata
        enriched_df = batch_df \
            .withColumn("processing_time", current_timestamp()) \
            .withColumn("trade_timestamp",
                (col("trade_time") / 1000).cast("timestamp")) \
            .withColumn("batch_id", lit(batch_id).cast("bigint"))

        # Write to Delta table
        enriched_df.write.mode("append").saveAsTable(trades_raw_table)

        # Statistics for this batch
        symbols = enriched_df.select("symbol").distinct().collect()
        symbol_list = [r.symbol for r in symbols]

        print(f"  Batch {batch_id:3d}: {batch_df.count():5d} trades | "
              f"Symbols: {', '.join(sorted(symbol_list))} | "
              f"{datetime.now().strftime('%H:%M:%S')}")

# Start the raw trades stream
print("=" * 60)
print("STARTING RAW TRADES STREAM")
print("=" * 60)
print(f"Reading from: {LANDING_PATH}")
print(f"Writing to: {trades_raw_table}")
print("-" * 60)

raw_trades_query = (
    auto_loader_stream
    .writeStream
    .foreachBatch(process_raw_trades_batch)
    .option("checkpointLocation", f"{CHECKPOINT_BASE}/raw_trades")
    .trigger(availableNow=True)
    .start()
)

# Wait for completion
raw_trades_query.awaitTermination()

print("-" * 60)
total_raw = spark.sql(f"SELECT COUNT(*) as cnt FROM {trades_raw_table}").collect()[0]['cnt']
print(f"✓ Raw trades stream complete! Total trades: {total_raw:,}")

STARTING RAW TRADES STREAM
Reading from: /Volumes/takamol_demo/crypto_streaming/crypto_landing/trades
Writing to: takamol_demo.crypto_streaming.trades_raw
------------------------------------------------------------


26/01/12 05:46:26 Error occurred in _on_start_callback_serverless: User neha.saxena@novigosolutions.com has already started Streaming queries in this Notebook. Only one user could start Streaming queries in a Serverless Notebook. If you need to start a Streaming query, detach and reattach the notebook and try again.
Traceback (most recent call last):
  File "/databricks/python_shell/dbruntime/databricks_connect_streaming_listener.py", line 84, in wrapper
    return func(*args, **kwargs)
  File "/databricks/python_shell/dbruntime/databricks_connect_streaming_listener.py", line 531, in _on_start_callback_serverless
    raise e
  File "/databricks/python_shell/dbruntime/databricks_connect_streaming_listener.py", line 523, in _on_start_callback_serverless
    listener = _set_streaming_user_and_register_listener(user_id, query._session,
  File "/databricks/python_shell/dbruntime/databricks_connect_streaming_listener.py", line 175, in _set_streaming_user_and_register_listener
    raise Runti

[0;31m---------------------------------------------------------------------------[0m
[0;31mRuntimeError[0m                              Traceback (most recent call last)
File [0;32m<command-8535008811241395>, line 37[0m
[1;32m     28[0m [38;5;28mprint[39m([38;5;124mf[39m[38;5;124m"[39m[38;5;124mWriting to: [39m[38;5;132;01m{[39;00mtrades_raw_table[38;5;132;01m}[39;00m[38;5;124m"[39m)
[1;32m     29[0m [38;5;28mprint[39m([38;5;124m"[39m[38;5;124m-[39m[38;5;124m"[39m [38;5;241m*[39m [38;5;241m60[39m)
[1;32m     31[0m raw_trades_query [38;5;241m=[39m (
[1;32m     32[0m     auto_loader_stream
[1;32m     33[0m     [38;5;241m.[39mwriteStream
[1;32m     34[0m     [38;5;241m.[39mforeachBatch(process_raw_trades_batch)
[1;32m     35[0m     [38;5;241m.[39moption([38;5;124m"[39m[38;5;124mcheckpointLocation[39m[38;5;124m"[39m, [38;5;124mf[39m[38;5;124m"[39m[38;5;132;01m{[39;00mCHECKPOINT_BASE[38;5;132;01m}[39;00m[38;5;124m/raw_trad

In [0]:
# Display raw trades
print(f"Recent Trades from {trades_raw_table}:")
display(spark.sql(f"""
    SELECT
        trade_timestamp,
        symbol,
        price,
        quantity,
        trade_value_usdt,
        CASE WHEN is_buyer_maker THEN 'SELL' ELSE 'BUY' END as trade_side
    FROM {trades_raw_table}
    ORDER BY trade_timestamp DESC
    LIMIT 50
"""))

Recent Trades from takamol_demo.crypto_streaming.trades_raw:


trade_timestamp,symbol,price,quantity,trade_value_usdt,trade_side


---
# Part 2: Real-Time Analytics

## Windowed Aggregations for Market Intelligence

### 2.1 Calculate VWAP and Volume Metrics

In [0]:
# Re-read from Delta as a stream for downstream processing
trades_delta_stream = (
    spark.readStream
    .format("delta")
    .table(trades_raw_table)
)

# Define foreachBatch function for analytics
def process_analytics_batch(batch_df, batch_id):
    """Calculate windowed analytics for each batch."""
    if batch_df.count() > 0:
        analytics_df = batch_df \
            .groupBy("symbol") \
            .agg(
                count("*").alias("trade_count"),
                sum("quantity").alias("total_volume"),
                sum("trade_value_usdt").alias("total_value_usdt"),
                # VWAP = Sum(Price * Volume) / Sum(Volume)
                (sum(col("price") * col("quantity")) / sum("quantity")).alias("vwap"),
                avg("price").alias("avg_price"),
                min("price").alias("min_price"),
                max("price").alias("max_price"),
                # Buy volume (when is_buyer_maker = false, it's a buy)
                sum(when(~col("is_buyer_maker"), col("quantity")).otherwise(0)).alias("buy_volume"),
                # Sell volume
                sum(when(col("is_buyer_maker"), col("quantity")).otherwise(0)).alias("sell_volume")
            ) \
            .withColumn("price_range_pct",
                round((col("max_price") - col("min_price")) / col("avg_price") * 100, 4)) \
            .withColumn("buy_sell_ratio",
                round(col("buy_volume") / (col("sell_volume") + 0.0001), 4)) \
            .withColumn("window_start", current_timestamp()) \
            .withColumn("window_end", current_timestamp()) \
            .withColumn("batch_id", lit(batch_id).cast("bigint")) \
            .select(
                "window_start", "window_end", "symbol", "trade_count",
                round("total_volume", 6).alias("total_volume"),
                round("total_value_usdt", 2).alias("total_value_usdt"),
                round("vwap", 4).alias("vwap"),
                round("avg_price", 4).alias("avg_price"),
                round("min_price", 4).alias("min_price"),
                round("max_price", 4).alias("max_price"),
                "price_range_pct",
                round("buy_volume", 6).alias("buy_volume"),
                round("sell_volume", 6).alias("sell_volume"),
                "buy_sell_ratio",
                "batch_id"
            )

        analytics_df.write.mode("append").saveAsTable(trades_analytics_table)

        # Print summary
        for row in analytics_df.collect():
            print(f"  {row.symbol:8s} | VWAP: ${row.vwap:,.2f} | "
                  f"Volume: {row.total_volume:,.4f} | "
                  f"B/S Ratio: {row.buy_sell_ratio:.2f}")

# Start analytics stream
print("\n" + "=" * 60)
print("STARTING ANALYTICS STREAM")
print("=" * 60)
print(f"Writing to: {trades_analytics_table}")
print("-" * 60)

analytics_query = (
    trades_delta_stream
    .writeStream
    .foreachBatch(process_analytics_batch)
    .option("checkpointLocation", f"{CHECKPOINT_BASE}/analytics")
    .trigger(availableNow=True)
    .start()
)

analytics_query.awaitTermination()

print("-" * 60)
print("✓ Analytics stream complete!")



In [0]:
# Display analytics results
print(f"Trade Analytics from {trades_analytics_table}:")
display(spark.sql(f"""
    SELECT
        symbol,
        SUM(trade_count) as total_trades,
        ROUND(SUM(total_value_usdt), 2) as total_volume_usdt,
        ROUND(AVG(vwap), 2) as avg_vwap,
        ROUND(MIN(min_price), 2) as session_low,
        ROUND(MAX(max_price), 2) as session_high,
        ROUND(AVG(buy_sell_ratio), 2) as avg_buy_sell_ratio
    FROM {trades_analytics_table}
    GROUP BY symbol
    ORDER BY total_volume_usdt DESC
"""))



### 2.2 Price Movement Alerts

In [0]:
# Re-read for alerts stream
trades_for_alerts = (
    spark.readStream
    .format("delta")
    .table(trades_raw_table)
)

# Alert thresholds (adjusted for testing/demo with small datasets)
PRICE_CHANGE_THRESHOLD_PCT = 0.1  # Alert if price changes > 0.1% in a batch
VOLUME_SPIKE_THRESHOLD = 2        # Alert if > 2 trades in a batch

def process_alerts_batch(batch_df, batch_id):
    """Detect price movements and volume spikes."""
    if batch_df.count() == 0:
        return

    # Calculate metrics per symbol
    metrics_df = batch_df \
        .groupBy("symbol") \
        .agg(
            count("*").alias("trade_count"),
            first("price").alias("first_price"),
            last("price").alias("last_price"),
            min("price").alias("min_price"),
            max("price").alias("max_price"),
            sum("quantity").alias("volume")
        ) \
        .withColumn("price_change_pct",
            round((col("last_price") - col("first_price")) / col("first_price") * 100, 4))

    # Generate alerts
    alerts = []

    for row in metrics_df.collect():
        alert = None

        # Price movement alert
        if py_abs(row.price_change_pct) > PRICE_CHANGE_THRESHOLD_PCT:
            direction = "UP" if row.price_change_pct > 0 else "DOWN"
            severity = "HIGH" if py_abs(row.price_change_pct) > 0.5 else "MEDIUM"
            alert = {
                "alert_time": datetime.now(),
                "symbol": row.symbol,
                "alert_type": f"PRICE_{direction}",
                "current_price": row.last_price,
                "previous_price": row.first_price,
                "price_change_pct": row.price_change_pct,
                "volume_in_window": row.volume,
                "trade_count": row.trade_count,
                "severity": severity,
                "message": f"{row.symbol} price moved {direction} {py_abs(row.price_change_pct):.2f}% "
                          f"(${row.first_price:,.2f} → ${row.last_price:,.2f})",
                "batch_id": int(batch_id)
            }

        # Volume spike alert
        elif row.trade_count > VOLUME_SPIKE_THRESHOLD:
            alert = {
                "alert_time": datetime.now(),
                "symbol": row.symbol,
                "alert_type": "VOLUME_SPIKE",
                "current_price": row.last_price,
                "previous_price": row.first_price,
                "price_change_pct": row.price_change_pct,
                "volume_in_window": row.volume,
                "trade_count": row.trade_count,
                "severity": "MEDIUM",
                "message": f"{row.symbol} volume spike: {row.trade_count} trades in batch",
                "batch_id": int(batch_id)
            }

        if alert:
            alerts.append(alert)
            print(f"  🚨 ALERT: {alert['message']}")

    # Write alerts to table
    if alerts:
        alerts_df = spark.createDataFrame(alerts)
        alerts_df.write.mode("append").saveAsTable(price_alerts_table)

# Start alerts stream
print("\n" + "=" * 60)
print("STARTING PRICE ALERT MONITORING")
print("=" * 60)
print(f"Thresholds: Price Change > {PRICE_CHANGE_THRESHOLD_PCT}% | Volume > {VOLUME_SPIKE_THRESHOLD} trades")
print(f"Writing to: {price_alerts_table}")
print("-" * 60)

alerts_query = (
    trades_for_alerts
    .writeStream
    .foreachBatch(process_alerts_batch)
    .option("checkpointLocation", f"{CHECKPOINT_BASE}/alerts")
    .trigger(availableNow=True)
    .start()
)

alerts_query.awaitTermination()

print("-" * 60)
total_alerts = spark.sql(f"SELECT COUNT(*) as cnt FROM {price_alerts_table}").collect()[0]['cnt']
print(f"✓ Alert monitoring complete! Total alerts: {total_alerts}")



In [0]:
# Display any alerts
print(f"Price Alerts from {price_alerts_table}:")
display(spark.sql(f"""
    SELECT
        alert_time,
        symbol,
        alert_type,
        severity,
        message,
        current_price,
        price_change_pct,
        trade_count
    FROM {price_alerts_table}
    ORDER BY alert_time DESC
"""))



---
# Part 3: Market Dashboard Queries

## Pre-Built Queries for Dashboards

### 3.1 Current Market Summary

In [0]:
print("=" * 60)
print("MARKET SUMMARY")
print("=" * 60)

display(spark.sql(f"""
    WITH latest_prices AS (
        SELECT
            symbol,
            price as current_price,
            trade_timestamp,
            ROW_NUMBER() OVER (PARTITION BY symbol ORDER BY trade_timestamp DESC) as rn
        FROM {trades_raw_table}
    ),
    symbol_stats AS (
        SELECT
            symbol,
            COUNT(*) as total_trades,
            SUM(trade_value_usdt) as total_volume_usdt,
            MIN(price) as session_low,
            MAX(price) as session_high,
            FIRST(price) as open_price,
            SUM(CASE WHEN NOT is_buyer_maker THEN quantity ELSE 0 END) as buy_volume,
            SUM(CASE WHEN is_buyer_maker THEN quantity ELSE 0 END) as sell_volume
        FROM {trades_raw_table}
        GROUP BY symbol
    )
    SELECT
        s.symbol,
        ROUND(lp.current_price, 2) as current_price,
        ROUND((lp.current_price - s.open_price) / s.open_price * 100, 2) as change_pct,
        s.total_trades,
        ROUND(s.total_volume_usdt, 2) as volume_usdt,
        ROUND(s.session_low, 2) as session_low,
        ROUND(s.session_high, 2) as session_high,
        ROUND(s.buy_volume / NULLIF(s.sell_volume, 0), 2) as buy_sell_ratio
    FROM symbol_stats s
    JOIN latest_prices lp ON s.symbol = lp.symbol AND lp.rn = 1
    ORDER BY s.total_volume_usdt DESC
"""))



### 3.2 Trade Volume Over Time

In [0]:
print("Trade Volume by Symbol (Per Minute):")
display(spark.sql(f"""
    SELECT
        date_trunc('minute', trade_timestamp) as minute,
        symbol,
        COUNT(*) as trades,
        ROUND(SUM(trade_value_usdt), 2) as volume_usdt,
        ROUND(AVG(price), 2) as avg_price
    FROM {trades_raw_table}
    GROUP BY date_trunc('minute', trade_timestamp), symbol
    ORDER BY minute DESC, volume_usdt DESC
"""))



### 3.3 Buy/Sell Pressure Analysis

In [0]:
print("Buy/Sell Pressure by Symbol:")
display(spark.sql(f"""
    SELECT
        symbol,
        COUNT(*) as total_trades,
        SUM(CASE WHEN NOT is_buyer_maker THEN 1 ELSE 0 END) as buy_trades,
        SUM(CASE WHEN is_buyer_maker THEN 1 ELSE 0 END) as sell_trades,
        ROUND(SUM(CASE WHEN NOT is_buyer_maker THEN trade_value_usdt ELSE 0 END), 2) as buy_volume_usdt,
        ROUND(SUM(CASE WHEN is_buyer_maker THEN trade_value_usdt ELSE 0 END), 2) as sell_volume_usdt,
        ROUND(
            SUM(CASE WHEN NOT is_buyer_maker THEN trade_value_usdt ELSE 0 END) /
            NULLIF(SUM(CASE WHEN is_buyer_maker THEN trade_value_usdt ELSE 0 END), 0)
        , 2) as buy_sell_ratio
    FROM {trades_raw_table}
    GROUP BY symbol
    ORDER BY total_trades DESC
"""))



---
# Part 4: Continuous Streaming Mode

## For Live Demos - Run Streams Continuously

The cells above use `trigger(availableNow=True)` which processes all available data and stops.
For a live demo where data is continuously arriving, use `processingTime` trigger instead.

### 4.1 Continuous Analytics Stream (Optional)

Uncomment and run this cell for continuous processing during a live demo.
**Note:** Stop any previous streams first.

In [0]:
# # UNCOMMENT FOR CONTINUOUS STREAMING
#
# # Stop any existing streams
# for stream in spark.streams.active:
#     print(f"Stopping: {stream.name}")
#     stream.stop()
#
# # Create continuous Auto Loader stream
# continuous_stream = (
#     spark.readStream
#     .format("cloudFiles")
#     .option("cloudFiles.format", "json")
#     .option("cloudFiles.schemaLocation", f"{CHECKPOINT_BASE}/continuous_schema")
#     .schema(trade_schema)
#     .load(LANDING_PATH)
# )
#
# def process_continuous_batch(batch_df, batch_id):
#     """Process trades continuously with live output."""
#     if batch_df.count() > 0:
#         # Calculate quick stats
#         stats = batch_df.groupBy("symbol").agg(
#             count("*").alias("trades"),
#             round(avg("price"), 2).alias("avg_price"),
#             round(sum("trade_value_usdt"), 2).alias("volume")
#         ).collect()
#
#         print(f"\n[Batch {batch_id}] {datetime.now().strftime('%H:%M:%S')} - {batch_df.count()} trades")
#         for s in stats:
#             print(f"  {s.symbol}: {s.trades} trades @ ${s.avg_price:,.2f} (${s.volume:,.2f} vol)")
#
#         # Write to Delta
#         enriched = batch_df \
#             .withColumn("processing_time", current_timestamp()) \
#             .withColumn("trade_timestamp", (col("trade_time") / 1000).cast("timestamp")) \
#             .withColumn("batch_id", lit(batch_id).cast("bigint"))
#         enriched.write.mode("append").saveAsTable(trades_raw_table)
#
# # Start continuous stream with 5-second intervals
# continuous_query = (
#     continuous_stream
#     .writeStream
#     .foreachBatch(process_continuous_batch)
#     .option("checkpointLocation", f"{CHECKPOINT_BASE}/continuous")
#     .trigger(processingTime="5 seconds")  # Process every 5 seconds
#     .start()
# )
#
# print("Continuous streaming started!")
# print("Press 'Cancel' to stop, or run: continuous_query.stop()")



---
# Part 5: Summary & Key Takeaways

In [0]:
# Final statistics
print("=" * 60)
print("SESSION SUMMARY")
print("=" * 60)

raw_count = spark.sql(f"SELECT COUNT(*) as cnt FROM {trades_raw_table}").collect()[0]['cnt']
analytics_count = spark.sql(f"SELECT COUNT(*) as cnt FROM {trades_analytics_table}").collect()[0]['cnt']
alerts_count = spark.sql(f"SELECT COUNT(*) as cnt FROM {price_alerts_table}").collect()[0]['cnt']

print(f"\nData Collected:")
print(f"  Raw Trades: {raw_count:,}")
print(f"  Analytics Records: {analytics_count:,}")
print(f"  Price Alerts: {alerts_count:,}")

print(f"\nDelta Tables Created:")
print(f"  {trades_raw_table}")
print(f"  {trades_analytics_table}")
print(f"  {price_alerts_table}")



## Key Capabilities Demonstrated

| Capability | Description |
|------------|-------------|
| **Auto Loader** | Automatic file discovery and schema inference |
| **Real External Data** | Live cryptocurrency trades from Binance |
| **foreachBatch Processing** | Serverless-compatible streaming pattern |
| **VWAP Calculation** | Volume-weighted average price |
| **Price Alerts** | Real-time anomaly detection |
| **Delta Lake** | ACID transactions for streaming data |
| **Multi-Stream Architecture** | Separate ingestion from processing |

## Production Recommendations

1. **Run Producer as Job** - Continuous data ingestion with monitoring
2. **Use Delta Live Tables** - Simplified pipeline management
3. **Add Schema Registry** - Handle schema evolution gracefully
4. **Set Up Alerts** - Route price alerts to Slack/Teams/PagerDuty
5. **Optimize Checkpointing** - Use appropriate checkpoint intervals
6. **Monitor Latency** - Track end-to-end streaming latency

---
## Cleanup

In [0]:
# Stop any running streams
for stream in spark.streams.active:
    print(f"Stopping stream: {stream.name}")
    stream.stop()

print("\n✓ All streams stopped")

# Optional: Clean up checkpoints
# dbutils.fs.rm(CHECKPOINT_BASE, recurse=True)
# print("✓ Checkpoints cleaned")



---

**Takamol Demo - Real-Time Cryptocurrency Streaming Analytics**

*Demonstrating Databricks Structured Streaming with Live Market Data*