# 04 - Streaming Aggregations

## Overview
This notebook performs streaming aggregations on the transformed transaction data using Spark SQL.

## Aggregation Patterns
Structured Streaming supports two main aggregation patterns:

1. **Stateless Aggregations**: Simple groupBy operations without time windows
2. **Stateful Aggregations**: Time-windowed aggregations that maintain state across micro-batches

## Use Cases
- Real-time KPIs and metrics
- Windowed analytics (hourly, daily)
- Customer behavior analysis
- Revenue tracking by segment

## Output Modes
- **Complete**: Output entire result table (used for aggregations without watermarks)
- **Update**: Output only rows that changed
- **Append**: Output only new rows (requires watermarking for aggregations)

In [None]:
# Import required libraries
from pyspark.sql import SparkSession
from pathlib import Path
import os

## Initialize Spark Session

In [None]:
# Get existing Spark session or create new one
try:
    spark = SparkSession.getActiveSession()
    if spark is None:
        raise Exception("No active session")
    print("Using existing Spark session")
except:
    spark = SparkSession.builder \
        .appName("TransactionStreamingETL") \
        .master("local[*]") \
        .config("spark.sql.streaming.schemaInference", "false") \
        .config("spark.sql.shuffle.partitions", "4") \
        .getOrCreate()
    print("Created new Spark session")

spark.sparkContext.setLogLevel("WARN")
print(f"Spark Version: {spark.version}")

## Configure Paths

In [None]:
# Configure paths
BASE_DIR = Path(os.path.abspath('')).parent
SQL_DIR = BASE_DIR / 'sql'
INPUT_DIR = str(BASE_DIR / 'data' / 'input')

print(f"SQL Directory: {SQL_DIR}")
print(f"Input Directory: {INPUT_DIR}")

## Ensure Transformed View Exists

Verify that the transformed_transactions view is available from the previous notebook.

In [None]:
from pyspark.sql.types import StructType, StructField, StringType, DoubleType, IntegerType

# Check if transformed view exists
existing_views = [table.name for table in spark.catalog.listTables() if table.isTemporary]

if 'transformed_transactions' not in existing_views:
    print("Setting up data pipeline from scratch...")
    
    # Define schema
    transaction_schema = StructType([
        StructField("transaction_id", StringType(), False),
        StructField("user_id", StringType(), False),
        StructField("product_id", StringType(), False),
        StructField("product_category", StringType(), True),
        StructField("amount", DoubleType(), False),
        StructField("quantity", IntegerType(), False),
        StructField("payment_method", StringType(), True),
        StructField("status", StringType(), False),
        StructField("event_time", StringType(), False),
        StructField("country_code", StringType(), True),
        StructField("discount_percent", DoubleType(), True),
        StructField("customer_segment", StringType(), True)
    ])
    
    # Create raw stream
    raw_stream = spark.readStream \
        .format("csv") \
        .schema(transaction_schema) \
        .option("header", "true") \
        .option("maxFilesPerTrigger", 1) \
        .load(INPUT_DIR)
    
    raw_stream.createOrReplaceTempView("raw_transactions")
    
    # Load and apply transformations
    with open(SQL_DIR / 'transformations.sql', 'r') as f:
        transformation_sql = f.read()
    
    transformed_stream = spark.sql(transformation_sql)
    transformed_stream.createOrReplaceTempView("transformed_transactions")
    
    print("Pipeline setup complete")
else:
    print("Using existing 'transformed_transactions' view")

## Load Aggregation SQL

Load aggregation queries from the external SQL file.

In [None]:
# Load SQL from external file
sql_file_path = SQL_DIR / 'aggregations.sql'

with open(sql_file_path, 'r') as f:
    aggregation_sql = f.read()

print(f"Loaded SQL from: {sql_file_path}")
print(f"\nSQL Query ({len(aggregation_sql)} characters):")
print("=" * 80)
print(aggregation_sql)
print("=" * 80)

## Execute Aggregation Query

Apply the SQL aggregation to create streaming analytics.

In [None]:
# Execute aggregation SQL
aggregated_stream = spark.sql(aggregation_sql)

print("Aggregation applied successfully!")
print(f"Is Streaming: {aggregated_stream.isStreaming}")
print(f"\nAggregated Schema:")
aggregated_stream.printSchema()

## Register Aggregated View

Make the aggregated results available as a temporary view.

In [None]:
# Register aggregated stream
aggregated_stream.createOrReplaceTempView("transaction_metrics")

print("Registered as 'transaction_metrics' view")
print("This view contains real-time aggregated KPIs")

## Test Aggregation Output

Display the aggregated metrics to verify calculations.

**Note**: We use `complete` output mode for aggregations without watermarks, which outputs the entire result table on each trigger.

In [None]:
# Write aggregated results to console
aggregation_test = aggregated_stream.writeStream \
    .outputMode("complete") \
    .format("console") \
    .option("truncate", "false") \
    .option("numRows", 50) \
    .trigger(processingTime='10 seconds') \
    .start()

print("Aggregation query started...")
print(f"Query ID: {aggregation_test.id}")
print(f"Output Mode: complete")
print("\nNote: Complete mode outputs the full result table on each trigger")

In [None]:
# Let it run to show results
import time
time.sleep(25)

# Stop test query
aggregation_test.stop()
print("Aggregation test stopped.")

## Create Additional Analytics Views

Generate secondary metrics by querying the aggregated view.

In [None]:
# Top performing categories by revenue
top_categories_sql = """
SELECT 
    product_category,
    total_revenue,
    total_transactions,
    avg_transaction_value,
    high_value_transactions
FROM transaction_metrics
ORDER BY total_revenue DESC
LIMIT 10
"""

top_categories = spark.sql(top_categories_sql)

print("Created top categories view")
print(f"Is Streaming: {top_categories.isStreaming}")

## Monitor Stream Health

Check the status of all active streaming queries.

In [None]:
# List all active streaming queries
active_streams = spark.streams.active
print(f"Active Streams: {len(active_streams)}\n")

for stream in active_streams:
    print(f"Stream ID: {stream.id}")
    print(f"Name: {stream.name}")
    print(f"Status: {stream.status['message']}")
    print("-" * 60)

## Aggregation Columns Summary

In [None]:
# Display aggregated columns
print("Aggregated Metrics Available:")
for col_name in aggregated_stream.columns:
    print(f"  - {col_name}")

## Summary

This notebook successfully:

1. Loaded aggregation logic from external SQL file
2. Performed streaming aggregations using GroupBy
3. Calculated real-time KPIs and metrics
4. Used complete output mode for stateful aggregations
5. Created derived analytics views

**Key Metrics Calculated:**
- Total transaction counts by category
- Revenue totals and averages
- High-value transaction counts
- Completion rates
- Unique customer counts

**Aggregation Considerations:**
- **Complete mode**: Required for aggregations without event-time watermarks
- **State management**: Spark maintains state across micro-batches
- **Memory**: Complete mode can be memory-intensive for large cardinality
- **Production**: Consider windowed aggregations with watermarks for append mode

**Next Steps:**
- Proceed to notebook 05 to write results to sink
- Configure checkpointing for fault tolerance
- Write aggregated data to Parquet format
- Set up monitoring and alerting