# Lab 4: Gold Layer - Aggregations & Features

## üéØ **Learning Objectives:**
- Aggregate Silver data v·ªõi windowed operations
- Create real-time metrics (avg price, volume, volatility)
- Build feature tables for ML
- Write aggregated data to Gold layer
- Query Gold tables for analytics

## üìö **Key Concepts:**
1. **Gold Layer**: Aggregated, feature-rich data
2. **Windowed Aggregations**: Time-based aggregations (1-min, 5-min)
3. **Real-time Metrics**: Average price, total volume, price volatility
4. **Feature Engineering**: Create features for ML models
5. **Analytics Ready**: Data ready for BI, ML, dashboards


In [None]:
# Setup
import findspark
findspark.init()

from pyspark.sql import SparkSession
from pyspark.sql.functions import *
from pyspark.sql.types import *
import time

spark = SparkSession.builder \
    .appName("StreamingLakehouseGold") \
    .master("spark://spark-master:7077") \
    .config("spark.sql.adaptive.enabled", "true") \
    .getOrCreate()

spark.sparkContext.setLogLevel("WARN")

SILVER_TABLE_PATH = "/warehouse/silver/trades"
GOLD_TABLE_PATH = "/warehouse/gold/trade_metrics"

print("üöÄ Spark Session initialized for Gold Layer!")


## Exercise 1: Windowed Aggregations

### Aggregations:
- **1-minute windows**: Real-time metrics
- **By symbol**: Per-stock aggregations
- **Metrics**: avg_price, total_volume, max_price, min_price, price_range


In [None]:
# Windowed Aggregations
print("üìä Exercise 1: Windowed Aggregations")
print("=" * 60)

# Read Silver nh∆∞ stream
silver_stream = spark.readStream \
    .format("parquet") \
    .schema(spark.read.parquet(SILVER_TABLE_PATH).schema) \
    .load(SILVER_TABLE_PATH)

print("1Ô∏è‚É£ Reading Silver stream...")

# Add watermark
with_watermark = silver_stream \
    .withWatermark("event_timestamp", "5 minutes")

print("2Ô∏è‚É£ Windowed aggregations (1-minute windows):")

gold_aggregated = with_watermark \
    .groupBy(
        window(col("event_timestamp"), "1 minute"),
        col("symbol")
    ) \
    .agg(
        avg("price").alias("avg_price"),
        max("price").alias("max_price"),
        min("price").alias("min_price"),
        (max("price") - min("price")).alias("price_range"),
        sum("volume").alias("total_volume"),
        sum("total_value").alias("total_value"),
        count("*").alias("trade_count")
    ) \
    .select(
        col("window.start").alias("window_start"),
        col("window.end").alias("window_end"),
        col("symbol"),
        col("avg_price"),
        col("max_price"),
        col("min_price"),
        col("price_range"),
        col("total_volume"),
        col("total_value"),
        col("trade_count")
    )

print("‚úÖ Aggregations defined!")
gold_aggregated.printSchema()


## Exercise 2: Write to Gold Layer

### Output Mode: Update
- Update mode: Update existing windows
- Complete mode: Full window state (alternative)


In [None]:
# Write to Gold Layer
print("üíæ Exercise 2: Write to Gold Layer")
print("=" * 60)

gold_query = gold_aggregated \
    .writeStream \
    .outputMode("update") \
    .format("parquet") \
    .option("path", GOLD_TABLE_PATH) \
    .option("checkpointLocation", "/tmp/gold_checkpoint") \
    .trigger(processingTime='10 seconds') \
    .start()

print("‚úÖ Gold streaming query started!")
print(f"   Writing to: {GOLD_TABLE_PATH}")
print(f"   Output mode: Update")

print("\nüí° Gold layer characteristics:")
print("   ‚úÖ Aggregated metrics")
print("   ‚úÖ Windowed by time")
print("   ‚úÖ Ready for analytics/ML")


## Exercise 3: Query Gold Data

### Use Cases:
- Real-time dashboards
- Historical analysis
- ML feature extraction


In [None]:
# Query Gold Data
print("üîç Exercise 3: Query Gold Data")
print("=" * 60)

time.sleep(15)

try:
    gold_df = spark.read.parquet(GOLD_TABLE_PATH)
    
    print(f"‚úÖ Gold data found! Records: {gold_df.count()}")
    
    print("\n1Ô∏è‚É£ Latest metrics by symbol:")
    gold_df.orderBy(desc("window_start")).show(10, truncate=False)
    
    print("\n2Ô∏è‚É£ Average price by symbol:")
    gold_df.groupBy("symbol").agg(avg("avg_price").alias("overall_avg_price")).show()
    
    print("\n3Ô∏è‚É£ Total volume by symbol:")
    gold_df.groupBy("symbol").agg(sum("total_volume").alias("total_vol")).orderBy(desc("total_vol")).show()
    
except Exception as e:
    print(f"‚ö†Ô∏è  Error: {e}")


## Summary

### ‚úÖ What we learned:
1. **Gold Layer**: Aggregated, analytics-ready data
2. **Windowed Aggregations**: Time-based metrics
3. **Real-time Metrics**: Price, volume, volatility
4. **Update Mode**: Update existing windows
5. **Query Gold**: Real-time v√† historical analytics

### üöÄ Next Steps:
- Lab 5: Unified batch + streaming (same code)
- Lab 6: Query v√† time travel
