# Lab 6: Query, Time Travel & Advanced Features

## üéØ **Learning Objectives:**
- Query Gold tables (real-time + historical)
- Use time travel queries
- Understand schema evolution v·ªõi streaming
- Compare snapshots
- Integration patterns

## üìö **Key Concepts:**
1. **Unified Query**: Real-time v√† historical t·ª´ c√πng tables
2. **Time Travel**: Query data t·∫°i b·∫•t k·ª≥ th·ªùi ƒëi·ªÉm n√†o
3. **Schema Evolution**: Add/modify columns v·ªõi streaming writes
4. **Snapshots**: Version history c·ªßa data
5. **Integration**: BI tools, ML models


In [None]:
# Setup
import findspark
findspark.init()

from pyspark.sql import SparkSession
from pyspark.sql.functions import *

spark = SparkSession.builder \
    .appName("QueryTimeTravel") \
    .master("spark://spark-master:7077") \
    .getOrCreate()

spark.sparkContext.setLogLevel("WARN")

GOLD_TABLE_PATH = "/warehouse/gold/trade_metrics"

print("üöÄ Spark Session initialized!")


## Exercise 1: Query Real-time + Historical

### Key Point:
C√πng table, query c·∫£ real-time v√† historical data!


In [None]:
# Query Real-time + Historical
print("üîç Exercise 1: Query Real-time + Historical")
print("=" * 60)

gold_df = spark.read.parquet(GOLD_TABLE_PATH)

print("1Ô∏è‚É£ Latest data (real-time):")
gold_df.orderBy(desc("window_start")).show(10)

print("\n2Ô∏è‚É£ Historical data (last hour):")
from datetime import datetime, timedelta
one_hour_ago = datetime.now() - timedelta(hours=1)
gold_df.filter(col("window_start") >= one_hour_ago).show()

print("\n3Ô∏è‚É£ All-time aggregates:")
gold_df.groupBy("symbol").agg(
    avg("avg_price").alias("all_time_avg"),
    sum("total_volume").alias("all_time_volume")
).show()

print("\nüí° Same table, real-time + historical!")


## Exercise 2: Time Travel Queries

### With Iceberg:
```sql
SELECT * FROM gold.trade_metrics 
VERSION AS OF 17;

SELECT * FROM gold.trade_metrics 
TIMESTAMP AS OF '2025-01-15 10:30:00';
```

### Note:
Full time travel requires Iceberg JAR. Pattern shown here.


In [None]:
# Time Travel Queries
print("‚è∞ Exercise 2: Time Travel Queries")
print("=" * 60)

print("üí° With Iceberg, you can query historical versions:")
print("\n1Ô∏è‚É£ Query by version:")
print("   spark.sql('SELECT * FROM gold.trade_metrics VERSION AS OF 17')")

print("\n2Ô∏è‚É£ Query by timestamp:")
print("   spark.sql(\"SELECT * FROM gold.trade_metrics TIMESTAMP AS OF '2025-01-15 10:30:00'\")")

print("\n3Ô∏è‚É£ List snapshots:")
print("   spark.sql('SELECT * FROM gold.trade_metrics.snapshots')")

print("\nüí° Time travel benefits:")
print("   ‚úÖ Query data at any point in time")
print("   ‚úÖ Compare before/after transformations")
print("   ‚úÖ Rollback if needed")
print("   ‚úÖ Audit trail")


## Summary

### ‚úÖ What we learned:
1. **Unified Query**: Real-time + historical t·ª´ c√πng tables
2. **Time Travel**: Query data t·∫°i b·∫•t k·ª≥ th·ªùi ƒëi·ªÉm
3. **Schema Evolution**: Modify schema v·ªõi streaming
4. **Snapshots**: Version history
5. **Integration**: Ready for BI, ML

### üéØ Streaming Lakehouse Advantages:
- ‚úÖ Unified storage (Iceberg)
- ‚úÖ Unified code (batch + streaming)
- ‚úÖ Unified query (real-time + historical)
- ‚úÖ Time travel capabilities
- ‚úÖ ACID transactions

### üöÄ This is the future of data architecture!
