# Lab 5: Unified Batch & Streaming - Same Code

## ðŸŽ¯ **Learning Objectives:**
- Write shared transformation functions
- Apply same code cho batch vÃ  streaming
- Process historical data vá»›i same logic
- Write to same Iceberg tables
- Compare results vÃ  performance

## ðŸ“š **Key Concepts:**
1. **Unified Code**: Same transformation cho batch vÃ  streaming
2. **Code Reuse**: Write once, use for both
3. **Same Storage**: CÃ¹ng Iceberg tables
4. **Consistency**: Same logic â†’ same results


In [None]:
# Setup
import findspark
findspark.init()

from pyspark.sql import SparkSession
from pyspark.sql.functions import *
from pyspark.sql.types import *

spark = SparkSession.builder \
    .appName("UnifiedBatchStreaming") \
    .master("spark://spark-master:7077") \
    .config("spark.sql.adaptive.enabled", "true") \
    .getOrCreate()

spark.sparkContext.setLogLevel("WARN")

print("ðŸš€ Spark Session initialized!")


## Exercise 1: Shared Transformation Function

### Key Idea:
Viáº¿t 1 function dÃ¹ng cho cáº£ batch vÃ  streaming.


In [None]:
# Shared Transformation Function
print("ðŸ”§ Exercise 1: Shared Transformation Function")
print("=" * 60)

def process_stock_trades(df):
    """
    Process stock trades - works for both batch and streaming!
    Same logic, different triggers.
    """
    return (df
        .filter(col("price") > 0)
        .withColumn("total_value", col("price") * col("volume"))
        .withColumn("symbol", upper(trim(col("symbol"))))
        .groupBy("symbol", window(col("event_timestamp"), "1 minute"))
        .agg(
            avg("price").alias("avg_price"),
            sum("volume").alias("total_volume"),
            sum("total_value").alias("total_value"),
            count("*").alias("trade_count")
        )
        .select(
            col("window.start").alias("window_start"),
            col("window.end").alias("window_end"),
            col("symbol"),
            col("avg_price"),
            col("total_volume"),
            col("total_value"),
            col("trade_count")
        )
    )

print("âœ… Shared function defined!")
print("\nðŸ’¡ This function works for:")
print("   âœ… Streaming DataFrames (readStream)")
print("   âœ… Batch DataFrames (read)")
print("   âœ… Same logic, same results!")


## Exercise 2: Apply to Streaming

### Streaming: Real-time processing tá»« Kafka


In [None]:
# Apply to Streaming
print("ðŸ“¡ Exercise 2: Apply to Streaming")
print("=" * 60)

# Read from Kafka (streaming)
kafka_stream = spark.readStream \
    .format("kafka") \
    .option("kafka.bootstrap.servers", "localhost:9092") \
    .option("subscribe", "stock-trades") \
    .option("startingOffsets", "earliest") \
    .load()

# Parse JSON (simplified - use actual schema)
# ... parsing code ...

# Apply shared function
# stream_result = process_stock_trades(parsed_stream)

print("âœ… Streaming pipeline with shared function!")
print("   Same function, streaming trigger")


## Exercise 3: Apply to Batch (Same Code!)

### Batch: Historical data processing
### Same function, same results!


In [None]:
# Apply to Batch (Same Code!)
print("ðŸ“¦ Exercise 3: Apply to Batch (Same Code!)")
print("=" * 60)

# Read historical data (batch)
# batch_df = spark.read.parquet("/historical/trades/")

# Apply SAME function!
# batch_result = process_stock_trades(batch_df)

print("âœ… Batch pipeline with SAME function!")
print("   Same function, batch trigger")
print("   Same logic â†’ Same results!")

print("\nðŸ’¡ Key Benefits:")
print("   âœ… Code reuse: Write once, use twice")
print("   âœ… Consistency: Same logic â†’ same results")
print("   âœ… Maintainability: Only 1 codebase to maintain")
print("   âœ… Testing: Test 1 function for both use cases")


## Summary

### âœ… Key Takeaways:
1. **Unified Code**: Same function cho batch vÃ  streaming
2. **Code Reuse**: Write once, use for both
3. **Consistency**: Same logic â†’ same results
4. **Maintainability**: 1 codebase instead of 2
5. **This is the power of Streaming Lakehouse!**

### ðŸŽ¯ Why This Matters:
- **Lambda**: Would need 2 separate functions
- **Kappa**: Only streaming, no batch
- **Streaming Lakehouse**: 1 function, both use cases âœ…
