# Lab 2: Spark Streaming with Kafka

## üéØ **Learning Objectives:**
- Master Spark Structured Streaming
- Learn real-time data processing patterns
- Practice Kafka integration with Spark
- Understand streaming analytics concepts
- Implement real-time aggregations

## üìö **Key Concepts:**
1. **Structured Streaming**: Real-time data processing framework
2. **Kafka Integration**: Reading from Kafka topics
3. **Stream Processing**: Continuous data transformation
4. **Watermarking**: Handling late-arriving data
5. **Checkpointing**: Ensuring fault tolerance

## üèóÔ∏è **Architecture Overview:**
```
‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê    ‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê    ‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê
‚îÇ   Kafka Topic   ‚îÇ‚îÄ‚îÄ‚îÄ‚ñ∂‚îÇ  Spark Streaming ‚îÇ‚îÄ‚îÄ‚îÄ‚ñ∂‚îÇ   Real-time     ‚îÇ
‚îÇ   (Stock Data)  ‚îÇ    ‚îÇ     Engine       ‚îÇ    ‚îÇ   Analytics     ‚îÇ
‚îÇ                 ‚îÇ    ‚îÇ                  ‚îÇ    ‚îÇ                 ‚îÇ
‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò    ‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò    ‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò
         ‚îÇ                        ‚îÇ                        ‚îÇ
         ‚ñº                        ‚ñº                        ‚ñº
‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê    ‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê    ‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê
‚îÇ High-frequency  ‚îÇ    ‚îÇ Stream Processing‚îÇ    ‚îÇ Output Sinks    ‚îÇ
‚îÇ Data Producer   ‚îÇ    ‚îÇ ‚Ä¢ Aggregations   ‚îÇ    ‚îÇ ‚Ä¢ Console       ‚îÇ
‚îÇ                 ‚îÇ    ‚îÇ ‚Ä¢ Window Functions‚îÇ    ‚îÇ ‚Ä¢ Memory        ‚îÇ
‚îÇ                 ‚îÇ    ‚îÇ ‚Ä¢ Watermarking   ‚îÇ    ‚îÇ ‚Ä¢ File System   ‚îÇ
‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò    ‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò    ‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò
```

## üìä **Streaming Use Cases:**
- **Real-time Analytics**: Live dashboards and metrics
- **Alert Systems**: Immediate notifications and triggers
- **Data Pipeline**: Continuous ETL processes
- **Monitoring**: Real-time system health checks


In [None]:
# Install and Import Dependencies
%pip install pyspark findspark pandas numpy pyarrow kafka-python

import findspark
findspark.init()

from pyspark.sql import SparkSession
from pyspark.sql.functions import *
from pyspark.sql.types import *
from pyspark.sql.streaming import StreamingQuery
import json
import time
import threading
from datetime import datetime
import random

print("‚úÖ Dependencies installed and imported successfully!")


In [None]:
# Initialize Spark Session for Streaming
spark = SparkSession.builder \
    .appName("SparkStreamingLab") \
    .master("spark://spark-master:7077") \
    .config("spark.sql.adaptive.enabled", "true") \
    .config("spark.sql.streaming.checkpointLocation", "/tmp/checkpoint") \
    .config("spark.sql.streaming.kafka.useDeprecatedOffsetFetching", "false") \
    .config("spark.serializer", "org.apache.spark.serializer.KryoSerializer") \
    .getOrCreate()

# Set log level to reduce verbosity
spark.sparkContext.setLogLevel("WARN")

print("üöÄ Spark Streaming Session initialized successfully!")
print(f"üìä Spark Version: {spark.version}")
print(f"üîó Master URL: {spark.sparkContext.master}")
print(f"üìÅ Checkpoint Location: /tmp/checkpoint")


## Exercise 1: Event-Time vs Processing-Time

### üéØ **Learning Objectives:**
- Understand the difference between event-time and processing-time
- Learn why event-time is important for accurate analytics
- Practice working with event timestamps
- Understand late-arriving data challenges

### üìö **Key Concepts:**
- **Event-Time**: Time when the event actually occurred (in the data)
- **Processing-Time**: Time when Spark processes the event
- **Late Data**: Events that arrive after their event-time window has closed


In [None]:
# Create Sample Streaming Data with Event-Time
print("üìä Exercise 1: Event-Time vs Processing-Time")
print("=" * 60)

# Create sample data with event timestamps
# In real scenarios, this would come from Kafka
sample_events = []

# Simulate events with event-time (when event occurred)
# and processing delays
base_time = datetime.now()

for i in range(20):
    # Event-time: when the event actually happened
    event_time = base_time - timedelta(minutes=random.randint(0, 30))
    
    # Processing-time: when Spark receives it (with delay)
    processing_delay = random.randint(0, 5)  # 0-5 minutes delay
    processing_time = event_time + timedelta(minutes=processing_delay)
    
    sample_events.append({
        'event_id': f'EVT_{i+1:04d}',
        'event_time': event_time.strftime('%Y-%m-%d %H:%M:%S'),
        'processing_time': processing_time.strftime('%Y-%m-%d %H:%M:%S'),
        'value': random.randint(10, 100),
        'category': random.choice(['A', 'B', 'C'])
    })

# Create DataFrame
events_df = spark.createDataFrame(sample_events)

print("\nüìã Sample Events (showing event-time vs processing-time):")
events_df.show(truncate=False)

print("\nüí° Key Insight:")
print("   - Event-time: When the event actually occurred")
print("   - Processing-time: When Spark processes it (may be delayed)")
print("   - Late data: Events arriving after their time window")
print("\n‚ö†Ô∏è Using processing-time can lead to incorrect results!")
print("   ‚úÖ Always use event-time for accurate analytics")


## Exercise 2: Tumbling Windows (Fixed Windows)

### üéØ **Learning Objectives:**
- Understand tumbling windows (non-overlapping)
- Learn to aggregate data in fixed time intervals
- Practice event-time based windowing
- Understand window boundaries

### üìö **Key Concepts:**
- **Tumbling Window**: Fixed-size, non-overlapping windows
- **Window Duration**: Size of each window (e.g., 5 minutes)
- **Window Start/End**: Boundaries of each window


In [None]:
# Tumbling Windows Example
print("ü™ü Exercise 2: Tumbling Windows (Fixed Windows)")
print("=" * 60)

# Create sample streaming data with timestamps
streaming_data = []
base_time = datetime(2024, 1, 1, 10, 0, 0)  # Start at 10:00

for i in range(50):
    # Events spread over 10 minutes
    event_time = base_time + timedelta(minutes=i*0.2)
    streaming_data.append({
        'timestamp': event_time,
        'user_id': f'user_{random.randint(1, 5)}',
        'action': random.choice(['click', 'view', 'purchase']),
        'amount': random.randint(10, 200) if random.random() > 0.7 else 0
    })

# Create DataFrame and convert timestamp
stream_df = spark.createDataFrame(streaming_data)
stream_df = stream_df.withColumn("timestamp", col("timestamp").cast("timestamp"))

# Register as temporary view for SQL
stream_df.createOrReplaceTempView("events")

print("\n1Ô∏è‚É£ Tumbling Window - 2 minute windows:")
print("   Each window is 2 minutes, non-overlapping")
print("   Windows: [10:00-10:02), [10:02-10:04), [10:04-10:06), ...")

tumbling_result = spark.sql("""
    SELECT 
        window(timestamp, '2 minutes') as window,
        COUNT(*) as event_count,
        COUNT(DISTINCT user_id) as unique_users,
        SUM(amount) as total_amount
    FROM events
    GROUP BY window(timestamp, '2 minutes')
    ORDER BY window
""")

tumbling_result.show(truncate=False)

print("\nüìä Window Structure:")
print("   - Window start: Beginning of 2-minute interval")
print("   - Window end: End of 2-minute interval")
print("   - No overlap between windows")
print("   - Each event belongs to exactly one window")


## Exercise 3: Sliding Windows (Overlapping Windows)

### üéØ **Learning Objectives:**
- Understand sliding windows (overlapping)
- Learn to create windows with slide intervals
- Practice aggregating over overlapping time ranges
- Understand when to use sliding vs tumbling windows

### üìö **Key Concepts:**
- **Sliding Window**: Overlapping windows with a slide interval
- **Window Duration**: Total size of the window
- **Slide Interval**: How much the window moves forward
- **Overlap**: Events can belong to multiple windows


In [None]:
# Sliding Windows Example
print("üîÑ Exercise 3: Sliding Windows (Overlapping Windows)")
print("=" * 60)

print("\n1Ô∏è‚É£ Sliding Window - 5 minute window, 1 minute slide:")
print("   Window size: 5 minutes")
print("   Slide interval: 1 minute")
print("   Windows overlap by 4 minutes")

sliding_result = spark.sql("""
    SELECT 
        window(timestamp, '5 minutes', '1 minute') as window,
        COUNT(*) as event_count,
        COUNT(DISTINCT user_id) as unique_users,
        SUM(amount) as total_amount
    FROM events
    GROUP BY window(timestamp, '5 minutes', '1 minute')
    ORDER BY window
""")

sliding_result.show(truncate=False)

print("\nüìä Window Structure:")
print("   Window 1: [10:00 - 10:05)")
print("   Window 2: [10:01 - 10:06)  ‚Üê overlaps with Window 1")
print("   Window 3: [10:02 - 10:07)  ‚Üê overlaps with Window 1 & 2")
print("   ...")
print("\nüí° Key Differences:")
print("   - Tumbling: No overlap, each event in 1 window")
print("   - Sliding: Overlap, each event can be in multiple windows")
print("   - Sliding provides smoother, more continuous results")
print("   - Useful for moving averages, trend analysis")


## Exercise 4: Watermarking for Late Data

### üéØ **Learning Objectives:**
- Understand watermarking mechanism
- Learn to handle late-arriving data
- Practice setting appropriate watermark thresholds
- Understand when late data is dropped vs processed

### üìö **Key Concepts:**
- **Watermark**: Threshold for how late data can arrive
- **Late Data**: Events arriving after watermark threshold
- **State Management**: Spark maintains state for incomplete windows
- **State Cleanup**: Watermark allows Spark to clean up old state


In [None]:
# Watermarking Example
print("üíß Exercise 4: Watermarking for Late Data")
print("=" * 60)

# Create data with some late-arriving events
late_data = []
base_time = datetime(2024, 1, 1, 10, 0, 0)

# Normal events (on-time)
for i in range(30):
    event_time = base_time + timedelta(minutes=i*0.5)
    late_data.append({
        'timestamp': event_time,
        'event_id': f'EVT_{i+1:04d}',
        'value': random.randint(10, 100),
        'arrival_status': 'on-time'
    })

# Late-arriving events (arriving after their window)
# These events have old timestamps but arrive late
for i in range(5):
    # Event happened 10 minutes ago but arriving now
    event_time = base_time - timedelta(minutes=10-i)
    late_data.append({
        'timestamp': event_time,
        'event_id': f'LATE_{i+1:04d}',
        'value': random.randint(10, 100),
        'arrival_status': 'late'
    })

late_df = spark.createDataFrame(late_data)
late_df = late_df.withColumn("timestamp", col("timestamp").cast("timestamp"))
late_df.createOrReplaceTempView("late_events")

print("\nüìä Sample Data (including late events):")
late_df.orderBy("timestamp").show(20, truncate=False)

print("\n1Ô∏è‚É£ Aggregation WITHOUT watermark (processes all data, including late):")
print("   ‚ö†Ô∏è Without watermark, Spark keeps state forever (memory issue)")

no_watermark = spark.sql("""
    SELECT 
        window(timestamp, '2 minutes') as window,
        COUNT(*) as event_count,
        SUM(value) as total_value
    FROM late_events
    GROUP BY window(timestamp, '2 minutes')
    ORDER BY window
""")

no_watermark.show(truncate=False)

print("\n2Ô∏è‚É£ Aggregation WITH watermark (drops very late data):")
print("   ‚úÖ With 5-minute watermark, late data within 5 min is processed")
print("   ‚ùå Data arriving >5 minutes late is dropped")

# Note: In real streaming, this would be:
# stream_df.withWatermark("timestamp", "5 minutes")...
# For batch demo, we show the concept
print("""
# Real streaming example:
stream_df = spark.readStream \\
    .format("kafka") \\
    .option("kafka.bootstrap.servers", "localhost:9092") \\
    .option("subscribe", "events") \\
    .load()

# Parse and add watermark
events_with_watermark = stream_df \\
    .select(from_json(col("value").cast("string"), schema).alias("data")) \\
    .select("data.*") \\
    .withWatermark("timestamp", "5 minutes")  # 5-minute watermark

# Aggregate with window
windowed = events_with_watermark \\
    .groupBy(
        window(col("timestamp"), "2 minutes"),
        col("category")
    ) \\
    .agg(count("*").alias("count"))
""")

print("\nüí° Watermark Behavior:")
print("   - Events within watermark threshold: ‚úÖ Processed")
print("   - Events beyond watermark threshold: ‚ùå Dropped")
print("   - State cleanup: Old windows beyond watermark are cleaned up")
print("   - Trade-off: Lower watermark = less late data, but more state cleanup")


## Exercise 5: Advanced Window Operations

### üéØ **Learning Objectives:**
- Combine windows with other aggregations
- Use multiple windows in the same query
- Practice window functions within windows
- Understand output modes with windows


In [None]:
# Advanced Window Operations
print("‚öôÔ∏è Exercise 5: Advanced Window Operations")
print("=" * 60)

print("\n1Ô∏è‚É£ Multiple aggregations in the same window:")
advanced_result = spark.sql("""
    SELECT 
        window(timestamp, '3 minutes') as window,
        COUNT(*) as total_events,
        COUNT(DISTINCT user_id) as unique_users,
        SUM(amount) as total_revenue,
        AVG(amount) as avg_amount,
        MAX(amount) as max_amount,
        MIN(amount) as min_amount
    FROM events
    GROUP BY window(timestamp, '3 minutes')
    ORDER BY window
""")

advanced_result.show(truncate=False)

print("\n2Ô∏è‚É£ Window with category grouping:")
category_window = spark.sql("""
    SELECT 
        window(timestamp, '2 minutes') as window,
        action as category,
        COUNT(*) as count,
        SUM(amount) as total
    FROM events
    GROUP BY window(timestamp, '2 minutes'), action
    ORDER BY window, category
""")

category_window.show(truncate=False)

print("\n3Ô∏è‚É£ Window with filtering:")
filtered_window = spark.sql("""
    SELECT 
        window(timestamp, '2 minutes') as window,
        COUNT(*) as total_events,
        SUM(amount) as revenue
    FROM events
    WHERE amount > 0  -- Only purchases
    GROUP BY window(timestamp, '2 minutes')
    ORDER BY window
""")

filtered_window.show(truncate=False)


## Exercise 6: Output Modes with Windows

### üéØ **Learning Objectives:**
- Understand different output modes
- Learn when to use Append vs Update vs Complete mode
- Practice output modes with windowed aggregations
- Understand state management implications


In [None]:
# Output Modes with Windows
print("üì§ Exercise 6: Output Modes with Windows")
print("=" * 60)

print("\n1Ô∏è‚É£ Append Mode (requires watermark):")
print("""
# Append mode: Only outputs new rows when window closes
# Requires watermark to know when window is complete

query = events_with_watermark \\
    .withWatermark("timestamp", "5 minutes") \\
    .groupBy(
        window(col("timestamp"), "2 minutes"),
        col("category")
    ) \\
    .agg(count("*").alias("count")) \\
    .writeStream \\
    .outputMode("append") \\
    .format("console") \\
    .start()

# Behavior:
# - Outputs only when window closes (after watermark passes)
# - Each window appears once in output
# - Lower memory usage
# - Cannot use with aggregations without watermark
""")

print("\n2Ô∏è‚É£ Update Mode:")
print("""
# Update mode: Outputs updated rows as they change
# Works with or without watermark

query = stream_df \\
    .groupBy(
        window(col("timestamp"), "2 minutes"),
        col("category")
    ) \\
    .agg(count("*").alias("count")) \\
    .writeStream \\
    .outputMode("update") \\
    .format("console") \\
    .start()

# Behavior:
# - Outputs whenever a window result changes
# - Can output same window multiple times (as data arrives)
# - Works without watermark
# - Higher memory usage (keeps all state)
""")

print("\n3Ô∏è‚É£ Complete Mode:")
print("""
# Complete mode: Outputs all rows every trigger
# Requires unbounded state

query = stream_df \\
    .groupBy(
        window(col("timestamp"), "2 minutes"),
        col("category")
    ) \\
    .agg(count("*").alias("count")) \\
    .writeStream \\
    .outputMode("complete") \\
    .format("console") \\
    .start()

# Behavior:
# - Outputs all windows every trigger
# - Requires unbounded state (all data kept in memory)
# - Use only when necessary (e.g., all-time aggregations)
# - High memory usage
""")

print("\nüí° Output Mode Selection:")
print("   - Append: Use with watermark for windowed aggregations")
print("   - Update: Use for incremental updates, can work without watermark")
print("   - Complete: Use for small datasets or when you need all results")


## Exercise 7: Real-World Streaming Pipeline

### üéØ **Learning Objectives:**
- Build a complete streaming pipeline
- Combine windows, watermarking, and aggregations
- Practice with realistic scenarios
- Understand best practices


In [None]:
# Real-World Streaming Pipeline Example
print("üåê Exercise 7: Real-World Streaming Pipeline")
print("=" * 60)

print("\nüìã Complete Streaming Pipeline Pattern:")
print("""
# Step 1: Read from Kafka
stream_df = spark.readStream \\
    .format("kafka") \\
    .option("kafka.bootstrap.servers", "localhost:9092") \\
    .option("subscribe", "user-events") \\
    .option("startingOffsets", "latest") \\
    .load()

# Step 2: Parse JSON and extract fields
from pyspark.sql.types import StructType, StructField, StringType, IntegerType, TimestampType

schema = StructType([
    StructField("user_id", StringType()),
    StructField("event_type", StringType()),
    StructField("timestamp", TimestampType()),
    StructField("amount", IntegerType())
])

events_df = stream_df \\
    .select(from_json(col("value").cast("string"), schema).alias("data")) \\
    .select("data.*")

# Step 3: Add watermark for late data handling
events_with_watermark = events_df \\
    .withWatermark("timestamp", "10 minutes")  # 10-minute watermark

# Step 4: Windowed aggregation with sliding window
windowed_agg = events_with_watermark \\
    .groupBy(
        window(col("timestamp"), "5 minutes", "1 minute"),  # 5-min window, 1-min slide
        col("event_type")
    ) \\
    .agg(
        count("*").alias("event_count"),
        countDistinct("user_id").alias("unique_users"),
        sum("amount").alias("total_revenue"),
        avg("amount").alias("avg_amount")
    )

# Step 5: Write to output (console, Kafka, or file)
query = windowed_agg \\
    .writeStream \\
    .outputMode("update") \\
    .format("console") \\
    .option("truncate", "false") \\
    .trigger(processingTime='10 seconds') \\
    .start()

# Step 6: Wait for termination
query.awaitTermination()
""")

print("\n‚úÖ Key Components:")
print("   1. Kafka source: Real-time data ingestion")
print("   2. Schema parsing: Extract structured data")
print("   3. Watermark: Handle late-arriving data")
print("   4. Sliding window: Continuous aggregations")
print("   5. Multiple metrics: Count, sum, avg, distinct")
print("   6. Output mode: Update for incremental results")
print("   7. Trigger: Control processing frequency")


## Summary

### ‚úÖ What we learned:
1. **Event-Time vs Processing-Time**: Understanding the difference and why event-time matters
2. **Tumbling Windows**: Fixed-size, non-overlapping windows for periodic aggregations
3. **Sliding Windows**: Overlapping windows for continuous, smooth aggregations
4. **Watermarking**: Mechanism to handle late-arriving data and manage state
5. **Advanced Window Operations**: Multiple aggregations, filtering, grouping
6. **Output Modes**: Append, Update, and Complete modes for different use cases
7. **Real-World Pipeline**: Complete streaming pipeline with best practices

### üéØ Key Takeaways:
- **Always use event-time** for accurate analytics, not processing-time
- **Tumbling windows** for periodic reports (e.g., hourly sales)
- **Sliding windows** for continuous metrics (e.g., moving averages)
- **Watermarking** is essential for state management and late data handling
- **Output mode selection** depends on use case and memory constraints
- **Window size and slide** should match your business requirements

### üöÄ Best Practices:
- Set watermark threshold based on expected data delays
- Use Append mode with watermark for windowed aggregations
- Choose window size based on analysis needs (not too small, not too large)
- Monitor state size and adjust watermark accordingly
- Use sliding windows for smoother, more continuous results
- Test with late-arriving data to validate watermark behavior

### üìö Next Steps:
- Practice with real Kafka streams
- Experiment with different window sizes and slides
- Test watermark behavior with intentionally late data
- Monitor streaming query performance and state size
- Explore advanced features (stream-stream joins, deduplication)
