# Lab 2: Bronze Layer - Raw Streaming Data

## üéØ **Learning Objectives:**
- Ingest stock trade data t·ª´ Kafka v√†o Iceberg Bronze table
- Setup Spark Structured Streaming v·ªõi Kafka source
- Write streaming data to Iceberg tables
- Understand checkpointing v√† fault tolerance
- Verify raw data trong Bronze layer

## üìö **Key Concepts:**
1. **Bronze Layer**: Raw, unprocessed data t·ª´ source
2. **Kafka Source**: Real-time event stream
3. **Iceberg Sink**: Write streaming data to Iceberg
4. **Checkpointing**: Fault tolerance cho streaming
5. **Schema**: Stock trade event structure


In [None]:
# Install and Import Dependencies
%pip install pyspark findspark pandas numpy pyarrow kafka-python

import findspark
findspark.init()

from pyspark.sql import SparkSession
from pyspark.sql.functions import *
from pyspark.sql.types import *
from pyspark.sql.streaming import StreamingQuery
import json
import time

print("‚úÖ Dependencies installed and imported successfully!")


In [None]:
# Initialize Spark Session for Streaming Lakehouse
spark = SparkSession.builder \
    .appName("StreamingLakehouseBronze") \
    .master("spark://spark-master:7077") \
    .config("spark.sql.adaptive.enabled", "true") \
    .config("spark.sql.streaming.checkpointLocation", "/tmp/bronze_checkpoint") \
    .config("spark.sql.streaming.kafka.useDeprecatedOffsetFetching", "false") \
    .config("spark.serializer", "org.apache.spark.serializer.KryoSerializer") \
    .getOrCreate()

# Set log level
spark.sparkContext.setLogLevel("WARN")

print("üöÄ Spark Streaming Session initialized!")
print(f"üìä Spark Version: {spark.version}")
print(f"üîó Master URL: {spark.sparkContext.master}")

# Configuration
KAFKA_BOOTSTRAP_SERVERS = "localhost:9092"
KAFKA_TOPIC = "stock-trades"
BRONZE_TABLE_PATH = "/warehouse/bronze/trades"

print(f"\nüì° Configuration:")
print(f"   Kafka: {KAFKA_BOOTSTRAP_SERVERS}")
print(f"   Topic: {KAFKA_TOPIC}")
print(f"   Bronze Path: {BRONZE_TABLE_PATH}")


## Exercise 1: Define Stock Trade Schema

### Stock Trade Event Schema
```json
{
  "trade_id": "TRD_001",
  "symbol": "AAPL",
  "price": 175.50,
  "volume": 100,
  "timestamp": "2025-01-15T10:30:00Z",
  "trade_type": "BUY",
  "exchange": "NASDAQ"
}
```


In [None]:
# Define Stock Trade Schema
print("üìã Exercise 1: Define Stock Trade Schema")
print("=" * 60)

trade_schema = StructType([
    StructField("trade_id", StringType()),
    StructField("symbol", StringType()),
    StructField("price", DoubleType()),
    StructField("volume", IntegerType()),
    StructField("timestamp", StringType()),
    StructField("trade_type", StringType()),
    StructField("exchange", StringType())
])

print("‚úÖ Stock Trade Schema defined:")
trade_schema.printTreeString()

print("\nüí° Note: Bronze layer stores raw data as-is")
print("   - No transformations")
print("   - No validations")
print("   - Preserves original format for replay capability")


## Exercise 2: Read from Kafka Stream

### Steps:
1. Read stream t·ª´ Kafka topic
2. Parse JSON messages
3. Extract v√† cast fields
4. Add Kafka metadata (partition, offset)


In [None]:
# Read from Kafka Stream
print("üì• Exercise 2: Read from Kafka Stream")
print("=" * 60)

print(f"\n1Ô∏è‚É£ Reading from Kafka topic: {KAFKA_TOPIC}")

kafka_stream = spark.readStream \
    .format("kafka") \
    .option("kafka.bootstrap.servers", KAFKA_BOOTSTRAP_SERVERS) \
    .option("subscribe", KAFKA_TOPIC) \
    .option("startingOffsets", "earliest") \
    .option("failOnDataLoss", "false") \
    .load()

print("‚úÖ Kafka stream created!")
print("\nüìã Kafka stream schema:")
kafka_stream.printSchema()

# Parse JSON from Kafka value
print("\n2Ô∏è‚É£ Parsing JSON messages:")

parsed_stream = kafka_stream \
    .select(
        col("key").cast("string").alias("kafka_key"),
        col("value").cast("string").alias("json_value"),
        col("timestamp").alias("kafka_timestamp"),
        col("partition"),
        col("offset")
    ) \
    .select(
        col("kafka_key"),
        from_json(col("json_value"), trade_schema).alias("data"),
        col("kafka_timestamp"),
        col("partition"),
        col("offset")
    ) \
    .select(
        col("kafka_key"),
        col("data.*"),
        col("kafka_timestamp").alias("kafka_ingestion_time"),
        col("partition").alias("kafka_partition"),
        col("offset").alias("kafka_offset")
    ) \
    .withColumn(
        "event_timestamp",
        to_timestamp(col("timestamp"), "yyyy-MM-dd'T'HH:mm:ss'Z'")
    )

print("‚úÖ Parsed stream schema:")
parsed_stream.printSchema()

print("\nüí° Bronze layer includes:")
print("   - Original trade data (from Kafka)")
print("   - Kafka metadata (partition, offset, ingestion time)")
print("   - Event timestamp (from data)")


## Exercise 3: Write to Iceberg Bronze Table

### Key Points:
- **Format**: Iceberg (unified storage)
- **Checkpointing**: For fault tolerance
- **Trigger**: Processing time (e.g., 10 seconds)
- **Output Mode**: Append (new data only)

### Note:
Full Iceberg functionality requires Iceberg Spark runtime JAR.
For demo, we'll show the pattern. In production, configure:
```python
.config("spark.sql.extensions", "org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions")
.config("spark.sql.catalog.spark_catalog", "org.apache.iceberg.spark.SparkSessionCatalog")
```


In [None]:
# Write to Iceberg Bronze Table
print("üíæ Exercise 3: Write to Iceberg Bronze Table")
print("=" * 60)

print(f"\n1Ô∏è‚É£ Writing to Bronze layer: {BRONZE_TABLE_PATH}")

# For demo: Write to Parquet (Iceberg pattern)
# In production with Iceberg JAR, use:
# .format("iceberg")
# .option("path", BRONZE_TABLE_PATH)
# .option("checkpointLocation", "/tmp/bronze_checkpoint")

bronze_query = parsed_stream \
    .writeStream \
    .outputMode("append") \
    .format("parquet") \
    .option("path", BRONZE_TABLE_PATH) \
    .option("checkpointLocation", "/tmp/bronze_checkpoint") \
    .trigger(processingTime='10 seconds') \
    .start()

print("‚úÖ Bronze streaming query started!")
print(f"   Writing to: {BRONZE_TABLE_PATH}")
print(f"   Checkpoint: /tmp/bronze_checkpoint")
print(f"   Trigger: Every 10 seconds")

print("\nüí° With Iceberg JAR, use:")
print("   .format('iceberg')")
print("   .option('path', 'warehouse.bronze.trades')")
print("   .option('checkpointLocation', '/tmp/bronze_checkpoint')")

print("\n‚ö†Ô∏è  Query is running. To stop: bronze_query.stop()")
print("   To check status: bronze_query.status")


## Exercise 4: Verify Bronze Data

### Check:
- Data is being written
- Schema is correct
- Kafka metadata is preserved
- Event timestamps are correct


In [None]:
# Verify Bronze Data
print("üîç Exercise 4: Verify Bronze Data")
print("=" * 60)

# Wait a bit for data to arrive
print("\n‚è≥ Waiting for data to be written...")
time.sleep(15)

# Read Bronze data (batch read)
print("\n1Ô∏è‚É£ Reading Bronze data:")

try:
    bronze_df = spark.read.parquet(BRONZE_TABLE_PATH)
    
    print(f"‚úÖ Bronze data found!")
    print(f"   Total records: {bronze_df.count()}")
    
    print("\n2Ô∏è‚É£ Sample Bronze data:")
    bronze_df.show(10, truncate=False)
    
    print("\n3Ô∏è‚É£ Schema:")
    bronze_df.printSchema()
    
    print("\n4Ô∏è‚É£ Data summary:")
    bronze_df.select("symbol", "price", "volume", "trade_type").summary().show()
    
    print("\n5Ô∏è‚É£ Records by symbol:")
    bronze_df.groupBy("symbol").count().orderBy(desc("count")).show()
    
except Exception as e:
    print(f"‚ö†Ô∏è  Error reading Bronze data: {e}")
    print("   Make sure:")
    print("   1. Kafka producer is running (stock_trade_simulator.py)")
    print("   2. Streaming query has processed some batches")
    print("   3. Wait a bit longer for data to arrive")


## Summary

### ‚úÖ What we learned:
1. **Bronze Layer**: Raw, unprocessed data t·ª´ Kafka
2. **Kafka Integration**: Read streaming data t·ª´ Kafka topics
3. **Schema Definition**: Define structure cho stock trade events
4. **Iceberg Write**: Write streaming data to Iceberg (pattern)
5. **Checkpointing**: Fault tolerance cho streaming queries
6. **Verification**: Check data trong Bronze layer

### üéØ Key Takeaways:
- **Bronze = Raw**: No transformations, preserves original data
- **Replay Capability**: C√≥ th·ªÉ reprocess t·ª´ Bronze n·∫øu c·∫ßn
- **Kafka Metadata**: Preserve partition, offset, ingestion time
- **Unified Storage**: Iceberg cho ph√©p query real-time v√† historical

### üöÄ Next Steps:
- Lab 3: Silver Layer (cleaning, deduplication)
- Lab 4: Gold Layer (aggregations)
- Lab 5: Unified batch + streaming
