# 02 - Streaming Ingestion

## Overview
This notebook sets up Spark Structured Streaming to read transaction data from the input directory.

## Key Concepts
- **Structured Streaming**: Spark's declarative streaming API built on DataFrames
- **Explicit Schema**: Required for production streaming to ensure type safety and performance
- **File Source**: We use CSV file streaming, which monitors a directory for new files

## Why Explicit Schema?
In production streaming, we avoid `inferSchema` because:
1. Performance: Schema inference requires reading data twice
2. Correctness: Ensures consistent types across batches
3. Safety: Catches schema violations early

In [None]:
# Import required libraries
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, StringType, DoubleType, IntegerType, TimestampType
from pathlib import Path
import os

## Initialize Spark Session

Create SparkSession with appropriate configurations for structured streaming.

In [None]:
# Initialize Spark Session
spark = SparkSession.builder \
    .appName("TransactionStreamingETL") \
    .master("local[*]") \
    .config("spark.sql.streaming.schemaInference", "false") \
    .config("spark.sql.shuffle.partitions", "4") \
    .config("spark.sql.streaming.stateStore.providerClass", 
            "org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider") \
    .getOrCreate()

# Set log level to reduce verbosity
spark.sparkContext.setLogLevel("WARN")

print(f"Spark Version: {spark.version}")
print(f"Spark UI: {spark.sparkContext.uiWebUrl}")

## Define Schema

Explicitly define the schema for our transaction data. This is critical for production streaming.

In [None]:
# Define explicit schema for transaction data
transaction_schema = StructType([
    StructField("transaction_id", StringType(), False),      # Primary key - not nullable
    StructField("user_id", StringType(), False),             # Customer identifier
    StructField("product_id", StringType(), False),          # Product identifier
    StructField("product_category", StringType(), True),     # Product category
    StructField("amount", DoubleType(), False),              # Transaction amount
    StructField("quantity", IntegerType(), False),           # Item quantity
    StructField("payment_method", StringType(), True),       # Payment type
    StructField("status", StringType(), False),              # Transaction status
    StructField("event_time", StringType(), False),          # Will be cast to timestamp later
    StructField("country_code", StringType(), True),         # Country code
    StructField("discount_percent", DoubleType(), True),     # Discount applied (nullable)
    StructField("customer_segment", StringType(), True)      # Customer segment (nullable)
])

print("Schema defined successfully:")
print(transaction_schema.simpleString())

## Configure Paths

Set up directory paths for reading streaming data.

In [None]:
# Configure paths
BASE_DIR = Path(os.path.abspath('')).parent
INPUT_DIR = str(BASE_DIR / 'data' / 'input')
OUTPUT_DIR = str(BASE_DIR / 'data' / 'output')

print(f"Input Directory: {INPUT_DIR}")
print(f"Output Directory: {OUTPUT_DIR}")

## Read Streaming Data

Set up the streaming DataFrame using `readStream`. This creates a streaming source that monitors the input directory for new CSV files.

### Streaming Options Explained:
- **maxFilesPerTrigger**: Controls the rate of file processing (throttling)
- **cleanSource**: Optionally archives processed files to prevent reprocessing
- **header**: Specifies that CSV files have headers

In [None]:
# Create streaming DataFrame
raw_stream = spark.readStream \
    .format("csv") \
    .schema(transaction_schema) \
    .option("header", "true") \
    .option("maxFilesPerTrigger", 1) \
    .load(INPUT_DIR)

print("Streaming DataFrame created successfully!")
print(f"Is Streaming: {raw_stream.isStreaming}")
print(f"\nSchema:")
raw_stream.printSchema()

## Register as Temporary View

Register the streaming DataFrame as a temporary SQL view. This enables us to use Spark SQL for all transformations.

In [None]:
# Register streaming DataFrame as temporary view
raw_stream.createOrReplaceTempView("raw_transactions")

print("Registered streaming DataFrame as 'raw_transactions' view")
print("This view can now be queried using Spark SQL")

## Test Query (Console Output)

Execute a simple streaming query to verify data ingestion. This uses console output mode for testing.

**Note:** This is a test query. In production, we would not use console output.

In [None]:
# Test query - display first few records
test_query = spark.sql("""
    SELECT 
        transaction_id,
        user_id,
        product_category,
        amount,
        status,
        event_time
    FROM raw_transactions
""")

# Write to console for verification (will run briefly)
console_query = test_query.writeStream \
    .outputMode("append") \
    .format("console") \
    .option("truncate", "false") \
    .option("numRows", 10) \
    .trigger(processingTime='5 seconds') \
    .start()

print("Console output query started...")
print("Query ID:", console_query.id)
print("Query Name:", console_query.name)
print("\nNote: This query will run for 20 seconds for demonstration")

In [None]:
# Let it run briefly to see output
import time
time.sleep(20)

# Stop the test query
console_query.stop()
print("Test query stopped.")

## Streaming Query Metrics

Check the status and metrics of active streaming queries.

In [None]:
# List all active streaming queries
active_streams = spark.streams.active
print(f"Active Streams: {len(active_streams)}")

for stream in active_streams:
    print(f"\nStream ID: {stream.id}")
    print(f"Name: {stream.name}")
    print(f"Status: {stream.status}")

## Verify Schema Compliance

Demonstrate that our schema is correctly applied and type-safe.

In [None]:
# Show data types
print("Column Data Types:")
for field in raw_stream.schema.fields:
    nullable = "NULL" if field.nullable else "NOT NULL"
    print(f"  {field.name:<20} {str(field.dataType):<15} {nullable}")

## Summary

This notebook successfully:

1. Initialized Spark with streaming configurations
2. Defined an explicit schema for type safety
3. Created a streaming DataFrame from CSV files
4. Registered the stream as a SQL temporary view
5. Verified data ingestion with a test query

**Key Takeaways:**
- Explicit schema is mandatory for production streaming
- `readStream` continuously monitors for new files
- Streaming DataFrames can be registered as SQL views
- `isStreaming=true` indicates this is a streaming DataFrame

**Next Steps:**
- Proceed to notebook 03 for data transformations using Spark SQL
- Apply data cleaning, type casting, and business logic
- Load SQL queries from external files