# Spark Streaming Enrichment with Parallel

This notebook demonstrates real-time data enrichment using Spark Structured Streaming and the Parallel API.

## Prerequisites

```bash
pip install parallel-web-tools[spark]
export PARALLEL_API_KEY="your-api-key"
```

In [None]:
from pyspark.sql import SparkSession

from parallel_web_tools.integrations.spark import enrich_streaming_batch

spark = SparkSession.builder.master("local[*]").appName("StreamingEnrichmentDemo").getOrCreate()

print(f"Spark version: {spark.version}")

## Create Streaming Source

We'll use a rate source to simulate streaming data. In production, this would be Kafka, Kinesis, or file-based sources.

In [None]:
# Simulate streaming data - generates 1 row per second
stream_df = (
    spark.readStream.format("rate")
    .option("rowsPerSecond", 1)
    .load()
    .selectExpr(
        "value % 5 as company_id",
        "CASE value % 5 "
        "  WHEN 0 THEN 'Google' "
        "  WHEN 1 THEN 'Microsoft' "
        "  WHEN 2 THEN 'Apple' "
        "  WHEN 3 THEN 'Amazon' "
        "  ELSE 'Tesla' "
        "END as company_name",
    )
)

print("Streaming schema:")
stream_df.printSchema()

## Define Batch Processing Function

The `enrich_streaming_batch` function enriches each micro-batch using the Parallel Task Group API.

In [None]:
def process_batch(batch_df, batch_id):
    """Process each micro-batch with enrichment."""
    if batch_df.count() == 0:
        return

    print(f"\n{'=' * 50}")
    print(f"Processing batch {batch_id} ({batch_df.count()} rows)")
    print("=" * 50)

    # Enrich the batch
    enriched_df = enrich_streaming_batch(
        batch_df,
        input_columns={"company_name": "company_name"},
        output_columns=["CEO name", "Stock ticker symbol"],
        processor="lite-fast",
        timeout=120,
    )

    # Display results (in production, write to a sink)
    display(enriched_df.toPandas())

## Start Streaming Query

This will run for 90 seconds, processing batches every 30 seconds.

In [None]:
import time

query = stream_df.writeStream.foreachBatch(process_batch).trigger(processingTime="30 seconds").start()

print("Streaming query started. Will run for 90 seconds...")
print("Watch for enriched batches below.\n")

# Run for 90 seconds (3 batches)
time.sleep(90)

query.stop()
print("\nStreaming query stopped.")

## Production Example: Write to Parquet

In production, you'd write enriched results to a sink instead of displaying them.

In [None]:
# Example: Write to Parquet (uncomment to run)

# def process_and_save(batch_df, batch_id):
#     if batch_df.count() == 0:
#         return
#
#     enriched_df = enrich_streaming_batch(
#         batch_df,
#         input_columns={"company_name": "company_name"},
#         output_columns=["CEO name", "headquarters"],
#         processor="lite-fast"
#     )
#
#     enriched_df.write.mode("append").parquet("/path/to/output")

print("See commented code above for production sink example.")

## Cleanup

In [None]:
spark.stop()
print("Spark session stopped.")