<a href="https://colab.research.google.com/github/margaridagomes/dataeng-basic-course/blob/main/spark_streaming/challenges/final_challenges.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Setting up PySpark

In [None]:
%pip install pyspark



# Context
Message events are coming from platform message broker (kafka, pubsub, kinesis...).
You need to process the data according to the requirements.

Message schema:
- timestamp
- value
- event_type
- message_id
- country_id
- user_id



# Challenge 1

Step 1
- Change exising producer
	- Change parquet location to "/content/lake/bronze/messages/data"
	- Add checkpoint (/content/lake/bronze/messages/checkpoint)
	- Delete /content/lake/bronze/messages and reprocess data
	- For reprocessing, run the streaming for at least 1 minute, then stop it

Step 2
- Implement new stream job to read from messages in bronze layer and split result in two locations
	- "messages_corrupted"
		- logic: event_status is null, empty or equal to "NONE"
    - extra logic: add country name by joining message with countries dataset
		- partition by "date" -extract it from timestamp
		- location: /content/lake/silver/messages_corrupted/data

	- "messages"
		- logic: not corrupted data
		- extra logic: add country name by joining message with countries dataset
		- partition by "date" -extract it from timestamp
		- location: /content/lake/silver/messages/data

	- technical requirements
		- add checkpint (choose location)
		- use StructSchema
		- Set trigger interval to 5 seconds
		- run streaming for at least 20 seconds, then stop it

	- alternatives
		- implementing single streaming job with foreach/- foreachBatch logic to write into two locations
		- implementing two streaming jobs, one for messages and another for messages_corrupted
		- (paying attention on the paths and checkpoints)


  - Check results:
    - results from messages in bronze layer should match with the sum of messages+messages_corrupted in the silver layer

In [None]:
%pip install faker

Collecting faker
  Downloading faker-37.4.0-py3-none-any.whl.metadata (15 kB)
Downloading faker-37.4.0-py3-none-any.whl (1.9 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.9/1.9 MB[0m [31m24.4 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: faker
Successfully installed faker-37.4.0


# Producer

In [None]:
import pyspark.sql.functions as F
from pyspark.sql import DataFrame
from faker import Faker
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName('Test streaming').getOrCreate()
sc = spark.sparkContext

fake = Faker()
messages = [fake.uuid4() for _ in range(50)]

def enrich_data(df, messages=messages):
  fake = Faker()
  new_columns = {
      'event_type': F.lit(fake.random_element(elements=('OPEN', 'RECEIVED', 'SENT', 'CREATED', 'CLICKED', '', 'NONE'))),
      'message_id': F.lit(fake.random_element(elements=messages)),
      'channel': F.lit(fake.random_element(elements=('CHAT', 'EMAIL', 'SMS', 'PUSH', 'OTHER'))),
      'country_id': F.lit(fake.random_int(min=2000, max=2015)),
      'user_id': F.lit(fake.random_int(min=1000, max=1050)),
  }
  df = df.withColumns(new_columns)
  return df

def insert_messages(df: DataFrame, batch_id):
  enrich = enrich_data(df)
  enrich.write.mode("append").format("parquet").save("content/lake/bronze/messages")

# read stream
df_stream = spark.readStream.format("rate").option("rowsPerSecond", 1).load()

# write stream
query = (df_stream.writeStream
.outputMode('append')
.trigger(processingTime='1 seconds')
.foreachBatch(insert_messages)
.start()
)

query.awaitTermination(60)


False

In [None]:
query.stop()

In [None]:
df = spark.read.format("parquet").load("content/lake/bronze/messages/*")
df.show()

# === CHALLENGE 1 - STEP 1: PRODUCER ===

In [63]:
import os
import shutil
import pyspark.sql.functions as F
from pyspark.sql import DataFrame, SparkSession
from faker import Faker

# ----------------------------------------
# Spark session initialization and config
# ----------------------------------------
def validate_and_initialize_spark():
    """
    Validate and initialize Spark session with proper configuration.
    If a Spark session is already active, reuse it.
    """
    try:
        # Check if Spark session already exists
        existing_spark = SparkSession.getActiveSession()

        if existing_spark:
            spark = existing_spark
        else:
            # Initialize Spark Session with settings for streaming
            spark = SparkSession.builder.appName("StreamingETLProducer").getOrCreate()

        return spark

    except Exception as e:
        print(f"Spark session validation failed: {e}")
        raise e

# Initialize and validate Spark session
spark = validate_and_initialize_spark()
sc = spark.sparkContext

# ----------------------------------------
# Clean up and prepare data directories
# ----------------------------------------
def cleanup_and_setup_directories():
    """
    Clean up existing directories and create a fresh structure for the bronze/messages layer.
    Removes any previous run's data and ensures a clean slate for new data.
    """
    base_dir = "content/lake/bronze/messages"

    # Remove existing bronze directory as required
    if os.path.exists(base_dir):
        shutil.rmtree(base_dir)

    # Create new directory structure with updated paths
    os.makedirs(os.path.join(base_dir, "data"), exist_ok=True)
    os.makedirs(os.path.join(base_dir, "checkpoint"), exist_ok=True)

# ----------------------------------------
# Producer: Generate and stream fake data
# ----------------------------------------
def run_updated_producer():
    """
    Run the streaming producer with new configuration.
    Generates fake message events and writes them in micro-batches to Parquet files using Spark Structured Streaming.
    Runs for 1 minute.
    """
    # Generate fake messages (as in original code)
    fake = Faker()
    messages = [fake.uuid4() for _ in range(50)]

    def enrich_data(df, messages=messages):
        """
        Enriches a DataFrame with random event metadata using Faker.
        Adds event_type, message_id, channel, country_id, and user_id.
        """
        fake = Faker()
        new_columns = {
            'event_type': F.lit(fake.random_element(elements=('OPEN', 'RECEIVED', 'SENT', 'CREATED', 'CLICKED', '', 'NONE'))),
            'message_id': F.lit(fake.random_element(elements=messages)),
            'channel': F.lit(fake.random_element(elements=('CHAT', 'EMAIL', 'SMS', 'PUSH', 'OTHER'))),
            'country_id': F.lit(fake.random_int(min=2000, max=2015)),
            'user_id': F.lit(fake.random_int(min=1000, max=1050)),
        }
        df = df.withColumns(new_columns)
        return df

    def insert_messages(df: DataFrame, batch_id):
        """
        Called for each micro-batch.
        Enriches the DataFrame and writes to Parquet in append mode.
        """
        enrich = enrich_data(df)
        enrich.write.mode("append").format("parquet").save("content/lake/bronze/messages/data")

    # Create streaming source with a rate of 1 row per second
    df_stream = spark.readStream.format("rate").option("rowsPerSecond", 1).load()

    # Start streaming with checkpointing enabled for fault-tolerance
    query = (df_stream.writeStream
             .outputMode('append')
             .trigger(processingTime='1 seconds')
             .option("checkpointLocation", "content/lake/bronze/messages/checkpoint")  # Ensure recovery if interrupted
             .foreachBatch(insert_messages)
             .start())

    print("2. Producer streaming for 1 minute as required...")

    # Run for exactly 1 minute as specified
    query.awaitTermination(60)

    print("✅ Producer completed successfully")

    return query

# ----------------------------------------
# Display results after streaming is done
# ----------------------------------------
def show_producer_results():
    """
    Reads and displays results from the produced Parquet data.
    Shows total records, schema, and a sample of the data.
    """
    print("3. Producer Results")

    try:
        # Read produced data
        produced_data = spark.read.parquet("content/lake/bronze/messages/data")
        total_records = produced_data.count()

        print(f"Total records produced: {total_records}")

        # Show schema
        print("Data Schema:")
        produced_data.printSchema()

        # Show sample data
        print("Sample Data (first 10 records):")
        produced_data.show(10, truncate=False)

        return produced_data

    except Exception as e:
        print(f"❌ Error reading producer results: {e}")
        return None

# ----------------------------------------
# Resource cleanup
# ----------------------------------------
def cleanup_producer_resources():
    """
    Stops all active Spark streaming queries to free resources.
    Prints a warning if unable to stop any stream.
    """
    try:
        # Stop any active streaming queries
        active_streams = spark.streams.active
        if active_streams:
            for stream in active_streams:
                try:
                    stream.stop()
                except Exception as e:
                    print(f"⚠️  Warning stopping stream {stream.id}: {e}")

        print("Producer cleanup completed")

    except Exception as e:
        print(f"⚠️  Warning during producer cleanup: {e}")

# =====================================
# MAIN EXECUTION: Run the complete step
# =====================================
def run_producer_step1():
    """
    Executes Step 1 of Challenge 1:
    1. Cleans up and prepares directories,
    2. Runs the producer for 1 minute,
    3. Displays results,
    4. Always performs resource cleanup.
    """
    print("Executing Challenge 1 - Step 1: Producer")

    try:
        # 1. Setup directories
        cleanup_and_setup_directories()
        print("1. Cleanup and Setup Directories")

        # 2. Run updated producer
        query = run_updated_producer()

        # 3. Show results
        produced_data = show_producer_results()

        print("Step 1 completed successfully!")

        return produced_data

    finally:
        # Always cleanup
        cleanup_producer_resources()

In [64]:
run_producer_step1()

Executing Challenge 1 - Step 1: Producer
1. Cleanup and Setup Directories
2. Producer streaming for 1 minute as required...
✅ Producer completed successfully
3. Producer Results
Total records produced: 59
Data Schema:
root
 |-- timestamp: timestamp (nullable = true)
 |-- value: long (nullable = true)
 |-- event_type: string (nullable = true)
 |-- message_id: string (nullable = true)
 |-- channel: string (nullable = true)
 |-- country_id: integer (nullable = true)
 |-- user_id: integer (nullable = true)

Sample Data (first 10 records):
+-----------------------+-----+----------+------------------------------------+-------+----------+-------+
|timestamp              |value|event_type|message_id                          |channel|country_id|user_id|
+-----------------------+-----+----------+------------------------------------+-------+----------+-------+
|2025-07-13 01:04:27.546|21   |RECEIVED  |1b692875-885d-432d-9e09-9c7d6f6bdca3|EMAIL  |2015      |1038   |
|2025-07-13 01:04:35.546|29   |

DataFrame[timestamp: timestamp, value: bigint, event_type: string, message_id: string, channel: string, country_id: int, user_id: int]

# Additional datasets

In [None]:
countries = [
    {"country_id": 2000, "country": "Brazil"},
    {"country_id": 2001, "country": "Portugal"},
    {"country_id": 2002, "country": "Spain"},
    {"country_id": 2003, "country": "Germany"},
    {"country_id": 2004, "country": "France"},
    {"country_id": 2005, "country": "Italy"},
    {"country_id": 2006, "country": "United Kingdom"},
    {"country_id": 2007, "country": "United States"},
    {"country_id": 2008, "country": "Canada"},
    {"country_id": 2009, "country": "Australia"},
    {"country_id": 2010, "country": "Japan"},
    {"country_id": 2011, "country": "China"},
    {"country_id": 2012, "country": "India"},
    {"country_id": 2013, "country": "South Korea"},
    {"country_id": 2014, "country": "Russia"},
    {"country_id": 2015, "country": "Argentina"}
]

countries = spark.createDataFrame(countries)

# === CHALLENGE 1 - STEP 2 ===


# Streaming Messages x Messages Corrupted

In [65]:
import os
import time
import shutil
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
from pyspark.sql.types import *

# =====================================
# Countries Dataset Setup
# =====================================
def create_countries_dataset():
    """
    Create static dataset mapping country_id to country name.
    """
    countries = [
        {"country_id": 2000, "country": "Brazil"},
        {"country_id": 2001, "country": "Portugal"},
        {"country_id": 2002, "country": "Spain"},
        {"country_id": 2003, "country": "Germany"},
        {"country_id": 2004, "country": "France"},
        {"country_id": 2005, "country": "Italy"},
        {"country_id": 2006, "country": "United Kingdom"},
        {"country_id": 2007, "country": "United States"},
        {"country_id": 2008, "country": "Canada"},
        {"country_id": 2009, "country": "Australia"},
        {"country_id": 2010, "country": "Japan"},
        {"country_id": 2011, "country": "China"},
        {"country_id": 2012, "country": "India"},
        {"country_id": 2013, "country": "South Korea"},
        {"country_id": 2014, "country": "Russia"},
        {"country_id": 2015, "country": "Argentina"}
    ]

    countries_df = spark.createDataFrame(countries)
    return countries_df

# =====================================
# Define schema for bronze layer data
# =====================================
def define_bronze_schema():
    """
    Define the expected schema for messages in the bronze layer.
    Ensures consistent parsing of timestamp, IDs, and event attributes.
    """
    return StructType([
        StructField("timestamp", TimestampType(), True),
        StructField("value", LongType(), True),
        StructField("event_type", StringType(), True),
        StructField("message_id", StringType(), True),
        StructField("channel", StringType(), True),
        StructField("country_id", IntegerType(), True),
        StructField("user_id", IntegerType(), True)
    ])

# ---------------------------------------------
# Clean up and prepare silver layer directories
# ---------------------------------------------
def setup_silver_directories():
    """
    Clean up and recreate silver layer directories.
    Ensures clean separation of valid and corrupted messages.
    """
    silver_paths = [
        "content/lake/silver/messages",
        "content/lake/silver/messages_corrupted",
        "content/lake/silver/etl_checkpoint"
    ]

    for path in silver_paths:
        if os.path.exists(path):
            shutil.rmtree(path)

    os.makedirs("content/lake/silver/messages/data", exist_ok=True)
    os.makedirs("content/lake/silver/messages_corrupted/data", exist_ok=True)
    os.makedirs("content/lake/silver/etl_checkpoint/checkpoint", exist_ok=True)

# =====================================
# Foreachbatch processing
# =====================================
def process_batch_to_both_sinks(batch_df, batch_id):
    """
    Split incoming micro-batch into corrupted and clean messages.
    Enrich with country name and write each set to its respective silver path.
    """
    if batch_df.isEmpty():
        print(f"📭 Batch {batch_id}: No data")
        return

    countries_df = create_countries_dataset()

    enriched_batch = batch_df.join(
        broadcast(countries_df), "country_id", "left"
    ).withColumn("country_name", col("country")).drop("country")

    enriched_batch = enriched_batch.withColumn(
        "date", date_format(col("timestamp"), "yyyy-MM-dd")
    )

    corrupted = enriched_batch.filter(col("event_type").isNull() | (col("event_type") == "") | (col("event_type") == "NONE"))
    clean = enriched_batch.filter(~(col("event_type").isNull() | (col("event_type") == "") | (col("event_type") == "NONE")))

    if not corrupted.isEmpty():
        corrupted.write.mode("append").partitionBy("date").parquet("content/lake/silver/messages_corrupted/data")
    if not clean.isEmpty():
        clean.write.mode("append").partitionBy("date").parquet("content/lake/silver/messages/data")

# ----------------------------------------
# Streaming query setup and execution
# ----------------------------------------
def start_single_stream_with_foreach():
    """
    Start structured streaming query using foreachBatch logic.
    Reads from bronze and routes to silver based on corruption logic.
    """
    setup_silver_directories()

    bronze_schema = define_bronze_schema()

    bronze_stream = spark.readStream \
        .format("parquet") \
        .schema(bronze_schema) \
        .load("content/lake/bronze/messages/data")

    return bronze_stream.writeStream \
        .foreachBatch(process_batch_to_both_sinks) \
        .option("checkpointLocation", "content/lake/silver/etl_checkpoint/checkpoint") \
        .trigger(processingTime="5 seconds") \
        .start()

# ----------------------------------------
# Resource cleanup
# ----------------------------------------
def cleanup_etl_resources():
    """
    Stops all active Spark streaming queries to free resources.
    Prints a warning if unable to stop any stream.
    """
    try:
        active_streams = spark.streams.active
        if active_streams:
            for stream in active_streams:
                try:
                    stream.stop()
                except Exception as e:
                    print(f"⚠️  Warning stopping stream {stream.id}: {e}")
        print("ETL cleanup completed")
    except Exception as e:
        print(f"⚠️  Warning during ETL cleanup: {e}")

# =====================================
# MAIN EXECUTION
# =====================================
def run_etl_step2(run_time=20):
    """
    Executes Step 2 of Challenge 1: Main execution for Bronze to Silver ETL pipeline using foreachBatch.
    Only runs the streaming process and stops it after run_time seconds.
    """
    print("Executing Challenge 1 - Step 2: Bronze to Silver Layer")

    try:
        query = start_single_stream_with_foreach()
        time.sleep(run_time)
        query.stop()
        print("✅ Stream stopped successfully after ETL run")

        # Quick status
        try:
            clean_count = spark.read.parquet("content/lake/silver/messages/data").count()
            corrupted_count = spark.read.parquet("content/lake/silver/messages_corrupted/data").count()
            print(f"Quick results: {clean_count} clean, {corrupted_count} corrupted")
        except:
            print("Data still processing or not available yet")

    except Exception as e:
        print(f"❌ ETL failed: {e}")
        raise e
    finally:
        cleanup_etl_resources()

In [66]:
run_etl_step2()

Executing Challenge 1 - Step 2: Bronze to Silver Layer
✅ Stream stopped successfully after ETL run
Quick results: 43 clean, 16 corrupted
ETL cleanup completed


## Checking data

In [69]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import col

# ----------------------------
# Count and compare all layers
# ----------------------------
def validate_counts():
    """
    Compare bronze layer record count with total of silver clean + corrupted.
    Return summary with counts and validation result.
    """
    try:
        bronze_count = spark.read.parquet("content/lake/bronze/messages/data").count()
    except:
        print("❌ Error reading bronze layer")
        return None

    try:
        clean_count = spark.read.parquet("content/lake/silver/messages/data").count()
    except:
        clean_count = 0

    try:
        corrupted_count = spark.read.parquet("content/lake/silver/messages_corrupted/data").count()
    except:
        corrupted_count = 0

    silver_total = clean_count + corrupted_count
    result = {
        "bronze": bronze_count,
        "clean": clean_count,
        "corrupted": corrupted_count,
        "silver_total": silver_total,
        "match": bronze_count == silver_total
    }

    print("\n📊 COUNT SUMMARY")
    print(f"Bronze:           {bronze_count}")
    print(f"Silver Clean:     {clean_count}")
    print(f"Silver Corrupted: {corrupted_count}")
    print(f"Silver Total:     {silver_total}")
    print(f"Match:            {'✅ YES' if result['match'] else '❌ NO'}")

    return result

# ----------------------------
# Show 5 sample records
# ----------------------------
def show_samples():
    """Display 5 records from each layer for quick inspection."""
    try:
        print("\n🟦 Bronze Sample:")
        spark.read.parquet("content/lake/bronze/messages/data").show(5, truncate=False)
    except:
        print("No bronze data found")

    try:
        print("\n🟢 Silver Clean Sample:")
        spark.read.parquet("content/lake/silver/messages/data").show(5, truncate=False)
    except:
        print("No clean data found")

    try:
        print("\n🔴 Silver Corrupted Sample:")
        spark.read.parquet("content/lake/silver/messages_corrupted/data").show(5, truncate=False)
    except:
        print("No corrupted data found")

# ----------------------------
# Validate Results
# ----------------------------
def validate_results():
    """Run validation."""
    result = validate_counts()
    return result

In [71]:
validate_results()


📊 COUNT SUMMARY
Bronze:           59
Silver Clean:     43
Silver Corrupted: 16
Silver Total:     59
Match:            ✅ YES


{'bronze': 59, 'clean': 43, 'corrupted': 16, 'silver_total': 59, 'match': True}

# Challenge 2

- Run business report
- But first, there is a bug in the system which is causing some duplicated messages, we need to exclude these lines from the report

- removing duplicates logic:
  - Identify possible duplicates on message_id, event_type and channel
  - in case of duplicates, consider only the first message (occurrence by timestamp)
  - Ex:
    In table below, the correct message to consider is the second line

```
    message_id | channel | event_type | timestamp
    123        | CHAT    | CREATED    | 10:10:01
    123        | CHAT    | CREATED    | 07:56:45 (first occurrence)
    123        | CHAT    | CREATED    | 08:13:33
```

- After cleaning the data we're able to create the busines report

In [None]:
# dedup data
from pyspark.sql import functions as F
from pyspark.sql.window import Window
df = spark.read.format("parquet").load("content/lake/silver/messages")
dedup = df.withColumn("row_number", F.row_number().over(Window.partitionBy("message_id", "event_type", "channel").orderBy("timestamp"))).filter("row_number = 1").drop("row_number")

### Report 1
  - Aggregate data by date, event_type and channel
  - Count number of messages
  - pivot event_type from rows into columns
  - schema expected:
  
```
|      date|channel|CLICKED|CREATED|OPEN|RECEIVED|SENT|
+----------+-------+-------+-------+----+--------+----+
|2024-12-03|    SMS|      4|      4|   1|       1|   5|
|2024-12-03|   CHAT|      3|      7|   5|       8|   4|
|2024-12-03|   PUSH|   NULL|      3|   4|       3|   4|
```

In [None]:
# report 1
# TODO

## Report 2

- Identify the most active users by channel (sorted by number of iterations)
- schema expected:

```
+-------+----------+----+-----+-----+----+---+
|user_id|iterations|CHAT|EMAIL|OTHER|PUSH|SMS|
+-------+----------+----+-----+-----+----+---+
|   1022|         5|   2|    0|    1|   0|  2|
|   1004|         4|   1|    1|    1|   1|  0|
|   1013|         4|   0|    0|    2|   1|  1|
|   1020|         4|   2|    0|    1|   1|  0|
```


In [None]:
# report 2
# TODO

# Challenge 3

In [None]:
# Theoretical question:

# A new usecase requires the message data to be aggregate in near real time
# They want to build a dashboard embedded in the platform website to analyze message data in low latency (few minutes)
# This application will access directly the data aggregated by streaming process

# Q1:
- What would be your suggestion to achieve that using Spark Structure Streaming?
Or would you choose a different data processing tool?

# Q2:
- Which storage would you use and why? (database?, data lake?, kafka?)



In [None]:
# agregação em tempo real e no fim ter um dashboard em tempo real com os resultados (baixa latência)
# o engine de consumo dos dados agregados tem que ser rápido para aceder e mostrar esses mesmos dados
# qual o storage? bd? datalake? kafka?
# usariamos o spark? ou outra forma de processar dados?
# testar lógica de arquitetura