<a href="https://colab.research.google.com/github/jorgeneves16/dataeng-dataprocessing/blob/main/datastreaming_challenge.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Setting up PySpark

In [3]:
%pip install pyspark



# Context
Message events are coming from platform message broker (kafka, pubsub, kinesis...).
You need to process the data according to the requirements.

Message schema:
- timestamp
- value
- event_type
- message_id
- country_id
- user_id



# Challenge 1

Step 1
- Change exising producer
	- Change parquet location to "/content/lake/bronze/messages/data"
	- Add checkpoint (/content/lake/bronze/messages/checkpoint)
	- Delete /content/lake/bronze/messages and reprocess data
	- For reprocessing, run the streaming for at least 1 minute, then stop it

Step 2
- Implement new stream job to read from messages in bronze layer and split result in two locations
	- "messages_corrupted"
		- logic: event_status is null, empty or equal to "NONE"
    - extra logic: add country name by joining message with countries dataset
		- partition by "date" -extract it from timestamp
		- location: /content/lake/silver/messages_corrupted/data

	- "messages"
		- logic: not corrupted data
		- extra logic: add country name by joining message with countries dataset
		- partition by "date" -extract it from timestamp
		- location: /content/lake/silver/messages/data

	- technical requirements
		- add checkpint (choose location)
		- use StructSchema
		- Set trigger interval to 5 seconds
		- run streaming for at least 20 seconds, then stop it

	- alternatives
		- implementing single streaming job with foreach/- foreachBatch logic to write into two locations
		- implementing two streaming jobs, one for messages and another for messages_corrupted
		- (paying attention on the paths and checkpoints)


  - Check results:
    - results from messages in bronze layer should match with the sum of messages+messages_corrupted in the silver layer

In [4]:
%pip install faker



In [1]:
import pyspark.sql.functions as F
from pyspark.sql import DataFrame
from faker import Faker
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName('Test streaming').getOrCreate()
sc = spark.sparkContext

# Producer

In [2]:
import pyspark.sql.functions as F
from pyspark.sql import DataFrame
from faker import Faker
from pyspark.sql import SparkSession

fake = Faker()
messages = [fake.uuid4() for _ in range(50)]

def enrich_data(df, messages=messages):
  fake = Faker()
  new_columns = {
      'event_type': F.lit(fake.random_element(elements=('OPEN', 'RECEIVED', 'SENT', 'CREATED', 'CLICKED', '', 'NONE'))),
      'message_id': F.lit(fake.random_element(elements=messages)),
      'channel': F.lit(fake.random_element(elements=('CHAT', 'EMAIL', 'SMS', 'PUSH', 'OTHER'))),
      'country_id': F.lit(fake.random_int(min=2000, max=2015)),
      'user_id': F.lit(fake.random_int(min=1000, max=1050)),
  }
  df = df.withColumns(new_columns)
  return df

def insert_messages(df: DataFrame, batch_id):
  enrich = enrich_data(df)
  enrich.write.mode("append").format("parquet").save("content/lake/bronze/messages/data")



# Delete old directory data
import shutil
import os
base_path = "content/lake/bronze/messages/"
if os.path.exists(base_path):
    shutil.rmtree(base_path)
    print("Old data deleted with success.")

# read stream
df_stream = spark.readStream.format("rate").option("rowsPerSecond", 1).load()

# write stream
query = (df_stream.writeStream
.outputMode('append')
.option('checkpointLocation', 'content/lake/bronze/messages/checkpoint')
.trigger(processingTime='1 seconds')
.foreachBatch(insert_messages)
.start()
)

query.awaitTermination(60)
print("Event producing ended.")

Old data deleted with success.
Event producing ended.


In [3]:
query.stop()

In [6]:
df = spark.read.format("parquet").load("content/lake/bronze/messages/data/")

df.show()

+--------------------+-----+----------+--------------------+-------+----------+-------+
|           timestamp|value|event_type|          message_id|channel|country_id|user_id|
+--------------------+-----+----------+--------------------+-------+----------+-------+
|2025-07-05 14:54:...|    1|          |ec9f31d9-a72f-47d...|    SMS|      2007|   1016|
|2025-07-05 14:54:...|    3|          |ec9f31d9-a72f-47d...|    SMS|      2007|   1016|
|2025-07-05 14:54:...|    5|          |ec9f31d9-a72f-47d...|    SMS|      2007|   1016|
|2025-07-05 14:54:...|    0|          |ec9f31d9-a72f-47d...|    SMS|      2007|   1016|
|2025-07-05 14:54:...|    2|          |ec9f31d9-a72f-47d...|    SMS|      2007|   1016|
|2025-07-05 14:54:...|    4|          |ec9f31d9-a72f-47d...|    SMS|      2007|   1016|
|2025-07-05 14:55:...|   27|  RECEIVED|9bca5d73-8c7b-4ae...|  EMAIL|      2004|   1039|
|2025-07-05 14:54:...|    9|  RECEIVED|1add1028-1d56-4d0...|  EMAIL|      2000|   1021|
|2025-07-05 14:55:...|   55|  RE

# Additional datasets

In [5]:
countries = [
    {"country_id": 2000, "country": "Brazil"},
    {"country_id": 2001, "country": "Portugal"},
    {"country_id": 2002, "country": "Spain"},
    {"country_id": 2003, "country": "Germany"},
    {"country_id": 2004, "country": "France"},
    {"country_id": 2005, "country": "Italy"},
    {"country_id": 2006, "country": "United Kingdom"},
    {"country_id": 2007, "country": "United States"},
    {"country_id": 2008, "country": "Canada"},
    {"country_id": 2009, "country": "Australia"},
    {"country_id": 2010, "country": "Japan"},
    {"country_id": 2011, "country": "China"},
    {"country_id": 2012, "country": "India"},
    {"country_id": 2013, "country": "South Korea"},
    {"country_id": 2014, "country": "Russia"},
    {"country_id": 2015, "country": "Argentina"}
]

countries = spark.createDataFrame(countries)

# Streaming Messages x Messages Corrupted

In [6]:
from pyspark.sql import SparkSession, DataFrame
from pyspark.sql.functions import col, to_date
from pyspark.sql.types import *

# 1. Spark Session
def create_spark_session(app_name="Streaming Job"):
    return SparkSession.builder.master("local").appName(app_name).getOrCreate()

# 2. Schema definition
def get_message_schema():
    return StructType([
        StructField("timestamp", TimestampType(), True),
        StructField("event_type", StringType(), True),
        StructField("message_id", StringType(), True),
        StructField("channel", StringType(), True),
        StructField("country_id", IntegerType(), True),
        StructField("user_id", IntegerType(), True)
    ])

# 3. Read stream from bronze
def read_bronze_stream(spark: SparkSession, schema: StructType, path: str) -> DataFrame:
    return (spark.readStream
        .schema(schema)
        .format("parquet")
        .load(path)
    )

# 4. Filter logic
def filter_messages(df: DataFrame, is_corrupted=True) -> DataFrame:
    if is_corrupted:
        return df.filter(
            col("event_type").isNull() |
            (col("event_type") == "") |
            (col("event_type") == "NONE")
        )
    else:
        return df.filter(
            col("event_type").isNotNull() &
            (col("event_type") != "") &
            (col("event_type") != "NONE")
        )

# 5. Join with countries and add partition date
def enrich_and_partition(df: DataFrame, countries_df: DataFrame) -> DataFrame:
    return (df.join(countries_df, on="country_id", how="left")
             .withColumn("date", to_date(col("timestamp"))))

# 6. Write stream to silver
def write_to_silver(df: DataFrame, path: str, checkpoint_path: str):
    return (df.writeStream
        .format("parquet")
        .option("path", path)
        .option("checkpointLocation", checkpoint_path)
        .partitionBy("date")
        .outputMode("append")
        .trigger(processingTime="5 seconds")
        .start()
    )


# Setup
spark = create_spark_session("Messages Processor")
schema = get_message_schema()

# Read of bronze layer data
df_stream = read_bronze_stream(spark, schema, "content/lake/bronze/messages/data/")

# Valid messages:
df_valid = filter_messages(df_stream, is_corrupted=False)
df_valid_enriched = enrich_and_partition(df_valid, countries)
query_valid = write_to_silver(
    df_valid_enriched,
    "content/lake/silver/messages/data",
    "content/lake/silver/messages/checkpoint"
)

# Corrupted messages:
df_corrupted = filter_messages(df_stream, is_corrupted=True)
df_corrupted_enriched = enrich_and_partition(df_corrupted, countries)
query_corrupted = write_to_silver(
    df_corrupted_enriched,
    "content/lake/silver/messages_corrupted/data",
    "content/lake/silver/messages_corrupted/checkpoint"
)

query_valid.awaitTermination(20)
query_corrupted.awaitTermination(20)


In [13]:
query.stop()

In [4]:
# Delete old directory data
import shutil
import os
base_path = "content/lake/silver/"
if os.path.exists(base_path):
    shutil.rmtree(base_path)
    print("Old data deleted with success.")

## Checking data

In [7]:
df_bronze = spark.read.format("parquet").load("content/lake/bronze/messages/data/")
bronze_count = df_bronze.count()
print("Bronze count:", bronze_count)

df_valid = spark.read.format("parquet").load("content/lake/silver/messages/data/")
valid_count = df_valid.count()
print("Silver valid count:", valid_count)

df_corrupted = spark.read.format("parquet").load("content/lake/silver/messages_corrupted/data/")
corrupted_count = df_corrupted.count()
print("Silver corrupted count:", corrupted_count)

total_silver = valid_count + corrupted_count
print("Total Silver count:", total_silver)

if bronze_count == total_silver:
    print("Valid data: bronze == silver valid + silver corrupted")
else:
    print("Inconsistent: silver total data is not consistent with bronze data")


Bronze count: 58
Silver valid count: 39
Silver corrupted count: 19
Total Silver count: 58
Valid data: bronze == silver valid + silver corrupted


# Challenge 2

- Run business report
- But first, there is a bug in the system which is causing some duplicated messages, we need to exclude these lines from the report

- removing duplicates logic:
  - Identify possible duplicates on message_id, event_type and channel
  - in case of duplicates, consider only the first message (occurrence by timestamp)
  - Ex:
    In table below, the correct message to consider is the second line

```
    message_id | channel | event_type | timestamp
    123        | CHAT    | CREATED    | 10:10:01
    123        | CHAT    | CREATED    | 07:56:45 (first occurrence)
    123        | CHAT    | CREATED    | 08:13:33
```

- After cleaning the data we're able to create the busines report

### Report 1
  - Aggregate data by date, event_type and channel
  - Count number of messages
  - pivot event_type from rows into columns
  - schema expected:
  
```
|      date|channel|CLICKED|CREATED|OPEN|RECEIVED|SENT|
+----------+-------+-------+-------+----+--------+----+
|2024-12-03|    SMS|      4|      4|   1|       1|   5|
|2024-12-03|   CHAT|      3|      7|   5|       8|   4|
|2024-12-03|   PUSH|   NULL|      3|   4|       3|   4|
```

In [None]:
# report 1
# TODO

## Report 2

- Identify the most active users by channel (sorted by number of iterations)
- schema expected:

```
+-------+----------+----+-----+-----+----+---+
|user_id|iterations|CHAT|EMAIL|OTHER|PUSH|SMS|
+-------+----------+----+-----+-----+----+---+
|   1022|         5|   2|    0|    1|   0|  2|
|   1004|         4|   1|    1|    1|   1|  0|
|   1013|         4|   0|    0|    2|   1|  1|
|   1020|         4|   2|    0|    1|   1|  0|
```


In [None]:
# report 2
# TODO

# Challenge 3

In [None]:
# Theoretical question:

# A new usecase requires the message data to be aggregate in near real time
# They want to build a dashboard embedded in the platform website to analyze message data in low latency (few minutes)
# This application will access directly the data aggregated by streaming process

# Q1:
- What would be your suggestion to achieve that using Spark Structure Streaming?
Or would you choose a different data processing tool?

- Which storage would you use and why? (database?, data lake?, kafka?)

