# <center> <img src="../../img/ITESOLogo.png" alt="ITESO" width="480" height="130"> </center>
# <center> **Departamento de Electrónica, Sistemas e Informática** </center>
---
## <center> Computer Systems Engineering  </center>
---
### <center> Big Data Processing </center>
---
#### <center> **Autumn 2025** </center>

#### <center> **Final Project: Structured Streaming** </center>
---

**Date**: November, 2025

**Student Name**: Luis Adrian Bravo Ramirez

**Professor**: Pablo Camarillo Ramirez

---

# Objective
To build a data pipeline in Python using Apache Spark for data consumption in continuous mode, transformation, and persistence, with the objective of addressing a practical problem. 

---


# Producer

The producer is a Python script that will generate information in continuous mode, specifically information related to the _twitch gamers_ dataset. This script will generate random information, like the following:

```bash
"numeric_id": 83382684,
"views": 303579,
"mature": true,
"life_time": 137,
"created_at": "2010-05-17",
"updated_at": "2010-10-01",
"dead_account": true,
"language": "DE",
"affiliate": true,
"activity_status": "Inactive",
"account_category": "Mega"
```


## Command Line Arguments
The producer requires two arguments:
```bash
python3 producer_script_luis_bravo.py  
```
- `<broker>`: Kafka broker address (`kafka:9093`)

- `<topic>`: Kafka topic name to publish messages to

## Creating topic

As we said, the producer requires a topic. To create it, execute the following command that will create a topic called _*streaming_processing_luisbravo*_.

```bash
docker exec -it <kafka_container_id> \
/opt/kafka/bin/kafka-topics.sh \
--create --zookeeper zookeeper:2181 \
--replication-factor 1 --partitions 1 \
--topic streaming_processing_luisbravo
```

NOTE: You must use your Kafka Container ID on <kafka_container_id>.

## Testing from Jupyter Notebook
To test the producer, you must open the Jupyter Notebook on your Web Browser and do the following:
- Select _File > New > Terminal_

- Access to my folder which is _/opt/spark/work-dir/lib/luisbravor00_ using a `cd /opt/spark/work-dir/lib/luisbravor00`

- Execute `python3 producer_script_luis_bravo.py kafka:9093 streaming_processing_luisbravo` to start the producer

This producer uses Kafka as the message broker to simulate real-time streaming data ingestion for analytics and processing pipelines.

---

# Consumer

### Create Spark Session

In [2]:
import findspark
findspark.init()

from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .appName("Structured Streaming: Twitch Gamers") \
    .master("spark://spark-master:7077") \
    .config("spark.jars.packages", "org.apache.spark:spark-sql-kafka-0-10_2.13:4.0.0") \
    .config("spark.ui.port", "4040") \
    .getOrCreate()

sc = spark.sparkContext
sc.setLogLevel("INFO")

# Optimization (reduce the number of shuffle partitions)
spark.conf.set("spark.sql.shuffle.partitions", "5")

:: loading settings :: url = jar:file:/opt/spark/jars/ivy-2.5.3.jar!/org/apache/ivy/core/settings/ivysettings.xml
Ivy Default Cache set to: /root/.ivy2.5.2/cache
The jars for the packages stored in: /root/.ivy2.5.2/jars
org.apache.spark#spark-sql-kafka-0-10_2.13 added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent-628de3ec-95c3-4d0c-87b6-07ba0f0220c0;1.0
	confs: [default]
	found org.apache.spark#spark-sql-kafka-0-10_2.13;4.0.0 in central
	found org.apache.spark#spark-token-provider-kafka-0-10_2.13;4.0.0 in central
	found org.apache.kafka#kafka-clients;3.9.0 in central
	found org.lz4#lz4-java;1.8.0 in central
	found org.xerial.snappy#snappy-java;1.1.10.7 in central
	found org.slf4j#slf4j-api;2.0.16 in central
	found org.apache.hadoop#hadoop-client-runtime;3.4.1 in central
	found org.apache.hadoop#hadoop-client-api;3.4.1 in central
	found com.google.code.findbugs#jsr305;3.0.0 in central
	found org.scala-lang.modules#scala-parallel-collections_2.13;1.2.0

## Dataset and Stream creation

### Create the remote connection to Kafka

*REMINDER*: Create the topic before running the following cell. \

`docker exec -it 776688eed872951b48536248b0cdc3182f813837eb7ebf30deb8162ab5c8cb3c /opt/kafka/bin/kafka-topics.sh --create --zookeeper zookeeper:2181 --replication-factor 1 --partitions 1 --topic streaming_processing_luisbravo`

In [3]:
# Create the remote connection
kafka_df = spark.readStream \
            .format("kafka") \
            .option("kafka.bootstrap.servers", "kafka:9093") \
            .option("subscribe", "streaming_processing_luisbravo") \
            .load()

### Create the schema for the dataset

In [4]:
from luisbravor00.spark_utils import SparkUtils
from pyspark.sql.functions import from_json

ts_telemetry_df = kafka_df.select(kafka_df.value.cast("string").alias("value_str"))

# Extract the columns from the input JSON
schema_columns = [("views", "int"),
                  ("mature", "int"),
                  ("life_time", "int"),
                  ("created_at", "string"),
                  ("updated_at", "string"),
                  ("numeric_id", "int"),
                  ("dead_account", "int"),
                  ("language", "string"),
                  ("affiliate", "int")]
pkg_schema = SparkUtils.generate_schema(schema_columns)
ts_extracted_df = ts_telemetry_df.withColumn("telemetry", from_json(ts_telemetry_df.value_str, pkg_schema))

ts_extracted_df.printSchema()   

root
 |-- value_str: string (nullable = true)
 |-- telemetry: struct (nullable = true)
 |    |-- views: integer (nullable = true)
 |    |-- mature: integer (nullable = true)
 |    |-- life_time: integer (nullable = true)
 |    |-- created_at: string (nullable = true)
 |    |-- updated_at: string (nullable = true)
 |    |-- numeric_id: integer (nullable = true)
 |    |-- dead_account: integer (nullable = true)
 |    |-- language: string (nullable = true)
 |    |-- affiliate: integer (nullable = true)



### Stream

In [5]:
# Send the stream to a files sink
query_files = ts_extracted_df.writeStream \
                .trigger(processingTime="4 seconds") \
                .format("parquet") \
                .option("header", "true") \
                .option("path", "/opt/spark/work-dir/data/ts_output/") \
                .option("checkpointLocation", "/opt/spark/work-dir/ts_checkpoint") \
                .start()

25/11/07 22:10:53 WARN ResolveWriteToStream: spark.sql.adaptive.enabled is not supported in streaming DataFrames/Datasets and will be disabled.


                                                                                

In [8]:
!rm -rf /opt/spark/work-dir/data/ts_output/
!rm -rf /opt/spark/work-dir/ts_checkpoint/

## Transformations and Actions

In [6]:
path = "/opt/spark/work-dir/data/ts_output/"

# Read all parquet part files in that folder
df = spark.read.parquet(path)

# Inspect the schema and show some rows
df.printSchema()
df.show(10, truncate=False)

root
 |-- value_str: string (nullable = true)
 |-- telemetry: struct (nullable = true)
 |    |-- views: integer (nullable = true)
 |    |-- mature: integer (nullable = true)
 |    |-- life_time: integer (nullable = true)
 |    |-- created_at: string (nullable = true)
 |    |-- updated_at: string (nullable = true)
 |    |-- numeric_id: integer (nullable = true)
 |    |-- dead_account: integer (nullable = true)
 |    |-- language: string (nullable = true)
 |    |-- affiliate: integer (nullable = true)

+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------+
|value_str                                                                                                                                                                              |telemetry                                                   

In [7]:
df.count()

30

## Persistence Data

## PowerBI Dashboard