# <center> <img src="../labs/img/ITESOLogo.png" alt="ITESO" width="480" height="130"> </center>
# <center> **Departamento de Electrónica, Sistemas e Informática** </center>
---
## <center> **Procesamiento de Datos Masivos** </center>
---
### <center> **Primavera 2025** </center>
---
### <center> **Ejemplos de Spark: Structured Streaming (Kafka + Watermarking)** </center>

---

**Lab 09**

**Fecha**: 08 abril 2025

**Nombre del Equipo**: Arriba Linux

**Integrantes del Equipo**: Tirzah Peniche Barba / Ana Cristina Luna Arellano / Juan Pedro Bihouet

**Profesor**: Dr. Pablo Camarillo Ramirez

In [3]:
import findspark
findspark.init()

#### Creacion de la conexión con el cluster de spark


In [4]:
from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .appName("SparkSQLStructuredStreaming-Kafka-Watermarking") \
    .master("spark://d9c3cc2bade8:7077") \
    .config("spark.ui.port","4040") \
    .config("spark.jars.packages", "org.apache.spark:spark-sql-kafka-0-10_2.13:3.5.4") \
    .getOrCreate()
sc = spark.sparkContext

:: loading settings :: url = jar:file:/opt/conda/spark-3.5.4-bin-hadoop3-scala2.13/jars/ivy-2.5.1.jar!/org/apache/ivy/core/settings/ivysettings.xml


Ivy Default Cache set to: /root/.ivy2/cache
The jars for the packages stored in: /root/.ivy2/jars
org.apache.spark#spark-sql-kafka-0-10_2.13 added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent-2c5ca1fe-22b6-47c9-b868-53dc85d9263b;1.0
	confs: [default]
	found org.apache.spark#spark-sql-kafka-0-10_2.13;3.5.4 in central
	found org.apache.spark#spark-token-provider-kafka-0-10_2.13;3.5.4 in central
	found org.apache.kafka#kafka-clients;3.4.1 in central
	found org.lz4#lz4-java;1.8.0 in central
	found org.xerial.snappy#snappy-java;1.1.10.5 in central
	found org.slf4j#slf4j-api;2.0.7 in central
	found org.apache.hadoop#hadoop-client-runtime;3.3.4 in central
	found org.apache.hadoop#hadoop-client-api;3.3.4 in central
	found commons-logging#commons-logging;1.1.3 in central
	found com.google.code.findbugs#jsr305;3.0.0 in central
	found org.scala-lang.modules#scala-parallel-collections_2.13;1.0.4 in central
	found org.apache.commons#commons-pool2;2.11.1 in centr

### Creación del Kafka Stream

In [5]:
kafka_lines = spark \
                .readStream \
                .format("kafka") \
                .option("kafka.bootstrap.servers", "49917d1b98f5:9093") \
                .option("subscribe", "kafka-spark-example") \
                .load()

kafka_lines.printSchema()

root
 |-- key: binary (nullable = true)
 |-- value: binary (nullable = true)
 |-- topic: string (nullable = true)
 |-- partition: integer (nullable = true)
 |-- offset: long (nullable = true)
 |-- timestamp: timestamp (nullable = true)
 |-- timestampType: integer (nullable = true)



### Transform binary data into string

In [6]:
kafka_df = kafka_lines.withColumn("value_str", kafka_lines.value.cast("string"))

In [7]:
from pyspark.sql.functions import explode, split

words = kafka_df.select(explode(split(kafka_df.value, " ")).alias("word"), "timestamp")
words.printSchema()

root
 |-- word: string (nullable = false)
 |-- timestamp: timestamp (nullable = true)



### Aplicando el mecanismo para manejar datos tardios con marcas de agua (watermarking)

In [8]:
from pyspark.sql.functions import window, col
windowed_counts =  words \
                        .withWatermark("timestamp", "2 minutes") \
                        .groupBy(window(words.timestamp, 
                                        "30 seconds", # Window duration 
                                        "5 seconds"), # Slide duration
                                 words.word) \
                        .count()

### Configuración del "Sink" del stream

In [8]:
spark.conf.set("spark.sql.shuffle.partitions", "5")

query = windowed_counts \
                .writeStream \
                .outputMode("update") \
                .trigger(processingTime='30 seconds') \
                .format("console") \
                .option("truncate", "false") \
                .start()

query.awaitTermination(300)

25/04/08 13:53:01 WARN ResolveWriteToStream: Temporary checkpoint location created which is deleted normally when the query didn't fail: /tmp/temporary-6cdc5767-44b3-4e06-9b4b-84804a7fa472. If it's required to delete it under any circumstances, please set spark.sql.streaming.forceDeleteTempCheckpointLocation to true. Important to know deleting temp checkpoint folder is best effort.
25/04/08 13:53:01 WARN ResolveWriteToStream: spark.sql.adaptive.enabled is not supported in streaming DataFrames/Datasets and will be disabled.
25/04/08 13:53:02 WARN AdminClientConfig: These configurations '[key.deserializer, value.deserializer, enable.auto.commit, max.poll.records, auto.offset.reset]' were supplied but are not used yet.
                                                                                

-------------------------------------------
Batch: 0
-------------------------------------------
+------+----+-----+
|window|word|count|
+------+----+-----+
+------+----+-----+



                                                                                

-------------------------------------------
Batch: 1
-------------------------------------------
+------------------------------------------+------+-----+
|window                                    |word  |count|
+------------------------------------------+------+-----+
|{2025-04-08 13:54:30, 2025-04-08 13:55:30}|holaaa|1    |
|{2025-04-08 13:54:00, 2025-04-08 13:55:00}|holaaa|1    |
+------------------------------------------+------+-----+



                                                                                

-------------------------------------------
Batch: 2
-------------------------------------------
+------------------------------------------+--------+-----+
|window                                    |word    |count|
+------------------------------------------+--------+-----+
|{2025-04-08 13:54:30, 2025-04-08 13:55:30}|jajaajaa|1    |
|{2025-04-08 13:54:00, 2025-04-08 13:55:00}|jajaajaa|1    |
+------------------------------------------+--------+-----+

-------------------------------------------
Batch: 3
-------------------------------------------
+------+----+-----+
|window|word|count|
+------+----+-----+
+------+----+-----+



                                                                                

-------------------------------------------
Batch: 4
-------------------------------------------
+------------------------------------------+------+-----+
|window                                    |word  |count|
+------------------------------------------+------+-----+
|{2025-04-08 13:56:30, 2025-04-08 13:57:30}|srgf  |1    |
|{2025-04-08 13:56:30, 2025-04-08 13:57:30}|wesrfe|1    |
|{2025-04-08 13:56:00, 2025-04-08 13:57:00}|srgf  |1    |
|{2025-04-08 13:56:00, 2025-04-08 13:57:00}|wesrfe|1    |
+------------------------------------------+------+-----+

-------------------------------------------
Batch: 5
-------------------------------------------
+------+----+-----+
|window|word|count|
+------+----+-----+
+------+----+-----+



False

AVG

In [9]:
kafka_df = kafka_lines.selectExpr("CAST(value AS STRING) AS value_str", "timestamp")

In [10]:
words = kafka_df.select(explode(split(col("value_str"), " ")).alias("word"), "timestamp")

In [11]:
numerical_words = words.withColumn("reading", col("word").cast("double"))

In [12]:
windowed_avg = numerical_words \
    .withWatermark("timestamp", "2 minutes") \
    .groupBy(window(col("timestamp"), 
                    "40 minute", 
                    "20 seconds")) \
    .avg("reading")

In [None]:
spark.conf.set("spark.sql.shuffle.partitions", "5")

query = windowed_avg \
                .writeStream \
                .outputMode("update") \
                .trigger(processingTime='30 seconds') \
                .format("console") \
                .option("truncate", "false") \
                .start()

query.awaitTermination(150)

25/04/08 14:41:54 WARN ResolveWriteToStream: Temporary checkpoint location created which is deleted normally when the query didn't fail: /tmp/temporary-fddc006a-8ec3-434a-a032-721de47d5291. If it's required to delete it under any circumstances, please set spark.sql.streaming.forceDeleteTempCheckpointLocation to true. Important to know deleting temp checkpoint folder is best effort.
25/04/08 14:41:54 WARN ResolveWriteToStream: spark.sql.adaptive.enabled is not supported in streaming DataFrames/Datasets and will be disabled.
25/04/08 14:41:55 WARN AdminClientConfig: These configurations '[key.deserializer, value.deserializer, enable.auto.commit, max.poll.records, auto.offset.reset]' were supplied but are not used yet.
25/04/08 14:41:57 WARN SparkStringUtils: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.sql.debug.maxToStringFields'.
                                                                                

-------------------------------------------
Batch: 0
-------------------------------------------
+------+------------+
|window|avg(reading)|
+------+------------+
+------+------------+



                                                                                

-------------------------------------------
Batch: 1
-------------------------------------------
+------------------------------------------+-----------------+
|window                                    |avg(reading)     |
+------------------------------------------+-----------------+
|{2025-04-08 14:40:20, 2025-04-08 15:20:20}|26.66666666666667|
|{2025-04-08 14:40:00, 2025-04-08 15:20:00}|26.66666666666667|
|{2025-04-08 14:39:20, 2025-04-08 15:19:20}|26.66666666666667|
|{2025-04-08 14:39:00, 2025-04-08 15:19:00}|26.66666666666667|
|{2025-04-08 14:35:20, 2025-04-08 15:15:20}|26.66666666666667|
|{2025-04-08 14:34:00, 2025-04-08 15:14:00}|26.66666666666667|
|{2025-04-08 14:33:40, 2025-04-08 15:13:40}|26.66666666666667|
|{2025-04-08 14:32:20, 2025-04-08 15:12:20}|26.66666666666667|
|{2025-04-08 14:29:20, 2025-04-08 15:09:20}|26.66666666666667|
|{2025-04-08 14:21:40, 2025-04-08 15:01:40}|26.66666666666667|
|{2025-04-08 14:21:00, 2025-04-08 15:01:00}|26.66666666666667|
|{2025-04-08 14:20:00

                                                                                

-------------------------------------------
Batch: 2
-------------------------------------------
+------+------------+
|window|avg(reading)|
+------+------------+
+------+------------+



In [31]:
query.stop()
sc.stop()

SUM

In [9]:
kafka_df = kafka_lines.selectExpr("CAST(value AS STRING) AS value_str", "timestamp")


In [10]:
words = kafka_df.select(explode(split(col("value_str"), " ")).alias("word"), "timestamp")


In [11]:
numerical_words = words.withColumn("reading", col("word").cast("double"))


In [12]:
windowed_sum = numerical_words \
    .withWatermark("timestamp", "2 minutes") \
    .groupBy(
        window(col("timestamp"), "40 minute", "20 seconds")
    ) \
    .sum("reading")


In [13]:
spark.conf.set("spark.sql.shuffle.partitions", "5")


In [16]:
query = windowed_sum \
    .writeStream \
    .outputMode("update") \
    .trigger(processingTime='30 seconds') \
    .format("console") \
    .option("truncate", "false") \
    .start()

query.awaitTermination(150)


25/04/10 04:47:07 WARN ResolveWriteToStream: Temporary checkpoint location created which is deleted normally when the query didn't fail: /tmp/temporary-5ad410ac-102e-46e4-958c-a8e078115c31. If it's required to delete it under any circumstances, please set spark.sql.streaming.forceDeleteTempCheckpointLocation to true. Important to know deleting temp checkpoint folder is best effort.
25/04/10 04:47:07 WARN ResolveWriteToStream: spark.sql.adaptive.enabled is not supported in streaming DataFrames/Datasets and will be disabled.
25/04/10 04:47:07 WARN AdminClientConfig: These configurations '[key.deserializer, value.deserializer, enable.auto.commit, max.poll.records, auto.offset.reset]' were supplied but are not used yet.
                                                                                

-------------------------------------------
Batch: 0
-------------------------------------------
+------+------------+
|window|sum(reading)|
+------+------------+
+------+------------+



                                                                                

-------------------------------------------
Batch: 1
-------------------------------------------
+------------------------------------------+------------+
|window                                    |sum(reading)|
+------------------------------------------+------------+
|{2025-04-10 04:46:00, 2025-04-10 05:26:00}|95.0        |
|{2025-04-10 04:41:00, 2025-04-10 05:21:00}|95.0        |
|{2025-04-10 04:40:40, 2025-04-10 05:20:40}|95.0        |
|{2025-04-10 04:39:40, 2025-04-10 05:19:40}|95.0        |
|{2025-04-10 04:38:20, 2025-04-10 05:18:20}|95.0        |
|{2025-04-10 04:36:40, 2025-04-10 05:16:40}|95.0        |
|{2025-04-10 04:35:40, 2025-04-10 05:15:40}|95.0        |
|{2025-04-10 04:33:20, 2025-04-10 05:13:20}|95.0        |
|{2025-04-10 04:30:00, 2025-04-10 05:10:00}|95.0        |
|{2025-04-10 04:29:00, 2025-04-10 05:09:00}|95.0        |
|{2025-04-10 04:27:40, 2025-04-10 05:07:40}|95.0        |
|{2025-04-10 04:27:00, 2025-04-10 05:07:00}|95.0        |
|{2025-04-10 04:26:20, 2025-04-10

                                                                                

-------------------------------------------
Batch: 8
-------------------------------------------
+------------------------------------------+------------+
|window                                    |sum(reading)|
+------------------------------------------+------------+
|{2025-04-10 04:46:00, 2025-04-10 05:26:00}|95.0        |
|{2025-04-10 04:41:00, 2025-04-10 05:21:00}|1303.0      |
|{2025-04-10 04:40:40, 2025-04-10 05:20:40}|1303.0      |
|{2025-04-10 04:39:40, 2025-04-10 05:19:40}|1303.0      |
|{2025-04-10 04:38:20, 2025-04-10 05:18:20}|1303.0      |
|{2025-04-10 04:36:40, 2025-04-10 05:16:40}|1303.0      |
|{2025-04-10 04:35:40, 2025-04-10 05:15:40}|1303.0      |
|{2025-04-10 04:33:20, 2025-04-10 05:13:20}|1303.0      |
|{2025-04-10 04:30:00, 2025-04-10 05:10:00}|1303.0      |
|{2025-04-10 04:29:00, 2025-04-10 05:09:00}|1303.0      |
|{2025-04-10 04:27:40, 2025-04-10 05:07:40}|1303.0      |
|{2025-04-10 04:27:00, 2025-04-10 05:07:00}|1303.0      |
|{2025-04-10 04:26:20, 2025-04-10

25/04/10 04:48:22 WARN ProcessingTimeExecutor: Current batch is falling behind. The trigger interval is 30000 milliseconds, but spent 52459 milliseconds
[Stage 27:===>(4 + 1) / 5][Stage 32:>   (0 + 0) / 1][Stage 34:>   (0 + 0) / 1]1]

-------------------------------------------
Batch: 4
-------------------------------------------
+------------------------------------------+------------+
|window                                    |sum(reading)|
+------------------------------------------+------------+
|{2025-04-10 04:46:00, 2025-04-10 05:26:00}|95.0        |
|{2025-04-10 04:41:00, 2025-04-10 05:21:00}|1303.0      |
|{2025-04-10 04:40:40, 2025-04-10 05:20:40}|1303.0      |
|{2025-04-10 04:39:40, 2025-04-10 05:19:40}|1303.0      |
|{2025-04-10 04:38:20, 2025-04-10 05:18:20}|1303.0      |
|{2025-04-10 04:36:40, 2025-04-10 05:16:40}|1303.0      |
|{2025-04-10 04:35:40, 2025-04-10 05:15:40}|1303.0      |
|{2025-04-10 04:33:20, 2025-04-10 05:13:20}|1303.0      |
|{2025-04-10 04:30:00, 2025-04-10 05:10:00}|1303.0      |
|{2025-04-10 04:29:00, 2025-04-10 05:09:00}|1303.0      |
|{2025-04-10 04:27:40, 2025-04-10 05:07:40}|1303.0      |
|{2025-04-10 04:27:00, 2025-04-10 05:07:00}|1303.0      |
|{2025-04-10 04:26:20, 2025-04-10

25/04/10 04:50:06 WARN ProcessingTimeExecutor: Current batch is falling behind. The trigger interval is 30000 milliseconds, but spent 156084 milliseconds
[Stage 33:=>  (2 + 1) / 5][Stage 35:>   (0 + 0) / 5][Stage 36:>   (0 + 0) / 1]5]

False

In [None]:
query.stop()
sc.stop()