# <center> **Departamento de Electrónica, Sistemas e Informática** </center>
---
## <center> **Procesamiento de Datos Masivos** </center>
---
### <center> **Primavera 2025** </center>
---
### <center> **Ejemplos de Spark: Structured Streaming (Kafka + Watermarking)** </center>

---
**Profesor**: Dr. Pablo Camarillo Ramirez

In [10]:
import findspark
findspark.init()

#### Creacion de la conexión con el cluster de spark


In [11]:
from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .appName("SparkSQLStructuredStreaming-Kafka-Watermarking") \
    .master("spark://dc612074df78:7077") \
    .config("spark.ui.port","4040") \
    .config("spark.jars.packages", "org.apache.spark:spark-sql-kafka-0-10_2.13:3.5.4") \
    .getOrCreate()
sc = spark.sparkContext

### Creación del Kafka Stream

In [12]:
kafka_lines = spark \
                .readStream \
                .format("kafka") \
                .option("kafka.bootstrap.servers", "7d9f66003388:9093") \
                .option("subscribe", "kafka-spark-example") \
                .load()

kafka_lines.printSchema()

root
 |-- key: binary (nullable = true)
 |-- value: binary (nullable = true)
 |-- topic: string (nullable = true)
 |-- partition: integer (nullable = true)
 |-- offset: long (nullable = true)
 |-- timestamp: timestamp (nullable = true)
 |-- timestampType: integer (nullable = true)



### Transform binary data into string

In [13]:
kafka_df = kafka_lines.withColumn("value_str", kafka_lines.value.cast("string"))

In [14]:
from pyspark.sql.functions import explode, split

words = kafka_df.select(explode(split(kafka_df.value, " ")).alias("word"), "timestamp")
words.printSchema()

root
 |-- word: string (nullable = false)
 |-- timestamp: timestamp (nullable = true)



### Aplicando el mecanismo para manejar datos tardios con marcas de agua (watermarking)

In [15]:
from pyspark.sql.functions import length

words_with_length = words.withColumn("word_length", length("word"))
words_with_length.printSchema()

root
 |-- word: string (nullable = false)
 |-- timestamp: timestamp (nullable = true)
 |-- word_length: integer (nullable = false)



In [17]:
from pyspark.sql.functions import window, avg, max, min

# 1. Word count with 2-minute tumbling window (modified from original)
windowed_counts = words \
                    .withWatermark("timestamp", "3 minutes") \
                    .groupBy(window(words.timestamp, 
                                    "120 seconds", 
                                    "60 seconds"),
                             words.word) \
                    .count()

# 2. Average word length with 90-second sliding window
avg_word_length = words_with_length \
                    .withWatermark("timestamp", "2 minutes") \
                    .groupBy(window(words_with_length.timestamp, 
                                    "90 seconds",
                                    "45 seconds")) \
                    .agg(avg("word_length").alias("avg_length"))

# 3. Maximum and minimum word length with 3-minute tumbling window
word_length_extremes = words_with_length \
                    .withWatermark("timestamp", "4 minutes") \
                    .groupBy(window(words_with_length.timestamp, 
                                    "180 seconds",
                                    "180 seconds")) \
                    .agg(
                        max("word_length").alias("max_length"),
                        min("word_length").alias("min_length")
                    )

### Configuración del "Sink" del stream

In [18]:
spark.conf.set("spark.sql.shuffle.partitions", "5")

# Start count query
count_query = windowed_counts \
                .writeStream \
                .outputMode("update") \
                .trigger(processingTime='30 seconds') \
                .format("console") \
                .option("truncate", "false") \
                .start()

# Start avg length query
avg_query = avg_word_length \
                .writeStream \
                .outputMode("update") \
                .trigger(processingTime='30 seconds') \
                .format("console") \
                .option("truncate", "false") \
                .start()

# Start min/max length query
extremes_query = word_length_extremes \
                .writeStream \
                .outputMode("update") \
                .trigger(processingTime='30 seconds') \
                .format("console") \
                .option("truncate", "false") \
                .start()
k
# Wait for a short time to see results
import time
time.sleep(120)

# Stop all queries
count_query.stop()
avg_query.stop()
extremes_query.stop()
sc.stop()

25/04/08 18:26:40 WARN ResolveWriteToStream: Temporary checkpoint location created which is deleted normally when the query didn't fail: /tmp/temporary-41e14f3a-82c2-4d65-b8ae-3d8bf39249e9. If it's required to delete it under any circumstances, please set spark.sql.streaming.forceDeleteTempCheckpointLocation to true. Important to know deleting temp checkpoint folder is best effort.
25/04/08 18:26:40 WARN ResolveWriteToStream: spark.sql.adaptive.enabled is not supported in streaming DataFrames/Datasets and will be disabled.
25/04/08 18:26:40 WARN ResolveWriteToStream: Temporary checkpoint location created which is deleted normally when the query didn't fail: /tmp/temporary-7f51574b-31f9-44c0-9d58-1355c6a8a70f. If it's required to delete it under any circumstances, please set spark.sql.streaming.forceDeleteTempCheckpointLocation to true. Important to know deleting temp checkpoint folder is best effort.
25/04/08 18:26:40 WARN AdminClientConfig: These configurations '[key.deserializer, val

-------------------------------------------
Batch: 0
-------------------------------------------
+------+----+-----+
|window|word|count|
+------+----+-----+
+------+----+-----+

-------------------------------------------
Batch: 0
-------------------------------------------
+------+----------+
|window|avg_length|
+------+----------+
+------+----------+

-------------------------------------------
Batch: 0
-------------------------------------------
+------+----------+----------+
|window|max_length|min_length|
+------+----------+----------+
+------+----------+----------+



                                                                                

-------------------------------------------
Batch: 1
-------------------------------------------
-------------------------------------------
Batch: 1
-------------------------------------------
+------------------------------------------+-----------------+
|window                                    |avg_length       |
+------------------------------------------+-----------------+
|{2025-04-08 18:26:15, 2025-04-08 18:27:45}|4.181818181818182|
|{2025-04-08 18:27:00, 2025-04-08 18:28:30}|4.181818181818182|
+------------------------------------------+-----------------+

+------------------------------------------+-------+-----+
|window                                    |word   |count|
+------------------------------------------+-------+-----+
|{2025-04-08 18:26:00, 2025-04-08 18:28:00}|hola   |2    |
|{2025-04-08 18:26:00, 2025-04-08 18:28:00}|cat    |2    |
|{2025-04-08 18:26:00, 2025-04-08 18:28:00}|texto  |1    |
|{2025-04-08 18:27:00, 2025-04-08 18:29:00}|de     |1    |
|{2025-04-08 1

                                                                                

-------------------------------------------
Batch: 2
-------------------------------------------
+------------------------------------------+-----------------+
|window                                    |avg_length       |
+------------------------------------------+-----------------+
|{2025-04-08 18:26:15, 2025-04-08 18:27:45}|4.153846153846154|
|{2025-04-08 18:27:00, 2025-04-08 18:28:30}|4.153846153846154|
+------------------------------------------+-----------------+

-------------------------------------------
Batch: 2
-------------------------------------------
+------------------------------------------+-----+-----+
|window                                    |word |count|
+------------------------------------------+-----+-----+
|{2025-04-08 18:26:00, 2025-04-08 18:28:00}|texto|2    |
|{2025-04-08 18:26:00, 2025-04-08 18:28:00}|mas  |1    |
|{2025-04-08 18:27:00, 2025-04-08 18:29:00}|texto|2    |
|{2025-04-08 18:27:00, 2025-04-08 18:29:00}|mas  |1    |
+---------------------------