# <center> <img src="../labs/img/ITESOLogo.png" alt="ITESO" width="480" height="130"> </center>
# <center> **Departamento de Electrónica, Sistemas e Informática** </center>
---
## <center> **Procesamiento de Datos Masivos** </center>
---
### <center> **Primavera 2025** </center>
---
### <center> **Ejemplos de Spark: Structured Streaming (Kafka + Watermarking)** </center>

---
**Profesor**: Dr. Pablo Camarillo Ramirez

**LAB**: LAB09

**Estudiantes**: Angel Ramirez, Roberto Osorno, Yochabel Cazares, Samuel Romero

In [56]:
import findspark
findspark.init()

#### Creacion de la conexión con el cluster de spark


In [57]:
from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .appName("SparkSQLStructuredStreaming-Kafka-Watermarking") \
    .master("spark://56a250e0d184:7077") \
    .config("spark.ui.port","4040") \
    .config("spark.jars.packages", "org.apache.spark:spark-sql-kafka-0-10_2.13:3.5.4") \
    .getOrCreate()
sc = spark.sparkContext

### Creación del Kafka Stream

In [58]:
kafka_lines = spark \
                .readStream \
                .format("kafka") \
                .option("kafka.bootstrap.servers", "5baf6a7c8147:9093") \
                .option("subscribe", "kafka-spark-example") \
                .load()

kafka_lines.printSchema()

root
 |-- key: binary (nullable = true)
 |-- value: binary (nullable = true)
 |-- topic: string (nullable = true)
 |-- partition: integer (nullable = true)
 |-- offset: long (nullable = true)
 |-- timestamp: timestamp (nullable = true)
 |-- timestampType: integer (nullable = true)



### Transform binary data into string

In [59]:
kafka_df = kafka_lines.withColumn("value_str", kafka_lines.value.cast("string"))

In [60]:
from pyspark.sql.functions import explode, split
from pyspark.sql.functions import split, col
from pyspark.sql.types import StructType, StructField, StringType, DoubleType, TimestampType

words = kafka_df.select(explode(split(kafka_df.value, " ")).alias("word"), "timestamp")
words.printSchema()









root
 |-- word: string (nullable = false)
 |-- timestamp: timestamp (nullable = true)



### Aplicando el mecanismo para manejar datos tardios con marcas de agua (watermarking)

In [61]:
from pyspark.sql.functions import window
from pyspark.sql.functions import col, lit, window, count, avg, min, max


word_counts = words \
    .withWatermark("timestamp", "2 minutes") \
    .groupBy(
        window(col("timestamp"), "60 seconds", "30 seconds"),
        col("word")
    ) \
    .agg(count("*").alias("word_count"))


window_stats = word_counts \
    .groupBy("window") \
    .agg(
        avg("word_count").alias("avg_count"),
        min("word_count").alias("min_count"),
        max("word_count").alias("max_count")
    )




### Configuración del "Sink" del stream

In [None]:
# Disable the correctness check for stateful operations
spark.conf.set("spark.sql.streaming.statefulOperator.checkCorrectness.enabled", "false")
spark.conf.set("spark.sql.shuffle.partitions", "5")

query = window_stats \
                .writeStream \
                .outputMode("update") \
                .trigger(processingTime='30 seconds') \
                .format("console") \
                .option("truncate", "false") \
                .start()

query.awaitTermination(300)

25/04/08 14:41:09 WARN ResolveWriteToStream: Temporary checkpoint location created which is deleted normally when the query didn't fail: /tmp/temporary-4f7055fd-e0b2-48d5-96e1-99700a8d66d3. If it's required to delete it under any circumstances, please set spark.sql.streaming.forceDeleteTempCheckpointLocation to true. Important to know deleting temp checkpoint folder is best effort.
25/04/08 14:41:09 WARN ResolveWriteToStream: spark.sql.adaptive.enabled is not supported in streaming DataFrames/Datasets and will be disabled.
25/04/08 14:41:09 WARN UnsupportedOperationChecker: Detected pattern of possible 'correctness' issue due to global watermark. The query contains stateful operation which can emit rows older than the current watermark plus allowed late record delay, which are "late rows" in downstream stateful operations and these rows can be discarded. Please refer the programming guide doc for more details. If you understand the possible risk of correctness issue and still need to r

-------------------------------------------
Batch: 0
-------------------------------------------
+------+---------+---------+---------+
|window|avg_count|min_count|max_count|
+------+---------+---------+---------+
+------+---------+---------+---------+



25/04/08 14:41:30 ERROR MicroBatchExecution: Query [id = 524f3f08-452e-4fe6-a7ae-621965fb6be7, runId = c4ecf413-f9f4-4f00-a17d-390227d8a8c1] terminated with error
java.lang.IllegalStateException: Cannot call methods on a stopped SparkContext.
This stopped SparkContext was created at:

org.apache.spark.api.java.JavaSparkContext.<init>(JavaSparkContext.scala:58)
java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:77)
java.base/jdk.internal.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
java.base/java.lang.reflect.Constructor.newInstanceWithCaller(Constructor.java:500)
java.base/java.lang.reflect.Constructor.newInstance(Constructor.java:481)
py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:247)
py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:374)
py4j.Gateway.invoke(Gateway.ja

-------------------------------------------
Batch: 1
-------------------------------------------
+------------------------------------------+-----------------+---------+---------+
|window                                    |avg_count        |min_count|max_count|
+------------------------------------------+-----------------+---------+---------+
|{2025-04-08 14:41:00, 2025-04-08 14:42:00}|6.333333333333333|1        |15       |
|{2025-04-08 14:40:30, 2025-04-08 14:41:30}|6.333333333333333|1        |15       |
+------------------------------------------+-----------------+---------+---------+



In [55]:
query.stop()
sc.stop()