# <center> <img src="../img/ITESOLogo.png" alt="ITESO" width="480" height="130"> </center>
# <center> **Departamento de Electrónica, Sistemas e Informática** </center>
---
## <center> **Carrera: Ing. en Sistemas Computacionales** </center>
---
### <center> **Primavera 2025** </center>
---

**Lab 09**: Watermarking with Spark

**Fecha**: 8 de abril del 2025

**Nombre del Estudiante**: Marco Albanese, Vicente Siloe

**Profesor**: Pablo Camarillo Ramirez

In [25]:
import findspark
findspark.init()

#### Creacion de la conexión con el cluster de spark


In [26]:
from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .appName("SparkSQLStructuredStreaming-Kafka-Watermarking") \
    .master("spark://2da3617855ce:7077") \
    .config("spark.ui.port","4040") \
    .config("spark.jars.packages", "org.apache.spark:spark-sql-kafka-0-10_2.13:3.5.4") \
    .getOrCreate()
sc = spark.sparkContext

### Creación del Kafka Stream

In [27]:
kafka_lines = spark \
                .readStream \
                .format("kafka") \
                .option("kafka.bootstrap.servers", "7c05a3b0cf9d:9093") \
                .option("subscribe", "kafka-spark-example") \
                .load()

kafka_lines.printSchema()

root
 |-- key: binary (nullable = true)
 |-- value: binary (nullable = true)
 |-- topic: string (nullable = true)
 |-- partition: integer (nullable = true)
 |-- offset: long (nullable = true)
 |-- timestamp: timestamp (nullable = true)
 |-- timestampType: integer (nullable = true)



### Transform binary data into string

In [28]:
kafka_df = kafka_lines.withColumn("value_str", kafka_lines.value.cast("string"))

In [29]:
from pyspark.sql.functions import split, col

split_col = split(kafka_df.value_str, ',')
kafka_df = kafka_df.withColumn("word", split_col.getItem(0))
kafka_df = kafka_df.withColumn("value", split_col.getItem(1).cast("integer"))

kafka_df.printSchema()

root
 |-- key: binary (nullable = true)
 |-- value: integer (nullable = true)
 |-- topic: string (nullable = true)
 |-- partition: integer (nullable = true)
 |-- offset: long (nullable = true)
 |-- timestamp: timestamp (nullable = true)
 |-- timestampType: integer (nullable = true)
 |-- value_str: string (nullable = true)
 |-- word: string (nullable = true)



### Aplicando el mecanismo para manejar datos tardios con marcas de agua (watermarking)

In [None]:
from pyspark.sql.functions import window, sum, avg, min, max

windowed_aggregates = kafka_df \
    .withWatermark("timestamp", "2 minutes") \
    .groupBy(
        window(col("timestamp"), 
            "1 minute",  # Window duration
            "30 seconds"  # Slide duration
        ), 
        col("word")
    ) \
    .agg(
        sum("value").alias("total"),
        avg("value").alias("average"),
        min("value").alias("min_value"),
        max("value").alias("max_value")
    )

### Configuración del "Sink" del stream

In [31]:
spark.conf.set("spark.sql.shuffle.partitions", "5")

query = windowed_aggregates \
                .writeStream \
                .outputMode("update") \
                .trigger(processingTime='30 seconds') \
                .format("console") \
                .option("truncate", "false") \
                .start()


query.awaitTermination(300)

25/04/10 05:17:22 WARN ResolveWriteToStream: Temporary checkpoint location created which is deleted normally when the query didn't fail: /tmp/temporary-c60e66e3-741a-4016-803e-6d2c26d52ead. If it's required to delete it under any circumstances, please set spark.sql.streaming.forceDeleteTempCheckpointLocation to true. Important to know deleting temp checkpoint folder is best effort.
25/04/10 05:17:22 WARN ResolveWriteToStream: spark.sql.adaptive.enabled is not supported in streaming DataFrames/Datasets and will be disabled.
25/04/10 05:17:22 WARN AdminClientConfig: These configurations '[key.deserializer, value.deserializer, enable.auto.commit, max.poll.records, auto.offset.reset]' were supplied but are not used yet.
                                                                                

-------------------------------------------
Batch: 0
-------------------------------------------
+------+----+-----+-------+---------+---------+
|window|word|total|average|min_value|max_value|
+------+----+-----+-------+---------+---------+
+------+----+-----+-------+---------+---------+



                                                                                

-------------------------------------------
Batch: 1
-------------------------------------------
+------------------------------------------+-------+-----+-------+---------+---------+
|window                                    |word   |total|average|min_value|max_value|
+------------------------------------------+-------+-----+-------+---------+---------+
|{2025-04-10 05:17:00, 2025-04-10 05:18:00}|testing|10   |10.0   |10       |10       |
|{2025-04-10 05:17:00, 2025-04-10 05:18:00}|lab09  |99   |99.0   |99       |99       |
|{2025-04-10 05:17:30, 2025-04-10 05:18:30}|lab09  |99   |99.0   |99       |99       |
|{2025-04-10 05:17:30, 2025-04-10 05:18:30}|testing|10   |10.0   |10       |10       |
+------------------------------------------+-------+-----+-------+---------+---------+

-------------------------------------------
Batch: 2
-------------------------------------------
+------------------------------------------+-------+-----+-------+---------+---------+
|window               

                                                                                

-------------------------------------------
Batch: 5
-------------------------------------------
+------------------------------------------+----------------------+-----+-------+---------+---------+
|window                                    |word                  |total|average|min_value|max_value|
+------------------------------------------+----------------------+-----+-------+---------+---------+
|{2025-04-10 05:19:00, 2025-04-10 05:20:00}|lab09kafkawatermarking|5801 |2900.5 |1234     |4567     |
|{2025-04-10 05:19:30, 2025-04-10 05:20:30}|lab09kafkawatermarking|4567 |4567.0 |4567     |4567     |
|{2025-04-10 05:18:30, 2025-04-10 05:19:30}|lab09kafkawatermarking|1234 |1234.0 |1234     |1234     |
+------------------------------------------+----------------------+-----+-------+---------+---------+



ERROR:root:KeyboardInterrupt while sending command.
Traceback (most recent call last):
  File "/opt/conda/spark-3.5.4-bin-hadoop3-scala2.13/python/lib/py4j-0.10.9.7-src.zip/py4j/java_gateway.py", line 1038, in send_command
    response = connection.send_command(command)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/spark-3.5.4-bin-hadoop3-scala2.13/python/lib/py4j-0.10.9.7-src.zip/py4j/clientserver.py", line 511, in send_command
    answer = smart_decode(self.stream.readline()[:-1])
                          ^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/socket.py", line 706, in readinto
    return self._sock.recv_into(b)
           ^^^^^^^^^^^^^^^^^^^^^^^
KeyboardInterrupt


-------------------------------------------
Batch: 6
-------------------------------------------
+------+----+-----+-------+---------+---------+
|window|word|total|average|min_value|max_value|
+------+----+-----+-------+---------+---------+
+------+----+-----+-------+---------+---------+



KeyboardInterrupt: 

In [32]:
query.stop()
sc.stop()