# <center> <img src="../labs/img/ITESOLogo.png" alt="ITESO" width="480" height="130"> </center>
# <center> **Departamento de Electrónica, Sistemas e Informática** </center>
---
## <center> **Big Data** </center>
---
### <center> **Spring 2025** </center>
---
### <center> **Examples on Structured Streaming (Kafka with Watermarking)** </center>

---
**Profesor**: Dr. Pablo Camarillo Ramirez

In [2]:
import findspark
findspark.init()

#### Spark Session creation

In [3]:
from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .appName("SparkSQLStructuredStreaming-Kafka-Watermarking") \
    .master("spark://98dc643237f2:7077") \
    .config("spark.ui.port","4040") \
    .config("spark.jars.packages", "org.apache.spark:spark-sql-kafka-0-10_2.13:3.5.4") \
    .getOrCreate()
sc = spark.sparkContext

:: loading settings :: url = jar:file:/opt/conda/spark-3.5.4-bin-hadoop3-scala2.13/jars/ivy-2.5.1.jar!/org/apache/ivy/core/settings/ivysettings.xml


Ivy Default Cache set to: /root/.ivy2/cache
The jars for the packages stored in: /root/.ivy2/jars
org.apache.spark#spark-sql-kafka-0-10_2.13 added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent-f5fc10dc-3cec-44b1-9d5e-fb5e0b38dca7;1.0
	confs: [default]
	found org.apache.spark#spark-sql-kafka-0-10_2.13;3.5.4 in central
	found org.apache.spark#spark-token-provider-kafka-0-10_2.13;3.5.4 in central
	found org.apache.kafka#kafka-clients;3.4.1 in central
	found org.lz4#lz4-java;1.8.0 in central
	found org.xerial.snappy#snappy-java;1.1.10.5 in central
	found org.slf4j#slf4j-api;2.0.7 in central
	found org.apache.hadoop#hadoop-client-runtime;3.3.4 in central
	found org.apache.hadoop#hadoop-client-api;3.3.4 in central
	found commons-logging#commons-logging;1.1.3 in central
	found com.google.code.findbugs#jsr305;3.0.0 in central
	found org.scala-lang.modules#scala-parallel-collections_2.13;1.0.4 in central
	found org.apache.commons#commons-pool2;2.11.1 in centr

### Kafka Stream init

In [4]:
kafka_lines = spark \
                .readStream \
                .format("kafka") \
                .option("kafka.bootstrap.servers", "29019ad08f3a:9093") \
                .option("subscribe", "kafka-spark-example") \
                .load()

kafka_lines.printSchema()

root
 |-- key: binary (nullable = true)
 |-- value: binary (nullable = true)
 |-- topic: string (nullable = true)
 |-- partition: integer (nullable = true)
 |-- offset: long (nullable = true)
 |-- timestamp: timestamp (nullable = true)
 |-- timestampType: integer (nullable = true)



### Transform binary data into string

In [6]:
kafka_df = kafka_lines.withColumn("value_str", kafka_lines.value.cast("string"))

In [35]:
from pyspark.sql.functions import explode, split, col

numbers = kafka_df.select(explode(split(kafka_df.value, " ")).alias("number"), "timestamp")
numbers = numbers.withColumn("number", col("number").cast("integer"))
numbers.printSchema()

root
 |-- number: integer (nullable = true)
 |-- timestamp: timestamp (nullable = true)



### Use Watermarking to handle late arrival events

In [41]:
from pyspark.sql.functions import window, count, min
windowed_counts =  numbers \
                    .withWatermark("timestamp", "2 minutes") \
                    .groupBy(window(numbers.timestamp,
                                    "20 seconds", # Window duration
                                    "10 seconds"), # Slide duration
                            numbers.number) \
                    .agg(count("number"), min("number"))

### Sink configuration

In [42]:
spark.conf.set("spark.sql.shuffle.partitions", "5")

query = windowed_counts \
                .writeStream \
                .outputMode("update") \
                .trigger(processingTime='5 seconds') \
                .format("console") \
                .option("truncate", "false") \
                .start()

query.awaitTermination(300)

25/04/10 01:40:08 WARN ResolveWriteToStream: Temporary checkpoint location created which is deleted normally when the query didn't fail: /tmp/temporary-b2a5628f-92ad-4693-a3c8-6c24c64465aa. If it's required to delete it under any circumstances, please set spark.sql.streaming.forceDeleteTempCheckpointLocation to true. Important to know deleting temp checkpoint folder is best effort.
25/04/10 01:40:08 WARN ResolveWriteToStream: spark.sql.adaptive.enabled is not supported in streaming DataFrames/Datasets and will be disabled.
25/04/10 01:40:08 WARN AdminClientConfig: These configurations '[key.deserializer, value.deserializer, enable.auto.commit, max.poll.records, auto.offset.reset]' were supplied but are not used yet.
                                                                                

-------------------------------------------
Batch: 0
-------------------------------------------
+------+------+-------------+-----------+
|window|number|count(number)|min(number)|
+------+------+-------------+-----------+
+------+------+-------------+-----------+



                                                                                

-------------------------------------------
Batch: 1
-------------------------------------------
+------------------------------------------+------+-------------+-----------+
|window                                    |number|count(number)|min(number)|
+------------------------------------------+------+-------------+-----------+
|{2025-04-10 01:40:10, 2025-04-10 01:40:30}|1     |6            |1          |
|{2025-04-10 01:40:10, 2025-04-10 01:40:30}|120   |1            |120        |
|{2025-04-10 01:40:00, 2025-04-10 01:40:20}|120   |1            |120        |
|{2025-04-10 01:40:00, 2025-04-10 01:40:20}|1     |6            |1          |
+------------------------------------------+------+-------------+-----------+



                                                                                

-------------------------------------------
Batch: 2
-------------------------------------------
+------------------------------------------+------+-------------+-----------+
|window                                    |number|count(number)|min(number)|
+------------------------------------------+------+-------------+-----------+
|{2025-04-10 01:40:10, 2025-04-10 01:40:30}|100   |1            |100        |
|{2025-04-10 01:40:00, 2025-04-10 01:40:20}|100   |1            |100        |
|{2025-04-10 01:40:10, 2025-04-10 01:40:30}|101   |1            |101        |
|{2025-04-10 01:40:00, 2025-04-10 01:40:20}|10    |1            |10         |
|{2025-04-10 01:40:10, 2025-04-10 01:40:30}|10    |1            |10         |
|{2025-04-10 01:40:00, 2025-04-10 01:40:20}|101   |1            |101        |
|{2025-04-10 01:40:10, 2025-04-10 01:40:30}|NULL  |0            |NULL       |
|{2025-04-10 01:40:00, 2025-04-10 01:40:20}|NULL  |0            |NULL       |
+------------------------------------------+-

ERROR:root:KeyboardInterrupt while sending command.
Traceback (most recent call last):
  File "/opt/conda/spark-3.5.4-bin-hadoop3-scala2.13/python/lib/py4j-0.10.9.7-src.zip/py4j/java_gateway.py", line 1038, in send_command
    response = connection.send_command(command)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/spark-3.5.4-bin-hadoop3-scala2.13/python/lib/py4j-0.10.9.7-src.zip/py4j/clientserver.py", line 511, in send_command
    answer = smart_decode(self.stream.readline()[:-1])
                          ^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/socket.py", line 706, in readinto
    return self._sock.recv_into(b)
           ^^^^^^^^^^^^^^^^^^^^^^^
KeyboardInterrupt


KeyboardInterrupt: 

In [43]:
query.stop()

In [44]:

sc.stop()