# <center> <img src="../labs/img/ITESOLogo.png" alt="ITESO" width="480" height="130"> </center>
# <center> **Departamento de Electrónica, Sistemas e Informática** </center>
---
## <center> **Big Data** </center>
---
### <center> **Spring 2025** </center>
---
### <center> **Examples on Structured Streaming (Kafka with Watermarking)** </center>

---
**Profesor**: Dr. Pablo Camarillo Ramirez

##### Team grandeInformacion members:
- Miguel Alberto Torres Dueñas
- Juan Pablo Cortez Navarro
- Luther Williams Sandria 
- Ferdinand Bierbaum

In [None]:
import findspark
findspark.init()

#### Spark Session creation

In [2]:
from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .appName("SparkSQLStructuredStreaming-Kafka-Watermarking") \
    .master("spark://f5db43ce3d38:7077") \
    .config("spark.ui.port","4040") \
    .config("spark.jars.packages", "org.apache.spark:spark-sql-kafka-0-10_2.13:3.5.4") \
    .getOrCreate()
sc = spark.sparkContext

:: loading settings :: url = jar:file:/opt/conda/spark-3.5.4-bin-hadoop3-scala2.13/jars/ivy-2.5.1.jar!/org/apache/ivy/core/settings/ivysettings.xml


Ivy Default Cache set to: /root/.ivy2/cache
The jars for the packages stored in: /root/.ivy2/jars
org.apache.spark#spark-sql-kafka-0-10_2.13 added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent-0aeb179c-9307-459a-9fed-1e7b991d9e56;1.0
	confs: [default]
	found org.apache.spark#spark-sql-kafka-0-10_2.13;3.5.4 in central
	found org.apache.spark#spark-token-provider-kafka-0-10_2.13;3.5.4 in central
	found org.apache.kafka#kafka-clients;3.4.1 in central
	found org.lz4#lz4-java;1.8.0 in central
	found org.xerial.snappy#snappy-java;1.1.10.5 in central
	found org.slf4j#slf4j-api;2.0.7 in central
	found org.apache.hadoop#hadoop-client-runtime;3.3.4 in central
	found org.apache.hadoop#hadoop-client-api;3.3.4 in central
	found commons-logging#commons-logging;1.1.3 in central
	found com.google.code.findbugs#jsr305;3.0.0 in central
	found org.scala-lang.modules#scala-parallel-collections_2.13;1.0.4 in central
	found org.apache.commons#commons-pool2;2.11.1 in centr

### Kafka Stream init

In [3]:
kafka_lines = spark \
                .readStream \
                .format("kafka") \
                .option("kafka.bootstrap.servers", "67261d1016d7:9093") \
                .option("subscribe", "kafka-spark-example") \
                .load()

kafka_lines.printSchema()

root
 |-- key: binary (nullable = true)
 |-- value: binary (nullable = true)
 |-- topic: string (nullable = true)
 |-- partition: integer (nullable = true)
 |-- offset: long (nullable = true)
 |-- timestamp: timestamp (nullable = true)
 |-- timestampType: integer (nullable = true)



### Transform binary data into string

In [4]:
kafka_df = kafka_lines.withColumn("value_str", kafka_lines.value.cast("string"))

In [5]:
from pyspark.sql.functions import explode, split, col

#logs_df = logs_lines.select(split(col("value"), r"\s*\|\s*").alias("log_array"))

words = kafka_df.select(split(kafka_df.value, " ").alias("reading"), "timestamp")

words = words.withColumn("value_one", col("reading")[0].cast("integer"))
words = words.withColumn("value_two", col("reading")[1].cast("integer"))
words = words.withColumn("value_four", col("reading")[2].cast("integer"))

words.printSchema()

root
 |-- reading: array (nullable = true)
 |    |-- element: string (containsNull = false)
 |-- timestamp: timestamp (nullable = true)
 |-- value_one: integer (nullable = true)
 |-- value_two: integer (nullable = true)
 |-- value_four: integer (nullable = true)



### Use Watermarking to handle late arrival events

In [6]:
from pyspark.sql.functions import window, col
windowedAverages = words.withWatermark("timestamp", "2 minutes") \
                     .groupBy(window(col("timestamp"), "20 seconds", "10 seconds")) \
                     .avg("value_one")

### Sink configuration

In [None]:
spark.conf.set("spark.sql.shuffle.partitions", "5")

query = windowedAverages \
                .writeStream \
                .outputMode("complete") \
                .trigger(processingTime='5 seconds') \
                .format("console") \
                .option("truncate", "false") \
                .start()

query.awaitTermination(20) #we wrote 1 2 3, 10 6 7, 11 8 1

25/04/10 01:33:32 WARN ResolveWriteToStream: Temporary checkpoint location created which is deleted normally when the query didn't fail: /tmp/temporary-7b9633ba-b186-4e32-a297-94095e611e51. If it's required to delete it under any circumstances, please set spark.sql.streaming.forceDeleteTempCheckpointLocation to true. Important to know deleting temp checkpoint folder is best effort.
25/04/10 01:33:32 WARN ResolveWriteToStream: spark.sql.adaptive.enabled is not supported in streaming DataFrames/Datasets and will be disabled.
25/04/10 01:33:32 WARN AdminClientConfig: These configurations '[key.deserializer, value.deserializer, enable.auto.commit, max.poll.records, auto.offset.reset]' were supplied but are not used yet.
                                                                                

-------------------------------------------
Batch: 0
-------------------------------------------
+------+--------------+
|window|avg(value_one)|
+------+--------------+
+------+--------------+



25/04/10 01:33:39 WARN ProcessingTimeExecutor: Current batch is falling behind. The trigger interval is 5000 milliseconds, but spent 7324 milliseconds
                                                                                

-------------------------------------------
Batch: 1
-------------------------------------------
+------------------------------------------+--------------+
|window                                    |avg(value_one)|
+------------------------------------------+--------------+
|{2025-04-10 01:33:20, 2025-04-10 01:33:40}|5.5           |
|{2025-04-10 01:33:30, 2025-04-10 01:33:50}|5.5           |
+------------------------------------------+--------------+



25/04/10 01:33:44 WARN ProcessingTimeExecutor: Current batch is falling behind. The trigger interval is 5000 milliseconds, but spent 5133 milliseconds


False

                                                                                

-------------------------------------------
Batch: 2
-------------------------------------------
+------------------------------------------+--------------+
|window                                    |avg(value_one)|
+------------------------------------------+--------------+
|{2025-04-10 01:33:40, 2025-04-10 01:34:00}|11.0          |
|{2025-04-10 01:33:20, 2025-04-10 01:33:40}|5.5           |
|{2025-04-10 01:33:50, 2025-04-10 01:34:10}|11.0          |
|{2025-04-10 01:33:30, 2025-04-10 01:33:50}|5.5           |
+------------------------------------------+--------------+



In [9]:
query.stop()

In [10]:
windowedAggregates = words.withWatermark("timestamp", "2 minutes") \
                     .groupBy(window(col("timestamp"), "20 seconds", "10 seconds")) \
                     .sum("value_two")

In [None]:
spark.conf.set("spark.sql.shuffle.partitions", "5")

query = windowedAggregates \
                .writeStream \
                .outputMode("complete") \
                .trigger(processingTime='5 seconds') \
                .format("console") \
                .option("truncate", "false") \
                .start()

query.awaitTermination(20) #We wrote 1 2 3, 9 10 5

25/04/10 01:35:02 WARN ResolveWriteToStream: Temporary checkpoint location created which is deleted normally when the query didn't fail: /tmp/temporary-bfdc9c8c-4d2d-4880-a831-fd9d1b22f075. If it's required to delete it under any circumstances, please set spark.sql.streaming.forceDeleteTempCheckpointLocation to true. Important to know deleting temp checkpoint folder is best effort.
25/04/10 01:35:02 WARN ResolveWriteToStream: spark.sql.adaptive.enabled is not supported in streaming DataFrames/Datasets and will be disabled.
25/04/10 01:35:02 WARN AdminClientConfig: These configurations '[key.deserializer, value.deserializer, enable.auto.commit, max.poll.records, auto.offset.reset]' were supplied but are not used yet.
                                                                                

-------------------------------------------
Batch: 0
-------------------------------------------
+------+--------------+
|window|sum(value_two)|
+------+--------------+
+------+--------------+



                                                                                

-------------------------------------------
Batch: 1
-------------------------------------------
+------------------------------------------+--------------+
|window                                    |sum(value_two)|
+------------------------------------------+--------------+
|{2025-04-10 01:34:50, 2025-04-10 01:35:10}|2             |
|{2025-04-10 01:35:00, 2025-04-10 01:35:20}|2             |
+------------------------------------------+--------------+



                                                                                

-------------------------------------------
Batch: 2
-------------------------------------------
+------------------------------------------+--------------+
|window                                    |sum(value_two)|
+------------------------------------------+--------------+
|{2025-04-10 01:34:50, 2025-04-10 01:35:10}|12            |
|{2025-04-10 01:35:00, 2025-04-10 01:35:20}|12            |
+------------------------------------------+--------------+



False

In [12]:
query.stop()

In [13]:
sc.stop()