# EX9-STREAM: Spark Structured Streaming + Kafka

Your assignment: complete the `TODO`'s and include also the **output of each cell**.

#### You may need to read the [Structured Streaming API Documentation](https://spark.apache.org/docs/latest/api/python/reference/pyspark.ss/index.html) to complete this lab.

### Before starting this exercise: (1) start `kafka` stack; (2) start `kafkafakestream` stack.

### Step 1: **[PLAN A]** Start Spark Session

In [2]:
from pyspark.sql import SparkSession

for ctx in ("spark", "sc"):
    try:
        globals()[ctx].stop()
    except Exception:
        pass

spark = (
    SparkSession.builder
        .appName("Spark SQL basic example")
        .master("local[*]")
        .config(
            "spark.jars.packages",
            "org.apache.hadoop:hadoop-aws:3.3.4,"
            "org.apache.spark:spark-sql-kafka-0-10_2.12:3.5.4"
        )
        .config("spark.hadoop.fs.s3a.endpoint", "http://minio:9000")
        .config("spark.hadoop.fs.s3a.access.key", "pdm_minio")
        .config("spark.hadoop.fs.s3a.secret.key", "pdm_minio")
        .config("spark.hadoop.fs.s3a.path.style.access", "true")
        .config("spark.hadoop.fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
        .config("spark.hadoop.fs.s3a.connection.ssl.enabled", "false")
        .getOrCreate()
)

print("✅ SparkSession criada com sucesso!")
print("🔍 Versão do Spark:", spark.sparkContext.version)

✅ SparkSession criada com sucesso!
🔍 Versão do Spark: 3.5.4


### Step 2: **[PLAN A]** Create stream of pizza orders from Kafka

In [3]:
# Read from Kafka
df_stream = spark.readStream \
    .format("kafka") \
    .option("kafka.bootstrap.servers", "kafka:9092") \
    .option("subscribe", "pizza") \
    .option("startingOffsets", "earliest") \
    .load()

from pyspark.sql.types import StructType, StructField, StringType, LongType, IntegerType, ArrayType
from pyspark.sql.functions import col, from_json
from pyspark.sql.types import StructType, StringType, TimestampType

schema = StructType([
    StructField("id", IntegerType()),
    StructField("shop", StringType()),
    StructField("name", StringType()),
    StructField("phoneNumber", StringType()),
    StructField("address", StringType()),
    StructField("pizzas", ArrayType(
        StructType([
            StructField("pizzaName", StringType()),
            StructField("additionalToppings", ArrayType(StringType()))
        ])
    )),
    StructField("timestamp", LongType())  # This is in epoch millis
])


df_stream = df_stream.selectExpr("CAST(value AS STRING) as json_str")
df_stream = df_stream.select(from_json(col("json_str"), schema).alias("data")).select("data.*")

df_stream_writer = df_stream.writeStream.format("console").outputMode("append")
df_stream_writer = df_stream_writer.trigger(processingTime="1 second")
df_stream_query = df_stream_writer.start()
df_stream_query.awaitTermination(10)

25/05/20 01:12:29 WARN ResolveWriteToStream: Temporary checkpoint location created which is deleted normally when the query didn't fail: /tmp/temporary-1b95a873-3d24-4f1b-8c7a-e80d0cdf5790. If it's required to delete it under any circumstances, please set spark.sql.streaming.forceDeleteTempCheckpointLocation to true. Important to know deleting temp checkpoint folder is best effort.
25/05/20 01:12:29 WARN ResolveWriteToStream: spark.sql.adaptive.enabled is not supported in streaming DataFrames/Datasets and will be disabled.
25/05/20 01:12:29 WARN AdminClientConfig: These configurations '[key.deserializer, value.deserializer, enable.auto.commit, max.poll.records, auto.offset.reset]' were supplied but are not used yet.
                                                                                

-------------------------------------------
Batch: 0
-------------------------------------------
+---+--------------------+---------------+--------------------+--------------------+--------------------+-------------+
| id|                shop|           name|         phoneNumber|             address|              pizzas|    timestamp|
+---+--------------------+---------------+--------------------+--------------------+--------------------+-------------+
|  0|        Marios Pizza|    Jason Brown|       (851)502-9074|2701 Samuel Summi...|[{Salami, []}, {M...|1747703245306|
|  1|Ill Make You a Pi...|     Edward Liu|   (366)611-5493x353|561 Lester Points...|[{Peperoni, [🍓 s...|1747703249858|
|  2|      Mammamia Pizza|Joshua Peterson|        933-884-1198|90464 Amanda Port...|[{Diavola, [🥚 eg...|1747703256172|
|  3|        Marios Pizza|    Maria Lopez|  783-390-4640x10333|0329 James Drive ...|[{Marinara, [🧄 g...|1747703259694|
|  4|        Luigis Pizza|    Chris Lyons|001-371-729-3812x...|385

25/05/20 01:12:31 WARN ProcessingTimeExecutor: Current batch is falling behind. The trigger interval is 1000 milliseconds, but spent 2160 milliseconds


-------------------------------------------
Batch: 1
-------------------------------------------
+---+------------+-------------+-------------+--------------------+--------------------+-------------+
| id|        shop|         name|  phoneNumber|             address|              pizzas|    timestamp|
+---+------------+-------------+-------------+--------------------+--------------------+-------------+
| 65|Marios Pizza|Sheena Waller|(331)263-9943|2753 Brandon Plai...|[{Margherita, []}...|1747703551389|
+---+------------+-------------+-------------+--------------------+--------------------+-------------+

-------------------------------------------
Batch: 2
-------------------------------------------
+---+--------------+-------------+-------------+--------------------+--------------------+-------------+
| id|          shop|         name|  phoneNumber|             address|              pizzas|    timestamp|
+---+--------------+-------------+-------------+--------------------+-----------

False

-------------------------------------------
Batch: 3
-------------------------------------------
+---+------------+-------------+-----------------+--------------------+--------------------+-------------+
| id|        shop|         name|      phoneNumber|             address|              pizzas|    timestamp|
+---+------------+-------------+-----------------+--------------------+--------------------+-------------+
| 67|Luigis Pizza|Misty Johnson|414-348-8870x7196|060 Flores Ridge\...|[{Mari & Monti, [...|1747703567874|
+---+------------+-------------+-----------------+--------------------+--------------------+-------------+

-------------------------------------------
Batch: 4
-------------------------------------------
+---+------------+------------+-------------------+--------------------+--------------------+-------------+
| id|        shop|        name|        phoneNumber|             address|              pizzas|    timestamp|
+---+------------+------------+-------------------+---

                                                                                

-------------------------------------------
Batch: 28
-------------------------------------------
+---+------------+------------+-----------------+--------------------+--------------------+-------------+
| id|        shop|        name|      phoneNumber|             address|              pizzas|    timestamp|
+---+------------+------------+-----------------+--------------------+--------------------+-------------+
| 92|Luigis Pizza|Darius Roach|234.245.7196x0805|8235 Robert Cliff...|[{Peperoni, [🥚 e...|1747703703420|
+---+------------+------------+-----------------+--------------------+--------------------+-------------+

-------------------------------------------
Batch: 29
-------------------------------------------
+---+--------------+-----------+-------------+--------------------+--------------------+-------------+
| id|          shop|       name|  phoneNumber|             address|              pizzas|    timestamp|
+---+--------------+-----------+-------------+--------------------+-

### Step 3: Explore the example above, change parameters, see the results

This is a open exercise (show your work and explain the output). Fake kafka stream has other options concerning subject, number of messages, waiting time, etc.

### Resumo dos Testes