# <center> <img src="../../img/ITESOLogo.png" alt="ITESO" width="480" height="130"> </center>
# <center> **Departamento de Electrónica, Sistemas e Informática** </center>
---
## <center> Computer Systems Engineering  </center>
---
### <center> Big Data Processing </center>
---
#### <center> **Autumn 2025** </center>
---

**Lab 07**: Structured Streaming with Files

**Date**: October 7th 2025

**Student Name**: José´Ángel León Pérez

**Professor**: Pablo Camarillo Ramirez

Description

Jupyter Notebook: Build a Jupyter Notebook (spark_cluster/notebooks/labs/lab07/lab07_<your_name>.ipynb) containing a data pipeline using structured streaming. The pipeline should monitor a directory for simulated server log files, analyze error patterns in real time, and filter alerts for critical issues (for example, repeated 500 errors). The sink should be the output console.

Producer: Create a script that generates random log entries (using Bash or Python). This script should be included in your module under the lib directory.

Submit to Canvas a pull request (PR) link including both the script that produces random log entries and the Jupyter Notebook with your data pipeline. The notebook should display at least three micro-batches of the streaming process.

In [1]:
import findspark
findspark.init()

from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .appName("Structured Streaming - Log Monitoring") \
    .master("spark://spark-master:7077") \
    .config("spark.ui.port", "4040") \
    .getOrCreate()

sc = spark.sparkContext
sc.setLogLevel("ERROR")

spark.conf.set("spark.sql.shuffle.partitions", "5")

Using Spark's default log4j profile: org/apache/spark/log4j2-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
25/10/10 03:01:05 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


In [2]:
from pyspark.sql.types import StructType, StructField, StringType, IntegerType, TimestampType

log_schema = StructType([
    StructField("timestamp", StringType(), True),
    StructField("level", StringType(), True),
    StructField("code", IntegerType(), True),
    StructField("message", StringType(), True)
])

logs_df = spark.readStream \
    .schema(log_schema) \
    .format("json") \
    .load("/opt/spark/work-dir/data/server_logs/")

In [3]:
from pyspark.sql.functions import col, window, count

critical_errors = logs_df.filter(col("code") == 500)

error_counts = critical_errors.groupBy(
    window(col("timestamp"), "1 minute"),
    col("code")
).agg(count("*").alias("count"))

alerts = error_counts.filter(col("count") >= 3)

In [4]:
query = alerts.writeStream \
    .outputMode("complete") \
    .format("console") \
    .start()

query.awaitTermination(120)


                                                                                

-------------------------------------------
Batch: 0
-------------------------------------------
+--------------------+----+-----+
|              window|code|count|
+--------------------+----+-----+
|{2025-10-10 02:51...| 500|    3|
|{2025-10-10 03:01...| 500|    4|
+--------------------+----+-----+



                                                                                

-------------------------------------------
Batch: 1
-------------------------------------------
+--------------------+----+-----+
|              window|code|count|
+--------------------+----+-----+
|{2025-10-10 02:51...| 500|    3|
|{2025-10-10 03:01...| 500|    5|
+--------------------+----+-----+



                                                                                

-------------------------------------------
Batch: 2
-------------------------------------------
+--------------------+----+-----+
|              window|code|count|
+--------------------+----+-----+
|{2025-10-10 02:51...| 500|    3|
|{2025-10-10 03:01...| 500|    5|
+--------------------+----+-----+



                                                                                

-------------------------------------------
Batch: 3
-------------------------------------------
+--------------------+----+-----+
|              window|code|count|
+--------------------+----+-----+
|{2025-10-10 02:51...| 500|    3|
|{2025-10-10 03:01...| 500|    6|
+--------------------+----+-----+

-------------------------------------------
Batch: 4
-------------------------------------------
+--------------------+----+-----+
|              window|code|count|
+--------------------+----+-----+
|{2025-10-10 02:51...| 500|    3|
|{2025-10-10 03:01...| 500|    6|
+--------------------+----+-----+



                                                                                

-------------------------------------------
Batch: 5
-------------------------------------------
+--------------------+----+-----+
|              window|code|count|
+--------------------+----+-----+
|{2025-10-10 02:51...| 500|    3|
|{2025-10-10 03:01...| 500|    6|
+--------------------+----+-----+

-------------------------------------------
Batch: 6
-------------------------------------------
+--------------------+----+-----+
|              window|code|count|
+--------------------+----+-----+
|{2025-10-10 02:51...| 500|    3|
|{2025-10-10 03:01...| 500|    6|
+--------------------+----+-----+

-------------------------------------------
Batch: 7
-------------------------------------------
+--------------------+----+-----+
|              window|code|count|
+--------------------+----+-----+
|{2025-10-10 02:51...| 500|    3|
|{2025-10-10 03:01...| 500|    6|
+--------------------+----+-----+

-------------------------------------------
Batch: 8
-----------------------------------------

                                                                                

-------------------------------------------
Batch: 14
-------------------------------------------
+--------------------+----+-----+
|              window|code|count|
+--------------------+----+-----+
|{2025-10-10 02:51...| 500|    3|
|{2025-10-10 03:02...| 500|    7|
|{2025-10-10 03:01...| 500|    6|
+--------------------+----+-----+

-------------------------------------------
Batch: 15
-------------------------------------------
+--------------------+----+-----+
|              window|code|count|
+--------------------+----+-----+
|{2025-10-10 02:51...| 500|    3|
|{2025-10-10 03:02...| 500|    7|
|{2025-10-10 03:01...| 500|    6|
+--------------------+----+-----+

-------------------------------------------
Batch: 16
-------------------------------------------
+--------------------+----+-----+
|              window|code|count|
+--------------------+----+-----+
|{2025-10-10 02:51...| 500|    3|
|{2025-10-10 03:02...| 500|    7|
|{2025-10-10 03:01...| 500|    6|
+--------------------+--

                                                                                

-------------------------------------------
Batch: 33
-------------------------------------------
+--------------------+----+-----+
|              window|code|count|
+--------------------+----+-----+
|{2025-10-10 02:51...| 500|    3|
|{2025-10-10 03:02...| 500|   10|
|{2025-10-10 03:01...| 500|    6|
+--------------------+----+-----+



                                                                                

-------------------------------------------
Batch: 34
-------------------------------------------
+--------------------+----+-----+
|              window|code|count|
+--------------------+----+-----+
|{2025-10-10 02:51...| 500|    3|
|{2025-10-10 03:02...| 500|   10|
|{2025-10-10 03:01...| 500|    6|
+--------------------+----+-----+

-------------------------------------------
Batch: 35
-------------------------------------------
+--------------------+----+-----+
|              window|code|count|
+--------------------+----+-----+
|{2025-10-10 02:51...| 500|    3|
|{2025-10-10 03:02...| 500|   10|
|{2025-10-10 03:01...| 500|    6|
+--------------------+----+-----+

-------------------------------------------
Batch: 36
-------------------------------------------
+--------------------+----+-----+
|              window|code|count|
+--------------------+----+-----+
|{2025-10-10 02:51...| 500|    3|
|{2025-10-10 03:02...| 500|   10|
|{2025-10-10 03:01...| 500|    6|
+--------------------+--

False

In [5]:
query.stop()
sc.stop()