# <center> <img src="../../img/ITESOLogo.png" alt="ITESO" width="480" height="130"> </center>
# <center> **Departamento de Electrónica, Sistemas e Informática** </center>
---
## <center> Computer Systems Engineering  </center>
---
### <center> Big Data Processing </center>
---
#### <center> **Autumn 2025** </center>
---

**Lab 07**: Structured Streaming with Files

**Date**: October 7nd 2025

**Student Name**: Luis Angel Santana Hernandez

**Professor**: Pablo Camarillo Ramirez

# Description

Jupyter Notebook: Build a Jupyter Notebook (spark_cluster/notebooks/labs/lab07/lab07_<your_name>.ipynb) containing a data pipeline using structured streaming. The pipeline should monitor a directory for simulated server log files, analyze error patterns in real time, and filter alerts for critical issues (for example, repeated 500 errors). The sink should be the output console.

Producer: Create a script that generates random log entries (using Bash or Python). This script should be included in your module under the lib directory.

Submit to Canvas a pull request (PR) link including both the script that produces random log entries and the Jupyter Notebook with your data pipeline. The notebook should display at least three micro-batches of the streaming process.



# Create SparkSession

In [8]:
import findspark
findspark.init()

from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .appName("Examples on Structured Streaming (files)") \
    .master("spark://spark-master:7077") \
    .config("spark.ui.port", "4040") \
    .getOrCreate()

sc = spark.sparkContext
sc.setLogLevel("ERROR")
# Optimization (reduce the number of shuffle partitions)
spark.conf.set("spark.sql.shuffle.partitions", "5")

# Check on producer

In [9]:
!ls ./data/logs
!pwd

/opt/spark/work-dir


# Data Pipelines

In [10]:
from luis_santana.spark_utils import SparkUtils
from pyspark.sql.functions import split, col

logs_schema = SparkUtils.generate_schema([("Timestamp", "timestamp"), ("Status", "string"), ("Data","string"), ("node","string")])

# get the raw logs
logs_raw = spark.readStream \
    .format("text") \
    .load("/opt/spark/work-dir/data/logs/")


# parse the logs 
logs = logs_raw.select(
    split(col("value"), " \\| ").alias("split")
).select(
    col("split").getItem(0).alias("Timestamp").cast("timestamp"),
    col("split").getItem(1).alias("Status"),
    col("split").getItem(2).alias("Data"),
    col("split").getItem(3).alias("node")
).filter((col("Status") == "ERROR") | (col("Data").rlike("50[0-9]") | col("Data").rlike("404") ))

# filter only ERROR logs or 50*(505,506,507)(im using regex) status codes on the Data field
# error_logs_df = logs.filter(
#     (col("Status") == "ERROR") | (col("Data").rlike("50[0-9]") | col("Data").rlike("404") )
# )

# put both streams together
query = logs.writeStream \
    .outputMode("append") \
    .format("console") \
    .start()


query.awaitTermination(60) 


                                                                                

-------------------------------------------
Batch: 0
-------------------------------------------
+-------------------+------+--------------------+-------------+
|          Timestamp|Status|                Data|         node|
+-------------------+------+--------------------+-------------+
|2025-10-09 14:09:44| ERROR|500 Internal Serv...|server-node-2|
+-------------------+------+--------------------+-------------+

-------------------------------------------
Batch: 1
-------------------------------------------
+-------------------+------+---------------+-------------+
|          Timestamp|Status|           Data|         node|
+-------------------+------+---------------+-------------+
|2025-10-09 14:09:51|  INFO|502 Bad Gateway|server-node-2|
|2025-10-09 14:09:51|  WARN|  404 Not Found|server-node-3|
+-------------------+------+---------------+-------------+

-------------------------------------------
Batch: 2
-------------------------------------------
+-------------------+------+-----

False

In [None]:
sc.stop()