# <center> <img src="../../img/ITESOLogo.png" alt="ITESO" width="480" height="130"> </center>
# <center> **Departamento de Electrónica, Sistemas e Informática** </center>
---
## <center> Computer Systems Engineering  </center>
---
### <center> Big Data Processing </center>
---
#### <center> **Autumn 2025** </center>
---

**Lab 07**: Structured Streaming with Files

**Date**: October 7th 2025

**Student Name**: Mateo Garcia Lopez

**Professor**: Pablo Camarillo Ramirez

In [1]:
# Import Dependencies
import findspark
findspark.init()

from pyspark.sql import SparkSession
from pyspark.sql.functions import split, col
from mateogarcial.log_stream import generate_log_files

**Initialize Spark**

In [3]:
spark = SparkSession.builder \
        .appName("RealTimeLogAnalyzer") \
        .master("spark://spark-master:7077") \
        .config("spark.ui.port", "4040") \
        .getOrCreate()

spark.sparkContext.setLogLevel("ERROR")
spark.conf.set("spark.sql.shuffle.partitions", "5")

Using Spark's default log4j profile: org/apache/spark/log4j2-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
25/10/10 01:41:00 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


**Generate 100 log files**

In [7]:
generate_log_files(100)

Generating 100 log files in '/opt/spark/work-dir/data/logs/'...
Successfully generated 100 log files.


**Spark Streaming Pipeline**
- Show all of the logs, ordered by higher to lower

In [14]:
# 1. SOURCE: Read from the file stream.
log_dir_path = "/opt/spark/work-dir/data/logs"
print(f"\nShows the count of all the logs, for each server-node")
print(f"\nMonitoring directory '{log_dir_path}' for files...")

raw_logs_df = spark.readStream \
    .format("text") \
    .load(log_dir_path)

# 2. TRANSFORMATION: Parse logs and create a full summary.
structured_logs_df = raw_logs_df.select(
    split(col("value"), " \\| ").alias("parts")
).select(
    col("parts").getItem(1).alias("log_level"),
    col("parts").getItem(2).alias("message"),
    col("parts").getItem(3).alias("server_id")
)

# Group by log level, server, and message to get counts for ALL log types.
log_summary_df = structured_logs_df.groupBy(
    "log_level", "server_id", "message"
).count()

# Sort by count of log type
sorted_summary_df = log_summary_df.orderBy(col("count").desc())

# 3. SINK: Output the complete, sorted summary to the console.
query = sorted_summary_df.writeStream \
    .outputMode("complete") \
    .format("console") \
    .option("truncate", "false") \
    .start()

query.awaitTermination(5)


Shows the count of all the logs, for each server-node

Monitoring directory '/opt/spark/work-dir/data/logs' for files...




False

                                                                                

-------------------------------------------
Batch: 0
-------------------------------------------
+---------+-------------+-------------------------+-----+
|log_level|server_id    |message                  |count|
+---------+-------------+-------------------------+-----+
|INFO     |server-node-1|Data export completed    |29   |
|INFO     |server-node-3|User login successful    |27   |
|INFO     |server-node-4|Data export completed    |24   |
|INFO     |server-node-1|User login successful    |24   |
|INFO     |server-node-2|Data export completed    |23   |
|INFO     |server-node-3|Data export completed    |22   |
|INFO     |server-node-2|User login successful    |21   |
|INFO     |server-node-4|User login successful    |16   |
|WARN     |server-node-1|Disk usage 85%           |10   |
|ERROR    |server-node-1|404 Not Found            |10   |
|WARN     |server-node-2|Disk usage 85%           |9    |
|WARN     |server-node-1|High CPU load detected   |9    |
|WARN     |server-node-3|High CPU

- Showing only 500 errors

In [None]:
# 1. TRANSFORMATION: Parse and filter logs.
structured_logs_df = raw_logs_df.select(
    split(col("value"), " \\| ").alias("parts")
).select(
    col("parts").getItem(0).alias("timestamp"),
    col("parts").getItem(1).alias("log_level"),
    col("parts").getItem(2).alias("message"),
    col("parts").getItem(3).alias("server_id")
)

critical_errors_df = structured_logs_df.filter(
    (col("log_level") == "ERROR") & (col("message").contains("500"))
)

error_counts_df = critical_errors_df.groupBy("server_id", "message").count()
alerts_df = error_counts_df.filter(col("count") > 2)

# 2. SINK: Output the alerts to the console.
query = alerts_df.writeStream \
    .outputMode("complete") \
    .format("console") \
    .option("truncate", "false") \
    .start()

query.awaitTermination(5)

                                                                                

-------------------------------------------
Batch: 0
-------------------------------------------
+-------------+-------------------------+-----+
|server_id    |message                  |count|
+-------------+-------------------------+-----+
|server-node-3|500 Service Unavailable  |5    |
|server-node-4|500 Service Unavailable  |4    |
|server-node-1|500 Internal Server Error|6    |
|server-node-3|500 Internal Server Error|4    |
|server-node-2|500 Service Unavailable  |5    |
+-------------+-------------------------+-----+

