# <center> <img src="../../img/ITESOLogo.png" alt="ITESO" width="480" height="130"> </center>
# <center> **Departamento de Electrónica, Sistemas e Informática** </center>
---
## <center> Computer Systems Engineering  </center>
---
### <center> Big Data Processing </center>
---
#### <center> **Autumn 2025** </center>
---

**Lab 07**: Structured Streaming with Files 

**Date**: October 10th 2025

**Student Name**: Carolina Arellano

**Professor**: Pablo Camarillo Ramirez

## Setup files

### Create Random Logs

In [14]:
from carolinarellano.random_logs_generator import LogGenerator

logs = LogGenerator()
logs.generate_log_file(100)  # Generate 100 entries

Generated /opt/spark/work-dir/data/carolinarellano/logs/log_1_1760031264.txt with 100 entries.


PosixPath('/opt/spark/work-dir/data/carolinarellano/logs/log_1_1760031264.txt')

In [15]:
!pwd
!ls /opt/spark/work-dir/data/carolinarellano/logs
!cat /opt/spark/work-dir/data/carolinarellano/logs/*

/opt/spark/work-dir
log_1_1760029707.txt  log_1_1760031151.txt
log_1_1760030257.txt  log_1_1760031264.txt
log_1_1760029707.txt  log_1_1760031151.txt
log_1_1760030257.txt  log_1_1760031264.txt
2025-10-09 17:08:27 | INFO | Request received | server-node-3
2025-10-09 17:08:27 | WARN | Disk usage 85% | server-node-2
2025-10-09 17:08:27 | INFO | Data processed correctly | server-node-2
2025-10-09 17:08:27 | INFO | Data processed correctly | server-node-4
2025-10-09 17:08:27 | INFO | Request received | server-node-3
2025-10-09 17:08:27 | ERROR | Database connection failed | server-node-2
2025-10-09 17:08:27 | INFO | Data processed correctly | server-node-4
2025-10-09 17:08:27 | INFO | User login successful | server-node-4
2025-10-09 17:08:27 | INFO | Request received | server-node-1
2025-10-09 17:08:27 | INFO | User login successful | server-node-4
2025-10-09 17:08:27 | INFO | Request received | server-node-1
2025-10-09 17:08:27 | INFO | Data processed correctly | server-node-2
2025-10-09 17

### Initialize Spark

In [16]:
import findspark
findspark.init()

from pyspark.sql import SparkSession
spark = SparkSession.builder \
    .appName("Lab 07 - Structured Streaming with Files") \
    .master("spark://spark-master:7077") \
    .config("spark.ui.port", "4040") \
    .getOrCreate()

sc = spark.sparkContext
sc.setLogLevel("ERROR")
# Optimization (reduce the number of shuffle partitions)
spark.conf.set("spark.sql.shuffle.partitions", "5")

### Create Schema

In [17]:
from pyspark.sql.functions import col, trim
from carolinarellano.spark_utils import SparkUtils

logs_schema = SparkUtils.generate_schema([
    ("timestamp", "string"),
    ("level", "string"),
    ("message", "string"),
    ("server", "string")
])

### Read stream with spark

In [18]:
from pyspark.sql.functions import split

logs_df = spark.readStream \
    .format("text") \
    .load("/opt/spark/work-dir/data/carolinarellano/logs")

parsed_logs_df = logs_df.select(
    split(col("value"), " \\| ").alias("parsed")
).select(
    col("parsed")[0].alias("timestamp"),
    trim(col("parsed")[1]).alias("level"),
    col("parsed")[2].alias("message"),
    col("parsed")[3].alias("server")
)

### Filter critical errors

In [19]:
critical_errors_df = parsed_logs_df \
    .filter((col("level") == "ERROR") & (col("message").contains("500 Internal Server Error")))

### Create queries

In [20]:
errors_count = critical_errors_df.groupBy("message").count()
errors_detailed = critical_errors_df.select("timestamp", "level", "message", "server")

### Write streaming

In [22]:
count = errors_count.writeStream \
    .outputMode("complete") \
    .format("console") \
    .option("truncate", "false") \
    .start()

query = errors_detailed.writeStream \
    .outputMode("append") \
    .format("console") \
    .option("truncate", "false") \
    .start()

print("Streaming started!")
count.awaitTermination(20)
count.stop()

query.awaitTermination(20)
query.stop()

print("Stream finalized!")

Streaming started!
-------------------------------------------
Batch: 0
-------------------------------------------
+-------------------+-----+-------------------------+-------------+
|timestamp          |level|message                  |server       |
+-------------------+-----+-------------------------+-------------+
|2025-10-09 17:08:27|ERROR|500 Internal Server Error|server-node-2|
|2025-10-09 17:08:27|ERROR|500 Internal Server Error|server-node-2|
|2025-10-09 17:32:31|ERROR|500 Internal Server Error|server-node-4|
|2025-10-09 17:32:31|ERROR|500 Internal Server Error|server-node-2|
|2025-10-09 17:32:31|ERROR|500 Internal Server Error|server-node-4|
|2025-10-09 17:34:24|ERROR|500 Internal Server Error|server-node-2|
|2025-10-09 17:34:24|ERROR|500 Internal Server Error|server-node-4|
|2025-10-09 17:34:24|ERROR|500 Internal Server Error|server-node-4|
|2025-10-09 17:34:24|ERROR|500 Internal Server Error|server-node-2|
|2025-10-09 17:34:24|ERROR|500 Internal Server Error|server-node-3|


                                                                                

-------------------------------------------
Batch: 0
-------------------------------------------
+-------------------------+-----+
|message                  |count|
+-------------------------+-----+
|500 Internal Server Error|15   |
+-------------------------+-----+

Stream finalized!
Stream finalized!


In [23]:
sc.stop()