# <center> <img src="../../img/ITESOLogo.png" alt="ITESO" width="480" height="130"> </center>
# <center> **Departamento de Electrónica, Sistemas e Informática** </center>
---
## <center> Computer Systems Engineering  </center>
---
### <center> Big Data Processing </center>
---
#### <center> **Autumn 2025** </center>
---

**Lab 05**: Structured Streaming with Files

**Date**: October 2nd 2025

**Student Name**: Juan Alonso

**Professor**: Pablo Camarillo Ramirez

## Setup files

## Create Random Logs

In [1]:
from juanalonso.producer import StreamLogProducer

logs = StreamLogProducer()

logs.entries_count = 100
logs._write_file()

Creating log directory: /opt/spark/work-dir/data/juanalonso/logs
[08:31:07] Wrote file: /opt/spark/work-dir/data/juanalonso/logs/logfile_20251010083107602950.txt


In [2]:
!pwd
!ls /opt/spark/work-dir/data/juanalonso//logs
!cat /opt/spark/work-dir/data/juanalonso/logs/*

/opt/spark/work-dir
logfile_20251010083107602950.txt
2025-10-10 08:31:07 | INFO | User login successful | server-node-3
2025-10-10 08:31:07 | ERROR | 404 Not Found | server-node-3
2025-10-10 08:31:07 | INFO | Port Currently Listening | server-node-1
2025-10-10 08:31:07 | INFO | Port Currently Listening | server-node-2
2025-10-10 08:31:07 | INFO | API got 'GET' request | server-node-4
2025-10-10 08:31:07 | WARN | Timeout, no response received | server-node-4
2025-10-10 08:31:07 | INFO | Port Currently Listening | server-node-1
2025-10-10 08:31:07 | INFO | API got 'GET' request | server-node-3
2025-10-10 08:31:07 | INFO | Port Currently Listening | server-node-1
2025-10-10 08:31:07 | WARN | Timeout, no response received | server-node-2
2025-10-10 08:31:07 | INFO | API got 'GET' request | server-node-4
2025-10-10 08:31:07 | ERROR | 500 Internal Server Error | server-node-4
2025-10-10 08:31:07 | INFO | API got 'GET' request | server-node-1
2025-10-10 08:31:07 | WARN | Timeout, no response 

## Initialize Spark

In [3]:
import findspark
findspark.init()

from pyspark.sql import SparkSession
spark = SparkSession.builder \
    .appName("Lab 07 - Structured Streaming with Files") \
    .master("spark://spark-master:7077") \
    .config("spark.ui.port", "4040") \
    .getOrCreate()

sc = spark.sparkContext
sc.setLogLevel("ERROR")
spark.conf.set("spark.sql.shuffle.partitions", "5")

Using Spark's default log4j profile: org/apache/spark/log4j2-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
25/10/10 08:33:02 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


## Create Schema

In [4]:
from pyspark.sql.functions import col, trim
from juanalonso.spark_utils import SparkUtils

logs_schema = SparkUtils.generate_schema([
    ("timestamp", "string"),
    ("level", "string"),
    ("message", "string"),
    ("server", "string")
])

## Read stream with spark

In [8]:
from pyspark.sql.functions import split

logs_df = spark.readStream \
    .format("text") \
    .load("/opt/spark/work-dir/data/juanalonso/logs")

parsed_logs_df = logs_df.select(
    split(col("value"), " \\| ").alias("parsed")
).select(
    col("parsed")[0].alias("timestamp"),
    trim(col("parsed")[1]).alias("level"),
    col("parsed")[2].alias("message"),
    col("parsed")[3].alias("server")
)

logs_df

DataFrame[value: string]

## Filter critical errors

In [9]:
critical_errors_df = parsed_logs_df \
    .filter((col("level") == "ERROR") & (col("message").contains("500 Internal Server Error")))

## Create queries

In [10]:
errors_count = critical_errors_df.groupBy("message").count()
errors_detailed = critical_errors_df.select("timestamp", "level", "message", "server")

## Write streaming

In [11]:
count = errors_count.writeStream \
    .outputMode("complete") \
    .format("console") \
    .option("truncate", "false") \
    .start()

query = errors_detailed.writeStream \
    .outputMode("append") \
    .format("console") \
    .option("truncate", "false") \
    .start()

count.awaitTermination(20)
count.stop()

query.awaitTermination(20)
query.stop()


                                                                                

-------------------------------------------
Batch: 0
-------------------------------------------
+-------------------+-----+-------------------------+-------------+
|timestamp          |level|message                  |server       |
+-------------------+-----+-------------------------+-------------+
|2025-10-10 08:31:07|ERROR|500 Internal Server Error|server-node-4|
|2025-10-10 08:31:07|ERROR|500 Internal Server Error|server-node-2|
|2025-10-10 08:31:07|ERROR|500 Internal Server Error|server-node-1|
|2025-10-10 08:31:07|ERROR|500 Internal Server Error|server-node-3|
|2025-10-10 08:31:07|ERROR|500 Internal Server Error|server-node-4|
+-------------------+-----+-------------------------+-------------+



                                                                                

-------------------------------------------
Batch: 0
-------------------------------------------
+-------------------------+-----+
|message                  |count|
+-------------------------+-----+
|500 Internal Server Error|5    |
+-------------------------+-----+

