# <center> <img src="../../labs/img/ITESOLogo.png" alt="ITESO" width="480" height="130"> </center>
# <center> **Departamento de Electrónica, Sistemas e Informática** </center>
---
## <center> **Big Data** </center>
---
### <center> **Spring 2025** </center>
---
### <center> **Application: Video Streaming Analytics** </center>
---
#### <center> **Live monitoring of video quality, viewer behavior, and content recommendations from services like Netflix or YouTube.** </center>

# <center> <img src="https://upload.wikimedia.org/wikipedia/commons/e/ef/Youtube_logo.png" width="640" height="443"> </center>
---
**Profesor**: Dr. Pablo Camarillo Ramirez

**Team members**: 
- Miguel Alberto Torres Dueñas
- Juan Pablo Cortez Navarro
- Luther Williams Sandria 
- Ferdinand Bierbaum
---

# 1. Introduction and Problem Definition

## Project's Objective
Develop a real-time recommendation system for a streaming platform that:
- Analyze user behavior (viewing time, pauses, skips, etc.)
- Generate personalized recommendations using machine learning
- Scale to handle large volumes of data

## App's Description
Our solution implements:
- **Data Ingestion**: Consuming real-time visualization events from Kafka
- **Processing**: Data transformation and enrichment with PySpark
- **Modeling**: ALS (Alternating Least Squares) based recommendation system
- **Visualization**: Dashboard in PowerBI with key metrics

# 2. Arquitectura del Sistema

pongan el esquema aqui

# 3. Justificación de las 5V's del Big Data

### Volume
- **Estimación de Crecimiento**:
  
  |    Time Perriod  | Data Processed |
  |------------------|----------------|
  | 1 Second         | 500 KB         |
  | 1 Minute (60s)   | 30 MB          |
  | 1 Hour (3,600s)  | 1.8 GB         |
  | 1 Day (86,400s)  | 43.2 GB        |
  | 1 Year (31.5M s) | 15.7 TB        |


- **Management Strategies**:
  - Data Partitioning in Parquet
  - Distributed processing with Spark
  - Schema optimization (appropriate data types)

### Velocity
- **Performance Metrics**:
  - `processedRowsPerSecond`: X rows/second
  - Latencia end-to-end: < X seconds for recommendations
- **Techniques Implemented**:
  - Structured Streaming with triggers each 3 seconds
  - Checkpointing to ensure exactly-one-processing

### Variety
- **Data Schema**:
```python
root
 |-- user_id: string (nullable = true)
 |-- video_id: string (nullable = true)
 |-- watch_time_seconds: double (nullable = true)
 |-- resolution: string (nullable = true)
 |-- buffering_events: integer (nullable = true)
 |-- paused: boolean (nullable = true)
 |-- skipped: boolean (nullable = true)
 |-- genre: string (nullable = true)
 |-- timestamp: timestamp (nullable = true)
```

### Veracity
- Schema validation when ingesting data
- Filtering incomplete records
- Quality metrics in PowerBI

### Value
- watch_time_seconds: User engagement
- buffering_events: Quality of Experience
- genre: Personal preferences
- skipped: Non-relevant content

# 4. Implementation

## Spark Configuration

In [None]:
import findspark
findspark.init()
#0be7b65b50a239d7ee8b621f3c329b25c5c4aadafbae5ac7

In [2]:
from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .appName("SparkSQLStructuredStreaming-Kafka") \
    .master("spark://7a8106b8550d:7077") \
    .config("spark.ui.port","4040") \
    .config("spark.jars.packages", "org.apache.spark:spark-sql-kafka-0-10_2.13:3.5.4") \
    .getOrCreate()
sc = spark.sparkContext

:: loading settings :: url = jar:file:/opt/conda/spark-3.5.4-bin-hadoop3-scala2.13/jars/ivy-2.5.1.jar!/org/apache/ivy/core/settings/ivysettings.xml


Ivy Default Cache set to: /root/.ivy2/cache
The jars for the packages stored in: /root/.ivy2/jars
org.apache.spark#spark-sql-kafka-0-10_2.13 added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent-5af0a233-6c59-43c6-a9c6-25dc29bae959;1.0
	confs: [default]
	found org.apache.spark#spark-sql-kafka-0-10_2.13;3.5.4 in central
	found org.apache.spark#spark-token-provider-kafka-0-10_2.13;3.5.4 in central
	found org.apache.kafka#kafka-clients;3.4.1 in central
	found org.lz4#lz4-java;1.8.0 in central
	found org.xerial.snappy#snappy-java;1.1.10.5 in central
	found org.slf4j#slf4j-api;2.0.7 in central
	found org.apache.hadoop#hadoop-client-runtime;3.3.4 in central
	found org.apache.hadoop#hadoop-client-api;3.3.4 in central
	found commons-logging#commons-logging;1.1.3 in central
	found com.google.code.findbugs#jsr305;3.0.0 in central
	found org.scala-lang.modules#scala-parallel-collections_2.13;1.0.4 in central
	found org.apache.commons#commons-pool2;2.11.1 in centr

## Kafka Link

In [3]:
kafka_lines = spark \
                .readStream \
                .format("kafka") \
                .option("kafka.bootstrap.servers", "ed69dac0a4e4:9093") \
                .option("subscribe", "kafka-spark-example-0") \
                .option("subscribe", "kafka-spark-example-1") \
                .option("subscribe", "kafka-spark-example-2") \
                .option("subscribe", "kafka-spark-example-3") \
                .load()

kafka_lines.printSchema()

root
 |-- key: binary (nullable = true)
 |-- value: binary (nullable = true)
 |-- topic: string (nullable = true)
 |-- partition: integer (nullable = true)
 |-- offset: long (nullable = true)
 |-- timestamp: timestamp (nullable = true)
 |-- timestampType: integer (nullable = true)



## Schema

In [4]:
from pyspark.sql.functions import split, col, expr

kafka_df = kafka_lines.select(split(col("value"), ",").alias("pairs_array"))

kafka_df = kafka_df.withColumn("user_id", split(col("pairs_array").getItem(0), ":").getItem(1))
kafka_df = kafka_df.withColumn("video_id", split(col("pairs_array").getItem(1), ":").getItem(1))
kafka_df = kafka_df.withColumn("watch_time_seconds", split(col("pairs_array").getItem(2), ":").getItem(1))
kafka_df = kafka_df.withColumn("resolution", split(col("pairs_array").getItem(3), ":").getItem(1))
kafka_df = kafka_df.withColumn("bitrate_kbps", split(col("pairs_array").getItem(4), ":").getItem(1))
kafka_df = kafka_df.withColumn("buffering_events", split(col("pairs_array").getItem(5), ":").getItem(1))
kafka_df = kafka_df.withColumn("paused", split(col("pairs_array").getItem(6), ":").getItem(1))
kafka_df = kafka_df.withColumn("skipped", split(col("pairs_array").getItem(7), ":").getItem(1))
kafka_df = kafka_df.withColumn("genre", split(col("pairs_array").getItem(8), ":").getItem(1))
kafka_df = kafka_df.withColumn("region", split(col("pairs_array").getItem(9), ":").getItem(1))
kafka_df = kafka_df.withColumn("recommended", split(col("pairs_array").getItem(10), ":").getItem(1))

# Usamos expr para hacer la resta de longitud directamente
kafka_df = kafka_df.withColumn(
    "timestamp",
    expr("substring(split(pairs_array[11], ':')[1], 1, length(split(pairs_array[11], ':')[1]) - 1)")
)

kafka_df.printSchema()


root
 |-- pairs_array: array (nullable = true)
 |    |-- element: string (containsNull = false)
 |-- user_id: string (nullable = true)
 |-- video_id: string (nullable = true)
 |-- watch_time_seconds: string (nullable = true)
 |-- resolution: string (nullable = true)
 |-- bitrate_kbps: string (nullable = true)
 |-- buffering_events: string (nullable = true)
 |-- paused: string (nullable = true)
 |-- skipped: string (nullable = true)
 |-- genre: string (nullable = true)
 |-- region: string (nullable = true)
 |-- recommended: string (nullable = true)
 |-- timestamp: string (nullable = true)



## Streaming processing

In [6]:
query_files = kafka_df \
                .writeStream \
                .outputMode("append") \
                .trigger(processingTime='3 seconds') \
                .format("parquet") \
                .option("path", "/home/jovyan/notebooks/data/project_parquet") \
                .option("truncate", "false") \
                .option("checkpointLocation", "/home/jovyan/checkpoint") \
                .start()
query_files.awaitTermination(300)
query_files.stop()

25/05/14 00:53:45 WARN ResolveWriteToStream: spark.sql.adaptive.enabled is not supported in streaming DataFrames/Datasets and will be disabled.
25/05/14 00:53:46 WARN AdminClientConfig: These configurations '[key.deserializer, value.deserializer, enable.auto.commit, max.poll.records, auto.offset.reset]' were supplied but are not used yet.
25/05/14 00:53:50 WARN ProcessingTimeExecutor: Current batch is falling behind. The trigger interval is 3000 milliseconds, but spent 4208 milliseconds
25/05/14 00:53:53 WARN ProcessingTimeExecutor: Current batch is falling behind. The trigger interval is 3000 milliseconds, but spent 3703 milliseconds
                                                                                

In [7]:
df = spark.read.parquet("/home/jovyan/notebooks/data/project_parquet")
df.show(5, False)


+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+------------+----------+------------------+----------+------------+----------------+------+-------+--------------+------+-----------+-------------+
|pairs_array                                                                                                                                                                                                                                                                         |user_id     |video_id  |watch_time_seconds|resolution|bitrate_kbps|buffering_events|paused|skipped|genre         |region|recommended|timestamp    |
+---------------------------------------------------------------------------------------------------------------------------------------------------

## Machine Learning Model - ALS

In [8]:
from pyspark.sql.functions import col, regexp_extract, when

# Leer datos Parquet
video_data = spark.read.parquet("/home/jovyan/notebooks/data/project_parquet")

# Convertir tipos de datos correctamente
video_data = video_data.withColumn("watch_time_seconds", col("watch_time_seconds").cast("double")) \
                      .withColumn("buffering_events", col("buffering_events").cast("double")) \
                      .withColumn("skipped", when(col("skipped") == "true", 1).otherwise(0))

# Crear IDs numéricos (como lo tienes)
video_data = video_data.withColumn("user_id_int", regexp_extract(col("user_id"), "user_(\\d+)", 1).cast("integer")) \
                      .withColumn("video_id_int", regexp_extract(col("video_id"), "vid_(\\d+)", 1).cast("integer"))

# Crear métrica de feedback implícito (versión corregida)
video_data = video_data.withColumn("implicit_feedback", 
                                 (col("watch_time_seconds")/3600 * 0.6 +  # Tiempo de visualización en horas
                                  (1 - col("buffering_events")/10) * 0.3 +  # Inverso de buffering
                                  (1 - col("skipped")) * 0.1))  # No saltado

# Verificar datos
video_data.select("user_id_int", "video_id_int", "implicit_feedback").show(5)

+-----------+------------+-------------------+
|user_id_int|video_id_int|  implicit_feedback|
+-----------+------------+-------------------+
|       5232|          21| 0.2813333333333333|
|       2262|          51| 0.7746666666666666|
|       8935|          75| 0.4754999999999999|
|       2928|          67|0.30466666666666664|
|       6080|          25| 0.5861666666666667|
+-----------+------------+-------------------+
only showing top 5 rows



In [9]:
from pyspark.ml.recommendation import ALS

als = ALS(
    userCol="user_id_int",
    itemCol="video_id_int",
    ratingCol="implicit_feedback",
    implicitPrefs=True,  # Usar feedback implícito
    coldStartStrategy="drop",  # Manejar nuevos usuarios/videos
    nonnegative=True,  # Solo factores positivos
    rank=10,  # Factores latentes (ajustable)
    maxIter=15,  # Iteraciones (ajustable)
    regParam=0.1  # Regularización (ajustable)
)

In [10]:
# Entrenar modelo
model = als.fit(video_data)

25/05/14 00:59:17 WARN InstanceBuilder: Failed to load implementation from:dev.ludovic.netlib.blas.JNIBLAS
25/05/14 00:59:17 WARN InstanceBuilder: Failed to load implementation from:dev.ludovic.netlib.blas.VectorBLAS


In [12]:
# Recomendaciones para todos los usuarios (top 10)
user_recs = model.recommendForAllUsers(5)

# Mostrar recomendaciones para algunos usuarios
print("Recomendaciones para usuarios de ejemplo:")
user_recs.show(truncate=False)

# Recomendaciones para un usuario específico
user_id_ejemplo = 8072
recs_usuario = model.recommendForUserSubset(
    spark.createDataFrame([(user_id_ejemplo,)]).toDF("user_id_int"), 
    5
)
print(f"\nTop 5 recomendaciones para usuario {user_id_ejemplo}:")
recs_usuario.show(truncate=False)

Recomendaciones para usuarios de ejemplo:


                                                                                

+-----------+-------------------------------------------------------------------------------------------------+
|user_id_int|recommendations                                                                                  |
+-----------+-------------------------------------------------------------------------------------------------+
|6080       |[{51, 0.38348034}, {1, 0.3802694}, {25, 0.28687268}, {67, 0.049135257}, {58, 0.026772548}]       |
|7321       |[{60, 0.50269973}, {86, 0.44174686}, {100, 0.3505718}, {43, 0.029861854}, {67, 0.008550466}]     |
|2391       |[{95, 0.48697206}, {66, 0.38709062}, {7, 0.29938403}, {73, 0.06898091}, {13, 1.0658878E-6}]      |
|9811       |[{37, 0.5696879}, {13, 0.53095186}, {21, 0.16064978}, {1, 4.3297778E-6}, {60, 9.962939E-7}]      |
|9531       |[{58, 0.8010914}, {74, 0.3921574}, {25, 0.02575272}, {7, 1.2776397E-4}, {31, 9.470557E-5}]       |
|9442       |[{3, 0.89450336}, {79, 0.18814947}, {74, 0.016068548}, {100, 5.4950247E-6}, {95, 2.4708405E

In [None]:
from pyspark.ml.evaluation import RegressionEvaluator

# Hacer predicciones
predictions = model.transform(video_data)
predictions.show(truncate=False)

                                                                                

+--------------------+------------+----------+------------------+----------+------------+----------------+------+-------+--------------+------+-----------+-------------+-----------+------------+-------------------+-----+------------+
|         pairs_array|     user_id|  video_id|watch_time_seconds|resolution|bitrate_kbps|buffering_events|paused|skipped|         genre|region|recommended|    timestamp|user_id_int|video_id_int|  implicit_feedback|label|  prediction|
+--------------------+------------+----------+------------------+----------+------------+----------------+------+-------+--------------+------+-----------+-------------+-----------+------------+-------------------+-----+------------+
|[{"user_id": "use...| "user_6825"| "vid_100"|            1609.0|      "4K"|        7104|             4.0|  true|      0|      "Sci-Fi"|  "BR"|       true| "05/14/2025"|       6825|         100| 0.5481666666666667|    1|  0.27823943|
|[{"user_id": "use...| "user_8935"| "vid_075"|             813.0

In [14]:
# Evaluar con RMSE
evaluator = RegressionEvaluator(
    metricName="rmse",
    labelCol="implicit_feedback",
    predictionCol="prediction"
)
rmse = evaluator.evaluate(predictions)
print(f"\nRMSE del modelo: {rmse}")

                                                                                


RMSE del modelo: 0.3187478627985201


# 5. Results and Evaluation

In [15]:
from pyspark.ml.feature import VectorAssembler, StringIndexer

# Prepare data
video_data = video_data.withColumn("label", 
                                 expr("CASE WHEN implicit_feedback >= 0.5 THEN 1 ELSE 0 END"))

# Features
feature_cols = ["watch_time_seconds", "buffering_events", "skipped", "user_id_int", "video_id_int"]

# Change genre and region in indexes
genre_indexer = StringIndexer(inputCol="genre", outputCol="genre_index")
region_indexer = StringIndexer(inputCol="region", outputCol="region_index")

# Assembler
assembler = VectorAssembler(
    inputCols=feature_cols + ["genre_index", "region_index"],
    outputCol="features"
)

In [None]:
# Apply transformations
data_label = genre_indexer.fit(video_data).transform(video_data)
data_label = region_indexer.fit(data_label).transform(data_label)
data_with_features = assembler.transform(data_label).select("label", "features")

# Divide data (train, test)
train_df, test_df = data_with_features.randomSplit([0.8, 0.2], seed=57)

print("Original Dataset")
data_with_features.show(5)

print("\nTrain set")
train_df.show()

                                                                                

Original Dataset
+-----+--------------------+
|label|            features|
+-----+--------------------+
|    0|[188.0,5.0,0.0,52...|
|    1|[2428.0,1.0,0.0,2...|
|    0|[813.0,2.0,0.0,89...|
|    0|[328.0,5.0,0.0,29...|
|    1|[1837.0,4.0,0.0,6...|
+-----+--------------------+
only showing top 5 rows


Train set


[Stage 1384:>                                                       (0 + 1) / 1]

+-----+--------------------+
|label|            features|
+-----+--------------------+
|    0|[328.0,5.0,0.0,29...|
|    0|[402.0,3.0,0.0,65...|
|    0|[813.0,2.0,0.0,89...|
|    0|[871.0,3.0,0.0,95...|
|    1|[1161.0,1.0,0.0,2...|
|    1|[1337.0,0.0,0.0,9...|
|    1|[1609.0,4.0,0.0,6...|
|    1|[2645.0,3.0,0.0,7...|
|    1|[2818.0,0.0,0.0,6...|
|    1|[3342.0,3.0,0.0,5...|
|    1|[3401.0,0.0,0.0,4...|
|    1|[3507.0,3.0,0.0,2...|
|    1|[3508.0,2.0,0.0,8...|
|    0|[252.0,5.0,0.0,29...|
|    1|[1865.0,5.0,0.0,7...|
|    1|[3061.0,5.0,0.0,8...|
|    1|[3110.0,4.0,0.0,3...|
+-----+--------------------+



                                                                                

In [25]:
from pyspark.ml.classification import DecisionTreeClassifier

# Initialize and train the Decision Tree model
dt = DecisionTreeClassifier(labelCol="label", featuresCol="features")
dt_model = dt.fit(train_df)

# Display model summary
print("\nDecision Tree model summary:{0}".format(dt_model.toDebugString))
 

25/05/14 01:12:01 WARN DecisionTreeMetadata: DecisionTree reducing maxBins from 32 to 17 (= number of training instances)


Decision Tree model summary:DecisionTreeClassificationModel: uid=DecisionTreeClassifier_eecd8e840088, depth=1, numNodes=3, numClasses=2, numFeatures=7
  If (feature 0 <= 1016.0)
   Predict: 0.0
  Else (feature 0 > 1016.0)
   Predict: 1.0



                                                                                

In [27]:
# Make predictions
predictions = dt_model.transform(test_df)

# Show predictions
predictions.select("features", "label", "prediction").show()

[Stage 1396:>                                                       (0 + 1) / 1]

+--------------------+-----+----------+
|            features|label|prediction|
+--------------------+-----+----------+
|[188.0,5.0,0.0,52...|    0|       0.0|
|[347.0,4.0,0.0,56...|    0|       0.0|
|[348.0,5.0,0.0,94...|    0|       0.0|
|[1001.0,4.0,0.0,9...|    0|       0.0|
|[1837.0,4.0,0.0,6...|    1|       1.0|
|[2028.0,0.0,0.0,6...|    1|       1.0|
|[2399.0,0.0,0.0,3...|    1|       1.0|
|[2428.0,1.0,0.0,2...|    1|       1.0|
|[2837.0,2.0,0.0,8...|    1|       1.0|
|[408.0,5.0,0.0,20...|    0|       0.0|
|[1018.0,3.0,0.0,2...|    0|       1.0|
|[1550.0,0.0,0.0,3...|    1|       1.0|
|[2294.0,2.0,0.0,5...|    1|       1.0|
|[2740.0,3.0,0.0,4...|    1|       1.0|
|[2751.0,5.0,0.0,7...|    1|       1.0|
|[2887.0,4.0,0.0,7...|    1|       1.0|
+--------------------+-----+----------+



                                                                                

In [28]:
# Evaluate model
from pyspark.ml.evaluation import MulticlassClassificationEvaluator

evaluator = MulticlassClassificationEvaluator(labelCol="label",
                            predictionCol="prediction")

accuracy = evaluator.evaluate(predictions, 
                  {evaluator.metricName: "accuracy"})
print(f"Accuracy: {accuracy}")
precision = evaluator.evaluate(predictions,
                  {evaluator.metricName: "weightedPrecision"})
print(f"Precision: {precision}")
recall = evaluator.evaluate(predictions,
                  {evaluator.metricName: "weightedRecall"})
print(f"Recall: {recall}")
f1 = evaluator.evaluate(predictions,
                {evaluator.metricName: "f1"})
print(f"F1 Score: {f1}")

                                                                                

Accuracy: 0.9375


                                                                                

Precision: 0.9431818181818181


                                                                                

Recall: 0.9375




F1 Score: 0.9361471861471862


                                                                                

In [29]:
from pyspark.ml.classification import LinearSVC, OneVsRest

# LinearSVC
lsvc = LinearSVC(maxIter=10, regParam=0.1, labelCol="label", featuresCol="features")

# OneVsRest
ovr = OneVsRest(classifier=lsvc, labelCol="label", featuresCol="features")

# Train model
ovr_model = ovr.fit(train_df)

# Make predictions
ovr_predictions = ovr_model.transform(test_df)

# Show predictions
ovr_predictions.select("features", "label", "prediction").show(5)

# Evaluate model
accuracy_ovr = evaluator.evaluate(ovr_predictions, {evaluator.metricName: "accuracy"})
print(f"\nAccuracy Score (OneVsRest + LinearSVC): {accuracy_ovr}")

precision_ovr = evaluator.evaluate(ovr_predictions, {evaluator.metricName: "weightedPrecision"})
print(f"Precision Score (OneVsRest + LinearSVC): {precision_ovr}")

recall_ovr = evaluator.evaluate(ovr_predictions, {evaluator.metricName: "weightedRecall"})
print(f"Recall Score (OneVsRest + LinearSVC): {recall_ovr}")

f1_ovr = evaluator.evaluate(ovr_predictions, {evaluator.metricName: "f1"})
print(f"F1 Score (OneVsRest + LinearSVC): {f1_ovr}")

                                                                                

+--------------------+-----+----------+
|            features|label|prediction|
+--------------------+-----+----------+
|[188.0,5.0,0.0,52...|    0|       0.0|
|[347.0,4.0,0.0,56...|    0|       1.0|
|[348.0,5.0,0.0,94...|    0|       0.0|
|[1001.0,4.0,0.0,9...|    0|       1.0|
|[1837.0,4.0,0.0,6...|    1|       1.0|
+--------------------+-----+----------+
only showing top 5 rows



                                                                                


Accuracy Score (OneVsRest + LinearSVC): 0.8125


                                                                                

Precision Score (OneVsRest + LinearSVC): 0.8557692307692308


                                                                                

Recall Score (OneVsRest + LinearSVC): 0.8125




F1 Score (OneVsRest + LinearSVC): 0.7934782608695653


                                                                                

In [62]:
sc.stop()

# 6. Conclusion

## Main Achievements
- Real-time recommendation system operational
- Scalable data pipeline
- Model with good accuracy metrics

## Key Learnings
1. Importance of preprocessing for streaming data
2. Advantages of ALS for implicit feedback
3. Challenges in the balance between coverage and diversity