# <center> <img src="../img/ITESOLogo.png" alt="ITESO" width="480" height="130"> </center>
# <center> **Departamento de Electrónica, Sistemas e Informática** </center>
---
## <center> **Big Data** </center>
---
### <center> **Autumn 2025** </center>
---
### <center> **Poryect Machine Learning: Logistic Regression** </center>
---
**Profesor**: Pablo Camarillo Ramirez

Diego Orozco Alvarado

## Machine Learning algorithm to use (10 points)

El problema que busco resolver es clasificar si un usuario va a dar “like” a un video recomendado o no, utilizando información de comportamiento como tiempo de visualización, categoría del video, dispositivo, interacciones previas, entre otros atributos.
Este problema es naturalmente un caso de clasificación binaria, ya que la variable objetivo (liked) solo toma dos valores: 0 (no le dio like) y 1 (sí le dio like).

Elegí utilizar Logistic Regression por que:

- Es un modelo ampliamente utilizado para clasificación binaria y proporciona una interpretación clara en términos de probabilidades.
- Escala bien para datasets grandes gracias a su implementación distribuida en PySpark.
- Permite inspeccionar coeficientes y contribuciones de cada feature, lo cual resulta útil para entender los factores que influyen en que un usuario dé like.

Ademas dado que el objetivo de esa parte del proyecto es construir un modelo de los vistos en clase, me parecio adecuado usar este modelo.

## Dataset Description (20 points)

El dataset utilizado proviene de Kaggle:
“YouTube Recommendation Data for Cleaning and ML”
(https://www.kaggle.com/datasets/iitanshravan/youtube-recommendation-data-for-cleaning-and-ml)

Este dataset contiene información sobre la interacción de usuarios con videos recomendados, incluyendo métricas de visualización, dispositivos, categorías, horario del día, acciones del usuario (likes, comentarios, suscripciones), entre otras.

El dataset original contiene aproximadamente:

40,000+ filas (registros de interacciones usuario–video)

columnas: user_id, video_id, video_duration, watch_time, liked, commented, subscribed_after, category, device, watch_time_of_day, recommended, clicked, timestamp, watch_percent

#### is Balanced?

Como el objetivo del proyecto es resolver un problema de clasificación, se analizó la distribución de la clase objetivo liked usando PySpark. (como se muestra mas adelante)
Los resultados muestran que:

La clase 0 (no like) es más frecuente.

La clase 1 (like) aparece menos en proporción.

Esto significa que el dataset no está perfectamente balanceado, aunque no presenta un desbalance extremo.
En este escenario, en lugar de usar pesos (weightCol), se optó por usar un ajuste directo del umbral de clasificación (threshold) en el modelo, lo cual permite controlar la sensibilidad del modelo sin modificar las proporciones del dataset.

Features Utilizados

El dataset mezcla características numéricas y categóricas:
- Numéricas: watch_time, video_duration, watch_percent, interacciones históricas, etc.
- Categóricas: device, category, watch_time_of_day.
- Objetivo: liked.

Para poder usar este algoritmo todas las variables categóricas fueron transformadas usando StringIndexer y OneHotEncoder para crear un vector de características adecuado.

## ML Training Process (30 points)

Para entrenar el modelo se implementó un pipeline completo en PySpark.

Data Transformation Steps

1 Conversión de tipos
- Se normalizaron los tipos: columnas numéricas a double, categóricas a string.

2 Indexación de columnas categóricas
- Con StringIndexer se generaron índices numéricos consistentes.

3 One-Hot Encoding
- Las categorías indexadas se transformaron en vectores dispersos mediante OneHotEncoder.

4 VectorAssembler
- Todas las características numéricas + OHE se integraron en un único vector features.

Train/Test Split
Se dividió el dataset en:
- 80% entrenamiento
- 20% prueba

Logistic Regression config:
- El modelo utilizado fue Logistic Regression, configurado de la siguiente manera:

lr = LogisticRegression(
    featuresCol="features",
    labelCol="liked",
    threshold=0.45
)

Hyperparameters

Además del threshold, se usaron los parámetros estándar:

- featuresCol="features": vector ensamblado.
- labelCol="liked": variable objetivo.
- maxIter=100 (por defecto): asegura convergencia del modelo.
- Regularización por defecto (regParam estándar de Spark): controla el sobreajuste.

## ML Evaluation (20 points)

Para evaluar el modelo se usaron las métricas estándar para clasificación binaria:

Metrics used

- Accuracy: proporción de predicciones correctas.
- Precision: qué tan confiables son las predicciones positivas.
- Recall: capacidad del modelo para detectar correctamente los casos positivos.
- F1 Score: balance entre precision y recall.

Estas métricas fueron calculadas usando BinaryClassificationEvaluator y funciones de PySpark.

Prediction Process

Una vez entrenado el modelo se aplicó el modelo al conjunto de prueba (test_df).

El modelo generó columnas:

- rawPrediction
- probability (vector con probabilidades de cada clase)
- prediction (clase binaria final)

A partir de estas columnas se computaron las métricas mencionadas.

El modelo obtuvo resultados coherentes y consistentes, mostrando que es capaz de capturar patrones en el comportamiento del usuario respecto a la probabilidad de dar like.

# Create SparkSession

In [1]:
import findspark
findspark.init()

from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .appName("ML: Logistic Regression") \
    .master("local[*]") \
    .config("spark.ui.port", "4040") \
    .getOrCreate()

sc = spark.sparkContext
sc.setLogLevel("INFO")

# Optimization (reduce the number of shuffle partitions)
spark.conf.set("spark.sql.shuffle.partitions", "5")


Using Spark's default log4j profile: org/apache/spark/log4j2-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
25/11/22 22:50:55 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


# Collect Data

In [2]:
!pwd

/opt/spark/work-dir/final_project/machine_learning


In [3]:
!du -sh ../../data/data_proy/ml_regresion
base_path = "/opt/spark/work-dir/data/"

83M	../../data/data_proy/ml_regresion


In [4]:
from diego_orozco.spark_utils import SparkUtils

df = spark.read.option("header", True).csv(base_path+"/data_proy/ml_regresion") # hay strings mezclados (por eso no uso mi schema)

                                                                                

In [5]:
df.printSchema()

root
 |-- user_id: string (nullable = true)
 |-- video_id: string (nullable = true)
 |-- video_duration: string (nullable = true)
 |-- watch_time: string (nullable = true)
 |-- liked: string (nullable = true)
 |-- commented: string (nullable = true)
 |-- subscribed_after: string (nullable = true)
 |-- category: string (nullable = true)
 |-- device: string (nullable = true)
 |-- watch_time_of_day: string (nullable = true)
 |-- recommended: string (nullable = true)
 |-- clicked: string (nullable = true)
 |-- timestamp: string (nullable = true)
 |-- watch_percent: string (nullable = true)



### Limpiar DF

In [6]:
from pyspark.sql.functions import col, when, regexp_replace


df = df.withColumn(
    "watch_percent",
    regexp_replace("watch_percent", "Infinity|-Infinity|NaN", None)
)

# columnas numéricas a float
df = df.withColumn("watch_time", col("watch_time").cast("float")) \
    .withColumn("video_duration", col("video_duration").cast("float")) \
    .withColumn("watch_percent", col("watch_percent").cast("float"))

df = df.fillna({"watch_percent": 0.0})

df = df.withColumn(
    "watch_time",
    when(col("watch_time") < 0, 0).otherwise(col("watch_time"))
)
df = df.withColumn(
    "video_duration",
    when(col("video_duration") <= 0, 1).otherwise(col("video_duration"))
)
df = df.withColumn(
    "watch_percent",
    when(col("watch_percent") <= 0, 1).otherwise(col("watch_percent"))
)


In [7]:
from pyspark.sql.functions import col, when, lower

df = df.withColumn("subscribed_after", col("subscribed_after").cast("string"))
df = df.withColumn("recommended", col("recommended").cast("string"))
df = df.withColumn("commented", col("commented").cast("string"))
df = df.withColumn("clicked", col("clicked").cast("string"))
df = df.withColumn("liked", col("liked").cast("string"))

df = df.withColumn(
    "commented",
    when(lower(col("commented")).isin("1", "yes"), 1)
    .when(lower(col("commented")).isin("0", "no"), 0)
    .otherwise(None)
)
df = df.withColumn(
    "subscribed_after",
    when(lower(col("subscribed_after")).isin("1", "yes"), 1)
    .when(lower(col("subscribed_after")).isin("0", "no"), 0)
    .otherwise(None)
)
df = df.withColumn(
    "recommended",
    when(lower(col("recommended")).isin("1", "yes"), 1)
    .when(lower(col("recommended")).isin("0", "no"), 0)
    .otherwise(None)
)

df = df.withColumn(
    "clicked",
    when(lower(col("clicked")).isin("1", "yes"), 1)
    .when(lower(col("clicked")).isin("0", "no"), 0)
    .otherwise(None)
)

df = df.withColumn(
    "liked",
    when(lower(col("liked")).isin("1", "yes"), 1)
    .when(lower(col("liked")).isin("0", "no"), 0)
    .otherwise(None)
)

df = df.filter(col("liked").isNotNull())

In [8]:
df = df.withColumn("device", lower(col("device")))
df = df.withColumn(
    "device",
    when(col("device").like("%mobil%"), "mobile")
    .when(col("device").like("%desk%"), "desktop")
    .when(col("device").like("%tabl%"), "tablet")
    .otherwise(col("device"))
)

df = df.withColumn("category", lower(col("category")))
df = df.withColumn(
    "category",
    when(col("category").like("music%"), "music")
    .when(col("category").like("educ%"), "education")
    .when(col("category").like("gaming%"), "gaming")
    .otherwise(col("category"))
)

### Código PySpark ver y corregir el balanceo del df

In [9]:
from pyspark.sql import functions as F

df.groupBy("liked").count().show()

df.groupBy("liked") \
  .agg((F.count("*") / df.count()).alias("percentage")) \
  .show()


                                                                                

+-----+------+
|liked| count|
+-----+------+
|    0|697143|
|    1|299802|
+-----+------+





+-----+-------------------+
|liked|         percentage|
+-----+-------------------+
|    0| 0.6992792982561726|
|    1|0.30072070174382737|
+-----+-------------------+



                                                                                

In [10]:
# Cantidad de positivos y negativos
counts = df.groupBy("liked").count().collect()
count_0 = [row['count'] for row in counts if row['liked'] == 0][0]
count_1 = [row['count'] for row in counts if row['liked'] == 1][0]

# Clase minoritaria
min_count = min(count_0, count_1)

# Submuestrear ambos grupos al tamaño de la clase minoritaria
df_0 = df.filter("liked = 0").sample(False, min_count / count_0)
df_1 = df.filter("liked = 1").sample(False, min_count / count_1)

# DataFrame balanceado
df = df_0.union(df_1)


                                                                                

In [11]:
df.groupBy("liked").count().show()

df.groupBy("liked") \
  .agg((F.count("*") / df.count()).alias("percentage")) \
  .show()

                                                                                

+-----+------+
|liked| count|
+-----+------+
|    0|299688|
|    1|299802|
+-----+------+





+-----+------------------+
|liked|        percentage|
+-----+------------------+
|    0|0.4999049191813041|
|    1|0.5000950808186959|
+-----+------------------+



                                                                                

### Columnas finales

In [12]:
cols_keep = ["watch_time","video_duration","watch_percent",
             "recommended","clicked","subscribed_after",
             "device","category","watch_time_of_day",
             "commented","liked"]
df = df.select(*[c for c in cols_keep if c in df.columns])
df.printSchema()
df.show(5, truncate=120)


root
 |-- watch_time: double (nullable = true)
 |-- video_duration: double (nullable = true)
 |-- watch_percent: double (nullable = false)
 |-- recommended: integer (nullable = true)
 |-- clicked: integer (nullable = true)
 |-- subscribed_after: integer (nullable = true)
 |-- device: string (nullable = true)
 |-- category: string (nullable = true)
 |-- watch_time_of_day: string (nullable = true)
 |-- commented: integer (nullable = true)
 |-- liked: integer (nullable = true)

+----------+--------------+-------------+-----------+-------+----------------+-------+---------+-----------------+---------+-----+
|watch_time|video_duration|watch_percent|recommended|clicked|subscribed_after| device| category|watch_time_of_day|commented|liked|
+----------+--------------+-------------+-----------+-------+----------------+-------+---------+-----------------+---------+-----+
|     389.0|         389.0|          1.0|          0|      0|               0|desktop|   sports|        Afternoon|        0|   

## ENCODING de variables categóricas

In [13]:
from pyspark.ml.feature import StringIndexer, OneHotEncoder

categorical_cols = ["device", "category", "watch_time_of_day"]

indexers = [
    StringIndexer(inputCol=c, outputCol=c + "_idx", handleInvalid="keep")
    for c in categorical_cols
]

encoder = OneHotEncoder(
    inputCols=[c + "_idx" for c in categorical_cols],
    outputCols=[c + "_ohe" for c in categorical_cols]
)

numeric_features = [
    "watch_time", "video_duration", "watch_percent",
    "recommended", "clicked", "subscribed_after","commented"
]

### Assemble the features into a single vector column

In [14]:
from pyspark.ml.feature import VectorAssembler
from pyspark.ml import Pipeline


assembler_inputs = numeric_features + [c + "_ohe" for c in categorical_cols]

assembler = VectorAssembler( inputCols=assembler_inputs, outputCol="features", handleInvalid="keep" )

pipeline = Pipeline(stages=indexers + [encoder] + [assembler])

pipeline_model = pipeline.fit(df)
df_transformed = pipeline_model.transform(df)

                                                                                

# Data splitting
#### 80% training data and 20% testing data

In [15]:
train_df, test_df = df_transformed.randomSplit([0.8, 0.2], seed=42)


### Show dataset (for debugging)

In [16]:
print("Original Dataset")
df.show()

# Print train dataset
print("train set")
train_df.show()

Original Dataset
+----------+--------------+-------------+-----------+-------+----------------+-------+---------+-----------------+---------+-----+
|watch_time|video_duration|watch_percent|recommended|clicked|subscribed_after| device| category|watch_time_of_day|commented|liked|
+----------+--------------+-------------+-----------+-------+----------------+-------+---------+-----------------+---------+-----+
|     389.0|         389.0|          1.0|          0|      0|               0|desktop|   sports|        Afternoon|        0|    0|
|     962.0|         962.0|          1.0|          0|      0|               0|desktop|     news|            Night|        0|    0|
|    1710.0|        1710.0|          1.0|          1|      0|               0|desktop|   gaming|        Afternoon|        0|    0|
|     986.0|        1396.0|          1.0|          1|      0|               0|     tv|lifestyle|        Afternoon|        0|    0|
|    1504.0|        2793.0|          1.0|          1|      0|     

[Stage 42:>                                                         (0 + 1) / 1]

+----------+--------------+-------------+-----------+-------+----------------+-------+---------+-----------------+---------+-----+----------+------------+---------------------+-------------+--------------+---------------------+--------------------+
|watch_time|video_duration|watch_percent|recommended|clicked|subscribed_after| device| category|watch_time_of_day|commented|liked|device_idx|category_idx|watch_time_of_day_idx|   device_ohe|  category_ohe|watch_time_of_day_ohe|            features|
+----------+--------------+-------------+-----------+-------+----------------+-------+---------+-----------------+---------+-----+----------+------------+---------------------+-------------+--------------+---------------------+--------------------+
|       0.0|      100000.0|          1.0|          0|      0|               0|desktop|   comedy|          Evening|        0|    0|       2.0|         1.0|                  0.0|(4,[2],[1.0])|(10,[1],[1.0])|        (4,[0],[1.0])|(25,[1,2,9,12,21]...|
|   

                                                                                

# Create ML Model

In [17]:
from pyspark.ml.feature import StandardScaler

scaler = StandardScaler(
    inputCol="features",
    outputCol="scaled_features",
    withMean=True,
    withStd=True
)

scaler_model = scaler.fit(train_df)
train_scaled = scaler_model.transform(train_df)
test_scaled  = scaler_model.transform(test_df)


                                                                                

In [18]:
from pyspark.ml.classification import LogisticRegression

lr = LogisticRegression(
    featuresCol="scaled_features",
    labelCol="liked"
)


# Train ML Model

In [19]:
lr_model = lr.fit(train_scaled)

# Print coefficients
print("Coefficients:", lr_model.coefficients)
print("Intercept:", lr_model.intercept)

# Display model summary
training_summary = lr_model.summary

25/11/22 22:52:20 WARN SparkStringUtils: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.sql.debug.maxToStringFields'.
25/11/22 22:52:34 WARN InstanceBuilder: Failed to load implementation from:dev.ludovic.netlib.blas.JNIBLAS
                                                                                

Coefficients: [0.0028860835682247357,0.012991825620190436,0.0,0.00023686316036718375,0.0015476308741016198,0.001631676094180211,0.006303280443736069,-0.0006228164725007162,0.0004996278817889857,0.0002669912377043324,-0.00014317732859853995,0.004268190606815065,-0.0003553710641351675,-0.003276011244396921,0.0013998257022772495,-0.0012413634005454832,0.0008263627955438581,-0.0025798110100449067,-0.00046159394956371287,0.004583213987423227,0.008973123972715812,0.00042249219894984765,-0.0005521291595423634,0.001047551221454139,-0.00091672174002509]
Intercept: 0.0004574609931941647


## Predictions

In [20]:
predictions = lr_model.transform(test_scaled)

predictions.select("liked", "prediction", "probability").show(20, False)


[Stage 58:>                                                         (0 + 1) / 1]

+-----+----------+----------------------------------------+
|liked|prediction|probability                             |
+-----+----------+----------------------------------------+
|0    |1.0       |[0.36900792709623526,0.6309920729037648]|
|0    |1.0       |[0.36782690569589405,0.632173094304106] |
|0    |1.0       |[0.31225468022867225,0.6877453197713277]|
|0    |1.0       |[0.3635568432748836,0.6364431567251164] |
|0    |1.0       |[0.3683961170493194,0.6316038829506806] |
|0    |0.0       |[0.5011703944523913,0.4988296055476087] |
|0    |0.0       |[0.5004033953517848,0.4995966046482152] |
|0    |1.0       |[0.49901046358189133,0.5009895364181087]|
|0    |1.0       |[0.49810736996315447,0.5018926300368456]|
|0    |0.0       |[0.5025113328646585,0.49748866713534146]|
|0    |0.0       |[0.5005813286836277,0.4994186713163723] |
|0    |1.0       |[0.49764395140480033,0.5023560485951997]|
|0    |0.0       |[0.501798528756852,0.49820147124314795] |
|0    |0.0       |[0.5031235644078774,0.

                                                                                

In [21]:
predictions.groupBy("prediction").count().show()




+----------+-----+
|prediction|count|
+----------+-----+
|       0.0|65016|
|       1.0|54896|
+----------+-----+



                                                                                

In [22]:
train_scaled.groupBy("liked").count().show()




+-----+------+
|liked| count|
+-----+------+
|    0|239736|
|    1|239842|
+-----+------+



                                                                                

In [23]:
predictions.printSchema()

root
 |-- watch_time: double (nullable = true)
 |-- video_duration: double (nullable = true)
 |-- watch_percent: double (nullable = false)
 |-- recommended: integer (nullable = true)
 |-- clicked: integer (nullable = true)
 |-- subscribed_after: integer (nullable = true)
 |-- device: string (nullable = true)
 |-- category: string (nullable = true)
 |-- watch_time_of_day: string (nullable = true)
 |-- commented: integer (nullable = true)
 |-- liked: integer (nullable = true)
 |-- device_idx: double (nullable = false)
 |-- category_idx: double (nullable = false)
 |-- watch_time_of_day_idx: double (nullable = false)
 |-- device_ohe: vector (nullable = true)
 |-- category_ohe: vector (nullable = true)
 |-- watch_time_of_day_ohe: vector (nullable = true)
 |-- features: vector (nullable = true)
 |-- scaled_features: vector (nullable = true)
 |-- rawPrediction: vector (nullable = true)
 |-- probability: vector (nullable = true)
 |-- prediction: double (nullable = false)



# Test ML Model

In [24]:
from pyspark.ml.evaluation import MulticlassClassificationEvaluator

evaluator = MulticlassClassificationEvaluator(labelCol="liked",
                            predictionCol="prediction")

accuracy = evaluator.evaluate(predictions, 
                  {evaluator.metricName: "accuracy"})
print(f"Accuracy: {accuracy}")
precision = evaluator.evaluate(predictions,
                  {evaluator.metricName: "weightedPrecision"})
print(f"Precision: {precision}")
recall = evaluator.evaluate(predictions,
                  {evaluator.metricName: "weightedRecall"})
print(f"Recall: {recall}")
f1 = evaluator.evaluate(predictions,
                {evaluator.metricName: "f1"})
print(f"F1 Score: {f1}")  

                                                                                

Accuracy: 0.5002668623657349


                                                                                

Precision: 0.5002716159314518


                                                                                

Recall: 0.5002668623657348




F1 Score: 0.4993754340927429


                                                                                

In [25]:
from pyspark.ml.stat import Correlation
from pyspark.ml.feature import VectorAssembler

assembler_corr = VectorAssembler(
    inputCols=["watch_time","video_duration","watch_percent","recommended","clicked","subscribed_after"],
    outputCol="corr_features"
)

df_corr = assembler_corr.transform(df)

corr_matrix = Correlation.corr(df_corr, "corr_features").head()[0]
print(corr_matrix)




DenseMatrix([[ 1.00000000e+00,  6.92453642e-01,             nan,
               7.41644505e-04, -1.23716731e-03,  5.94209803e-04],
             [ 6.92453642e-01,  1.00000000e+00,             nan,
               1.03302358e-03, -1.29672085e-03,  4.67204024e-05],
             [            nan,             nan,  1.00000000e+00,
                          nan,             nan,             nan],
             [ 7.41644505e-04,  1.03302358e-03,             nan,
               1.00000000e+00,  1.31173328e-04,  8.30501903e-04],
             [-1.23716731e-03, -1.29672085e-03,             nan,
               1.31173328e-04,  1.00000000e+00,  4.27227898e-04],
             [ 5.94209803e-04,  4.67204024e-05,             nan,
               8.30501903e-04,  4.27227898e-04,  1.00000000e+00]])


25/11/22 22:54:02 WARN PearsonCorrelation: Pearson correlation matrix contains NaN values.


In [26]:
sc.stop()