# <center> <img src="../img/ITESOLogo.png" alt="ITESO" width="480" height="130"> </center>
# <center> **Departamento de Electrónica, Sistemas e Informática** </center>
---
## <center> **Carrera: Ing. en Sistemas Computacionales** </center>
---
### <center> **Primavera 2025** </center>
---

**Lab 06**: Big Data Pipeline for Netflix data

**Fecha**: 7 de marzo del 2025

**Nombre del Estudiante**: Marco Albanese, Vicente Siloe

**Profesor**: Pablo Camarillo Ramirez

In [1]:
import findspark
findspark.init()

In [2]:
from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .appName("Big Data Pipeline for Netflix data") \
    .master("spark://bfb6d658c7db:7077") \
    .config("spark.ui.port","4040") \
    .getOrCreate()

# Create SparkContext
sc = spark.sparkContext
sc.setLogLevel("ERROR")

Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
25/03/09 07:28:27 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


### Problem Statement

In teams, write a Jupyter Notebook (within the directory spark_cluster/notebooks/labs/lab06) to cleanup a the Netflix dataset and persist it. To do so you need:

- **Data Ingestion:** Download and uncompress the dataset and move it to the **spark_cluster/data** directory.
- **Compute:** Add the needed code emove all null values from the Netflix dataset. You need to create two methods (clean_df and write_df) methods as part of your **spark_utils** module.
- **Store:** Persist the dataframe using the **release_year** as criteria to partition data. 


In [3]:
from equipo_mcqueen.spark_utils import SparkUtils

# Definir esquema de datos para el archivo netflix.csv
netflix_data = [
    ("show_id", "StringType"),
    ("type", "StringType"),
    ("title", "StringType"),
    ("director", "StringType"),
    ("country", "StringType"),
    ("date_added", "StringType"),
    ("release_year", "IntegerType"),
    ("rating", "StringType"),
    ("duration", "StringType"),
    ("listed_in", "StringType")
]

# Crear schema usando nuestro método generate_schema
netflix_schema = SparkUtils.generate_schema(netflix_data)

# Leer archivo netflix.csv
netflix_df = spark.read.schema(netflix_schema).option("header", "true").csv("/home/jovyan/notebooks/data/netflix.csv")

In [4]:
# La siguiente línea de código se utiliza para convertir la columna 'date_added' de MM/dd/yyyy a yyyy-MM-dd.
# A partir de Spark 3.0, el parser de tiempo es más restringido cuando se quiere convertir un string a una fecha (DateType)
spark.conf.set("spark.sql.legacy.timeParserPolicy", "LEGACY")

# Importar funciones para conversión de fecha
from pyspark.sql.functions import to_date, date_format

# La columna 'date_added' se convierte de MM/dd/yyyy a yyyy-MM-dd (único formato aceptado por Spark)
# Obtenido de https://stackoverflow.com/questions/74007217/converting-string-type-date-values-to-date-format-in-pyspark
netflix_df = netflix_df.withColumn("date_added", date_format(to_date(netflix_df.date_added, "MM/dd/yyyy"), "yyyy-MM-dd"))

# Mostramos dataframe para comprobar cambios
netflix_df.show(5, truncate=False)

                                                                                

+-------+-------+--------------------------------+---------------+-------------+----------+------------+------+--------+-------------------------------------------------------------+
|show_id|type   |title                           |director       |country      |date_added|release_year|rating|duration|listed_in                                                    |
+-------+-------+--------------------------------+---------------+-------------+----------+------------+------+--------+-------------------------------------------------------------+
|s1     |Movie  |Dick Johnson Is Dead            |Kirsten Johnson|United States|2021-09-25|2020        |PG-13 |90 min  |Documentaries                                                |
|s3     |TV Show|Ganglands                       |Julien Leclercq|France       |2021-09-24|2021        |TV-MA |1 Season|Crime TV Shows, International TV Shows, TV Action & Adventure|
|s6     |TV Show|Midnight Mass                   |Mike Flanagan  |United States|2021-

In [5]:
# Imprimir esquema
netflix_df.printSchema()

root
 |-- show_id: string (nullable = true)
 |-- type: string (nullable = true)
 |-- title: string (nullable = true)
 |-- director: string (nullable = true)
 |-- country: string (nullable = true)
 |-- date_added: string (nullable = true)
 |-- release_year: integer (nullable = true)
 |-- rating: string (nullable = true)
 |-- duration: string (nullable = true)
 |-- listed_in: string (nullable = true)



In [6]:
# date_added sigue siendo un string, debemos convertirlo a DateType
from pyspark.sql.types import DateType

netflix_df = netflix_df.withColumn("date_added", netflix_df["date_added"].cast(DateType()))

# Imprimimos esquema para comprobar cambios
netflix_df.printSchema()

root
 |-- show_id: string (nullable = true)
 |-- type: string (nullable = true)
 |-- title: string (nullable = true)
 |-- director: string (nullable = true)
 |-- country: string (nullable = true)
 |-- date_added: date (nullable = true)
 |-- release_year: integer (nullable = true)
 |-- rating: string (nullable = true)
 |-- duration: string (nullable = true)
 |-- listed_in: string (nullable = true)



In [7]:
from pyspark.sql.functions import col, sum

# Valores nulos previo a limpieza
netflix_df.select([sum(col(c).isNull().cast("int")).alias(c) for c in netflix_df.columns]).show()

[Stage 1:>                                                          (0 + 1) / 1]

+-------+----+-----+--------+-------+----------+------------+------+--------+---------+
|show_id|type|title|director|country|date_added|release_year|rating|duration|listed_in|
+-------+----+-----+--------+-------+----------+------------+------+--------+---------+
|      0|   0|    0|       1|      1|         2|           2|     1|       2|        2|
+-------+----+-----+--------+-------+----------+------------+------+--------+---------+



                                                                                

In [8]:
# Eliminar valores nulos del dataset
clean_netflix_df = SparkUtils.clean_df(netflix_df)

In [9]:
# Valores nulos posterior a limpieza
clean_netflix_df.select([sum(col(c).isNull().cast("int")).alias(c) for c in clean_netflix_df.columns]).show()

+-------+----+-----+--------+-------+----------+------------+------+--------+---------+
|show_id|type|title|director|country|date_added|release_year|rating|duration|listed_in|
+-------+----+-----+--------+-------+----------+------------+------+--------+---------+
|      0|   0|    0|       0|      0|         0|           0|     0|       0|        0|
+-------+----+-----+--------+-------+----------+------------+------+--------+---------+



In [None]:
from equipo_mcqueen.spark_utils import SparkUtils

# Guardar dataframe
SparkUtils.write_df(clean_netflix_df, "release_year", "/home/jovyan/notebooks/data/netflix_output")

                                                                                

In [12]:
!ls notebooks/data/netflix_output/

'release_year=1925'  'release_year=1973'  'release_year=1998'
'release_year=1942'  'release_year=1974'  'release_year=1999'
'release_year=1943'  'release_year=1975'  'release_year=2000'
'release_year=1944'  'release_year=1976'  'release_year=2001'
'release_year=1945'  'release_year=1977'  'release_year=2002'
'release_year=1946'  'release_year=1978'  'release_year=2003'
'release_year=1947'  'release_year=1979'  'release_year=2004'
'release_year=1954'  'release_year=1980'  'release_year=2005'
'release_year=1955'  'release_year=1981'  'release_year=2006'
'release_year=1956'  'release_year=1982'  'release_year=2007'
'release_year=1958'  'release_year=1983'  'release_year=2008'
'release_year=1959'  'release_year=1984'  'release_year=2009'
'release_year=1960'  'release_year=1985'  'release_year=2010'
'release_year=1961'  'release_year=1986'  'release_year=2011'
'release_year=1962'  'release_year=1987'  'release_year=2012'
'release_year=1963'  'release_year=1988'  'release_year=2013'
'release

In [13]:
!ls notebooks/data/netflix_output/ | wc -l

75
