# Big Data Pipeline for Netflix Data

|                |   |
:----------------|---|
| **Team**     | Foraneos  |
| **Date**      | 03/10/2025  |
| **Lab** | 06  |

## Problem Statement

In teams, write a Jupyter Notebook (within the directory
spark_cluster/notebooks/labs/lab06) to cleanup a the Netflix dataset and persist it.
To do so you need:<br>
- Data Ingestion. Download and uncompress the dataset and move it to
spark_cluster/data directory. <br><br>
- Compute. Add the needed code remove all null values from the Netflix dataset. You
need to create two methods (clean_df and write_df) methods as part of your
spark_utils module.<br><br>
- Store. Persist the dataframe using the release_year as criteria to partition data.


In [25]:
import findspark
findspark.init()

In [26]:
from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .appName("SparkSQL-Transformations-Actions") \
    .master("spark://0638c7435d1d:7077") \
    .config("spark.ui.port","4040") \
    .getOrCreate()
sc = spark.sparkContext

### Loading the netflix dataset

In [27]:
#import importlib
from foraneos.spark_utils import SparkUtils as SpU

#importlib.reload(SpU)


netflix_schema = SpU.generate_schema([("show_id", "string", True), 
                                              ("type", "string", True),
                                              ("title", "string", True), 
                                              ("director", "string", False), 
                                              ("country", "string",False),
                                              ("date_added", "date", True) ,
                                              ("release_year", "int", False),  
                                              ("rating", "string",False), 
                                              ("duration", "string", False), 
                                              ("listed_in", "string", False), 
                                                                                                                                               
                                              ])

netflix_df = spark.read \
                .schema(netflix_schema) \
                .option("header", "true") \
                .csv("/home/jovyan/notebooks/data/netflix_dataset/netflix1.csv")

netflix_df.printSchema()

netflix_df.show(10, truncate=False)

root
 |-- show_id: string (nullable = true)
 |-- type: string (nullable = true)
 |-- title: string (nullable = true)
 |-- director: string (nullable = true)
 |-- country: string (nullable = true)
 |-- date_added: date (nullable = true)
 |-- release_year: integer (nullable = true)
 |-- rating: string (nullable = true)
 |-- duration: string (nullable = true)
 |-- listed_in: string (nullable = true)



[Stage 0:>                                                          (0 + 1) / 1]

+-------+-------+--------------------------------+-------------------+--------------+----------+------------+------+---------+-------------------------------------------------------------+
|show_id|type   |title                           |director           |country       |date_added|release_year|rating|duration |listed_in                                                    |
+-------+-------+--------------------------------+-------------------+--------------+----------+------------+------+---------+-------------------------------------------------------------+
|s1     |Movie  |Dick Johnson Is Dead            |Kirsten Johnson    |United States |NULL      |2020        |PG-13 |90 min   |Documentaries                                                |
|s3     |TV Show|Ganglands                       |Julien Leclercq    |France        |NULL      |2021        |TV-MA |1 Season |Crime TV Shows, International TV Shows, TV Action & Adventure|
|s6     |TV Show|Midnight Mass                   |Mike 

                                                                                

### Cleaning Null values in df

In [28]:
correction_dict={"director": "unknown", 
                 "country": "unknown", 
                 "date_added": "1800-01-01", 
                 "release_year": 1800, 
                 "rating": "unknown", 
                 "duration": "unknown", 
                 "listed_in": "unknown"
                }

netflix_nn = SpU.clean_df(netflix_df, correction_dict)
netflix_nn.show(n=10, truncate=False)

+-------+-------+--------------------------------+-------------------+--------------+----------+------------+------+---------+-------------------------------------------------------------+
|show_id|type   |title                           |director           |country       |date_added|release_year|rating|duration |listed_in                                                    |
+-------+-------+--------------------------------+-------------------+--------------+----------+------------+------+---------+-------------------------------------------------------------+
|s1     |Movie  |Dick Johnson Is Dead            |Kirsten Johnson    |United States |1800-01-01|2020        |PG-13 |90 min   |Documentaries                                                |
|s3     |TV Show|Ganglands                       |Julien Leclercq    |France        |1800-01-01|2021        |TV-MA |1 Season |Crime TV Shows, International TV Shows, TV Action & Adventure|
|s6     |TV Show|Midnight Mass                   |Mike 

### Saving the modified netflix dataset

In [29]:
params = {"mode": "overwrite", "dataframe": netflix_df, "path": "/home/jovyan/notebooks/data/netflix_output/", "criteria": ["release_year"]}
netflix_df = SpU.write_df(params)

                                                                                

### Checking the number of output files

In [30]:
!ls notebooks/data/netflix_output/ | wc -l

76


In [31]:
# Stop the SparkContext
sc.stop()