### **Paso 2.3 - Comenzar la lógica de transformación de los datos**

1. Comenzar la lógica de transformación de los datos

    *   Identificar y eliminar filas duplicadas
    *   Reemplazar valores Null
    *   Creación de nuevas columnas
    *   Cambio de tipo de datos
    *   Renombrar columnas

#### Identificar y eliminar filas duplicadas



In [None]:
%%pyspark
source_df.groupBy(source_df.columns).count().filter(col('count')>1).show(truncate=False)

+----------------------------------------+-------------------+-----------+-------+----------+--------+------+----------+-----+
|Title                                   |Genre              |ReleaseDate|Runtime|IMDB Score|Language|Views |AddedDate |count|
+----------------------------------------+-------------------+-----------+-------+----------+--------+------+----------+-----+
|Caught by a Wave                        |Romantic teen drama|25-03-2021 |99     |5.7       |Italian |246360|2023-01-21|2    |
|Point Blank                             |Action             |12-07-2019 |86     |5.7       |English |243230|2023-01-21|2    |
|When We First Met                       |Romantic comedy    |09-02-2018 |97     |6.4       |English |300803|2023-01-21|2    |
|Good Sam                                |Drama              |16-05-2019 |89     |5.7       |English |442082|2023-01-21|2    |
|Squared Love                            |Romantic comedy    |11-02-2021 |102    |5.0       |Polish  |3739  |20

In [None]:
%%pyspark
df_nodups = source_df

In [None]:
%%pyspark
df_nodups = df_nodups.drop_duplicates()

In [None]:
%%pyspark
df_nodups.groupBy(source_df.columns).count().filter(col('count')>1).show(truncate=False)

+-----+-----+-----------+-------+----------+--------+-----+---------+-----+
|Title|Genre|ReleaseDate|Runtime|IMDB Score|Language|Views|AddedDate|count|
+-----+-----+-----------+-------+----------+--------+-----+---------+-----+
+-----+-----+-----------+-------+----------+--------+-----+---------+-----+



In [None]:
%%pyspark
df_nodups.select([count(when(col(c).isNull(),c)).alias(c) for c in df_nodups.columns]).show()

+-----+-----+-----------+-------+----------+--------+-----+---------+
|Title|Genre|ReleaseDate|Runtime|IMDB Score|Language|Views|AddedDate|
+-----+-----+-----------+-------+----------+--------+-----+---------+
|    1|    2|          1|      1|         1|      16|    1|        1|
+-----+-----+-----------+-------+----------+--------+-----+---------+



In [None]:
%%pyspark

# Eliminar sólo si toda la fila es nula
df_nodups = df_nodups.na.drop('all')


In [None]:
%%pyspark
df_nodups.select([count(when(col(c).isNull(),c)).alias(c) for c in df_nodups.columns]).show()

+-----+-----+-----------+-------+----------+--------+-----+---------+
|Title|Genre|ReleaseDate|Runtime|IMDB Score|Language|Views|AddedDate|
+-----+-----+-----------+-------+----------+--------+-----+---------+
|    0|    1|          0|      0|         0|      15|    0|        0|
+-----+-----+-----------+-------+----------+--------+-----+---------+



#### Reemplazar valores Null

In [None]:
%%pyspark
df_nonulls = df_nodups.na.fill(value="Unknown")

In [None]:
%%pyspark
df_nonulls.filter(col('Language').isNull()).show()

+-----+-----+-----------+-------+----------+--------+-----+---------+
|Title|Genre|ReleaseDate|Runtime|IMDB Score|Language|Views|AddedDate|
+-----+-----+-----------+-------+----------+--------+-----+---------+
+-----+-----+-----------+-------+----------+--------+-----+---------+



In [None]:
%%pyspark
df_nonulls.filter(col('Genre').isNull()).show()

+-----+-----+-----------+-------+----------+--------+-----+---------+
|Title|Genre|ReleaseDate|Runtime|IMDB Score|Language|Views|AddedDate|
+-----+-----+-----------+-------+----------+--------+-----+---------+
+-----+-----+-----------+-------+----------+--------+-----+---------+



In [None]:
%%pyspark
df_nonulls.select([count(when(col(c).isNull(),c)).alias(c) for c in df_nonulls.columns]).show()

+-----+-----+-----------+-------+----------+--------+-----+---------+
|Title|Genre|ReleaseDate|Runtime|IMDB Score|Language|Views|AddedDate|
+-----+-----+-----------+-------+----------+--------+-----+---------+
|    0|    0|          0|      0|         0|       0|    0|        0|
+-----+-----+-----------+-------+----------+--------+-----+---------+



In [None]:
%%pyspark
df_nonulls.filter(col('Genre') == "Unknown").show(truncate=False)

+----------------------------+-------+-----------+-------+----------+--------+------+----------+
|Title                       |Genre  |ReleaseDate|Runtime|IMDB Score|Language|Views |AddedDate |
+----------------------------+-------+-----------+-------+----------+--------+------+----------+
|Between Two Ferns: The Movie|Unknown|20-09-2019 |82     |6.1       |English |283445|2023-01-28|
+----------------------------+-------+-----------+-------+----------+--------+------+----------+



In [None]:
%%pyspark
df_nonulls.filter(col('Language') == "Unknown").show(truncate=False)

+------------------------------------------+-------------------+-----------+-------+----------+--------+-------+----------+
|Title                                     |Genre              |ReleaseDate|Runtime|IMDB Score|Language|Views  |AddedDate |
+------------------------------------------+-------------------+-----------+-------+----------+--------+-------+----------+
|Bomb Scared                               |Black comedy       |12-10-2017 |89     |5.6       |Unknown |255483 |2023-01-21|
|Space Sweepers                            |Science fiction    |05-02-2021 |136    |6.6       |Unknown |327838 |2023-02-03|
|His House                                 |Thriller           |30-10-2020 |93     |6.5       |Unknown |352801 |2023-02-03|
|Mowgli: Legend of the Jungle              |Adventure          |07-12-2018 |104    |6.5       |Unknown |115    |2023-01-21|
|Hot Girls Wanted                          |Documentary        |29-05-2015 |84     |6.1       |Unknown |268877 |2023-01-14|
|Fatal A

#### Creación de una nueva columna basada en el rating de IMDB

In [None]:
%%pyspark
df = df_nonulls.withColumn("IMDB Category", when(col("IMDB Score").between(0.1,2.9),"Very Low"). \
                                            when(col("IMDB Score").between(3,4.9),"Low"). \
                                            when(col("IMDB Score").between(3,4.9),"Medium"). \
                                            otherwise("High"))                    

In [None]:
%%pyspark
df.show(15,truncate=False) 

+-------------------------------------------------+---------------------+-----------+-------+----------+---------+-------+----------+-------------+
|Title                                            |Genre                |ReleaseDate|Runtime|IMDB Score|Language |Views  |AddedDate |IMDB Category|
+-------------------------------------------------+---------------------+-----------+-------+----------+---------+-------+----------+-------------+
|Holiday Rush                                     |Family film          |28-11-2019 |94     |4.9       |English  |20221  |2023-01-21|Low          |
|IO                                               |Science fiction/Drama|18-01-2019 |95     |4.7       |English  |122853 |2023-01-21|Low          |
|Strip Down, Rise Up                              |Documentary          |05-02-2021 |112    |5.2       |English  |1583625|2023-01-21|High         |
|Handsome: A Netflix Mystery Movie                |Comedy               |05-05-2017 |81     |5.2       |English 

#### Nueva columna Runtime en horas

In [None]:
%%pyspark
df = df.withColumn("RuntimeinHours", round(df.Runtime/60,2)) 

In [None]:
%%pyspark
df.show()

+--------------------+--------------------+-----------+-------+----------+---------+-------+----------+-------------+--------------+
|               Title|               Genre|ReleaseDate|Runtime|IMDB Score| Language|  Views| AddedDate|IMDB Category|RuntimeinHours|
+--------------------+--------------------+-----------+-------+----------+---------+-------+----------+-------------+--------------+
|        Holiday Rush|         Family film| 28-11-2019|     94|       4.9|  English|  20221|2023-01-21|          Low|          1.57|
|                  IO|Science fiction/D...| 18-01-2019|     95|       4.7|  English| 122853|2023-01-21|          Low|          1.58|
|                Mute|Science fiction/M...| 23-02-2018|    126|       5.5|  English|  30761|2023-01-21|         High|           2.1|
| Strip Down, Rise Up|         Documentary| 05-02-2021|    112|       5.2|  English|1583625|2023-01-21|         High|          1.87|
|         Bomb Scared|        Black comedy| 12-10-2017|     89|      

In [None]:
%%pyspark
df = df.drop('Runtime')

In [None]:
%%pyspark
df.show(15,truncate=False)   

+-------------------------------------------------+---------------------+-----------+----------+---------+-------+----------+-------------+--------------+
|Title                                            |Genre                |ReleaseDate|IMDB Score|Language |Views  |AddedDate |IMDB Category|RuntimeinHours|
+-------------------------------------------------+---------------------+-----------+----------+---------+-------+----------+-------------+--------------+
|Holiday Rush                                     |Family film          |28-11-2019 |4.9       |English  |20221  |2023-01-21|Low          |1.57          |
|IO                                               |Science fiction/Drama|18-01-2019 |4.7       |English  |122853 |2023-01-21|Low          |1.58          |
|Strip Down, Rise Up                              |Documentary          |05-02-2021 |5.2       |English  |1583625|2023-01-21|High         |1.87          |
|Handsome: A Netflix Mystery Movie                |Comedy             

#### Categoría 'Runtime'

In [None]:
%%pyspark
df = df.withColumn("Runtime_Category", when(col("RuntimeInHours").between(0,1.30),"Short Runtime").\
                                       when(col("RuntimeInHours").between(1.31,2.15) , "Medium Runtime").\
                                       otherwise("LongRuntime"))

In [None]:
%%pyspark
df.show()

+--------------------+--------------------+-----------+----------+--------+-------+----------+-------------+--------------+----------------+
|               Title|               Genre|ReleaseDate|IMDB Score|Language|  Views| AddedDate|IMDB Category|RuntimeinHours|Runtime_Category|
+--------------------+--------------------+-----------+----------+--------+-------+----------+-------------+--------------+----------------+
|               Barry|              Biopic| 16-12-2016|       5.8| English| 156567|2023-01-15|         High|          1.73|  Medium Runtime|
|        Holiday Rush|         Family film| 28-11-2019|       4.9| English|  20221|2023-01-21|          Low|          1.57|  Medium Runtime|
|Michael Bolton's ...|        Variety Show| 07-02-2017|       6.7| English|  84016|2023-02-03|         High|           0.9|   Short Runtime|
|        Road to Roma|           Making-of| 11-02-2020|       7.7| Spanish|  90372|2023-02-03|         High|           1.2|   Short Runtime|
|    The Midn

#### Cambio de tipo de datos de String a Date en la columna 'ReleaseDate'

In [None]:
%%pyspark
df.printSchema()

root
 |-- Title: string (nullable = false)
 |-- Genre: string (nullable = false)
 |-- ReleaseDate: string (nullable = false)
 |-- IMDB Score: double (nullable = true)
 |-- Language: string (nullable = false)
 |-- Views: integer (nullable = true)
 |-- AddedDate: date (nullable = true)
 |-- IMDB Category: string (nullable = false)
 |-- RuntimeinHours: double (nullable = true)
 |-- Runtime_Category: string (nullable = false)



In [None]:
%%pyspark
df = df.withColumn("ReleaseDate", to_date("ReleaseDate",'dd-MM-yyyy'))

In [None]:
%%pyspark
df.printSchema()

root
 |-- Title: string (nullable = false)
 |-- Genre: string (nullable = false)
 |-- ReleaseDate: date (nullable = true)
 |-- IMDB Score: double (nullable = true)
 |-- Language: string (nullable = false)
 |-- Views: integer (nullable = true)
 |-- AddedDate: date (nullable = true)
 |-- IMDB Category: string (nullable = false)
 |-- RuntimeinHours: double (nullable = true)
 |-- Runtime_Category: string (nullable = false)



In [None]:
%%pyspark
df.sort('ReleaseDate').show()

+--------------------+--------------------+-----------+----------+--------------------+------+----------+-------------+--------------+----------------+
|               Title|               Genre|ReleaseDate|IMDB Score|            Language| Views| AddedDate|IMDB Category|RuntimeinHours|Runtime_Category|
+--------------------+--------------------+-----------+----------+--------------------+------+----------+-------------+--------------+----------------+
|Feminists: What W...|         Documentary|       null|       7.0|             English|494819|2023-02-23|         High|          1.43|  Medium Runtime|
|The Meyerowitz St...|        Comedy-drama|       null|       6.9|             English| 48297|2023-02-23|         High|          1.87|  Medium Runtime|
| Gaga: Five Foot Two|         Documentary|       null|       7.0|             English|382459|2023-02-23|         High|          1.67|  Medium Runtime|
|The Lonely Island...|    Comedy / Musical|       null|       6.9|             English|2

In [None]:
%%pyspark
df.select([count(when(col(c).isNull(),c)).alias(c) for c in df.columns]).show()

+-----+-----+-----------+----------+--------+-----+---------+-------------+--------------+----------------+
|Title|Genre|ReleaseDate|IMDB Score|Language|Views|AddedDate|IMDB Category|RuntimeinHours|Runtime_Category|
+-----+-----+-----------+----------+--------+-----+---------+-------------+--------------+----------------+
|    0|    0|          7|         0|       0|    0|        0|            0|             0|               0|
+-----+-----+-----------+----------+--------+-----+---------+-------------+--------------+----------------+



#### Renombrar las columnas de "IMDB Score" y "IMDB Category"

In [None]:
%%pyspark
df = df.withColumnRenamed("IMDB Score","IMDB_Score")

In [None]:
%%pyspark
df = df.withColumnRenamed("IMDB Category","IMDB_Category")

In [None]:
%%pyspark
df.show()

+--------------------+--------------------+-----------+----------+--------+-------+----------+-------------+--------------+----------------+
|               Title|               Genre|ReleaseDate|IMDB_Score|Language|  Views| AddedDate|IMDB_Category|RuntimeinHours|Runtime_Category|
+--------------------+--------------------+-----------+----------+--------+-------+----------+-------------+--------------+----------------+
|               Barry|              Biopic| 2016-12-16|       5.8| English| 156567|2023-01-15|         High|          1.73|  Medium Runtime|
|        Holiday Rush|         Family film| 2019-11-28|       4.9| English|  20221|2023-01-21|          Low|          1.57|  Medium Runtime|
|Michael Bolton's ...|        Variety Show| 2017-02-07|       6.7| English|  84016|2023-02-03|         High|           0.9|   Short Runtime|
|        Road to Roma|           Making-of| 2020-02-11|       7.7| Spanish|  90372|2023-02-03|         High|           1.2|   Short Runtime|
|    The Midn