## Netflix Movies

In [1]:
import boto3
import tempfile
from pyspark.sql import SparkSession

In [2]:
spark = SparkSession.builder \
    .appName("TitanicAnalysisBoto3") \
    .master("local[*]") \
    .config("spark.executor.memory", "4g") \
    .config("spark.driver.memory", "2g") \
    .config("spark.executor.cores", "2") \
    .config("spark.serializer", "org.apache.spark.serializer.KryoSerializer") \
    .config("spark.sql.files.maxPartitionBytes", "128MB") \
    .config("spark.sql.shuffle.partitions", "200") \
    .config("spark.sql.execution.arrow.enabled", "true") \
    .getOrCreate()

In [3]:
spark.conf.get("spark.executor.cores")

'2'

Abrimos el csv de netflix 

In [5]:
df = spark.read.csv("netflix_titles.csv",header=True, inferSchema=True)
df.show()

+-------+-------+--------------------+--------------------+--------------------+--------------------+------------------+------------+------+---------+--------------------+--------------------+
|show_id|   type|               title|            director|                cast|             country|        date_added|release_year|rating| duration|           listed_in|         description|
+-------+-------+--------------------+--------------------+--------------------+--------------------+------------------+------------+------+---------+--------------------+--------------------+
|     s1|  Movie|Dick Johnson Is Dead|     Kirsten Johnson|                NULL|       United States|September 25, 2021|        2020| PG-13|   90 min|       Documentaries|As her father nea...|
|     s2|TV Show|       Blood & Water|                NULL|Ama Qamata, Khosi...|        South Africa|September 24, 2021|        2021| TV-MA|2 Seasons|International TV ...|After crossing pa...|
|     s3|TV Show|           Ganglan

In [6]:
from pyspark.sql.functions import corr, col, count, when

In [8]:
print(df.count())

8809


In [9]:
print(df.columns)

['show_id', 'type', 'title', 'director', 'cast', 'country', 'date_added', 'release_year', 'rating', 'duration', 'listed_in', 'description']


In [10]:
print(len(df.columns))

12


Por el tipo de datos en el csv, no podemos hacer correlaciones

In [14]:
df.printSchema()

root
 |-- show_id: string (nullable = true)
 |-- type: string (nullable = true)
 |-- title: string (nullable = true)
 |-- director: string (nullable = true)
 |-- cast: string (nullable = true)
 |-- country: string (nullable = true)
 |-- date_added: string (nullable = true)
 |-- release_year: string (nullable = true)
 |-- rating: string (nullable = true)
 |-- duration: string (nullable = true)
 |-- listed_in: string (nullable = true)
 |-- description: string (nullable = true)



In [13]:
df.select(corr("rating", "duration")).show()

+----------------------+
|corr(rating, duration)|
+----------------------+
|                  NULL|
+----------------------+



Si existieran columnas numericas de los shows podriamos hacer alguna correlación.

In [16]:
df_filter_ages = df.filter(col("country") == "Mexico").show()

+-------+-------+--------------------+--------------------+--------------------+-------+------------------+------------+------+---------+--------------------+--------------------+
|show_id|   type|               title|            director|                cast|country|        date_added|release_year|rating| duration|           listed_in|         description|
+-------+-------+--------------------+--------------------+--------------------+-------+------------------+------------+------+---------+--------------------+--------------------+
|    s18|TV Show|     Falsa identidad|                NULL|Luis Ernesto Fran...| Mexico|September 22, 2021|        2020| TV-MA|2 Seasons|Crime TV Shows, S...|Strangers Diego a...|
|   s283|  Movie|La diosa del asfalto|    Julián Hernández|Ximena Romo, Mabe...| Mexico|   August 11, 2021|        2020| TV-MA|  127 min|Dramas, Independe...|A woman from a to...|
|   s312|TV Show|           Control Z|                NULL|Ana Valeria Becer...| Mexico|    August 4

In [20]:
df.groupBy("date_added").count().show()

+------------------+-----+
|        date_added|count|
+------------------+-----+
|      May 21, 2021|    8|
|     March 2, 2021|    3|
|September 23, 2020|    2|
| September 8, 2020|    4|
|    April 14, 2020|    2|
| December 30, 2019|    1|
|   August 12, 2019|    1|
|     June 22, 2019|    2|
|      May 30, 2017|    2|
|    April 29, 2016|    2|
|    March 25, 2016|    1|
|  October 27, 2015|    1|
|   January 1, 2008|    1|
|     March 2, 2017|    2|
|  October 31, 2015|    1|
|     June 23, 2021|    5|
|  November 1, 2020|   32|
|  February 9, 2020|    3|
| November 28, 2019|    6|
|   October 5, 2019|    3|
+------------------+-----+
only showing top 20 rows



In [21]:
df.groupBy("date_added", "rating").count().show()

+-----------------+--------------------+-----+
|       date_added|              rating|count|
+-----------------+--------------------+-----+
| December 1, 2020|                TV-Y|    5|
|     June 5, 2020|               PG-13|    1|
|     June 1, 2020|                TV-G|    1|
| February 8, 2020|               TV-14|    1|
|December 31, 2019|               TV-14|   40|
|   August 8, 2019|               TV-14|    1|
|     June 1, 2019|               TV-MA|    6|
| January 15, 2019|               TV-MA|    5|
| October 20, 2018|               TV-MA|    1|
|  October 5, 2018|               TV-MA|    2|
|    June 29, 2018|               TV-14|    1|
| October 20, 2017|               TV-PG|    1|
|   March 10, 2017|               TV-14|    6|
| January 15, 2017|               TV-PG|    2|
|  January 1, 2018|                TV-G|    4|
|   April 18, 2017|                  PG|    1|
| January 17, 2018|               TV-PG|    1|
|            TV-PG|Classic Movies, D...|    1|
|September 5,

In [23]:
df.groupBy("listed_in", "rating").count().show(50)

+--------------------+------+-----+
|           listed_in|rating|count|
+--------------------+------+-----+
|British TV Shows,...|  TV-Y|   15|
|Dramas, Romantic ...| PG-13|   24|
|Cult Movies, Horr...|     R|    2|
|International Movies| TV-MA|    1|
|Dramas, Faith & S...| TV-14|    5|
|           Thrillers|     R|   30|
|Crime TV Shows, T...| TV-14|    2|
|Children & Family...| TV-PG|    1|
|Children & Family...|  TV-Y|    2|
|Documentaries, St...| TV-MA|    1|
|           Thrillers| TV-MA|    7|
|Action & Adventur...| TV-MA|    1|
|Dramas, Faith & S...| TV-PG|    5|
|Documentaries, Fa...|  TV-G|    1|
|Dramas, Independe...| TV-PG|    1|
|Action & Adventur...|    PG|    1|
|TV Action & Adven...|    NR|    1|
|Comedies, Interna...| TV-MA|   86|
|Dramas, Independe...|     R|    5|
|Anime Series, Rom...| TV-14|    1|
|Dramas, Sci-Fi & ...|    PG|    1|
|International Mov...| TV-MA|    1|
|Dramas, Independe...|    NR|    4|
|     Romantic Movies| TV-14|    1|
|Crime TV Shows, I...| TV-MA

In [25]:
df.groupBy("type", "rating").count().show(100)

+-------------+--------------------+-----+
|         type|              rating|count|
+-------------+--------------------+-----+
|        Movie|                TV-G|  126|
|        Movie|    Shavidee Trotter|    1|
|        Movie|                TV-Y|  131|
|        Movie|               TV-Y7|  139|
|        Movie|              66 min|    1|
|        Movie|                   G|   41|
|        Movie|               NC-17|    3|
|        Movie|               PG-13|  489|
|         NULL|                NULL|    1|
|      TV Show|               TV-PG|  323|
|        Movie|            TV-Y7-FV|    5|
|      TV Show|                  NR|    5|
|        Movie|                   R|  794|
|      TV Show|                NULL|    2|
|        Movie|                  NR|   75|
|        Movie|                2021|    2|
|        Movie|         Jide Kosoko|    1|
|      TV Show| Keppy Ekpenyong ...|    1|
|      TV Show|               TV-Y7|  195|
|        Movie|                2017|    1|
|        Mo