# Descomposición de datos de películas

El propósito de esta notebook es separar los datos, y generar datasets normalizados e independientes, para poder ser guardados en tablas de Apache Hive.

Creación de la sesión de Spark

In [50]:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("Movies data decompositon").getOrCreate()

In [2]:
from pyspark.sql.types import  *
from pyspark.sql.functions import col, from_json, regexp_replace

Directorios de lectura y escritura

In [3]:
hadoop_base_directory = 'hdfs://192.168.56.101:9000/obligatorio'
hadoop_datasets_folder = f'{hadoop_base_directory}/datasets_parquet'
hadoop_dest_folder = 'hdfs://192.168.56.101:9000/obligatorio/processed_tables'

# Procesamiento de dataset de películas

Lectura de dataset de películas de HDFS:

In [4]:
movies = spark.read.format("parquet").option("header", "true").option("mode", "DROPMALFORMED").option("escape","\"").option("quote", "\"").load(f'{hadoop_datasets_folder}/movies_metadata')

In [5]:
movies.show(10, truncate=False)

+-----+---------------------+--------+----------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------+------+---------+-----------------+-------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------+--------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------

In [6]:
movies.count()

45463

Nombres de los atributos, y selección de atributos relevantes

In [7]:
a_adult = "adult"
a_belongs_to = "belongs_to_collection"
a_budget = "budget"
a_genres = "genres"
a_id = "movie_id"
a_original_language = "original_language"
a_original_title = "original_title"
a_overview = "overview"
a_popularity = "popularity"
a_prod_companies = "production_companies"
a_production_countries = "production_countries"
a_release_date = "release_date"
a_revenue = "revenue"
a_spoken_languages = "spoken_languages"
a_title = "title"
a_vote_average = "vote_average"
a_vote_count = "vote_count"

relevant_fields = [a_adult, a_budget, a_genres, a_id, a_original_language, a_original_title,
                   a_overview, a_popularity, a_prod_companies, a_production_countries, a_release_date,
                  a_revenue, a_spoken_languages, a_title]

movies_selected_fields = [a_adult, a_budget, a_id, a_original_language, a_original_title, a_overview,
                          a_popularity, a_release_date, a_revenue, a_title]

Los datos son todos de tipo String inicialmente

In [8]:
movies.dtypes

[('adult', 'string'),
 ('belongs_to_collection', 'string'),
 ('budget', 'string'),
 ('genres', 'string'),
 ('homepage', 'string'),
 ('id', 'string'),
 ('imdb_id', 'string'),
 ('original_language', 'string'),
 ('original_title', 'string'),
 ('overview', 'string'),
 ('popularity', 'string'),
 ('poster_path', 'string'),
 ('production_companies', 'string'),
 ('production_countries', 'string'),
 ('release_date', 'string'),
 ('revenue', 'string'),
 ('runtime', 'string'),
 ('spoken_languages', 'string'),
 ('status', 'string'),
 ('tagline', 'string'),
 ('title', 'string'),
 ('video', 'string'),
 ('vote_average', 'string'),
 ('vote_count', 'string'),
 ('c_0', 'string'),
 ('c_1', 'string'),
 ('c_2', 'string'),
 ('c_3', 'string')]

Definición del esquema de los campos JSON, para darles formato y poder trabajarlos.

In [9]:
genres_schema = ArrayType(
    StructType([StructField("id", IntegerType()), 
                StructField("name", StringType())]))

prod_companies_schema = ArrayType(
    StructType([StructField("id", IntegerType()),
                StructField("name", StringType())]))

prod_countries_schema = ArrayType(
    StructType([StructField("iso_3166_1", StringType()),
                StructField("name", StringType())]))

spoken_languages_schema = ArrayType(
    StructType([StructField("iso_639_1", StringType()),
                StructField("name", StringType())]))

Formateo de datos a sus tipos originales

In [10]:
movies = movies.withColumn(a_adult, (movies.adult).cast("Boolean"))\
         .withColumn(a_id, (movies.id).cast("Integer"))\
         .withColumn(a_budget, (movies.budget).cast("Integer"))\
         .withColumn(a_genres, from_json(movies.genres, genres_schema))\
         .withColumn(a_prod_companies, from_json(movies.production_companies, prod_companies_schema))\
         .withColumn(a_production_countries, from_json(movies.production_countries, prod_countries_schema))\
         .withColumn(a_spoken_languages, from_json(movies.spoken_languages, spoken_languages_schema))\
         .withColumn(a_popularity, (movies.popularity).cast("Float"))\
         .withColumn(a_release_date, (movies.release_date).cast("Date"))\
         .withColumn(a_revenue, (movies.revenue).cast("Integer"))\
         .withColumn(a_vote_average, (movies.vote_average).cast("Float"))\
         .withColumn(a_vote_count, (movies.vote_count).cast("Integer"))

Eliminación de registros mal formados y con id duplicado

In [11]:
movies = movies.na.drop(subset=[a_adult, a_id, a_budget, a_genres, a_prod_companies, a_production_countries, 
                   a_spoken_languages, a_popularity, a_revenue, a_vote_average, a_vote_count])\
         .dropDuplicates(subset=[a_id])

Selección de los atributos seleccionados para el dataset de peliculas, definidos anteriormente, eliminando los que no son.

In [12]:
movies = movies[relevant_fields]

# Incorporación de Ratings calculados

Los ratings fueron calculados por un script de pig y estan guardados en HDFS también, los Ids que tienen asociados, no son los mismos que los de movies_metadata, pero existe la correlacion entre los ids de movies y de ratings, en el dataset links, el cual se utilizará.

Direcciones de los ratinngs y links en HDFS

In [13]:
hadoop_ratings_addr = f'{hadoop_base_directory}/ratings'
hadoop_links_addr = f'{hadoop_datasets_folder}/links'

Esquema de datos de dataset de ratings.

In [14]:
a_rating = "rating"
a_vote_count = "vote_count"

ratings_schema = StructType([
    StructField(a_id, IntegerType(), True),
    StructField(a_rating, FloatType(), True),
    StructField(a_vote_count, IntegerType(), True)
])

Id de links correspondiente a movies_metadata:

In [15]:
a_tmdbId = "tmdbId"

In [16]:
ratings = spark.read.format("csv").option("header", "true").schema(ratings_schema).load(hadoop_ratings_addr, header=False)
links = spark.read.format("parquet").option("header", "true").load(hadoop_links_addr, header=True)

Elimincación de nulos y duplicados en links

In [17]:
ratings = ratings.na.drop(subset=[a_id])\
          .dropDuplicates(subset=[a_id])

In [18]:
links = links.withColumn(a_id, (links.movieId).cast("Integer"))\
             .withColumn(a_tmdbId, (links.tmdbId).cast("Integer"))[a_id, a_tmdbId]

Unión de ratings con links:

In [19]:
ratings_links = ratings.join(links, on=[a_id])

In [20]:
ratings = ratings_links.withColumn(a_id, (ratings_links.tmdbId).cast("Integer"))\
                 .withColumn(a_rating, (ratings_links.rating).cast("Float"))[a_id, a_rating, a_vote_count]

Unión de los ratings con las peliculas:

In [21]:
movies = movies.join(ratings, on=[a_id], how='left')

Escritura de la tabla formateada de películas en HDFS:

In [22]:
movies.write.mode('overwrite').parquet(f'{hadoop_dest_folder}/movies')

### Separación de atributos compuestos en tablas separadas

Definición de funciones para separar datos desestructurados de un dataset, en un dataset distinto, y para almacenar los datasets.

In [23]:
def separate_normalized_tables(df,entity_name, entity_identifier = "id", table_id = "movie_id"):
    sub_df = df.select([table_id,entity_name])
    sub_df = sub_df.filter(sub_df[entity_name].isNotNull())
    rdd = sub_df.rdd
    movie_entity = rdd.flatMap(lambda r: map(lambda g: (r[table_id], g[entity_identifier]), r[entity_name]))
    entity = rdd.flatMap(lambda r: r[entity_name])
    entity = entity.map(tuple)
    entity = entity.reduceByKey(lambda a, b : a)
    return entity, movie_entity

In [24]:
def store_rdd(rdd, fields, table_name):
    df = rdd.toDF(fields)
    df.write.mode('overwrite').parquet(f'{hadoop_dest_folder}/{table_name}')

Invocación de la función para separar los atributos compuestos

In [25]:
genre, movie_genre = separate_normalized_tables(movies,a_genres)
prod_company, movie_prod_company = separate_normalized_tables(movies, a_prod_companies)
country, movie_prod_country = separate_normalized_tables(movies, a_production_countries, "iso_3166_1")
language, movie_spoken_language = separate_normalized_tables(movies, a_spoken_languages, "iso_639_1")

Nombres que tendran los nuevos datasets:

In [26]:
t_movies = "movies"
t_genres = "genres"
t_movies_genres = "movies_genres"
t_prod_companies = "prod_companies"
t_movies_prod_companies = "movies_prod_companies"
t_countries = "prod_countries"
t_movies_countries = "movies_prod_countries"
t_languages = "spoken_languages"
t_movies_languages = "movies_spoken_languages"

Escritura de los datasets:

In [27]:
store_rdd(genre, ["id", "name"], t_genres)
store_rdd(movie_genre, ["id_movie", "id_genre"], t_movies_genres)
store_rdd(prod_company, ["id", "name"], t_prod_companies)
store_rdd(movie_prod_company, ["id_movie", "id_prod_company"], t_movies_prod_companies)
store_rdd(country, ["id", "name"], t_countries)
store_rdd(movie_prod_country, ["id_movie", "id_prod_country"], t_movies_countries)
store_rdd(language, ["id", "name"], t_languages)
store_rdd(movie_spoken_language, ["id_movie", "id_spoken_language"], t_movies_languages)

# Procesamiento de dataset Keywords

Lectura de dataset de keywords:

In [28]:
keywords = spark.read.format("parquet").option("header", "true").option("escape","\"").load(f'{hadoop_datasets_folder}/keywords')

Formato de el campo desestructurado de keywords:

In [29]:
a_keywords = "keywords"

keywords_schema = ArrayType(
    StructType([StructField("id", IntegerType()), 
                StructField("name", StringType())]))

Formateo de los datos a sus tipos correspondientes:

In [30]:
keywords = keywords.withColumn("id", (keywords.id).cast("Integer"))\
                   .withColumn("keywords", from_json(keywords.keywords, keywords_schema))

In [31]:
keywords.dtypes

[('id', 'int'), ('keywords', 'array<struct<id:int,name:string>>')]

In [32]:
keyword, movie_keyword = separate_normalized_tables(keywords,"keywords", table_id="id")

In [33]:
t_keyword = "keywords"
t_movies_keywords = "movies_keywords"

In [34]:
store_rdd(keyword, ["id", "name"], t_keyword)
store_rdd(movie_keyword, ["id_movie", "id_keyword"], t_movies_keywords)

# Procesamiento de dataset Credits

In [35]:
hadoop_credits_addr = f'{hadoop_datasets_folder}/credits'

In [36]:
credits = spark.read.format("parquet").option("header", "true").option("escape","\"").load(hadoop_credits_addr)

Se sustituyen None por String vacío

In [37]:
credits = credits.withColumn('cast', regexp_replace('cast', ': None', ": ''"))\
          .withColumn('crew', regexp_replace('crew', ': None', ": ''"))

Definición de esquema para actores y personal

In [38]:
a_cast = "cast"
a_crew = "crew"


crew_schema = ArrayType(
    StructType([StructField("credit_id", StringType()), 
                StructField("department", StringType()),
                StructField("gender", IntegerType()),
                StructField("id", IntegerType()),
                StructField("job", StringType()),
                StructField("name", StringType()),
                StructField("profile_path", StringType())
               ]))

cast_schema = ArrayType(
    StructType([StructField("cast_id", IntegerType()), 
                StructField("character", StringType()),
                StructField("credit_id", StringType(), True),
                StructField("gender", IntegerType(), True),
                StructField("id", IntegerType()),
                StructField("name", StringType()),
                StructField("order", IntegerType(), True),
                StructField("profile_path", StringType(), True),
               ]))

Formateo de dataset de créditos

In [39]:
credits = credits.withColumn("id", (credits.id).cast("Integer"))\
                  .withColumn("cast", from_json(credits.cast, cast_schema))\
                  .withColumn("crew", from_json(credits.crew, crew_schema))

In [40]:
credits.dtypes

[('cast',
  'array<struct<cast_id:int,character:string,credit_id:string,gender:int,id:int,name:string,order:int,profile_path:string>>'),
 ('crew',
  'array<struct<credit_id:string,department:string,gender:int,id:int,job:string,name:string,profile_path:string>>'),
 ('id', 'int')]

### Extracción del campo cast (actores):

In [41]:
sub_df = credits.select(["id","cast"])
sub_df = sub_df.filter(sub_df["cast"].isNotNull())
rdd = sub_df.rdd
movie_cast = rdd.flatMap(lambda r: map(lambda g: (r.id, g["id"]), r["cast"]))
cast = rdd.flatMap(lambda r: r["cast"])
cast = rdd.flatMap(lambda r: r["cast"])
cast = cast.map(lambda e: (e.id, (e.cast_id, e.character, e.gender, e.name, e.order)))
cast = cast.reduceByKey(lambda a, b : a)
cast = cast.map(lambda t: (t[0], t[1][0], t[1][1], t[1][2], t[1][3], t[1][4]))

Nombres para los nuevos datasets:

In [42]:
t_cast = "cast"
t_movies_cast = "movies_cast"

Escritura de los nuevos datasets en HDFS:

In [43]:
store_rdd(cast, ["id", "cast_id", "character", "gender", "name", "order"], t_cast)
store_rdd(movie_cast, ["id_movie", "cast_id"], t_movies_cast)

### Extracción del campo crew (personal):

In [44]:
sub_df = credits.select(["id","crew"])
sub_df = sub_df.filter(sub_df["crew"].isNotNull())
rdd = sub_df.rdd
movie_crew = rdd.flatMap(lambda r: map(lambda g: (r.id, g["id"]), r["crew"]))
crew = rdd.flatMap(lambda r: r["crew"])
crew = crew.map(lambda e: (e.id, (e.department, e.gender, e.job)))
crew = crew.reduceByKey(lambda a, b : a)
crew = crew.map(lambda t: (t[0], t[1][0], t[1][1], t[1][2]))

In [53]:
cast.take(10)

AttributeError: 'NoneType' object has no attribute 'sc'

Nombres para los nuevos datasets:

In [46]:
t_crew = "crew"
t_movie_crew = "movies_crew"

Escritura de los nuevos datasets en HDFS:

In [47]:
store_rdd(crew, ["id", "department", "gender", "job"], t_crew)
store_rdd(movie_crew, ["id_movie", "crew_id"], t_movie_crew)

Cierre de la sesión de Spark

In [48]:
spark.stop()