# <h1 align=center> **ETL - DATASET MOVIES** </h1>
<h1 align=center> (Extract, Transform, Load) </h1>

Para el desarrollo de la primera fase de este proyecto, realizaré la aplicación del proceso ETL, el cuál básicamente consiste en `“Extraer”` los datos crudos desde su origen (Source), `“Transformarlos”` según nuestras necesidades de analítica o la estructura que deseamos y `“Cargarlos”` a una base de datos orientada a procesos analíticos (Target).

<p align=center><img src=https://www.informatica.com/content/dam/informatica-com/en/images/misc/etl-process-explained-diagram.png height=300><p>

Como fase previa a la aplicación del ETL, se realizará la importación de librerías que serán de utilidad en el desarrollo del proceso, las cuales son:

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from collections import Counter
from scipy.spatial import distance
import json
import ast

# <h1 align=left>**`Extract`**</h1>

En este punto se extraen datasets extructurados en archivos .csv y se almacenan en `DataFrames` de la librería de `Pandas`.

In [3]:
df_movies = pd.read_csv("..\Data\movies_dataset.csv")

  df_movies = pd.read_csv("..\Data\movies_dataset.csv")


# <h1 align=left>**`Transform`**</h1>

En este punto aplicaré las reglas que el proyecto demanda para realizar un buen proceso de _`Exploratory Data Analysis-EDA`_ y un _`Sistema de recomendación`_, estas reglas pueden incluir procesos como:

+ Filtrar filas por ciertas características.
+ Eliminar duplicados.
+ Transformar datos.
+ Calcular datos nuevos (por ejemplo, calcular el porcentaje de retorno de la inversión por película).
+ Extraer datos (extraer el año de la fecha de estreno).
+ Unir o combinar datos de distintas fuentes.
+ Desanidar datos de columnas que contienen un diccionario o una lista como valores en cada fila.

**1.** Se consulta los datos que contiene el DataFrame, para identificar su estructura y la información que contiene:

In [4]:
df_movies.head()

Unnamed: 0,adult,belongs_to_collection,budget,genres,homepage,id,imdb_id,original_language,original_title,overview,...,release_date,revenue,runtime,spoken_languages,status,tagline,title,video,vote_average,vote_count
0,False,"{'id': 10194, 'name': 'Toy Story Collection', ...",30000000,"[{'id': 16, 'name': 'Animation'}, {'id': 35, '...",http://toystory.disney.com/toy-story,862,tt0114709,en,Toy Story,"Led by Woody, Andy's toys live happily in his ...",...,1995-10-30,373554033.0,81.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,,Toy Story,False,7.7,5415.0
1,False,,65000000,"[{'id': 12, 'name': 'Adventure'}, {'id': 14, '...",,8844,tt0113497,en,Jumanji,When siblings Judy and Peter discover an encha...,...,1995-12-15,262797249.0,104.0,"[{'iso_639_1': 'en', 'name': 'English'}, {'iso...",Released,Roll the dice and unleash the excitement!,Jumanji,False,6.9,2413.0
2,False,"{'id': 119050, 'name': 'Grumpy Old Men Collect...",0,"[{'id': 10749, 'name': 'Romance'}, {'id': 35, ...",,15602,tt0113228,en,Grumpier Old Men,A family wedding reignites the ancient feud be...,...,1995-12-22,0.0,101.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Still Yelling. Still Fighting. Still Ready for...,Grumpier Old Men,False,6.5,92.0
3,False,,16000000,"[{'id': 35, 'name': 'Comedy'}, {'id': 18, 'nam...",,31357,tt0114885,en,Waiting to Exhale,"Cheated on, mistreated and stepped on, the wom...",...,1995-12-22,81452156.0,127.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Friends are the people who let you be yourself...,Waiting to Exhale,False,6.1,34.0
4,False,"{'id': 96871, 'name': 'Father of the Bride Col...",0,"[{'id': 35, 'name': 'Comedy'}]",,11862,tt0113041,en,Father of the Bride Part II,Just when George Banks has recovered from his ...,...,1995-02-10,76578911.0,106.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Just When His World Is Back To Normal... He's ...,Father of the Bride Part II,False,5.7,173.0


In [5]:
df_movies.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 45466 entries, 0 to 45465
Data columns (total 24 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   adult                  45466 non-null  object 
 1   belongs_to_collection  4494 non-null   object 
 2   budget                 45466 non-null  object 
 3   genres                 45466 non-null  object 
 4   homepage               7782 non-null   object 
 5   id                     45466 non-null  object 
 6   imdb_id                45449 non-null  object 
 7   original_language      45455 non-null  object 
 8   original_title         45466 non-null  object 
 9   overview               44512 non-null  object 
 10  popularity             45461 non-null  object 
 11  poster_path            45080 non-null  object 
 12  production_companies   45463 non-null  object 
 13  production_countries   45463 non-null  object 
 14  release_date           45379 non-null  object 
 15  re

**2.** Se Consultan los nombres de las columnas con las que cuenta el `DataFrame`. Esta información es relevante para tener a la mano los nombres completos de todas las columna y poder utilizarlos en los siguientes códigos exploratorios.

In [6]:
columnas= df_movies.columns.tolist()
columnas

['adult',
 'belongs_to_collection',
 'budget',
 'genres',
 'homepage',
 'id',
 'imdb_id',
 'original_language',
 'original_title',
 'overview',
 'popularity',
 'poster_path',
 'production_companies',
 'production_countries',
 'release_date',
 'revenue',
 'runtime',
 'spoken_languages',
 'status',
 'tagline',
 'title',
 'video',
 'vote_average',
 'vote_count']

**3.** Al observar en el punto 1, que la primera columna del DataFrame contiene datos de tipo booleano se realiza un orden de la columna de modo ascendente para evaluar si hay datos diferentes a los ya identificados.

In [7]:
df_movies.sort_values(by=['adult'], ascending=True)

Unnamed: 0,adult,belongs_to_collection,budget,genres,homepage,id,imdb_id,original_language,original_title,overview,...,release_date,revenue,runtime,spoken_languages,status,tagline,title,video,vote_average,vote_count
19730,- Written by Ørnås,0.065736,/ff9qCepilowshEtG2GYWwzt2bs4.jpg,"[{'name': 'Carousel Productions', 'id': 11176}...","[{'iso_3166_1': 'CA', 'name': 'Canada'}, {'iso...",1997-08-20,0,104.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,...,1,,,,,,,,,
35587,Avalanche Sharks tells the story of a bikini ...,2.185485,/zaSf5OG7V8X8gqFvly88zDdRm46.jpg,"[{'name': 'Odyssey Media', 'id': 17161}, {'nam...","[{'iso_3166_1': 'CA', 'name': 'Canada'}]",2014-01-01,0,82.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,...,22,,,,,,,,,
29503,Rune Balot goes to a casino connected to the ...,1.931659,/zV8bHuSL6WXoD6FWogP9j4x80bL.jpg,"[{'name': 'Aniplex', 'id': 2883}, {'name': 'Go...","[{'iso_3166_1': 'US', 'name': 'United States o...",2012-09-29,0,68.0,"[{'iso_639_1': 'ja', 'name': '日本語'}]",Released,...,12,,,,,,,,,
0,False,"{'id': 10194, 'name': 'Toy Story Collection', ...",30000000,"[{'id': 16, 'name': 'Animation'}, {'id': 35, '...",http://toystory.disney.com/toy-story,862,tt0114709,en,Toy Story,"Led by Woody, Andy's toys live happily in his ...",...,1995-10-30,373554033.0,81.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,,Toy Story,False,7.7,5415.0
30300,False,,0,"[{'id': 35, 'name': 'Comedy'}]",http://www.cc.com/shows/roast-of-flavor-flav,315850,tt1037714,en,Comedy Central Roast of Flavor Flav,It's Flavor Flav's turn to step in to the cele...,...,2007-08-12,0.0,90.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,,Comedy Central Roast of Flavor Flav,False,6.4,9.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
39901,True,,0,"[{'id': 80, 'name': 'Crime'}, {'id': 27, 'name...",,35731,tt1161951,en,Amateur Porn Star Killer 2,Shane Ryan's sequel to the disturbing Amateur ...,...,2008-05-13,0.0,0.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Humiliation. Rape. Murder. You know the drill.,Amateur Porn Star Killer 2,False,6.3,8.0
32113,True,,0,"[{'id': 35, 'name': 'Comedy'}, {'id': 18, 'nam...",http://www.dietofsex.com/,324230,tt3094816,es,Diet of Sex,Ágata suffers from a psychological disorder wh...,...,2014-02-14,0.0,72.0,"[{'iso_639_1': 'es', 'name': 'Español'}]",Released,"Comedy, food, drama and sex, a lot of sex",Diet of Sex,False,4.0,12.0
31934,True,,0,"[{'id': 35, 'name': 'Comedy'}]",,44781,tt0322232,cn,發電悄嬌娃,Electrical Girl centers around a horny young w...,...,2001-04-26,0.0,89.0,"[{'iso_639_1': 'cn', 'name': '广州话 / 廣州話'}]",Released,,Electrical Girl,False,0.0,0.0
39902,True,,0,"[{'id': 35, 'name': 'Comedy'}, {'id': 10402, '...",,55774,tt1153101,en,The Band,Australian film about a fictional sub-par Aust...,...,2009-11-17,0.0,90.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,"Sex, drugs and Rock 'n Roll",The Band,False,3.3,7.0


**4.** Al analizar los datos arrojados en el punto 3, se observa que hay 3 filas de datos los cuales tiene como característica común que estan ubicados en las variables que no le corresponden, por tanto los datos están corridos a la izquierda 8 posiciones; con base en lo anterior, y teniendo en cuenta que la información que continen puede requerirse para el proceso de _`Exploratory Data Analysis-EDA`_, se determina extraerlos del Dataset original y ubicarlos en un nuevo Dataset para posteriormente aplicar un proceso paralelo de transformación:

**4.1. Identificación de los errores y creación del nuevo DataFrame** 

In [8]:
df_movies["id"] = pd.to_numeric(df_movies["id"], errors='coerce')
mistake_rows = df_movies["id"].isnull()
df_anomalies = df_movies[mistake_rows]
df_anomalies.head()

Unnamed: 0,adult,belongs_to_collection,budget,genres,homepage,id,imdb_id,original_language,original_title,overview,...,release_date,revenue,runtime,spoken_languages,status,tagline,title,video,vote_average,vote_count
19730,- Written by Ørnås,0.065736,/ff9qCepilowshEtG2GYWwzt2bs4.jpg,"[{'name': 'Carousel Productions', 'id': 11176}...","[{'iso_3166_1': 'CA', 'name': 'Canada'}, {'iso...",,0,104.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,...,1,,,,,,,,,
29503,Rune Balot goes to a casino connected to the ...,1.931659,/zV8bHuSL6WXoD6FWogP9j4x80bL.jpg,"[{'name': 'Aniplex', 'id': 2883}, {'name': 'Go...","[{'iso_3166_1': 'US', 'name': 'United States o...",,0,68.0,"[{'iso_639_1': 'ja', 'name': '日本語'}]",Released,...,12,,,,,,,,,
35587,Avalanche Sharks tells the story of a bikini ...,2.185485,/zaSf5OG7V8X8gqFvly88zDdRm46.jpg,"[{'name': 'Odyssey Media', 'id': 17161}, {'nam...","[{'iso_3166_1': 'CA', 'name': 'Canada'}]",,0,82.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,...,22,,,,,,,,,


**4.2. Creación de un nuevo Dataset con el DataFrame creado de las anomalías**

In [9]:
df_anomalies.to_csv("..\Data\movies_dataset_anomalies.csv")

**4.3. Eliminación de las filas identificadas como anomalías, del DataFrame original**

In [10]:
df_movies = df_movies.drop(df_movies[mistake_rows].index)

**4.4. De nuevo se consulta el DataFrame original para validar los cambios realizados**

In [11]:
df_movies.sort_values(by=['adult'], ascending=True)

Unnamed: 0,adult,belongs_to_collection,budget,genres,homepage,id,imdb_id,original_language,original_title,overview,...,release_date,revenue,runtime,spoken_languages,status,tagline,title,video,vote_average,vote_count
0,False,"{'id': 10194, 'name': 'Toy Story Collection', ...",30000000,"[{'id': 16, 'name': 'Animation'}, {'id': 35, '...",http://toystory.disney.com/toy-story,862.0,tt0114709,en,Toy Story,"Led by Woody, Andy's toys live happily in his ...",...,1995-10-30,373554033.0,81.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,,Toy Story,False,7.7,5415.0
30301,False,,0,"[{'id': 35, 'name': 'Comedy'}, {'id': 18, 'nam...",,270886.0,tt3483194,fr,Tu dors Nicole,Making the most of the family home while her p...,...,2014-08-22,0.0,93.0,"[{'iso_639_1': 'fr', 'name': 'Français'}]",Released,,You're Sleeping Nicole,False,6.8,19.0
30302,False,,14500000,"[{'id': 18, 'name': 'Drama'}]",http://themoviefreedom.com/,288980.0,tt2584018,en,Freedom,Two men separated by 100 years are united in t...,...,2014-08-21,0.0,98.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,John Newton's Amazing Grace,Freedom,False,5.8,9.0
30303,False,,0,"[{'id': 18, 'name': 'Drama'}]",,69352.0,tt1309178,tr,Bizim Büyük Çaresizliğimiz,The peaceful cohabitation of two 30-something ...,...,2011-01-01,0.0,102.0,"[{'iso_639_1': 'tr', 'name': 'Türkçe'}]",Released,,Our Grand Despair,False,6.0,7.0
30304,False,,123690,"[{'id': 35, 'name': 'Comedy'}, {'id': 12, 'nam...",,212481.0,tt2902898,en,Ashens and the Quest for the Gamechild,Ashens is going on a quest to find the legenda...,...,2013-08-08,0.0,88.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,,Ashens and the Quest for the Gamechild,False,5.0,14.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
19489,True,,0,"[{'id': 27, 'name': 'Horror'}]",,5422.0,tt0079642,it,Le notti erotiche dei morti viventi,A sailor takes an American businessman and his...,...,1980-11-18,0.0,112.0,"[{'iso_639_1': 'it', 'name': 'Italiano'}]",Released,,Erotic Nights of the Living Dead,False,2.2,7.0
32113,True,,0,"[{'id': 35, 'name': 'Comedy'}, {'id': 18, 'nam...",http://www.dietofsex.com/,324230.0,tt3094816,es,Diet of Sex,Ágata suffers from a psychological disorder wh...,...,2014-02-14,0.0,72.0,"[{'iso_639_1': 'es', 'name': 'Español'}]",Released,"Comedy, food, drama and sex, a lot of sex",Diet of Sex,False,4.0,12.0
31934,True,,0,"[{'id': 35, 'name': 'Comedy'}]",,44781.0,tt0322232,cn,發電悄嬌娃,Electrical Girl centers around a horny young w...,...,2001-04-26,0.0,89.0,"[{'iso_639_1': 'cn', 'name': '广州话 / 廣州話'}]",Released,,Electrical Girl,False,0.0,0.0
39902,True,,0,"[{'id': 35, 'name': 'Comedy'}, {'id': 10402, '...",,55774.0,tt1153101,en,The Band,Australian film about a fictional sub-par Aust...,...,2009-11-17,0.0,90.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,"Sex, drugs and Rock 'n Roll",The Band,False,3.3,7.0


**5.** Una vez eliminados los datos con anomalías, se continua con el proceso de transformación cambiando la ubicación de las columnas y dejando como primera variable la columna `"id"`.

In [12]:
df_movies = df_movies.reindex(columns=['id',
                                        'adult',
                                        'belongs_to_collection',
                                        'budget',
                                        'genres',
                                        'homepage',
                                        'imdb_id',
                                        'original_language',
                                        'original_title',
                                        'overview',
                                        'popularity',
                                        'poster_path',
                                        'production_companies',
                                        'production_countries',
                                        'release_date',
                                        'revenue',
                                        'runtime',
                                        'spoken_languages',
                                        'status',
                                        'tagline',
                                        'title',
                                        'video',
                                        'vote_average',
                                        'vote_count'])
df_movies

Unnamed: 0,id,adult,belongs_to_collection,budget,genres,homepage,imdb_id,original_language,original_title,overview,...,release_date,revenue,runtime,spoken_languages,status,tagline,title,video,vote_average,vote_count
0,862.0,False,"{'id': 10194, 'name': 'Toy Story Collection', ...",30000000,"[{'id': 16, 'name': 'Animation'}, {'id': 35, '...",http://toystory.disney.com/toy-story,tt0114709,en,Toy Story,"Led by Woody, Andy's toys live happily in his ...",...,1995-10-30,373554033.0,81.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,,Toy Story,False,7.7,5415.0
1,8844.0,False,,65000000,"[{'id': 12, 'name': 'Adventure'}, {'id': 14, '...",,tt0113497,en,Jumanji,When siblings Judy and Peter discover an encha...,...,1995-12-15,262797249.0,104.0,"[{'iso_639_1': 'en', 'name': 'English'}, {'iso...",Released,Roll the dice and unleash the excitement!,Jumanji,False,6.9,2413.0
2,15602.0,False,"{'id': 119050, 'name': 'Grumpy Old Men Collect...",0,"[{'id': 10749, 'name': 'Romance'}, {'id': 35, ...",,tt0113228,en,Grumpier Old Men,A family wedding reignites the ancient feud be...,...,1995-12-22,0.0,101.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Still Yelling. Still Fighting. Still Ready for...,Grumpier Old Men,False,6.5,92.0
3,31357.0,False,,16000000,"[{'id': 35, 'name': 'Comedy'}, {'id': 18, 'nam...",,tt0114885,en,Waiting to Exhale,"Cheated on, mistreated and stepped on, the wom...",...,1995-12-22,81452156.0,127.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Friends are the people who let you be yourself...,Waiting to Exhale,False,6.1,34.0
4,11862.0,False,"{'id': 96871, 'name': 'Father of the Bride Col...",0,"[{'id': 35, 'name': 'Comedy'}]",,tt0113041,en,Father of the Bride Part II,Just when George Banks has recovered from his ...,...,1995-02-10,76578911.0,106.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Just When His World Is Back To Normal... He's ...,Father of the Bride Part II,False,5.7,173.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
45461,439050.0,False,,0,"[{'id': 18, 'name': 'Drama'}, {'id': 10751, 'n...",http://www.imdb.com/title/tt6209470/,tt6209470,fa,رگ خواب,Rising and falling between a man and woman.,...,,0.0,90.0,"[{'iso_639_1': 'fa', 'name': 'فارسی'}]",Released,Rising and falling between a man and woman,Subdue,False,4.0,1.0
45462,111109.0,False,,0,"[{'id': 18, 'name': 'Drama'}]",,tt2028550,tl,Siglo ng Pagluluwal,An artist struggles to finish his work while a...,...,2011-11-17,0.0,360.0,"[{'iso_639_1': 'tl', 'name': ''}]",Released,,Century of Birthing,False,9.0,3.0
45463,67758.0,False,,0,"[{'id': 28, 'name': 'Action'}, {'id': 18, 'nam...",,tt0303758,en,Betrayal,"When one of her hits goes wrong, a professiona...",...,2003-08-01,0.0,90.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,A deadly game of wits.,Betrayal,False,3.8,6.0
45464,227506.0,False,,0,[],,tt0008536,en,Satana likuyushchiy,"In a small town live two brothers, one a minis...",...,1917-10-21,0.0,87.0,[],Released,,Satan Triumphant,False,0.0,0.0


En este punto se realiza primero la transformación del Dataset nuevo de anomalías para luego continar con la transformación de toda la Data...

**6.** Ya listo el nuevo Dataset de `anomalies` se procede a extraer los datos en un nuevo DataFrame:

**6.1.** Se crea el nuevo DataFrame con las anomalías transformadas.

In [13]:
df_movies_anomalies = pd.read_csv("..\Data\movies_newdataset_anomalies.csv")

**6.2.** Se consulta el DataFrame para validar los datos.

In [14]:
df_movies_anomalies.head()

Unnamed: 0.1,Unnamed: 0,adult,belongs_to_collection,budget,genres,homepage,id,imdb_id,original_language,overview,...,release_date,revenue,runtime,spoken_languages,status,tagline,title,video,vote_average,vote_count
0,0,0,0,0,0,0,0,0,0,0,...,1997-08-20,0,104.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,,Midnight Man,False,6.0,1
1,1,0,0,0,0,0,1,0,0,0,...,2012-09-29,0,68.0,"[{'iso_639_1': 'ja', 'name': '日本語'}]",Released,,Mardock Scramble: The Third Exhaust,False,7.0,12
2,2,0,0,0,0,0,4,0,0,0,...,2014-01-01,0,82.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Beware Of Frost Bites,Avalanche Sharks,False,4.3,22


**6.3.** Se organizan las variables del nuevo DataFrame, de acuerdo a como están organizadas las variables del DataFrame original de `movies`.

In [15]:
df_movies_anomalies = df_movies_anomalies.reindex(columns=['id',
                                        'adult',
                                        'belongs_to_collection',
                                        'budget',
                                        'genres',
                                        'homepage',
                                        'imdb_id',
                                        'original_language',
                                        'original_title',
                                        'overview',
                                        'popularity',
                                        'poster_path',
                                        'production_companies',
                                        'production_countries',
                                        'release_date',
                                        'revenue',
                                        'runtime',
                                        'spoken_languages',
                                        'status',
                                        'tagline',
                                        'title',
                                        'video',
                                        'vote_average',
                                        'vote_count'])
df_movies_anomalies

Unnamed: 0,id,adult,belongs_to_collection,budget,genres,homepage,imdb_id,original_language,original_title,overview,...,release_date,revenue,runtime,spoken_languages,status,tagline,title,video,vote_average,vote_count
0,0,0,0,0,0,0,0,0,- Written by Ørnås,0,...,1997-08-20,0,104.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,,Midnight Man,False,6.0,1
1,1,0,0,0,0,0,0,0,Rune Balot goes to a casino connected to the ...,0,...,2012-09-29,0,68.0,"[{'iso_639_1': 'ja', 'name': '日本語'}]",Released,,Mardock Scramble: The Third Exhaust,False,7.0,12
2,4,0,0,0,0,0,0,0,Avalanche Sharks tells the story of a bikini ...,0,...,2014-01-01,0,82.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Beware Of Frost Bites,Avalanche Sharks,False,4.3,22


**6.4.** Una vez listos los DataFrame se procede a concatenarlos para continuar con el proceso de `Transform` de la Data

In [16]:
newdf_movies = pd.concat([df_movies, df_movies_anomalies], ignore_index=True)

**6.5.** Se consulta el DataFrame para validar los datos, y confirmar que se haya concatenado la información.

In [17]:
newdf_movies.tail()

Unnamed: 0,id,adult,belongs_to_collection,budget,genres,homepage,imdb_id,original_language,original_title,overview,...,release_date,revenue,runtime,spoken_languages,status,tagline,title,video,vote_average,vote_count
45461,227506.0,False,,0,[],,tt0008536,en,Satana likuyushchiy,"In a small town live two brothers, one a minis...",...,1917-10-21,0.0,87.0,[],Released,,Satan Triumphant,False,0.0,0.0
45462,461257.0,False,,0,[],,tt6980792,en,Queerama,50 years after decriminalisation of homosexual...,...,2017-06-09,0.0,75.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,,Queerama,False,0.0,0.0
45463,0.0,0,0.0,0,0,0.0,0,0,- Written by Ørnås,0,...,1997-08-20,0.0,104.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,,Midnight Man,False,6.0,1.0
45464,1.0,0,0.0,0,0,0.0,0,0,Rune Balot goes to a casino connected to the ...,0,...,2012-09-29,0.0,68.0,"[{'iso_639_1': 'ja', 'name': '日本語'}]",Released,,Mardock Scramble: The Third Exhaust,False,7.0,12.0
45465,4.0,0,0.0,0,0,0.0,0,0,Avalanche Sharks tells the story of a bikini ...,0,...,2014-01-01,0.0,82.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Beware Of Frost Bites,Avalanche Sharks,False,4.3,22.0


**7.1.** Se cambia el tipo de datos de la columna _**'id'**_ pasando de _`"object"`_ a _`"int"`_.

In [18]:
newdf_movies['id'] = newdf_movies['id'].astype('int')

**8.** En este punto se comienza el proceso de desanidar las columnas que están anidadas por listas o diccionarios o ambas.

**8.1.** Se crea función para desanidar la columna 'belongs_to_colletion', así:
    
Verifica si el valor proporcionado es un diccionario válido o una cadena que representa un diccionario.
Si es un diccionario válido, devuelve el valor correspondiente a la clave especificada;
de lo contrario, devuelve None.

_**`Args`**_
    diccionario (dict or str): El diccionario o la cadena que representa un diccionario.
    clave (str): La clave cuyo valor se desea obtener.

_**`Returns:`**_
    object: El valor correspondiente a la clave especificada o None si no se encuentra.

In [19]:
def obtener_valor(diccionario, clave):

    if pd.isnull(diccionario):
        return None
    if not isinstance(diccionario, dict):
        try:
            diccionario = json.loads(diccionario.replace("'", "\""))
        except (json.JSONDecodeError, AttributeError):
            return None
    if isinstance(diccionario, dict):
        return diccionario.get(clave)
    else:
        return None

+ Aplicar la función a la columna "belongs_to_collection".

In [20]:
newdf_movies['id_btc'] = newdf_movies['belongs_to_collection'].apply(lambda x: obtener_valor(x, 'id'))
newdf_movies['name_btc'] = newdf_movies['belongs_to_collection'].apply(lambda x: obtener_valor(x, 'name'))
newdf_movies['poster_btc'] = newdf_movies['belongs_to_collection'].apply(lambda x: obtener_valor(x, 'poster_path'))
newdf_movies['backdrop_btc'] = newdf_movies['belongs_to_collection'].apply(lambda x: obtener_valor(x, 'backdrop_path'))

+ Imprimir los valores obtenidos.

In [21]:
result = newdf_movies[['id_btc', 'name_btc', 'poster_btc', 'backdrop_btc']]

+ Se consulta el DataFrame para validar los datos.

_Las nuevas columnas son creadas al final del DataFrame._

In [22]:
newdf_movies.head(3)

Unnamed: 0,id,adult,belongs_to_collection,budget,genres,homepage,imdb_id,original_language,original_title,overview,...,status,tagline,title,video,vote_average,vote_count,id_btc,name_btc,poster_btc,backdrop_btc
0,862,False,"{'id': 10194, 'name': 'Toy Story Collection', ...",30000000,"[{'id': 16, 'name': 'Animation'}, {'id': 35, '...",http://toystory.disney.com/toy-story,tt0114709,en,Toy Story,"Led by Woody, Andy's toys live happily in his ...",...,Released,,Toy Story,False,7.7,5415.0,10194.0,Toy Story Collection,/7G9915LfUQ2lVfwMEEhDsn3kT4B.jpg,/9FBwqcd9IRruEDUrTdcaafOMKUq.jpg
1,8844,False,,65000000,"[{'id': 12, 'name': 'Adventure'}, {'id': 14, '...",,tt0113497,en,Jumanji,When siblings Judy and Peter discover an encha...,...,Released,Roll the dice and unleash the excitement!,Jumanji,False,6.9,2413.0,,,,
2,15602,False,"{'id': 119050, 'name': 'Grumpy Old Men Collect...",0,"[{'id': 10749, 'name': 'Romance'}, {'id': 35, ...",,tt0113228,en,Grumpier Old Men,A family wedding reignites the ancient feud be...,...,Released,Still Yelling. Still Fighting. Still Ready for...,Grumpier Old Men,False,6.5,92.0,119050.0,Grumpy Old Men Collection,/nLvUdqgPgm3F85NMCii9gVFUcet.jpg,/hypTnLot2z8wpFS7qwsQHW1uV8u.jpg


+ Una vez se ha verificado que las columnas fueron creadas tal cual como se esperaba, se procede a eliminar la columna original.

In [23]:
newdf_movies.drop(['belongs_to_collection'], axis=1, inplace=True)

**8.2.** Se crea función para desanidar la columna 'genres', así:

Convierte una cadena en una lista de diccionarios utilizando ast.literal_eval().

    Args:
x (str): Cadena a convertir.

    Returns:
list: Lista de diccionarios resultante.

In [24]:
def desanidar_genres(x):
    try:
        return ast.literal_eval(x)
    except (SyntaxError, ValueError):
        return []

+ Convertir las cadenas de texto en la columna "genres" en listas de diccionarios

In [25]:
newdf_movies['genres'] = newdf_movies['genres'].apply(desanidar_genres)

+ Crear las nuevas columnas extraidas de la lista

In [26]:
newdf_movies['genres_id'] = newdf_movies['genres'].apply(lambda x: ', '.join(str(pc['id']) for pc in x))
newdf_movies['genres_name'] = newdf_movies['genres'].apply(lambda x: ', '.join(pc['name'] for pc in x))

result = newdf_movies[['genres_id', 'genres_name']]

+ Se consulta el DataFrame para validar los datos.

_Las nuevas columnas son creadas al final del DataFrame._

In [27]:
newdf_movies.head(5)

Unnamed: 0,id,adult,budget,genres,homepage,imdb_id,original_language,original_title,overview,popularity,...,title,video,vote_average,vote_count,id_btc,name_btc,poster_btc,backdrop_btc,genres_id,genres_name
0,862,False,30000000,"[{'id': 16, 'name': 'Animation'}, {'id': 35, '...",http://toystory.disney.com/toy-story,tt0114709,en,Toy Story,"Led by Woody, Andy's toys live happily in his ...",21.946943,...,Toy Story,False,7.7,5415.0,10194.0,Toy Story Collection,/7G9915LfUQ2lVfwMEEhDsn3kT4B.jpg,/9FBwqcd9IRruEDUrTdcaafOMKUq.jpg,"16, 35, 10751","Animation, Comedy, Family"
1,8844,False,65000000,"[{'id': 12, 'name': 'Adventure'}, {'id': 14, '...",,tt0113497,en,Jumanji,When siblings Judy and Peter discover an encha...,17.015539,...,Jumanji,False,6.9,2413.0,,,,,"12, 14, 10751","Adventure, Fantasy, Family"
2,15602,False,0,"[{'id': 10749, 'name': 'Romance'}, {'id': 35, ...",,tt0113228,en,Grumpier Old Men,A family wedding reignites the ancient feud be...,11.7129,...,Grumpier Old Men,False,6.5,92.0,119050.0,Grumpy Old Men Collection,/nLvUdqgPgm3F85NMCii9gVFUcet.jpg,/hypTnLot2z8wpFS7qwsQHW1uV8u.jpg,"10749, 35","Romance, Comedy"
3,31357,False,16000000,"[{'id': 35, 'name': 'Comedy'}, {'id': 18, 'nam...",,tt0114885,en,Waiting to Exhale,"Cheated on, mistreated and stepped on, the wom...",3.859495,...,Waiting to Exhale,False,6.1,34.0,,,,,"35, 18, 10749","Comedy, Drama, Romance"
4,11862,False,0,"[{'id': 35, 'name': 'Comedy'}]",,tt0113041,en,Father of the Bride Part II,Just when George Banks has recovered from his ...,8.387519,...,Father of the Bride Part II,False,5.7,173.0,96871.0,Father of the Bride Collection,/nts4iOmNnq7GNicycMJ9pSAn204.jpg,/7qwE57OVZmMJChBpLEbJEmzUydk.jpg,35,Comedy


+ Una vez se ha verificado que las columnas fueron creadas tal cual como se esperaba, se procede a eliminar la columna original.

In [28]:
newdf_movies.drop(['genres'], axis=1, inplace=True)

**8.3.** Se crea función para desanidar la columna 'production_companies', así:

Convierte una cadena en una lista de diccionarios utilizando ast.literal_eval().

    Args:
x (str): Cadena a convertir.

    Returns:
list: Lista de diccionarios resultante.

In [29]:
def desanidar_production_companies(x):
    try:
        return ast.literal_eval(x)
    except (SyntaxError, ValueError):
        return []

+ Convertir las cadenas de texto en la columna "production_companies" en listas de diccionarios

In [30]:
newdf_movies['production_companies'] = newdf_movies['production_companies'].apply(desanidar_production_companies)

+ Crear las nuevas columnas extraidas de la lista

In [31]:
newdf_movies['ption_companies_id'] = newdf_movies['production_companies'].apply(lambda x: ', '.join(str(pc['id']) for pc in x))
newdf_movies['ption_companies_name'] = newdf_movies['production_companies'].apply(lambda x: ', '.join(pc['name'] for pc in x))

result = newdf_movies[['ption_companies_id', 'ption_companies_name']]

+ Se consulta el DataFrame para validar los datos.

_Las nuevas columnas son creadas al final del DataFrame._

In [32]:
newdf_movies.head(3)

Unnamed: 0,id,adult,budget,homepage,imdb_id,original_language,original_title,overview,popularity,poster_path,...,vote_average,vote_count,id_btc,name_btc,poster_btc,backdrop_btc,genres_id,genres_name,ption_companies_id,ption_companies_name
0,862,False,30000000,http://toystory.disney.com/toy-story,tt0114709,en,Toy Story,"Led by Woody, Andy's toys live happily in his ...",21.946943,/rhIRbceoE9lR4veEXuwCC2wARtG.jpg,...,7.7,5415.0,10194.0,Toy Story Collection,/7G9915LfUQ2lVfwMEEhDsn3kT4B.jpg,/9FBwqcd9IRruEDUrTdcaafOMKUq.jpg,"16, 35, 10751","Animation, Comedy, Family",3,Pixar Animation Studios
1,8844,False,65000000,,tt0113497,en,Jumanji,When siblings Judy and Peter discover an encha...,17.015539,/vzmL6fP7aPKNKPRTFnZmiUfciyV.jpg,...,6.9,2413.0,,,,,"12, 14, 10751","Adventure, Fantasy, Family","559, 2550, 10201","TriStar Pictures, Teitler Film, Interscope Com..."
2,15602,False,0,,tt0113228,en,Grumpier Old Men,A family wedding reignites the ancient feud be...,11.7129,/6ksm1sjKMFLbO7UY2i6G1ju9SML.jpg,...,6.5,92.0,119050.0,Grumpy Old Men Collection,/nLvUdqgPgm3F85NMCii9gVFUcet.jpg,/hypTnLot2z8wpFS7qwsQHW1uV8u.jpg,"10749, 35","Romance, Comedy","6194, 19464","Warner Bros., Lancaster Gate"


+ Una vez se ha verificado que las columnas fueron creadas tal cual como se esperaba, se procede a eliminar la columna original.

In [33]:
newdf_movies.drop(['production_companies'], axis=1, inplace=True)

**8.4.** Se crea función para desanidar la columna 'production_countries', así:

Convierte una cadena en una lista de diccionarios utilizando ast.literal_eval().

    Args:
x (str): Cadena a convertir.

    Returns:
list: Lista de diccionarios resultante.

In [None]:
def desanidar_production_countries(x):
    try:
        return ast.literal_eval(x)
    except (SyntaxError, ValueError):
        return []

+ Convertir las cadenas de texto en la columna "production_countries" en listas de diccionarios

In [None]:
newdf_movies['production_countries'] = newdf_movies['production_countries'].apply(desanidar_production_countries)

+ Crear las nuevas columnas extraidas de la lista

In [None]:
newdf_movies['country_code'] = newdf_movies['production_countries'].apply(lambda x: ', '.join(str(pc['iso_3166_1']) for pc in x))
newdf_movies['country_name'] = newdf_movies['production_countries'].apply(lambda x: ', '.join(pc['name'] for pc in x))
result = newdf_movies[['country_code', 'country_name']]

+ Se consulta el DataFrame para validar los datos.

_Las nuevas columnas son creadas al final del DataFrame._

In [34]:
newdf_movies.head(3)

Unnamed: 0,id,adult,budget,homepage,imdb_id,original_language,original_title,overview,popularity,poster_path,...,vote_average,vote_count,id_btc,name_btc,poster_btc,backdrop_btc,genres_id,genres_name,ption_companies_id,ption_companies_name
0,862,False,30000000,http://toystory.disney.com/toy-story,tt0114709,en,Toy Story,"Led by Woody, Andy's toys live happily in his ...",21.946943,/rhIRbceoE9lR4veEXuwCC2wARtG.jpg,...,7.7,5415.0,10194.0,Toy Story Collection,/7G9915LfUQ2lVfwMEEhDsn3kT4B.jpg,/9FBwqcd9IRruEDUrTdcaafOMKUq.jpg,"16, 35, 10751","Animation, Comedy, Family",3,Pixar Animation Studios
1,8844,False,65000000,,tt0113497,en,Jumanji,When siblings Judy and Peter discover an encha...,17.015539,/vzmL6fP7aPKNKPRTFnZmiUfciyV.jpg,...,6.9,2413.0,,,,,"12, 14, 10751","Adventure, Fantasy, Family","559, 2550, 10201","TriStar Pictures, Teitler Film, Interscope Com..."
2,15602,False,0,,tt0113228,en,Grumpier Old Men,A family wedding reignites the ancient feud be...,11.7129,/6ksm1sjKMFLbO7UY2i6G1ju9SML.jpg,...,6.5,92.0,119050.0,Grumpy Old Men Collection,/nLvUdqgPgm3F85NMCii9gVFUcet.jpg,/hypTnLot2z8wpFS7qwsQHW1uV8u.jpg,"10749, 35","Romance, Comedy","6194, 19464","Warner Bros., Lancaster Gate"


+ Una vez se ha verificado que las columnas fueron creadas tal cual como se esperaba, se procede a eliminar la columna original.

In [35]:
newdf_movies.drop(['production_countries'], axis=1, inplace=True)

**8.5.** Se crea función para desanidar la columna 'spoken_languages', así:

Convierte una cadena en una lista de diccionarios utilizando ast.literal_eval().

    Args:
x (str): Cadena a convertir.

    Returns:
list: Lista de diccionarios resultante.

In [36]:
def desanidar_languages(x):
    try:
        return ast.literal_eval(x)
    except (SyntaxError, ValueError):
        return []

+ Convertir las cadenas de texto en la columna "spoken_languages" en listas de diccionarios.

In [37]:
newdf_movies['spoken_languages'] = newdf_movies['spoken_languages'].apply(desanidar_languages)

+ Crear las nuevas columnas extraidas de la lista

In [38]:
newdf_movies['languages_code'] = newdf_movies['spoken_languages'].apply(lambda x: ', '.join(str(pc['iso_639_1']) for pc in x))
newdf_movies['languages_name'] = newdf_movies['spoken_languages'].apply(lambda x: ', '.join(pc['name'] for pc in x))
result = newdf_movies[['languages_code', 'languages_name']]
newdf_movies.head(3)

Unnamed: 0,id,adult,budget,homepage,imdb_id,original_language,original_title,overview,popularity,poster_path,...,id_btc,name_btc,poster_btc,backdrop_btc,genres_id,genres_name,ption_companies_id,ption_companies_name,languages_code,languages_name
0,862,False,30000000,http://toystory.disney.com/toy-story,tt0114709,en,Toy Story,"Led by Woody, Andy's toys live happily in his ...",21.946943,/rhIRbceoE9lR4veEXuwCC2wARtG.jpg,...,10194.0,Toy Story Collection,/7G9915LfUQ2lVfwMEEhDsn3kT4B.jpg,/9FBwqcd9IRruEDUrTdcaafOMKUq.jpg,"16, 35, 10751","Animation, Comedy, Family",3,Pixar Animation Studios,en,English
1,8844,False,65000000,,tt0113497,en,Jumanji,When siblings Judy and Peter discover an encha...,17.015539,/vzmL6fP7aPKNKPRTFnZmiUfciyV.jpg,...,,,,,"12, 14, 10751","Adventure, Fantasy, Family","559, 2550, 10201","TriStar Pictures, Teitler Film, Interscope Com...","en, fr","English, Français"
2,15602,False,0,,tt0113228,en,Grumpier Old Men,A family wedding reignites the ancient feud be...,11.7129,/6ksm1sjKMFLbO7UY2i6G1ju9SML.jpg,...,119050.0,Grumpy Old Men Collection,/nLvUdqgPgm3F85NMCii9gVFUcet.jpg,/hypTnLot2z8wpFS7qwsQHW1uV8u.jpg,"10749, 35","Romance, Comedy","6194, 19464","Warner Bros., Lancaster Gate",en,English


+ Una vez se ha verificado que las columnas fueron creadas tal cual como se esperaba, se procede a eliminar la columna original.

In [39]:
newdf_movies.drop(['spoken_languages'], axis=1, inplace=True)

**9.** Los valores nulos de los campos **`revenue`**, **`budget`** son rellenados por el número **`0`**.

+ Convertir los datos de la columna "budget" a tipo float64

In [40]:
newdf_movies["budget"] = pd.to_numeric(newdf_movies["budget"], errors="coerce")
newdf_movies["budget"] = newdf_movies["budget"].astype(float)

+ Convertir los datos de la columna "revenue" a tipo float64

In [41]:
newdf_movies["revenue"] = pd.to_numeric(newdf_movies["revenue"], errors="coerce")
newdf_movies["revenue"] = newdf_movies["revenue"].astype(float)

+ Se reemplazan los valores en las columnas mencionadas

In [42]:
newdf_movies["revenue"].fillna(0, inplace=True)
newdf_movies["budget"].fillna(0, inplace=True)

**10.** Eliminar las columnas que no serán utilizadas, **`video`**, **`imdb_id`**, **`adult`**, **`original_title`**, **`poster_path`** y **`homepage`**.

In [43]:
newdf_movies.drop(['video','imdb_id', 'adult', 'original_title', 'poster_path', 'homepage'], axis=1, inplace=True)

**11.** Los valores nulos del campo **`release date`** se eliminan.

+ Primero se cuenta y se imprime la cantidad de valores nulos en la columna **`release date`**.

In [44]:
cantidad_nulos = newdf_movies['release_date'].isnull().sum()
print("Cantidad de valores nulos en la columna 'release_date':", cantidad_nulos)

Cantidad de valores nulos en la columna 'release_date': 87


+ Separamos los registros con anomalías para una revisión manual posterior.

In [47]:
fechas_nulas = newdf_movies["release_date"].isnull()
df_anomalies_fechas = newdf_movies[fechas_nulas]
df_anomalies_fechas.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 87 entries, 711 to 45458
Data columns (total 23 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   id                    87 non-null     int32  
 1   budget                87 non-null     float64
 2   original_language     87 non-null     object 
 3   overview              74 non-null     object 
 4   popularity            84 non-null     object 
 5   release_date          0 non-null      object 
 6   revenue               87 non-null     float64
 7   runtime               73 non-null     float64
 8   status                83 non-null     object 
 9   tagline               14 non-null     object 
 10  title                 84 non-null     object 
 11  vote_average          84 non-null     float64
 12  vote_count            84 non-null     float64
 13  id_btc                3 non-null      float64
 14  name_btc              3 non-null      object 
 15  poster_btc          

+ Ahora se eliminan las filas identificadas con tipos de dato no válidos

In [48]:
newdf_movies = newdf_movies.drop(newdf_movies[fechas_nulas].index)
cantidad_nulos = newdf_movies['release_date'].isnull().sum()
cantidad_nulos

0

**12.** Se cambia el formato a la columna **`release date`** la cual debe tener el formato **`AAAA-mm-dd`**.

In [49]:
newdf_movies['release_date']=pd.to_datetime(newdf_movies['release_date'])
newdf_movies['release_year']=newdf_movies['release_date'].dt.year
newdf_movies.head(3)

Unnamed: 0,id,budget,original_language,overview,popularity,release_date,revenue,runtime,status,tagline,...,name_btc,poster_btc,backdrop_btc,genres_id,genres_name,ption_companies_id,ption_companies_name,languages_code,languages_name,release_year
0,862,30000000.0,en,"Led by Woody, Andy's toys live happily in his ...",21.946943,1995-10-30,373554033.0,81.0,Released,,...,Toy Story Collection,/7G9915LfUQ2lVfwMEEhDsn3kT4B.jpg,/9FBwqcd9IRruEDUrTdcaafOMKUq.jpg,"16, 35, 10751","Animation, Comedy, Family",3,Pixar Animation Studios,en,English,1995
1,8844,65000000.0,en,When siblings Judy and Peter discover an encha...,17.015539,1995-12-15,262797249.0,104.0,Released,Roll the dice and unleash the excitement!,...,,,,"12, 14, 10751","Adventure, Fantasy, Family","559, 2550, 10201","TriStar Pictures, Teitler Film, Interscope Com...","en, fr","English, Français",1995
2,15602,0.0,en,A family wedding reignites the ancient feud be...,11.7129,1995-12-22,0.0,101.0,Released,Still Yelling. Still Fighting. Still Ready for...,...,Grumpy Old Men Collection,/nLvUdqgPgm3F85NMCii9gVFUcet.jpg,/hypTnLot2z8wpFS7qwsQHW1uV8u.jpg,"10749, 35","Romance, Comedy","6194, 19464","Warner Bros., Lancaster Gate",en,English,1995


**13.** Se crea la columna con el retorno de inversión, llamada **`return`** con los campos **`revenue`** y **`budget`**, dividiendo estas dos últimas **`revenue / budget`**, cuando no hay datos disponibles para calcularlo, tomará el valor **`0`**.

+ Se rellena NaN con 1 para evitar división por 0.

+ Se reemplaza inf (división por 0) con 0

In [50]:
newdf_movies['return'] = newdf_movies['revenue'] / newdf_movies['budget'].fillna(1)
newdf_movies['return'] = newdf_movies['return'].replace([np.inf, -np.inf], 0)
newdf_movies['return'] = pd.to_numeric(newdf_movies['return'], errors='coerce')
newdf_movies['return'] = newdf_movies['return'].astype(float)
newdf_movies['return'].fillna(0, inplace=True)

+ Se redondea el resultado de la columna **`return`** a dos dígitos.

In [54]:
newdf_movies['return'] = newdf_movies['return'].round(2)
newdf_movies.head(3)

Unnamed: 0,id,budget,original_language,overview,popularity,release_date,revenue,runtime,status,tagline,...,poster_btc,backdrop_btc,genres_id,genres_name,ption_companies_id,ption_companies_name,languages_code,languages_name,release_year,return
0,862,30000000.0,en,"Led by Woody, Andy's toys live happily in his ...",21.946943,1995-10-30,373554033.0,81.0,Released,,...,/7G9915LfUQ2lVfwMEEhDsn3kT4B.jpg,/9FBwqcd9IRruEDUrTdcaafOMKUq.jpg,"16, 35, 10751","Animation, Comedy, Family",3,Pixar Animation Studios,en,English,1995,12.45
1,8844,65000000.0,en,When siblings Judy and Peter discover an encha...,17.015539,1995-12-15,262797249.0,104.0,Released,Roll the dice and unleash the excitement!,...,,,"12, 14, 10751","Adventure, Fantasy, Family","559, 2550, 10201","TriStar Pictures, Teitler Film, Interscope Com...","en, fr","English, Français",1995,4.04
2,15602,0.0,en,A family wedding reignites the ancient feud be...,11.7129,1995-12-22,0.0,101.0,Released,Still Yelling. Still Fighting. Still Ready for...,...,/nLvUdqgPgm3F85NMCii9gVFUcet.jpg,/hypTnLot2z8wpFS7qwsQHW1uV8u.jpg,"10749, 35","Romance, Comedy","6194, 19464","Warner Bros., Lancaster Gate",en,English,1995,0.0


**14.** Se ordena de forma ascendente la columna **`id`**.

In [55]:
newdf_movies.sort_values(by=['id'], ascending=True, inplace=True)
newdf_movies

Unnamed: 0,id,budget,original_language,overview,popularity,release_date,revenue,runtime,status,tagline,...,poster_btc,backdrop_btc,genres_id,genres_name,ption_companies_id,ption_companies_name,languages_code,languages_name,release_year,return
45463,0,0.0,0,0,0.065736,1997-08-20,0.0,104.0,Released,,...,,,,,"11176, 11602, 29812","Carousel Productions, Vision View Entertainmen...",en,English,1997,0.0
45464,1,0.0,0,0,1.931659,2012-09-29,0.0,68.0,Released,,...,,,,,"2883, 7759, 7760, 7761, 33751","Aniplex, GoHands, BROSTA TV, Mardock Scramble ...",ja,日本語,2012,0.0
4342,2,0.0,fi,Taisto Kasurinen is a Finnish coal miner whose...,3.860491,1988-10-21,0.0,69.0,Released,,...,,,"18, 80","Drama, Crime","2303, 2396","Villealfa Filmproduction Oy, Finnish Film Foun...","fi, de","suomi, Deutsch",1988,0.0
12947,3,0.0,fi,"An episode in the life of Nikander, a garbage ...",2.29211,1986-10-16,0.0,76.0,Released,,...,,,"18, 35","Drama, Comedy",2303,Villealfa Filmproduction Oy,"en, fi, sv","English, suomi, svenska",1986,0.0
45465,4,0.0,0,0,2.185485,2014-01-01,0.0,82.0,Released,Beware Of Frost Bites,...,,,,,"17161, 18012, 18013, 23822","Odyssey Media, Pulser Productions, Rogue State...",en,English,2014,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
45075,465044,0.0,en,A horror comedy spoofing conspiracy theory mov...,0.281008,2017-06-28,0.0,90.0,Released,Horrifically Funny,...,,,"14, 18","Fantasy, Drama",,,en,English,2017,0.0
45270,467731,0.0,en,Fifteen-year-old girl Dotty Fisher is assaulte...,0.001189,1956-02-19,0.0,60.0,Released,,...,,,18,Drama,,,en,English,1956,0.0
21890,468343,0.0,fi,"In the 1910s, beautiful young Silja loses both...",0.001202,1956-01-01,0.0,87.0,Released,,...,,,"18, 10749","Drama, Romance",,,,,1956,0.0
45395,468707,1254040.0,fi,,0.347806,2017-07-28,0.0,90.0,Released,,...,,,"10749, 35","Romance, Comedy",84883,Elokuvayhtiö Oy Aamu,fi,suomi,2017,0.0


**15.** Se procede a eliminar los datos duplicados de la columna **`id`**

In [56]:
newdf_movies.drop_duplicates(subset='id', inplace=True)
duplicados_count = newdf_movies['id'].duplicated().sum()
print("Cantidad de duplicados en la columna 'id':", duplicados_count)

Cantidad de duplicados en la columna 'id': 0


**16.** Se eliminan las columnas **`poster_btc`**, **`backdrop_btc`**, una vez identificado que no se requieren para la continuación del proceso de análisis de la información.

In [57]:
newdf_movies.drop(['poster_btc', 'backdrop_btc'], axis=1, inplace=True)

**17.** Se procede a reorganizar el orden de las varibles para hacer más comprensible la información.

In [58]:
newdf_movies = newdf_movies.reindex(columns=['id',
                                            'title', 
                                            'overview',
                                            'tagline',
                                            'id_btc',
                                            'name_btc',
                                            'release_date',
                                            'release_year',
                                            'budget',
                                            'revenue',
                                            'return',
                                            'genres_id',
                                            'genres_name',
                                            'ption_companies_id',
                                            'ption_companies_name',
                                            'country_code',
                                            'country_name',
                                            'original_language',
                                            'languages_code',
                                            'languages_name',
                                            'runtime',
                                            'vote_average',
                                            'popularity',
                                            'vote_count',
                                            'status',])
newdf_movies

Unnamed: 0,id,title,overview,tagline,id_btc,name_btc,release_date,release_year,budget,revenue,...,country_code,country_name,original_language,languages_code,languages_name,runtime,vote_average,popularity,vote_count,status
45463,0,Midnight Man,0,,,,1997-08-20,1997,0.0,0.0,...,,,0,en,English,104.0,6.0,0.065736,1.0,Released
45464,1,Mardock Scramble: The Third Exhaust,0,,,,2012-09-29,2012,0.0,0.0,...,,,0,ja,日本語,68.0,7.0,1.931659,12.0,Released
4342,2,Ariel,Taisto Kasurinen is a Finnish coal miner whose...,,,,1988-10-21,1988,0.0,0.0,...,,,fi,"fi, de","suomi, Deutsch",69.0,7.1,3.860491,44.0,Released
12947,3,Shadows in Paradise,"An episode in the life of Nikander, a garbage ...",,,,1986-10-16,1986,0.0,0.0,...,,,fi,"en, fi, sv","English, suomi, svenska",76.0,7.1,2.29211,35.0,Released
45465,4,Avalanche Sharks,0,Beware Of Frost Bites,,,2014-01-01,2014,0.0,0.0,...,,,0,en,English,82.0,4.3,2.185485,22.0,Released
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
45075,465044,Abduction,A horror comedy spoofing conspiracy theory mov...,Horrifically Funny,,,2017-06-28,2017,0.0,0.0,...,,,en,en,English,90.0,0.0,0.281008,0.0,Released
45270,467731,Tragedy in a Temporary Town,Fifteen-year-old girl Dotty Fisher is assaulte...,,,,1956-02-19,1956,0.0,0.0,...,,,en,en,English,60.0,0.0,0.001189,0.0,Released
21890,468343,Silja - nuorena nukkunut,"In the 1910s, beautiful young Silja loses both...",,,,1956-01-01,1956,0.0,0.0,...,,,fi,,,87.0,0.0,0.001202,0.0,Released
45395,468707,Thick Lashes of Lauri Mäntyvaara,,,,,2017-07-28,2017,1254040.0,0.0,...,,,fi,fi,suomi,90.0,8.0,0.347806,1.0,Released


**18.** Se consulta la información del DataFrame definitivo.

In [59]:
newdf_movies.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 45349 entries, 45463 to 20188
Data columns (total 25 columns):
 #   Column                Non-Null Count  Dtype         
---  ------                --------------  -----         
 0   id                    45349 non-null  int32         
 1   title                 45349 non-null  object        
 2   overview              44408 non-null  object        
 3   tagline               20388 non-null  object        
 4   id_btc                3163 non-null   float64       
 5   name_btc              3163 non-null   object        
 6   release_date          45349 non-null  datetime64[ns]
 7   release_year          45349 non-null  int64         
 8   budget                45349 non-null  float64       
 9   revenue               45349 non-null  float64       
 10  return                45349 non-null  float64       
 11  genres_id             45349 non-null  object        
 12  genres_name           45349 non-null  object        
 13  ption_compan

# <h1 align=left>**`Load`**</h1>

**19.** Teniendo lista la información con los datos deseados se procede con la creación del nuevo Dataset.

In [60]:
newdf_movies.to_csv("..\Data\movies_dataset_ETL.csv")