# Proyecto de Sistema de Recomendación para Plataforma de Streaming


Proyecto individual Soy Henry PT09
Julio 2024


### **Introducción**

El proceso de Extracción, Transformación y Carga (ETL) es fundamental en el campo de la ciencia de datos y la analítica. En este proyecto, hemos realizado un exhaustivo proceso ETL sobre un conjunto de datos relacionados con películas. El objetivo principal fue preparar los datos para su posterior análisis y uso en una API, mejorando su calidad, estructura y utilidad.

### **Objetivos**



1. Integrar múltiples fuentes de datos en un único conjunto coherente.
2. Limpiar y estructurar los datos para facilitar su análisis.
3. Desanidar datos complejos para mejorar su accesibilidad.
4. Crear nuevas variables derivadas para enriquecer el análisis.
5. Optimizar el formato y tipo de datos para mejorar el rendimiento.

### **Bibliotecas Utilizadas**:


- Pandas para manipulación de datos.
- Scikit-learn para implementar el sistema de recomendación.
- FastAPI y Uvicorn para la creación y despliegue de la API.
- Matplotlib y Seaborn para visualizaciones durante el EDA.

Este notebook guiará paso a paso a través de cada fase del proyecto, desde la carga inicial de datos hasta la implementación y demostración del sistema de recomendación. **¡Comencemos!**

### **Paso a paso del proceso ETL**:



1. Carga de datos:

    - Importación de dos conjuntos de datos principales: 'credits.csv' y 'movies_dataset.csv'.


2. Desanidación de datos:

    - Desanidación de columnas complejas como 'belongs_to_collection', 'genres', 'production_companies', 'production_countries', 'spoken_languages', 'cast' y 'crew'.


3. Limpieza de datos:

    - Eliminación de columnas innecesarias.
    - Manejo de valores nulos y NaN.
    - Eliminación de filas con fechas de lanzamiento nulas.


3. Creación de nuevas variables:

    - Generación de la columna 'return' basada en 'revenue' y 'budget'.
    - Extracción de año y mes de la fecha de lanzamiento.


4. Estandarización de formatos:

    - Conversión de fechas al formato 'yyyy-mm-dd'.
    - Traducción de nombres de meses al español.


5. Optimización de tipos de datos:

    - Conversión de la columna 'popularity' a tipo float.


6. Integración de datos adicionales:

    - Incorporación de datos de 'crew' desde un archivo Excel externo.


7. Exportación de datos:

    - Conversión del DataFrame final a formato Parquet para optimizar el almacenamiento y la eficiencia de lectura.

### **Conclusiones**:

El proceso ETL realizado ha transformado significativamente los datos originales, convirtiéndolos en un conjunto coherente, limpio y estructurado. Las principales mejoras incluyen:

1. Desanidación de estructuras de datos complejas, facilitando el acceso a información relevante.
2. Creación de nuevas variables que aportan valor analítico, como el retorno de inversión y la categorización temporal.
3. Estandarización de formatos, especialmente en fechas y valores numéricos, mejorando la consistencia de los datos.
4. Optimización del almacenamiento y eficiencia mediante la conversión a formato Parquet.

Estas transformaciones no solo mejoran la calidad de los datos, sino que también facilitan su uso en análisis posteriores y en la implementación de la API. El conjunto de datos resultante está ahora preparado para proporcionar insights valiosos sobre la industria cinematográfica y servir como base sólida para futuras aplicaciones y análisis.

**Vamos a ver el proceso** de la transformación de datos ETL

# Desarrollo 

# **1. TRANSFORMACIONES (ETL)**

### 1.1 Carga de los datasets

In [309]:
# Carga de las bibliotecas
import pandas as pd
import numpy as np

In [310]:
#Carga de los Datasets
data1 = '/Users/felipeamezquita/Library/Mobile Documents/com~apple~CloudDocs/Documents/HENRY/PROYECTO INDIVIDUAL/EJERCICIO 1/DATASETS/Movies/credits.csv'
data2 = '/Users/felipeamezquita/Library/Mobile Documents/com~apple~CloudDocs/Documents/HENRY/PROYECTO INDIVIDUAL/EJERCICIO 1/DATASETS/Movies/movies_dataset.csv'

df1 = pd.read_csv(data1)
df2 = pd.read_csv(data2)

  df2 = pd.read_csv(data2)


### Vista preliminar de los datos

In [180]:
#Vista preliminar del dataset 1 y su información
df1.head()

Unnamed: 0,cast,crew,id
0,"[{'cast_id': 14, 'character': 'Woody (voice)',...","[{'credit_id': '52fe4284c3a36847f8024f49', 'de...",862
1,"[{'cast_id': 1, 'character': 'Alan Parrish', '...","[{'credit_id': '52fe44bfc3a36847f80a7cd1', 'de...",8844
2,"[{'cast_id': 2, 'character': 'Max Goldman', 'c...","[{'credit_id': '52fe466a9251416c75077a89', 'de...",15602
3,"[{'cast_id': 1, 'character': ""Savannah 'Vannah...","[{'credit_id': '52fe44779251416c91011acb', 'de...",31357
4,"[{'cast_id': 1, 'character': 'George Banks', '...","[{'credit_id': '52fe44959251416c75039ed7', 'de...",11862


In [181]:
print(df1.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 45476 entries, 0 to 45475
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   cast    45476 non-null  object
 1   crew    45476 non-null  object
 2   id      45476 non-null  int64 
dtypes: int64(1), object(2)
memory usage: 1.0+ MB
None


In [182]:
#Vista preliminar del dataset 2 y su información
df2.head()

Unnamed: 0,adult,belongs_to_collection,budget,genres,homepage,id,imdb_id,original_language,original_title,overview,...,release_date,revenue,runtime,spoken_languages,status,tagline,title,video,vote_average,vote_count
0,False,"{'id': 10194, 'name': 'Toy Story Collection', ...",30000000,"[{'id': 16, 'name': 'Animation'}, {'id': 35, '...",http://toystory.disney.com/toy-story,862,tt0114709,en,Toy Story,"Led by Woody, Andy's toys live happily in his ...",...,1995-10-30,373554033.0,81.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,,Toy Story,False,7.7,5415.0
1,False,,65000000,"[{'id': 12, 'name': 'Adventure'}, {'id': 14, '...",,8844,tt0113497,en,Jumanji,When siblings Judy and Peter discover an encha...,...,1995-12-15,262797249.0,104.0,"[{'iso_639_1': 'en', 'name': 'English'}, {'iso...",Released,Roll the dice and unleash the excitement!,Jumanji,False,6.9,2413.0
2,False,"{'id': 119050, 'name': 'Grumpy Old Men Collect...",0,"[{'id': 10749, 'name': 'Romance'}, {'id': 35, ...",,15602,tt0113228,en,Grumpier Old Men,A family wedding reignites the ancient feud be...,...,1995-12-22,0.0,101.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Still Yelling. Still Fighting. Still Ready for...,Grumpier Old Men,False,6.5,92.0
3,False,,16000000,"[{'id': 35, 'name': 'Comedy'}, {'id': 18, 'nam...",,31357,tt0114885,en,Waiting to Exhale,"Cheated on, mistreated and stepped on, the wom...",...,1995-12-22,81452156.0,127.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Friends are the people who let you be yourself...,Waiting to Exhale,False,6.1,34.0
4,False,"{'id': 96871, 'name': 'Father of the Bride Col...",0,"[{'id': 35, 'name': 'Comedy'}]",,11862,tt0113041,en,Father of the Bride Part II,Just when George Banks has recovered from his ...,...,1995-02-10,76578911.0,106.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Just When His World Is Back To Normal... He's ...,Father of the Bride Part II,False,5.7,173.0


In [183]:
print(df2.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 45466 entries, 0 to 45465
Data columns (total 24 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   adult                  45466 non-null  object 
 1   belongs_to_collection  4494 non-null   object 
 2   budget                 45466 non-null  object 
 3   genres                 45466 non-null  object 
 4   homepage               7782 non-null   object 
 5   id                     45466 non-null  object 
 6   imdb_id                45449 non-null  object 
 7   original_language      45455 non-null  object 
 8   original_title         45466 non-null  object 
 9   overview               44512 non-null  object 
 10  popularity             45461 non-null  object 
 11  poster_path            45080 non-null  object 
 12  production_companies   45463 non-null  object 
 13  production_countries   45463 non-null  object 
 14  release_date           45379 non-null  object 
 15  re

### 1.2 Desanidaciones

### Desanidacion del df2 

In [311]:
#Visual de anidacion de datos de la Columna belongs_to_collection
print(df2.loc[0,'belongs_to_collection'])

{'id': 10194, 'name': 'Toy Story Collection', 'poster_path': '/7G9915LfUQ2lVfwMEEhDsn3kT4B.jpg', 'backdrop_path': '/9FBwqcd9IRruEDUrTdcaafOMKUq.jpg'}


In [312]:
# Función para desanidar la columna 'belongs_to_collection'
def desanidar_belongs_to_collection(row):
    if pd.isna(row):
        return None, None, None, None
    else:
        try:
            data = eval(row)
            if isinstance(data, dict):
                return data.get('id'), data.get('name'), data.get('poster_path'), data.get('backdrop_path')
            else:
                return None, None, None, None
        except Exception as e:
            print(f"Error en fila: {row}. Detalle: {e}")
            return None, None, None, None

# Aplicacion de  la función a la columna 'belongs_to_collection' para desanidarla
desanidados = df2['belongs_to_collection'].apply(lambda row: pd.Series(desanidar_belongs_to_collection(row)))

# Cambio de nombre a las columnas del DataFrame resultante
desanidados.columns = ['belongs_to_collection_id', 'belongs_to_collection_name', 'belongs_to_collection_poster_path', 'belongs_to_collection_backdrop_path']


In [313]:
# Combinacion del DataFrame original con las columnas desanidadas
df2 = pd.concat([df2, desanidados], axis=1)

# Eliminar la columna original 'belongs_to_collection'
df2.drop(columns=['belongs_to_collection'], inplace=True)



In [187]:
# Muestra las primeras filas del DataFrame para verificar los resultados
df2.head()

Unnamed: 0,adult,budget,genres,homepage,id,imdb_id,original_language,original_title,overview,popularity,...,status,tagline,title,video,vote_average,vote_count,belongs_to_collection_id,belongs_to_collection_name,belongs_to_collection_poster_path,belongs_to_collection_backdrop_path
0,False,30000000,"[{'id': 16, 'name': 'Animation'}, {'id': 35, '...",http://toystory.disney.com/toy-story,862,tt0114709,en,Toy Story,"Led by Woody, Andy's toys live happily in his ...",21.946943,...,Released,,Toy Story,False,7.7,5415.0,10194.0,Toy Story Collection,/7G9915LfUQ2lVfwMEEhDsn3kT4B.jpg,/9FBwqcd9IRruEDUrTdcaafOMKUq.jpg
1,False,65000000,"[{'id': 12, 'name': 'Adventure'}, {'id': 14, '...",,8844,tt0113497,en,Jumanji,When siblings Judy and Peter discover an encha...,17.015539,...,Released,Roll the dice and unleash the excitement!,Jumanji,False,6.9,2413.0,,,,
2,False,0,"[{'id': 10749, 'name': 'Romance'}, {'id': 35, ...",,15602,tt0113228,en,Grumpier Old Men,A family wedding reignites the ancient feud be...,11.7129,...,Released,Still Yelling. Still Fighting. Still Ready for...,Grumpier Old Men,False,6.5,92.0,119050.0,Grumpy Old Men Collection,/nLvUdqgPgm3F85NMCii9gVFUcet.jpg,/hypTnLot2z8wpFS7qwsQHW1uV8u.jpg
3,False,16000000,"[{'id': 35, 'name': 'Comedy'}, {'id': 18, 'nam...",,31357,tt0114885,en,Waiting to Exhale,"Cheated on, mistreated and stepped on, the wom...",3.859495,...,Released,Friends are the people who let you be yourself...,Waiting to Exhale,False,6.1,34.0,,,,
4,False,0,"[{'id': 35, 'name': 'Comedy'}]",,11862,tt0113041,en,Father of the Bride Part II,Just when George Banks has recovered from his ...,8.387519,...,Released,Just When His World Is Back To Normal... He's ...,Father of the Bride Part II,False,5.7,173.0,96871.0,Father of the Bride Collection,/nts4iOmNnq7GNicycMJ9pSAn204.jpg,/7qwE57OVZmMJChBpLEbJEmzUydk.jpg


In [314]:
# Función para desanidar 'genres' en df2
def desanidar_genres(row):
    try:
        if pd.isna(row):
            return None, None
        genres_list = eval(row)
        genres_ids = [genre['id'] for genre in genres_list]
        genres_names = [genre['name'] for genre in genres_list]
        return genres_ids, genres_names
    except Exception as e:
        print(f"Error en fila: {row}. Detalle: {e}")
        return None, None

# Aplicar la función y expandir a nuevas columnas en df2
desanidados_genres = df2['genres'].apply(lambda row: pd.Series(desanidar_genres(row)))

# Renombrar columnas
desanidados_genres.columns = ['genres_ids', 'genres_names']

In [315]:
# Concatenar con el DataFrame original df2
df2 = pd.concat([df2, desanidados_genres], axis=1)

In [316]:
# Eliminar la columna original 'genres' de df2
df2.drop(columns=['genres'], inplace=True)

In [191]:
df2.head()

Unnamed: 0,adult,budget,homepage,id,imdb_id,original_language,original_title,overview,popularity,poster_path,...,title,video,vote_average,vote_count,belongs_to_collection_id,belongs_to_collection_name,belongs_to_collection_poster_path,belongs_to_collection_backdrop_path,genres_ids,genres_names
0,False,30000000,http://toystory.disney.com/toy-story,862,tt0114709,en,Toy Story,"Led by Woody, Andy's toys live happily in his ...",21.946943,/rhIRbceoE9lR4veEXuwCC2wARtG.jpg,...,Toy Story,False,7.7,5415.0,10194.0,Toy Story Collection,/7G9915LfUQ2lVfwMEEhDsn3kT4B.jpg,/9FBwqcd9IRruEDUrTdcaafOMKUq.jpg,"[16, 35, 10751]","[Animation, Comedy, Family]"
1,False,65000000,,8844,tt0113497,en,Jumanji,When siblings Judy and Peter discover an encha...,17.015539,/vzmL6fP7aPKNKPRTFnZmiUfciyV.jpg,...,Jumanji,False,6.9,2413.0,,,,,"[12, 14, 10751]","[Adventure, Fantasy, Family]"
2,False,0,,15602,tt0113228,en,Grumpier Old Men,A family wedding reignites the ancient feud be...,11.7129,/6ksm1sjKMFLbO7UY2i6G1ju9SML.jpg,...,Grumpier Old Men,False,6.5,92.0,119050.0,Grumpy Old Men Collection,/nLvUdqgPgm3F85NMCii9gVFUcet.jpg,/hypTnLot2z8wpFS7qwsQHW1uV8u.jpg,"[10749, 35]","[Romance, Comedy]"
3,False,16000000,,31357,tt0114885,en,Waiting to Exhale,"Cheated on, mistreated and stepped on, the wom...",3.859495,/16XOMpEaLWkrcPqSQqhTmeJuqQl.jpg,...,Waiting to Exhale,False,6.1,34.0,,,,,"[35, 18, 10749]","[Comedy, Drama, Romance]"
4,False,0,,11862,tt0113041,en,Father of the Bride Part II,Just when George Banks has recovered from his ...,8.387519,/e64sOI48hQXyru7naBFyssKFxVd.jpg,...,Father of the Bride Part II,False,5.7,173.0,96871.0,Father of the Bride Collection,/nts4iOmNnq7GNicycMJ9pSAn204.jpg,/7qwE57OVZmMJChBpLEbJEmzUydk.jpg,[35],[Comedy]


In [317]:
#Visual de anidacion de datos de la Columna production_companies
print(df2.loc[0,'production_companies']) 

[{'name': 'Pixar Animation Studios', 'id': 3}]


In [318]:
# Función para desanidar 'production_companies'
def desanidar_production_companies(row):
    try:
        if pd.isna(row):
            return None, None
        companies_list = eval(row)
        if not isinstance(companies_list, list):
            return None, None
        company_names = [company['name'] for company in companies_list]
        company_ids = [company['id'] for company in companies_list]
        return company_names, company_ids
    except Exception as e:
        print(f"Error en fila: {row}. Detalle: {e}")
        return None, None

# Aplicar la función y expandir a nuevas columnas en df2
desanidados_companies = df2['production_companies'].apply(lambda row: pd.Series(desanidar_production_companies(row)))

# Renombrar columnas
desanidados_companies.columns = ['company_names', 'company_ids']

In [319]:
# Concatenar con el DataFrame original df2
df2 = pd.concat([df2, desanidados_companies], axis=1)

# Eliminar la columna original 'production_companies' de df2
df2.drop(columns=['production_companies'], inplace=True)


In [320]:
df2.head()

Unnamed: 0,adult,budget,homepage,id,imdb_id,original_language,original_title,overview,popularity,poster_path,...,vote_average,vote_count,belongs_to_collection_id,belongs_to_collection_name,belongs_to_collection_poster_path,belongs_to_collection_backdrop_path,genres_ids,genres_names,company_names,company_ids
0,False,30000000,http://toystory.disney.com/toy-story,862,tt0114709,en,Toy Story,"Led by Woody, Andy's toys live happily in his ...",21.946943,/rhIRbceoE9lR4veEXuwCC2wARtG.jpg,...,7.7,5415.0,10194.0,Toy Story Collection,/7G9915LfUQ2lVfwMEEhDsn3kT4B.jpg,/9FBwqcd9IRruEDUrTdcaafOMKUq.jpg,"[16, 35, 10751]","[Animation, Comedy, Family]",[Pixar Animation Studios],[3]
1,False,65000000,,8844,tt0113497,en,Jumanji,When siblings Judy and Peter discover an encha...,17.015539,/vzmL6fP7aPKNKPRTFnZmiUfciyV.jpg,...,6.9,2413.0,,,,,"[12, 14, 10751]","[Adventure, Fantasy, Family]","[TriStar Pictures, Teitler Film, Interscope Co...","[559, 2550, 10201]"
2,False,0,,15602,tt0113228,en,Grumpier Old Men,A family wedding reignites the ancient feud be...,11.7129,/6ksm1sjKMFLbO7UY2i6G1ju9SML.jpg,...,6.5,92.0,119050.0,Grumpy Old Men Collection,/nLvUdqgPgm3F85NMCii9gVFUcet.jpg,/hypTnLot2z8wpFS7qwsQHW1uV8u.jpg,"[10749, 35]","[Romance, Comedy]","[Warner Bros., Lancaster Gate]","[6194, 19464]"
3,False,16000000,,31357,tt0114885,en,Waiting to Exhale,"Cheated on, mistreated and stepped on, the wom...",3.859495,/16XOMpEaLWkrcPqSQqhTmeJuqQl.jpg,...,6.1,34.0,,,,,"[35, 18, 10749]","[Comedy, Drama, Romance]",[Twentieth Century Fox Film Corporation],[306]
4,False,0,,11862,tt0113041,en,Father of the Bride Part II,Just when George Banks has recovered from his ...,8.387519,/e64sOI48hQXyru7naBFyssKFxVd.jpg,...,5.7,173.0,96871.0,Father of the Bride Collection,/nts4iOmNnq7GNicycMJ9pSAn204.jpg,/7qwE57OVZmMJChBpLEbJEmzUydk.jpg,[35],[Comedy],"[Sandollar Productions, Touchstone Pictures]","[5842, 9195]"


In [321]:
#Visual de anidacion de datos de la Columna production_countries
print(df2.loc[0,'production_countries']) 

[{'iso_3166_1': 'US', 'name': 'United States of America'}]


In [322]:
# Función para desanidar 'production_countries'
def desanidar_production_countries(row):
    try:
        if pd.isna(row):
            return None, None
        countries_list = eval(row)
        if not isinstance(countries_list, list):
            return None, None
        country_names = [country['name'] for country in countries_list]
        country_codes = [country['iso_3166_1'] for country in countries_list]
        return country_names, country_codes
    except Exception as e:
        print(f"Error en fila: {row}. Detalle: {e}")
        return None, None

# Aplicar la función y expandir a nuevas columnas en df2
desanidados_countries = df2['production_countries'].apply(lambda row: pd.Series(desanidar_production_countries(row)))

# Renombrar columnas
desanidados_countries.columns = ['country_names', 'country_codes']

In [323]:
# Concatenar con el DataFrame original df2
df2 = pd.concat([df2, desanidados_countries], axis=1)

# Eliminar la columna original 'production_countries' de df2
df2.drop(columns=['production_countries'], inplace=True)

In [324]:
df2.head()

Unnamed: 0,adult,budget,homepage,id,imdb_id,original_language,original_title,overview,popularity,poster_path,...,belongs_to_collection_id,belongs_to_collection_name,belongs_to_collection_poster_path,belongs_to_collection_backdrop_path,genres_ids,genres_names,company_names,company_ids,country_names,country_codes
0,False,30000000,http://toystory.disney.com/toy-story,862,tt0114709,en,Toy Story,"Led by Woody, Andy's toys live happily in his ...",21.946943,/rhIRbceoE9lR4veEXuwCC2wARtG.jpg,...,10194.0,Toy Story Collection,/7G9915LfUQ2lVfwMEEhDsn3kT4B.jpg,/9FBwqcd9IRruEDUrTdcaafOMKUq.jpg,"[16, 35, 10751]","[Animation, Comedy, Family]",[Pixar Animation Studios],[3],[United States of America],[US]
1,False,65000000,,8844,tt0113497,en,Jumanji,When siblings Judy and Peter discover an encha...,17.015539,/vzmL6fP7aPKNKPRTFnZmiUfciyV.jpg,...,,,,,"[12, 14, 10751]","[Adventure, Fantasy, Family]","[TriStar Pictures, Teitler Film, Interscope Co...","[559, 2550, 10201]",[United States of America],[US]
2,False,0,,15602,tt0113228,en,Grumpier Old Men,A family wedding reignites the ancient feud be...,11.7129,/6ksm1sjKMFLbO7UY2i6G1ju9SML.jpg,...,119050.0,Grumpy Old Men Collection,/nLvUdqgPgm3F85NMCii9gVFUcet.jpg,/hypTnLot2z8wpFS7qwsQHW1uV8u.jpg,"[10749, 35]","[Romance, Comedy]","[Warner Bros., Lancaster Gate]","[6194, 19464]",[United States of America],[US]
3,False,16000000,,31357,tt0114885,en,Waiting to Exhale,"Cheated on, mistreated and stepped on, the wom...",3.859495,/16XOMpEaLWkrcPqSQqhTmeJuqQl.jpg,...,,,,,"[35, 18, 10749]","[Comedy, Drama, Romance]",[Twentieth Century Fox Film Corporation],[306],[United States of America],[US]
4,False,0,,11862,tt0113041,en,Father of the Bride Part II,Just when George Banks has recovered from his ...,8.387519,/e64sOI48hQXyru7naBFyssKFxVd.jpg,...,96871.0,Father of the Bride Collection,/nts4iOmNnq7GNicycMJ9pSAn204.jpg,/7qwE57OVZmMJChBpLEbJEmzUydk.jpg,[35],[Comedy],"[Sandollar Productions, Touchstone Pictures]","[5842, 9195]",[United States of America],[US]


In [325]:
#Visual de anidacion de datos de la Columna spoken_languages
print(df2.loc[0,'spoken_languages']) 

[{'iso_639_1': 'en', 'name': 'English'}]


In [326]:
# Función para desanidar 'spoken_languages'
def desanidar_spoken_languages(row):
    try:
        if pd.isna(row):
            return None, None
        languages_list = eval(row)
        if not isinstance(languages_list, list):
            return None, None
        language_names = [lang['name'] for lang in languages_list]
        language_codes = [lang['iso_639_1'] for lang in languages_list]
        return language_names, language_codes
    except Exception as e:
        print(f"Error en fila: {row}. Detalle: {e}")
        return None, None

# Aplicar la función y expandir a nuevas columnas en df2
desanidados_languages = df2['spoken_languages'].apply(lambda row: pd.Series(desanidar_spoken_languages(row)))

# Renombrar columnas
desanidados_languages.columns = ['language_names', 'language_codes']

In [327]:
# Concatenar con el DataFrame original df2
df2 = pd.concat([df2, desanidados_languages], axis=1)

# Eliminar la columna original 'spoken_languages' de df2
df2.drop(columns=['spoken_languages'], inplace=True)

In [328]:
df2.head()

Unnamed: 0,adult,budget,homepage,id,imdb_id,original_language,original_title,overview,popularity,poster_path,...,belongs_to_collection_poster_path,belongs_to_collection_backdrop_path,genres_ids,genres_names,company_names,company_ids,country_names,country_codes,language_names,language_codes
0,False,30000000,http://toystory.disney.com/toy-story,862,tt0114709,en,Toy Story,"Led by Woody, Andy's toys live happily in his ...",21.946943,/rhIRbceoE9lR4veEXuwCC2wARtG.jpg,...,/7G9915LfUQ2lVfwMEEhDsn3kT4B.jpg,/9FBwqcd9IRruEDUrTdcaafOMKUq.jpg,"[16, 35, 10751]","[Animation, Comedy, Family]",[Pixar Animation Studios],[3],[United States of America],[US],[English],[en]
1,False,65000000,,8844,tt0113497,en,Jumanji,When siblings Judy and Peter discover an encha...,17.015539,/vzmL6fP7aPKNKPRTFnZmiUfciyV.jpg,...,,,"[12, 14, 10751]","[Adventure, Fantasy, Family]","[TriStar Pictures, Teitler Film, Interscope Co...","[559, 2550, 10201]",[United States of America],[US],"[English, Français]","[en, fr]"
2,False,0,,15602,tt0113228,en,Grumpier Old Men,A family wedding reignites the ancient feud be...,11.7129,/6ksm1sjKMFLbO7UY2i6G1ju9SML.jpg,...,/nLvUdqgPgm3F85NMCii9gVFUcet.jpg,/hypTnLot2z8wpFS7qwsQHW1uV8u.jpg,"[10749, 35]","[Romance, Comedy]","[Warner Bros., Lancaster Gate]","[6194, 19464]",[United States of America],[US],[English],[en]
3,False,16000000,,31357,tt0114885,en,Waiting to Exhale,"Cheated on, mistreated and stepped on, the wom...",3.859495,/16XOMpEaLWkrcPqSQqhTmeJuqQl.jpg,...,,,"[35, 18, 10749]","[Comedy, Drama, Romance]",[Twentieth Century Fox Film Corporation],[306],[United States of America],[US],[English],[en]
4,False,0,,11862,tt0113041,en,Father of the Bride Part II,Just when George Banks has recovered from his ...,8.387519,/e64sOI48hQXyru7naBFyssKFxVd.jpg,...,/nts4iOmNnq7GNicycMJ9pSAn204.jpg,/7qwE57OVZmMJChBpLEbJEmzUydk.jpg,[35],[Comedy],"[Sandollar Productions, Touchstone Pictures]","[5842, 9195]",[United States of America],[US],[English],[en]


### Desanidacion del df1

In [329]:
#Visual de anidacion de datos de la Columna cast
print(df1.loc[0,'cast']) 

[{'cast_id': 14, 'character': 'Woody (voice)', 'credit_id': '52fe4284c3a36847f8024f95', 'gender': 2, 'id': 31, 'name': 'Tom Hanks', 'order': 0, 'profile_path': '/pQFoyx7rp09CJTAb932F2g8Nlho.jpg'}, {'cast_id': 15, 'character': 'Buzz Lightyear (voice)', 'credit_id': '52fe4284c3a36847f8024f99', 'gender': 2, 'id': 12898, 'name': 'Tim Allen', 'order': 1, 'profile_path': '/uX2xVf6pMmPepxnvFWyBtjexzgY.jpg'}, {'cast_id': 16, 'character': 'Mr. Potato Head (voice)', 'credit_id': '52fe4284c3a36847f8024f9d', 'gender': 2, 'id': 7167, 'name': 'Don Rickles', 'order': 2, 'profile_path': '/h5BcaDMPRVLHLDzbQavec4xfSdt.jpg'}, {'cast_id': 17, 'character': 'Slinky Dog (voice)', 'credit_id': '52fe4284c3a36847f8024fa1', 'gender': 2, 'id': 12899, 'name': 'Jim Varney', 'order': 3, 'profile_path': '/eIo2jVVXYgjDtaHoF19Ll9vtW7h.jpg'}, {'cast_id': 18, 'character': 'Rex (voice)', 'credit_id': '52fe4284c3a36847f8024fa5', 'gender': 2, 'id': 12900, 'name': 'Wallace Shawn', 'order': 4, 'profile_path': '/oGE6JqPP2xH4tN

In [330]:
# Función para desanidar 'cast'
def desanidar_cast(row):
    try:
        if pd.isna(row):
            return None, None, None, None, None, None, None
        cast_list = eval(row)
        if not isinstance(cast_list, list):
            return None, None, None, None, None, None, None
        cast_names = [cast['name'] for cast in cast_list]
        cast_characters = [cast['character'] for cast in cast_list]
        cast_ids = [cast['id'] for cast in cast_list]
        cast_genders = [cast['gender'] for cast in cast_list]
        cast_orders = [cast['order'] for cast in cast_list]
        cast_profile_paths = [cast['profile_path'] for cast in cast_list]
        cast_credit_ids = [cast['credit_id'] for cast in cast_list]
        return cast_names, cast_characters, cast_ids, cast_genders, cast_orders, cast_profile_paths, cast_credit_ids
    except Exception as e:
        print(f"Error en fila: {row}. Detalle: {e}")
        return None, None, None, None, None, None, None

# Aplicar la función y expandir a nuevas columnas en df1
desanidados_cast = df1['cast'].apply(lambda row: pd.Series(desanidar_cast(row)))

# Renombrar columnas
desanidados_cast.columns = ['cast_names', 'cast_characters', 'cast_ids', 'cast_genders', 'cast_orders', 'cast_profile_paths', 'cast_credit_ids']


In [331]:
# Concatenar con el DataFrame original df1
df1 = pd.concat([df1, desanidados_cast], axis=1)

# Eliminar la columna original 'cast' de df1
df1.drop(columns=['cast'], inplace=True)

In [332]:
df1.head()

Unnamed: 0,crew,id,cast_names,cast_characters,cast_ids,cast_genders,cast_orders,cast_profile_paths,cast_credit_ids
0,"[{'credit_id': '52fe4284c3a36847f8024f49', 'de...",862,"[Tom Hanks, Tim Allen, Don Rickles, Jim Varney...","[Woody (voice), Buzz Lightyear (voice), Mr. Po...","[31, 12898, 7167, 12899, 12900, 7907, 8873, 11...","[2, 2, 2, 2, 2, 2, 1, 0, 2, 1, 2, 1, 2]","[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12]","[/pQFoyx7rp09CJTAb932F2g8Nlho.jpg, /uX2xVf6pMm...","[52fe4284c3a36847f8024f95, 52fe4284c3a36847f80..."
1,"[{'credit_id': '52fe44bfc3a36847f80a7cd1', 'de...",8844,"[Robin Williams, Jonathan Hyde, Kirsten Dunst,...","[Alan Parrish, Samuel Alan Parrish / Van Pelt,...","[2157, 8537, 205, 145151, 5149, 10739, 58563, ...","[2, 2, 1, 0, 1, 1, 2, 1, 0, 1, 2, 1, 2, 0, 0, ...","[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,...","[/sojtJyIV3lkUeThD7A2oHNm8183.jpg, /7il5D76vx6...","[52fe44bfc3a36847f80a7c73, 52fe44bfc3a36847f80..."
2,"[{'credit_id': '52fe466a9251416c75077a89', 'de...",15602,"[Walter Matthau, Jack Lemmon, Ann-Margret, Sop...","[Max Goldman, John Gustafson, Ariel Gustafson,...","[6837, 3151, 13567, 16757, 589, 16523, 7166]","[2, 2, 1, 1, 1, 2, 2]","[0, 1, 2, 3, 4, 5, 6]","[/xJVkvprOnzP5Zdh5y63y8HHniDZ.jpg, /chZmNRYMtq...","[52fe466a9251416c75077a8d, 52fe466a9251416c750..."
3,"[{'credit_id': '52fe44779251416c91011acb', 'de...",31357,"[Whitney Houston, Angela Bassett, Loretta Devi...","[Savannah 'Vannah' Jackson, Bernadine 'Bernie'...","[8851, 9780, 18284, 51359, 66804, 352, 87118, ...","[1, 1, 1, 1, 2, 2, 2, 2, 2, 2]","[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]","[/69ouDnXnmklYPr4sMJXWKYz81AL.jpg, /tHkgSzhEuJ...","[52fe44779251416c91011aad, 52fe44779251416c910..."
4,"[{'credit_id': '52fe44959251416c75039ed7', 'de...",11862,"[Steve Martin, Diane Keaton, Martin Short, Kim...","[George Banks, Nina Banks, Franck Eggelhoffer,...","[67773, 3092, 519, 70696, 59222, 18793, 14592,...","[2, 1, 2, 1, 2, 0, 2, 2, 1, 1, 2, 1]","[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11]","[/rI2EMvkfKKPKa5z0nM2pFVBtUyO.jpg, /fzgUMnbOkx...","[52fe44959251416c75039eb9, 52fe44959251416c750..."


In [333]:
#Visual de anidacion de datos de la Columna crew
print(df1.loc[0,'crew']) 

[{'credit_id': '52fe4284c3a36847f8024f49', 'department': 'Directing', 'gender': 2, 'id': 7879, 'job': 'Director', 'name': 'John Lasseter', 'profile_path': '/7EdqiNbr4FRjIhKHyPPdFfEEEFG.jpg'}, {'credit_id': '52fe4284c3a36847f8024f4f', 'department': 'Writing', 'gender': 2, 'id': 12891, 'job': 'Screenplay', 'name': 'Joss Whedon', 'profile_path': '/dTiVsuaTVTeGmvkhcyJvKp2A5kr.jpg'}, {'credit_id': '52fe4284c3a36847f8024f55', 'department': 'Writing', 'gender': 2, 'id': 7, 'job': 'Screenplay', 'name': 'Andrew Stanton', 'profile_path': '/pvQWsu0qc8JFQhMVJkTHuexUAa1.jpg'}, {'credit_id': '52fe4284c3a36847f8024f5b', 'department': 'Writing', 'gender': 2, 'id': 12892, 'job': 'Screenplay', 'name': 'Joel Cohen', 'profile_path': '/dAubAiZcvKFbboWlj7oXOkZnTSu.jpg'}, {'credit_id': '52fe4284c3a36847f8024f61', 'department': 'Writing', 'gender': 0, 'id': 12893, 'job': 'Screenplay', 'name': 'Alec Sokolow', 'profile_path': '/v79vlRYi94BZUQnkkyznbGUZLjT.jpg'}, {'credit_id': '52fe4284c3a36847f8024f67', 'depart

In [334]:
# Explode the 'crew' column
df_crew = df1.explode('crew').reset_index(drop=True)

# Normalize the nested JSON data
crew_normalized = pd.json_normalize(df_crew['crew'])

# Concatenate the normalized crew DataFrame with the original df1 (excluding the old 'crew' column)
df_expanded = pd.concat([df1.drop(columns=['crew']), crew_normalized], axis=1)



In [335]:
print(df1.loc[0,]) 

crew                  [{'credit_id': '52fe4284c3a36847f8024f49', 'de...
id                                                                  862
cast_names            [Tom Hanks, Tim Allen, Don Rickles, Jim Varney...
cast_characters       [Woody (voice), Buzz Lightyear (voice), Mr. Po...
cast_ids              [31, 12898, 7167, 12899, 12900, 7907, 8873, 11...
cast_genders                    [2, 2, 2, 2, 2, 2, 1, 0, 2, 1, 2, 1, 2]
cast_orders                  [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12]
cast_profile_paths    [/pQFoyx7rp09CJTAb932F2g8Nlho.jpg, /uX2xVf6pMm...
cast_credit_ids       [52fe4284c3a36847f8024f95, 52fe4284c3a36847f80...
Name: 0, dtype: object


In [336]:
df_expanded.head()

Unnamed: 0,id,cast_names,cast_characters,cast_ids,cast_genders,cast_orders,cast_profile_paths,cast_credit_ids
0,862,"[Tom Hanks, Tim Allen, Don Rickles, Jim Varney...","[Woody (voice), Buzz Lightyear (voice), Mr. Po...","[31, 12898, 7167, 12899, 12900, 7907, 8873, 11...","[2, 2, 2, 2, 2, 2, 1, 0, 2, 1, 2, 1, 2]","[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12]","[/pQFoyx7rp09CJTAb932F2g8Nlho.jpg, /uX2xVf6pMm...","[52fe4284c3a36847f8024f95, 52fe4284c3a36847f80..."
1,8844,"[Robin Williams, Jonathan Hyde, Kirsten Dunst,...","[Alan Parrish, Samuel Alan Parrish / Van Pelt,...","[2157, 8537, 205, 145151, 5149, 10739, 58563, ...","[2, 2, 1, 0, 1, 1, 2, 1, 0, 1, 2, 1, 2, 0, 0, ...","[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,...","[/sojtJyIV3lkUeThD7A2oHNm8183.jpg, /7il5D76vx6...","[52fe44bfc3a36847f80a7c73, 52fe44bfc3a36847f80..."
2,15602,"[Walter Matthau, Jack Lemmon, Ann-Margret, Sop...","[Max Goldman, John Gustafson, Ariel Gustafson,...","[6837, 3151, 13567, 16757, 589, 16523, 7166]","[2, 2, 1, 1, 1, 2, 2]","[0, 1, 2, 3, 4, 5, 6]","[/xJVkvprOnzP5Zdh5y63y8HHniDZ.jpg, /chZmNRYMtq...","[52fe466a9251416c75077a8d, 52fe466a9251416c750..."
3,31357,"[Whitney Houston, Angela Bassett, Loretta Devi...","[Savannah 'Vannah' Jackson, Bernadine 'Bernie'...","[8851, 9780, 18284, 51359, 66804, 352, 87118, ...","[1, 1, 1, 1, 2, 2, 2, 2, 2, 2]","[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]","[/69ouDnXnmklYPr4sMJXWKYz81AL.jpg, /tHkgSzhEuJ...","[52fe44779251416c91011aad, 52fe44779251416c910..."
4,11862,"[Steve Martin, Diane Keaton, Martin Short, Kim...","[George Banks, Nina Banks, Franck Eggelhoffer,...","[67773, 3092, 519, 70696, 59222, 18793, 14592,...","[2, 1, 2, 1, 2, 0, 2, 2, 1, 1, 2, 1]","[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11]","[/rI2EMvkfKKPKa5z0nM2pFVBtUyO.jpg, /fzgUMnbOkx...","[52fe44959251416c75039eb9, 52fe44959251416c750..."


In [337]:

# df_expanded es el DataFrame 
df_expanded.columns = ['Crew_' + col for col in df_expanded.columns]

# Mostrar las primeras filas del DataFrame con las columnas renombradas
print(df_expanded.head())


   Crew_id                                    Crew_cast_names  \
0      862  [Tom Hanks, Tim Allen, Don Rickles, Jim Varney...   
1     8844  [Robin Williams, Jonathan Hyde, Kirsten Dunst,...   
2    15602  [Walter Matthau, Jack Lemmon, Ann-Margret, Sop...   
3    31357  [Whitney Houston, Angela Bassett, Loretta Devi...   
4    11862  [Steve Martin, Diane Keaton, Martin Short, Kim...   

                                Crew_cast_characters  \
0  [Woody (voice), Buzz Lightyear (voice), Mr. Po...   
1  [Alan Parrish, Samuel Alan Parrish / Van Pelt,...   
2  [Max Goldman, John Gustafson, Ariel Gustafson,...   
3  [Savannah 'Vannah' Jackson, Bernadine 'Bernie'...   
4  [George Banks, Nina Banks, Franck Eggelhoffer,...   

                                       Crew_cast_ids  \
0  [31, 12898, 7167, 12899, 12900, 7907, 8873, 11...   
1  [2157, 8537, 205, 145151, 5149, 10739, 58563, ...   
2       [6837, 3151, 13567, 16757, 589, 16523, 7166]   
3  [8851, 9780, 18284, 51359, 66804, 352, 87118,

In [338]:
print(df_expanded.columns)

Index(['Crew_id', 'Crew_cast_names', 'Crew_cast_characters', 'Crew_cast_ids',
       'Crew_cast_genders', 'Crew_cast_orders', 'Crew_cast_profile_paths',
       'Crew_cast_credit_ids'],
      dtype='object')


In [339]:
df_expanded.head()

Unnamed: 0,Crew_id,Crew_cast_names,Crew_cast_characters,Crew_cast_ids,Crew_cast_genders,Crew_cast_orders,Crew_cast_profile_paths,Crew_cast_credit_ids
0,862,"[Tom Hanks, Tim Allen, Don Rickles, Jim Varney...","[Woody (voice), Buzz Lightyear (voice), Mr. Po...","[31, 12898, 7167, 12899, 12900, 7907, 8873, 11...","[2, 2, 2, 2, 2, 2, 1, 0, 2, 1, 2, 1, 2]","[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12]","[/pQFoyx7rp09CJTAb932F2g8Nlho.jpg, /uX2xVf6pMm...","[52fe4284c3a36847f8024f95, 52fe4284c3a36847f80..."
1,8844,"[Robin Williams, Jonathan Hyde, Kirsten Dunst,...","[Alan Parrish, Samuel Alan Parrish / Van Pelt,...","[2157, 8537, 205, 145151, 5149, 10739, 58563, ...","[2, 2, 1, 0, 1, 1, 2, 1, 0, 1, 2, 1, 2, 0, 0, ...","[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,...","[/sojtJyIV3lkUeThD7A2oHNm8183.jpg, /7il5D76vx6...","[52fe44bfc3a36847f80a7c73, 52fe44bfc3a36847f80..."
2,15602,"[Walter Matthau, Jack Lemmon, Ann-Margret, Sop...","[Max Goldman, John Gustafson, Ariel Gustafson,...","[6837, 3151, 13567, 16757, 589, 16523, 7166]","[2, 2, 1, 1, 1, 2, 2]","[0, 1, 2, 3, 4, 5, 6]","[/xJVkvprOnzP5Zdh5y63y8HHniDZ.jpg, /chZmNRYMtq...","[52fe466a9251416c75077a8d, 52fe466a9251416c750..."
3,31357,"[Whitney Houston, Angela Bassett, Loretta Devi...","[Savannah 'Vannah' Jackson, Bernadine 'Bernie'...","[8851, 9780, 18284, 51359, 66804, 352, 87118, ...","[1, 1, 1, 1, 2, 2, 2, 2, 2, 2]","[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]","[/69ouDnXnmklYPr4sMJXWKYz81AL.jpg, /tHkgSzhEuJ...","[52fe44779251416c91011aad, 52fe44779251416c910..."
4,11862,"[Steve Martin, Diane Keaton, Martin Short, Kim...","[George Banks, Nina Banks, Franck Eggelhoffer,...","[67773, 3092, 519, 70696, 59222, 18793, 14592,...","[2, 1, 2, 1, 2, 0, 2, 2, 1, 1, 2, 1]","[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11]","[/rI2EMvkfKKPKa5z0nM2pFVBtUyO.jpg, /fzgUMnbOkx...","[52fe44959251416c75039eb9, 52fe44959251416c750..."


In [340]:
# Reemplazar valores NaN en Crew_id con -1
df_expanded['Crew_id'] = df_expanded['Crew_id'].fillna(-1)



In [341]:
# Convertir Crew_id a enteros
df_expanded['Crew_id'] = df_expanded['Crew_id'].astype(int)

In [342]:
# Verificar nombres de columnas en df_expanded
print(df_expanded.columns)


Index(['Crew_id', 'Crew_cast_names', 'Crew_cast_characters', 'Crew_cast_ids',
       'Crew_cast_genders', 'Crew_cast_orders', 'Crew_cast_profile_paths',
       'Crew_cast_credit_ids'],
      dtype='object')


In [343]:
# Revisar si hay columnas duplicadas con el nombre 'Crew_id'
duplicated_cols = df_expanded.columns[df_expanded.columns.duplicated()]
print(duplicated_cols)


Index([], dtype='object')


In [345]:
# Obtener los índices de las columnas duplicadas
cols = df_expanded.columns
duplicated_indices = [i for i, col in enumerate(cols) if col == 'Crew_id']

# Renombrar la segunda columna 'Crew_id' a 'Crew_id2'
if len(duplicated_indices) > 1:
    df_expanded.columns = [f'Crew_id2' if i == duplicated_indices[1] else col for i, col in enumerate(cols)]

# Verificar los cambios
print(df_expanded.columns)


Index(['Crew_id', 'Crew_cast_names', 'Crew_cast_characters', 'Crew_cast_ids',
       'Crew_cast_genders', 'Crew_cast_orders', 'Crew_cast_profile_paths',
       'Crew_cast_credit_ids'],
      dtype='object')


In [346]:

# Realizar la unión utilizando la columna 'id' de df1 y 'Crew_id' de df_expanded
df_merged = pd.merge(df1, df_expanded, left_on='id', right_on='Crew_id', how='left')


In [347]:

# Revertir los valores -1 a NaN en Crew_id en el DataFrame combinado
df_merged['Crew_id'] = df_merged['Crew_id'].replace(-1, np.nan)

In [348]:
df_merged.head()

Unnamed: 0,crew,id,cast_names,cast_characters,cast_ids,cast_genders,cast_orders,cast_profile_paths,cast_credit_ids,Crew_id,Crew_cast_names,Crew_cast_characters,Crew_cast_ids,Crew_cast_genders,Crew_cast_orders,Crew_cast_profile_paths,Crew_cast_credit_ids
0,"[{'credit_id': '52fe4284c3a36847f8024f49', 'de...",862,"[Tom Hanks, Tim Allen, Don Rickles, Jim Varney...","[Woody (voice), Buzz Lightyear (voice), Mr. Po...","[31, 12898, 7167, 12899, 12900, 7907, 8873, 11...","[2, 2, 2, 2, 2, 2, 1, 0, 2, 1, 2, 1, 2]","[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12]","[/pQFoyx7rp09CJTAb932F2g8Nlho.jpg, /uX2xVf6pMm...","[52fe4284c3a36847f8024f95, 52fe4284c3a36847f80...",862,"[Tom Hanks, Tim Allen, Don Rickles, Jim Varney...","[Woody (voice), Buzz Lightyear (voice), Mr. Po...","[31, 12898, 7167, 12899, 12900, 7907, 8873, 11...","[2, 2, 2, 2, 2, 2, 1, 0, 2, 1, 2, 1, 2]","[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12]","[/pQFoyx7rp09CJTAb932F2g8Nlho.jpg, /uX2xVf6pMm...","[52fe4284c3a36847f8024f95, 52fe4284c3a36847f80..."
1,"[{'credit_id': '52fe44bfc3a36847f80a7cd1', 'de...",8844,"[Robin Williams, Jonathan Hyde, Kirsten Dunst,...","[Alan Parrish, Samuel Alan Parrish / Van Pelt,...","[2157, 8537, 205, 145151, 5149, 10739, 58563, ...","[2, 2, 1, 0, 1, 1, 2, 1, 0, 1, 2, 1, 2, 0, 0, ...","[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,...","[/sojtJyIV3lkUeThD7A2oHNm8183.jpg, /7il5D76vx6...","[52fe44bfc3a36847f80a7c73, 52fe44bfc3a36847f80...",8844,"[Robin Williams, Jonathan Hyde, Kirsten Dunst,...","[Alan Parrish, Samuel Alan Parrish / Van Pelt,...","[2157, 8537, 205, 145151, 5149, 10739, 58563, ...","[2, 2, 1, 0, 1, 1, 2, 1, 0, 1, 2, 1, 2, 0, 0, ...","[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,...","[/sojtJyIV3lkUeThD7A2oHNm8183.jpg, /7il5D76vx6...","[52fe44bfc3a36847f80a7c73, 52fe44bfc3a36847f80..."
2,"[{'credit_id': '52fe466a9251416c75077a89', 'de...",15602,"[Walter Matthau, Jack Lemmon, Ann-Margret, Sop...","[Max Goldman, John Gustafson, Ariel Gustafson,...","[6837, 3151, 13567, 16757, 589, 16523, 7166]","[2, 2, 1, 1, 1, 2, 2]","[0, 1, 2, 3, 4, 5, 6]","[/xJVkvprOnzP5Zdh5y63y8HHniDZ.jpg, /chZmNRYMtq...","[52fe466a9251416c75077a8d, 52fe466a9251416c750...",15602,"[Walter Matthau, Jack Lemmon, Ann-Margret, Sop...","[Max Goldman, John Gustafson, Ariel Gustafson,...","[6837, 3151, 13567, 16757, 589, 16523, 7166]","[2, 2, 1, 1, 1, 2, 2]","[0, 1, 2, 3, 4, 5, 6]","[/xJVkvprOnzP5Zdh5y63y8HHniDZ.jpg, /chZmNRYMtq...","[52fe466a9251416c75077a8d, 52fe466a9251416c750..."
3,"[{'credit_id': '52fe44779251416c91011acb', 'de...",31357,"[Whitney Houston, Angela Bassett, Loretta Devi...","[Savannah 'Vannah' Jackson, Bernadine 'Bernie'...","[8851, 9780, 18284, 51359, 66804, 352, 87118, ...","[1, 1, 1, 1, 2, 2, 2, 2, 2, 2]","[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]","[/69ouDnXnmklYPr4sMJXWKYz81AL.jpg, /tHkgSzhEuJ...","[52fe44779251416c91011aad, 52fe44779251416c910...",31357,"[Whitney Houston, Angela Bassett, Loretta Devi...","[Savannah 'Vannah' Jackson, Bernadine 'Bernie'...","[8851, 9780, 18284, 51359, 66804, 352, 87118, ...","[1, 1, 1, 1, 2, 2, 2, 2, 2, 2]","[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]","[/69ouDnXnmklYPr4sMJXWKYz81AL.jpg, /tHkgSzhEuJ...","[52fe44779251416c91011aad, 52fe44779251416c910..."
4,"[{'credit_id': '52fe44959251416c75039ed7', 'de...",11862,"[Steve Martin, Diane Keaton, Martin Short, Kim...","[George Banks, Nina Banks, Franck Eggelhoffer,...","[67773, 3092, 519, 70696, 59222, 18793, 14592,...","[2, 1, 2, 1, 2, 0, 2, 2, 1, 1, 2, 1]","[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11]","[/rI2EMvkfKKPKa5z0nM2pFVBtUyO.jpg, /fzgUMnbOkx...","[52fe44959251416c75039eb9, 52fe44959251416c750...",11862,"[Steve Martin, Diane Keaton, Martin Short, Kim...","[George Banks, Nina Banks, Franck Eggelhoffer,...","[67773, 3092, 519, 70696, 59222, 18793, 14592,...","[2, 1, 2, 1, 2, 0, 2, 2, 1, 1, 2, 1]","[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11]","[/rI2EMvkfKKPKa5z0nM2pFVBtUyO.jpg, /fzgUMnbOkx...","[52fe44959251416c75039eb9, 52fe44959251416c750..."


In [349]:
df3 = df_merged
# Eliminar la columna 'Crew_id' de df3
df3 = df3.drop(columns=['crew'], errors='ignore')
df3.head()

Unnamed: 0,id,cast_names,cast_characters,cast_ids,cast_genders,cast_orders,cast_profile_paths,cast_credit_ids,Crew_id,Crew_cast_names,Crew_cast_characters,Crew_cast_ids,Crew_cast_genders,Crew_cast_orders,Crew_cast_profile_paths,Crew_cast_credit_ids
0,862,"[Tom Hanks, Tim Allen, Don Rickles, Jim Varney...","[Woody (voice), Buzz Lightyear (voice), Mr. Po...","[31, 12898, 7167, 12899, 12900, 7907, 8873, 11...","[2, 2, 2, 2, 2, 2, 1, 0, 2, 1, 2, 1, 2]","[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12]","[/pQFoyx7rp09CJTAb932F2g8Nlho.jpg, /uX2xVf6pMm...","[52fe4284c3a36847f8024f95, 52fe4284c3a36847f80...",862,"[Tom Hanks, Tim Allen, Don Rickles, Jim Varney...","[Woody (voice), Buzz Lightyear (voice), Mr. Po...","[31, 12898, 7167, 12899, 12900, 7907, 8873, 11...","[2, 2, 2, 2, 2, 2, 1, 0, 2, 1, 2, 1, 2]","[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12]","[/pQFoyx7rp09CJTAb932F2g8Nlho.jpg, /uX2xVf6pMm...","[52fe4284c3a36847f8024f95, 52fe4284c3a36847f80..."
1,8844,"[Robin Williams, Jonathan Hyde, Kirsten Dunst,...","[Alan Parrish, Samuel Alan Parrish / Van Pelt,...","[2157, 8537, 205, 145151, 5149, 10739, 58563, ...","[2, 2, 1, 0, 1, 1, 2, 1, 0, 1, 2, 1, 2, 0, 0, ...","[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,...","[/sojtJyIV3lkUeThD7A2oHNm8183.jpg, /7il5D76vx6...","[52fe44bfc3a36847f80a7c73, 52fe44bfc3a36847f80...",8844,"[Robin Williams, Jonathan Hyde, Kirsten Dunst,...","[Alan Parrish, Samuel Alan Parrish / Van Pelt,...","[2157, 8537, 205, 145151, 5149, 10739, 58563, ...","[2, 2, 1, 0, 1, 1, 2, 1, 0, 1, 2, 1, 2, 0, 0, ...","[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,...","[/sojtJyIV3lkUeThD7A2oHNm8183.jpg, /7il5D76vx6...","[52fe44bfc3a36847f80a7c73, 52fe44bfc3a36847f80..."
2,15602,"[Walter Matthau, Jack Lemmon, Ann-Margret, Sop...","[Max Goldman, John Gustafson, Ariel Gustafson,...","[6837, 3151, 13567, 16757, 589, 16523, 7166]","[2, 2, 1, 1, 1, 2, 2]","[0, 1, 2, 3, 4, 5, 6]","[/xJVkvprOnzP5Zdh5y63y8HHniDZ.jpg, /chZmNRYMtq...","[52fe466a9251416c75077a8d, 52fe466a9251416c750...",15602,"[Walter Matthau, Jack Lemmon, Ann-Margret, Sop...","[Max Goldman, John Gustafson, Ariel Gustafson,...","[6837, 3151, 13567, 16757, 589, 16523, 7166]","[2, 2, 1, 1, 1, 2, 2]","[0, 1, 2, 3, 4, 5, 6]","[/xJVkvprOnzP5Zdh5y63y8HHniDZ.jpg, /chZmNRYMtq...","[52fe466a9251416c75077a8d, 52fe466a9251416c750..."
3,31357,"[Whitney Houston, Angela Bassett, Loretta Devi...","[Savannah 'Vannah' Jackson, Bernadine 'Bernie'...","[8851, 9780, 18284, 51359, 66804, 352, 87118, ...","[1, 1, 1, 1, 2, 2, 2, 2, 2, 2]","[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]","[/69ouDnXnmklYPr4sMJXWKYz81AL.jpg, /tHkgSzhEuJ...","[52fe44779251416c91011aad, 52fe44779251416c910...",31357,"[Whitney Houston, Angela Bassett, Loretta Devi...","[Savannah 'Vannah' Jackson, Bernadine 'Bernie'...","[8851, 9780, 18284, 51359, 66804, 352, 87118, ...","[1, 1, 1, 1, 2, 2, 2, 2, 2, 2]","[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]","[/69ouDnXnmklYPr4sMJXWKYz81AL.jpg, /tHkgSzhEuJ...","[52fe44779251416c91011aad, 52fe44779251416c910..."
4,11862,"[Steve Martin, Diane Keaton, Martin Short, Kim...","[George Banks, Nina Banks, Franck Eggelhoffer,...","[67773, 3092, 519, 70696, 59222, 18793, 14592,...","[2, 1, 2, 1, 2, 0, 2, 2, 1, 1, 2, 1]","[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11]","[/rI2EMvkfKKPKa5z0nM2pFVBtUyO.jpg, /fzgUMnbOkx...","[52fe44959251416c75039eb9, 52fe44959251416c750...",11862,"[Steve Martin, Diane Keaton, Martin Short, Kim...","[George Banks, Nina Banks, Franck Eggelhoffer,...","[67773, 3092, 519, 70696, 59222, 18793, 14592,...","[2, 1, 2, 1, 2, 0, 2, 2, 1, 1, 2, 1]","[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11]","[/rI2EMvkfKKPKa5z0nM2pFVBtUyO.jpg, /fzgUMnbOkx...","[52fe44959251416c75039eb9, 52fe44959251416c750..."


In [224]:

# Ruta donde se guardará el archivo Excel
ruta_archivo = 'df3.xlsx'

# Exportar df3 a un archivo Excel
df3.to_excel(ruta_archivo, index=False, engine='openpyxl')

print(f"Archivo exportado a {ruta_archivo}")


Archivo exportado a df3.xlsx


In [350]:
# Convertir la columna 'id' en df2 a tipo entero (si es posible)
df2['id'] = pd.to_numeric(df2['id'], errors='coerce', downcast='integer')

# Convertir la columna 'id' en df3 a tipo entero (si es posible)
df3['id'] = pd.to_numeric(df3['id'], errors='coerce', downcast='integer')

# Realizar la unión utilizando la columna 'id' en df2 y df3
df_combined = pd.merge(df2, df3, left_on='id', right_on='id', how='left')

In [351]:
df_combined.head()

Unnamed: 0,adult,budget,homepage,id,imdb_id,original_language,original_title,overview,popularity,poster_path,...,cast_profile_paths,cast_credit_ids,Crew_id,Crew_cast_names,Crew_cast_characters,Crew_cast_ids,Crew_cast_genders,Crew_cast_orders,Crew_cast_profile_paths,Crew_cast_credit_ids
0,False,30000000,http://toystory.disney.com/toy-story,862.0,tt0114709,en,Toy Story,"Led by Woody, Andy's toys live happily in his ...",21.946943,/rhIRbceoE9lR4veEXuwCC2wARtG.jpg,...,"[/pQFoyx7rp09CJTAb932F2g8Nlho.jpg, /uX2xVf6pMm...","[52fe4284c3a36847f8024f95, 52fe4284c3a36847f80...",862.0,"[Tom Hanks, Tim Allen, Don Rickles, Jim Varney...","[Woody (voice), Buzz Lightyear (voice), Mr. Po...","[31, 12898, 7167, 12899, 12900, 7907, 8873, 11...","[2, 2, 2, 2, 2, 2, 1, 0, 2, 1, 2, 1, 2]","[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12]","[/pQFoyx7rp09CJTAb932F2g8Nlho.jpg, /uX2xVf6pMm...","[52fe4284c3a36847f8024f95, 52fe4284c3a36847f80..."
1,False,65000000,,8844.0,tt0113497,en,Jumanji,When siblings Judy and Peter discover an encha...,17.015539,/vzmL6fP7aPKNKPRTFnZmiUfciyV.jpg,...,"[/sojtJyIV3lkUeThD7A2oHNm8183.jpg, /7il5D76vx6...","[52fe44bfc3a36847f80a7c73, 52fe44bfc3a36847f80...",8844.0,"[Robin Williams, Jonathan Hyde, Kirsten Dunst,...","[Alan Parrish, Samuel Alan Parrish / Van Pelt,...","[2157, 8537, 205, 145151, 5149, 10739, 58563, ...","[2, 2, 1, 0, 1, 1, 2, 1, 0, 1, 2, 1, 2, 0, 0, ...","[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,...","[/sojtJyIV3lkUeThD7A2oHNm8183.jpg, /7il5D76vx6...","[52fe44bfc3a36847f80a7c73, 52fe44bfc3a36847f80..."
2,False,0,,15602.0,tt0113228,en,Grumpier Old Men,A family wedding reignites the ancient feud be...,11.7129,/6ksm1sjKMFLbO7UY2i6G1ju9SML.jpg,...,"[/xJVkvprOnzP5Zdh5y63y8HHniDZ.jpg, /chZmNRYMtq...","[52fe466a9251416c75077a8d, 52fe466a9251416c750...",15602.0,"[Walter Matthau, Jack Lemmon, Ann-Margret, Sop...","[Max Goldman, John Gustafson, Ariel Gustafson,...","[6837, 3151, 13567, 16757, 589, 16523, 7166]","[2, 2, 1, 1, 1, 2, 2]","[0, 1, 2, 3, 4, 5, 6]","[/xJVkvprOnzP5Zdh5y63y8HHniDZ.jpg, /chZmNRYMtq...","[52fe466a9251416c75077a8d, 52fe466a9251416c750..."
3,False,16000000,,31357.0,tt0114885,en,Waiting to Exhale,"Cheated on, mistreated and stepped on, the wom...",3.859495,/16XOMpEaLWkrcPqSQqhTmeJuqQl.jpg,...,"[/69ouDnXnmklYPr4sMJXWKYz81AL.jpg, /tHkgSzhEuJ...","[52fe44779251416c91011aad, 52fe44779251416c910...",31357.0,"[Whitney Houston, Angela Bassett, Loretta Devi...","[Savannah 'Vannah' Jackson, Bernadine 'Bernie'...","[8851, 9780, 18284, 51359, 66804, 352, 87118, ...","[1, 1, 1, 1, 2, 2, 2, 2, 2, 2]","[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]","[/69ouDnXnmklYPr4sMJXWKYz81AL.jpg, /tHkgSzhEuJ...","[52fe44779251416c91011aad, 52fe44779251416c910..."
4,False,0,,11862.0,tt0113041,en,Father of the Bride Part II,Just when George Banks has recovered from his ...,8.387519,/e64sOI48hQXyru7naBFyssKFxVd.jpg,...,"[/rI2EMvkfKKPKa5z0nM2pFVBtUyO.jpg, /fzgUMnbOkx...","[52fe44959251416c75039eb9, 52fe44959251416c750...",11862.0,"[Steve Martin, Diane Keaton, Martin Short, Kim...","[George Banks, Nina Banks, Franck Eggelhoffer,...","[67773, 3092, 519, 70696, 59222, 18793, 14592,...","[2, 1, 2, 1, 2, 0, 2, 2, 1, 1, 2, 1]","[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11]","[/rI2EMvkfKKPKa5z0nM2pFVBtUyO.jpg, /fzgUMnbOkx...","[52fe44959251416c75039eb9, 52fe44959251416c750..."


In [352]:
df_combined.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 45700 entries, 0 to 45699
Data columns (total 46 columns):
 #   Column                               Non-Null Count  Dtype  
---  ------                               --------------  -----  
 0   adult                                45700 non-null  object 
 1   budget                               45700 non-null  object 
 2   homepage                             7821 non-null   object 
 3   id                                   45697 non-null  float64
 4   imdb_id                              45683 non-null  object 
 5   original_language                    45689 non-null  object 
 6   original_title                       45700 non-null  object 
 7   overview                             44746 non-null  object 
 8   popularity                           45695 non-null  object 
 9   poster_path                          45314 non-null  object 
 10  release_date                         45613 non-null  object 
 11  revenue                     

In [353]:
# Convertir 'id' en df_combined a enteros
df_combined['id'] = df_combined['id'].fillna(-1).astype(int)

In [354]:
#cambio de nombre del dataset
data = df_combined

In [355]:
data.head()

Unnamed: 0,adult,budget,homepage,id,imdb_id,original_language,original_title,overview,popularity,poster_path,...,cast_profile_paths,cast_credit_ids,Crew_id,Crew_cast_names,Crew_cast_characters,Crew_cast_ids,Crew_cast_genders,Crew_cast_orders,Crew_cast_profile_paths,Crew_cast_credit_ids
0,False,30000000,http://toystory.disney.com/toy-story,862,tt0114709,en,Toy Story,"Led by Woody, Andy's toys live happily in his ...",21.946943,/rhIRbceoE9lR4veEXuwCC2wARtG.jpg,...,"[/pQFoyx7rp09CJTAb932F2g8Nlho.jpg, /uX2xVf6pMm...","[52fe4284c3a36847f8024f95, 52fe4284c3a36847f80...",862.0,"[Tom Hanks, Tim Allen, Don Rickles, Jim Varney...","[Woody (voice), Buzz Lightyear (voice), Mr. Po...","[31, 12898, 7167, 12899, 12900, 7907, 8873, 11...","[2, 2, 2, 2, 2, 2, 1, 0, 2, 1, 2, 1, 2]","[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12]","[/pQFoyx7rp09CJTAb932F2g8Nlho.jpg, /uX2xVf6pMm...","[52fe4284c3a36847f8024f95, 52fe4284c3a36847f80..."
1,False,65000000,,8844,tt0113497,en,Jumanji,When siblings Judy and Peter discover an encha...,17.015539,/vzmL6fP7aPKNKPRTFnZmiUfciyV.jpg,...,"[/sojtJyIV3lkUeThD7A2oHNm8183.jpg, /7il5D76vx6...","[52fe44bfc3a36847f80a7c73, 52fe44bfc3a36847f80...",8844.0,"[Robin Williams, Jonathan Hyde, Kirsten Dunst,...","[Alan Parrish, Samuel Alan Parrish / Van Pelt,...","[2157, 8537, 205, 145151, 5149, 10739, 58563, ...","[2, 2, 1, 0, 1, 1, 2, 1, 0, 1, 2, 1, 2, 0, 0, ...","[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,...","[/sojtJyIV3lkUeThD7A2oHNm8183.jpg, /7il5D76vx6...","[52fe44bfc3a36847f80a7c73, 52fe44bfc3a36847f80..."
2,False,0,,15602,tt0113228,en,Grumpier Old Men,A family wedding reignites the ancient feud be...,11.7129,/6ksm1sjKMFLbO7UY2i6G1ju9SML.jpg,...,"[/xJVkvprOnzP5Zdh5y63y8HHniDZ.jpg, /chZmNRYMtq...","[52fe466a9251416c75077a8d, 52fe466a9251416c750...",15602.0,"[Walter Matthau, Jack Lemmon, Ann-Margret, Sop...","[Max Goldman, John Gustafson, Ariel Gustafson,...","[6837, 3151, 13567, 16757, 589, 16523, 7166]","[2, 2, 1, 1, 1, 2, 2]","[0, 1, 2, 3, 4, 5, 6]","[/xJVkvprOnzP5Zdh5y63y8HHniDZ.jpg, /chZmNRYMtq...","[52fe466a9251416c75077a8d, 52fe466a9251416c750..."
3,False,16000000,,31357,tt0114885,en,Waiting to Exhale,"Cheated on, mistreated and stepped on, the wom...",3.859495,/16XOMpEaLWkrcPqSQqhTmeJuqQl.jpg,...,"[/69ouDnXnmklYPr4sMJXWKYz81AL.jpg, /tHkgSzhEuJ...","[52fe44779251416c91011aad, 52fe44779251416c910...",31357.0,"[Whitney Houston, Angela Bassett, Loretta Devi...","[Savannah 'Vannah' Jackson, Bernadine 'Bernie'...","[8851, 9780, 18284, 51359, 66804, 352, 87118, ...","[1, 1, 1, 1, 2, 2, 2, 2, 2, 2]","[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]","[/69ouDnXnmklYPr4sMJXWKYz81AL.jpg, /tHkgSzhEuJ...","[52fe44779251416c91011aad, 52fe44779251416c910..."
4,False,0,,11862,tt0113041,en,Father of the Bride Part II,Just when George Banks has recovered from his ...,8.387519,/e64sOI48hQXyru7naBFyssKFxVd.jpg,...,"[/rI2EMvkfKKPKa5z0nM2pFVBtUyO.jpg, /fzgUMnbOkx...","[52fe44959251416c75039eb9, 52fe44959251416c750...",11862.0,"[Steve Martin, Diane Keaton, Martin Short, Kim...","[George Banks, Nina Banks, Franck Eggelhoffer,...","[67773, 3092, 519, 70696, 59222, 18793, 14592,...","[2, 1, 2, 1, 2, 0, 2, 2, 1, 1, 2, 1]","[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11]","[/rI2EMvkfKKPKa5z0nM2pFVBtUyO.jpg, /fzgUMnbOkx...","[52fe44959251416c75039eb9, 52fe44959251416c750..."


In [231]:
# Ruta donde se guardará el archivo Excel
ruta_archivo = 'data.xlsx'

# Exportar df3 a un archivo Excel
data.to_excel(ruta_archivo, index=False, engine='openpyxl')

print(f"Archivo exportado a {ruta_archivo}")

Archivo exportado a data.xlsx


### 1.3 Eliminar Columnas que no se usaran

In [356]:
#Ver columnas actuales de dataset
print(data.columns)
print(data.info())

Index(['adult', 'budget', 'homepage', 'id', 'imdb_id', 'original_language',
       'original_title', 'overview', 'popularity', 'poster_path',
       'release_date', 'revenue', 'runtime', 'status', 'tagline', 'title',
       'video', 'vote_average', 'vote_count', 'belongs_to_collection_id',
       'belongs_to_collection_name', 'belongs_to_collection_poster_path',
       'belongs_to_collection_backdrop_path', 'genres_ids', 'genres_names',
       'company_names', 'company_ids', 'country_names', 'country_codes',
       'language_names', 'language_codes', 'cast_names', 'cast_characters',
       'cast_ids', 'cast_genders', 'cast_orders', 'cast_profile_paths',
       'cast_credit_ids', 'Crew_id', 'Crew_cast_names', 'Crew_cast_characters',
       'Crew_cast_ids', 'Crew_cast_genders', 'Crew_cast_orders',
       'Crew_cast_profile_paths', 'Crew_cast_credit_ids'],
      dtype='object')
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 45700 entries, 0 to 45699
Data columns (total 46 columns):
 # 

In [357]:
# Comparar si Columnas sus filas son iguales
todas_iguales = data['id'].equals(data['Crew_id'])

if todas_iguales:
    print("Todas las filas son iguales")
else:
    print("No todas las filas son iguales")

    # Identificar las filas que son diferentes
    diferencias = data[data['id'] != data['Crew_id']]
    datos_diferentes = diferencias[['id', 'Crew_id']]
print("Datos diferentes entre las columnas:")
print(datos_diferentes)

No todas las filas son iguales
Datos diferentes entre las columnas:
           id  Crew_id
19844      -1      NaN
29704      -1      NaN
35800      -1      NaN
43108  401840      NaN


In [358]:
#Eliminar datos de filas
"""
Se hace una revision de los datos, comparando las columnas id y Crew_id, vemos que hay 4 filas que tiene datos que no concuerdan entre las dos columnas,
al ver son filas comlpetamnete desocupadas, o con valores que no tienen sentido de acuerdo a cada una de sus columnas se eliminaron del daatset, fue un 
proceso manual, pero se hizo con cuidado para no eliminar datos que fueran importantes. un total de 4 filas. 
"""

mask = data['id'] != data['Crew_id']

# Eliminar las filas con diferencias
data_cleaned = data[~mask]

In [359]:
# Comparar si Columnas sus filas son iguales
todas_iguales = data['cast_names'].equals(data['Crew_cast_names'])

if todas_iguales:
    print("Todas las filas son iguales")
else:
    print("No todas las filas son iguales")

    # Identificar las filas que son diferentes
    diferencias = data[data['cast_names'] != data['Crew_cast_names']]
    datos_diferentes = diferencias[['cast_names', 'Crew_cast_names']]
print("Datos diferentes entre las columnas:")
print(datos_diferentes)

"""
Veo que las columnas son completamente iguales, por lo que puedo eliminar una para no tener duplicados los mismos datos
"""


Todas las filas son iguales
Datos diferentes entre las columnas:
           id  Crew_id
19844      -1      NaN
29704      -1      NaN
35800      -1      NaN
43108  401840      NaN


'\nVeo que las columnas son completamente iguales, por lo que puedo eliminar una para no tener duplicados los mismos datos\n'

In [360]:
# Comparar si Columnas sus filas son iguales
todas_iguales = data['cast_characters'].equals(data['Crew_cast_characters'])

if todas_iguales:
    print("Todas las filas son iguales")
else:
    print("No todas las filas son iguales")

    # Identificar las filas que son diferentes
    diferencias = data[data['cast_characters'] != data['Crew_cast_characters']]
    datos_diferentes = diferencias[['cast_characters', 'Crew_cast_characters']]
print("Datos diferentes entre las columnas:")
print(datos_diferentes)

"""
Veo que las columnas son completamente iguales, por lo que puedo eliminar una para no tener duplicados los mismos datos
"""


Todas las filas son iguales
Datos diferentes entre las columnas:
           id  Crew_id
19844      -1      NaN
29704      -1      NaN
35800      -1      NaN
43108  401840      NaN


'\nVeo que las columnas son completamente iguales, por lo que puedo eliminar una para no tener duplicados los mismos datos\n'

In [361]:
# Comparar si Columnas sus filas son iguales
todas_iguales = data['cast_ids'].equals(data['Crew_cast_ids'])

if todas_iguales:
    print("Todas las filas son iguales")
else:
    print("No todas las filas son iguales")

    # Identificar las filas que son diferentes
    diferencias = data[data['cast_ids'] != data['Crew_cast_ids']]
    datos_diferentes = diferencias[['cast_ids', 'Crew_cast_ids']]
print("Datos diferentes entre las columnas:")
print(datos_diferentes)

"""
Veo que las columnas son completamente iguales, por lo que puedo eliminar una para no tener duplicados los mismos datos
"""


Todas las filas son iguales
Datos diferentes entre las columnas:
           id  Crew_id
19844      -1      NaN
29704      -1      NaN
35800      -1      NaN
43108  401840      NaN


'\nVeo que las columnas son completamente iguales, por lo que puedo eliminar una para no tener duplicados los mismos datos\n'

In [362]:
# Comparar si Columnas sus filas son iguales
todas_iguales = data['cast_genders'].equals(data['Crew_cast_genders'])

if todas_iguales:
    print("Todas las filas son iguales")
else:
    print("No todas las filas son iguales")

    # Identificar las filas que son diferentes
    diferencias = data[data['cast_genders'] != data['Crew_cast_genders']]
    datos_diferentes = diferencias[['cast_genders', 'Crew_cast_genders']]
print("Datos diferentes entre las columnas:")
print(datos_diferentes)

"""
Veo que las columnas son completamente iguales, por lo que puedo eliminar una para no tener duplicados los mismos datos
"""


Todas las filas son iguales
Datos diferentes entre las columnas:
           id  Crew_id
19844      -1      NaN
29704      -1      NaN
35800      -1      NaN
43108  401840      NaN


'\nVeo que las columnas son completamente iguales, por lo que puedo eliminar una para no tener duplicados los mismos datos\n'

In [363]:
# Comparar si Columnas sus filas son iguales
todas_iguales = data['cast_orders'].equals(data['Crew_cast_orders'])

if todas_iguales:
    print("Todas las filas son iguales")
else:
    print("No todas las filas son iguales")

    # Identificar las filas que son diferentes
    diferencias = data[data['cast_orders'] != data['Crew_cast_orders']]
    datos_diferentes = diferencias[['cast_orders', 'Crew_cast_orders']]
print("Datos diferentes entre las columnas:")
print(datos_diferentes)

"""
Veo que las columnas son completamente iguales, por lo que puedo eliminar una para no tener duplicados los mismos datos
"""


Todas las filas son iguales
Datos diferentes entre las columnas:
           id  Crew_id
19844      -1      NaN
29704      -1      NaN
35800      -1      NaN
43108  401840      NaN


'\nVeo que las columnas son completamente iguales, por lo que puedo eliminar una para no tener duplicados los mismos datos\n'

In [364]:
# Comparar si Columnas sus filas son iguales
todas_iguales = data['cast_credit_ids'].equals(data['Crew_cast_credit_ids'])

if todas_iguales:
    print("Todas las filas son iguales")
else:
    print("No todas las filas son iguales")

    # Identificar las filas que son diferentes
    diferencias = data[data['cast_credit_ids'] != data['Crew_cast_credit_ids']]
    datos_diferentes = diferencias[['cast_credit_ids', 'Crew_cast_credit_ids']]
print("Datos diferentes entre las columnas:")
print(datos_diferentes)

"""
Veo que las columnas son completamente iguales, por lo que puedo eliminar una para no tener duplicados los mismos datos
"""


Todas las filas son iguales
Datos diferentes entre las columnas:
           id  Crew_id
19844      -1      NaN
29704      -1      NaN
35800      -1      NaN
43108  401840      NaN


'\nVeo que las columnas son completamente iguales, por lo que puedo eliminar una para no tener duplicados los mismos datos\n'

In [365]:
print(data.columns)

Index(['adult', 'budget', 'homepage', 'id', 'imdb_id', 'original_language',
       'original_title', 'overview', 'popularity', 'poster_path',
       'release_date', 'revenue', 'runtime', 'status', 'tagline', 'title',
       'video', 'vote_average', 'vote_count', 'belongs_to_collection_id',
       'belongs_to_collection_name', 'belongs_to_collection_poster_path',
       'belongs_to_collection_backdrop_path', 'genres_ids', 'genres_names',
       'company_names', 'company_ids', 'country_names', 'country_codes',
       'language_names', 'language_codes', 'cast_names', 'cast_characters',
       'cast_ids', 'cast_genders', 'cast_orders', 'cast_profile_paths',
       'cast_credit_ids', 'Crew_id', 'Crew_cast_names', 'Crew_cast_characters',
       'Crew_cast_ids', 'Crew_cast_genders', 'Crew_cast_orders',
       'Crew_cast_profile_paths', 'Crew_cast_credit_ids'],
      dtype='object')


In [366]:
data

Unnamed: 0,adult,budget,homepage,id,imdb_id,original_language,original_title,overview,popularity,poster_path,...,cast_profile_paths,cast_credit_ids,Crew_id,Crew_cast_names,Crew_cast_characters,Crew_cast_ids,Crew_cast_genders,Crew_cast_orders,Crew_cast_profile_paths,Crew_cast_credit_ids
0,False,30000000,http://toystory.disney.com/toy-story,862,tt0114709,en,Toy Story,"Led by Woody, Andy's toys live happily in his ...",21.946943,/rhIRbceoE9lR4veEXuwCC2wARtG.jpg,...,"[/pQFoyx7rp09CJTAb932F2g8Nlho.jpg, /uX2xVf6pMm...","[52fe4284c3a36847f8024f95, 52fe4284c3a36847f80...",862.0,"[Tom Hanks, Tim Allen, Don Rickles, Jim Varney...","[Woody (voice), Buzz Lightyear (voice), Mr. Po...","[31, 12898, 7167, 12899, 12900, 7907, 8873, 11...","[2, 2, 2, 2, 2, 2, 1, 0, 2, 1, 2, 1, 2]","[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12]","[/pQFoyx7rp09CJTAb932F2g8Nlho.jpg, /uX2xVf6pMm...","[52fe4284c3a36847f8024f95, 52fe4284c3a36847f80..."
1,False,65000000,,8844,tt0113497,en,Jumanji,When siblings Judy and Peter discover an encha...,17.015539,/vzmL6fP7aPKNKPRTFnZmiUfciyV.jpg,...,"[/sojtJyIV3lkUeThD7A2oHNm8183.jpg, /7il5D76vx6...","[52fe44bfc3a36847f80a7c73, 52fe44bfc3a36847f80...",8844.0,"[Robin Williams, Jonathan Hyde, Kirsten Dunst,...","[Alan Parrish, Samuel Alan Parrish / Van Pelt,...","[2157, 8537, 205, 145151, 5149, 10739, 58563, ...","[2, 2, 1, 0, 1, 1, 2, 1, 0, 1, 2, 1, 2, 0, 0, ...","[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,...","[/sojtJyIV3lkUeThD7A2oHNm8183.jpg, /7il5D76vx6...","[52fe44bfc3a36847f80a7c73, 52fe44bfc3a36847f80..."
2,False,0,,15602,tt0113228,en,Grumpier Old Men,A family wedding reignites the ancient feud be...,11.7129,/6ksm1sjKMFLbO7UY2i6G1ju9SML.jpg,...,"[/xJVkvprOnzP5Zdh5y63y8HHniDZ.jpg, /chZmNRYMtq...","[52fe466a9251416c75077a8d, 52fe466a9251416c750...",15602.0,"[Walter Matthau, Jack Lemmon, Ann-Margret, Sop...","[Max Goldman, John Gustafson, Ariel Gustafson,...","[6837, 3151, 13567, 16757, 589, 16523, 7166]","[2, 2, 1, 1, 1, 2, 2]","[0, 1, 2, 3, 4, 5, 6]","[/xJVkvprOnzP5Zdh5y63y8HHniDZ.jpg, /chZmNRYMtq...","[52fe466a9251416c75077a8d, 52fe466a9251416c750..."
3,False,16000000,,31357,tt0114885,en,Waiting to Exhale,"Cheated on, mistreated and stepped on, the wom...",3.859495,/16XOMpEaLWkrcPqSQqhTmeJuqQl.jpg,...,"[/69ouDnXnmklYPr4sMJXWKYz81AL.jpg, /tHkgSzhEuJ...","[52fe44779251416c91011aad, 52fe44779251416c910...",31357.0,"[Whitney Houston, Angela Bassett, Loretta Devi...","[Savannah 'Vannah' Jackson, Bernadine 'Bernie'...","[8851, 9780, 18284, 51359, 66804, 352, 87118, ...","[1, 1, 1, 1, 2, 2, 2, 2, 2, 2]","[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]","[/69ouDnXnmklYPr4sMJXWKYz81AL.jpg, /tHkgSzhEuJ...","[52fe44779251416c91011aad, 52fe44779251416c910..."
4,False,0,,11862,tt0113041,en,Father of the Bride Part II,Just when George Banks has recovered from his ...,8.387519,/e64sOI48hQXyru7naBFyssKFxVd.jpg,...,"[/rI2EMvkfKKPKa5z0nM2pFVBtUyO.jpg, /fzgUMnbOkx...","[52fe44959251416c75039eb9, 52fe44959251416c750...",11862.0,"[Steve Martin, Diane Keaton, Martin Short, Kim...","[George Banks, Nina Banks, Franck Eggelhoffer,...","[67773, 3092, 519, 70696, 59222, 18793, 14592,...","[2, 1, 2, 1, 2, 0, 2, 2, 1, 1, 2, 1]","[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11]","[/rI2EMvkfKKPKa5z0nM2pFVBtUyO.jpg, /fzgUMnbOkx...","[52fe44959251416c75039eb9, 52fe44959251416c750..."
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
45695,False,0,http://www.imdb.com/title/tt6209470/,439050,tt6209470,fa,رگ خواب,Rising and falling between a man and woman.,0.072051,/jldsYflnId4tTWPx8es3uzsB1I8.jpg,...,"[/ppI1Q4a7zAXTaAyPRSvPm0vhove.jpg, None, None]","[5894a909925141427e0079a5, 5894a952c3a36840a70...",439050.0,"[Leila Hatami, Kourosh Tahami, Elham Korda]","[, , ]","[240240, 1749839, 1619957]","[1, 0, 0]","[1, 2, 3]","[/ppI1Q4a7zAXTaAyPRSvPm0vhove.jpg, None, None]","[5894a909925141427e0079a5, 5894a952c3a36840a70..."
45696,False,0,,111109,tt2028550,tl,Siglo ng Pagluluwal,An artist struggles to finish his work while a...,0.178241,/xZkmxsNmYXJbKVsTRLLx3pqGHx7.jpg,...,"[/aZLCxPbokn0EX1cNXxF9UCaA8gP.jpg, None, /jvAa...","[52fe4af1c3a36847f81e9b1f, 559eb4ecc3a368081d0...",111109.0,"[Angel Aquino, Perry Dizon, Hazel Orencio, Joe...","[Sister Angela, Homer, Crazy Woman/Virgin, Ama...","[1043186, 111636, 1204271, 278923, 1042953, 57...","[1, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0]","[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10]","[/aZLCxPbokn0EX1cNXxF9UCaA8gP.jpg, None, /jvAa...","[52fe4af1c3a36847f81e9b1f, 559eb4ecc3a368081d0..."
45697,False,0,,67758,tt0303758,en,Betrayal,"When one of her hits goes wrong, a professiona...",0.903007,/d5bX92nDsISNhu3ZT69uHwmfCGw.jpg,...,"[/7ZdUGY78qpB2MkTzAQvYEYzncwN.jpg, /vhZ8AD36h0...","[52fe4776c3a368484e0c83a3, 52fe4776c3a368484e0...",67758.0,"[Erika Eleniak, Adam Baldwin, Julie du Page, J...","[Emily Shaw, Det. Mark Winston, Jayne Ferré, A...","[23764, 2059, 46277, 1736, 58646, 54649, 55270...","[1, 2, 1, 2, 2, 2, 2, 2, 2, 0, 2, 2, 2, 2, 0]","[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,...","[/7ZdUGY78qpB2MkTzAQvYEYzncwN.jpg, /vhZ8AD36h0...","[52fe4776c3a368484e0c83a3, 52fe4776c3a368484e0..."
45698,False,0,,227506,tt0008536,en,Satana likuyushchiy,"In a small town live two brothers, one a minis...",0.003503,/aorBPO7ak8e8iJKT5OcqYxU3jlK.jpg,...,"[None, None, None, None, /n1NXVGNzNxtqsMWxLT1h...","[52fe4ea59251416c7515d7d5, 52fe4ea59251416c751...",227506.0,"[Iwan Mosschuchin, Nathalie Lissenko, Pavel Pa...","[, , , , ]","[544742, 1090923, 1136422, 1261758, 29199]","[2, 1, 2, 0, 1]","[0, 1, 2, 3, 4]","[None, None, None, None, /n1NXVGNzNxtqsMWxLT1h...","[52fe4ea59251416c7515d7d5, 52fe4ea59251416c751..."


In [367]:
# Lista de columnas a eliminar
columns_to_drop = [
    'homepage',
    'poster_path',
    'belongs_to_collection_poster_path',
    'belongs_to_collection_backdrop_path',
    'cast_profile_paths',
    'Crew_cast_profile_paths',
    
    'video',
    'imdb_id',
    'adult',
    'original_title',
    'Crew_id',
    'Crew_cast_names',
    'Crew_cast_characters',
    'Crew_cast_ids',
    'Crew_cast_genders',
    'Crew_cast_orders',
    'Crew_cast_credit_ids'
]

# Eliminar las columnas del DataFrame
data_cleaned = data.drop(columns=columns_to_drop)


In [368]:
data = data_cleaned
print(data.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 45700 entries, 0 to 45699
Data columns (total 29 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   budget                      45700 non-null  object 
 1   id                          45700 non-null  int64  
 2   original_language           45689 non-null  object 
 3   overview                    44746 non-null  object 
 4   popularity                  45695 non-null  object 
 5   release_date                45613 non-null  object 
 6   revenue                     45694 non-null  float64
 7   runtime                     45437 non-null  float64
 8   status                      45613 non-null  object 
 9   tagline                     20499 non-null  object 
 10  title                       45694 non-null  object 
 11  vote_average                45694 non-null  float64
 12  vote_count                  45694 non-null  float64
 13  belongs_to_collection_id    451

### 1.4 Manejo de nulos y NaN

In [369]:
# Verificar qué columnas tienen valores nulos
nulos = data.isnull().sum()

# Filtrar las columnas que tienen al menos un valor nulo
columnas_con_nulos = nulos[nulos > 0]

# Mostrar el resumen de columnas con valores nulos
print("Columnas con valores nulos y la cantidad de nulos en cada una:")
print(columnas_con_nulos)

Columnas con valores nulos y la cantidad de nulos en cada una:
original_language                11
overview                        954
popularity                        5
release_date                     87
revenue                           6
runtime                         263
status                           87
tagline                       25201
title                             6
vote_average                      6
vote_count                        6
belongs_to_collection_id      41182
belongs_to_collection_name    41182
company_names                     6
company_ids                       6
country_names                     6
country_codes                     6
language_names                    6
language_codes                    6
cast_names                        4
cast_characters                   4
cast_ids                          4
cast_genders                      4
cast_orders                       4
cast_credit_ids                   4
dtype: int64


In [370]:
# Reemplazar los valores NaN en la columna 'revenue' por 0
data['revenue'] = data['revenue'].fillna(0)

In [371]:
# Verificar qué columnas tienen valores nulos
nulos = data.isnull().sum()

# Filtrar las columnas que tienen al menos un valor nulo
columnas_con_nulos = nulos[nulos > 0]

# Mostrar el resumen de columnas con valores nulos
print("Columnas con valores nulos y la cantidad de nulos en cada una:")
print(columnas_con_nulos)

Columnas con valores nulos y la cantidad de nulos en cada una:
original_language                11
overview                        954
popularity                        5
release_date                     87
runtime                         263
status                           87
tagline                       25201
title                             6
vote_average                      6
vote_count                        6
belongs_to_collection_id      41182
belongs_to_collection_name    41182
company_names                     6
company_ids                       6
country_names                     6
country_codes                     6
language_names                    6
language_codes                    6
cast_names                        4
cast_characters                   4
cast_ids                          4
cast_genders                      4
cast_orders                       4
cast_credit_ids                   4
dtype: int64


### 1.5 Los valores nulos del campo release date deben eliminarse.

In [372]:
print(data['release_date'].isnull().sum())  # Esto debería mostrar 0

87


In [373]:
# Eliminar las filas con valores NaN en la columna 'release_date'
data = data.dropna(subset=['release_date'])

In [374]:
print(data['release_date'].isnull().sum())  # Esto debería mostrar 0

0


In [375]:
data

Unnamed: 0,budget,id,original_language,overview,popularity,release_date,revenue,runtime,status,tagline,...,country_names,country_codes,language_names,language_codes,cast_names,cast_characters,cast_ids,cast_genders,cast_orders,cast_credit_ids
0,30000000,862,en,"Led by Woody, Andy's toys live happily in his ...",21.946943,1995-10-30,373554033.0,81.0,Released,,...,[United States of America],[US],[English],[en],"[Tom Hanks, Tim Allen, Don Rickles, Jim Varney...","[Woody (voice), Buzz Lightyear (voice), Mr. Po...","[31, 12898, 7167, 12899, 12900, 7907, 8873, 11...","[2, 2, 2, 2, 2, 2, 1, 0, 2, 1, 2, 1, 2]","[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12]","[52fe4284c3a36847f8024f95, 52fe4284c3a36847f80..."
1,65000000,8844,en,When siblings Judy and Peter discover an encha...,17.015539,1995-12-15,262797249.0,104.0,Released,Roll the dice and unleash the excitement!,...,[United States of America],[US],"[English, Français]","[en, fr]","[Robin Williams, Jonathan Hyde, Kirsten Dunst,...","[Alan Parrish, Samuel Alan Parrish / Van Pelt,...","[2157, 8537, 205, 145151, 5149, 10739, 58563, ...","[2, 2, 1, 0, 1, 1, 2, 1, 0, 1, 2, 1, 2, 0, 0, ...","[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,...","[52fe44bfc3a36847f80a7c73, 52fe44bfc3a36847f80..."
2,0,15602,en,A family wedding reignites the ancient feud be...,11.7129,1995-12-22,0.0,101.0,Released,Still Yelling. Still Fighting. Still Ready for...,...,[United States of America],[US],[English],[en],"[Walter Matthau, Jack Lemmon, Ann-Margret, Sop...","[Max Goldman, John Gustafson, Ariel Gustafson,...","[6837, 3151, 13567, 16757, 589, 16523, 7166]","[2, 2, 1, 1, 1, 2, 2]","[0, 1, 2, 3, 4, 5, 6]","[52fe466a9251416c75077a8d, 52fe466a9251416c750..."
3,16000000,31357,en,"Cheated on, mistreated and stepped on, the wom...",3.859495,1995-12-22,81452156.0,127.0,Released,Friends are the people who let you be yourself...,...,[United States of America],[US],[English],[en],"[Whitney Houston, Angela Bassett, Loretta Devi...","[Savannah 'Vannah' Jackson, Bernadine 'Bernie'...","[8851, 9780, 18284, 51359, 66804, 352, 87118, ...","[1, 1, 1, 1, 2, 2, 2, 2, 2, 2]","[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]","[52fe44779251416c91011aad, 52fe44779251416c910..."
4,0,11862,en,Just when George Banks has recovered from his ...,8.387519,1995-02-10,76578911.0,106.0,Released,Just When His World Is Back To Normal... He's ...,...,[United States of America],[US],[English],[en],"[Steve Martin, Diane Keaton, Martin Short, Kim...","[George Banks, Nina Banks, Franck Eggelhoffer,...","[67773, 3092, 519, 70696, 59222, 18793, 14592,...","[2, 1, 2, 1, 2, 0, 2, 2, 1, 1, 2, 1]","[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11]","[52fe44959251416c75039eb9, 52fe44959251416c750..."
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
45694,0,30840,en,"Yet another version of the classic epic, with ...",5.683753,1991-05-13,0.0,104.0,Released,,...,"[Canada, Germany, United Kingdom, United State...","[CA, DE, GB, US]",[English],[en],"[Patrick Bergin, Uma Thurman, David Morrissey,...","[Sir Robert Hode, Maid Marian, Little John, Si...","[29459, 139, 18616, 920, 1924]","[2, 1, 2, 2, 0]","[0, 1, 2, 3, 4]","[52fe44439251416c9100a887, 52fe44439251416c910..."
45696,0,111109,tl,An artist struggles to finish his work while a...,0.178241,2011-11-17,0.0,360.0,Released,,...,[Philippines],[PH],[],[tl],"[Angel Aquino, Perry Dizon, Hazel Orencio, Joe...","[Sister Angela, Homer, Crazy Woman/Virgin, Ama...","[1043186, 111636, 1204271, 278923, 1042953, 57...","[1, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0]","[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10]","[52fe4af1c3a36847f81e9b1f, 559eb4ecc3a368081d0..."
45697,0,67758,en,"When one of her hits goes wrong, a professiona...",0.903007,2003-08-01,0.0,90.0,Released,A deadly game of wits.,...,[United States of America],[US],[English],[en],"[Erika Eleniak, Adam Baldwin, Julie du Page, J...","[Emily Shaw, Det. Mark Winston, Jayne Ferré, A...","[23764, 2059, 46277, 1736, 58646, 54649, 55270...","[1, 2, 1, 2, 2, 2, 2, 2, 2, 0, 2, 2, 2, 2, 0]","[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,...","[52fe4776c3a368484e0c83a3, 52fe4776c3a368484e0..."
45698,0,227506,en,"In a small town live two brothers, one a minis...",0.003503,1917-10-21,0.0,87.0,Released,,...,[Russia],[RU],[],[],"[Iwan Mosschuchin, Nathalie Lissenko, Pavel Pa...","[, , , , ]","[544742, 1090923, 1136422, 1261758, 29199]","[2, 1, 2, 0, 1]","[0, 1, 2, 3, 4]","[52fe4ea59251416c7515d7d5, 52fe4ea59251416c751..."


Se eliminan los nulos del campo solicitado segun la consigna

### 1.6 Creacion de la columna return

Crear la columna con el retorno de inversión, llamada return con los campos revenue y budget, dividiendo estas dos últimas revenue / budget, cuando no hay datos disponibles para calcularlo, deberá tomar el valor 0.

In [376]:
# Reemplazar NaN en 'revenue' y 'budget' por 0 para evitar errores de división
data['revenue'] = data['revenue'].fillna(0)
data['budget'] = data['budget'].fillna(0)

# Convertir las columnas 'revenue' y 'budget' a tipo numérico (en caso de que tengan cadenas de texto)
data['revenue'] = pd.to_numeric(data['revenue'], errors='coerce').fillna(0)
data['budget'] = pd.to_numeric(data['budget'], errors='coerce').fillna(0)

# Crear la columna 'return' calculando revenue / budget, y manejar los casos donde budget es 0
data['return'] = data.apply(lambda row: row['revenue'] / row['budget'] if row['budget'] != 0 else 0, axis=1)

# Verificar los cambios
print(data[['revenue', 'budget', 'return']].head())

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data['revenue'] = data['revenue'].fillna(0)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data['budget'] = data['budget'].fillna(0)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data['revenue'] = pd.to_numeric(data['revenue'], errors='coerce').fillna(0)
A value is trying to be set on a copy of a 

       revenue      budget     return
0  373554033.0  30000000.0  12.451801
1  262797249.0  65000000.0   4.043035
2          0.0         0.0   0.000000
3   81452156.0  16000000.0   5.090760
4   76578911.0         0.0   0.000000


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data['return'] = data.apply(lambda row: row['revenue'] / row['budget'] if row['budget'] != 0 else 0, axis=1)


In [377]:
data

Unnamed: 0,budget,id,original_language,overview,popularity,release_date,revenue,runtime,status,tagline,...,country_codes,language_names,language_codes,cast_names,cast_characters,cast_ids,cast_genders,cast_orders,cast_credit_ids,return
0,30000000.0,862,en,"Led by Woody, Andy's toys live happily in his ...",21.946943,1995-10-30,373554033.0,81.0,Released,,...,[US],[English],[en],"[Tom Hanks, Tim Allen, Don Rickles, Jim Varney...","[Woody (voice), Buzz Lightyear (voice), Mr. Po...","[31, 12898, 7167, 12899, 12900, 7907, 8873, 11...","[2, 2, 2, 2, 2, 2, 1, 0, 2, 1, 2, 1, 2]","[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12]","[52fe4284c3a36847f8024f95, 52fe4284c3a36847f80...",12.451801
1,65000000.0,8844,en,When siblings Judy and Peter discover an encha...,17.015539,1995-12-15,262797249.0,104.0,Released,Roll the dice and unleash the excitement!,...,[US],"[English, Français]","[en, fr]","[Robin Williams, Jonathan Hyde, Kirsten Dunst,...","[Alan Parrish, Samuel Alan Parrish / Van Pelt,...","[2157, 8537, 205, 145151, 5149, 10739, 58563, ...","[2, 2, 1, 0, 1, 1, 2, 1, 0, 1, 2, 1, 2, 0, 0, ...","[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,...","[52fe44bfc3a36847f80a7c73, 52fe44bfc3a36847f80...",4.043035
2,0.0,15602,en,A family wedding reignites the ancient feud be...,11.7129,1995-12-22,0.0,101.0,Released,Still Yelling. Still Fighting. Still Ready for...,...,[US],[English],[en],"[Walter Matthau, Jack Lemmon, Ann-Margret, Sop...","[Max Goldman, John Gustafson, Ariel Gustafson,...","[6837, 3151, 13567, 16757, 589, 16523, 7166]","[2, 2, 1, 1, 1, 2, 2]","[0, 1, 2, 3, 4, 5, 6]","[52fe466a9251416c75077a8d, 52fe466a9251416c750...",0.000000
3,16000000.0,31357,en,"Cheated on, mistreated and stepped on, the wom...",3.859495,1995-12-22,81452156.0,127.0,Released,Friends are the people who let you be yourself...,...,[US],[English],[en],"[Whitney Houston, Angela Bassett, Loretta Devi...","[Savannah 'Vannah' Jackson, Bernadine 'Bernie'...","[8851, 9780, 18284, 51359, 66804, 352, 87118, ...","[1, 1, 1, 1, 2, 2, 2, 2, 2, 2]","[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]","[52fe44779251416c91011aad, 52fe44779251416c910...",5.090760
4,0.0,11862,en,Just when George Banks has recovered from his ...,8.387519,1995-02-10,76578911.0,106.0,Released,Just When His World Is Back To Normal... He's ...,...,[US],[English],[en],"[Steve Martin, Diane Keaton, Martin Short, Kim...","[George Banks, Nina Banks, Franck Eggelhoffer,...","[67773, 3092, 519, 70696, 59222, 18793, 14592,...","[2, 1, 2, 1, 2, 0, 2, 2, 1, 1, 2, 1]","[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11]","[52fe44959251416c75039eb9, 52fe44959251416c750...",0.000000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
45694,0.0,30840,en,"Yet another version of the classic epic, with ...",5.683753,1991-05-13,0.0,104.0,Released,,...,"[CA, DE, GB, US]",[English],[en],"[Patrick Bergin, Uma Thurman, David Morrissey,...","[Sir Robert Hode, Maid Marian, Little John, Si...","[29459, 139, 18616, 920, 1924]","[2, 1, 2, 2, 0]","[0, 1, 2, 3, 4]","[52fe44439251416c9100a887, 52fe44439251416c910...",0.000000
45696,0.0,111109,tl,An artist struggles to finish his work while a...,0.178241,2011-11-17,0.0,360.0,Released,,...,[PH],[],[tl],"[Angel Aquino, Perry Dizon, Hazel Orencio, Joe...","[Sister Angela, Homer, Crazy Woman/Virgin, Ama...","[1043186, 111636, 1204271, 278923, 1042953, 57...","[1, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0]","[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10]","[52fe4af1c3a36847f81e9b1f, 559eb4ecc3a368081d0...",0.000000
45697,0.0,67758,en,"When one of her hits goes wrong, a professiona...",0.903007,2003-08-01,0.0,90.0,Released,A deadly game of wits.,...,[US],[English],[en],"[Erika Eleniak, Adam Baldwin, Julie du Page, J...","[Emily Shaw, Det. Mark Winston, Jayne Ferré, A...","[23764, 2059, 46277, 1736, 58646, 54649, 55270...","[1, 2, 1, 2, 2, 2, 2, 2, 2, 0, 2, 2, 2, 2, 0]","[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,...","[52fe4776c3a368484e0c83a3, 52fe4776c3a368484e0...",0.000000
45698,0.0,227506,en,"In a small town live two brothers, one a minis...",0.003503,1917-10-21,0.0,87.0,Released,,...,[RU],[],[],"[Iwan Mosschuchin, Nathalie Lissenko, Pavel Pa...","[, , , , ]","[544742, 1090923, 1136422, 1261758, 29199]","[2, 1, 2, 0, 1]","[0, 1, 2, 3, 4]","[52fe4ea59251416c7515d7d5, 52fe4ea59251416c751...",0.000000


### 1.7 Manejo de los formatos de fechas

In [378]:
# Asegurarse de que la columna 'release_date' esté en el formato de fecha 'yyyy-mm-dd'
data['release_date'] = pd.to_datetime(data['release_date'], errors='coerce', format='%Y-%m-%d')

# Verificar si alguna fecha no cumple con el formato deseado (YYYY-MM-DD)
invalid_dates = data['release_date'].isna().sum()

# Mostrar el resultado
if invalid_dates > 0:
    print(f"Hay {invalid_dates} fechas que no cumplen con el formato 'yyyy-mm-dd'.")
else:
    print("Todas las fechas cumplen con el formato 'yyyy-mm-dd'.")

# Asegurarse de que las fechas estén en el formato 'yyyy-mm-dd'
data['release_date'] = data['release_date'].dt.strftime('%Y-%m-%d')

# Mostrar las primeras filas para verificar el resultado
print(data['release_date'].head())

Hay 3 fechas que no cumplen con el formato 'yyyy-mm-dd'.
0    1995-10-30
1    1995-12-15
2    1995-12-22
3    1995-12-22
4    1995-02-10
Name: release_date, dtype: object


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data['release_date'] = pd.to_datetime(data['release_date'], errors='coerce', format='%Y-%m-%d')
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data['release_date'] = data['release_date'].dt.strftime('%Y-%m-%d')


In [379]:
# Asegurarse de que la columna 'release_date' esté en el formato de fecha
data['release_date'] = pd.to_datetime(data['release_date'], errors='coerce', format='%Y-%m-%d')


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data['release_date'] = pd.to_datetime(data['release_date'], errors='coerce', format='%Y-%m-%d')


In [380]:

# Asegurarse de que la columna 'release_date' esté en el formato de fecha
data['release_date'] = pd.to_datetime(data['release_date'], errors='coerce', format='%Y-%m-%d')

# Verificar si hay valores que no se pudieron convertir y mostrarlos
invalid_dates = data[data['release_date'].isna() & data['release_date'].notnull()]
print("Fechas inválidas que no se pudieron convertir:")
print(invalid_dates['release_date'])

# Extraer el año y el mes y crear nuevas columnas
data['release_year'] = data['release_date'].dt.year
data['release_month'] = data['release_date'].dt.month

# Crear una columna con el nombre del mes
data['month_name'] = data['release_date'].dt.strftime('%B')

Fechas inválidas que no se pudieron convertir:
Series([], Name: release_date, dtype: datetime64[ns])


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data['release_date'] = pd.to_datetime(data['release_date'], errors='coerce', format='%Y-%m-%d')
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data['release_year'] = data['release_date'].dt.year
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data['release_month'] = data['release_date'].dt.month
A v

In [381]:
data

Unnamed: 0,budget,id,original_language,overview,popularity,release_date,revenue,runtime,status,tagline,...,cast_names,cast_characters,cast_ids,cast_genders,cast_orders,cast_credit_ids,return,release_year,release_month,month_name
0,30000000.0,862,en,"Led by Woody, Andy's toys live happily in his ...",21.946943,1995-10-30,373554033.0,81.0,Released,,...,"[Tom Hanks, Tim Allen, Don Rickles, Jim Varney...","[Woody (voice), Buzz Lightyear (voice), Mr. Po...","[31, 12898, 7167, 12899, 12900, 7907, 8873, 11...","[2, 2, 2, 2, 2, 2, 1, 0, 2, 1, 2, 1, 2]","[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12]","[52fe4284c3a36847f8024f95, 52fe4284c3a36847f80...",12.451801,1995.0,10.0,October
1,65000000.0,8844,en,When siblings Judy and Peter discover an encha...,17.015539,1995-12-15,262797249.0,104.0,Released,Roll the dice and unleash the excitement!,...,"[Robin Williams, Jonathan Hyde, Kirsten Dunst,...","[Alan Parrish, Samuel Alan Parrish / Van Pelt,...","[2157, 8537, 205, 145151, 5149, 10739, 58563, ...","[2, 2, 1, 0, 1, 1, 2, 1, 0, 1, 2, 1, 2, 0, 0, ...","[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,...","[52fe44bfc3a36847f80a7c73, 52fe44bfc3a36847f80...",4.043035,1995.0,12.0,December
2,0.0,15602,en,A family wedding reignites the ancient feud be...,11.7129,1995-12-22,0.0,101.0,Released,Still Yelling. Still Fighting. Still Ready for...,...,"[Walter Matthau, Jack Lemmon, Ann-Margret, Sop...","[Max Goldman, John Gustafson, Ariel Gustafson,...","[6837, 3151, 13567, 16757, 589, 16523, 7166]","[2, 2, 1, 1, 1, 2, 2]","[0, 1, 2, 3, 4, 5, 6]","[52fe466a9251416c75077a8d, 52fe466a9251416c750...",0.000000,1995.0,12.0,December
3,16000000.0,31357,en,"Cheated on, mistreated and stepped on, the wom...",3.859495,1995-12-22,81452156.0,127.0,Released,Friends are the people who let you be yourself...,...,"[Whitney Houston, Angela Bassett, Loretta Devi...","[Savannah 'Vannah' Jackson, Bernadine 'Bernie'...","[8851, 9780, 18284, 51359, 66804, 352, 87118, ...","[1, 1, 1, 1, 2, 2, 2, 2, 2, 2]","[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]","[52fe44779251416c91011aad, 52fe44779251416c910...",5.090760,1995.0,12.0,December
4,0.0,11862,en,Just when George Banks has recovered from his ...,8.387519,1995-02-10,76578911.0,106.0,Released,Just When His World Is Back To Normal... He's ...,...,"[Steve Martin, Diane Keaton, Martin Short, Kim...","[George Banks, Nina Banks, Franck Eggelhoffer,...","[67773, 3092, 519, 70696, 59222, 18793, 14592,...","[2, 1, 2, 1, 2, 0, 2, 2, 1, 1, 2, 1]","[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11]","[52fe44959251416c75039eb9, 52fe44959251416c750...",0.000000,1995.0,2.0,February
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
45694,0.0,30840,en,"Yet another version of the classic epic, with ...",5.683753,1991-05-13,0.0,104.0,Released,,...,"[Patrick Bergin, Uma Thurman, David Morrissey,...","[Sir Robert Hode, Maid Marian, Little John, Si...","[29459, 139, 18616, 920, 1924]","[2, 1, 2, 2, 0]","[0, 1, 2, 3, 4]","[52fe44439251416c9100a887, 52fe44439251416c910...",0.000000,1991.0,5.0,May
45696,0.0,111109,tl,An artist struggles to finish his work while a...,0.178241,2011-11-17,0.0,360.0,Released,,...,"[Angel Aquino, Perry Dizon, Hazel Orencio, Joe...","[Sister Angela, Homer, Crazy Woman/Virgin, Ama...","[1043186, 111636, 1204271, 278923, 1042953, 57...","[1, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0]","[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10]","[52fe4af1c3a36847f81e9b1f, 559eb4ecc3a368081d0...",0.000000,2011.0,11.0,November
45697,0.0,67758,en,"When one of her hits goes wrong, a professiona...",0.903007,2003-08-01,0.0,90.0,Released,A deadly game of wits.,...,"[Erika Eleniak, Adam Baldwin, Julie du Page, J...","[Emily Shaw, Det. Mark Winston, Jayne Ferré, A...","[23764, 2059, 46277, 1736, 58646, 54649, 55270...","[1, 2, 1, 2, 2, 2, 2, 2, 2, 0, 2, 2, 2, 2, 0]","[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,...","[52fe4776c3a368484e0c83a3, 52fe4776c3a368484e0...",0.000000,2003.0,8.0,August
45698,0.0,227506,en,"In a small town live two brothers, one a minis...",0.003503,1917-10-21,0.0,87.0,Released,,...,"[Iwan Mosschuchin, Nathalie Lissenko, Pavel Pa...","[, , , , ]","[544742, 1090923, 1136422, 1261758, 29199]","[2, 1, 2, 0, 1]","[0, 1, 2, 3, 4]","[52fe4ea59251416c7515d7d5, 52fe4ea59251416c751...",0.000000,1917.0,10.0,October


In [382]:
# Definimos el diccionario de traducción
month_translation = {
    'January': 'Enero',
    'February': 'Febrero',
    'March': 'Marzo',
    'April': 'Abril',
    'May': 'Mayo',
    'June': 'Junio',
    'July': 'Julio',
    'August': 'Agosto',
    'September': 'Septiembre',
    'October': 'Octubre',
    'November': 'Noviembre',
    'December': 'Diciembre'
}

# Crear una nueva columna con los meses en español usando el diccionario
data['month_name_es'] = data['month_name'].map(month_translation)

# Verificar los cambios
print(data[['month_name', 'month_name_es']].head())

  month_name month_name_es
0    October       Octubre
1   December     Diciembre
2   December     Diciembre
3   December     Diciembre
4   February       Febrero


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data['month_name_es'] = data['month_name'].map(month_translation)


In [383]:
# Eliminar la columna 'month_name'
data = data.drop(columns=['month_name'])

In [384]:
# Verificar los primeros valores en la columna 'popularity'
print(data['popularity'].head(10))

# Verificar el tipo de datos en la columna 'popularity'
print(data['popularity'].dtype)


0    21.946943
1    17.015539
2      11.7129
3     3.859495
4     8.387519
5    17.924927
6     6.677277
7     2.561161
8      5.23158
9    14.686036
Name: popularity, dtype: object
object


In [385]:
# Convertir la columna 'popularity' a float, reemplazando valores no convertibles por NaN
data['popularity'] = pd.to_numeric(data['popularity'], errors='coerce')

# Verificar si hubo valores convertidos a NaN
print(data['popularity'].isna().sum())


3


In [386]:
# Encontrar los valores que no se pudieron convertir a float
invalid_popularity_values = data[~data['popularity'].apply(pd.to_numeric, errors='coerce').notna()]
print("Valores no convertibles en 'popularity':")
print(invalid_popularity_values['popularity'])


Valores no convertibles en 'popularity':
19844   NaN
29704   NaN
35800   NaN
Name: popularity, dtype: float64


In [387]:
# Eliminar filas con valores NaN en la columna 'popularity'
data = data.dropna(subset=['popularity'])


In [388]:
# Verificar los primeros valores de la columna 'popularity' después de la limpieza
print(data['popularity'].head())


0    21.946943
1    17.015539
2    11.712900
3     3.859495
4     8.387519
Name: popularity, dtype: float64


In [389]:
#Traigo un csv por perdida de dos columnas de crew que necesito a futuro en la api
#la perdida no se por que fue, asi que aproveche una copia de seguridad que tenia del dataset para volver a ingresarlo y trabajar con datos 

%pip install pandas openpyxl

data3 = '/Users/felipeamezquita/Library/Mobile Documents/com~apple~CloudDocs/Documents/HENRY/PROYECTO INDIVIDUAL/EJERCICIO 1/DATASETS/Movies/Crew.xlsx'
df1 = pd.read_excel(data3)


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.0[0m[39;49m -> [0m[32;49m24.1.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip3 install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


In [390]:
df1

Unnamed: 0,id,Crew_job,Crew_name
0,862,Director,John Lasseter
1,8844,Screenplay,Joss Whedon
2,15602,Screenplay,Andrew Stanton
3,31357,Screenplay,Joel Cohen
4,11862,Screenplay,Alec Sokolow
...,...,...,...
45561,439050,Screenplay,Joe Eszterhas
45562,111109,Producer,Don Simpson
45563,67758,Producer,Jerry Bruckheimer
45564,227506,Original Music Composer,Giorgio Moroder


In [391]:
data

Unnamed: 0,budget,id,original_language,overview,popularity,release_date,revenue,runtime,status,tagline,...,cast_names,cast_characters,cast_ids,cast_genders,cast_orders,cast_credit_ids,return,release_year,release_month,month_name_es
0,30000000.0,862,en,"Led by Woody, Andy's toys live happily in his ...",21.946943,1995-10-30,373554033.0,81.0,Released,,...,"[Tom Hanks, Tim Allen, Don Rickles, Jim Varney...","[Woody (voice), Buzz Lightyear (voice), Mr. Po...","[31, 12898, 7167, 12899, 12900, 7907, 8873, 11...","[2, 2, 2, 2, 2, 2, 1, 0, 2, 1, 2, 1, 2]","[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12]","[52fe4284c3a36847f8024f95, 52fe4284c3a36847f80...",12.451801,1995.0,10.0,Octubre
1,65000000.0,8844,en,When siblings Judy and Peter discover an encha...,17.015539,1995-12-15,262797249.0,104.0,Released,Roll the dice and unleash the excitement!,...,"[Robin Williams, Jonathan Hyde, Kirsten Dunst,...","[Alan Parrish, Samuel Alan Parrish / Van Pelt,...","[2157, 8537, 205, 145151, 5149, 10739, 58563, ...","[2, 2, 1, 0, 1, 1, 2, 1, 0, 1, 2, 1, 2, 0, 0, ...","[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,...","[52fe44bfc3a36847f80a7c73, 52fe44bfc3a36847f80...",4.043035,1995.0,12.0,Diciembre
2,0.0,15602,en,A family wedding reignites the ancient feud be...,11.712900,1995-12-22,0.0,101.0,Released,Still Yelling. Still Fighting. Still Ready for...,...,"[Walter Matthau, Jack Lemmon, Ann-Margret, Sop...","[Max Goldman, John Gustafson, Ariel Gustafson,...","[6837, 3151, 13567, 16757, 589, 16523, 7166]","[2, 2, 1, 1, 1, 2, 2]","[0, 1, 2, 3, 4, 5, 6]","[52fe466a9251416c75077a8d, 52fe466a9251416c750...",0.000000,1995.0,12.0,Diciembre
3,16000000.0,31357,en,"Cheated on, mistreated and stepped on, the wom...",3.859495,1995-12-22,81452156.0,127.0,Released,Friends are the people who let you be yourself...,...,"[Whitney Houston, Angela Bassett, Loretta Devi...","[Savannah 'Vannah' Jackson, Bernadine 'Bernie'...","[8851, 9780, 18284, 51359, 66804, 352, 87118, ...","[1, 1, 1, 1, 2, 2, 2, 2, 2, 2]","[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]","[52fe44779251416c91011aad, 52fe44779251416c910...",5.090760,1995.0,12.0,Diciembre
4,0.0,11862,en,Just when George Banks has recovered from his ...,8.387519,1995-02-10,76578911.0,106.0,Released,Just When His World Is Back To Normal... He's ...,...,"[Steve Martin, Diane Keaton, Martin Short, Kim...","[George Banks, Nina Banks, Franck Eggelhoffer,...","[67773, 3092, 519, 70696, 59222, 18793, 14592,...","[2, 1, 2, 1, 2, 0, 2, 2, 1, 1, 2, 1]","[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11]","[52fe44959251416c75039eb9, 52fe44959251416c750...",0.000000,1995.0,2.0,Febrero
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
45694,0.0,30840,en,"Yet another version of the classic epic, with ...",5.683753,1991-05-13,0.0,104.0,Released,,...,"[Patrick Bergin, Uma Thurman, David Morrissey,...","[Sir Robert Hode, Maid Marian, Little John, Si...","[29459, 139, 18616, 920, 1924]","[2, 1, 2, 2, 0]","[0, 1, 2, 3, 4]","[52fe44439251416c9100a887, 52fe44439251416c910...",0.000000,1991.0,5.0,Mayo
45696,0.0,111109,tl,An artist struggles to finish his work while a...,0.178241,2011-11-17,0.0,360.0,Released,,...,"[Angel Aquino, Perry Dizon, Hazel Orencio, Joe...","[Sister Angela, Homer, Crazy Woman/Virgin, Ama...","[1043186, 111636, 1204271, 278923, 1042953, 57...","[1, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0]","[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10]","[52fe4af1c3a36847f81e9b1f, 559eb4ecc3a368081d0...",0.000000,2011.0,11.0,Noviembre
45697,0.0,67758,en,"When one of her hits goes wrong, a professiona...",0.903007,2003-08-01,0.0,90.0,Released,A deadly game of wits.,...,"[Erika Eleniak, Adam Baldwin, Julie du Page, J...","[Emily Shaw, Det. Mark Winston, Jayne Ferré, A...","[23764, 2059, 46277, 1736, 58646, 54649, 55270...","[1, 2, 1, 2, 2, 2, 2, 2, 2, 0, 2, 2, 2, 2, 0]","[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,...","[52fe4776c3a368484e0c83a3, 52fe4776c3a368484e0...",0.000000,2003.0,8.0,Agosto
45698,0.0,227506,en,"In a small town live two brothers, one a minis...",0.003503,1917-10-21,0.0,87.0,Released,,...,"[Iwan Mosschuchin, Nathalie Lissenko, Pavel Pa...","[, , , , ]","[544742, 1090923, 1136422, 1261758, 29199]","[2, 1, 2, 0, 1]","[0, 1, 2, 3, 4]","[52fe4ea59251416c7515d7d5, 52fe4ea59251416c751...",0.000000,1917.0,10.0,Octubre


In [392]:
# Unir los DataFrames por la columna 'id'
merged_df = pd.merge(data, df1, on='id', how='inner')

# Mostrar las primeras filas del DataFrame unido
print(merged_df.head())


       budget     id original_language  \
0  30000000.0    862                en   
1  65000000.0   8844                en   
2         0.0  15602                en   
3  16000000.0  31357                en   
4         0.0  11862                en   

                                            overview  popularity release_date  \
0  Led by Woody, Andy's toys live happily in his ...   21.946943   1995-10-30   
1  When siblings Judy and Peter discover an encha...   17.015539   1995-12-15   
2  A family wedding reignites the ancient feud be...   11.712900   1995-12-22   
3  Cheated on, mistreated and stepped on, the wom...    3.859495   1995-12-22   
4  Just when George Banks has recovered from his ...    8.387519   1995-02-10   

       revenue  runtime    status  \
0  373554033.0     81.0  Released   
1  262797249.0    104.0  Released   
2          0.0    101.0  Released   
3   81452156.0    127.0  Released   
4   76578911.0    106.0  Released   

                                     

In [393]:
data = merged_df
data

Unnamed: 0,budget,id,original_language,overview,popularity,release_date,revenue,runtime,status,tagline,...,cast_ids,cast_genders,cast_orders,cast_credit_ids,return,release_year,release_month,month_name_es,Crew_job,Crew_name
0,30000000.0,862,en,"Led by Woody, Andy's toys live happily in his ...",21.946943,1995-10-30,373554033.0,81.0,Released,,...,"[31, 12898, 7167, 12899, 12900, 7907, 8873, 11...","[2, 2, 2, 2, 2, 2, 1, 0, 2, 1, 2, 1, 2]","[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12]","[52fe4284c3a36847f8024f95, 52fe4284c3a36847f80...",12.451801,1995.0,10.0,Octubre,Director,John Lasseter
1,65000000.0,8844,en,When siblings Judy and Peter discover an encha...,17.015539,1995-12-15,262797249.0,104.0,Released,Roll the dice and unleash the excitement!,...,"[2157, 8537, 205, 145151, 5149, 10739, 58563, ...","[2, 2, 1, 0, 1, 1, 2, 1, 0, 1, 2, 1, 2, 0, 0, ...","[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,...","[52fe44bfc3a36847f80a7c73, 52fe44bfc3a36847f80...",4.043035,1995.0,12.0,Diciembre,Screenplay,Joss Whedon
2,0.0,15602,en,A family wedding reignites the ancient feud be...,11.712900,1995-12-22,0.0,101.0,Released,Still Yelling. Still Fighting. Still Ready for...,...,"[6837, 3151, 13567, 16757, 589, 16523, 7166]","[2, 2, 1, 1, 1, 2, 2]","[0, 1, 2, 3, 4, 5, 6]","[52fe466a9251416c75077a8d, 52fe466a9251416c750...",0.000000,1995.0,12.0,Diciembre,Screenplay,Andrew Stanton
3,16000000.0,31357,en,"Cheated on, mistreated and stepped on, the wom...",3.859495,1995-12-22,81452156.0,127.0,Released,Friends are the people who let you be yourself...,...,"[8851, 9780, 18284, 51359, 66804, 352, 87118, ...","[1, 1, 1, 1, 2, 2, 2, 2, 2, 2]","[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]","[52fe44779251416c91011aad, 52fe44779251416c910...",5.090760,1995.0,12.0,Diciembre,Screenplay,Joel Cohen
4,0.0,11862,en,Just when George Banks has recovered from his ...,8.387519,1995-02-10,76578911.0,106.0,Released,Just When His World Is Back To Normal... He's ...,...,"[67773, 3092, 519, 70696, 59222, 18793, 14592,...","[2, 1, 2, 1, 2, 0, 2, 2, 1, 1, 2, 1]","[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11]","[52fe44959251416c75039eb9, 52fe44959251416c750...",0.000000,1995.0,2.0,Febrero,Screenplay,Alec Sokolow
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
46660,0.0,30840,en,"Yet another version of the classic epic, with ...",5.683753,1991-05-13,0.0,104.0,Released,,...,"[29459, 139, 18616, 920, 1924]","[2, 1, 2, 2, 0]","[0, 1, 2, 3, 4]","[52fe44439251416c9100a887, 52fe44439251416c910...",0.000000,1991.0,5.0,Mayo,Screenplay,Thomas Hedley Jr.
46661,0.0,111109,tl,An artist struggles to finish his work while a...,0.178241,2011-11-17,0.0,360.0,Released,,...,"[1043186, 111636, 1204271, 278923, 1042953, 57...","[1, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0]","[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10]","[52fe4af1c3a36847f81e9b1f, 559eb4ecc3a368081d0...",0.000000,2011.0,11.0,Noviembre,Producer,Don Simpson
46662,0.0,67758,en,"When one of her hits goes wrong, a professiona...",0.903007,2003-08-01,0.0,90.0,Released,A deadly game of wits.,...,"[23764, 2059, 46277, 1736, 58646, 54649, 55270...","[1, 2, 1, 2, 2, 2, 2, 2, 2, 0, 2, 2, 2, 2, 0]","[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,...","[52fe4776c3a368484e0c83a3, 52fe4776c3a368484e0...",0.000000,2003.0,8.0,Agosto,Producer,Jerry Bruckheimer
46663,0.0,227506,en,"In a small town live two brothers, one a minis...",0.003503,1917-10-21,0.0,87.0,Released,,...,"[544742, 1090923, 1136422, 1261758, 29199]","[2, 1, 2, 0, 1]","[0, 1, 2, 3, 4]","[52fe4ea59251416c7515d7d5, 52fe4ea59251416c751...",0.000000,1917.0,10.0,Octubre,Original Music Composer,Giorgio Moroder


In [394]:
print(data.columns)

Index(['budget', 'id', 'original_language', 'overview', 'popularity',
       'release_date', 'revenue', 'runtime', 'status', 'tagline', 'title',
       'vote_average', 'vote_count', 'belongs_to_collection_id',
       'belongs_to_collection_name', 'genres_ids', 'genres_names',
       'company_names', 'company_ids', 'country_names', 'country_codes',
       'language_names', 'language_codes', 'cast_names', 'cast_characters',
       'cast_ids', 'cast_genders', 'cast_orders', 'cast_credit_ids', 'return',
       'release_year', 'release_month', 'month_name_es', 'Crew_job',
       'Crew_name'],
      dtype='object')


In [395]:
# Mostrar el tipo de dato de las columnas específicas
print("Tipo de dato de la columna 'vote_average':", data['vote_average'].dtype)
print("Tipo de dato de la columna 'cast_names':", data['cast_names'].dtype)
print("Tipo de dato de la columna 'company_names':", data['company_names'].dtype)


Tipo de dato de la columna 'vote_average': float64
Tipo de dato de la columna 'cast_names': object
Tipo de dato de la columna 'company_names': object


- Se convierte el dataframe a parquet

In [407]:
import pyarrow as pa
import pyarrow.parquet as pq


table = pa.Table.from_pandas(data)
pq.write_table(table, 'data.parquet')



### Resumen del paso 1: Transformacion de datos.


1. **Desanidación de Datos:**

    Se identificaron y eliminaron valores innecesarios y columnas irrelevantes.

2. **Limpieza de Datos:**

    Se manejaron los valores nulos o NaN en las columnas relevantes. Se reemplazaron los NaN en la columna revenue por 0 y se eliminaron las filas con NaN en la columna release_date.
    Se identificaron y corrigieron problemas de formato en la columna release_date.

3. **Eliminación de Columnas:**

    Se eliminaron columnas innecesarias como Homepage, poster_path, belongs_to_collection_poster_path, entre otras.

4. **Corrección de Valores Nulos:**

    Se reemplazaron los valores NaN en la columna revenue por 0.
    Se eliminaron las filas con valores NaN en la columna release_date.

5. **Transformación de Tipos de Datos y Extracción de Características:**

    Se convirtió la columna release_date a un formato de fecha adecuado.
    Se crearon nuevas columnas release_year, release_month, y month_name a partir de la columna release_date, extrayendo el año, el mes, y el nombre del mes, respectivamente.

6. **Convertir Dataset a Parquet**