# 0.0 - Preparación de Datos: Sistema de Recomendación de Películas

Este notebook realiza la carga, limpieza y unión de datos del dataset de TMDB.
El objetivo es generar un conjunto de datos que contenga:
- Título de la película
- Sinopsis (`overview`)
- Géneros
- Palabras clave
- Reparto (cast)
- Director


In [4]:
import pandas as pd

# Cargar datasets
movies_df = pd.read_csv('../data/tmdb_5000_movies.csv')
credits_df = pd.read_csv('../data/tmdb_5000_credits.csv')

print("Películas:", movies_df.shape)
print("Créditos:", credits_df.shape)


Películas: (4803, 20)
Créditos: (4803, 4)


In [None]:
# Ver columnas disponibles
print(movies_df.columns)
print(credits_df.columns)

# Unir por 'title'
merged_df = movies_df.merge(credits_df, on='title')


# Eliminar columnas no necesarias después del merge
columns_to_keep = ['title', 'overview', 'genres', 'keywords', 'cast', 'crew']
merged_df = merged_df[columns_to_keep]
merged_df.head(2)


Index(['budget', 'genres', 'homepage', 'id', 'keywords', 'original_language',
       'original_title', 'overview', 'popularity', 'production_companies',
       'production_countries', 'release_date', 'revenue', 'runtime',
       'spoken_languages', 'status', 'tagline', 'title', 'vote_average',
       'vote_count'],
      dtype='object')
Index(['movie_id', 'title', 'cast', 'crew'], dtype='object')


Unnamed: 0,budget,genres,homepage,id,keywords,original_language,original_title,overview,popularity,production_companies,...,runtime,spoken_languages,status,tagline,title,vote_average,vote_count,movie_id,cast,crew
0,237000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://www.avatarmovie.com/,19995,"[{""id"": 1463, ""name"": ""culture clash""}, {""id"":...",en,Avatar,"In the 22nd century, a paraplegic Marine is di...",150.437577,"[{""name"": ""Ingenious Film Partners"", ""id"": 289...",...,162.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}, {""iso...",Released,Enter the World of Pandora.,Avatar,7.2,11800,19995,"[{""cast_id"": 242, ""character"": ""Jake Sully"", ""...","[{""credit_id"": ""52fe48009251416c750aca23"", ""de..."
1,300000000,"[{""id"": 12, ""name"": ""Adventure""}, {""id"": 14, ""...",http://disney.go.com/disneypictures/pirates/,285,"[{""id"": 270, ""name"": ""ocean""}, {""id"": 726, ""na...",en,Pirates of the Caribbean: At World's End,"Captain Barbossa, long believed to be dead, ha...",139.082615,"[{""name"": ""Walt Disney Pictures"", ""id"": 2}, {""...",...,169.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,"At the end of the world, the adventure begins.",Pirates of the Caribbean: At World's End,6.9,4500,285,"[{""cast_id"": 4, ""character"": ""Captain Jack Spa...","[{""credit_id"": ""52fe4232c3a36847f800b579"", ""de..."


In [6]:
# Nos quedamos con columnas necesarias
selected_df = merged_df[['title', 'overview', 'genres', 'keywords', 'cast', 'crew']]
selected_df.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4809 entries, 0 to 4808
Data columns (total 6 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   title     4809 non-null   object
 1   overview  4806 non-null   object
 2   genres    4809 non-null   object
 3   keywords  4809 non-null   object
 4   cast      4809 non-null   object
 5   crew      4809 non-null   object
dtypes: object(6)
memory usage: 225.6+ KB


In [7]:
# Eliminar cualquier fila que no tenga sinopsis
selected_df.dropna(subset=['overview'], inplace=True)

# Resetear índice
selected_df.reset_index(drop=True, inplace=True)

print("Dataset limpio:", selected_df.shape)


Dataset limpio: (4806, 6)


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  selected_df.dropna(subset=['overview'], inplace=True)


In [8]:
selected_df.to_csv('../data/clean_movies.csv', index=False)



Se generó un archivo `clean_movies.csv` con las siguientes columnas:
- title
- overview
- genres
- keywords
- cast
- crew

Este archivo será usado en el siguiente notebook para explorar datos y crear el motor de recomendación.
