## Sprint 02 - Lab 05 - Caso 1 Crear un recomendador de películas usando word embeddings
**Solución a problemas de Procesamiento de Lenguaje Natural (PLN)**

## 0. Importamos las librerías necesarias

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from gensim.models import Word2Vec
from sklearn.metrics.pairwise import cosine_similarity

# Descargar recursos necesarios de nltk
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')


Pyarrow will become a required dependency of pandas in the next major release of pandas (pandas 3.0),
(to allow more performant data types, such as the Arrow string type, and better interoperability with other libraries)
but was not found to be installed on your system.
If this would cause problems for you,
please provide us feedback at https://github.com/pandas-dev/pandas/issues/54466
        
  import pandas as pd
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\hugoc\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\hugoc\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\hugoc\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

## 1. Análisis y procesado de los datos


In [2]:
# Cargar el dataset desde el archivo CSV
df = pd.read_csv("wiki_movie_plots_deduped.csv")
df.head()

Unnamed: 0,Release Year,Title,Origin/Ethnicity,Director,Cast,Genre,Wiki Page,Plot
0,1901,Kansas Saloon Smashers,American,Unknown,,unknown,https://en.wikipedia.org/wiki/Kansas_Saloon_Sm...,"A bartender is working at a saloon, serving dr..."
1,1901,Love by the Light of the Moon,American,Unknown,,unknown,https://en.wikipedia.org/wiki/Love_by_the_Ligh...,"The moon, painted with a smiling face hangs ov..."
2,1901,The Martyred Presidents,American,Unknown,,unknown,https://en.wikipedia.org/wiki/The_Martyred_Pre...,"The film, just over a minute long, is composed..."
3,1901,"Terrible Teddy, the Grizzly King",American,Unknown,,unknown,"https://en.wikipedia.org/wiki/Terrible_Teddy,_...",Lasting just 61 seconds and consisting of two ...
4,1902,Jack and the Beanstalk,American,"George S. Fleming, Edwin S. Porter",,unknown,https://en.wikipedia.org/wiki/Jack_and_the_Bea...,The earliest known adaptation of the classic f...


In [3]:
# Mostramos el número de filas y columnas
print(f"El dataset tiene {df.shape[0]} filas")

El dataset tiene 34886 filas


In [4]:
# Como el modelo queremos que recomiendo en función a su sinopsis, eliminamos
# las columnas que no aportan valor sustancial
df_clean = df.drop(columns=['Release Year', 'Origin/Ethnicity', 'Director', 'Genre', 'Wiki Page', 'Cast'])
df_clean.head()

Unnamed: 0,Title,Plot
0,Kansas Saloon Smashers,"A bartender is working at a saloon, serving dr..."
1,Love by the Light of the Moon,"The moon, painted with a smiling face hangs ov..."
2,The Martyred Presidents,"The film, just over a minute long, is composed..."
3,"Terrible Teddy, the Grizzly King",Lasting just 61 seconds and consisting of two ...
4,Jack and the Beanstalk,The earliest known adaptation of the classic f...


In [5]:
# Muestra cuantas filas vacías tienen valores nulos
print(f"El dataset tiene {df_clean.isnull().any(axis=1).sum()} filas con valores nulos")

# Eliminamos filas vacías
df_clean.dropna()

El dataset tiene 0 filas con valores nulos


Unnamed: 0,Title,Plot
0,Kansas Saloon Smashers,"A bartender is working at a saloon, serving dr..."
1,Love by the Light of the Moon,"The moon, painted with a smiling face hangs ov..."
2,The Martyred Presidents,"The film, just over a minute long, is composed..."
3,"Terrible Teddy, the Grizzly King",Lasting just 61 seconds and consisting of two ...
4,Jack and the Beanstalk,The earliest known adaptation of the classic f...
...,...,...
34881,The Water Diviner,"The film begins in 1919, just after World War ..."
34882,Çalgı Çengi İkimiz,"Two musicians, Salih and Gürkan, described the..."
34883,Olanlar Oldu,"Zafer, a sailor living with his mother Döndü i..."
34884,Non-Transferable,The film centres around a young woman named Am...


In [6]:
# Preprocesamiento de texto
def preprocess_text(text):
    text = text.lower()
    
    # Eliminar stopwords
    tokens = word_tokenize(text)
    tokens = [token for token in tokens if token.isalpha()]
    stop_words = set(stopwords.words('english'))
    tokens = [token for token in tokens if token not in stop_words]
    
    lemmatizer = WordNetLemmatizer()
    tokens = [lemmatizer.lemmatize(token) for token in tokens]
    return tokens

# Seleccionar solo la columna 'Plot' que contiene las sinopsis
plot_data = df_clean['Plot'].tolist()

# Aplicar preprocesamiento a las sinopsis
preprocessed_plots = [preprocess_text(plot) for plot in plot_data]

KeyboardInterrupt: 

## 2. Entrenamiento del modelo Word2Vec
**En este ejercicio se utiliza Word2Vec por su eficiencia y capacidad para capturar el contexto semántico de palabras en grandes conjuntos de datos de texto.**

In [None]:
# Entrenar el modelo Word2Vec
model = Word2Vec(preprocessed_plots, vector_size=100, window=5, min_count=1, workers=4)

## 3. Entrenamiento del modelo Word2Vec con el método de similaridad
**En este caso vamos convierto la sinopsis de películas en números usando Word2Vec.**
**Luego, se promedian estos números para obtener una idea promedio de cada película.**
**Finalmente, se comparan estas "ideas" para encontrar similitudes entre películas.**


In [None]:
# Calcular representaciones vectoriales de las sinopsis
plot_vectors = [model.wv[plot] for plot in preprocessed_plots]

# Calcular el vector promedio de cada sinopsis
plot_vectors_avg = [sum(vec) / len(vec) for vec in plot_vectors]

# Convertir la lista de vectores promedio en un arreglo de NumPy
plot_vectors_avg_np = np.array(plot_vectors_avg)

# Calcular similitud coseno entre las sinopsis
similarity_matrix = cosine_similarity(plot_vectors_avg_np, plot_vectors_avg_np)

In [None]:
# Función para obtener las películas más similares
def get_similar_movies(movie_title, top_n=5):
    movie_index = df_clean[df_clean['Title'] == movie_title].index[0]
    similar_movies_indices = similarity_matrix[movie_index].argsort()[-top_n-1:-1][::-1] # Últimos top_n elementos
    similar_movies = [(df_clean.iloc[i]['Title'], similarity_matrix[movie_index][i]) for i in similar_movies_indices]
    return similar_movies

## 4. Prueba del modelo

**En este caso se nos pide que introduciendo el nombre de una película, nos recomiende algunas similares en función de la sinopsis**

In [None]:
# Recogemos el nombre de la película
movie_title = input("Por favor ingresa el título de la película: ")

# Prueba del modelo recomendando películas similares a una película dada
similar_movies = get_similar_movies(movie_title)
print(f"Aquí tiene una lista de 5 películas similares a '{movie_title}':\n")
for movie, similarity_score in similar_movies:
    print(f"- {movie} (SIMILITUD: {similarity_score:.2f})")


IndexError: index 0 is out of bounds for axis 0 with size 0