# UNR - FCEIA 
## Tecnicatura Universitaria en Programación 
### NLP: Trabajo Práctico N°1 

---

**Integrantes**
- López Ceratto, Julieta : L-3311/1
- Crenna, Giuliano : C-7438/1

# Importamos librerías necesarias

In [26]:
import os
import pandas as pd
import pickle
from typing import List, Dict, Any
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import make_pipeline
from sklearn.metrics import classification_report
import tensorflow as tf
import tensorflow_hub as hub
from sklearn.neighbors import NearestNeighbors
import numpy as np
import tensorflow_text
import spacy

In [27]:
import warnings
warnings.filterwarnings("ignore")

# Carga de datasets

In [28]:
JUEGOS_PATH: str = os.path.join(os.getcwd(), 'data', 'bgg_database.csv')
PELICULAS_PATH: str = os.path.join(os.getcwd(), 'data', 'IMDB-Movie-Data.csv')
LIBROS_PATH: str = os.path.join(os.getcwd(), 'data', 'dataset_libros.csv')

In [29]:
dataset_juegos: pd.DataFrame = pd.read_csv(JUEGOS_PATH)
dataset_peliculas: pd.DataFrame = pd.read_csv(PELICULAS_PATH)
dataset_libros: pd.DataFrame = pd.read_csv(LIBROS_PATH)

In [30]:
dataset_libros['Resumen'][0]

'"Frankenstein; Or, The Modern Prometheus" by Mary Wollstonecraft Shelley is a novel written in the early 19th century. The story explores themes of ambition, the quest for knowledge, and the consequences of man\'s hubris through the experiences of Victor Frankenstein and the monstrous creation of his own making.   The opening of the book introduces Robert Walton, an ambitious explorer on a quest to discover new lands and knowledge in the icy regions of the Arctic. In his letters to his sister Margaret, he expresses both enthusiasm and the fear of isolation in his grand venture. As Walton\'s expedition progresses, he encounters a mysterious, emaciated stranger who has faced great suffering—furthering the intrigue of his narrative. This stranger ultimately reveals his tale of creation, loss, and the profound consequences of seeking knowledge that lies beyond human bounds. The narrative is set up in a manner that suggests a deep examination of the emotions and ethical dilemmas faced by t

## Modificacion Resumen dataset_libros

Dado que los resúmenes de los libros son muy largos y no disponemos de cómputo para hacer los embeddings, utilizamos spacy para acortarlos.1

In [31]:
summary = spacy.load('en_core_web_sm')

In [32]:
# Función para extraer palabras clave (sustantivos, adjetivos, etc.)
def extraer_palabras_clave(texto):
    doc = summary(texto)
    palabras_clave = [token.text for token in doc if token.pos_ in {'NOUN'}]
    return " ".join(palabras_clave)

In [None]:
dataset_libros['Resumen'].str.split().str.len().sum()

899588

In [None]:
dataset_peliculas['Description'].str.split().str.len().sum()

27921

# Modelo de Análisis de Sentimientos

Creo un dataset sencillo para entrenar al clasificador.

In [34]:
ESTADOS_ANIMO_PATH: str = os.path.join(os.getcwd(), 'data', 'estados_de_animo.csv')

In [35]:
df_estados_de_animo: pd.DataFrame = pd.read_csv(ESTADOS_ANIMO_PATH)

X: pd.Series = df_estados_de_animo['prompt']
y: pd.Series = df_estados_de_animo['estado_animo']

Hacemos un split de los datos y creamos un pipeline de trabajo utilizando el clasificador **MultinomialNB**.

In [36]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

modelo_animo = make_pipeline(TfidfVectorizer(), MultinomialNB())

In [37]:
modelo_animo.fit(X_train, y_train)

y_pred = modelo_animo.predict(X_test)
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

      Alegre       1.00      0.96      0.98        79
 Melancólico       1.00      1.00      1.00        65
 Ni fu ni fa       0.96      1.00      0.98        76

    accuracy                           0.99       220
   macro avg       0.99      0.99      0.99       220
weighted avg       0.99      0.99      0.99       220



Creamos una función para clasificar el prompt del usuario.

In [38]:
def clasificar_animo(prompt_usuario: str) -> str:
    estado_animo_predicho = modelo_animo.predict([prompt_usuario])[0]
    
    return estado_animo_predicho

In [39]:
nuevo_prompt = "La vida no significa nada"

print(f"Estado de ánimo: {clasificar_animo(nuevo_prompt)}")

Estado de ánimo: Ni fu ni fa


In [40]:
nuevo_prompt = "Me siento feliz"

print(f"Estado de ánimo: {clasificar_animo(nuevo_prompt)}")

Estado de ánimo: Alegre


In [41]:
nuevo_prompt = "El vacio se siente en mi"

print(f"Estado de ánimo: {clasificar_animo(nuevo_prompt)}")

Estado de ánimo: Melancólico


Exporto el modelo

In [42]:
# ESTADO_ANIMO_MODEL_PATH: str = os.path.join(os.getcwd(), 'models', 'modelo_estado_animo.pickle')

# pickle.dump(modelo_animo, open(ESTADO_ANIMO_MODEL_PATH, 'wb'))

# Análisis Datasets

## Análisis Dataset Juegos
Como Vemos en el dataset de juegos no se presentan datos nulos.

In [43]:
dataset_juegos.isna().sum()

rank                0
game_name           0
game_href           0
geek_rating         0
avg_rating          0
num_voters          0
description         0
yearpublished       0
minplayers          0
maxplayers          0
minplaytime         0
maxplaytime         0
minage              0
avgweight           0
best_num_players    0
designers           0
mechanics           0
categories          0
dtype: int64

## Análisis Dataset Películas

In [44]:
dataset_peliculas.isna().sum()

Rank                  0
Title                 0
Genre                 0
Description           0
Director              0
Actors                0
Year                  0
Runtime (Minutes)     0
Rating                0
Votes                 0
Revenue (Millions)    0
Metascore             0
dtype: int64

## Dataset Libros
De los 5984 libros dentro del dataset, solo 53 tiene titulo secundario y solo 2796 libros tienen un autor registrado.

In [45]:
dataset_libros.shape

(5986, 5)

Como tenemos problemas de cómputo al hacer embeddings con casi 6000 libros, reducimos el dataset de libros al tamaño del de películas (1000). Además, nos aseguramos que de este dataset resumido sólo formen parte libros que tienen descripción.

In [46]:
dataset_libros_resumido = dataset_libros[dataset_libros['Resumen']!= 'Resumen no encontrado']

Además, sacamos títulos repetidos.

In [47]:
dataset_libros_resumido.drop_duplicates(subset='Titulo Principal', inplace= True)

In [48]:
dataset_libros_resumido = dataset_libros_resumido.sample(n = 1000, random_state= 42)

In [88]:
dataset_libros_resumido.reset_index(drop=True, inplace=True)

In [50]:
dataset_peliculas.shape

(1000, 12)

In [51]:
dataset_libros.isna().sum()

Titulo Principal     0
Titulo Secundario    0
Autor                0
N° Ref               0
Resumen              0
dtype: int64

# Apartado 1

## Implementación modelo KNN con encoder de tensorflow

In [53]:
# Cargar Universal Sentence Encoder
embed = hub.load("https://tfhub.dev/google/universal-sentence-encoder-multilingual/3")

In [54]:
def crear_embeding():
    '''
    Crea dataset de embedings para entrenar el modelo y realizar posterior búsqueda
    en los resultados
    '''
    ###Agrega index y tipo a los dataset para luego recuperar el vecino más cercano
    ### por índice.
    dataset_juegos['index'] = [i for i in dataset_juegos.index]
    dataset_juegos['tipo'] = 'juego'
    dataset_libros_resumido['index'] = [i for i in dataset_libros_resumido.index]
    dataset_libros_resumido['tipo'] = 'libro'
    dataset_peliculas['index'] = [i for i in dataset_peliculas.index]
    dataset_peliculas['tipo'] = 'pelicula'

    ###Agrega una columna 'frase_embedding' que junta la información que se utilizará para entrenar
    ### al modelo.
    
    # Para dataset_juegos
    dataset_juegos['frase_embeding'] = dataset_juegos.apply(
        lambda row: f"{row['description']}, tipo juego, {row['maxplayers']}", axis=1)

    # Para dataset_peliculas
    dataset_peliculas['frase_embeding'] = dataset_peliculas.apply(
        lambda row: f"{row['Description']}, tipo pelicula, {row['Genre']}", axis=1)

    # Para dataset_libros_resumido
    dataset_libros_resumido['frase_embeding'] = dataset_libros_resumido.apply(
        lambda row: f"{row['Titulo Principal']}, tipo libro, {row['Autor']}, {row['Resumen']}", axis=1)
    
    # Generar embeddings para cada conjunto de datos
    embeding_juegos = embed(dataset_juegos['frase_embeding']).numpy()
    embeding_libros = embed(dataset_libros_resumido['frase_embeding']).numpy()
    embeding_peliculas = embed(dataset_peliculas['frase_embeding']).numpy()

    # Crear DataFrames para los embeddings
    embeding_juegos_df = pd.DataFrame(embeding_juegos)
    embeding_libros_df = pd.DataFrame(embeding_libros)
    embeding_peliculas_df = pd.DataFrame(embeding_peliculas)

    # Añadir el índice y tipo a los DataFrames de embeddings
    embeding_juegos_df['index'] = dataset_juegos['index']
    embeding_libros_df['index'] = dataset_libros_resumido['index']
    embeding_peliculas_df['index'] = dataset_peliculas['index']
    embeding_juegos_df['tipo'] = dataset_juegos['tipo']
    embeding_libros_df['tipo'] = dataset_libros_resumido['tipo']
    embeding_peliculas_df['tipo'] = dataset_peliculas['tipo'] 

    # Concatenar todos los embeddings
    embedings_totales = np.concatenate([embeding_juegos, embeding_libros, embeding_peliculas])
    df_embedings_totales = pd.concat([embeding_juegos_df, embeding_libros_df, embeding_peliculas_df])
    
    # Guardar el DataFrame de embeddings en un archivo CSV
    df_embedings_totales.to_csv('./data/embedings_totales.csv', index=False)


In [93]:
crear_embeding()
df_embedings_totales = pd.read_csv('./data/embedings_totales.csv')

In [56]:
df_embedings_totales

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,504,505,506,507,508,509,510,511,index,tipo
0,-0.047936,0.068728,-0.042762,-0.057224,0.037448,0.066649,0.030122,-0.056059,0.069155,0.025768,...,-0.052796,0.049141,0.003954,0.052749,-0.028988,-0.046117,0.028948,-0.027748,0.0,juego
1,0.018236,0.027856,-0.064104,0.003759,0.034112,0.065483,-0.059687,0.053690,0.049393,0.020000,...,0.007075,-0.033128,-0.062432,-0.061036,-0.027173,-0.065109,0.056215,-0.057410,1.0,juego
2,-0.015504,0.058139,-0.069847,0.024894,0.042606,0.065332,0.001296,0.009519,0.062509,-0.036491,...,-0.022464,-0.062281,0.008048,-0.052787,-0.031792,-0.052469,0.053836,0.005630,2.0,juego
3,-0.008350,0.062483,-0.062257,-0.065673,0.062957,0.065244,-0.044344,-0.048647,0.059848,-0.067839,...,-0.025592,-0.009789,0.067237,0.024409,0.047927,-0.065013,-0.006761,-0.055704,3.0,juego
4,-0.011809,0.061071,-0.035048,-0.056053,0.055633,0.061631,0.034980,-0.050839,0.060246,0.058773,...,-0.053403,0.028869,-0.003057,-0.026638,-0.058163,-0.055203,-0.007861,-0.061118,4.0,juego
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2995,0.094688,0.005679,0.033949,0.022188,-0.081190,0.032900,-0.024884,0.042659,0.067939,0.010390,...,0.041620,-0.018868,-0.007200,0.002785,0.023341,-0.015587,0.066750,-0.039837,995.0,pelicula
2996,0.012903,-0.076499,0.004223,0.016400,-0.072026,0.040490,0.026115,-0.013240,0.019930,-0.015047,...,0.007965,-0.008668,-0.010873,-0.024861,-0.052551,0.031445,0.042686,-0.038097,996.0,pelicula
2997,-0.020377,-0.004738,0.006126,0.030710,-0.066414,-0.013004,-0.067743,-0.039744,0.031397,0.039048,...,0.017172,0.019270,-0.051282,0.024418,-0.053818,0.012994,0.043990,-0.013857,997.0,pelicula
2998,0.068452,0.009533,0.028528,-0.033708,-0.077274,0.015580,0.000097,0.006803,0.062297,0.007245,...,0.047684,-0.019694,0.029868,-0.028147,-0.044766,-0.007948,0.055236,-0.055687,998.0,pelicula


In [57]:
n_neighbors = 5
modelor_recomendador = make_pipeline(NearestNeighbors(n_neighbors=n_neighbors, metric='cosine', algorithm='brute'))
modelor_recomendador.fit(df_embedings_totales.drop(columns=['index', 'tipo']))

In [58]:
RECOMENDADOR_MODEL_PATH: str = os.path.join(os.getcwd(), 'models', 'modelo_recomendador.pickle')

pickle.dump(modelor_recomendador, open(RECOMENDADOR_MODEL_PATH, 'wb'))

In [62]:
animo = clasificar_animo('Quiero estar tranquilo')

In [None]:
user_prompt = f'pelicula de terror psicologico desarrollada en estados unidos, {animo}'

In [95]:
def que_hacer (consulta : str):
    '''
    Devuelve 5 recomendaciones más acordes a la consulta
    '''
    consulta = embed(consulta).numpy()

    # Realizar la búsqueda de los vecinos más cercanos
    distances, indices = modelor_recomendador[0].kneighbors(consulta)

    for j in range(n_neighbors):
        # Obtenemos el índice del vecino más cercano
        idx = indices[0][j]
        i = df_embedings_totales['index'].iloc[idx]
        dataset = df_embedings_totales['tipo'].iloc[idx]

        # Dependiendo del dataset, accedemos al DataFrame correspondiente
        if dataset == 'juego':
            vecino = dataset_juegos['game_name'].iloc[i]
        elif dataset == 'libro':
            vecino = dataset_libros_resumido['Titulo Principal'].iloc[i]
        elif dataset == 'pelicula':
            vecino = dataset_peliculas['Title'].iloc[i]
        
        # Imprimir el vecino y la distancia
        print(f"Vecino {j + 1}: {vecino} - Distancia: {distances[0][j]:.4f} - {dataset}")

    print("-" * 40)

In [78]:
def user():
    user_prompt = f"{input('ingrese qué está buscando')}, {animo}"
    que_hacer(user_prompt)

In [96]:
df_embedings_totales['tipo'].iloc[1953]

'libro'

In [100]:
user()

Vecino 1: The Longest Ride - Distancia: 0.6167 - pelicula
Vecino 2: The First Time - Distancia: 0.6568 - pelicula
Vecino 3: Tramps - Distancia: 0.6632 - pelicula
Vecino 4: The Best of Me - Distancia: 0.6660 - pelicula
Vecino 5: (500) Days of Summer - Distancia: 0.6734 - pelicula
----------------------------------------
