# UNR - FCEIA 
## Tecnicatura Universitaria en Programación 
### NLP: Trabajo Práctico N°1 

---

**Integrantes**
- López Ceratto, Julieta : L-3311/1
- Crenna, Giuliano : C-7438/1

# Importamos librerías necesarias

In [1]:
import os
import pandas as pd
import pickle
from typing import List, Dict, Any
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import make_pipeline
from sklearn.metrics import classification_report
import tensorflow as tf
import tensorflow_text
import tensorflow_hub as hub
from sklearn.neighbors import NearestNeighbors
import numpy as np

In [2]:
import warnings
warnings.filterwarnings("ignore")

# Carga de datasets

In [3]:
JUEGOS_PATH: str = os.path.join(os.getcwd(), 'data', 'bgg_database.csv')
PELICULAS_PATH: str = os.path.join(os.getcwd(), 'data', 'IMDB-Movie-Data.csv')
LIBROS_PATH: str = os.path.join(os.getcwd(), 'data', 'dataset_libros.csv')

In [4]:
dataset_juegos: pd.DataFrame = pd.read_csv(JUEGOS_PATH)
dataset_peliculas: pd.DataFrame = pd.read_csv(PELICULAS_PATH)
dataset_libros: pd.DataFrame = pd.read_csv(LIBROS_PATH)

# Modelo de Análisis de Sentimientos

Creo un dataset sencillo para entrenar al clasificador.

In [5]:
ESTADOS_ANIMO_PATH: str = os.path.join(os.getcwd(), 'data', 'estados_de_animo.csv')

In [6]:
df_estados_de_animo: pd.DataFrame = pd.read_csv(ESTADOS_ANIMO_PATH)

X: pd.Series = df_estados_de_animo['prompt']
y: pd.Series = df_estados_de_animo['estado_animo']

Hacemos un split de los datos y creamos un pipeline de trabajo utilizando el clasificador **MultinomialNB**.

In [7]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

modelo_animo = make_pipeline(TfidfVectorizer(), MultinomialNB())

In [8]:
modelo_animo.fit(X_train, y_train)

y_pred = modelo_animo.predict(X_test)
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

      Alegre       1.00      0.96      0.98        79
 Melancólico       1.00      1.00      1.00        65
 Ni fu ni fa       0.96      1.00      0.98        76

    accuracy                           0.99       220
   macro avg       0.99      0.99      0.99       220
weighted avg       0.99      0.99      0.99       220



Creamos una función para clasificar el prompt del usuario.

In [9]:
def clasificar_animo(prompt_usuario: str) -> str:
    estado_animo_predicho = modelo_animo.predict([prompt_usuario])[0]
    
    return estado_animo_predicho

In [10]:
nuevo_prompt = "La vida no significa nada"

print(f"Estado de ánimo: {clasificar_animo(nuevo_prompt)}")

Estado de ánimo: Ni fu ni fa


In [11]:
nuevo_prompt = "Me siento feliz"

print(f"Estado de ánimo: {clasificar_animo(nuevo_prompt)}")

Estado de ánimo: Alegre


In [12]:
nuevo_prompt = "El vacio se siente en mi"

print(f"Estado de ánimo: {clasificar_animo(nuevo_prompt)}")

Estado de ánimo: Melancólico


Exporto el modelo

In [13]:
# ESTADO_ANIMO_MODEL_PATH: str = os.path.join(os.getcwd(), 'models', 'modelo_estado_animo.pickle')

# pickle.dump(modelo_animo, open(ESTADO_ANIMO_MODEL_PATH, 'wb'))

# Análisis Datasets

## Análisis Dataset Juegos
Como Vemos en el dataset de juegos no se presentan datos nulos.

In [14]:
dataset_juegos.isna().sum()

rank                0
game_name           0
game_href           0
geek_rating         0
avg_rating          0
num_voters          0
description         0
yearpublished       0
minplayers          0
maxplayers          0
minplaytime         0
maxplaytime         0
minage              0
avgweight           0
best_num_players    0
designers           0
mechanics           0
categories          0
dtype: int64

## Análisis Dataset Películas

In [15]:
dataset_peliculas.isna().sum()

Rank                  0
Title                 0
Genre                 0
Description           0
Director              0
Actors                0
Year                  0
Runtime (Minutes)     0
Rating                0
Votes                 0
Revenue (Millions)    0
Metascore             0
dtype: int64

## Dataset Libros
De los 5984 libros dentro del dataset, solo 53 tiene titulo secundario y solo 2796 libros tienen un autor registrado.

In [16]:
dataset_libros.shape

(5984, 4)

In [17]:
dataset_libros.isna().sum()

Titulo Principal        0
Titulo Secundario    5931
Autor                3188
N° Ref                  0
dtype: int64

# Apartado 1

## Implementación modelo KNN con encoder de tensorflow

In [18]:
# Cargar Universal Sentence Encoder
embed = hub.load("https://tfhub.dev/google/universal-sentence-encoder-multilingual/3")

In [33]:
len(dataset_juegos)+len(dataset_libros)+len(dataset_peliculas)

7984

In [37]:
dataset_juegos['index_dataset'] = [(i, 'dataset_juegos') for i in dataset_juegos.index]
dataset_libros['index_dataset'] = [(i, 'dataset_libros') for i in dataset_libros.index]
dataset_peliculas['index_dataset'] = [(i, 'dataset_peliculas') for i in dataset_peliculas.index]

In [19]:
embeding_juegos= embed(dataset_juegos['description']).numpy()
embeding_libros = embed(dataset_libros['Titulo Principal']).numpy()
embeding_peliculas = embed(dataset_peliculas['Description']).numpy()

In [38]:
embeding_juegos_df = pd.DataFrame(embeding_juegos)
embeding_libros_df = pd.DataFrame(embeding_libros)
embeding_peliculas_df = pd.DataFrame(embeding_peliculas)
embeding_juegos_df['index_dataset'] =  dataset_juegos['index_dataset']
embeding_libros_df['index_dataset'] = dataset_libros['index_dataset']
embeding_peliculas_df['index_dataset']=dataset_peliculas['index_dataset'] 

In [20]:
embedings_totales = np.concatenate([embeding_juegos,embeding_libros, embeding_peliculas ])

In [39]:
df_embedings_totales = pd.concat([embeding_juegos_df,embeding_libros_df, embeding_peliculas_df])

In [40]:
df_embedings_totales

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,503,504,505,506,507,508,509,510,511,index_dataset
0,-0.048300,0.068703,-0.043667,-0.056702,0.038015,0.066751,0.030109,-0.055992,0.069133,0.024812,...,0.043030,-0.054104,0.049328,0.005198,0.051768,-0.027934,-0.046247,0.027357,-0.028809,"(0, dataset_juegos)"
1,0.017649,0.028976,-0.064173,0.004160,0.035593,0.065500,-0.059804,0.053789,0.048729,0.018230,...,0.024176,0.006184,-0.034213,-0.062382,-0.061460,-0.025917,-0.065143,0.055290,-0.057559,"(1, dataset_juegos)"
2,-0.012979,0.058477,-0.070121,0.028392,0.038904,0.066474,-0.001906,0.008118,0.062959,-0.037352,...,0.029357,-0.029355,-0.061351,0.008848,-0.053755,-0.029861,-0.049717,0.050994,0.001451,"(2, dataset_juegos)"
3,-0.006771,0.062177,-0.062085,-0.065413,0.063442,0.065518,-0.045002,-0.048304,0.059370,-0.067893,...,-0.002811,-0.026211,-0.010782,0.067397,0.021924,0.048966,-0.065105,-0.010564,-0.054653,"(3, dataset_juegos)"
4,-0.011964,0.061035,-0.033034,-0.055925,0.056099,0.061576,0.034298,-0.050790,0.060119,0.058633,...,0.059892,-0.053786,0.027728,-0.001129,-0.026491,-0.058215,-0.055197,-0.010142,-0.061052,"(4, dataset_juegos)"
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
995,0.101780,0.027629,0.031078,0.006868,-0.081075,-0.002862,0.004631,0.047269,0.034093,0.004214,...,-0.062894,0.039775,-0.024407,-0.005727,-0.021108,0.042280,0.014414,0.082030,-0.034652,"(995, dataset_peliculas)"
996,0.029539,-0.097512,0.009582,0.011753,-0.082796,-0.021360,-0.005233,-0.020894,0.034498,-0.035226,...,-0.064742,0.016436,0.010886,-0.023831,-0.028152,-0.050418,0.055626,0.044340,-0.015197,"(996, dataset_peliculas)"
997,-0.054547,-0.012116,0.016270,0.009599,-0.056786,-0.060565,-0.065082,-0.045898,0.009691,0.044520,...,-0.009174,0.004557,0.059068,-0.065523,0.036204,-0.050877,0.040817,0.063030,0.018572,"(997, dataset_peliculas)"
998,0.061582,0.040589,0.054283,-0.035823,-0.081507,-0.059563,0.005052,-0.019061,0.055058,0.000301,...,-0.051153,0.064794,-0.015714,0.033275,-0.052432,-0.027591,0.032127,0.043266,-0.042989,"(998, dataset_peliculas)"


In [49]:
n_neighbors = 5 
knn = NearestNeighbors(n_neighbors=n_neighbors, metric='cosine', algorithm='brute')
knn.fit(df_embedings_totales.drop(columns = 'index_dataset'))

In [50]:
consulta = 'Juego para familia'
consulta = embed(consulta).numpy()

In [51]:
# Realizamos la búsqueda de los vecinos más cercanos
distances, indices = knn.kneighbors(consulta)

for j in range(n_neighbors):
    # Obtenemos el índice y el nombre del dataset del vecino más cercano
    i, dataset = df_embedings_totales['index_dataset'].iloc[indices[0][j]]  # indices[0][j] da la posición del vecino

    # Dependiendo del dataset, accedemos al DataFrame correspondiente
    if dataset == 'dataset_juegos':
        vecino = dataset_juegos['game_name'].iloc[i]
    elif dataset == 'dataset_libros':
        vecino = dataset_libros['Titulo Principal'].iloc[i]
    elif dataset == 'dataset_peliculas':
        vecino = dataset_peliculas['Title'].iloc[i]
    
    # Imprimir el vecino y la distancia
    print(f"Vecino {j + 1}: {vecino} - Distancia: {distances[0][j]:.4f}")

print("-" * 40)

Vecino 1: Plays - Distancia: 0.5421
Vecino 2: Plays - Distancia: 0.5421
Vecino 3: Plays - Distancia: 0.5421
Vecino 4: Plays - Distancia: 0.5421
Vecino 5: Plays - Distancia: 0.5421
----------------------------------------
