# Trabajo Practico 1 NLP

## Autor: Iair Borgo Elgart, Albano Nardi
#### Universidad Nacional de Rosario
#### Año 2024

# Instalacion, carga de librerias y modelos (obligatorio correr)

In [None]:
!pip install gliner



In [None]:
!pip install deep_translator



In [None]:
!pip install transformers sentence_transformers



In [None]:
# Importación de librerías necesarias

from transformers import BertTokenizer, BertModel
import torch
import numpy as np
from sentence_transformers import SentenceTransformer, util
import pandas as pd
from sklearn.preprocessing import MultiLabelBinarizer
from sklearn.multioutput import MultiOutputClassifier
from sklearn.linear_model import LogisticRegression
from sklearn import metrics
import requests
from bs4 import BeautifulSoup
from time import sleep
from gliner import GLiNER
import re
from deep_translator import GoogleTranslator
import pickle
import unicodedata
import jellyfish
import ast


def remove_accents(input_str):
  # Funcion para remove acentos o caracteres especiales
    nfkd_form = unicodedata.normalize('NFKD', input_str)
    return ''.join([c for c in nfkd_form if not unicodedata.combining(c)])


In [None]:
# Carga de distintos modelos pre entrenados

embedding_model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')
entity_model = GLiNER.from_pretrained("urchade/gliner_multi-v2.1")
entity_model.eval()

# Estas son las etiquetas que el modelo de entidades va a buscar
entity_labels = ['actor', 'book', 'date', 'character', 'location',
          'historical event', 'era', 'person', 'writer', 'famous', 'country',
                 'genre']



def return_entidades(text):
  # Dado un texto retorna una lista con las entidades unicas
  entidades = entity_model.predict_entities(text, entity_labels, threshold=0.4)
  lista = []
  for entidad in entidades:
    lista.append(remove_accents(entidad['text'].lower()))
  return list(set(lista))

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Fetching 4 files:   0%|          | 0/4 [00:00<?, ?it/s]



# Web Scraping y acondicionamiento de datasets ( no correr tarda mucho )

## Web Scraping + dataset libros

In [None]:
links = []

# De este link vamos a sacar mas links
link_toscrap = "https://www.gutenberg.org/browse/scores/top1000.php"
response = requests.get(link_toscrap)
soup = BeautifulSoup(response.text, 'html.parser')


In [None]:
for link in soup.find_all('li'):
  # Si el link del ebook que esta en la pagina no esta en la lista lo agregamos
  if "ebooks" in link.a['href'] and link.a['href'] not in links:
    links.append(link.a['href'])

In [None]:
link_toscrap = "https://www.gutenberg.org"

books = []

for link in links[3:]:
  # Iteramos por todos los links y scrapeamos autor titulo y sintesis.
  # Si alguno de estos tiene otro formato solo se guarda None
  response = requests.get(link_toscrap + link)
  soup = BeautifulSoup(response.text, 'html.parser')
  title, author, summary = None, None, None

  for box in soup.find_all('table', {'class': 'bibrec'}):
    for x in (box.find_all('tr')):
      if x.th and x.th.text == 'Title':
        title = x.td.text.replace('\n', '').strip()
      elif x.th and x.th.text == 'Author':
        author = x.td.text.replace('\n', '').strip()
      elif x.th and x.th.text == 'Summary':
        summary = x.td.text.replace('\n', '').strip()

    books.append([title, author, summary])
  sleep(0.1)

In [None]:
# Pasamos la lista a un dataframe
df = pd.DataFrame(books, columns=['Title', 'Author', 'Summary'])

In [None]:
df

Unnamed: 0,Title,Author,Summary
0,呻吟語,"Lü, Kun, 1536-1618","""呻吟語"" by Kun Lü is a philosophical treatise wr..."
1,"Frankenstein; Or, The Modern Prometheus","Shelley, Mary Wollstonecraft, 1797-1851","""Frankenstein; Or, The Modern Prometheus"" by M..."
2,"Moby Dick; Or, The Whale","Melville, Herman, 1819-1891","""Moby Dick; Or, The Whale"" by Herman Melville ..."
3,Romeo and Juliet,"Shakespeare, William, 1564-1616","""Romeo and Juliet"" by William Shakespeare is a..."
4,Pride and Prejudice,"Austen, Jane, 1775-1817","""Pride and Prejudice"" by Jane Austen is a clas..."
...,...,...,...
1252,The Financier: A Novel,"Dreiser, Theodore, 1871-1945","""The Financier: A Novel"" by Theodore Dreiser i..."
1253,"The Philippine Islands, 1493-1803 — Volume 05 ...",,"""The Philippine Islands, 1493-1803 — Volume 05..."
1254,Crump folk going home,"Holme, Constance, 1880-1955",
1255,Four Arthurian Romances,"Chrétien, de Troyes, active 12th century","""Four Arthurian Romances"" by Chrétien de Troye..."


In [None]:
# Limpiamos los duplicados si es que pasaron y los None
df.drop_duplicates(inplace=True)
df.dropna(inplace=True)

In [None]:

df['Summary'] = df['Summary'].str.replace('\n', '')

In [None]:
# Buscamos las entidades en las descripciones y las sumamos al dataframe
entidades_encontradas = []
for summary in df['Summary']:
  entidades_encontradas.append(return_entidades(summary))

df['Entidades'] = entidades_encontradas

Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


In [None]:
df

Unnamed: 0,Title,Author,Summary,Entidades
0,呻吟語,"Lü, Kun, 1536-1618",by Kun Lü is a philosophical treatise written...,"[kun lu, late 16th century]"
1,"Frankenstein; Or, The Modern Prometheus","Shelley, Mary Wollstonecraft, 1797-1851",by Mary Wollstonecraft Shelley is a novel wri...,"[arctic, victor frankenstein, robert walton, e..."
2,"Moby Dick; Or, The Whale","Melville, Herman, 1819-1891",by Herman Melville is a novel written in the ...,"[mid-19th century, queequeg, new bedford, tatt..."
3,Romeo and Juliet,"Shakespeare, William, 1564-1616",by William Shakespeare is a tragedy likely wr...,"[feud, juliet capulet, tybalt, juliet, benvoli..."
4,Pride and Prejudice,"Austen, Jane, 1775-1817",by Jane Austen is a classic novel written in ...,"[regency england, netherfield park, early 19th..."
...,...,...,...,...
1248,Brown Wolf and Other Jack London StoriesChosen...,"London, Jack, 1876-1916",by Jack London is a collection of short stori...,"[walt irvine, late 19th century, madge, wolf, ..."
1250,Sir Gawain and the Green Knight: A Middle-Engl...,"Weston, Jessie L. (Jessie Laidlay), 1850-1928",by Jessie L. Weston is a retelling of a class...,"[king arthur, green knight, sir gawain, gawain..."
1252,The Financier: A Novel,"Dreiser, Theodore, 1871-1945",by Theodore Dreiser is a fictional work writt...,"[late 19th century, frank, frank algernon cowp..."
1255,Four Arthurian Romances,"Chrétien, de Troyes, active 12th century",by Chrétien de Troyes is a collection of medi...,"[enide, erec, king arthur, lancelot, cliges, 1..."


In [None]:
# Guardamos el csv
df.to_csv('books.csv', index=False)

In [None]:
df  = pd.read_csv('books.csv')

In [None]:
# Embeddeamos el resumen del libro y lo guardamos
book_summary_embed = embedding_model.encode(df['Summary'].values)
book_summary_embed = pd.DataFrame(book_summary_embed)
book_summary_embed.to_csv('book_summary_embed.csv', index=False)

## Embed de los otros df

In [None]:
# Se procede a realizar lo mismo que antes con los datasets

In [None]:
df_boardgames = pd.read_csv('bgg_database.csv')
df_boardgames.head()

Unnamed: 0,rank,game_name,game_href,geek_rating,avg_rating,num_voters,description,yearpublished,minplayers,maxplayers,minplaytime,maxplaytime,minage,avgweight,best_num_players,designers,mechanics,categories
0,1,Brass: Birmingham,https://boardgamegeek.com/boardgame/224517/bra...,8.415,8.6,46836.0,Brass: Birmingham is an economic strategy game...,2018,2,4,60,120,14,3.8776,"[{'min': 3, 'max': 4}]","['Gavan Brown', 'Matt Tolman', 'Martin Wallace']","['Hand Management', 'Income', 'Loans', 'Market...","['Age of Reason', 'Economic', 'Industry / Manu..."
1,2,Pandemic Legacy: Season 1,https://boardgamegeek.com/boardgame/161936/pan...,8.377,8.53,53807.0,Pandemic Legacy is a co-operative campaign gam...,2015,2,4,60,60,13,2.8308,"[{'min': 4, 'max': 4}]","['Rob Daviau', 'Matt Leacock']","['Action Points', 'Cooperative Game', 'Hand Ma...","['Environmental', 'Medical']"
2,3,Gloomhaven,https://boardgamegeek.com/boardgame/174430/glo...,8.349,8.59,62592.0,Gloomhaven is a game of Euro-inspired tactica...,2017,1,4,60,120,14,3.9132,"[{'min': 3, 'max': 3}]",['Isaac Childres'],"['Action Queue', 'Action Retrieval', 'Campaign...","['Adventure', 'Exploration', 'Fantasy', 'Fight..."
3,4,Ark Nova,https://boardgamegeek.com/boardgame/342942/ark...,8.335,8.54,44728.0,"In Ark Nova, you will plan and design a modern...",2021,1,4,90,150,14,3.7653,"[{'min': 2, 'max': 2}]",['Mathias Wigge'],"['Action Queue', 'End Game Bonuses', 'Grid Cov...","['Animals', 'Economic', 'Environmental']"
4,5,Twilight Imperium: Fourth Edition,https://boardgamegeek.com/boardgame/233078/twi...,8.24,8.6,24148.0,Twilight Imperium (Fourth Edition) is a game o...,2017,3,6,240,480,14,4.3173,"[{'min': 6, 'max': 6}]","['Dane Beltrami', 'Corey Konieczka', 'Christia...","['Action Drafting', 'Area-Impulse', 'Dice Roll...","['Civilization', 'Economic', 'Exploration', 'N..."


In [None]:
df_boardgames.drop(columns = ['rank', 'geek_rating', 'avg_rating', 'num_voters',
                              'yearpublished', 'minplayers', 'maxplayers',
                              'minplaytime', 'maxplaytime', 'minage', 'avgweight', 'best_num_players'],
                   inplace = True)

# Renombramos game_name a Title
df_boardgames.rename(columns={'game_name': 'Title'}, inplace=True)

In [None]:
df_boardgames.drop_duplicates(inplace=True)
df_boardgames.dropna(inplace=True)

In [None]:
entidades_encontradas = []

for summary in df_boardgames['description']:

  # Separamos hasta el espacio mas cercano a la mitad de la oracion
  parte_1 = summary[:summary.find(' ', int(len(summary)/2))]
  parte_2 = summary[summary.find(' ', int(len(summary)/2)):]

  parte_1 = return_entidades(parte_1)
  parte_2 = return_entidades(parte_2)

  parte_1.extend(parte_2)
  parte_1 = list(set(parte_1))

  entidades_encontradas.append(parte_1)

df_boardgames['Entidades'] = entidades_encontradas
df_boardgames.to_csv('boardgames.csv', index=False)



In [None]:
df_boardgames_embed = embedding_model.encode(df_boardgames['description'].values)
df_boardgames_embed = pd.DataFrame(df_boardgames_embed)
df_boardgames_embed.to_csv('df_boardgames_embed.csv', index=False)

In [None]:
df_movies = pd.read_csv('IMDB-Movie-Data.csv')
df_movies.head()

Unnamed: 0,Rank,Title,Genre,Description,Director,Actors,Year,Runtime (Minutes),Rating,Votes,Revenue (Millions),Metascore
0,1,Guardians of the Galaxy,"Action,Adventure,Sci-Fi",A group of intergalactic criminals are forced ...,James Gunn,"Chris Pratt, Vin Diesel, Bradley Cooper, Zoe S...",2014,121,8.1,757074,333.13,76
1,2,Prometheus,"Adventure,Mystery,Sci-Fi","Following clues to the origin of mankind, a te...",Ridley Scott,"Noomi Rapace, Logan Marshall-Green, Michael Fa...",2012,124,7.0,485820,126.46,65
2,3,Split,"Horror,Thriller",Three girls are kidnapped by a man with a diag...,M. Night Shyamalan,"James McAvoy, Anya Taylor-Joy, Haley Lu Richar...",2016,117,7.3,157606,138.12,62
3,4,Sing,"Animation,Comedy,Family","In a city of humanoid animals, a hustling thea...",Christophe Lourdelet,"Matthew McConaughey,Reese Witherspoon, Seth Ma...",2016,108,7.2,60545,270.32,59
4,5,Suicide Squad,"Action,Adventure,Fantasy",A secret government agency recruits some of th...,David Ayer,"Will Smith, Jared Leto, Margot Robbie, Viola D...",2016,123,6.2,393727,325.02,40


In [None]:
df_movies.drop(columns = ['Rank', 'Genre', 'Year', 'Runtime (Minutes)',
                          'Rating', 'Votes', 'Revenue (Millions)', 'Metascore'],
               inplace = True)

In [None]:
df_movies.drop_duplicates(inplace=True)
df_movies.dropna(inplace=True)

In [None]:
entidades_encontradas = []
for summary in df_movies['Description']:
  entidades_encontradas.append(return_entidades(summary))

# A entidades_encontradas le sumamos la columna director y actores, estos separados por comas
for i in range(len(entidades_encontradas)):

  director_lower = df_movies['Director'][i].lower()
  entidades_encontradas[i].extend(remove_accents(x) for x in director_lower.split(', '))

  actors_lower = df_movies['Actors'][i].lower()
  entidades_encontradas[i].extend(remove_accents(x) for x in actors_lower.split(', '))

df_movies['Entidades'] = entidades_encontradas

In [None]:
df_movies.to_csv('movies.csv', index=False)

In [None]:
df_movies

Unnamed: 0,Title,Description,Director,Actors,Entidades
0,Guardians of the Galaxy,A group of intergalactic criminals are forced ...,James Gunn,"Chris Pratt, Vin Diesel, Bradley Cooper, Zoe S...","[fanatical warrior, universe, james gunn, chri..."
1,Prometheus,"Following clues to the origin of mankind, a te...",Ridley Scott,"Noomi Rapace, Logan Marshall-Green, Michael Fa...","[distant moon, ridley scott, noomi rapace, log..."
2,Split,Three girls are kidnapped by a man with a diag...,M. Night Shyamalan,"James McAvoy, Anya Taylor-Joy, Haley Lu Richar...","[three girls, m. night shyamalan, james mcavoy..."
3,Sing,"In a city of humanoid animals, a hustling thea...",Christophe Lourdelet,"Matthew McConaughey,Reese Witherspoon, Seth Ma...","[hustling theater impresario, theater, city of..."
4,Suicide Squad,A secret government agency recruits some of th...,David Ayer,"Will Smith, Jared Leto, Margot Robbie, Viola D...","[world, apocalypse, david ayer, will smith, ja..."
...,...,...,...,...,...
995,Secret in Their Eyes,"A tight-knit team of rising investigators, alo...",Billy Ray,"Chiwetel Ejiofor, Nicole Kidman, Julia Roberts...","[supervisor, one of their own teenage daughter..."
996,Hostel: Part II,Three American college students studying abroa...,Eli Roth,"Lauren German, Heather Matarazzo, Bijou Philli...","[three american college students, slovakian ho..."
997,Step Up 2: The Streets,Romantic sparks occur between two dance studen...,Jon M. Chu,"Robert Hoffman, Briana Evigan, Cassie Ventura,...","[dance students, maryland school of the arts, ..."
998,Search Party,A pair of friends embark on a mission to reuni...,Scot Armstrong,"Adam Pally, T.J. Miller, Thomas Middleditch,Sh...","[pal, woman, a pair of friends, scot armstrong..."


In [None]:
df_movies = pd.read_csv('movies.csv')
df_movies_embed = embedding_model.encode(df_movies['Description'].values)
df_movies_embed = pd.DataFrame(df_movies_embed)
df_movies_embed.to_csv('df_movies_embed.csv', index=False)

# Ejercicio 1

## Preparacion Dataset

Para este ejercicio utilizare la base de datos GOEmotions, provista por google en el que se tiene varios comentarios scrapeados de reddit y etiquetados a mano. Al estar este dataset solo en ingles, voy a entrenar el modelo en este idioma y traducir el comentario del usuario. Se podria realizar al reves pero creo que traducir cerca de 30mil comentarios de internet, ademas de ser mas costoso, producira  mas errores.

In [None]:
# Las emociones, en orden de las etiquetas son:
emociones = [
    "admiración",
    "diversión",
    "enojo",
    "molestia",
    "aprobación",
    "cariño",
    "confusión",
    "curiosidad",
    "deseo",
    "decepción",
    "desaprobación",
    "asco",
    "vergüenza",
    "emoción",
    "miedo",
    "gratitud",
    "dolor",
    "alegría",
    "amor",
    "nerviosismo",
    "optimismo",
    "orgullo",
    "realización",
    "alivio",
    "remordimiento",
    "tristeza",
    "sorpresa",
    "neutral"
]

In [None]:
link = 'hf://datasets/google-research-datasets/go_emotions/'

splits = {'train': 'simplified/train-00000-of-00001.parquet',
          'validation': 'simplified/validation-00000-of-00001.parquet',
          'test': 'simplified/test-00000-of-00001.parquet'}

df_train = pd.read_parquet(link + splits["train"])
df_test = pd.read_parquet(link + splits["test"])
df_val = pd.read_parquet(link + splits["validation"])

In [None]:
df_train

Unnamed: 0,text,labels,id
0,My favourite food is anything I didn't have to...,[27],eebbqej
1,"Now if he does off himself, everyone will thin...",[27],ed00q6i
2,WHY THE FUCK IS BAYLESS ISOING,[2],eezlygj
3,To make her feel threatened,[14],ed7ypvh
4,Dirty Southern Wankers,[3],ed0bdzj
...,...,...,...
43405,Added you mate well I’ve just got the bow and ...,[18],edsb738
43406,Always thought that was funny but is it a refe...,[6],ee7fdou
43407,What are you talking about? Anything bad that ...,[3],efgbhks
43408,"More like a baptism, with sexy results!",[13],ed1naf8


para facilitar el entrenamiento, solo dejaremos algunas emociones

In [None]:
# mlb nos ayuda a pasar los distintos labels a onehot (puede haber mas de un label por observacion)
mlb = MultiLabelBinarizer()
df_train_sparse = pd.DataFrame(mlb.fit_transform(df_train['labels']))
df_test_sparse = pd.DataFrame(mlb.transform(df_test['labels']))
df_val_sparse = pd.DataFrame(mlb.transform(df_val['labels']))

# Nombramos las columnas
df_train_sparse.columns = emociones
df_test_sparse.columns = emociones
df_val_sparse.columns = emociones

In [None]:
df_train_sparse.mean().sort_values(ascending=False)

Unnamed: 0,0
neutral,0.327551
admiración,0.095139
aprobación,0.067703
gratitud,0.061322
molestia,0.056899
diversión,0.053628
curiosidad,0.050472
amor,0.048053
desaprobación,0.046579
optimismo,0.03642


In [None]:
# Nos quedamos solo con algunas emociones para poder acotar la busqueda
emociones = ['neutral', 'admiración', 'diversión',
             'amor', 'enojo', 'alegría',
             'tristeza', 'remordimiento']

df_train_sparse = df_train_sparse[emociones]
df_test_sparse = df_test_sparse[emociones]
df_val_sparse = df_val_sparse[emociones]

In [None]:
# dropeamos los indices donde solo haya 0s
df_train_sparse = df_train_sparse[df_train_sparse.sum(axis=1) > 0]
df_test_sparse = df_test_sparse[df_test_sparse.sum(axis=1) > 0]
df_val_sparse = df_val_sparse[df_val_sparse.sum(axis=1) > 0]

In [None]:
x = df_train['text']
y = df_train_sparse

x_test = df_test['text']
x_val = df_val['text']
y_test = df_test_sparse
y_val = df_val_sparse

# Eliminamos los indices que existan en x y no esten en y
x = x[y.index]
x_test = x_test[y_test.index]
x_val = x_val[y_val.index]

# Unimos test y val ya que no lo vamos a utilizar a este ultimo
x_test = pd.concat([x_test, x_val])
y_test = pd.concat([y_test, y_val])

In [None]:
embedding = embedding_model.encode(x.values)
embedding_test = embedding_model.encode(x_test.values)

In [None]:
# Creamos una columna con los embedding conseguidos
x_trainembed = pd.DataFrame(embedding)
x_testembed = pd.DataFrame(embedding_test)

## Entrenamiento Regresion Logistica multiclase

In [None]:
# Con mulitoutputclassifier entrenamos diversos modelos de regresion
# logistica (1 para cada emocion)
lr_model = MultiOutputClassifier(LogisticRegression()).fit(x_trainembed, y)

In [None]:
first_report_LR = metrics.classification_report(y_test, lr_model.predict(x_testembed))

print("Reporte de clasificación Regresión Logística:\n", first_report_LR)


Reporte de clasificación Regresión Logística:
               precision    recall  f1-score   support

           0       0.80      0.85      0.83      3553
           1       0.74      0.47      0.58       992
           2       0.84      0.59      0.70       567
           3       0.77      0.49      0.60       490
           4       0.69      0.24      0.36       393
           5       0.70      0.21      0.33       333
           6       0.64      0.28      0.39       299
           7       0.69      0.29      0.41       124

   micro avg       0.79      0.65      0.71      6751
   macro avg       0.74      0.43      0.52      6751
weighted avg       0.77      0.65      0.68      6751
 samples avg       0.65      0.66      0.65      6751



  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


In [None]:
# Aunque la precision sea buena, el recall es muy bajo para la mayoria de emociones
# Vamos a ver si podemos solucionar esto

predictions = lr_model.predict_proba(x_testembed)

In [None]:
labels = y_test.columns.values
num_items, num_labels = len(y_test), len(labels)

In [None]:
# Las predicciones que nos devuelve es label por label
# Para poder trabajar las quiero observacion por observacion asi que las roto
y_probas_all = np.zeros((num_items, num_labels), dtype=float)
for i, item_probas in enumerate(lr_model.predict_proba(x_testembed)):
    for j, item_proba in enumerate(item_probas):
        y_probas_all[j, i] = item_proba[1]

y_targets_all = y_test.values

In [None]:
# Funcion para calcular todas las metricas
def calc_label_metrics(label, y_targets, y_preds, threshold):
    return {
        "label": label,
        "accuracy": metrics.accuracy_score(y_targets, y_preds),
        "precision": metrics.precision_score(y_targets, y_preds, zero_division=0),
        "recall": metrics.recall_score(y_targets, y_preds, zero_division=0),
        "f1": metrics.f1_score(y_targets, y_preds, zero_division=0),
        "support": y_targets.sum(),
        "threshold": threshold,
    }

In [None]:
threshold_results = {}
# Variamos el treshold del modelo de 0,05 con paso de 0,01 y vamos guardando
# los resultados de todos estos
for t in range(5, 100, 1):
    threshold = t / 100
    y_preds_all = (y_probas_all > threshold).astype(int)
    threshold_results[threshold] = []
    for label_index, label in enumerate(labels):
        y_targets, y_preds = y_targets_all[:, label_index], y_preds_all[:, label_index]
        threshold_results[threshold].append(calc_label_metrics(label, y_targets, y_preds, threshold))

In [None]:
metric_name = "f1"
best = {label: {metric_name: -1, "result": None} for label in labels}
# Luego recorremos el diccionario buscando el mejor valor f1 para cada uno de los
# labels y guardamos el mejor resultado junto al threshold asociado
for threshold, results in threshold_results.items():
    for result in results:
        label = result["label"]
        if result[metric_name] > best[label][metric_name]:
            best[label] = {metric_name: result[metric_name], "result": result}

results = [b["result"] for b in best.values()]
per_label_threshold_results = pd.DataFrame(results, index=[result["label"] for result in results])
display(per_label_threshold_results.drop(columns=["label"]).round(3))

Unnamed: 0,accuracy,precision,recall,f1,support,threshold
neutral,0.802,0.782,0.883,0.83,3553,0.45
admiración,0.893,0.649,0.659,0.654,992,0.31
diversión,0.961,0.76,0.801,0.78,567,0.26
amor,0.952,0.675,0.696,0.685,490,0.28
enojo,0.941,0.515,0.534,0.524,393,0.22
alegría,0.943,0.452,0.535,0.49,333,0.19
tristeza,0.954,0.501,0.562,0.53,299,0.22
remordimiento,0.986,0.631,0.621,0.626,124,0.17


In [None]:
def predict_custom(predictions):

  # Esta funcion es igual que predict, pero utiliza los umbrales personalizados
  # que encontramos anteriormente. Ademas, si no se predice ninguna emocion
  # normalizamos (dividimos segun el umbral la que este mas cerca de este va
  # a ser mas cercano a 1 por lo tanto sera la elegida). Por ultimo, para
  # arreglar inconsistencias una persona no puede estar neutral + algun otro
  # sentimiento

  emociones = ['neutral', 'admiración', 'diversión',
             'amor', 'enojo', 'alegría',
             'tristeza', 'remordimiento']

  threshold = [0.45, 0.31, 0.26, 0.28, 0.22, 0.19, 0.22, 0.17]

  prediction = []
  for i in range(len(predictions[0])):
    og_preds = []
    preds = []
    for j in range(len(predictions)):
      og_preds.append(predictions[j][i][1])
      pred = (predictions[j][i][1] > threshold[j]).astype(int)
      preds.append(pred)

    if sum(preds) == 0:
      for pred in range(len(og_preds)):
        og_preds[pred] = og_preds[pred] / threshold[pred]
      preds[np.argmax(og_preds)] = 1

    if sum(preds[1:]) > 0:
      preds[0] = 0
    prediction.append(preds)

  return np.array(prediction)


In [None]:
new_report_LR = metrics.classification_report(y_test, predict_custom(lr_model.predict_proba(x_testembed)))

print("Reporte de clasificación Regresión Logística con nueva funcion:\n", new_report_LR)

# Se nota que logramos una mejora significativa en f1, a pesar de perder
# un poco de precision

Reporte de clasificación Regresión Logística con nueva funcion:
               precision    recall  f1-score   support

           0       0.84      0.84      0.84      3553
           1       0.63      0.71      0.67       992
           2       0.74      0.81      0.77       567
           3       0.66      0.73      0.69       490
           4       0.50      0.55      0.52       393
           5       0.45      0.57      0.50       333
           6       0.49      0.59      0.53       299
           7       0.62      0.64      0.63       124

   micro avg       0.72      0.76      0.74      6751
   macro avg       0.62      0.68      0.65      6751
weighted avg       0.73      0.76      0.74      6751
 samples avg       0.74      0.77      0.75      6751



In [None]:
print("Reporte de clasificación Regresión Logística con funcion default:\n", first_report_LR)

Reporte de clasificación Regresión Logística con funcion default:
               precision    recall  f1-score   support

           0       0.80      0.85      0.83      3553
           1       0.74      0.47      0.58       992
           2       0.84      0.59      0.70       567
           3       0.77      0.49      0.60       490
           4       0.69      0.24      0.36       393
           5       0.70      0.21      0.33       333
           6       0.64      0.28      0.39       299
           7       0.69      0.29      0.41       124

   micro avg       0.79      0.65      0.71      6751
   macro avg       0.74      0.43      0.52      6751
weighted avg       0.77      0.65      0.68      6751
 samples avg       0.65      0.66      0.65      6751



In [None]:
# guardamos el modelo
with open('multioutput_emotion_logclass.pkl','wb') as f:
    pickle.dump(lr_model,f)


## Pedimos al usuario ingresar su sentimiento

## Prueba de funcionamiento

In [None]:
with open('multioutput_emotion_logclass.pkl','rb') as f:
    sentiment_model = pickle.load(f)

In [None]:
user_sentiment = input("Como te sentis? 😊: ")
user_sentiment = user_sentiment.lower().strip()
translated_sentiment = GoogleTranslator(source='auto', target='en').translate(user_sentiment)
print(translated_sentiment)

Como te sentis? 😊: fue el mejor dia de mi vida, la pase bien en el parque de diversiones
It was the best day of my life, I had a good time at the amusement park


In [None]:
embedding = embedding_model.encode(translated_sentiment)


In [None]:
probs = sentiment_model.predict_proba(embedding.reshape(1, -1))

In [None]:
emociones = ['neutral', 'admiración', 'diversión',
             'amor', 'enojo', 'alegría',
             'tristeza', 'remordimiento']

In [None]:
predict = predict_custom(probs)

In [None]:
predict

array([[0, 1, 0, 0, 0, 1, 0, 0]])

In [None]:
tags = []
for i in range(len(predict[0])):
  if predict[0][i] == 1:
    print(emociones[i])
    tags.append(emociones[i])

admiración
alegría


# Ejercicio 2

In [None]:
def is_match(entidad, lista_entidades, threshold = 0.8):
  # Al tener que reconocer actores y demás, implemento esta funcion
  # que utiliza jaro winkler para solucionar problemas de tipeo

  # ast.literal_eval nos ayuda a pasar de un string a una lista ya que en los df
  # se guarda como '['entidad1', 'entidad2', ...]' -> ['entidad1', 'entidad2', ...]

  lista_entidades = ast.literal_eval(lista_entidades)
  for ent in lista_entidades:
    distancia = jellyfish.jaro_winkler_similarity(entidad, ent)
    if distancia >= threshold:
      return True
  return False


def check_entidad(lista_entidades, df):
 # Creamos una lista del largo del dataframe
  suma = [0] * len(df['Entidades'])

  # Por cada entidad que coincida, sumamos 1 (val es booleano)
  for entidad in lista_entidades:
    for pos, val in enumerate(df['Entidades'].apply(lambda x: is_match(entidad, x))):
      suma[pos] += val
  return suma

def calcular_similitud_logodds(user_query_Embed, embed_df):
  # Pasamos la similitud de coseno al rango 0-1 (como probabilidad)
  # para luego convertirlo a logodds asi usamos todo el rango de los reales

  # Calculamos similitud
  similitud = util.cos_sim(user_query_Embed, embed_df)

  # Transformamos la similitud en rango (0, 1)
  similitud = (similitud + 1) / 2

  # Pasamos a log odds
  similitud = np.log(similitud / (1 - similitud))

  return similitud[0].tolist() # similitud es un tensor asi que hay que devolver una lista

def suma_entidad(similitud_logodds, lista_entidades, alfa = 3):
  # Por cada entidad nombrada, se suma alfa a los logodds
    resultado = [logodd + alfa * nro_entidades for logodd, nro_entidades in zip(similitud_logodds, lista_entidades)]

    return resultado


## Prueba de funcionamiento

In [None]:
df_books = pd.read_csv('books.csv')
df_movies = pd.read_csv('movies.csv')
df_boardgames = pd.read_csv('boardgames.csv')

book_summary_embed = pd.read_csv('book_summary_embed.csv').to_numpy().astype(np.float32)
df_movies_embed = pd.read_csv('df_movies_embed.csv').to_numpy().astype(np.float32)
df_boardgames_embed = pd.read_csv('df_boardgames_embed.csv').to_numpy().astype(np.float32)

In [None]:
user_query = input("Ingrese que temática le gustaría probar: ")

user_query = user_query.strip().lower()
translated_query = GoogleTranslator(source='auto', target='en').translate(user_query)

user_query_embed = embedding_model.encode(translated_query)

entidades_lista = return_entidades(translated_query)

recomendaciones = []

for df in [df_books, df_movies, df_boardgames]:
  entidades = check_entidad(entidades_lista, df)
  logodds = calcular_similitud_logodds(user_query_embed, df_boardgames_embed)
  logodds = suma_entidad(logodds, entidades)

  recomendaciones.append(np.argpartition(logodds,-3)[-3:])

print('\n')
for recomendacion, df in zip(recomendaciones, [df_books, df_movies, df_boardgames]):
  for rec in recomendacion:
    print(df.iloc[rec]['Title'])
  print('\n')




Ingrese que temática le gustaría probar: una pelota de futbol rompiendo la estrella de la muerte, protagonizada por matt damon


The United States Constitution
The Expedition of Humphry Clinker
The Castle of Otranto


Gold
Il racconto dei racconti - Tale of Tales
Ocean's Thirteen


Apiary
Star Wars: Rebellion
Star Wars: Imperial Assault




# Programa final

## Solo es necesario correr esto si no corrieron nada del ej 1 y 2

## Carga de funciones

In [None]:
emociones = ['neutral', 'admiración', 'diversión',
             'amor', 'enojo', 'alegría',
             'tristeza', 'remordimiento']

def is_match(entidad, lista_entidades, threshold = 0.8):
  # Al tener que reconocer actores y demás, implemento esta funcion
  # que utiliza jaro winkler para solucionar problemas de tipeo

  # ast.literal_eval nos ayuda a pasar de un string a una lista ya que en los df
  # se guarda como '['entidad1', 'entidad2', ...]' -> ['entidad1', 'entidad2', ...]

  lista_entidades = ast.literal_eval(lista_entidades)
  for ent in lista_entidades:
    distancia = jellyfish.jaro_winkler_similarity(entidad, ent)
    if distancia >= threshold:
      return True
  return False


def check_entidad(lista_entidades, df):
 # Creamos una lista del largo del dataframe
  suma = [0] * len(df['Entidades'])

  # Por cada entidad que coincida, sumamos 1 (val es booleano)
  for entidad in lista_entidades:
    for pos, val in enumerate(df['Entidades'].apply(lambda x: is_match(entidad, x))):
      suma[pos] += val
  return suma

def calcular_similitud_logodds(user_query_Embed, embed_df):
  # Pasamos la similitud de coseno al rango 0-1 (como probabilidad)
  # para luego convertirlo a logodds asi usamos todo el rango de los reales

  # Calculamos similitud
  similitud = util.cos_sim(user_query_Embed, embed_df)

  # Transformamos la similitud en rango (0, 1)
  similitud = (similitud + 1) / 2

  # Pasamos a log odds
  similitud = np.log(similitud / (1 - similitud))

  return similitud[0].tolist() # similitud es un tensor asi que hay que devolver una lista

def suma_entidad(similitud_logodds, lista_entidades, alfa = 3):
  # Por cada entidad nombrada, se suma alfa a los logodds
    resultado = [logodd + alfa * nro_entidades for logodd, nro_entidades in zip(similitud_logodds, lista_entidades)]

    return resultado

def predict_custom(predictions):

  # Esta funcion es igual que predict, pero utiliza los umbrales personalizados
  # que encontramos anteriormente. Ademas, si no se predice ninguna emocion
  # normalizamos (dividimos segun el umbral la que este mas cerca de este va
  # a ser mas cercano a 1 por lo tanto sera la elegida). Por ultimo, para
  # arreglar inconsistencias una persona no puede estar neutral + algun otro
  # sentimiento

  emociones = ['neutral', 'admiración', 'diversión',
             'amor', 'enojo', 'alegría',
             'tristeza', 'remordimiento']

  threshold = [0.45, 0.31, 0.26, 0.28, 0.22, 0.19, 0.22, 0.17]

  prediction = []
  for i in range(len(predictions[0])):
    og_preds = []
    preds = []
    for j in range(len(predictions)):
      og_preds.append(predictions[j][i][1])
      pred = (predictions[j][i][1] > threshold[j]).astype(int)
      preds.append(pred)

    if sum(preds) == 0:
      for pred in range(len(og_preds)):
        og_preds[pred] = og_preds[pred] / threshold[pred]
      preds[np.argmax(og_preds)] = 1

    if sum(preds[1:]) > 0:
      preds[0] = 0
    prediction.append(preds)

  return np.array(prediction)

## Carga de data

In [None]:
with open('multioutput_emotion_logclass.pkl','rb') as f:
    sentiment_model = pickle.load(f)

df_books = pd.read_csv('books.csv')
df_movies = pd.read_csv('movies.csv')
df_boardgames = pd.read_csv('boardgames.csv')

book_summary_embed = pd.read_csv('book_summary_embed.csv').to_numpy().astype(np.float32)
df_movies_embed = pd.read_csv('df_movies_embed.csv').to_numpy().astype(np.float32)
df_boardgames_embed = pd.read_csv('df_boardgames_embed.csv').to_numpy().astype(np.float32)

## Ejercicios combinados

In [None]:
def recomendacion_final():
  user_sentiment = input("Buenas, como te sentis? 😊: ")
  user_sentiment = user_sentiment.lower().strip()
  user_sentiment = GoogleTranslator(source='auto', target='en').translate(user_sentiment)

  embedding = embedding_model.encode(user_sentiment)

  probs = sentiment_model.predict_proba(embedding.reshape(1, -1))
  predict = predict_custom(probs)

  tags = []
  print('Sentimientos detectados:')
  for i in range(len(predict[0])):
    if predict[0][i] == 1:
      print(emociones[i])
      tags.append(emociones[i])
  print('\n')

  user_query = input("Que temática te gustaría que busquemos? 👀: ")

  user_query = user_query.strip().lower()
  translated_query = GoogleTranslator(source='auto', target='en').translate(user_query)

  user_query_embed = embedding_model.encode(translated_query)

  entidades_lista = return_entidades(translated_query)

  recomendaciones = []

  for df in [df_books, df_movies, df_boardgames]:
    entidades = check_entidad(entidades_lista, df)
    logodds = calcular_similitud_logodds(user_query_embed, df_boardgames_embed)
    logodds = suma_entidad(logodds, entidades)

    recomendaciones.append(np.argpartition(logodds,-3)[-3:])

  print('\n')
  print(f'Coincidencias encontradas en libros, peliculas y juegos de mesa')
  for recomendacion, df in zip(recomendaciones, [df_books, df_movies, df_boardgames]):
    for rec in recomendacion:
      print(df.iloc[rec]['Title'])
    print('\n')


In [None]:
recomendacion_final()

Buenas, como te sentis? 😊: Me arrepiento de no haberle hablado cuando tuve la chance
Sentimientos detectados:
tristeza
remordimiento


Que temática te gustaría que busquemos? 👀: Una historia ambientada en medio del siglo 16, que tome lugar en francia 


Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.




Coincidencias encontradas en libros, peliculas y juegos de mesa
Les crimes de l'amourPrécédé d'un avant-propos, suivi des idées sur les romans, de l'auteur des crimes de l'amour à Villeterque, d'une notice bio-bibliographique du marquis de Sade: l'homme et ses écrits et du discours prononcé par le marquis de Sade à la section des piques.
Reflections; or Sentences and Moral Maxims
Memoirs of Fanny HillA New and Genuine Edition from the Original Text (London, 1749)


Marie Antoinette
Mr. Nobody
The Host


Great Western Trail: Second Edition
Maria
Imperial Struggle


