## Creación de datasets de entrenamiento en bruto

En este *notebook* se crearán los datasets para el entrenamiento del recomendador de libros. En concreto, serán dos:

1. Dataset de libros:
  - `book_id`: Identificador del libro.
  - `semantic_sbert`: Array de `numpy` que codifica el contenido semántico de la sinopsis del libro, utilizando el modelo [SBERT](https://www.sbert.net/docs/pretrained_models.html) `all-distilroberta-v1`. La dimensión es 768.
  - `semantic_use`: Array de `numpy` que codifica el contenido semántico de la sinopsis del libro, utilizando [*Universal Sentence Encoder*](https://www.tensorflow.org/hub/tutorials/semantic_similarity_with_tf_hub_universal_encoder) de *Google*. La dimensión es 512.
  - `sentiment`: Array de `numpy` que codifica el *sentiment analysis* realizado sobre cada sinopsis mediante el modelo `SamLowe/roberta-base-go_emotions` disponible en [*HuggingFace*](https://huggingface.co/SamLowe/roberta-base-go_emotions). La dimensión es 28.

2. Dataset de ratings:
  - `user_id`: Identificador del usuario.
  - `book_id`: Identificador del libro que se valora.
  - `rating`: Valoración del libro del 1 al 5.

In [None]:
%%capture
!pip3 install seaborn
!pip3 install -U transformers
!pip3 install -U sentence-transformers

In [1]:
import pandas as pd
from ast import literal_eval

books_df = pd.read_csv(
  'https://raw.githubusercontent.com/malcolmosh/goodbooks-10k/master/books_enriched.csv',
  index_col=[0],
  converters={"genres": literal_eval}
)

# Selección de las sinopsis en inglés
books_with_summary_df = books_df[(books_df['language_code'] == 'eng') & (books_df['description'].notna())]
books_with_summary_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 9628 entries, 0 to 9999
Data columns (total 29 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   index                      9628 non-null   int64  
 1   authors                    9628 non-null   object 
 2   average_rating             9628 non-null   float64
 3   best_book_id               9628 non-null   int64  
 4   book_id                    9628 non-null   int64  
 5   books_count                9628 non-null   int64  
 6   description                9628 non-null   object 
 7   genres                     9628 non-null   object 
 8   goodreads_book_id          9628 non-null   int64  
 9   image_url                  9628 non-null   object 
 10  isbn                       9014 non-null   object 
 11  isbn13                     9100 non-null   float64
 12  language_code              9628 non-null   object 
 13  original_publication_year  9608 non-null   float64
 1

In [2]:
books_reduced = books_with_summary_df.loc[:, ['book_id', 'description']]
books_reduced.head(10)

Unnamed: 0,book_id,description
0,1,WINNING MEANS FAME AND FORTUNE.LOSING MEANS CE...
1,2,Harry Potter's life is miserable. His parents ...
2,3,About three things I was absolutely positive.\...
3,4,The unforgettable novel of a childhood in a sl...
4,5,Alternate Cover Edition ISBN: 0743273567 (ISBN...
5,6,Despite the tumor-shrinking medical miracle th...
6,7,In a hole in the ground there lived a hobbit. ...
7,8,The hero-narrator of The Catcher in the Rye is...
8,9,World-renowned Harvard symbologist Robert Lang...
9,10,Alternate cover edition of ISBN 9780679783268S...


In [3]:
from sentence_transformers import SentenceTransformer

# Cargamos el modelo SBERT
bert_model = SentenceTransformer('all-distilroberta-v1')

  from .autonotebook import tqdm as notebook_tqdm


In [4]:
import numpy as np

def embed_sbert(description):
  """
  Transforma las descripciones en embeddings de SBERT
  """
  return np.array(bert_model.encode([description])[0])

In [None]:
import tensorflow_hub as hub

# Cargamos el modelo de Universal Sentence Encoder
use_model = hub.load('https://tfhub.dev/google/universal-sentence-encoder/4')

In [6]:
def embed_use(description):
  """
  Transforma las descripciones en embeddings de USE
  """
  return np.array(use_model([description])[0])

In [7]:
from transformers import pipeline

# Cargamos el modelo clasificador para sentiment analysis
classifier = pipeline(task="text-classification", model="SamLowe/roberta-base-go_emotions", top_k=None, truncation=True)

In [8]:
def sentiment_to_array(sentiment_list):
  """
  Transforma una lista de diccionarios de sentiment analysis
  en un array con los valores de las emociones en orden alfabético
  """
  label_score = [(sa['label'], sa['score']) for sa in sentiment_list]
  label_score.sort(key=lambda x: x[0]) # Ordena por label
  return np.array([score for (label, score) in label_score])

def embed_sentiment(description):
  """
  Transforma las descripciones en embeddings de sentiment analysis
  """
  return sentiment_to_array(classifier([description])[0])

In [9]:
books_final_df = books_reduced.copy()

In [None]:
books_final_df['semantic_sbert'] = books_final_df.loc[:, 'description'].apply(embed_sbert)

In [None]:
books_final_df['semantic_use'] = books_final_df.loc[:, 'description'].apply(embed_use)

In [None]:
books_final_df['sentiment'] = books_final_df.loc[:, 'description'].apply(embed_sentiment)

In [None]:
books_raw = books_final_df.copy()
del books_raw['description']

In [None]:
books_raw.to_pickle("books_embedded_raw.pkl")

In [None]:
# Monta Google Drive
from google.colab import drive
drive.mount('/content/gdrive')

Mounted at /content/gdrive


In [None]:
# Crea una copia del dataset a Drive (guardar versión en crudo)
!cp books_embedded_raw.csv /content/gdrive/MyDrive/Colab\ Notebooks/TFG_SR_Libros/training

In [None]:
books_final_df['sentiment'] = books_final_df['description'].apply(embed_sentiment)
books_final_df

Unnamed: 0,book_id,description,sentiment
0,1,WINNING MEANS FAME AND FORTUNE.LOSING MEANS CE...,"[0.004516009707003832, 0.000651626440230757, 0..."
1,2,Harry Potter's life is miserable. His parents ...,"[0.07255825400352478, 0.001317441463470459, 0...."
2,3,About three things I was absolutely positive.\...,"[0.046528104692697525, 0.0008789585554040968, ..."
3,4,The unforgettable novel of a childhood in a sl...,"[0.6824034452438354, 0.0008955516968853772, 0...."
4,5,Alternate Cover Edition ISBN: 0743273567 (ISBN...,"[0.8346691131591797, 0.0006059535662643611, 0...."
...,...,...,...
9995,9981,"A high-school girl in Harlem, Geneva Settle, i...","[0.002152122324332595, 0.0006707774009555578, ..."
9996,9982,In Karen Marie Moning’s latest installment of ...,"[0.013859235681593418, 0.0005022927070967853, ..."
9997,9985,"In the year 2000, computers are the new superp...","[0.0016449115937575698, 0.0012249279534444213,..."
9998,9987,A CIA agent's two-year-old child was stolen in...,"[0.003782995045185089, 0.0007758404244668782, ..."


In [None]:
books_sentiment = pd.read_pickle("books_embedded_raw.pkl")
books_sentiment

Unnamed: 0,book_id,description,sentiment
0,1,WINNING MEANS FAME AND FORTUNE.LOSING MEANS CE...,"[0.004516009707003832, 0.000651626440230757, 0..."
1,2,Harry Potter's life is miserable. His parents ...,"[0.07255825400352478, 0.001317441463470459, 0...."
2,3,About three things I was absolutely positive.\...,"[0.046528104692697525, 0.0008789585554040968, ..."
3,4,The unforgettable novel of a childhood in a sl...,"[0.6824034452438354, 0.0008955516968853772, 0...."
4,5,Alternate Cover Edition ISBN: 0743273567 (ISBN...,"[0.8346691131591797, 0.0006059535662643611, 0...."
...,...,...,...
9995,9981,"A high-school girl in Harlem, Geneva Settle, i...","[0.002152122324332595, 0.0006707774009555578, ..."
9996,9982,In Karen Marie Moning’s latest installment of ...,"[0.013859235681593418, 0.0005022927070967853, ..."
9997,9985,"In the year 2000, computers are the new superp...","[0.0016449115937575698, 0.0012249279534444213,..."
9998,9987,A CIA agent's two-year-old child was stolen in...,"[0.003782995045185089, 0.0007758404244668782, ..."


In [None]:
books_raw

Unnamed: 0,book_id,description,sentiment,semantic_sbert,semantic_use
0,1,WINNING MEANS FAME AND FORTUNE.LOSING MEANS CE...,"[0.004516009707003832, 0.000651626440230757, 0...","[-0.004934204, -0.08143925, 0.024838133, -0.00...","[0.00072222785, -0.06841643, -0.032095194, 0.0..."
1,2,Harry Potter's life is miserable. His parents ...,"[0.07255825400352478, 0.001317441463470459, 0....","[-0.009699796, -0.04587975, 0.008891147, 0.048...","[-0.024099799, -0.040485196, -0.053515114, 0.0..."
2,3,About three things I was absolutely positive.\...,"[0.046528104692697525, 0.0008789585554040968, ...","[-0.028360855, -0.005439244, -0.03582615, 0.01...","[-0.029888729, -0.040947303, 0.05555888, 0.013..."
3,4,The unforgettable novel of a childhood in a sl...,"[0.6824034452438354, 0.0008955516968853772, 0....","[0.009296236, -0.036142815, 0.018441785, -0.01...","[-0.0022137966, 0.004525279, 0.029811809, 0.02..."
4,5,Alternate Cover Edition ISBN: 0743273567 (ISBN...,"[0.8346691131591797, 0.0006059535662643611, 0....","[-0.020718819, -0.027652211, 0.034484472, 0.03...","[0.037782643, -0.014753511, 0.05975853, -0.012..."
...,...,...,...,...,...
9995,9981,"A high-school girl in Harlem, Geneva Settle, i...","[0.002152122324332595, 0.0006707774009555578, ...","[0.005982023, -0.08106781, 0.009434896, -0.007...","[-0.021464368, -0.0095358975, 0.046445213, -0...."
9996,9982,In Karen Marie Moning’s latest installment of ...,"[0.013859235681593418, 0.0005022927070967853, ...","[-0.037881684, -0.06854862, 0.015421294, -0.01...","[-0.050711267, -0.04681007, 0.02529571, 0.0029..."
9997,9985,"In the year 2000, computers are the new superp...","[0.0016449115937575698, 0.0012249279534444213,...","[-0.0050055524, 0.006095925, 0.009436544, -0.0...","[-0.036081206, -0.0245497, -0.045305472, 0.031..."
9998,9987,A CIA agent's two-year-old child was stolen in...,"[0.003782995045185089, 0.0007758404244668782, ...","[0.03438998, -0.018278314, 0.008121892, -0.015...","[-0.004951604, -0.0228154, 0.0080927145, 0.031..."
