# Embeddings

Este cuaderno muestra, de forma compacta, cómo obtener embeddings semánticos con `Transformers` y `PyTorch` usando `google-bert/bert-base-cased`. Partimos de un fragmento de Don Quixote (Wikipedia), lo tokenizamos con `AutoTokenizer` (aplicando truncation y padding) y generamos las representaciones por token pasando los tensores al modelo base `AutoModel` dentro de `torch.no_grad()` para acelerar la inferencia. Inspeccionamos outputs.`last_hidden_state` (p. ej., forma [1, 512, 768]) y calculamos dos embeddings de frase: (i) [CLS] pooling, tomando el vector del primer token; y (ii) mean pooling, promediando los embeddings válidos según attention_mask para ignorar el padding.

In [None]:
# Para este notebook es necesario tener instalada la librería transformer y Torch
from transformers import AutoModel, AutoTokenizer

import torch
import torch.nn.functional as F

In [42]:
# El texto que vamos a convertir a embedding, extraido de: https://en.wikipedia.org/wiki/Don_Quixote
text= """
Cervantes, in a metafictional narrative, writes that the first few chapters were taken from "the archives of La Mancha", and the rest were translated from an Arabic text by the Moorish historian Cide Hamete Benengeli.

Alonso Quixano is an hidalgo nearing 50 years of age who lives in a deliberately unspecified region of La Mancha with his niece and housekeeper. While he lives a frugal life, he is full of fantasies about chivalry stemming from his obsession with chivalric romance books. Eventually, his obsession becomes madness when he decides to become a knight errant, donning an old suit of armor. He renames himself "Don Quixote", names his old workhorse "Rocinante", and designates Aldonza Lorenzo (a slaughterhouse worker with a famed hand for salting pork) his lady love, renaming her Dulcinea del Toboso.

As he travels in search of adventure, he arrives at an inn that he believes to be a castle, calls the prostitutes he meets there "ladies", and demands that the innkeeper, whom he takes to be the lord of the castle, dub him a knight. The innkeeper agrees. Quixote starts the night holding vigil at the inn's horse trough, which Quixote imagines to be a chapel. He then becomes involved in a fight with muleteers who try to remove his armor from the horse trough to water their mules. In a pretend ceremony, the innkeeper dubs him a knight to be rid of him and sends him on his way.

Quixote next encounters a servant named Andres who is tied to a tree and being beaten by his master over disputed wages. Quixote orders the master to stop the beating, untie Andres and swear to treat his servant fairly. However, the beating is resumed, and redoubled, as soon as Quixote leaves.

Quixote then chances upon traders from Toledo. He demands that they agree that Dulcinea del Toboso is the most beautiful woman in the world. One of them demands to see her picture so that he can decide for himself. Enraged, Quixote charges at them but his horse stumbles, causing him to fall. One of the traders beats up Quixote, who is left at the side of the road until a neighboring peasant brings him back home.

While Quixote lies unconscious in his bed, his niece, the housekeeper, the parish curate, and the local barber burn most of his chivalric and other books, seeing them as the root of his madness. They seal up the library room, later telling Quixote that it was done by a wizard. """
print(text)


Cervantes, in a metafictional narrative, writes that the first few chapters were taken from "the archives of La Mancha", and the rest were translated from an Arabic text by the Moorish historian Cide Hamete Benengeli.

Alonso Quixano is an hidalgo nearing 50 years of age who lives in a deliberately unspecified region of La Mancha with his niece and housekeeper. While he lives a frugal life, he is full of fantasies about chivalry stemming from his obsession with chivalric romance books. Eventually, his obsession becomes madness when he decides to become a knight errant, donning an old suit of armor. He renames himself "Don Quixote", names his old workhorse "Rocinante", and designates Aldonza Lorenzo (a slaughterhouse worker with a famed hand for salting pork) his lady love, renaming her Dulcinea del Toboso.

As he travels in search of adventure, he arrives at an inn that he believes to be a castle, calls the prostitutes he meets there "ladies", and demands that the innkeeper, whom he t

In [None]:
# Los embeddings los vamos a generar con el siguiente modelo:
pretrained_model= "google-bert/bert-base-cased"

# Antes de introducir cualquier texto a un modelo hay que tokenizarlo, es decir, convertirlo en carácteres que el modelo si entiende.
# Para esto, cargamos el tokenizador y lo usamos tal cual se explica en el tutorial de tokenización.
tokenizer = AutoTokenizer.from_pretrained(pretrained_model)

tokenized_texts = tokenizer(text, 
                            max_length=tokenizer.model_max_length, 
                            padding=True, truncation=True, 
                            return_tensors='pt')


In [None]:
# Una vez tenemos el texto tokenizado cargamos el modelo con el que generaremos los embeddings en forma de solo inferencia
model= AutoModel.from_pretrained(pretrained_model)
model.eval()  

# Ahora pasamos todo lo que nos devuelve el tokenizador por el modelo.
# Además, realizamos esto dentro del entorno no_grad() para que vaya más rápida la inferencia.
with torch.no_grad():
    outputs= model(**tokenized_texts) # La salida del modelo

In [None]:
# Los embeddings que representan los tokens del texto están presentes en el last_hidden_state.
# En este caso, tenemos embeddings de tamaño [1, 512, 768]. 
# Tenemos 1 texto, 512 tokens y el embedding de cada token tiene 768 dimensiones

print(outputs.last_hidden_state.shape)

torch.Size([1, 512, 768])


In [None]:
# Para obtener un embedding que represente todo el texto, hay dos estrategias principales: [CLS] y mean pooling.
# La primera estrategia, [CLS] pooling, consiste en tomar el embedding del primer token, el [CLS], que 
# marca el inicio de la frase. Es una estrategia sencilla y rápida que muestra buenos resultados.
# La segunda estrategia es sacar la media de los embeddings de todos los tokens como representación del texto.

# Ambos enfoques permiten conseguir un embedding de 768 dimensiones para cada texto y su idoneidad depende del contexto.

cls_pooling_embedding= outputs.last_hidden_state[:, 0, :] 
print(f"[CLS] pooling embedding: {cls_pooling_embedding.shape}")

mask = tokenized_texts["attention_mask"].unsqueeze(-1) # Para evitar usar los embeddings de los padding
mean_pooling_embedding = (outputs.last_hidden_state * mask).sum(1) / mask.sum(1)
# mean_pooling_embedding= F.normalize(mean_pooling_embedding, p=2, dim=1)    # Se puede normalizar el valor, es útil para comparaciones

print(f"Mean pooling embedding: {mean_pooling_embedding.shape}")

[CLS] pooling embedding: torch.Size([1, 768])
Mean pooling embedding: torch.Size([1, 768])
