# Miguel Angel Ruiz Ortiz
## Procesamiento de Lenguaje Natural
## Tarea 5: Modelo de Lenguaje Neuronales

In [2]:
from google.colab import drive
drive.mount("/content/drive")

Mounted at /content/drive


In [3]:
from typing import Union, Optional
import os
import shutil
import json
from pathlib import Path
import time
from itertools import permutations
import nltk
from nltk.tokenize import TweetTokenizer
from nltk.util import ngrams
import math
from sklearn.metrics import accuracy_score
import numpy as np
from torch.utils.data import DataLoader, TensorDataset
import torch
import torch.nn as nn

Creación de directorios

In [4]:
base_path = Path("/content/drive/My Drive/Academic Stuff/NLP (CIMAT)")

# Saving directories
savedir = base_path / "models-tarea-5" / "from_pretrain" # directory for model using pretrained embeddings
os.makedirs(savedir, exist_ok = True)

savedir_scratch = base_path / "models-tarea-5" / "from_scratch" # directory for model from scratch
os.makedirs(savedir_scratch, exist_ok = True)

savedir_char = base_path / "models-tarea-5" / "characters" # directory for characters model
os.makedirs(savedir_char, exist_ok = True)

# 1) Modelo de Lenguaje Neuronal (Bengio 2003) a nivel de palabra

---

## 1.1) Implementación del modelo

Con base en la implementación mostrada en la práctica del modelo de Bengio, construya un modelo de lenguaje neuronal a nivel de palabra, pero preinicializado con los embeddings proporcionados. Tome en cuenta secuencias de tamaño 4 para el modelo, es decir hasta 3 palabras en el contexto.

---

En las siguientes celdas se encuentra el código para leer y procesar el corpus de "MEX-A3T". Se utiliza TweetTokenizer para tokenizar el texto:

In [5]:
def get_corpus(corpus_path: Path) -> list[str]:
    with open(corpus_path, "r") as corpus_file:
        corpus = [line for line in corpus_file]

    return corpus

In [6]:
mex_a3t_path = base_path / "corpus/MEX-A3T"

train_corpus_path = mex_a3t_path / "mex20_train.txt"
val_corpus_path = mex_a3t_path / "mex20_val.txt"
embeddings_path = mex_a3t_path / "word2vec_col.txt"

In [7]:
train_corpus = get_corpus(train_corpus_path)
val_corpus = get_corpus(val_corpus_path)

In [8]:
tokenizer = TweetTokenizer(preserve_case=False, reduce_len=True, strip_handles=True)

train_corpus_tk = [tokenizer.tokenize(tweet) for tweet in train_corpus]
val_corpus_tk = [tokenizer.tokenize(tweet) for tweet in val_corpus]

Cargamos los embeddings iniciales:

In [9]:
with open(embeddings_path, "r") as file:
    lines_embeddings_file = file.readlines()

In [10]:
num_words, embedding_dim = map(int, lines_embeddings_file[0].split())
word2embedding = {
    line_splitted[0]: np.array(list(map(float, line_splitted[1:])))
    for line_splitted in map(str.split, lines_embeddings_file[1:])
}

In [11]:
print("Número de palabras con embedding:", num_words)
print("Dimensión de los embeddings:", embedding_dim)

Número de palabras con embedding: 973265
Dimensión de los embeddings: 100


Ejemplo:

In [12]:
word2embedding["hola"]

array([-1.419667, -0.490418, -1.444962,  0.864942, -2.474545, -2.819041,
       -0.195111,  3.268535,  4.201846, -0.446295,  0.132508, -3.323097,
       -1.335639,  2.86414 ,  0.775206, -2.351034,  3.294083, -2.585027,
       -3.064607,  0.274417,  3.548857,  3.086329,  0.119739,  0.577198,
       -1.788768,  2.477334, -1.746314,  0.747134,  2.337681, -4.256221,
        3.570596, -0.41506 ,  1.456289,  0.148753, -4.042562, -1.551155,
       -0.978901,  1.965899, -0.331655,  1.018842,  2.553949,  1.254084,
       -0.789299,  2.823506, -5.736207, -0.169698,  1.530003,  3.976882,
        0.497212,  0.294316,  1.58776 , -2.974533, -0.832896, -0.161019,
       -1.31667 , -2.505708,  1.711155, -0.819489,  0.6929  , -6.522143,
       -2.402351,  3.085217, -1.504392,  0.314337,  1.760254,  0.297669,
        0.689544, -0.704122, -3.248115,  0.832989, -0.923742, -0.966281,
        0.48139 , -5.741403, -2.064541,  1.670688, -4.450252,  0.124791,
        0.393129,  1.819823,  0.462336,  2.106388, 

A continuación veremos la cantidad de palabras en el conjunto de entrenamiento y cuántas de ellas tienen un embedding preentrenado:

In [13]:
train_words = set()
for tweet in train_corpus_tk:
    train_words.update(tweet)

print("Número de palabras en el conjunto de entrenamiento:",  len(train_words))

Número de palabras en el conjunto de entrenamiento: 13071


In [14]:
embedddings_words = set(word2embedding.keys())

print("Número de palabras en el conjunto de entrenamiento que también tienen un embedding:", len(train_words.intersection(embedddings_words)))

Número de palabras en el conjunto de entrenamiento que también tienen un embedding: 11425


El vocabulario que vamos a considerar constará justo de aquellas palabras en nuestro conjunto de entrenamiento que también tienen un embedding preentrenado. Además, vamos a agregar los tokens especiales ``<s>``, ``</s>`` y ``<unk>`` al vocabulario con un embedding aleatorio inicial.

In [15]:
# special tokens
INIT_TKN = "<s>"
END_TKN = "</s>"
UNK_TKN = "<unk>"

In [16]:
np.random.seed(0)

vocab = train_words.intersection(embedddings_words)
embeddings = { word: word2embedding[word] for word in vocab }

vocab.update([INIT_TKN, END_TKN, UNK_TKN])
vocab_len = len(vocab)

embeddings[INIT_TKN] = np.random.rand(embedding_dim)
embeddings[END_TKN] = np.random.rand(embedding_dim)
embeddings[UNK_TKN] = np.random.rand(embedding_dim)

Reutilizamos la clase de la tarea pasada para procesar el texto (quitar signos de puntuación, crear mapeos ``{palabra -> id}`` (y viceversa), y enmascaras palabras fuera del vocabulario) con unas modicaciones para que considere un vocabulario dado. Puede recibir de manera opcional un mapeo ``{palabra -> id}`` precalculado.

In [17]:
class TextProcessor:
    def __init__(self, vocab: set, word2id: Optional[dict[str, int]] = None):
        # special tokens. we assume they are already present in the vocabulary given
        self.INIT_TKN = "<s>"
        self.END_TKN = "</s>"
        self.UNK_TKN = "<unk>"

        # punctuation signs that will be not considered
        self.punctuation = {"¡", "!", '"', "$", "%", "&", "'", "(", ")", "*", "+", ",", "-", ".", ":", ";", "¿", "?", "@", "[", "]", "_", "`", "{", "}", "«", "»", "…"}

        self.vocab = vocab
        self.vocab_len = len(vocab)

        if word2id is None:
            self.word2id = { word: idx for idx, word in enumerate(vocab) } # mapping {word -> word id}
        else:
            self.word2id = word2id

        self.id2word = { idx: word for word, idx in self.word2id.items() } # mapping {word id -> word}

    def mask_oov(self, text: Union[str, list[str]]) -> Union[str, list[str]]:
        """Replace out-of-vocabulary words with <unk>"""
        if isinstance(text, str):
            return text if text in self.vocab else self.UNK_TKN
        else:
            # than it is list of strings
            return [self.mask_oov(word) for word in text if word not in self.punctuation]

    def mask_text(self, text: list[str]) -> list[str]:
        """ Mask if out of vocabulary and add initial and end sentence
        """
        return [self.INIT_TKN] + self.mask_oov(text) + [self.END_TKN]

    def transform(self, corpus: list[list[str]]) -> list[list[str]]:
        return [self.mask_text(text) for text in corpus]


Notemos que el diccionario ``word2id`` en la clase ``TextProcessor`` depende de la iteración sobre el ``set`` ``vocab`` (el vocabulario). Resulta que Python no asegura ningún orden en la manera que se itera un ``set``, de tal manera que en diferentes sesiones de Python el mapeo {palabra -> id} puede cambiar (comprobado durante la realización de esta tarea). Si uno quiere cargar el mejor modelo entrenado en una sesión anterior, puede que el orden en que se guarda la matriz de embeddings no corresponda con el nuevo mapeo {palabra -> id} calculado en la sesión actual. Así que vamos a guardar ese mapeo en un archivo json.

In [18]:
word2id_path = base_path / "models-tarea-5" / "word2id.json"

if word2id_path.exists():
    with open(word2id_path, "r", encoding="utf-8") as file:
        word2id_json = json.load(file)
else:
    word2id_json = None

In [19]:
processor = TextProcessor(vocab=vocab, word2id=word2id_json)
train_corpus_msk = processor.transform(train_corpus_tk)
val_corpus_msk = processor.transform(val_corpus_tk)

Si no habíamos creado el archivo json antes, lo hacemos con el mapeo calculado en ``processor``.

In [20]:
if word2id_json is None:
    with open(word2id_path, "w", encoding="utf-8") as file:
        json.dump(processor.word2id, file, ensure_ascii=False)

Porcentaje de tokens ``<UNK>`` en el corpus de entrenamiento:

In [21]:
sum(
    sum([1 if tkn == UNK_TKN else 0 for tkn in tweet]) for tweet in train_corpus_msk
) / sum(len(tweet) for tweet in train_corpus_msk) * 100

5.585791741398149

Con el mapeo ``{ palabra -> id }`` obtenido en el objeto ``processor``, obtenemos la matriz de embeddings preentrenados asociada a nuestro vocabulario.

In [22]:
embeddings_w = np.array([embeddings[processor.id2word[i]] for i in range(vocab_len)])

In [23]:
embeddings_w.shape

(11428, 100)

Con la siguiente función generamos el conjunto de entrenamiento y validación con los n-gramas (n=4) de los corpus correspondientes  (3 palabras de contexto y una de predicción). La siguiente función recibe un objeto ``TextProcessor`` y un corpus previamente enmascarado con ``TextProcessor``.

In [24]:
def get_ngrams(masked_corpus: list[list[str]], n: int, text_processor: TextProcessor) -> tuple[np.ndarray, np.ndarray]:

    X_ngrams = []
    y = []

    for doc in masked_corpus:
        # we assume doc has only one initial token <s> and one end token </s>, added in TextProcessor
        doc_pad = [text_processor.INIT_TKN]*(n-2) + doc

        for ngram in ngrams(doc_pad, n):
            ngram_ids = [text_processor.word2id[w] for w in ngram]
            X_ngrams.append(ngram_ids[:-1])
            y.append(ngram_ids[-1])

    return np.array(X_ngrams), np.array(y)

In [25]:
X_train, y_train = get_ngrams(masked_corpus=train_corpus_msk, n=4, text_processor=processor)
X_val, y_val = get_ngrams(masked_corpus=val_corpus_msk, n=4, text_processor=processor)

In [26]:
X_train.shape, y_train.shape, X_val.shape, y_val.shape

((91736, 3), (91736,), (10319, 3), (10319,))

In [27]:
[[processor.id2word[w] for w in tw] for tw in X_train[:10]]

[['<s>', '<s>', '<s>'],
 ['<s>', '<s>', 'q'],
 ['<s>', 'q', 'se'],
 ['q', 'se', 'puede'],
 ['se', 'puede', 'esperar'],
 ['puede', 'esperar', 'del'],
 ['esperar', 'del', 'maricon'],
 ['del', 'maricon', 'de'],
 ['maricon', 'de', 'closet'],
 ['de', 'closet', 'de']]

Definimos los objetos ``TensorDataset`` y ``DataLoader`` con nuestros datos.

In [28]:
batch_size = 64
num_workers = 2

# training
train_dataset = TensorDataset(
    torch.tensor(X_train, dtype=torch.int64), torch.tensor(y_train, dtype=torch.int64)
)

train_loader = DataLoader(
    train_dataset, batch_size=batch_size, num_workers=num_workers, shuffle=True
)

# validation
val_dataset = TensorDataset(
    torch.tensor(X_val, dtype=torch.int64), torch.tensor(y_val, dtype=torch.int64)
)

val_loader = DataLoader(
    val_dataset, batch_size=batch_size, num_workers=num_workers, shuffle=False
)

La siguiente clase implementa el modelo de lenguaje neuronal de Bengio. El código es el del profesor de la práctica 4, pero adaptado para utilizar embeddings preentrenados y con *type hints*.

In [29]:
class NeuralLM(nn.Module):

    def __init__(
        self,
        window_size: int,
        embedding_dim: int,
        hidden_dim: int,
        vocab_size: int,
        pretrained_embeddings: Optional[np.ndarray] = None,
        dropout: float = 0.1,
    ):
        super().__init__()

        self.window_size = window_size
        self.embedding_dim = embedding_dim

        if pretrained_embeddings is not None:
            self.emb = nn.Embedding.from_pretrained(
                torch.tensor(pretrained_embeddings, dtype=torch.float),
                freeze=False,
            )
        else:
            self.emb = nn.Embedding(vocab_size, embedding_dim)

        self.dense_1 = nn.Linear(embedding_dim * (window_size), hidden_dim)
        self.drop1 = nn.Dropout(p=dropout)
        self.dense_2 = nn.Linear(hidden_dim, vocab_size, bias=False)

    def forward(self, x: torch.FloatTensor) -> torch.FloatTensor:
        x = self.emb(x)
        x = x.view(-1, self.window_size * self.embedding_dim)
        h = nn.functional.relu(self.dense_1(x))
        h = self.drop1(h)
        return self.dense_2(h)

Las siguientes funciones también son del código del profesor, las cuales sirven para obtener la predicción del modelo, evaluar el modelo y guardar checkpoints del modelo durante el entrenamiento.

In [30]:
def get_preds(raw_logits: torch.FloatTensor) -> np.ndarray:
    probs = nn.functional.softmax(raw_logits.detach(), dim=1)
    y_pred = torch.argmax(probs, dim=1).cpu().numpy()

    return y_pred

def model_eval(data: DataLoader, model: NeuralLM, gpu: bool = False) -> float:
    with torch.no_grad():
        preds, tgts = [], []
        for window_words, labels in data:
            if gpu:
                window_words = window_words.cuda()

            outputs = model(window_words)

            # Get prediction
            y_pred = get_preds(outputs)

            tgt = labels.numpy()
            tgts.append(tgt)
            preds.append(y_pred)

    tgts = [e for l in tgts for e in l]
    preds = [e for l in preds for e in l]

    return accuracy_score(tgts, preds)

def save_checkpoint(state: dict, is_best: bool, checkpoint_path: Path):
    filename = checkpoint_path / "checkpoint.pt"
    torch.save(state, filename)

    if is_best:
        shutil.copyfile(filename, checkpoint_path / "model_best.pt")

Hiperparámetros del modelo

In [31]:
# Model hyperparameters
embedding_dim = 100  # Dimension of word embedding
hidden_dim = 200  # Dimension for hidden layer
dropout = 0.2

# Training hyperparameters
lr = 2.3e-1
num_epochs = 100
patience = 20

# Scheduler hyperparameters
lr_patience = 10
lr_factor = 0.7

Definición del modelo, función de pérdida, optimizador y scheduler para actualizar el *learning rate*.

In [None]:
# Create model
model = NeuralLM(
    window_size = 3,
    embedding_dim = embedding_dim,
    hidden_dim = hidden_dim,
    vocab_size = vocab_len,
    pretrained_embeddings = embeddings_w,
    dropout = dropout
)

# Send to GPU
use_gpu = torch.cuda.is_available()
if use_gpu:
    model.cuda()

# Loss, Optimizer and Scheduler
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.SGD(model.parameters(), lr = lr)
scheduler = torch.optim.lr_scheduler.ReduceLROnPlateau(
                optimizer, "min",
                patience = lr_patience,
                factor = lr_factor
            )

Entrenamiento del modelo (código del profesor):

In [None]:
start_time = time.time()
best_metric = 0
metric_history = []
train_metric_history = []

for epoch in range(num_epochs):
    epoch_start_time = time.time()
    loss_epoch = []
    training_metric = []
    model.train()

    for window_words, labels in train_loader:

        # If GPU available
        if use_gpu:
            window_words = window_words.cuda()
            labels = labels.cuda()

        # Forward pass
        outputs = model(window_words)
        loss = criterion(outputs, labels)
        loss_epoch.append(loss.item())

        # Get training metrics
        y_pred = get_preds(outputs)
        tgt = labels.cpu().numpy()
        training_metric.append(accuracy_score(tgt, y_pred))

        # Backward and optimize
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

    # Get metric in validation dataset
    mean_epoch_metric = np.mean(training_metric)
    train_metric_history.append(mean_epoch_metric)

    # Get metric in validation dataset
    model.eval()
    tuning_metric = model_eval(val_loader, model, gpu = use_gpu)
    metric_history.append(mean_epoch_metric)

    # Update scheduler
    scheduler.step(tuning_metric)

    # Check for metric improvement
    is_improvement = tuning_metric > best_metric
    if is_improvement:
        best_metric = tuning_metric
        n_no_improve = 0
    else:
        n_no_improve += 1

    # Save best model if metric improved
    save_checkpoint(
      {
        'epoch': epoch + 1,
        'state_dict': model.state_dict(),
        'optimizer': optimizer.state_dict(),
        'scheduler': scheduler.state_dict(),
        'best_metric': best_metric,
      },
      is_improvement,
      savedir
    )

    if n_no_improve >= patience:
        print("No improvement. Breaking out of loop")
        break

    print('Train acc: {}'.format(mean_epoch_metric))
    print('Epoch [{}/{}], Loss: {:.4f} - Val accuracy: {:.4f} - Epoch time: {:.2f}s'.format(
        epoch + 1,
        num_epochs,
        np.mean(loss_epoch),
        tuning_metric,
        time.time() - epoch_start_time
    ))

    print("---%s seconds ---" % (time.time() - start_time))

Train acc: 0.09933243258949327
Epoch [1/100], Loss: 6.0868 - Val accuracy: 0.1119 - Epoch time: 12.04s
---12.04413890838623 seconds ---
Train acc: 0.10462793468154348
Epoch [2/100], Loss: 5.8519 - Val accuracy: 0.1037 - Epoch time: 7.16s
---19.204473972320557 seconds ---
Train acc: 0.10935683984193399
Epoch [3/100], Loss: 5.6523 - Val accuracy: 0.1052 - Epoch time: 5.98s
---25.187881469726562 seconds ---
Train acc: 0.11151789865178986
Epoch [4/100], Loss: 5.4623 - Val accuracy: 0.1034 - Epoch time: 7.12s
---32.30402064323425 seconds ---
Train acc: 0.11458696536494654
Epoch [5/100], Loss: 5.2962 - Val accuracy: 0.1166 - Epoch time: 6.30s
---38.606375217437744 seconds ---
Train acc: 0.11929407833565785
Epoch [6/100], Loss: 5.1337 - Val accuracy: 0.1197 - Epoch time: 7.21s
---45.812177419662476 seconds ---
Train acc: 0.1230423349604835
Epoch [7/100], Loss: 4.9877 - Val accuracy: 0.1114 - Epoch time: 6.10s
---51.90981698036194 seconds ---
Train acc: 0.12840321362157137
Epoch [8/100], Loss:

Cargamos el mejor modelo obtenido durante el entrenamiento:

In [32]:
best_model = NeuralLM(
    window_size = 3,
    embedding_dim = embedding_dim,
    hidden_dim = hidden_dim,
    vocab_size = vocab_len,
    dropout = dropout
)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
best_model.load_state_dict(torch.load(savedir / "model_best.pt", map_location=device)["state_dict"])
best_model.train(False)

NeuralLM(
  (emb): Embedding(11428, 100)
  (dense_1): Linear(in_features=300, out_features=200, bias=True)
  (drop1): Dropout(p=0.2, inplace=False)
  (dense_2): Linear(in_features=200, out_features=11428, bias=False)
)

---

## 1.2) Similitud de palabras

Después de haber entrenado el modelo, recupere las n palabras más similares a tres palabras de su gusto dadas.

---

Función que regresa las 10 palabras más similares a una dada.

In [34]:
def n_similar_words(word: str, n: int, model: NeuralLM, processor: TextProcessor) -> list[tuple[str, float]]:
    word_id = processor.word2id[word]
    word_emb = model.emb(torch.LongTensor([word_id]))
    dists = torch.linalg.norm(model.emb.weight - word_emb, dim=1).detach()
    dists_ord = sorted(enumerate(dists), key=lambda x: x[1])

    return [(processor.id2word[idx], dist) for idx, dist in dists_ord[1:n+1]]


Vamos a considerar las 10 palabras más similares a "madre", "cabrón", y "chingada".

In [35]:
words = ["madre", "cabrón", "chingada"]

for word in words:
    print("Palabras similares a", word, ":")
    close_words = n_similar_words(word, 10, best_model, processor)
    print(", ".join([f"{tup[0]} ({tup[1]:.4f})" for tup in close_words]))
    print("-"*30)

Palabras similares a madre :
mama (15.2459), hermana (15.9248), abuela (15.9911), hija (17.1942), mamá (17.3497), abuelita (17.5991), vecina (18.3809), papá (18.4850), mamà (18.7762), padre (19.0859)
------------------------------
Palabras similares a cabrón :
mamón (10.6598), cabron (11.3008), maricón (12.0740), culero (12.2599), marica (12.6395), putito (12.6703), imbécil (12.9490), baboso (13.1617), pendejo (13.3371), desgraciado (13.3585)
------------------------------
Palabras similares a chingada :
fregada (11.1419), verga (15.3635), verch (16.5052), gaver (16.7693), vrg (16.7979), reputa (18.2148), vg (18.3495), putisima (18.5655), friendzone (18.6779), reputisima (18.7386)
------------------------------


---

## 1.3) Generación de texto

Ponga al modelo a generar texto a partir de tres secuencias de inicio de su gusto.

---

La siguiente función genera texto utilizando el escalamiento con el parámetro de *temperatura* que utilizó el profesor en la práctica, pero después hacer el escalado se consideran sólo el top-k tokens con mayor probabilidad para predecir la siguiente palabra.

In [36]:
def generate_text(
    model: NeuralLM,
    processor: TextProcessor,
    seed: Optional[list[str]] = None,
    n_tokens: int = 50,
    top_k : int = 1000,
    temperature: float = 1.0,
) -> str:
    if seed is None:
        text = [processor.INIT_TKN] * (model.window_size)
    else:
        if len(seed) != model.window_size:
            raise ValueError("seed should be of length of the window size of the model")

        text = list(map(processor.mask_oov, seed))


    for i in range(n_tokens):
        context = text[-model.window_size :]  # last tokens for the window

        logits = model(torch.LongTensor([[processor.word2id[w] for w in context]])).detach().numpy()[0]
        logits_adj = logits/temperature

        # softmax
        probs = np.exp(logits_adj)
        probs = probs / np.sum(probs)

        word_and_probs = [(processor.id2word[i], p) for i, p in enumerate(probs)]
        word_and_probs.sort(key=lambda x: x[1], reverse=True)

        words_topk = [w for w, p in word_and_probs[:top_k]]
        probs_topk = np.array([p for w, p in word_and_probs[:top_k]])
        probs_topk = probs_topk / np.sum(probs_topk) # normalize probabilities

        pred_word = np.random.choice(words_topk, p=probs_topk)

        text.append(pred_word)

        if pred_word == processor.END_TKN:
            break

    return text

Ejemplos:

In [37]:
np.random.seed(0)
sequence_1 = ["<s>", "<s>", "<s>"]
print(" ".join(generate_text(best_model, processor, seed=sequence_1, temperature=0.7, top_k=500)))

<s> <s> <s> es más fácil hacerla para que putas estas fotos y te valgo verga la vida de estos nacos que hace como las putas de colosio no se da y la madre teresa de calcuta </s>


In [38]:
np.random.seed(0)
sequence_2 = ["<s>", "hola", "como"]
print(" ".join(generate_text(best_model, processor, seed=sequence_2, temperature=0.8, top_k=500)))

<s> hola como la puta madre ya no son como tonta la pregunta para que se están mamando con coco de toronja verga ya clima lo me la pela la me vale verga </s>


In [39]:
np.random.seed(7)
sequence_3 = ["hijo", "de", "la"]
print(" ".join(generate_text(best_model, processor, seed=sequence_3, temperature=0.8, top_k=500)))

hijo de la vergueishon tonatiuh cervantes explicó sobre las próximas inversiones para el estadio por la pero <unk> verga </s>


---

## 1.4) Verosimilitud de oraciones

Escriba 5 ejemplos de oraciones y mídales el likelihood.

---

Función que calcula la log-verosimilitud de un texto dado (tokenizado), adaptada del código del profesor. Se prefiere la log-verosimilitud sobre la verosimilitud ya que la verosimilitud es un producto de probabilidades, de tal manera que se vuelve un número muy pequeño conforme aumenta el número de factores.

In [31]:
def log_likelihood(model: NeuralLM, text: list[str], processor: TextProcessor) -> float:
    text = processor.mask_text(text)

    X, y = get_ngrams(masked_corpus=[text], n=4, text_processor=processor)
    X, y = X[2:], y[2:]
    X = torch.LongTensor(X)

    logits = model(X).detach()
    probs = nn.functional.softmax(logits, dim=1).numpy()

    # consider the case when the probability is practically 0
    return sum(np.log(probs[i][w] if not np.isclose(probs[i][w], 0) else 1e-8) for  i, w in enumerate(y))

Ejemplos:

In [41]:
sentences = [
    ["por", "eso", "estamos", "como", "estamos"],
    ["hola", "buen", "dia", "como", "están"],
    ["hijos", "de", "la", "chingada", "como", "creen"],
    ["vas", "a", "ver", "hijo", "de", "tu"],
    ["maldito", "clima", "hace", "un", "chingo", "de", "calor"]
]
for sent in sentences:
    result = log_likelihood(best_model, sent, processor)
    print(" ".join(sent))
    print("Log-verosimilitud:", result)
    print("-"*30)

por eso estamos como estamos
Log-verosimilitud: -42.60297
------------------------------
hola buen dia como están
Log-verosimilitud: -57.450253
------------------------------
hijos de la chingada como creen
Log-verosimilitud: -29.718723
------------------------------
vas a ver hijo de tu
Log-verosimilitud: -28.189758
------------------------------
maldito clima hace un chingo de calor
Log-verosimilitud: -45.35263
------------------------------


---

## 1.5) Estructuras sintácticas

Proponga un ejemplo para ver estructuras sintácticas (permutaciones de palabras de alguna oración) buenas usando el likelihood a partir de una oración que usted proponga.

---

Dada una oración, calculamos la log-verosimilitud de todas las permutaciones de los tokens en la oración. Después se muestran las 5 permutaciones de mayor log-verosimilitud, y por lo tanto de mayor verosimilitud, i.e., las que tienen mayor estructura sintáctica. También se muestran las 5 permutaciones de menor verosimilitud.

In [42]:
sentence = ["hijos", "de", "la", "chingada", "van", "a", "ver"]

log_likelihood_perms = [(perm, log_likelihood(best_model, perm, processor)) for perm in permutations(sentence)]
log_likelihood_perms.sort(key=lambda x: x[1], reverse=True)

Permutaciones con mayor log-verosimilitud:

In [43]:
for perm, log_lh in log_likelihood_perms[:5]:
    print(" ".join(perm))
    print("Log-verosimilitud:", log_lh)
    print("-"*30)

van ver a hijos de la chingada
Log-verosimilitud: -12.792529
------------------------------
la van a ver hijos de chingada
Log-verosimilitud: -15.818754
------------------------------
hijos van a ver de la chingada
Log-verosimilitud: -16.934473
------------------------------
ver van a hijos de la chingada
Log-verosimilitud: -17.56024
------------------------------
chingada hijos de la van a ver
Log-verosimilitud: -19.902334
------------------------------


Permutaciones con menor log-verosimilitud:

In [44]:
for perm, log_lh in log_likelihood_perms[-5:]:
    print(" ".join(perm))
    print("Log-verosimilitud:", log_lh)
    print("-"*30)

de a chingada van la hijos ver
Log-verosimilitud: -117.186806
------------------------------
de chingada la a hijos van ver
Log-verosimilitud: -117.82024
------------------------------
la de chingada a hijos van ver
Log-verosimilitud: -118.17029
------------------------------
a de chingada la hijos van ver
Log-verosimilitud: -118.603195
------------------------------
de a chingada la hijos van ver
Log-verosimilitud: -122.77367
------------------------------


---

## 1.6) Evaluación del modelo a través de la perplejidad

Calcule la perplejidad del modelo sobre los datos de validación. Compárelo con la perplejidad del modelo de lenguaje sin embeddings preentrenados y el probabilista de la tarea anterior.

---

Primero entrenamos el modelo sin embeddings preentrenados para hacer la comparación.

Hiperparámetros del modelo

In [None]:
# Model hyperparameters
embedding_dim = 100  # Dimension of word embedding
hidden_dim = 200  # Dimension for hidden layer
dropout = 0.2

# Training hyperparameters
lr = 2.3e-1
num_epochs = 100
patience = 20

# Scheduler hyperparameters
lr_patience = 10
lr_factor = 0.7

In [None]:
# Create model
model_scratch = NeuralLM(
    window_size = 3,
    embedding_dim = embedding_dim,
    hidden_dim = hidden_dim,
    vocab_size = vocab_len,
    dropout = dropout
)

# Send to GPU
use_gpu = torch.cuda.is_available()
if use_gpu:
    model_scratch.cuda()

# Loss, Optimizer and Scheduler
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.SGD(model_scratch.parameters(), lr = lr)
scheduler = torch.optim.lr_scheduler.ReduceLROnPlateau(
                optimizer, "min",
                patience = lr_patience,
                factor = lr_factor
            )

Entrenamiento del modelo:

In [None]:
start_time = time.time()
best_metric = 0
metric_history = []
train_metric_history = []

for epoch in range(num_epochs):
    epoch_start_time = time.time()
    loss_epoch = []
    training_metric = []
    model_scratch.train()

    for window_words, labels in train_loader:

        # If GPU available
        if use_gpu:
            window_words = window_words.cuda()
            labels = labels.cuda()

        # Forward pass
        outputs = model_scratch(window_words)
        loss = criterion(outputs, labels)
        loss_epoch.append(loss.item())

        # Get training metrics
        y_pred = get_preds(outputs)
        tgt = labels.cpu().numpy()
        training_metric.append(accuracy_score(tgt, y_pred))

        # Backward and optimize
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

    # Get metric in validation dataset
    mean_epoch_metric = np.mean(training_metric)
    train_metric_history.append(mean_epoch_metric)

    # Get metric in validation dataset
    model_scratch.eval()
    tuning_metric = model_eval(val_loader, model_scratch, gpu = use_gpu)
    metric_history.append(mean_epoch_metric)

    # Update scheduler
    scheduler.step(tuning_metric)

    # Check for metric improvement
    is_improvement = tuning_metric > best_metric
    if is_improvement:
        best_metric = tuning_metric
        n_no_improve = 0
    else:
        n_no_improve += 1

    # Save best model if metric improved
    save_checkpoint(
      {
        'epoch': epoch + 1,
        'state_dict': model_scratch.state_dict(),
        'optimizer': optimizer.state_dict(),
        'scheduler': scheduler.state_dict(),
        'best_metric': best_metric,
      },
      is_improvement,
      savedir_scratch
    )

    if n_no_improve >= patience:
        print("No improvement. Breaking out of loop")
        break

    print('Train acc: {}'.format(mean_epoch_metric))
    print('Epoch [{}/{}], Loss: {:.4f} - Val accuracy: {:.4f} - Epoch time: {:.2f}s'.format(
        epoch + 1,
        num_epochs,
        np.mean(loss_epoch),
        tuning_metric,
        time.time() - epoch_start_time
    ))

    print("---%s seconds ---" % (time.time() - start_time))

Train acc: 0.08506218038121803
Epoch [1/100], Loss: 6.4539 - Val accuracy: 0.0942 - Epoch time: 11.17s
---11.173033714294434 seconds ---
Train acc: 0.1019039109716411
Epoch [2/100], Loss: 5.9382 - Val accuracy: 0.1090 - Epoch time: 7.91s
---19.082343339920044 seconds ---
Train acc: 0.1070831880520688
Epoch [3/100], Loss: 5.7054 - Val accuracy: 0.1199 - Epoch time: 7.33s
---26.412657260894775 seconds ---
Train acc: 0.1120045908879591
Epoch [4/100], Loss: 5.5124 - Val accuracy: 0.0992 - Epoch time: 6.21s
---32.61926031112671 seconds ---
Train acc: 0.11610515457926546
Epoch [5/100], Loss: 5.3314 - Val accuracy: 0.1071 - Epoch time: 7.26s
---39.881489515304565 seconds ---
Train acc: 0.1181136680613668
Epoch [6/100], Loss: 5.1573 - Val accuracy: 0.1134 - Epoch time: 6.39s
---46.26910924911499 seconds ---
Train acc: 0.12061976987447699
Epoch [7/100], Loss: 4.9837 - Val accuracy: 0.1234 - Epoch time: 7.34s
---53.60463190078735 seconds ---
Train acc: 0.1228026208740121
Epoch [8/100], Loss: 4.8

Cargamos el mejor modelo obtenido durante el entrenamiento:

In [45]:
best_model_scratch = NeuralLM(
    window_size = 3,
    embedding_dim = embedding_dim,
    hidden_dim = hidden_dim,
    vocab_size = vocab_len,
    dropout = dropout
)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
best_model_scratch.load_state_dict(torch.load(savedir_scratch / "model_best.pt", map_location=device)["state_dict"])
best_model_scratch.train(False)

NeuralLM(
  (emb): Embedding(11428, 100)
  (dense_1): Linear(in_features=300, out_features=200, bias=True)
  (drop1): Dropout(p=0.2, inplace=False)
  (dense_2): Linear(in_features=200, out_features=11428, bias=False)
)

Para calcular la perplejidad de un modelo, podemos utilizar nuestra función que calcula la log-verosimilitud de un texto $X=(x_1, ..., x_N)$, pues justo su perplejidad $PPL(X)$ es
$$
PPL(X) = \exp\left(-\frac{1}{N} \log \mathbb{P}(X) \right),
$$
donde $\log \mathbb{P}(X)$ es la log-verosimilitud del texto $X$.

In [32]:
def perplexity(model: NeuralLM, corpus: list[list[str]], text_processor: TextProcessor) -> float:
    # corpus is masked inside log_likelihood

    corpus_len = sum(len(text) for text in corpus)
    log_lh = sum(log_likelihood(model, text, text_processor) for text in corpus)

    return math.exp(-log_lh/corpus_len)

Perplejidad del mejor modelo con embeddings preentrenados:

In [122]:
perplexity(best_model, val_corpus_tk, processor)

602.0006596709721

Perplejidad del mejor modelo entrenado desde cero.

In [123]:
perplexity(best_model_scratch, val_corpus_tk, processor)

185.10026931719662

---

## 1.7) Discusión

---

A través el ejercicio de encontrar palabras similares, notamos que el modelo de lenguaje neuronal sí logra aprender algo del texto. Por ejemplo, el modelo encuentra similar la palabra "madre" a las palabras "abuela", "mama", "papá", etc. O la palabra "chingada" la encuentra similar a otras como "fregada" y "verga", lo cual tiene sentido pues esas palabras se suelen usar en frases similares como "veta a la ..."). También si fijamos una oración y permutamos sus tokens, se observa que las permutaciones con mayor verosimilitud sí tienen algo de estructura sintáctica.

Algo curioso es que el modelo entrenado desde cero tuvo una menor perplejidad (185.1) que el modelo con los embeddings preentrenados (602.0). Lo que yo sospecho es que al entrenar desde cero el modelo, se logra encontrar embeddings que funcionan para este corpus en particular, obteniendo así menor perplejidad. Mientras que con los embeddings preentrenados, seguramente el corpus con el que se entrenaron era de caracter general, y entonces al proceso de entrenamiento le cuesta más adaptar los embeddings a este contexto en particular.

La perplejidad del mejor modelo de lenguaje (interpolado) basado en frecuencias de la tarea anterior fue de 138.74. Este valor es cercano al obtenido con el modelo de lenguaje neuronal entrenado desde 0. Sin embargo, hay que tomar en cuenta que los corpus de esta tarea y los de la tarea anterior son diferentes, por lo que la perplejidad puede que no sea comparable. El corpus de la tarea anterior, con las conferencias de prensa de los presidentes de México, tiene un lenguaje más formal, por lo que pudiera tener más estructura, y por lo tanto se podría esperar que un modelo tuviera una perplejidad menor sobre ese corpus.

# 2) Modelo de Lenguaje Neuronal (Bengio 2003) a nivel de caracter

---

## 2.1) Implementación del modelo

Con base en la implementación mostrada en las prácticas del NLM, construya un modelo de lenguaje neuronal a nivel de caracter. Tome en cuenta secuencias de tamaño 6 o más para el modelo, es decir hasta 5 caracteres o más en el contexto.

---

Notemos que si hacemos ``list(text)``, donde ``text`` es una string, vamos a obtener la tokenización por caracter. Como vocabulario vamos a utilizar todos los caracteres encontrados en el conjunto de entrenamiento pero sin considerar signos de puntuación.

In [33]:
train_tk_char = [list(text) for text in train_corpus]
val_tk_char = [list(text) for text in val_corpus]

vocab_char = set()
for text in train_tk_char:
    vocab_char.update(text)

vocab_char.update([INIT_TKN, END_TKN, UNK_TKN])

In [34]:
vocab_char_len = len(vocab_char)
print("Tamaño del vocabulario:", vocab_char_len)

Tamaño del vocabulario: 429


Cargamos el mapeo ``{ char -> char id }`` si es que ya se había calculado antes.

In [35]:
char2id_path = base_path / "models-tarea-5" / "char2id.json"

if char2id_path.exists():
    with open(char2id_path, "r", encoding="utf-8") as file:
        char2id_json = json.load(file)
else:
    char2id_json = None

In [36]:
processor_char = TextProcessor(vocab=vocab_char, word2id=char2id_json)
train_char_msk = processor_char.transform(train_tk_char)
val_char_msk = processor_char.transform(val_tk_char)

Si no habíamos creado el archivo json antes, lo hacemos con el mapeo calculado en ``processor_char``.

In [37]:
if char2id_json is None:
    with open(char2id_path, "w", encoding="utf-8") as file:
        json.dump(processor_char.word2id, file, ensure_ascii=False)

Porcentaje de tokens ``<UNK>`` en el corpus de validación:

In [38]:
sum(
    sum([1 if tkn == UNK_TKN else 0 for tkn in tweet]) for tweet in val_char_msk
) / sum(len(tweet) for tweet in val_char_msk) * 100

0.03829290241053821

Vemos que hay algunos caracteres que no estamos considerando en el conjunto de entrenamiento. Pero dado que utilizamos como vocabulario todos los caracteres del conjunto de entrenamiento, entonces no se entrenaría un embedding para el token ``<UNK>``. De tal manera que en el conjunto de validación vamos a quitarlos. Además en la siguiente celda se pueden observar que justo esos caracteres fuera del vocabulario son en su mayoría emojis, por lo que no afectaría tanto quitarlos.

In [39]:
for tweet in val_tk_char:
    for tkn in tweet:
        if tkn not in processor_char.vocab:
            print(tkn)

💥
🆙
👱
📧
🏈
♠
🐃
🖤
🍠
💰
̶
̶
̶
̶
̶
̶
🍾
🤚
😼
😼


Filtración de caracteres desconocidos en corpus de validación:

In [40]:
val_tk_char = [[tkn for tkn in tweet if tkn in processor_char.vocab] for tweet in val_tk_char]
val_char_msk = processor_char.transform(val_tk_char)

Generamos conjunto de entrenamiento y validación con los n-gramas de los corpus:

In [41]:
X_char_train, y_char_train = get_ngrams(masked_corpus=train_char_msk, n=4, text_processor=processor_char)
X_char_val, y_char_val = get_ngrams(masked_corpus=val_char_msk, n=4, text_processor=processor_char)

Definimos los objetos ``TensorDataset`` y ``DataLoader`` con nuestros datos.

In [42]:
batch_size = 64
num_workers = 2

# training
train_char_dataset = TensorDataset(
    torch.tensor(X_char_train, dtype=torch.int64), torch.tensor(y_char_train, dtype=torch.int64)
)

train_char_loader = DataLoader(
    train_char_dataset, batch_size=batch_size, num_workers=num_workers, shuffle=True
)

# validation
val_char_dataset = TensorDataset(
    torch.tensor(X_char_val, dtype=torch.int64), torch.tensor(y_char_val, dtype=torch.int64)
)

val_char_loader = DataLoader(
    val_char_dataset, batch_size=batch_size, num_workers=num_workers, shuffle=False
)

Hiperparámetros del modelo

In [44]:
# Model hyperparameters
embedding_dim = 100  # Dimension of word embedding
hidden_dim = 200  # Dimension for hidden layer
dropout = 0.2

# Training hyperparameters
lr = 2.3e-1
num_epochs = 100
patience = 20

# Scheduler hyperparameters
lr_patience = 10
lr_factor = 0.7

Modelo:

In [61]:
# Create model
model_char = NeuralLM(
    window_size = 3,
    embedding_dim = embedding_dim,
    hidden_dim = hidden_dim,
    vocab_size = vocab_char_len,
    dropout = dropout
)

# Send to GPU
use_gpu = torch.cuda.is_available()
if use_gpu:
    model_char.cuda()

# Loss, Optimizer and Scheduler
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.SGD(model_char.parameters(), lr = lr)
scheduler = torch.optim.lr_scheduler.ReduceLROnPlateau(
                optimizer, "min",
                patience = lr_patience,
                factor = lr_factor
            )

Entrenamiento

In [62]:
start_time = time.time()
best_metric = 0
metric_history = []
train_metric_history = []

for epoch in range(num_epochs):
    epoch_start_time = time.time()
    loss_epoch = []
    training_metric = []
    model_char.train()

    for window_words, labels in train_char_loader:

        # If GPU available
        if use_gpu:
            window_words = window_words.cuda()
            labels = labels.cuda()

        # Forward pass
        outputs = model_char(window_words)
        loss = criterion(outputs, labels)
        loss_epoch.append(loss.item())

        # Get training metrics
        y_pred = get_preds(outputs)
        tgt = labels.cpu().numpy()
        training_metric.append(accuracy_score(tgt, y_pred))

        # Backward and optimize
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

    # Get metric in validation dataset
    mean_epoch_metric = np.mean(training_metric)
    train_metric_history.append(mean_epoch_metric)

    # Get metric in validation dataset
    model_char.eval()
    tuning_metric = model_eval(val_char_loader, model_char, gpu = use_gpu)
    metric_history.append(mean_epoch_metric)

    # Update scheduler
    scheduler.step(tuning_metric)

    # Check for metric improvement
    is_improvement = tuning_metric > best_metric
    if is_improvement:
        best_metric = tuning_metric
        n_no_improve = 0
    else:
        n_no_improve += 1

    # Save best model if metric improved
    save_checkpoint(
      {
        'epoch': epoch + 1,
        'state_dict': model_char.state_dict(),
        'optimizer': optimizer.state_dict(),
        'scheduler': scheduler.state_dict(),
        'best_metric': best_metric,
      },
      is_improvement,
      savedir_char
    )

    if n_no_improve >= patience:
        print("No improvement. Breaking out of loop")
        break

    print('Train acc: {}'.format(mean_epoch_metric))
    print('Epoch [{}/{}], Loss: {:.4f} - Val accuracy: {:.4f} - Epoch time: {:.2f}s'.format(
        epoch + 1,
        num_epochs,
        np.mean(loss_epoch),
        tuning_metric,
        time.time() - epoch_start_time
    ))

    print("---%s seconds ---" % (time.time() - start_time))

Train acc: 0.39974606701450194
Epoch [1/100], Loss: 2.1194 - Val accuracy: 0.4389 - Epoch time: 42.66s
---42.663923501968384 seconds ---
Train acc: 0.42846156394094925
Epoch [2/100], Loss: 1.9779 - Val accuracy: 0.4478 - Epoch time: 28.78s
---71.44033074378967 seconds ---
Train acc: 0.4364697700007652
Epoch [3/100], Loss: 1.9363 - Val accuracy: 0.4503 - Epoch time: 29.11s
---100.5498297214508 seconds ---
Train acc: 0.44156003169911207
Epoch [4/100], Loss: 1.9107 - Val accuracy: 0.4554 - Epoch time: 32.10s
---132.65417861938477 seconds ---
Train acc: 0.4449938660021709
Epoch [5/100], Loss: 1.8944 - Val accuracy: 0.4605 - Epoch time: 29.49s
---162.14362502098083 seconds ---
Train acc: 0.4469325494049863
Epoch [6/100], Loss: 1.8811 - Val accuracy: 0.4634 - Epoch time: 38.41s
---200.55559015274048 seconds ---
Train acc: 0.44944836530640586
Epoch [7/100], Loss: 1.8706 - Val accuracy: 0.4640 - Epoch time: 42.30s
---242.85403394699097 seconds ---
Train acc: 0.4513575123494753
Epoch [8/100], L

Cargamos el mejor modelo obtenido durante el entrenamiento:

In [45]:
best_model_char = NeuralLM(
    window_size = 3,
    embedding_dim = embedding_dim,
    hidden_dim = hidden_dim,
    vocab_size = vocab_char_len,
    dropout = dropout
)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
best_model_char.load_state_dict(torch.load(savedir_char / "model_best.pt", map_location=device)["state_dict"])
best_model_char.train(False)

NeuralLM(
  (emb): Embedding(429, 100)
  (dense_1): Linear(in_features=300, out_features=200, bias=True)
  (drop1): Dropout(p=0.2, inplace=False)
  (dense_2): Linear(in_features=200, out_features=429, bias=False)
)

---

## 2.2) Generación de texto

Ponga al modelo a generar texto 3 veces, con un máximo de 300 caracteres.

---

Ejemplos:

In [102]:
np.random.seed(0)
sequence_1 = ["<s>", "<s>", "<s>"]
print("".join(generate_text(best_model_char, processor_char, seed=sequence_1, n_tokens=300, temperature=0.6, top_k=50)))

<s><s><s>Me de palar cuando que puta la gorda ahora que se a Madre la verga en femir esta es la ven a está te del amos amos ando seguiero para para de para y Por porque no chinga madre mi mamar pero que pero que estoy vuelven mis no su me esta esta y putas putos verga de no madre muy habe y la y por con loca


In [103]:
np.random.seed(0)
sequence_2 = ["h", "o", "l"]
print("".join(generate_text(best_model_char, processor_char, seed=sequence_2, n_tokens=300, temperature=0.6, top_k=50)))

hola con mueres verga cula ver mi a te las a estoy me duelva a la verga el  son la madre a tien a la la mejo si que en ser del cabron madre si de una haz te para que cribio que putos para por a las madre se la ponse es de vas ten en esta esta y putas putos verga de no madre muy habe y la y por con loca


In [104]:
np.random.seed(7)
sequence_3 = ["h", "i", "j"]
print("".join(generate_text(best_model_char, processor_char, seed=sequence_3, n_tokens=300, temperature=0.6, top_k=50)))

hijos lleva a con que sigues que está todo me de para de putos gusta fiera es que estar compezó no de te se va a mama gorda a chingue cuando cagar madre pero paner tienero putos putas de la vivida pones a el muchos a llos putos me hable todos tontas sona estoy pata se ten madre dicios por madre pre es 


---

## 2.3) Verosimilitud de oraciones

Escriba 5 ejemplos de oraciones y mídales el likelihood.

---

Ejemplos:

In [106]:
sentences = [
    list("por eso estamos como estamos"),
    list("hola buen dia como están"),
    list("hijos de la chingada como creen"),
    list("vas a ver hijo de tu"),
    list("maldito clima hace un chingo de calor")
]
for sent in sentences:
    result = log_likelihood(best_model_char, sent, processor_char)
    print("".join(sent))
    print("Log-verosimilitud:", result)
    print("-"*30)

por eso estamos como estamos
Log-verosimilitud: -51.607082
------------------------------
hola buen dia como están
Log-verosimilitud: -57.900665
------------------------------
hijos de la chingada como creen
Log-verosimilitud: -48.82113
------------------------------
vas a ver hijo de tu
Log-verosimilitud: -45.066654
------------------------------
maldito clima hace un chingo de calor
Log-verosimilitud: -66.77584
------------------------------


---

## 2.4) Estructura morfológica

Escriba un ejemplo de estructura morfológica (permutaciones con caracteres) similar al de estructura sintáctica del profesor con 5 o más caracteres de su gusto (e.g., "ando").

Dada una palabra (secuencia de caracteres), calculamos la log-verosimilitud de todas las permutaciones de sus caracteres. Después se muestran las 5 permutaciones de mayor log-verosimilitud, y por lo tanto de mayor verosimilitud, i.e., las que tienen mayor estructura morgológica. También se muestran las 5 permutaciones de menor verosimilitud.

In [79]:
sequence = list("saludos")

log_likelihood_perms = [(perm, log_likelihood(best_model_char, perm, processor_char)) for perm in permutations(sequence)]
log_likelihood_perms.sort(key=lambda x: x[1], reverse=True)

Permutaciones con mayor log-verosimilitud:

In [107]:
for perm, log_lh in log_likelihood_perms[:5]:
    print("".join(perm))
    print("Log-verosimilitud:", log_lh)
    print("-"*30)

saludos
Log-verosimilitud: -22.508583
------------------------------
saludos
Log-verosimilitud: -22.508583
------------------------------
saludso
Log-verosimilitud: -23.867504
------------------------------
saludso
Log-verosimilitud: -23.867504
------------------------------
lusados
Log-verosimilitud: -25.414679
------------------------------


Permutaciones con menor log-verosimilitud

In [109]:
for perm, log_lh in log_likelihood_perms[-5:]:
    print("".join(perm))
    print("Log-verosimilitud:", log_lh)
    print("-"*30)

slsuaod
Log-verosimilitud: -67.943954
------------------------------
saodlsu
Log-verosimilitud: -69.886406
------------------------------
saodlsu
Log-verosimilitud: -69.886406
------------------------------
suaodls
Log-verosimilitud: -72.88739
------------------------------
suaodls
Log-verosimilitud: -72.88739
------------------------------


---

## 2.5) Perplejidad

Calcule la perplejidad del modelo sobre los datos de validación.

---

Perplejidad del mejor modelo de caracteres sobre el conjunto de validación:

In [46]:
perplexity(best_model_char, val_tk_char, processor_char)

5.336686747862487

---
## 2.6) Discusión

---

Cuando generamos texto con este modelo de lenguaje a nivel de caracteres, podemos darnos cuenta que se generan algunas palabras que sí existen, pero no hay coherencia en las oraciones. Además, el modelo logra asignarle una verosimilitud alta a la palabra "saludos" y también a permutaciones que sólo cambian un caracter de tal manera que se sigue entendiendo la palabra. Con este modelo se obtuvo una perplejidad muy baja, pero esto se explica ya que el vocabulario de caracteres es pequeño, por lo que predecir el siguiente caracter es una tarea menos compleja que predecir la siguiente palabra. Dado que los vocabularios de un modelo a nivel caracter y un modelo a nivel palabra tienen dos órdenes de magnitud de diferencia en cuanto a tamaño, es razonable obtener menor perplejidad con el modelo a nivel caracter.