# **Tarea 4 - Sequence to Sequence 📚**

**Procesamiento de Lenguaje Natural (CC6205-1 - Otoño 2024)**

## Tarjeta de identificación

**Nombres:**

```- Ignacio Albornoz```

```- Eduardo Silva```  

**Fecha límite de entrega 📆:** 10/07.

**Tiempo estimado de dedicación:** 4 horas


## Instrucciones

Bienvenid@s a la tercera tarea en el curso de Natural Language Processing (NLP). Esta tarea tiene como objetivo evaluar los contenidos teóricos de las últimas semanas de clases posteriores a la Tarea 3, enfocado en **Sequence-to-Sequence + Attention**. Si aún no has visto las clases, se recomienda visitar los links de las referencias.

* La tarea es en **grupo** (maximo hasta 3 personas).
* La entrega es a través de u-cursos a más tardar el día estipulado arriba.
* El formato de entrega es este mismo Jupyter Notebook.
* Al momento de la revisión su código será ejecutado. Por favor verifiquen que su entrega no tenga errores de compilación.
* Completar la tarjeta de identificación. Sin ella no podrá tener nota.
* Recomendamos mirar el enunciado completo con atención (*ba dum tss*) antes de empezar la tarea, para tener una idea más completa de lo que se pide.

## Material de referencia

Diapositivas del curso 📄
    
- [Sequence-to-Sequence + Attention](https://github.com/dccuchile/CC6205/blob/master/slides/NLP-seq2seq.pdf)
- [Transformer](https://github.com/dccuchile/CC6205/blob/master/slides/NLP-transformer.pdf)

Videos del curso 📺

- [Sequence-to-Sequence + Attention](https://www.youtube.com/watch?v=OpKxRjISqmM&list=PLppKo85eGXiXIh54H_qz48yHPHeNVJqBi&index=35)
- [Transformer](https://www.youtube.com/watch?v=8RE23Uq8rU0)

## Parte 1: Traducción automática con arquitectura Encoder-Decoder con RNNs
En esta sección crearemos nuestro propio traductor de español a inglés con la arquitecura Encoder-Decoder con RNNs + Attention vista en clases.

In [1]:
#%pip install torch
#%pip install numpy
#!pip install matplotlib
#%pip install scikit-learn


In [2]:
#%pip install torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/cu117


In [3]:
## Importamos librerías

from __future__ import unicode_literals, print_function, division
from io import open
import unicodedata
import re
import random

import torch
import torch.nn as nn
from torch import optim
import torch.nn.functional as F

import numpy as np
from torch.utils.data import TensorDataset, DataLoader, RandomSampler

In [4]:
print(torch.cuda.is_available())
print(torch.cuda.get_device_name(0))


True
NVIDIA GeForce GTX 1650 Ti


### P0. Preparación del dataset y tokenización

Utilizaremos un dataset de pares de oraciones en inglés y castellano.

In [5]:
#!wget https://www.manythings.org/anki/spa-eng.zip
#!unzip spa-eng.zip

Vamos a crear una clase que nos permita procesar mejor cada idioma del corpus. Esto nos será útil para manejar dos vocabularios distintos.

In [6]:
# Código base

SOS_token = 0
EOS_token = 1

class Lang:
  def __init__(self, name):
    self.name = name
    self.word2index = {}
    self.word2count = {}
    self.index2word = {0: "*", 1: "STOP"}
    self.n_tokens = 2  # * y STOP

  def add_sentence(self, sentence):
    for word in sentence.split(' '):
      self.add_word(word)

  def add_word(self, word):
    if word not in self.word2index:
      self.word2index[word] = self.n_tokens
      self.word2count[word] = 1
      self.index2word[self.n_tokens] = word
      self.n_tokens += 1
    else:
      self.word2count[word] += 1

Implemente acá funciones para leer, procesar y filtar el dataset según estime.

In [7]:
# Minúsculas, puntuación y remoción de caracteres que no son letras
# Puede añadir cualquier otro preprocesamiento que estime conveniente
def normalize_string(s):
  s = unicodedata.normalize('NFC', s) # Normalización de caracteres unicode
  s = s.lower().strip()
  s = re.sub(r"([.!?])", r" \1", s) # Regex para separar puntuación de las palabras
  s = re.sub(r"[^a-zA-Z!?áéíóúñ´]+", r" ", s) # Regex para excluir otros caracteres
  return s.strip()

# Recomendamos mantener sólo oraciones con ~10 palabras o menos
def filter_pairs(pairs, max_length):
  return [p for p in pairs if len(p[0].split(' ')) < max_length and \
                len(p[1].split(' ')) < max_length]

In [8]:
def read_dataset(path, reverse=False):
  lines = open(path, encoding='utf-8').\
    read().strip().split('\n')

  pairs = [[normalize_string(s) for s in l.split('\t')][:2] for l in lines]

  # Reverse pairs, make Lang instances
  if reverse:
    pairs = [list(reversed(p)) for p in pairs]
    input_lang = Lang("spa")
    output_lang = Lang("eng")
  else:
    input_lang = Lang("eng")
    output_lang = Lang("spa")

  return input_lang, output_lang, pairs

def read_langs(lang1, lang2, reverse=False, max_length=10):
  input_lang, output_lang, pairs = read_dataset("spa.txt", reverse)
  print(f"Total de oraciones en dataset: {len(pairs)}")
  pairs = filter_pairs(pairs, max_length)
  print(f"Reducido a: {len(pairs)}")
  for pair in pairs:
    input_lang.add_sentence(pair[0])
    output_lang.add_sentence(pair[1])
  print(f"Tamaño vocab {input_lang.name}: {input_lang.n_tokens}")
  print(f"Tamaño vocab {output_lang.name}: {output_lang.n_tokens}")
  return input_lang, output_lang, pairs

In [9]:
input_lang, output_lang, pairs = read_langs('eng', 'spa', reverse=True, max_length=10)
print(random.choice(pairs))

Total de oraciones en dataset: 141543
Reducido a: 119626
Tamaño vocab spa: 24306
Tamaño vocab eng: 12105
['solo estoy exponiendo mi punto de vista', 'i m just giving my perspective']


### P1. Encoder (1.2 pt.)
Implemente una red Encoder utilizando redes neuronales recurrentes.

In [10]:
# Garantizar reproducibilidad de los experimentos
SEED = 1234
torch.manual_seed(SEED)
torch.backends.cudnn.deterministic = True

In [11]:
# Definir dispositivo
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

In [12]:
class EncoderRNN(nn.Module):
    def __init__(self, input_size, hidden_size, num_layers=2, dropout_p=0.1):
        super(EncoderRNN, self).__init__()
        self.hidden_size = hidden_size
        self.num_layers = num_layers
        self.embedding = nn.Embedding(input_size, hidden_size)
        self.gru = nn.GRU(hidden_size, hidden_size, num_layers=num_layers, dropout=dropout_p)
        self.dropout = nn.Dropout(dropout_p)

    def forward(self, input, hidden):
        embedded = self.embedding(input).view(1, 1, -1)
        embedded = self.dropout(embedded)
        output, hidden = self.gru(embedded, hidden)
        return output, hidden

    def init_hidden(self):
        return torch.zeros(self.num_layers, 1, self.hidden_size, device=device)



### P2. Attention Decoder (1.8 pt.)

Ahora diseñe un mecanismo de atención según estime conveniente y otra red que servirá de decoder con el modelo de attention. Utilice la predicción objetivo (en caso de existir) como siguiente input de cada oración para la etapa de entrenamiento (teacher forcing).

In [13]:
class AttnDecoderRNN(nn.Module):
    def __init__(self, hidden_size, output_size, num_layers=2, dropout_p=0.1, max_length=10):
        super(AttnDecoderRNN, self).__init__()
        self.hidden_size = hidden_size
        self.output_size = output_size
        self.num_layers = num_layers
        self.dropout_p = dropout_p
        self.max_length = max_length

        self.embedding = nn.Embedding(self.output_size, self.hidden_size)
        self.attn = nn.Linear(self.hidden_size * 2, self.max_length)
        self.attn_combine = nn.Linear(self.hidden_size * 2, self.hidden_size)
        self.dropout = nn.Dropout(self.dropout_p)
        self.gru = nn.GRU(self.hidden_size, self.hidden_size, num_layers=num_layers, dropout=dropout_p)
        self.out = nn.Linear(self.hidden_size, self.output_size)

    def forward(self, input, hidden, encoder_outputs):
        embedded = self.embedding(input).view(1, 1, -1)
        embedded = self.dropout(embedded)

        attn_weights = F.softmax(
            self.attn(torch.cat((embedded[0], hidden[0]), 1)), dim=1)
        attn_applied = torch.bmm(attn_weights.unsqueeze(0),
                                 encoder_outputs.unsqueeze(0))

        output = torch.cat((embedded[0], attn_applied[0]), 1)
        output = self.attn_combine(output).unsqueeze(0)

        output = F.relu(output)
        output, hidden = self.gru(output, hidden)

        output = F.log_softmax(self.out(output[0]), dim=1)
        return output, hidden, attn_weights

    def init_hidden(self):
        return torch.zeros(self.num_layers, 1, self.hidden_size, device=device)

### P3. Entrenamiento y evaluación (1 pt.)
Entrene su modelo Sequence-to-Sequence. Para esto entrene el encoder, decoder y attention en conjunto, es decir utilizando la misma función de loss para los parámetros de cada componente. Recuerde entregar las predicciones objetivo al decoder en cada iteración.

In [14]:
import math
import time
import matplotlib.pyplot as plt
import matplotlib.ticker as ticker
import numpy as np

In [15]:
# Funciones auxiliares para cargar datos de entrenamiento

def sentence2indexes(lang, sentence):
  return [lang.word2index[word] for word in sentence.split(' ')]

def sentence2tensor(lang, sentence):
  indexes = sentence2indexes(lang, sentence)
  indexes.append(EOS_token)
  return torch.tensor(indexes, dtype=torch.long, device=device).view(1, -1)

def pair2tensors(pair):
  input_tensor = sentence2tensor(input_lang, pair[0])
  target_tensor = sentence2tensor(output_lang, pair[1])
  return (input_tensor, target_tensor)



In [16]:
class EarlyStopping:
    def __init__(self, patience=5, delta=0, path='checkpoint.pt'):
        self.patience = patience
        self.delta = delta
        self.path = path
        self.best_loss = None
        self.counter = 0
        self.early_stop = False

    def __call__(self, loss, encoder, decoder, encoder_optimizer, decoder_optimizer):
        if self.best_loss is None:
            self.best_loss = loss
            self.save_checkpoint(encoder, decoder, encoder_optimizer, decoder_optimizer)
        elif loss > self.best_loss - self.delta:
            self.counter += 1
            if self.counter >= self.patience:
                self.early_stop = True
        else:
            self.best_loss = loss
            self.save_checkpoint(encoder, decoder, encoder_optimizer, decoder_optimizer)
            self.counter = 0

    def save_checkpoint(self, encoder, decoder, encoder_optimizer, decoder_optimizer):
        '''Guarda el modelo cuando la pérdida disminuye.'''
        torch.save({
            'encoder_state_dict': encoder.state_dict(),
            'decoder_state_dict': decoder.state_dict(),
            'encoder_optimizer_state_dict': encoder_optimizer.state_dict(),
            'decoder_optimizer_state_dict': decoder_optimizer.state_dict(),
        }, self.path)
        print(f'Modelo guardado con pérdida de: {self.best_loss:.4f}')


In [17]:
def train(input_tensor, target_tensor, encoder, decoder, encoder_optimizer, decoder_optimizer, criterion, max_length=10):
    encoder_hidden = encoder.init_hidden()

    encoder_optimizer.zero_grad()
    decoder_optimizer.zero_grad()

    input_length = input_tensor.size(1)
    target_length = target_tensor.size(1)

    encoder_outputs = torch.zeros(max_length, encoder.hidden_size, device=device)

    loss = 0

    for ei in range(input_length):
        input_step = input_tensor[:, ei]
        encoder_output, encoder_hidden = encoder(input_step, encoder_hidden)
        encoder_outputs[ei] = encoder_output[0, 0]
        # Print the encoder output for debugging
        #print(f"Encoder output at step {ei}: {encoder_output}")

    decoder_input = torch.tensor([[SOS_token]], device=device)
    decoder_hidden = encoder_hidden

    use_teacher_forcing = True if random.random() < 0.5 else False

    if use_teacher_forcing:
        for di in range(target_length):
            decoder_output, decoder_hidden, decoder_attention = decoder(decoder_input, decoder_hidden, encoder_outputs)
            loss += criterion(decoder_output, target_tensor[:, di])
            decoder_input = target_tensor[:, di]  # Teacher forcing
            # Print the decoder output for debugging
            #print(f"Decoder output at step {di} (teacher forcing): {decoder_output}")

    else:
        for di in range(target_length):
            decoder_output, decoder_hidden, decoder_attention = decoder(decoder_input, decoder_hidden, encoder_outputs)
            topv, topi = decoder_output.topk(1)
            decoder_input = topi.squeeze().detach()  # No teacher forcing
            loss += criterion(decoder_output, target_tensor[:, di])
             # Print the decoder output for debugging
            #print(f"Decoder output at step {di} (no teacher forcing): {decoder_output}")
            if decoder_input.item() == EOS_token:
                break

    loss.backward()

    encoder_optimizer.step()
    decoder_optimizer.step()

    return loss.item() / target_length


def train_iters(encoder, decoder, n_iters, train_pairs, val_pairs, print_every=1, learning_rate=0.01, patience=5):
    max_length = 10  # Asegúrate de definir max_length en el código
    print_loss_total = 0
    early_stopping = EarlyStopping(patience=patience, path='best_model.pt')

    encoder_optimizer = optim.SGD(encoder.parameters(), lr=learning_rate, weight_decay=1e-4)
    decoder_optimizer = optim.SGD(decoder.parameters(), lr=learning_rate, weight_decay=1e-4)
    training_pairs = [pair2tensors(random.choice(train_pairs)) for _ in range(n_iters)]
    validation_pairs = [pair2tensors(pair) for pair in val_pairs]
    criterion = nn.NLLLoss()

    for iter in range(1, n_iters + 1):
        training_pair = training_pairs[iter - 1]
        input_tensor = training_pair[0]
        target_tensor = training_pair[1]

        loss = train(input_tensor, target_tensor, encoder, decoder, encoder_optimizer, decoder_optimizer, criterion)
        print_loss_total += loss

        if iter % print_every == 0:
            print_loss_avg = print_loss_total / print_every
            print_loss_total = 0
            print(f"Iteración: {iter}, Pérdida promedio (entrenamiento): {print_loss_avg:.4f}")

            # Validación
            val_loss_total = 0
            for val_pair in validation_pairs:
                val_input_tensor, val_target_tensor = val_pair
                encoder_hidden = encoder.init_hidden()
                input_length = val_input_tensor.size(1)
                target_length = val_target_tensor.size(1)
                encoder_outputs = torch.zeros(max_length, encoder.hidden_size, device=device)

                #print(f"Validation input tensor shape: {val_input_tensor.shape}")
                #print(f"Validation target tensor shape: {val_target_tensor.shape}")

                with torch.no_grad():
                    for ei in range(input_length):
                        val_input_step = val_input_tensor[:, ei]
                        encoder_output, encoder_hidden = encoder(val_input_step, encoder_hidden)
                        encoder_outputs[ei] = encoder_output[0, 0]

                    decoder_input = torch.tensor([[SOS_token]], device=device)
                    decoder_hidden = encoder_hidden

                    for di in range(target_length):
                        decoder_output, decoder_hidden, _ = decoder(decoder_input, decoder_hidden, encoder_outputs)
                        val_loss_total += criterion(decoder_output, val_target_tensor[:, di]).item()
                        if decoder_output.topk(1)[1].item() == EOS_token:
                            break
                        decoder_input = val_target_tensor[:, di]

            val_loss_avg = val_loss_total / len(validation_pairs)
            print(f"Iteración: {iter}, Pérdida promedio (validación): {val_loss_avg:.4f}")

            # Pasar la pérdida promedio de validación a EarlyStopping en cada iteración
            early_stopping(val_loss_avg, encoder, decoder, encoder_optimizer, decoder_optimizer)
            
            if early_stopping.early_stop:
                print("Early stopping triggered. Stopping training.")
                break


In [18]:
from sklearn.model_selection import train_test_split

# Separar el conjunto de datos en entrenamiento, validación y prueba
train_pairs, temp_pairs = train_test_split(pairs, test_size=0.2, random_state=SEED)
val_pairs, test_pairs = train_test_split(temp_pairs, test_size=0.5, random_state=SEED)



In [19]:
hidden_size = 512  # Aumentar el tamaño de las capas
num_layers = 3     # Aumentar el número de capas
input_size = input_lang.n_tokens
output_size = output_lang.n_tokens

encoder = EncoderRNN(input_size, hidden_size, num_layers=num_layers, dropout_p=0.1).to(device)
decoder = AttnDecoderRNN(hidden_size, output_size, num_layers=num_layers, dropout_p=0.1).to(device)

n_iters = 100
train_iters(encoder, decoder, n_iters, train_pairs, val_pairs, print_every=1, learning_rate=0.01, patience=5)


Encoder output at step 0: tensor([[[ 0.0152,  0.0225, -0.0404,  0.0070, -0.0259, -0.0246, -0.0304,
          -0.0054, -0.0554,  0.0148,  0.0200,  0.0433,  0.0143, -0.0010,
           0.0184,  0.0014, -0.0285,  0.0102,  0.0465,  0.0035,  0.0121,
           0.0033, -0.0109,  0.0005, -0.0068, -0.0630, -0.0107, -0.0255,
           0.0230, -0.0397, -0.0463,  0.0393,  0.0433, -0.0560,  0.0071,
          -0.0047,  0.0215, -0.0170,  0.0172,  0.0139,  0.0251,  0.0230,
          -0.0093, -0.0319,  0.0415, -0.0283, -0.0352, -0.0160,  0.0094,
          -0.0077,  0.0117,  0.0330,  0.0057, -0.0456,  0.0358, -0.0004,
          -0.0160, -0.0356,  0.0080, -0.0292, -0.0191,  0.0191, -0.0042,
          -0.0070, -0.0262,  0.0252, -0.0507,  0.0339,  0.0386, -0.0061,
          -0.0180, -0.0564, -0.0034,  0.0166, -0.0027, -0.0186,  0.0473,
          -0.0415,  0.0216, -0.0412, -0.0173, -0.0472, -0.0101,  0.0046,
           0.0048, -0.0489,  0.0258, -0.0549, -0.0240, -0.0267, -0.0225,
          -0.0053,  0.008

Comente sus resultados. ¿Cómo evoluciona la loss a medida que aumenta el número de epochs?

```
Comentar aquí.
```

Adapte las siguientes funciones para traducir oraciones con su modelo. Pruebe su traductor con algunas oraciones aleatorias del corpus.

In [None]:
def translate(encoder, decoder, sentence, input_lang, output_lang, max_length=10):
    with torch.no_grad():
        input_tensor = sentence2tensor(input_lang, sentence)
        input_length = input_tensor.size(1)  # Cambiar a .size(1) para la longitud correcta
        
        encoder_hidden = encoder.init_hidden()
        encoder_outputs = torch.zeros(max_length, encoder.hidden_size, device=device)

        for ei in range(input_length):
            input_step = input_tensor[0, ei].unsqueeze(0)  # Asegurarse de que la dimensión sea correcta
            encoder_output, encoder_hidden = encoder(input_step, encoder_hidden)
            encoder_outputs[ei] = encoder_output[0, 0]
        
        decoder_input = torch.tensor([[SOS_token]], device=device)
        decoder_hidden = encoder_hidden
        
        decoded_words = []
        decoder_attentions = torch.zeros(max_length, max_length)
        
        for di in range(max_length):
            decoder_output, decoder_hidden, decoder_attention = decoder(decoder_input, decoder_hidden, encoder_outputs)
            decoder_attentions[di] = decoder_attention.data
            topv, topi = decoder_output.data.topk(1)
            if topi.item() == EOS_token:
                decoded_words.append('STOP')
                break
            else:
                decoded_words.append(output_lang.index2word[topi.item()])
            
            decoder_input = topi.squeeze().detach()
        
        return decoded_words, decoder_attentions[:di + 1]





In [None]:
# Cargar el mejor modelo guardado
checkpoint = torch.load('best_model.pt')
encoder.load_state_dict(checkpoint['encoder_state_dict'])
decoder.load_state_dict(checkpoint['decoder_state_dict'])
encoder.eval()
decoder.eval()

def evaluate_randomly(encoder, decoder, dataset, n=10):
    for i in range(n):
        pair = random.choice(dataset)
        print('Input:', pair[0])
        print('Traducción:', pair[1])
        output_words, _ = translate(encoder, decoder, pair[0], input_lang, output_lang)
        output_sentence = ' '.join(output_words)
        print('Predicción:', output_sentence)
        print('')

evaluate_randomly(encoder, decoder, test_pairs, 10)



Input: no estamos seguros
Traducción: we aren t sure
Predicción: STOP

Input: me está empezando a entrar hambre
Traducción: i m starting to get hungry
Predicción: STOP

Input: el gato está ronroneando
Traducción: the cat is purring
Predicción: STOP

Input: nada puede justificarlo por su comportamiento tan grosero
Traducción: nothing can excuse him for such rude behavior
Predicción: STOP

Input: no tengo nada de hambre
Traducción: i don t feel hungry at all
Predicción: STOP

Input: te debo una disculpa
Traducción: i owe you an apology
Predicción: STOP

Input: sacaré a pasear a mi perro
Traducción: i ll take my dog out for a walk
Predicción: STOP

Input: no hace falta que te vayas inmediatamente
Traducción: you don t have to leave right now
Predicción: STOP

Input: espero que esté bien
Traducción: i hope he ll be ok
Predicción: STOP

Input: yo sé cuando alguien me está mintiendo
Traducción: i know when someone s lying to me
Predicción: STOP



Comente sus resultados. ¿Qué ocurre con las cuando la traducción predicha es válida pero no igual al ground truth? ¿Qué haría para abordar este problema?

```
Comentar aquí.
```

### P4. Visualizando Attention (0.5 pt.)

Ahora visualizaremos los pesos de atención asignados entre las palabras traducidas. Para esto, les entregamos funciones para visualizar los pesos de atención entregados por su decoder a partir de una oración. Adapte el código al output de su modelo de ser necesario.

In [None]:
import matplotlib.pyplot as plt
plt.switch_backend('agg')
import matplotlib.ticker as ticker
import numpy as np

In [None]:
%matplotlib inline

In [None]:
def plot_attention(input_sentence, output_words, attentions):
  fig = plt.figure()
  ax = fig.add_subplot(111)
  cax = ax.matshow(attentions.cpu().numpy(), cmap='bone')
  fig.colorbar(cax)

  # Set up axes
  ax.xaxis.set_ticks(attentions)
  ax.set_xticklabels([''] + input_sentence.split(' ') +
             ['STOP'], rotation=90)
  ax.set_yticklabels([''] + output_words)

  # Show label at every tick
  ax.xaxis.set_major_locator(ticker.MultipleLocator(1))
  ax.yaxis.set_major_locator(ticker.MultipleLocator(1))

  plt.show()


def show_attention(input_sentence):
  output_words, attentions = translate(encoder, decoder, input_sentence, input_lang, output_lang)
  print('input =', input_sentence)
  print('output =', ' '.join(output_words))
  plot_attention(input_sentence, output_words, attentions[0, :len(output_words), :])

Grafique la atención de las siguientes oraciones de ejemplo. Haga lo mismo con al menos tres oraciones más que puedan ser interesantes.

In [None]:
## Oraciones de ejemplo

show_attention('tom necesita un poco de ayuda')

show_attention('el perro corre rápidamente')

show_attention('el banco le ofreció un alto interés')

show_attention('él toca la flauta el clarinete y el saxofón')

input = tom necesita un poco de ayuda
output = STOP


IndexError: too many indices for tensor of dimension 2

In [None]:
## Oraciones de ejemplo

show_attention('última tarea del ramo !')

show_attention('')

show_attention('')

Comente sus resultados. ¿Eran lo que esperaba?
```
Comentar aquí.
```

## Parte 2: BERT

Lo primero es instalar las librerías necesarias.

In [None]:
%%capture
!pip install transformers
from transformers import BertTokenizer, BertForNextSentencePrediction, BertForMaskedLM, BertForQuestionAnswering
import torch

Para las preguntas que siguen, utilizaremos distintas variantes de BERT disponibles en la librería transformers. [Aquí](https://huggingface.co/transformers/model_doc/bert.html) pueden encontrar toda la documentación necesaria. El modelo pre-entrenado a utilizar es "bert-base-uncased" (salvo para question answering).

BERT es un modelo de lenguaje que fue entrenado exhaustivamente sobre dos tareas: 1) Next sentence prediction. 2) Masked language modeling.

### **BertForNextSentencePrediction** (0.5 pt.)

**Pregunta 1:**  Utilizando el modelo BertForNextSentencePrediction de la librería transformers, muestre cual de las 2 oraciones es **más probable** que sea una continuación de la primera. Para esto defina la función $oracion\_mas\_probable$, que recibe el inicio de una frase, las alternativas para continuar esta frase y retorna un string indicando cual de las dos oraciones es más probable.

Por ejemplo:

Initial: "The sky is blue."\
A: "This is due to the shorter wavelength of blue light."\
B: "Chile is one of the world's greatest economies."

Debería retornar "La oración que continúa más probable es A", justificándolo con la evaluación de BERT.



In [None]:
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertForNextSentencePrediction.from_pretrained('bert-base-uncased')

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

In [None]:
def oracion_mas_probable(first,sentA,sentB):
  #Tu implementacion
  encodingA = tokenizer(first, sentA, return_tensors='pt')
  encodingB = tokenizer(first, sentB, return_tensors='pt')
  outputA = model(**encodingA, labels=torch.LongTensor([1]))
  outputB = model(**encodingB, labels=torch.LongTensor([1]))
  logitsA = outputA.logits
  logitsB = outputB.logits

  return

1.1)
Initial: "My cat is fluffy."\
A: "My dog has a curling tail."\
B: "A song can make or ruin a person’s day if they let it get to them."

1.2)
Initial: "The Big Apple is famous worldwide."\
A: "You can add cinnamon for the perfect combination."\
B: "It is America's largest city."

1.3)
Initial: "Roses are red."\
A: "Violets are blue."\
B: "Fertilize them regularly for impressive flowers."

1.4)
Initial: "I play videogames the whole day."\
A: "They make me happy."\
B: "They make me rage."\

### **BertForMaskedLM** (0.5 pt.)

**Pregunta 2:**  Ahora utilizaremos BertForMaskedLM para **predecir una palabra oculta** en una oración.\
Por ejemplo:\
BERT input: "I want to _ a new car."\
BERT prediction: "buy"

In [None]:
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertForMaskedLM.from_pretrained('bert-base-uncased')

In [None]:
def palabra_mas_probable(sentence):
  #Tu implementacion
  tokenized_text = tokenizer.tokenize(sentence)
  masked_index = tokenized_text.index('[MASK]')
  indexed_tokens = tokenizer.convert_tokens_to_ids(tokenized_text)
  tokens_tensor = torch.tensor([indexed_tokens])

  segments_tensors = None

  predictions = model(tokens_tensor, segments_tensors)
  predicted_index = None
  predicted_token = None
  pass

2.1)
BERT input: "[CLS] I love [MASK] . [SEP]"

In [None]:
sent = "[CLS] I love [MASK] . [SEP]"
palabra_mas_probable(sent)

2.2)
BERT input: "[CLS] I hear that Karen is very [MASK] . [SEP]"

In [None]:
sent = "[CLS] I heard that Karen is very [MASK] . [SEP]"
palabra_mas_probable(sent)

2.3)
BERT input: "[CLS] She had the gift of being able to [MASK] . [SEP]"

In [None]:
sent = "[CLS] She had the gift of being able to [MASK] . [SEP]"
palabra_mas_probable(sent)

2.4)
BERT input: "[CLS] It's not often you find a [MASK] on the street. [SEP]"

In [None]:
sent = "[CLS] It's not often you find an [MASK] on the circus . [SEP]"
palabra_mas_probable(sent)

### **BertForQuestionAnswering** (0.5 pt.)

**Pregunta 3**  Utilizando el modelo BertForQuestionAnswering pre-entrenado con 'bert-large-uncased-whole-word-masking-finetuned-squad', **extraiga la respuesta** a cada una de las siguientes 4 preguntas y su contexto. Recuerde cambiar el tokenizer para que coincida con el modelo.

In [None]:
tokenizer = BertTokenizer.from_pretrained('bert-large-uncased-whole-word-masking-finetuned-squad')
model = BertForQuestionAnswering.from_pretrained('bert-large-uncased-whole-word-masking-finetuned-squad')

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/443 [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/1.34G [00:00<?, ?B/s]

In [None]:
def entregar_respuesta(qst, cntxt):
  #Tu implementacion
  inputs = tokenizer(qst, cntxt, return_tensors='pt')
  start_positions = torch.tensor([1])
  end_positions = torch.tensor([3])

  outputs = model(**inputs, start_positions=start_positions, end_positions=end_positions)
  start_scores = None
  end_scores = None
  answer = None
  return answer

3.1)

Pregunta: "When was the Battle of Iquique?"

Contexto: "The Battle of Iquique was a naval engagement that occurred between a Chilean corvette under the command of Arturo Prat and a Peruvian ironclad under the command of Miguel Grau Seminario on 21 May 1879, during the naval stage of the War of the Pacific, and resulted in a Peruvian victory."

In [None]:
q = "When was the Battle of Iquique?"
c = "The Battle of Iquique was a naval engagement that occurred between a Chilean corvette under the command of Arturo Prat and a Peruvian ironclad under the command of Miguel Grau Seminario on 21 May 1879, during the naval stage of the War of the Pacific, and resulted in a Peruvian victory."
entregar_respuesta(q, c)

3.2)

Pregunta: "Who won the Battle of Iquique?"

Contexto: "The Battle of Iquique was a naval engagement that occurred between a Chilean corvette under the command of Arturo Prat and a Peruvian ironclad under the command of Miguel Grau Seminario on 21 May 1879, during the naval stage of the War of the Pacific, and resulted in a Peruvian victory."

In [None]:
q = "Who won the Battle of Iquique?"
c = "The Battle of Iquique was a naval engagement that occurred between a Chilean corvette under the command of Arturo Prat and a Peruvian ironclad under the command of Miguel Grau Seminario on 21 May 1879, during the naval stage of the War of the Pacific, and resulted in a Peruvian victory."
entregar_respuesta(q, c)

3.3)

Pregunta: "Who introduced peephole connections to LSTM networks?"
Contexto: "What I’ve described so far is a pretty normal LSTM. But not all LSTMs are the same as the above. In fact, it seems like almost every paper involving LSTMs uses a slightly different version. The differences are minor, but it’s worth mentioning some of them. One popular LSTM variant, introduced by Gers & Schmidhuber (2000), is adding “peephole connections.” This means that we let the gate layers look at the cell state."

In [None]:
q = "Who introduced peephole connections to LSTM networks?"
c = "What I’ve described so far is a pretty normal LSTM. But not all LSTMs are the same as the above. In fact, it seems like almost every paper involving LSTMs uses a slightly different version. The differences are minor, but it’s worth mentioning some of them. One popular LSTM variant, introduced by Gers & Schmidhuber (2000), is adding “peephole connections.” This means that we let the gate layers look at the cell state."
entregar_respuesta(q, c)

3.4)

Pregunta: "When is the cat most active?"

Contexto: "The cat is similar in anatomy to the other felid species: it has a strong flexible body, quick reflexes, sharp teeth and retractable claws adapted to killing small prey. Its night vision and sense of smell are well developed. Cat communication includes vocalizations like meowing, purring, trilling, hissing, growling and grunting as well as cat-specific body language. It is a solitary hunter but a social species. It can hear sounds too faint or too high in frequency for human ears, such as those made by mice and other small mammals. It is a predator that is most active at dawn and dusk. It secretes and perceives pheromones."

In [None]:
q = "When is the cat most active?"
c = "The cat is similar in anatomy to the other felid species: it has a strong flexible body, quick reflexes, sharp teeth and retractable claws adapted to killing small prey. Its night vision and sense of smell are well developed. Cat communication includes vocalizations like meowing, purring, trilling, hissing, growling and grunting as well as cat-specific body language. It is a solitary hunter but a social species. It can hear sounds too faint or too high in frequency for human ears, such as those made by mice and other small mammals. It is a predator that is most active at dawn and dusk. It secretes and perceives pheromones."
entregar_respuesta(q, c)