# Homework 2
# Recurrent Neural Networks


The second homework is comprised of two parts:

*   Task A. Named Entity Recognition
*   Task B. Summarization

To do this homework, make a copy of this notebook (`File` -> `Save a copy in Drive`) and work on it. Alternatively, download the notebook if you want to work on it locally (`File` -> `Download` -> `Download .ipynb`).

**Note:** The model is intended to be trained in GPU, training on CPU will take considerably more time. To use GPU in Colab: `Runtime->Change runtime type-> Choose GPU as a hardware accelerator.`. Be aware that GPU hours in Colab are limited to 3-4 hours, so it is advised to develop the model in CPU (`None` as hardware accelerator) and change the hardware accelerator to GPU when you are ready to train your model.

In [5]:
!pip3 install --quiet torchtext datasets torchinfo torchdata evaluate seqeval

## Task A. Named Entity Recognition [6 Points]

In this task, we will work on the Named Entity Recognition(NER) task. We will concentrate on four types of named entities: persons, locations, organizations, and names of miscellaneous entities that do not belong to the previous three groups. As a dataset, we will use [conll2003](https://huggingface.co/datasets/conll2003)  dataset. The dataset has already been uploaded for you. The dataset contains the following columns:
* `id`: an identifier.
* `tokens`: a list of tokens.
* `tags`: a list of classification labels (NER tags)


In [6]:
from datasets import load_dataset

dataset = load_dataset("conll2003")

old_indices= ['O', 'B-PER', 'I-PER', 'B-ORG', 'I-ORG', 'B-LOC', 'I-LOC', 'B-MISC', 'I-MISC']

def org_net_tag(data):
  ner_tags= [[old_indices[tag] for tag in ner] for ner in data['ner_tags']]
  data['tags']=ner_tags
  return data
dataset = dataset.map(org_net_tag, batched=True,remove_columns=["pos_tags", "chunk_tags","ner_tags"])

In [7]:
dataset

DatasetDict({
    train: Dataset({
        features: ['id', 'tokens', 'tags'],
        num_rows: 14041
    })
    validation: Dataset({
        features: ['id', 'tokens', 'tags'],
        num_rows: 3250
    })
    test: Dataset({
        features: ['id', 'tokens', 'tags'],
        num_rows: 3453
    })
})

In [8]:
dataset['train'][0]

{'id': '0',
 'tokens': ['EU',
  'rejects',
  'German',
  'call',
  'to',
  'boycott',
  'British',
  'lamb',
  '.'],
 'tags': ['B-ORG', 'O', 'B-MISC', 'O', 'O', 'O', 'B-MISC', 'O', 'O']}

### Task 1. Preparing the dataset [1 point]

Since the sentences are already tokenized, we will directly start to create vocabulary, process the datasets and create mini-batches of the datasets.


**a. Vocabulary**

You need to create 2 vocabularies: one for tokens and one for tags:
*   Create tokens vocabulary by using tokens in the training data. Add `<pad>` and `<unk>` special tokens. Make index for `<unk>` token default for the tokens vocabulary.
*   Create NER tags vocabulary by using tags in the training data. Add `<pad>` special token.


In [9]:
from torchtext.vocab import vocab
from collections import Counter

token_counter = Counter()
tag_counter = Counter()


for doc in dataset['train']:
    token_counter.update(doc['tokens'])
    tag_counter.update(doc['tags']) 
    
token_vocab = vocab(token_counter, specials=('<unk>', '<pad>'))
token_vocab.set_default_index(token_vocab['<unk>'])

tag_vocab = vocab(tag_counter, specials=('<pad>',))
#tag_vocab.set_default_index(tag_vocab['<unk>'])


**b. Process the datasets**

For each dataset split:
*   Convert tokens of the dataset into token ids by using tokens vocabulary.
*   Convert tags of the dataset into tag ids by using NER tags vocabulary
*   Convert final results (both token ids and tag ids) into Pytorch tensor

In [10]:
import torch

def convert_to_ids(sentences, vocab):
    return [torch.tensor(vocab.lookup_indices(sentence)) for sentence in sentences]


tensor_dataset = {}
for split in dataset:
    split_dataset = dataset[split]
    token_ids = convert_to_ids(split_dataset['tokens'], token_vocab)

    tag_ids = convert_to_ids(split_dataset['tags'], tag_vocab)

    tensor_dataset[split] = {}
    tensor_dataset[split]['token_tensors'] = token_ids
    tensor_dataset[split]['tag_tensors'] = tag_ids
    

In [11]:
dataset

DatasetDict({
    train: Dataset({
        features: ['id', 'tokens', 'tags'],
        num_rows: 14041
    })
    validation: Dataset({
        features: ['id', 'tokens', 'tags'],
        num_rows: 3250
    })
    test: Dataset({
        features: ['id', 'tokens', 'tags'],
        num_rows: 3453
    })
})


**c. Automatic batching**

In order to automatically batch each dataset split:
*   Use `DataLoader` for automatic batching
*   In collate function, return
    *   padded tokens
    *   padded tags
    *   length of sentences
    *   *Note: If you will use GPU for training, send the outputs of collate function to GPU (device).*
*   Decide on batch size (e.g. 8, 16, 32, or any 2^n) and initialize DataLoader for each dataset split. If you run out of memory, change the batch size to a lower value.

In [12]:
from torch.utils.data import DataLoader, Dataset
from torch.nn.utils.rnn import pad_sequence
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

class NERDataset(Dataset):
    def __init__(self, sentences, tags, token_vocab, tag_vocab):
        self.sentences = sentences
        self.tags = tags
        self.token_vocab = token_vocab
        self.tag_vocab = tag_vocab

    def __len__(self):
        return len(self.sentences)

    def __getitem__(self, idx):
        return self.sentences[idx], self.tags[idx]
    
def collate_fn(batch):
    # Separate the tokens and tags and sort by length (descending)
    batch.sort(key=lambda x: len(x[0]), reverse=True)
    tokens, tags = zip(*batch)
    
    # Get lengths of each sequence before padding
    lengths = [len(seq) for seq in tokens]

    # Pad the sequences
    tokens_padded = pad_sequence(tokens, batch_first=True, padding_value=token_vocab['<pad>'])
    tags_padded = pad_sequence(tags, batch_first=True, padding_value=tag_vocab['<pad>'])
    
    tokens_padded.to(device)
    tags_padded.to(device)

    return tokens_padded, tags_padded, torch.tensor(lengths)    

ner_dataset_train = NERDataset(tensor_dataset['train']['token_tensors'], 
                         tensor_dataset['train']['tag_tensors'], 
                         token_vocab, tag_vocab)
ner_dataset_valid = NERDataset(tensor_dataset['validation']['token_tensors'], 
                         tensor_dataset['validation']['tag_tensors'], 
                         token_vocab, tag_vocab)
ner_dataset_test = NERDataset(tensor_dataset['test']['token_tensors'], 
                         tensor_dataset['test']['tag_tensors'], 
                         token_vocab, tag_vocab)

ner_dataloader_train = DataLoader(ner_dataset_train, batch_size=64, collate_fn=collate_fn)
ner_dataloader_valid = DataLoader(ner_dataset_valid, batch_size=64, collate_fn=collate_fn)
ner_dataloader_test = DataLoader(ner_dataset_test, batch_size=64, collate_fn=collate_fn)

In [13]:
device

device(type='cuda')

### Task 2. RNN model [1.5 point]

Here, you will need to create an RNN model class. It will take the tokens and lengths of sentences as input and will produce NER tag ids for each token in the sentence.

In the model, define  
*  **Embedding layer** with the number of embeddings equal to
size of tokens vocabulary and embedding size to `emb_dim`. Also, specify `padding_idx=0`.
*   **Bidirectional LSTM layer**. It should have the hidden size equal to `hid_dim` and the number of layers equal to `n_layers`. Also, specify `batch_first=True`.
*   **Classification layer.** Decide the structure of the classification layer. You can use a single Linear layer, or use multiple Linear layers, ReLU, and/or Dropout between. If you will use several layers for classification, you can store all layers in `nn.Sequential` for easy use in the forward method. The input size for the classification layer should be double the size of `hid_dim` because we are using bidirectional LSTM and the output size should be the size of the NER tags vocabulary.
*   **Dropout layer** with the dropout probability of `0.5`.

In the forward method:

*   Encode the input `tokens` with the embedding layer.
*   Apply dropout on the embeddings.
*   Pack the embedded inputs with [pack_padded_sequence](https://pytorch.org/docs/stable/generated/torch.nn.utils.rnn.pack_padded_sequence.html) function. Make sure to specify `batch_first=True` and `enforce_sorted=False`.
*   Pass the packed inputs to the LSTM layer.
*   Get the output features from the last layer of the LSTM and pass it to [pad_packed_sequence](https://pytorch.org/docs/stable/generated/torch.nn.utils.rnn.pad_packed_sequence.html). Make sure to specify `batch_first=True`. *Note: `pad_packed_sequence` is inverse operation to `pack_padded_sequence`.*
*   Pass padded sequence from `pad_packed_sequence` to classification layer
*   Return prediction from the classification layer

*Note: Refer to the [`nn.LSTM`](https://pytorch.org/docs/stable/generated/torch.nn.LSTM.html?highlight=lstm#torch.nn.LSTM) documentation to get more information about the parameters, the inputs, and outputs.*

In [14]:
import torch.nn as nn
from torch.nn.utils.rnn import pack_padded_sequence, pad_packed_sequence

class BiLSTMNERTagger(nn.Module):
    def __init__(self, emb_dim, hid_dim, n_layers, token_vocab_size, tag_vocab_size):
        super().__init__()
        self.emb = nn.Embedding(token_vocab_size, emb_dim, padding_idx=1)
        self.dropout = nn.Dropout(0.5)
        self.lstm_bidirectional = nn.LSTM(input_size=emb_dim, num_layers=n_layers, 
                                          batch_first=True, hidden_size=hid_dim, 
                                          bidirectional=True)
        self.classification = nn.Sequential(
           nn.Linear(hid_dim*2, hid_dim*2),
           nn.ReLU(),
           nn.Linear(hid_dim*2, tag_vocab_size)
        )


    def forward(self, words, words_len):
      emb = self.emb(words)
      emb = self.dropout(emb)
      emb = pack_padded_sequence(emb, batch_first=True, enforce_sorted=False, lengths=words_len)
      lstm_output, (hn, cn) = self.lstm_bidirectional(emb)
      seq_unpacked, lens_unpacked = pad_packed_sequence(lstm_output, batch_first=True, padding_value=1)
      prediction = self.classification(seq_unpacked)
    
      return prediction

### Task 3. Training [2 points]
**a. Initialize the model:**

*   Calculate token and tag vocabulary sizes
*   Decide on the number of layers (`n_layers`), sizes of embedding(`emb_dim`), and hidden dimensions (`hid_dim`).
*   Initialize the model by passing arguments to the model. Send the model to GPU if you are using one.


In [15]:
token_vocab_size = len(token_vocab)
tag_vocab_size = len(tag_vocab)

model = BiLSTMNERTagger(64, 128, 3, token_vocab_size, tag_vocab_size)
model.to(device)

BiLSTMNERTagger(
  (emb): Embedding(23625, 64, padding_idx=1)
  (dropout): Dropout(p=0.5, inplace=False)
  (lstm_bidirectional): LSTM(64, 128, num_layers=3, batch_first=True, bidirectional=True)
  (classification): Sequential(
    (0): Linear(in_features=256, out_features=256, bias=True)
    (1): ReLU()
    (2): Linear(in_features=256, out_features=10, bias=True)
  )
)

**b.**

**Loss function(criterion)**: use [CrossEntropyLoss](https://pytorch.org/docs/stable/generated/torch.nn.CrossEntropyLoss.html?highlight=crossentropy#torch.nn.CrossEntropyLoss) and set `ignore_index` parameter to ignore the padding.

**Optimizer**: use [Adam](https://pytorch.org/docs/stable/generated/torch.optim.Adam.html) optimizer


In [16]:
from torch.optim import Adam
criterion = nn.CrossEntropyLoss(ignore_index=tag_vocab['<pad>'])
optimizer = Adam(model.parameters())


**c. Training:**

*   Decide on the number of epochs you want to train.
  *  *Note: You can estimate the total time of the training loop by running the training loop for one epoch and multiplying the time with the chosen number of epochs.*
  *  *Note: One epoch should not take more than 2-3 minutes on a GPU.*
*   Implement the training loop
   *   Report training and validation loss for each epoch.
   *   If validation loss is decreased, save the model. (You can use `torch.save(model, file_path)` )


In [17]:
import numpy as np
num_epochs = 20
best_val_loss = np.inf

for epoch in range(num_epochs):
    model.train()
    total_train_loss = 0

    for tokens_padded, tags_padded, lengths in ner_dataloader_train:
        tokens_padded = tokens_padded.to(device)
        tags_padded = tags_padded.to(device)
        optimizer.zero_grad()
        
        # Forward pass: Get logits from the model
        logits = model(tokens_padded, lengths)

        # Compute the loss and perform a backward pass
        loss = criterion(logits.view(-1, logits.size(-1)), tags_padded.view(-1))
        loss.backward()
        optimizer.step()

        total_train_loss += loss.item()

    # Validation
    model.eval()
    total_val_loss = 0
    for tokens_padded, tags_padded, lengths in ner_dataloader_valid:
        tokens_padded = tokens_padded.to(device)
        tags_padded = tags_padded.to(device)

        outputs = model(tokens_padded, lengths)
        loss = criterion(outputs.view(-1, tag_vocab_size), tags_padded.view(-1))
        total_val_loss += loss.item()
    print(f'Epoch {epoch + 1}/{num_epochs}, Training Loss: {total_train_loss}, Validation Loss: {total_val_loss}')

    if total_val_loss < best_val_loss:
        print('Validation loss decreased ({:.6f} --> {:.6f}). Saving model ...'.format(
            best_val_loss,
            total_val_loss))
        torch.save(model.state_dict(), 'model.pth')
        best_val_loss = total_val_loss

Epoch 1/20, Training Loss: 191.80757600069046, Validation Loss: 37.187063217163086
Validation loss decreased (inf --> 37.187063). Saving model ...
Epoch 2/20, Training Loss: 145.73386059701443, Validation Loss: 26.692867010831833
Validation loss decreased (37.187063 --> 26.692867). Saving model ...
Epoch 3/20, Training Loss: 110.90190483257174, Validation Loss: 21.525157721713185
Validation loss decreased (26.692867 --> 21.525158). Saving model ...
Epoch 4/20, Training Loss: 93.20787839964032, Validation Loss: 19.425325771793723
Validation loss decreased (21.525158 --> 19.425326). Saving model ...
Epoch 5/20, Training Loss: 81.80387690383941, Validation Loss: 18.032829579897225
Validation loss decreased (19.425326 --> 18.032830). Saving model ...
Epoch 6/20, Training Loss: 73.33292641118169, Validation Loss: 16.31408913107589
Validation loss decreased (18.032830 --> 16.314089). Saving model ...
Epoch 7/20, Training Loss: 66.04235683009028, Validation Loss: 14.894242275040597
Validation

KeyboardInterrupt: 

### Task 4. Evaluation [1.5 points]

To evaluate the validation and test datasets, we will use [seqeval](https://huggingface.co/spaces/evaluate-metric/seqeval) from HuggingFace.
It takes two mandatory arguments:
*  predictions: a list of lists of predicted labels
*  references: a list of lists of reference labels

In order to pass these arguments to the model, first you should get the model's prediction on the datasets and prepare predictions for the seqeval.

**a.** First, for evaluation on **validation** set:
*  Load the saved model with the lowest validation loss and set it to evaluation mode. *Note: You can use `torch.load('path_to_model')` to load the model*
*  Loop over data loader
*  Get predictions from the model and convert model prediction to NER ids by using [argmax](https://pytorch.org/docs/stable/generated/torch.argmax.html). Specify `dim=2`.
*  Remove paddings from the predictions and references (true labels) using the length of the sentence
* Convert predictions and references NER ids to NER tags by using NER tag vocabulary (`lookup_tokens` function)
* Add predictions and references to the separate lists
* Pass these lists to `seqeval` and print results.
* *Note: Your model should achieve a minimum 0.60 `overall_f1` score on the validation dataset. If your model has a lower overall F1 score, you should consider going back to the previous task and trying to change the parameters of the model or training.*

In [None]:
model = BiLSTMNERTagger(64, 128, 3, token_vocab_size, tag_vocab_size)
model.to(device)
model.load_state_dict(torch.load('model.pth'))
model.eval()
total_val_loss = 0

def remove_padding(seqs, lengths):
    new_seqs = []
    for i, seq in enumerate(seqs):
        new_seq = seq[:lengths[i]]
        new_seqs.append(new_seq)
    return new_seqs

def seqs_to_tok(seqs, vocab):
    seqs_tokens = []
    for seq in seqs:
        seqs_tokens.append(vocab.lookup_tokens(list(seq)))
    return seqs_tokens


def get_all_preds(dataloader):
    all_preds, all_tags = [], []
    with torch.no_grad():  # Turn off gradients for validation, saves memory and computations
        for tokens_padded, tags_padded, lengths in dataloader:
            tokens_padded = tokens_padded.to(device)
            tags_padded = tags_padded.to(device)

            outputs = model(tokens_padded, lengths)
            preds = torch.argmax(outputs, dim=2)

            preds_unpad = remove_padding(preds, lengths)
            tags_unpad = remove_padding(tags_padded, lengths)

            preds_tokens = seqs_to_tok(preds_unpad, tag_vocab)
            tags_tokens = seqs_to_tok(tags_unpad, tag_vocab)

            all_preds += preds_tokens
            all_tags += tags_tokens
    return all_preds, all_tags



In [None]:
from seqeval.metrics import accuracy_score
from seqeval.metrics import classification_report
from seqeval.metrics import f1_score

all_tags, all_preds = get_all_preds(ner_dataloader_valid)
print(f1_score(all_tags, all_preds))

0.7773584905660378


**b. Evaluation of the test set.** After you achieve the desired result on the validation set, use similar procedures as above to evaluate the model on the test set and print the results of `seqeval`.

In [None]:
all_tags, all_preds = get_all_preds(ner_dataloader_test)
print(f1_score(all_tags, all_preds))

0.7215311004784689


## Task B. Summarization [9 Points]

Summarization is the task of condensing a piece of text to a shorter version that contains the main information from the original. For this task, you will implement a sequence-to-sequence attentional model with a simple version of Pointer-Generator Networks introduced in this [paper](https://arxiv.org/abs/1704.04368).





**Note:** In the modeling part of this task, we are providing a starter code. You are free to change any part of the code if necessary.

For this task,  we will be using [SAMSum](https://huggingface.co/datasets/samsum) dataset. The SAMSum dataset contains about 16k messenger-like conversations with summaries.

The dataset contains following data fields:
* `dialogue`: text of dialogue.
* `summary`: human written summary of the dialogue.
* id: unique id of an example.


In [1]:
from datasets import load_dataset

dataset = load_dataset("samsum")
dataset

  from .autonotebook import tqdm as notebook_tqdm


DatasetDict({
    train: Dataset({
        features: ['id', 'dialogue', 'summary'],
        num_rows: 14732
    })
    test: Dataset({
        features: ['id', 'dialogue', 'summary'],
        num_rows: 819
    })
    validation: Dataset({
        features: ['id', 'dialogue', 'summary'],
        num_rows: 818
    })
})

In [2]:
dataset['train'][0]

{'id': '13818513',
 'dialogue': "Amanda: I baked  cookies. Do you want some?\r\nJerry: Sure!\r\nAmanda: I'll bring you tomorrow :-)",
 'summary': 'Amanda baked cookies and will bring Jerry some tomorrow.'}

### Task 1. Dataset processing [1 point]

**a. Vocabulary**

You need to create a vocabulary:
*   Create a vocabulary by using tokens in `dialogue` and `summary` fields of the training data.
*   Add `<pad>`, `<unk>` , `<sos>`, and `<eos> `special tokens.
*   Make index for `<unk>` token default for the vocabulary.

In [3]:
from torchtext.vocab import vocab
from collections import Counter
from torchtext.data.utils import get_tokenizer
import re

tokenizer = get_tokenizer('spacy', language='en_core_web_sm')
#tokenizer = get_tokenizer('basic_english')
joint_counter = Counter()


def remove_emojis(s):
    emoji_pattern = re.compile("["
                               u"\U0001F600-\U0001F64F"  # emoticons
                               u"\U0001F300-\U0001F5FF"  # symbols & pictographs
                               u"\U0001F680-\U0001F6FF"  # transport & map symbols
                               u"\U0001F700-\U0001F77F"  # alchemical symbols
                               u"\U0001F780-\U0001F7FF"  # Geometric Shapes Extended
                               u"\U0001F800-\U0001F8FF"  # Supplemental Arrows-C
                               u"\U0001F900-\U0001F9FF"  # Supplemental Symbols and Pictographs
                               u"\U0001FA00-\U0001FA6F"  # Chess Symbols
                               u"\U0001FA70-\U0001FAFF"  # Symbols and Pictographs Extended-A
                               u"\U00002702-\U000027B0"  # Dingbats
                               u"\U000024C2-\U0001F251" 
                               "]+", flags=re.UNICODE)
    return emoji_pattern.sub(r'', s)

def keep_alphanumeric(s):
    pattern = re.compile(r'[^a-zA-Z0-9\s.,;:!?-]+')  # Pattern to match non-alphanumeric characters
    return pattern.sub('', s)

def preprocessing(s):
    s = s.replace('\r\n', ' ')
    s = s.replace('\n', ' ')
    s = s.replace('\r', ' ')
    s = remove_emojis(s)
    s = s.lower()
    s = keep_alphanumeric(s)

    return s



for doc in dataset['train']:
    #print(repr(preprocessing(doc['dialogue'])))
    tokenized_dialogue = tokenizer(preprocessing(doc['dialogue']))
    tokenized_summary = tokenizer(preprocessing(doc['summary']))
    
    joint_counter.update(tokenized_dialogue)
    joint_counter.update(tokenized_summary)

joint_vocab = vocab(joint_counter, specials=('<unk>', '<pad>', '<sos>', '<eos>'))
joint_vocab.set_default_index(joint_counter['<unk>'])

print(len(joint_vocab))

32526


**b. Process the datasets**

For each dataset split:
*   Convert the dataset into token IDs by using created vocabulary
*   Calculate source (`dialogue`) and target(`summary`) lengths of each sample

In [4]:
import torch

def convert_to_ids(sentences, vocab, tokenizer):
    tensors = []
    for sentence in sentences:
        sentence = preprocessing(sentence)
        tokensized_sentence = tokenizer(sentence)
        if len(tokensized_sentence) <= 0:
            continue
        tensor = torch.tensor(vocab.lookup_indices(tokensized_sentence))
        tensors.append(tensor)
    return tensors


tensor_dataset = {}
for split in dataset:
    split_dataset = dataset[split]

    dialogue_ids = convert_to_ids(split_dataset['dialogue'], joint_vocab, tokenizer)
    summary_ids = convert_to_ids(split_dataset['summary'], joint_vocab, tokenizer)
    
    dialogue_lens = torch.tensor([len(dialogue) for dialogue in dialogue_ids])
    summary_lens = torch.tensor([len(summary) for summary in summary_ids])

    tensor_dataset[split] = {}
    
    tensor_dataset[split]['dialogue_ids'] = dialogue_ids
    tensor_dataset[split]['summary_ids'] = summary_ids

    tensor_dataset[split]['dialogue_lens'] = dialogue_lens
    tensor_dataset[split]['summary_lens'] = summary_lens
    


**c. Automatic batching**

*   Calculate maximum length of the sources (SOURCE_LEN) and targets (TARGET_LEN)
*   Pad each source and target to their maximum length (SOURCE_LEN and TARGET_LEN, respectively) and return appropriate fields with collate function
*   Decide on batch size (e.g., 8, 16, 32, or any 2^n) and initialize DataLoader for each dataset split.

In [5]:
from torch.utils.data import DataLoader, Dataset
from torch.nn.utils.rnn import pad_sequence
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

PAD_IDX = joint_vocab['<pad>']
SOS_IDX = joint_vocab['<sos>']
EOS_IDX = joint_vocab['<eos>']
# Not using SOURCE LEN as it is not needed, padding per batch for better performance
#SOURCE_LEN = max(tensor_dataset['train']['dialogue_lens']) + 1  # FOR EOS
TARGET_LEN = max(tensor_dataset['train']['summary_lens']) + 1  # FOR EOS

class SEMSUMDataset(Dataset):
    def __init__(self, tensor_dataset):
        self.dialogules = tensor_dataset['dialogue_ids']
        self.summaries = tensor_dataset['summary_ids']

        self.dialogue_lens = tensor_dataset['dialogue_lens']
        self.summary_lens = tensor_dataset['summary_lens']

    def __len__(self):
        return len(self.dialogules)

    def __getitem__(self, idx):
        return self.dialogules[idx], self.summaries[idx], self.dialogue_lens[idx], self.summary_lens[idx]
    
def collate_fn(batch):
    #batch.sort(key=lambda x: len(x[0]), reverse=True)
    #source, target, source_len, target_len = zip(*batch)
    
    source_batch, target_batch, source_lens, target_lens = [], [], [], []
    for source, target, source_len, target_len in batch:
        source_batch.append(torch.cat([source, torch.tensor([EOS_IDX])], dim=0))
        target_batch.append(torch.cat([target, torch.tensor([EOS_IDX])], dim=0))
        source_lens.append(source_len + 1)
        target_lens.append(target_len + 1)                         
    #source_batch[0] = nn.ConstantPad1d((0, SOURCE_LEN - source_batch[0].shape[0]), PAD_IDX)(source_batch[0])
    #target_batch[0] = nn.ConstantPad1d((0, TARGET_LEN - target_batch[0].shape[0]), PAD_IDX)(target_batch[0])

    # Pad the sequences, padding both target and source to max in batch
    source_padded = pad_sequence(source_batch, batch_first=True, padding_value=PAD_IDX)
    target_padded = pad_sequence(target_batch, batch_first=True, padding_value=PAD_IDX)
    
    return source_padded, target_padded, torch.tensor(source_lens), torch.tensor(target_lens) 

semsum_dataset_train = SEMSUMDataset(tensor_dataset['train'])
semsum_dataset_valid = SEMSUMDataset(tensor_dataset['validation'])
semsum_dataset_test = SEMSUMDataset(tensor_dataset['test'])

semsum_dataloader_train = DataLoader(semsum_dataset_train, batch_size=64, collate_fn=collate_fn, shuffle=True)
semsum_dataloader_valid = DataLoader(semsum_dataset_valid, batch_size=64, collate_fn=collate_fn)
semsum_dataloader_test = DataLoader(semsum_dataset_test, batch_size=64, collate_fn=collate_fn)

### Task 2. Encoder [1 point]

In the encoder, we will use a single-layer bidirectional LSTM. Additionally, we will define two linear layers to reduce the dimensionality of the concatenated hidden (`h_n`) and cell states (`c_n`). These layers will take the concatenation of the forward and backward hidden and cell states from the bidirectional LSTM as input and produce hidden states of dimensionality `hidden_size`.
  
**Forward Method:**
1. Pass the input sequences (`input`) through the bidirectional LSTM layer.
2. Concatenate the forward and backward hidden states along the last dimension.
3. Apply linear transformations to reduce the dimensionality of the concatenated hidden and cell states using the defined linear layers
4. Apply the ReLU activation function to the outputs of the linear layers.
5. Return the output sequence of the LSTM along with the final hidden and cell states.

**Additional Notes:**
- Ensure that the LSTM layer is configured to handle batched input sequences correctly
- Use PyTorch functions such as `torch.cat` and `unsqueeze` for tensor manipulations

In [6]:
import torch.nn as nn
import torch.nn.functional as F

class Encoder(nn.Module):

    def __init__(self, input_size, hidden_size):
        super().__init__()

        #TODO: define a single-layer bidirectional LSTM
        self.lstm = nn.LSTM(input_size, hidden_size, 1, batch_first=True, bidirectional=True)

        #TODO: Define two linear layers with bias to reduce the dimensionality
             # of the concatenated hidden (h_n) and cell states (c_n) to `hidden_size`
        self.reduce_h_n = nn.Linear(hidden_size * 2, hidden_size)
        self.reduce_c_n = nn.Linear(hidden_size * 2, hidden_size)

    def forward(self, input, input_len):
        #print('-----ENCODER------')
        #TODO: pass the input to LSTM layer. If you want, you can use
             # pack_padded_sequence and pad_packed_sequence for input and output
        input_len = input_len.to(torch.int64).to('cpu')
        packed_input = nn.utils.rnn.pack_padded_sequence(input, lengths=input_len, batch_first=True, enforce_sorted=False)
        #print('input', input.shape)
        output, (h_n, c_n) = self.lstm(packed_input)

        output, _ = nn.utils.rnn.pad_packed_sequence(output, batch_first=True)
        #print('Encoder hidden', h_n.shape)
        #print('Encoder cell', c_n.shape)
        #TODO: Concatenate the forward and backward hidden and cell states
        #h_n = torch.cat((h_n[1], h_n[2]), dim=-1)  
        #c_n = torch.cat((c_n[1], c_n[2]), dim=-1)  
        #print('---')
        h_n = h_n.reshape(1, h_n.size(1), -1)
        c_n = c_n.reshape(1, c_n.size(1), -1)
        #print('Encoder hidden cat', h_n.shape)
        #print('Encoder cell cat', c_n.shape)
        #print('---')
        #TODO:Apply linear transformations to reduce the dimensionality
        h_n = self.reduce_h_n(h_n)
        c_n = self.reduce_c_n(c_n)
        #print('Encoder hidden cat reduce', h_n.shape)
        #print('Encoder cell cat reduce', c_n.shape)
        #TODO:Apply the ReLU activation function to the outputs of the linear layers
        h_n = F.relu(h_n)
        c_n = F.relu(c_n)
        #print('--------------------------')
        return output, (h_n, c_n)

### Task 3. Attention [2 points]

In the Attention model class, we will define linear projections (`Ea` and `Da`) to transform the encoder outputs and decoder input, respectively, to a common feature space. We will also define a linear layer (`Va`) to learn the importance of each encoder output when computing attention scores.

**Forward Method:**
  - Compute the attention scores using the Bahdanau attention mechanism (Equations 1 in the [paper](https://arxiv.org/pdf/1704.04368.pdf#page=2)). This involves:
    - Projecting the encoder outputs (`encoder_outputs`) and decoder input (`input`) to a common feature space.
    - Calculating the attention scores by adding the projected encoder outputs and decoder input and applying the tanh activation function followed by the linear layer (`Va`).
  - Compute the attention distribution by applying the softmax function to the masked attention scores (Equations 2 in the [paper](https://arxiv.org/pdf/1704.04368.pdf#page=2))
  - Compute the context vector by performing a weighted sum of the encoder outputs using the attention distribution (Equations 3 in the [paper](https://arxiv.org/pdf/1704.04368.pdf#page=3))
  - Return the attention distribution and the context vector.

**Additional Notes:**
- Ensure compatibility of dimensions throughout the operations, especially when performing tensor manipulations such as squeezing, unsqueezing, and summing.
- Use PyTorch functions such as `torch.bmm`, `torch.tanh`, `torch.softmax` operations for efficient implementation.

In [7]:
from torch import nn

class Attention(nn.Module):

    def __init__(self, hidden_size):
        super().__init__()

        #TODO: define linear layer WITHOUT bias to transform encoder output
        self.Ea = nn.Linear(hidden_size*2, hidden_size, bias=False)

        #TODO: define linear layer WITH bias to transform decoder input
        self.Da = nn.Linear(hidden_size, hidden_size, bias=False)

        #TODO: define linear layer WITHOUT bias to learn attention weights
        self.Va = nn.Linear(hidden_size, 1, bias=False)
        self.softmax = nn.Softmax(-1)

    def forward(self, input, encoder_outputs, pad_mask):
        #print('-----ATTENTION------')
        #print('Decoder hidden', input.shape)
        #print('Encoder output', encoder_outputs.shape)
        #TODO: apply linear layer to transform encoder outputs
        enc = self.Ea(encoder_outputs) 
        #print('Linear (encoder)', enc.shape)
        #TODO: apply linear layer to transform decoder input

        dec = self.Da(input).squeeze(0)  
        #print('Linear (decoder)', dec.shape)

        #print('-----------------')

        #TODO: add transformed encoder and decoder features together to get attention scores
        scores = enc + dec.unsqueeze(1)

        #TODO: apply the tanh activation function to attention scores
        scores = F.tanh(scores)

        #TODO: apply the linear layer to attention scores
        scores = self.Va(scores).squeeze(2) 
        if pad_mask is not None:    
            scores = scores.float().masked_fill_(
                pad_mask,
                float(-1e9)
            ).type_as(scores)

        #TODO:Compute the attention weights by applying the softmax
        attn_dist = self.softmax(scores)

        #Perform a batch matrix-matrix product of matrices stored in attn_dist and encoder_outputs

        context = torch.bmm(attn_dist.unsqueeze(1), encoder_outputs)

        context = torch.sum(context, dim=1)

        return attn_dist, context


### Task 4. Decoder with Attention [1 point]

The decoder will generate output sequences based on the attention mechanism, which allows it to focus on relevant parts of the input sequence.
We will define a single-layer unidirectional LSTM and use our `Attention` module to compute attention scores and context vectors. Additionally, we will define two linear layers to produce the vocabulary distribution.

**Forward Method:**
  - Pass the decoder input (`input`) through the LSTM decoder to get the output sequence, hidden and cell states (`h_n` and `c_n`).
  - Compute the attention distribution and context vector using the `Attention` module, passing the decoder hidden state, encoder outputs, and padding mask.
  - Produce vocabulary distribution (Equation 4 in the [paper](https://arxiv.org/pdf/1704.04368.pdf#page=3)):
    - Concatenate the decoder hidden state and context vector
    - Pass the concatenated vector through the linear layers `linear_v` and `linear_v_out` to obtain the vocabulary distribution over the output tokens.
    - Apply the softmax function to obtain the final probability distribution over the vocabulary.
  - Return the vocabulary distribution, attention distribution, context vector, and updated hidden and cell states.

In [8]:
class AttentionDecoder(nn.Module):

    def __init__(self, input_size, hidden_size, vocab_size):
        super().__init__()

        # TODO: define a single-layer unidirectional LSTM
        self.lstm = nn.LSTM(input_size, hidden_size, 1, batch_first=True)

        # TODO: define an attention layer
        self.attention = Attention(hidden_size)

        self.linear_v = nn.Linear(hidden_size * 3, hidden_size, bias=True)
        self.linear_v_out = nn.Linear(hidden_size, vocab_size, bias=True)
        self.softmax = nn.Softmax(dim=1)

    def forward(self, input, prev_h, prev_c, encoder_outputs, pad_mask):
        #TODO: pass the decoder input, previous hidden and cells to LSTM layer
        #print('----DECODER----')
        #print('Input', input.shape)
        #print('prev_h', prev_h.shape)
        #print('prev_c', prev_c.shape)
        output, (h_n, c_n) = self.lstm(input, (prev_h, prev_c))
        #TODO: pass the hidden states (h_n), encoder_outputs and pad_mask to Attention layer

        attn_dist, context = self.attention(h_n, encoder_outputs, pad_mask)

        #TODO: concatenate context vector with hidden state of decoder

        output =  torch.cat((context, h_n.squeeze(0)), dim=1)

        #TOTO:  fed the ouput through two linear layers
        output = self.linear_v(output)
        output = self.linear_v_out(output)
        #TODO: apply softmax to output
        vocab_dist = self.softmax(output)
        return vocab_dist, attn_dist, context, h_n, c_n

### Task 5. Pointer-Generator Network (2 points)

In the PGNet class, we will define the embedding layer, encoder, decoder with attention and add a pointer-generator network (3 linear transformations). The pointer-generator network can copy words from the source text via pointing, which aids accurate reproduction of information while retaining the ability to produce novel words through the generator.

**Forward Method:**
  - Iterate over the target sequence length (`TARGET_LEN`).
  - For each time step `t`:
    - Compute the decoder input embedding (`input`) based on the previous decoder output.
    - Use the `AttentionDecoder` to generate the vocabulary distribution (`vocab_dist`), attention distribution (`attn_dist`),  context vector, decoder hidden and cell states.
    - Calculate the generation probability (`p_gen`) based on the context vector, decoder hidden state, and decoder input (Equation 9 in the [ paper](https://arxiv.org/pdf/1704.04368.pdf#page=3)).
    - Compute the final distribution (`final_dist`) by combining the vocabulary distribution and the weighted attention distribution (`w_attn_dist`) based on the generation probability (Equation 9 in the [paper](https://arxiv.org/pdf/1704.04368.pdf#page=3)).
  - Stack the final distributions and attention distributions over time steps.
  - Return the final distributions and attention distributions.

In [9]:


class PGNet(nn.Module):

    def __init__(self, vocab_size, embed_size, hidden_size, top_n = 3):
        super().__init__()
        self.top_n = top_n
        self.embed_size = embed_size
        self.embedding = nn.Embedding(vocab_size, embed_size, padding_idx=PAD_IDX)

        self.encoder = Encoder(embed_size, hidden_size)
        self.decoder = AttentionDecoder(embed_size, hidden_size, vocab_size)

        self.w_context = nn.Linear(hidden_size * 2, 1, bias=False)
        self.w_hidden = nn.Linear(hidden_size, 1, bias=False)
        self.w_input = nn.Linear(embed_size, 1, bias=True)


    def forward(self, source_tensor, target_tensor, pad_mask, enc_len):

        batch_size=source_tensor.size(0)
        d_input = torch.empty(batch_size, 1, dtype=torch.long, device=device).fill_(SOS_IDX)
        
        enc_emb = self.embedding(source_tensor)
        encoder_outputs, (h, c) = self.encoder(enc_emb, enc_len)

        final_dists = []
        attn_dists = []

        #TODO: mention max TARGET_LEN within range function below
        if target_tensor is not None:
            target_len = target_tensor.size(1)
        else: 
            target_len = TARGET_LEN
        for t in range(target_len):
            input = self.embedding(d_input)
            vocab_dist, attn_dist, context, h, c = self.decoder(input, h, c, encoder_outputs, pad_mask)

            #TODO: apply the appropiate linear layers to
                 # context vector, decoder hidden states and  decoder input
            context_feature = self.w_context(context)
            decoder_feature  = self.w_hidden(h)
            input_feature = self.w_input(input)

            #TODO: obtain generation features by adding up all features
            gen_feat = context_feature + decoder_feature + input_feature.squeeze(-1)

            #TODO: obtain generation probability by applying sigmoid to generation feature
            p_gen = torch.sigmoid(gen_feat.squeeze())

            #TODO: Calculate new vocab distribution by multiplying to p_gen
            vocab_dist = vocab_dist * p_gen.unsqueeze(1)

            #TODO: Calculate weighted attention distribution by
                # multiplying attention distribution to (1-p_gen)
            w_attn_dist = attn_dist * (1 - p_gen).unsqueeze(1)

            #TODO:  add vocab_dist to w_attn_dist on the indexes of source_tensor
            final_dist = vocab_dist.scatter_add(dim=-1, index=source_tensor, src=w_attn_dist)
            final_dists.append(final_dist)
            attn_dists.append(attn_dist)

            #TODO:  If `target_tensor` is provided, use teacher forcing
                  # by feeding the actual target token as the next input;
                  # otherwise, use the predicted token from the current step.
            if target_tensor is not None:
                d_input = target_tensor[:, t:t+1]
            else:
                # Adding randomness to the selection of the next input
                _, topi = torch.topk(final_dist, self.top_n , dim = -1)  
                d_input = torch.empty(d_input.size(), dtype=torch.int32, device=device)
                for i in range(topi.shape[0]):
                    d_input[i] = topi[i, torch.randint(0, self.top_n, (1,))]

        final_dists = torch.stack(final_dists, dim=1)
        attn_dists = torch.stack(attn_dists, dim=1)

        return final_dists, attn_dists

### Task 6. Training [ 1.5 points]

*   Decide on model and training parameters
*   Implement the training loop
   *   Report training and validation loss for each epoch. You can refer to Equations 6 and 7 in the [ paper](https://arxiv.org/pdf/1704.04368.pdf#page=3) regarding loss calculation.
   *   If validation loss is decreased, save the model.
*   Initialize the model and train the model.


In [10]:
model = PGNet(len(joint_vocab), embed_size=128, hidden_size=256)
#model.load_state_dict(torch.load('best_pgnet.pth'))
model = model.to(device)
criterion = nn.NLLLoss(ignore_index=PAD_IDX)
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)

In [11]:
n_epochs = 10
best_val_loss = torch.inf
# Training loop
for epoch in range(n_epochs):
    model.train()
    total_loss = 0
    for batch in semsum_dataloader_train:  # Assuming train_loader is defined
        source_tensor, target_tensor, source_len, target_len = batch
        source_tensor, target_tensor = source_tensor.to(device), target_tensor.to(device)
        source_len, target_len = source_len.to(device), target_len.to(device)        
             
        optimizer.zero_grad()
        pad_mask = (source_tensor == PAD_IDX)
        final_dists, _ = model(source_tensor, target_tensor, pad_mask, source_len)

        # Calculate loss_t for each time step and sum
        loss = criterion(torch.transpose(torch.log(final_dists), -1, -2), target_tensor)
        print(loss)
        loss.backward()  # Backpropagation
        optimizer.step()
        total_loss += loss.item()
    
    print('finished epoch')
    avg_train_loss = total_loss / len(semsum_dataloader_train)
    print(f'Epoch {epoch}: Training Loss: {avg_train_loss}')

    # # Validation loop
    model.eval()
    with torch.no_grad():
        total_val_loss = 0
        for batch in semsum_dataloader_valid:  
            source_tensor, target_tensor, source_len, target_len = batch
            source_tensor, target_tensor = source_tensor.to(device), target_tensor.to(device)
            source_len, target_len = source_len.to(device), target_len.to(device)        
                
            optimizer.zero_grad()
            pad_mask = (source_tensor == PAD_IDX)
            final_dists, _ = model(source_tensor, target_tensor, pad_mask, source_len)
            # Calculate loss_t for each time step and sum
            loss = criterion(torch.transpose(torch.log(final_dists), -1, -2), target_tensor)
            total_val_loss += loss

        avg_val_loss = total_val_loss / len(semsum_dataloader_valid)
        print(f'Epoch {epoch}: Validation Loss: {avg_val_loss}')

        # Save the model if validation loss has decreased
        if avg_val_loss < best_val_loss:
            best_val_loss = avg_val_loss
            torch.save(model.state_dict(), 'best_pgnet.pth')
            print('Model saved.')

tensor(8.0975, device='cuda:0', grad_fn=<NllLoss2DBackward0>)
tensor(8.1348, device='cuda:0', grad_fn=<NllLoss2DBackward0>)
tensor(8.2066, device='cuda:0', grad_fn=<NllLoss2DBackward0>)
tensor(8.0441, device='cuda:0', grad_fn=<NllLoss2DBackward0>)
tensor(8.1492, device='cuda:0', grad_fn=<NllLoss2DBackward0>)
tensor(8.3777, device='cuda:0', grad_fn=<NllLoss2DBackward0>)
tensor(7.8367, device='cuda:0', grad_fn=<NllLoss2DBackward0>)
tensor(7.4861, device='cuda:0', grad_fn=<NllLoss2DBackward0>)
tensor(7.4092, device='cuda:0', grad_fn=<NllLoss2DBackward0>)
tensor(7.3479, device='cuda:0', grad_fn=<NllLoss2DBackward0>)
tensor(6.9020, device='cuda:0', grad_fn=<NllLoss2DBackward0>)
tensor(6.7174, device='cuda:0', grad_fn=<NllLoss2DBackward0>)
tensor(6.5171, device='cuda:0', grad_fn=<NllLoss2DBackward0>)
tensor(6.1269, device='cuda:0', grad_fn=<NllLoss2DBackward0>)
tensor(6.0678, device='cuda:0', grad_fn=<NllLoss2DBackward0>)
tensor(6.3821, device='cuda:0', grad_fn=<NllLoss2DBackward0>)
tensor(6

KeyboardInterrupt: 

### Task 7. Inference [0.5 points]

*   Create an inference function that gets text as input and returns the summary of the text based on the trained model.
*   Pass at least 10 samples from the test set to the inference function
*   Print the dialogues, human written summary, and generated summary
*   Question: What are the common patterns among the generated summaries?



In [15]:
model = PGNet(len(joint_vocab), embed_size=128, hidden_size=256)
model.load_state_dict(torch.load('best_pgnet.pth'))
model.to(device)
model.eval()
total_loss = 0

def inference(s_batch, top_n):
    preprocessed_s_batch = [preprocessing(s) for s in s_batch] 
    source_batch = convert_to_ids(preprocessed_s_batch, joint_vocab, tokenizer)
    
    source_eos = []
    for source in source_batch:
        source = torch.cat([source, torch.tensor([EOS_IDX])], dim=0)
        source_eos.append(source)
    
    source_len = torch.tensor([len(source) for source in source_eos])
    source_padded = pad_sequence(source_eos, batch_first=True, padding_value=joint_vocab['<pad>'])
    source_len, source_padded = source_len.to(device), source_padded.to(device)
    
    pad_mask = (source_padded == PAD_IDX)

    # Adding randomness to selected tokens
    final_dists, _ = model(source_padded, None, pad_mask, source_len)
    _, topi = torch.topk(final_dists, top_n , dim=-1)
    targets = torch.empty(final_dists.shape[0:2], dtype=torch.int32, device=device)
    for i in range(topi.shape[0]):
        for j in range(topi.shape[1]):
            targets[i, j] = topi[i,j, torch.randint(0, top_n, (1,))]

    sentences = []
    for target in targets:
        target_l = target.tolist()
        for i in range(len(target_l)):
            if target_l[i] == EOS_IDX:
                target_l = target_l[0:i+1]
                break
        target_strs = joint_vocab.lookup_tokens(target_l)
        sentences.append(' '.join(target_strs))
    return sentences



In [18]:
inds = torch.randint(0, len(dataset['train']), (10,)).tolist()
# No randomness here works best it seems, so top_n=1
generated_summaries = inference([dataset['train'][i]['dialogue'] for i in inds], top_n=1)
for i, rndi in enumerate(inds):
    print('---DIALOGUE---')
    print(dataset['train'][rndi]['dialogue'])
    print('---TRUE SUMMARY---')
    print(dataset['train'][rndi]['summary'])
    print('---MODEL SUMMARY---')
    print(generated_summaries[i])


---DIALOGUE---
Christina: I need your help!
Lee: What's wrong?
Christina: Computer not working ;(
Lee: What happened?
Christina: I was working on my paper for Mr. Anderson and it suddenly turned off...
Lee: W8. What paper?
Christina: He gave us an essay on the most meaningful event in American history. It's due 2moro.
Lee: Good to know, no sleep 2nite... But what about ur computer?
Christina: I can't turn it back on and my whole paper and other stuff are on it!
Lee: Did you try charging it first?
Christina: Do you think I'm that stupid?
Lee: No, but the cable could've disconnected and you didn't notice.
Christina: Lemme check. BRB.
Lee: OK.
Christina: UR right! Silly me!
Lee: See? E123!
Christina: Thank you <3
Lee: UR welcome.
---TRUE SUMMARY---
Christina thinks her computer broke, and she's not able to finish her essay on American History. Lee forgot about the essay. Christina follows Lee's advice, and finds out that the cable is disconnected.
---MODEL SUMMARY---
christina is working 

There are some commont patterns between the generated summaries. First of all, they are not very good... They tend to have to many full stops, likely since it's a common character in the vocab. They also tend to repeat words, this is likely due to the fact that only the last word is fed into the encoder and the model tends to get into loops. They also tend to repeat names of people from the dialogue. 