<a href="https://colab.research.google.com/github/hjori66/Kaist-AI605-2021-Spring/blob/main/KAIST_AI605_Assignment_2_20194364.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# KAIST AI605 Assignment 2: Token Classification with RNNs and Attention
Author: Minjoon Seo (minjoon@kaist.ac.kr)

TA in charge: Taehyung Kwon (taehyung.kwon@kaist.ac.kr)

**Due date**:  April 19 (Mon) 11:00pm, 2021  


Your name: Taehwan Kim

Your student ID: 20194364

Your collaborators: -

## Assignment Objectives
- Verify theoretically and empirically how Transformer's attention mechanism works for sequence modeling task.
- Implement Transformer's encoder attention layer from scratch using PyTorch.
- Design an Attention-based token classification model using PyTorch.
- Apply the token classification model to a popular machine reading comprehension task, Stanford Question Answering Dataset (SQuAD).
- (Bonus) Analyze pros and cons between using RNN + attention versus purely attention.

## Your Submission
Your submission will be a link to a Colab notebook that has all written answers and is fully executable. You will submit your assignment via KLMS. Use in-line LaTeX (see below) for mathematical expressions. Collaboration among students is allowed but it is not a group assignment so make sure your answer and code are your own. Also make sure to mention your collaborators in your assignment with their names and their student ids.

## Grading
The entire assignment is out of 100 points. There are two bonus questions with 30 points altogether. Your final score can be higher than 100 points.


## Environment
You will only use Python 3.7 and PyTorch 1.8, which is already available on Colab:

In [None]:
from platform import python_version
import torch

print("python", python_version())
print("torch", torch.__version__)

python 3.7.10
torch 1.8.1+cu101


## 1. Transformer's Attention Layer

We will first start with going over a few concepts that you learned in your high school statistics class. The variance of a random variable $X$, $\text{Var}(X)$ is defined as $\text{E}[(X-\mu)^2]$ where $\mu$ is the mean of $X$. Furthermore, given two independent random variables $X$ and $Y$ and a constant $a$,
$$ \text{Var}(X+Y) = \text{Var}(X) + \text{Var}(Y), \quad \ldots \; \text{(1)}$$ 
$$ \text{Var}(aX) = a^2\text{Var}(X), \quad \ldots \; \text{(2)}$$
$$ \text{Var}(XY) = \text{E}(X^2)\text{E}(Y^2) - [\text{E}(X)]^2[\text{E}(Y)]^2. \quad \ldots \; \text{(3)}$$

**Problem 1.1** *(10 points)* Suppose we are given two sets of $n$ random variables, $X_1 \dots X_n$ and $Y_1 \dots Y_n$, where all of these $2n$ variables are mutually independent and have a mean of $0$ and a variance of $1$. Prove that
$$\text{Var}\left(\sum_i^n X_i Y_i\right) = n.$$

There is a typo in the formula $\text{(2)}$. I changed it.

**Answer 1.1** 

$$
\begin{align}
  \text{Var}\left(\sum_i^n X_i Y_i\right)
  &= \sum_i^n \text{Var} \left(X_i Y_i\right) \quad \because \text{(1), independence} \\
  &= \sum_i^n \left[\text{E}(X_i^2)\text{E}(Y_i^2) - [\text{E}(X_i)]^2[\text{E}(Y_i)]^2 \right] \quad \because \text{(3)} \\
  &= \sum_i^n \left[\text{E}((X_i-0)^2)\text{E}((Y_i-0)^2)\right] \\
  &= \sum_i^n \left[\text{E}((X_i-[\text{E}(X_i)])^2)\text{E}((Y_i-[\text{E}(Y_i)])^2)\right] \\
  &= \sum_i^n \left[\text{Var}(X_i) \text{Var}(Y_i)\right] \\
  &= \sum_i^n \left[1\right] = n \\
\end{align}
\\
$$

In Lecture 08 and 09, we discussed how the attention is computed in Transformer via the following equation,
$$ \text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^\top}{\sqrt{d_k}}\right)V.$$
**Problem 1.2** *(10 points)*  Suppose $Q$ and $K$ are matrices of independent variables each of which has a mean of $0$ and a variance of $1$. Using what you learned from Problem 1.1., show that
$$\text{Var}\left(\frac{QK^\top}{\sqrt{d_k}}\right) = 1.$$

**Answer 1.2** 

$$
\begin{align}
  \text{Var}\left(\frac{QK^\top}{\sqrt{d_k}}\right)
  &= \left(\frac{1}{\sqrt{d_k}}\right)^2 \text{Var}\left(QK^\top\right)  \quad \because \text{(2)} \\
  &= \frac{1}{d_k} \text{Var}\left(QK^\top\right) \\
\end{align}
\\
\\
$$

Then, we can focus on the one element of $QK$, $\left(QK\right)_{ij}$ without loss of generality.

$$
\begin{align}
  \frac{1}{d_k} \text{Var}\left(\left(QK\right)_{ij}^\top\right)
  &= \frac{1}{d_k} \text{Var}\left(\sum_{t}^{d_k} \left(Q_{it} K_{tj}\right) \right) \\
  &= \frac{1}{d_k} \left(\sum_{t}^{d_k} \text{Var}\left(Q_{it} K_{tj}\right) \right) \quad \because \text{(1), } Q_{it} \text{ and } Y_{tj} \text{ are mutually independent} \\
  &= \frac{1}{d_k} \left(d_k\right) = 1 \quad \because \text{Problem 1.1} \\
\end{align}
\\
$$

Therefore, 

$$
\text{Var}\left(\frac{QK^\top}{\sqrt{d_k}}\right) = 1.
$$



**Problem 1.3** *(10 points)* What would happen if the assumption that the variance of $Q$ and $K$ is $1$ does not hold? Consider each case of it being higher and lower than $1$ and conjecture what it implies, respectively.

**Answer 1.3** \

If the variance of $Q_{ij}$ and $K_{ij}$ is higher than 1 for all i and j, then

$$
\text{Var}\left(\frac{QK^\top}{\sqrt{d_k}}\right) > 1.
$$

Then, the variance of the output of the decoder becomes larger.
If we use the softmax function on this output, then the final result might be "too" sharp. (Actually, this is not true, because of the residual connection) \
So, I guess that the model overfits faster than original model.

\

Otherwse, if the variance of $Q_{ij}$ and $K_{ij}$ is lower than 1 for all i and j, then

$$
\text{Var}\left(\frac{QK^\top}{\sqrt{d_k}}\right) < 1.
$$

Then, the variance of the output of the decoder becomes smaller.
If we use the softmax function on this output, then the final result might be "too" smooth. 
\
So, I guess that the early training would be more unstable than normal although we use the bigger learning rate. The training time would be longer than the original version. 

## 2. Preprocessing SQuAD

We will use `datasets` package offered by Hugging Face, which allows us to easily download various language datasets, including Stanford Question Answering Dataset (SQuAD).

First, install the package:

In [None]:
!pip install datasets



Then, download SQuAD and print the first example:

In [None]:
from datasets import load_dataset
squad_dataset = load_dataset('squad')
print(squad_dataset['train'][0])

Reusing dataset squad (/root/.cache/huggingface/datasets/squad/plain_text/1.0.0/4fffa6cf76083860f85fa83486ec3028e7e32c342c218ff2a620fc6b2868483a)


{'answers': {'answer_start': [515], 'text': ['Saint Bernadette Soubirous']}, 'context': 'Architecturally, the school has a Catholic character. Atop the Main Building\'s gold dome is a golden statue of the Virgin Mary. Immediately in front of the Main Building and facing it, is a copper statue of Christ with arms upraised with the legend "Venite Ad Me Omnes". Next to the Main Building is the Basilica of the Sacred Heart. Immediately behind the basilica is the Grotto, a Marian place of prayer and reflection. It is a replica of the grotto at Lourdes, France where the Virgin Mary reputedly appeared to Saint Bernadette Soubirous in 1858. At the end of the main drive (and in a direct line that connects through 3 statues and the Gold Dome), is a simple, modern stone statue of Mary.', 'id': '5733be284776f41900661182', 'question': 'To whom did the Virgin Mary allegedly appear in 1858 in Lourdes France?', 'title': 'University_of_Notre_Dame'}


Here, `answer_start` corresponds to the character-level start position of the answer, and `text` is the answer text itself. You will note that `answer_start` and `text` fields are given as lists but they only contain one item each. In fact, you can safely assume that this is the case for the training data. During evaluation, however, you will utilize several possible answers so that your evaluation can be compared against all of them. So your code need to handle multiple-answers case as well.

As we discussed in Lecture 05, we want to formulate this task as a token classification problem. That is, we want to find which token of the context corresponds to the start position of the answer, and which corresponds to the end.

**Problem 2.1** *(10 points)* Write `preprocess()` function that takes a SQuAD example as the input and outputs space-tokenized context and question, as well as the start and end token position of the answer if it has the answer field. That is, a pseudo code would look like:
```python
def preprocess(example):
  out = {'context': ['each', 'token'], 
         'question': ['each', 'token']}
  if 'answers' not in example:
    return out
  out['answers'] = [{'start': 3, 'end': 5}]
  return out
```
Verify that this code works by comparing between the original answer text and the concatenation of the answer tokens from start to end in training data. Report the percentage of the questions that have exact match.

**Answer 2.1**



In [None]:
def preprocess(example):
    def tokenizer(sentence):
        if sentence is None:
            return list()
        return sentence.split()

    out = dict()
    context = example['context']
    question = example['question']

    out['context'] = tokenizer(context)
    out['question'] = tokenizer(question)

    answer_list = list()
    for answer_index in range(len(example['answers']['text'])):
        answer = example['answers']['text'][answer_index]
        answer_start = example['answers']['answer_start'][answer_index]
        answer_end = answer_start + len(answer)

        if (answer_start == 0 or context[answer_start-1] == ' ') \
            and (answer_end == len(context) or context[answer_end] == ' '):
            n_tokens_before_answer = len(tokenizer(context[:answer_start]))
            n_tokens_answer = len(tokenizer(answer))
            answer_list.append({'start': n_tokens_before_answer, 'end': n_tokens_before_answer+n_tokens_answer-1})
    
    if not answer_list:
        out['answers'] = answer_list

    return out

n_exact_match = 0.0
for i, example in enumerate(squad_dataset['train']):
    out = preprocess(example)
    if 'answers' in out.keys():
        n_exact_match += 1


KeyboardInterrupt: ignored

** Result 2.1 **


```
print("number of answers that have exact match : ", n_exact_match)
print("number of training data : ", len(squad_dataset['train']))
print("the percentage of the exact match (training) : ", n_exact_match / len(squad_dataset['train']))

# number of answers that have exact match :  41616.0
# number of training data :  87599
# the percentage of the exact match :  0.47507391636890833 -> pretty low

```



We want to maximize the percentage of the exact match. You might see a low percentage however, due to bad tokenization. For instance, such space-based tokenization will fail to separate between "world" and "!" in "hello world!". 

**Problem 2.2** *(10 points)* Write an advanced tokenization model that always separates non-alphabet characters as independent tokens. For instance, "hello1 world!!" will be tokenized into "hello", "1", "world", "!", and "!". Using this new tokenizer, re-run the `preprocess` function and report the exact match percentage. How does the ratio change?

**Answer 2.2**



In [None]:
def find_nonalpha_list(dataset):
    nonalpha_list = list()
    for example in dataset:
        for c in example['context']:
            if not c.isalpha() and c not in nonalpha_list:
                nonalpha_list.append(c)
        for c in example['question']:
            if not c.isalpha() and c not in nonalpha_list:
                nonalpha_list.append(c)
    return nonalpha_list

    
def convert_idx(context, tokens):
    current = 0
    spans = []
    for token in tokens:
        current = context.find(token, current)
        if current < 0:
            print("No Token", token)
            raise Exception()
        spans.append((current, current + len(token)))
        current += len(token)
    return spans


def preprocess(example, nonalpha_list):
    def tokenizer(sentence, nonalpha_list):
        if sentence is None:
            return list()

        for nonalpha_token in nonalpha_list:
            sentence = sentence.replace(nonalpha_token, ' ' + nonalpha_token + ' ')

        sentence = ' '.join(sentence.split())
        return sentence.split()
        
    out = dict()
    context = example['context']
    question = example['question']
    id = example['id']

    out['context'] = tokenizer(context, nonalpha_list)
    out['question'] = tokenizer(question, nonalpha_list)
    out['id'] = id
    
    out['context_span'] = {'context':context, 'span':convert_idx(context, out['context'])}
    # print(out['context_span'], context)

    answer_list = list()
    for answer_index in range(len(example['answers']['text'])):
        answer = example['answers']['text'][answer_index]
        answer_start = example['answers']['answer_start'][answer_index]
        answer_end = answer_start + len(answer)

        if (answer_start == 0 or context[answer_start-1] in nonalpha_list) \
            and (answer_end == len(context) or context[answer_end] in nonalpha_list):
            n_tokens_before_answer = len(tokenizer(context[:answer_start], nonalpha_list))
            n_tokens_answer = len(tokenizer(answer, nonalpha_list))
            # print(answer_start, answer_end, len(context), len(out['context_span']))
            # print(answer_start, out['context_span']['span'][n_tokens_before_answer])
            # print(answer_end, out['context_span']['span'][n_tokens_before_answer+n_tokens_answer-1])
            answer_list.append({'start': n_tokens_before_answer, 'end': n_tokens_before_answer+n_tokens_answer-1})

    if answer_list:
        out['answers'] = answer_list

    return out

dataset = squad_dataset['train']
# dataset = squad_dataset['validation'] # Do this if you want to check the valid dataset

nonalpha_list = find_nonalpha_list(squad_dataset['train'])
print("nonalpha_list : ", nonalpha_list)

n_exact_match = 0.0
for i, example in enumerate(dataset):
  out = preprocess(example, nonalpha_list)
  if 'answers' in out.keys():
    n_exact_match += 1

print("number of answers that have exact match : ", n_exact_match)
print("number of training data : ", len(dataset))
print("the percentage of the exact match (training) : ", n_exact_match / len(dataset))


nonalpha_list :  [',', ' ', '.', "'", '"', '1', '8', '5', '(', '3', ')', '?', '-', '7', '6', '9', '2', '0', ';', '–', '&', '4', '%', '$', '[', ']', '/', ':', '#', '—', '!', '“', '’', '”', '<', '\u200b', '̃', '£', '½', '+', '¢', '−', '°', '>', '€', '《', '》', '±', '~', '¥', '²', '❤', '=', '\u200e', '͡', '́', '`', '्', 'ु', 'ः', 'ॊ', 'ि', 'ा', '\u200d', '\u200c', '*', '‘', '\u3000', '•', '§', '⁄', '\n', '̯', '̩', '…', '·', 'ָ', 'ִ', 'ׁ', 'ַ', 'ּ', 'ְ', 'ّ', '⟨', '◌', '⟩', '˭', '̤', '♠', '∅', '̞', '×', '̥', '′', '″', '\ufeff', '_', 'ֿ', '´', '^', '̧', '̄', '→', '‑', '，', '₹', '\u202f', '♯', '₂', '₥', '⁊', '\u2009', '{', '}', '|', '@', '̪', '‚', '›', 'ׂ', 'ֵ', 'ِ', 'ْ', 'َ', '̍', '˥', '˨', '˩', '¡', '√', '¿', 'ာ', 'း', 'ُ', '≥', '˚', '≈', '⋅', 'ี', '︘', '�', '～', '〜', '̀', 'ོ', '་', '˧', 'ಾ', 'ು', '್', 'া', '্', 'ಿ', '∗', '∈', '≡', '∖', '№', '÷', 'ٔ', '¶', 'ิ', '₤', '♆', '⅓', '∝', '¼', 'ٍ', 'ֹ', '̌', '。', '̠', '₯']
number of answers that have exact match :  87108.0
number of training data :

** Result 2.2 **


```
nonalpha_list :  [',', ' ', '.', "'", '"', '1', '8', '5', '(', '3', ')', '?', '-', '7', '6', '9', '2', '0', ';', '–', '&', '4', '%', '$', '[', ']', '/', ':', '#', '—', '!', '“', '’', '”', '<', '\u200b', '̃', '£', '½', '+', '¢', '−', '°', '>', '€', '《', '》', '±', '~', '¥', '²', '❤', '=', '\u200e', '͡', '́', '`', '्', 'ु', 'ः', 'ॊ', 'ि', 'ा', '\u200d', '\u200c', '*', '‘', '\u3000', '•', '§', '⁄', '\n', '̯', '̩', '…', '·', 'ָ', 'ִ', 'ׁ', 'ַ', 'ּ', 'ְ', 'ّ', '⟨', '◌', '⟩', '˭', '̤', '♠', '∅', '̞', '×', '̥', '′', '″', '\ufeff', '_', 'ֿ', '´', '^', '̧', '̄', '→', '‑', '，', '₹', '\u202f', '♯', '₂', '₥', '⁊', '\u2009', '{', '}', '|', '@', '̪', '‚', '›', 'ׂ', 'ֵ', 'ِ', 'ْ', 'َ', '̍', '˥', '˨', '˩', '¡', '√', '¿', 'ာ', 'း', 'ُ', '≥', '˚', '≈', '⋅', 'ี', '︘', '�', '～', '〜', '̀', 'ོ', '་', '˧', 'ಾ', 'ು', '್', 'া', '্', 'ಿ', '∗', '∈', '≡', '∖', '№', '÷', 'ٔ', '¶', 'ิ', '₤', '♆', '⅓', '∝', '¼', 'ٍ', 'ֹ', '̌', '。', '̠', '₯']
number of answers that have exact match :  87108.0
number of training data :  87599
the percentage of the exact match (training) :  0.9943949131839405 -> now, it is OK

number of answers that have exact match :  10566.0
number of validation data :  10570
the percentage of the exact match (validation) :  0.9996215704824977 -> Also, it is OK

```



## 3. LSTM Baseline for SQuAD

We will bring and reuse our model from Assignment 1. There are two key differences, however. First, we need to classify each token instead of the entire sentence. Second, we have two inputs (context and question) instead of just one.  

In [None]:
# My model from Assignment 1 is too slow to use it.
# I will use torchtext and torch.nn.lstm in the assignment 2.

import torchtext
from torchtext.legacy import data
from torchtext.legacy import datasets
from torchtext.legacy.data import BucketIterator


class SQuAD1Dataset(data.Dataset):
    """
    Defines a dataset for squad1.0.
    """
    
    """
    @staticmethod
    def sort_key(ex):
        return data.interleave_keys(len(ex.context), len(ex.question))
    """  

    def __init__(self, data_list, fields, use_bos=True, max_length=None, **kwargs):
        if not isinstance(fields[0], (tuple, list)):
            fields = [('context', fields[0]), 
                      ('question', fields[1]), 
                      # ('context_question', fields[2]), # For Problem 3.2+, put the question after the context
                      ('answer_start', fields[2]), 
                      ('answer_end', fields[3]), 
                      ('id_index', fields[4])
                      ]

        examples = []
        nonalpha_list = find_nonalpha_list(data_list)

        self.id_list = list()
        self.context_span = list()
        self.reference = list()

        for _, example in enumerate(data_list):
            out = preprocess(example, nonalpha_list)
            # use data if the answer exists
            if max_length and max_length < max(len(out['context']), len(out['question'])):
                continue
            if 'answers' in out.keys():
                answer_start = out['answers'][0]['start'] # Use index 0 for instant valid accuracy
                answer_end = out['answers'][0]['end'] # Use index 0 for instant valid accuracy
                if use_bos:
                    answer_start += 1 # for <BOS> token
                    answer_end += 1 # for <BOS> token
                examples.append(data.Example.fromlist([out['context'], 
                                                    out['question'], 
                                                    #    out['context'] + ["<CLS>"] + out['question'], # Use <CLS> token
                                                    answer_start,
                                                    answer_end,
                                                    len(self.id_list)], 
                                                    fields))
                self.id_list.append(out['id'])
                self.context_span.append(out['context_span'])
                self.reference.append({'id':example['id'], 'answers':example['answers']})

        super(SQuAD1Dataset, self).__init__(examples, fields, **kwargs)


class SQuAD1Dataloader():
  """
  Make the dataloader for SQuAD 1.0
  """
  def __init__(self, train_data=None, valid_data=None, batch_size=64, device='cpu', 
                max_length=255, min_freq=2, fix_length=None,
                use_bos=True, use_eos=True, shuffle=True
              ):

    super(SQuAD1Dataloader, self).__init__()

    self.text = data.Field(sequential=True, use_vocab=True, batch_first=True, 
                           include_lengths=True, fix_length=fix_length, 
                           init_token='<BOS>' if use_bos else None, 
                           eos_token='<EOS>' if use_eos else None
                          )
    self.answer_start = data.Field(sequential = False, use_vocab = False)
    self.answer_end = data.Field(sequential = False, use_vocab = False)
    self.id_index = data.Field(sequential = False, use_vocab = False)
    
    train = SQuAD1Dataset(data_list=train_data, 
                          fields = [('context', self.text),
                                    ('question', self.text),
                                    # ('context_question', self.text),
                                    ('answer_start', self.answer_start),
                                    ('answer_end', self.answer_end),
                                    ('id_index', self.id_index)
                                    ], 
                          use_bos = use_bos,
                          max_length = max_length
                          )
    valid = SQuAD1Dataset(data_list=valid_data, 
                          fields = [('context', self.text),
                                    ('question', self.text),
                                    # ('context_question', self.text),
                                    ('answer_start', self.answer_start),
                                    ('answer_end', self.answer_end),
                                    ('id_index', self.id_index)
                                    ], 
                          use_bos = use_bos,
                          max_length = max_length
                          )
    self.train_id_list = train.id_list
    self.valid_id_list = valid.id_list
    self.train_context_span = train.context_span
    self.valid_context_span = valid.context_span
    self.train_reference = train.reference
    self.valid_reference = valid.reference
    
    self.train_iter = data.BucketIterator(train, batch_size=batch_size,
                                          device=device,
                                          shuffle=shuffle,
                                          sort_key=lambda x: len(x.question) + (max_length * len(x.context)), 
                                          sort_within_batch = True
                                          )
    self.valid_iter = data.BucketIterator(valid, batch_size=batch_size,
                                          device=device,
                                          shuffle=shuffle,
                                          sort_key=lambda x: len(x.question) + (max_length * len(x.context)), 
                                          sort_within_batch = True
                                          )
    
    self.text.build_vocab(train)


train_dataset = squad_dataset['train']
valid_dataset = squad_dataset['validation']

print('# of train data : {}'.format(len(train_dataset)))
print('# of vaild data : {}'.format(len(valid_dataset)))

batch_size = 128
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
max_length = 253 # 255 - 2 for <BOS> and <EOS>
min_freq = 2
use_bos = False
use_eos = False
print("device : ", device)

loader = SQuAD1Dataloader(train_dataset, valid_dataset, batch_size=batch_size, 
                          device=device, max_length=max_length, min_freq=min_freq,
                          use_bos=use_bos, use_eos=use_eos)
print('\nFinish making the dataloader')
print("batch_size : ", batch_size)
print("max_length : ", max_length)
print('number of used train data ~ {}'.format((len(loader.train_iter)) * batch_size))
print('number of used vaild data ~ {}'.format((len(loader.valid_iter)) * batch_size))

vocab = loader.text.vocab
vocab_list = list(vocab.stoi.keys())
print('number of vocab : {}'.format(len(vocab)))


** Result 3.1.1 (Preprocessing) **


```
# of train data : 87599
# of vaild data : 10570
device :  cuda

Finish making the dataloader
batch_size :  128
max_length :  253
number of used train data ~ 81408
number of used vaild data ~ 9856
number of vocab : 86389

```




Before resolving these differences, you will need to define your evaluation function to correctly evaluate how well your model is doing. Note that the evaluation was very straightforward in Assignment 1's sentiment classification (it is either positive or negative) while it is a bit complicated in SQuAD. We will use the evaluation function provided by `datasets`. You can access to it via the following code.  

In [None]:
from datasets import load_metric
squad_metric = load_metric('squad')

You can also easily learn about how to use the function by simply typing the function:

In [None]:
squad_metric

**Problem 3.1** *(10 points)* Let's resolve the first issue here. Hence, for now, assume that your only input is context and you want to obtain the answer without seeing the question. While this may seem to be a non-sense, actually it can be considered as modeling the prior $\text{Prob}(a|c)$ before observing $q$ (we ultimately want $\text{Prob}(a|q,c)$). Transform your model into a token classification model by imposing $\text{softmax}$ over the tokens instead of predefined classes. You will need to do this twice for each of start and end. Report the accuracy (using the metric above) on `squad_dataset['validation']`. 

**Answer 3.1**



In [None]:
import torch.nn as nn
from tqdm.notebook import tqdm
from torch.nn.utils.rnn import pack_padded_sequence, pad_packed_sequence

class ClassificationLSTMModel(nn.Module):
    def __init__(self, embedding_dim, hidden_dim, n_layers, n_label, emb_dropout, rnn_dropout, bidirectional, enable_layer_norm, device):
        super(ClassificationLSTMModel, self).__init__()
        self.embedding = nn.Embedding(len(vocab), embedding_dim)
        self.lstm = nn.LSTM(input_size=embedding_dim, 
                            hidden_size=hidden_dim, 
                            num_layers=n_layers, 
                            dropout=rnn_dropout if n_layers > 1 else 0.0, 
                            bidirectional=bidirectional)
      
        n_direction = 2 if bidirectional else 1
        self.new_hidden_dim = hidden_dim * n_direction
        self.fc_start = nn.Linear(self.new_hidden_dim, 1, bias=True)
        self.fc_end = nn.Linear(self.new_hidden_dim, 1, bias=True)

        # Layer_normalization
        self.enable_layer_norm = enable_layer_norm
        if enable_layer_norm:
            self.norm1 = nn.LayerNorm(embedding_dim)
            self.norm2 = nn.LayerNorm(hidden_dim*n_direction)

        self.emb_dropout = nn.Dropout(emb_dropout)
        self.fc_dropout = nn.Dropout(rnn_dropout)
        self.bidirectional = bidirectional
        self.device = device

    def forward(self, input_tensor, src_seq_lens):
        emb = self.embedding(input_tensor) # emb.shape = batch * len * hidden

        # Layer_normalization
        if self.enable_layer_norm:
            emb = self.norm1(emb)

        emb = self.emb_dropout(emb)
        emb = emb.transpose(0, 1) # emb.shape = len * batch * hidden

        # n_direction = 2 if bidirectional else 1
        # hidden = torch.zeros(n_layers*n_direction, context.shape[0], hidden_dim, requires_grad=True).to(self.device)
        # cell = torch.zeros(n_layers*n_direction, context.shape[0], hidden_dim, requires_grad=True).to(self.device)

        # nn.LSTM
        packed = pack_padded_sequence(emb, src_seq_lens.tolist(), batch_first=False)
        outs, (hidden, cell) = self.lstm(packed)
        outs, out_lens = pad_packed_sequence(outs, batch_first=False)  # outs.shape = len * batch * self.new_hidden_dim
        
        outs = outs.transpose(0, 1) # outs.shape = batch * len * self.new_hidden_dim
        outs = self.fc_dropout(outs)
        logits_start = self.fc_start(outs).squeeze(2)
        logits_end = self.fc_end(outs).squeeze(2)

        # if self.bidirectional:
        #     hidden = torch.stack([hidden[-2], hidden[-1]], dim=0)
        # else:
        #     hidden = hidden[-1].unsqueeze(dim=0)
        # hidden = hidden.transpose(0, 1)
        # hidden = hidden.contiguous().view(hidden.shape[0], -1)
        
        # # Layer_normalization
        # if self.enable_layer_norm:
        #     hidden = self.norm2(hidden)

        # hidden = self.fc_dropout(hidden)
        # logits_start = self.fc_start(hidden)
        # logits_end = self.fc_end(hidden)

        return (logits_start, logits_end)


device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print("device : ", device)

# Vocabulary : Use vocab_list

# Construct the LSTM Model
embedding_dim = 128 # usually bigger, e.g. 128
hidden_dim = 128
n_layers = 2
n_label = max_length+1 if use_bos else max_length
emb_dropout = 0.5
rnn_dropout = 0.5
bidirectional = True
enable_layer_norm = True
rnnmodel = ClassificationLSTMModel(embedding_dim, hidden_dim, n_layers, n_label, emb_dropout, rnn_dropout, bidirectional, enable_layer_norm, device).to(device)

# Training
epochs = 30
learning_rate = 1e-3
max_norm = 5

print("batch_size : ", batch_size)
print("max_length : ", max_length)
print("embedding_dim : ", embedding_dim)
print("hidden_dim : ", hidden_dim)
print("n_layers : ", n_layers)
print("emb_dropout : ", emb_dropout)
print("rnn_dropout_and_fc_dropout : ", rnn_dropout)
if bidirectional:
    print("bidirectional : True")
else:
    print("bidirectional : False")
if enable_layer_norm:
    print("enable_layer_norm : True")
else:
    print("enable_layer_norm : False")
    
print("learning_rate : ", learning_rate)

# Construct the data loader
train_iter = loader.train_iter
valid_iter = loader.valid_iter

train_id_list = loader.train_id_list
valid_id_list = loader.valid_id_list
train_context_span = loader.train_context_span
valid_context_span = loader.valid_context_span
train_reference = loader.train_reference
valid_reference = loader.valid_reference

PAD_IDX = vocab.stoi['<pad>']
cel = nn.CrossEntropyLoss(ignore_index=PAD_IDX) # Ignore Padding
# optimizer = torch.optim.SGD(rnnmodel.parameters(), lr=1e-1)
optimizer = torch.optim.Adam(rnnmodel.parameters(), lr=learning_rate)

# Evaluate
squad_metric = load_metric('squad') # get_tokens() in squad_metric is not exactly same as my preprocess()..


for epoch in tqdm(range(epochs)):
    train_loss = 0
    train_accuracy = 0.0
    train_data_num = 0
    train_prediction = list()
    for train_i, train_batch in enumerate(train_iter):
        context, context_length = train_batch.context
        question, question_length = train_batch.question # Unused
        answer_start = train_batch.answer_start
        answer_end = train_batch.answer_end
        train_id_index = train_batch.id_index

        logits_start, logits_end = rnnmodel(context, context_length)

        optimizer.zero_grad() # reset process
        loss = cel(logits_start, answer_start) + cel(logits_end, answer_end) # Loss, a.k.a L

        loss.backward() # compute gradients
        # print(torch.norm(rnnmodel.lstm.weight_hh_l0.grad), loss.item())
        torch.nn.utils.clip_grad_norm_(rnnmodel.parameters(), max_norm) # gradent clipping
        optimizer.step() # update parameters
        train_loss += loss.item()
        
        _, train_start_preds = torch.max(logits_start, 1)
        _, train_end_preds = torch.max(logits_end, 1)
        # train_accuracy += ((train_start_preds == answer_start) * (train_end_preds == answer_end)).sum().float()

        train_data_num += context.shape[0]

        for train_j in range(context.shape[0]):
            current_id = train_id_index[train_j]
            start = train_start_preds[train_j]
            end = train_end_preds[train_j]

            context_sentence = train_context_span[current_id]['context']
            context_span = train_context_span[current_id]['span']
            pred_text = ""
            if start < end and end < len(context_span):
                pred_text = context_sentence[context_span[start][0]:context_span[end][1]]
            
            # start = answer_start[train_j]
            # end = answer_end[train_j]
            # answer_text = context_sentence[context_span[start][0]:context_span[end][1]+1]
            # print(pred_text, answer_text, train_reference[current_id], "\n")

            train_prediction.append({'id':train_id_list[current_id], 'prediction_text':pred_text})

    train_result = squad_metric.compute(predictions=train_prediction, references=train_reference)
    print('train:: Epoch:', '%04d' % (epoch + 1), 
          'cost =', '{:.6f},'.format(train_loss / train_data_num), 
        #   'my exact_match =', '{:.6f}'.format(train_accuracy / train_data_num),
          'squad_metric : ', train_result
          )
        
    if (epoch + 1) % 1 == 0:
        with torch.no_grad():
            valid_loss = 0
            valid_accuracy = 0.0
            valid_data_num = 0
            valid_prediction = list()
            for valid_i, valid_batch in enumerate(valid_iter):
                context, context_length = valid_batch.context
                question, question_length = valid_batch.question # Unused
                answer_start = valid_batch.answer_start
                answer_end = valid_batch.answer_end
                valid_id_index = valid_batch.id_index

                logits_start, logits_end = rnnmodel(context, context_length)

                loss = cel(logits_start, answer_start) + cel(logits_end, answer_end) # Loss, a.k.a L
                valid_loss += loss.item()

                _, valid_start_preds = torch.max(logits_start, 1)
                _, valid_end_preds = torch.max(logits_end, 1)
                # valid_accuracy += ((valid_start_preds == answer_start) * (valid_end_preds == answer_end)).sum().float()

                valid_data_num += context.shape[0]

                for valid_j in range(context.shape[0]):
                    current_id = valid_id_index[valid_j]
                    start = valid_start_preds[valid_j]
                    end = valid_end_preds[valid_j]

                    context_sentence = valid_context_span[current_id]['context']
                    context_span = valid_context_span[current_id]['span']
                    pred_text = ""
                    if start < end and end < len(context_span):
                        pred_text = context_sentence[context_span[start][0]:context_span[end][1]]
                    
                    # start = answer_start[valid_j]
                    # end = answer_end[valid_j]
                    # answer_text = context_sentence[context_span[start][0]:context_span[end][1]+1]
                    # print(pred_text, answer_text, valid_reference[current_id], "\n")

                    valid_prediction.append({'id':valid_id_list[current_id], 'prediction_text':pred_text})
                
            valid_result = squad_metric.compute(predictions=valid_prediction, references=valid_reference)
            print('valid:: Epoch:', '%04d' % (epoch + 1), 
                  'cost =', '{:.6f},'.format(valid_loss / valid_data_num), 
                #   'my exact_match =', '{:.6f},'.format(valid_accuracy / valid_data_num),
                  'squad_metric : ', valid_result
                 )
            

** Comment 3.1 **

"squad_metric" has two metric, 'exact_match' and 'f1'.
squad_metric get the sentence with string type, not list of tokens.
Therefore, I joint the word? tokens with one space before put it in the given metric.
After that, they use simple space splitter with split(), punctuation remover and Lowercase English for numericalizer.
\
However, it is different from my numericalizer in Prob 2.2, the function 'process()'. I believe that this issue brings the harsh performance (especially, 'exact_match').
\
For example, I double checked the 'exact_match' accuracy using only the start and end position of FIRST answer. Then, accuracy value is much higher than 'exact_match' value. (argmax acc on Prob 3.2)

\
** I fixed above issue on 210422 !! **

** Result 3.1.2 (Training & Validation) on 210422 **


```
# Result ::

device :  cuda
batch_size :  128
max_length :  253
embedding_dim :  128
hidden_dim :  128
n_layers :  2
emb_dropout :  0.5
rnn_dropout_and_fc_dropout :  0.5
bidirectional : True
enable_layer_norm : True
learning_rate :  0.001
100%
30/30 [31:59<00:00, 63.98s/it]
train:: Epoch: 0001 cost = 0.065187, squad_metric :  {'exact_match': 2.056580565805658, 'f1': 5.804066277241986}
valid:: Epoch: 0001 cost = 0.063628, squad_metric :  {'exact_match': 2.5017797213464865, 'f1': 6.626567457311269}
train:: Epoch: 0002 cost = 0.061263, squad_metric :  {'exact_match': 2.872078720787208, 'f1': 6.630574448540768}
valid:: Epoch: 0002 cost = 0.061789, squad_metric :  {'exact_match': 3.295026950066104, 'f1': 7.197033583409913}
train:: Epoch: 0003 cost = 0.059741, squad_metric :  {'exact_match': 3.140221402214022, 'f1': 6.922236813976194}
valid:: Epoch: 0003 cost = 0.061271, squad_metric :  {'exact_match': 3.111969897284654, 'f1': 6.925491932446894}
train:: Epoch: 0004 cost = 0.058734, squad_metric :  {'exact_match': 3.3997539975399755, 'f1': 7.345496100542807}
valid:: Epoch: 0004 cost = 0.060731, squad_metric :  {'exact_match': 3.244177768737923, 'f1': 7.40480242676518}
train:: Epoch: 0005 cost = 0.057881, squad_metric :  {'exact_match': 3.5202952029520294, 'f1': 7.482867259268445}
valid:: Epoch: 0005 cost = 0.060427, squad_metric :  {'exact_match': 3.2238380962066513, 'f1': 7.544678171053291}
train:: Epoch: 0006 cost = 0.057167, squad_metric :  {'exact_match': 3.6346863468634685, 'f1': 7.671537235894837}
valid:: Epoch: 0006 cost = 0.060406, squad_metric :  {'exact_match': 3.4068951489881014, 'f1': 7.939744846165134}
train:: Epoch: 0007 cost = 0.056531, squad_metric :  {'exact_match': 3.787207872078721, 'f1': 8.002235148122486}
valid:: Epoch: 0007 cost = 0.060026, squad_metric :  {'exact_match': 3.5391030204413707, 'f1': 8.043496321836761}
train:: Epoch: 0008 cost = 0.056052, squad_metric :  {'exact_match': 3.988929889298893, 'f1': 8.177588042776978}
valid:: Epoch: 0008 cost = 0.060133, squad_metric :  {'exact_match': 3.3255364588630125, 'f1': 7.839441067283323}
train:: Epoch: 0009 cost = 0.055551, squad_metric :  {'exact_match': 4.011070110701107, 'f1': 8.21664527910432}
valid:: Epoch: 0009 cost = 0.060019, squad_metric :  {'exact_match': 3.518763347910099, 'f1': 8.12918467530948}
train:: Epoch: 0010 cost = 0.055113, squad_metric :  {'exact_match': 4.115621156211562, 'f1': 8.317789639877057}
valid:: Epoch: 0010 cost = 0.059959, squad_metric :  {'exact_match': 3.152649242347198, 'f1': 7.685162704048659}
train:: Epoch: 0011 cost = 0.054717, squad_metric :  {'exact_match': 4.321033210332104, 'f1': 8.580317576396196}
valid:: Epoch: 0011 cost = 0.060023, squad_metric :  {'exact_match': 3.478084002847554, 'f1': 7.838234784404577}
train:: Epoch: 0012 cost = 0.054317, squad_metric :  {'exact_match': 4.237392373923739, 'f1': 8.484760252335024}
valid:: Epoch: 0012 cost = 0.060156, squad_metric :  {'exact_match': 3.4475744940506456, 'f1': 7.856218883584121}
train:: Epoch: 0013 cost = 0.053976, squad_metric :  {'exact_match': 4.3628536285362856, 'f1': 8.605783816935014}
valid:: Epoch: 0013 cost = 0.060177, squad_metric :  {'exact_match': 3.386555476456829, 'f1': 7.787905192658652}
train:: Epoch: 0014 cost = 0.053590, squad_metric :  {'exact_match': 4.397293972939729, 'f1': 8.644590212653958}
valid:: Epoch: 0014 cost = 0.060172, squad_metric :  {'exact_match': 3.3458761313942844, 'f1': 7.7599370837192}
train:: Epoch: 0015 cost = 0.053299, squad_metric :  {'exact_match': 4.520295202952029, 'f1': 8.768772265882282}
valid:: Epoch: 0015 cost = 0.060146, squad_metric :  {'exact_match': 3.6814807281602766, 'f1': 7.859592225429869}
train:: Epoch: 0016 cost = 0.052974, squad_metric :  {'exact_match': 4.565805658056581, 'f1': 8.85582993856305}
valid:: Epoch: 0016 cost = 0.060741, squad_metric :  {'exact_match': 3.4984236753788265, 'f1': 7.766187098597552}
train:: Epoch: 0017 cost = 0.052682, squad_metric :  {'exact_match': 4.539975399753997, 'f1': 8.803159052462188}
valid:: Epoch: 0017 cost = 0.060593, squad_metric :  {'exact_match': 3.7323299094884574, 'f1': 8.117706658077221} -> valid max exact_match & f1
train:: Epoch: 0018 cost = 0.052393, squad_metric :  {'exact_match': 4.688806888068881, 'f1': 8.995489220012683}
valid:: Epoch: 0018 cost = 0.060750, squad_metric :  {'exact_match': 3.386555476456829, 'f1': 7.614236168878465}
train:: Epoch: 0019 cost = 0.052137, squad_metric :  {'exact_match': 4.821648216482165, 'f1': 9.079067942461126}
valid:: Epoch: 0019 cost = 0.060971, squad_metric :  {'exact_match': 3.7323299094884574, 'f1': 7.77000816638873}
train:: Epoch: 0020 cost = 0.051826, squad_metric :  {'exact_match': 4.7749077490774905, 'f1': 9.013046929151512}
valid:: Epoch: 0020 cost = 0.061129, squad_metric :  {'exact_match': 3.528933184175735, 'f1': 7.585744742845281}
train:: Epoch: 0021 cost = 0.051532, squad_metric :  {'exact_match': 4.908979089790898, 'f1': 9.159416891249757}
valid:: Epoch: 0021 cost = 0.061273, squad_metric :  {'exact_match': 3.1628190786128343, 'f1': 7.410190235482942}
train:: Epoch: 0022 cost = 0.051295, squad_metric :  {'exact_match': 4.9175891758917585, 'f1': 9.173378621994955}
valid:: Epoch: 0022 cost = 0.061564, squad_metric :  {'exact_match': 3.4882538391131903, 'f1': 7.626373532177601}
train:: Epoch: 0023 cost = 0.051006, squad_metric :  {'exact_match': 4.939729397293973, 'f1': 9.268019479019237}
valid:: Epoch: 0023 cost = 0.061508, squad_metric :  {'exact_match': 3.7526695820197293, 'f1': 7.7489979711838535}
train:: Epoch: 0024 cost = 0.050738, squad_metric :  {'exact_match': 5.030750307503075, 'f1': 9.223411318450689}
valid:: Epoch: 0024 cost = 0.062149, squad_metric :  {'exact_match': 3.7526695820197293, 'f1': 7.824465489995462}
train:: Epoch: 0025 cost = 0.050490, squad_metric :  {'exact_match': 5.163591635916359, 'f1': 9.374094644338166}
valid:: Epoch: 0025 cost = 0.061921, squad_metric :  {'exact_match': 3.30519678633174, 'f1': 7.466523830086144}
train:: Epoch: 0026 cost = 0.050288, squad_metric :  {'exact_match': 5.084870848708487, 'f1': 9.306312264102903}
valid:: Epoch: 0026 cost = 0.062100, squad_metric :  {'exact_match': 3.3662158039255567, 'f1': 7.421691393618396}
train:: Epoch: 0027 cost = 0.049983, squad_metric :  {'exact_match': 5.088560885608856, 'f1': 9.35156305705315}
valid:: Epoch: 0027 cost = 0.062905, squad_metric :  {'exact_match': 3.3560459676599206, 'f1': 7.64952733556857}
train:: Epoch: 0028 cost = 0.049738, squad_metric :  {'exact_match': 5.083640836408364, 'f1': 9.368993351445374}
valid:: Epoch: 0028 cost = 0.063298, squad_metric :  {'exact_match': 3.213668259941015, 'f1': 7.420978200337307}
train:: Epoch: 0029 cost = 0.049470, squad_metric :  {'exact_match': 5.099630996309963, 'f1': 9.324632873470037}
valid:: Epoch: 0029 cost = 0.062680, squad_metric :  {'exact_match': 3.457744330316282, 'f1': 7.781609326613748}
train:: Epoch: 0030 cost = 0.049204, squad_metric :  {'exact_match': 5.279212792127922, 'f1': 9.52412685897577}
valid:: Epoch: 0030 cost = 0.063807, squad_metric :  {'exact_match': 3.1831587511441066, 'f1': 7.174881290925502}
```

**Problem 3.2** *(10 points)*  Now let's resolve the second issue, by simply concatenating the two inputs into one sequence. The simplest way would be to append the the question at the start *OR* the end of the context. If you put it at the start, you will need to shift the start and the end positions of the answer accordingly. If you put it at the end, it will be necesary to use bidirectional LSTM for the context to be aware of what is ahead (though it is recommended to use bidirectional LSTM even if the question is appended at the start). Whichever you choose, carry it out and report the accuracy. How does it differ from 3.1?

In [None]:
# Put the question sentence before the context sentence (for Prob 3.2+)
# Use <CLS> token to separate two sentences

import torchtext
from torchtext.legacy import data
from torchtext.legacy import datasets
from torchtext.legacy.data import BucketIterator


class SQuAD2Dataset(data.Dataset):
  """
  Defines a dataset for squad1.0.
  """
  
  @staticmethod
  def sort_key(ex):
    return data.interleave_keys(len(ex.context_question))

  def __init__(self, data_list, fields, use_bos=True, max_length=None, **kwargs):
    if not isinstance(fields[0], (tuple, list)):
      fields = [
                # ('context', fields[0]), 
                # ('question', fields[1]), 
                ('context_question', fields[0]), # For Problem 3.2+, put the question after the context
                ('answer_start', fields[1]), 
                ('answer_end', fields[2]), 
                ('id_index', fields[3])
                ]

    examples = []
    nonalpha_list = find_nonalpha_list(data_list)

    self.id_list = list()
    self.reference = list()
    self.context_span = list()

    for _, example in enumerate(data_list):
        out = preprocess(example, nonalpha_list)
        # use data if the answer exists
        if max_length and max_length < max(len(out['context']), len(out['question'])):
            continue
        if 'answers' in out.keys():
            answer_start = out['answers'][0]['start'] # Use index 0 for instant valid accuracy
            answer_end = out['answers'][0]['end'] # Use index 0 for instant valid accuracy
            if use_bos:
                answer_start += 1 # for <BOS> token
                answer_end += 1 # for <BOS> token
            examples.append(data.Example.fromlist([
                                                #    out['context'], 
                                                #    out['question'], 
                                                   out['context'] + ["<CLS>"] + out['question'], # Use <CLS> token
                                                   answer_start,
                                                   answer_end,
                                                   len(self.id_list)], 
                                                  fields))
            self.id_list.append(out['id'])
            self.context_span.append(out['context_span'])
            self.reference.append({'id':example['id'], 'answers':example['answers']})

    super(SQuAD2Dataset, self).__init__(examples, fields, **kwargs)


class SQuAD2Dataloader():
  """
  Make the dataloader for SQuAD 1.0
  """
  def __init__(self, train_data=None, valid_data=None, batch_size=64, device='cpu', 
                max_length=255, min_freq=2, fix_length=None,
                use_bos=True, use_eos=True, shuffle=True
              ):

    super(SQuAD2Dataloader, self).__init__()

    self.text = data.Field(sequential=True, use_vocab=True, batch_first=True, 
                           include_lengths=True, fix_length=fix_length, 
                           init_token='<BOS>' if use_bos else None, 
                           eos_token='<EOS>' if use_eos else None
                          )
    self.answer_start = data.Field(sequential = False, use_vocab = False)
    self.answer_end = data.Field(sequential = False, use_vocab = False)
    self.id_index = data.Field(sequential = False, use_vocab = False)
    
    train = SQuAD2Dataset(data_list=train_data, 
                          fields = [
                                    # ('context', self.text),
                                    # ('question', self.text),
                                    ('context_question', self.text),
                                    ('answer_start', self.answer_start),
                                    ('answer_end', self.answer_end),
                                    ('id_index', self.id_index)
                                    ], 
                          use_bos = use_bos,
                          max_length = max_length
                          )
    valid = SQuAD2Dataset(data_list=valid_data, 
                          fields = [
                                    # ('context', self.text),
                                    # ('question', self.text),
                                    ('context_question', self.text),
                                    ('answer_start', self.answer_start),
                                    ('answer_end', self.answer_end),
                                    ('id_index', self.id_index)
                                    ], 
                          use_bos = use_bos,
                          max_length = max_length
                          )
    self.train_id_list = train.id_list
    self.valid_id_list = valid.id_list
    self.train_context_span = train.context_span
    self.valid_context_span = valid.context_span
    self.train_reference = train.reference
    self.valid_reference = valid.reference
    
    self.train_iter = data.BucketIterator(train, batch_size=batch_size,
                                          device=device,
                                          shuffle=shuffle,
                                          sort_key=lambda x: len(x.context_question), 
                                          sort_within_batch = True
                                          )
    self.valid_iter = data.BucketIterator(valid, batch_size=batch_size,
                                          device=device,
                                          shuffle=shuffle,
                                          sort_key=lambda x: len(x.context_question),
                                          sort_within_batch = True
                                          )
    
    self.text.build_vocab(train)


train_dataset = squad_dataset['train']
valid_dataset = squad_dataset['validation']

print('# of train data : {}'.format(len(train_dataset)))
print('# of vaild data : {}'.format(len(valid_dataset)))

batch_size = 128
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
max_length = 253 # 255 - 2 for <BOS> and <EOS>
min_freq = 2
use_bos = False
use_eos = False
print("device : ", device)

loader = SQuAD2Dataloader(train_dataset, valid_dataset, batch_size=batch_size, 
                          device=device, max_length=max_length, min_freq=min_freq,
                          use_bos=use_bos, use_eos=use_eos)
print('\nFinish making the dataloader')
print("batch_size : ", batch_size)
print("max_length : ", max_length)
print('# of used train data ~ {}'.format((len(loader.train_iter)) * batch_size))
print('# of used vaild data ~ {}'.format((len(loader.valid_iter)) * batch_size))

vocab = loader.text.vocab
vocab_list = list(vocab.stoi.keys())
print('# of vocab : {}'.format(len(vocab_list)))


# of train data : 87599
# of vaild data : 10570
device :  cuda

Finish making the dataloader
batch_size :  128
max_length :  253
# of used train data ~ 81408
# of used vaild data ~ 9856
# of vocab : 86390



** Result 3.2.1 (Preprocessing) **

```
# of train data : 87599
# of vaild data : 10570
device :  cuda

Finish making the dataloader
batch_size :  128
max_length :  253
# of used train data ~ 81408
# of used vaild data ~ 9856
# of vocab : 86390
```

**Answer 3.2**



In [None]:
import torch.nn as nn
from tqdm.notebook import tqdm
from torch.nn.utils.rnn import pack_padded_sequence, pad_packed_sequence


device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print("device : ", device)

# Vocabulary : Use vocab_list

# Construct the LSTM Model
embedding_dim = 128 # usually bigger, e.g. 128
hidden_dim = 128
n_layers = 3
n_label = max_length+1 if use_bos else max_length
emb_dropout = 0.5
rnn_dropout = 0.5
bidirectional = True
enable_layer_norm = True
rnnmodel = ClassificationLSTMModel(embedding_dim, hidden_dim, n_layers, n_label, emb_dropout, rnn_dropout, bidirectional, enable_layer_norm, device).to(device)

# Training
epochs = 50
learning_rate = 1e-3
max_norm = 5

print("batch_size : ", batch_size)
print("max_length : ", max_length)
print("embedding_dim : ", embedding_dim)
print("hidden_dim : ", hidden_dim)
print("n_layers : ", n_layers)
print("emb_dropout : ", emb_dropout)
print("rnn_dropout_and_fc_dropout : ", rnn_dropout)
if bidirectional:
    print("bidirectional : True")
else:
    print("bidirectional : False")
if enable_layer_norm:
    print("enable_layer_norm : True")
else:
    print("enable_layer_norm : False")
    
print("learning_rate : ", learning_rate)

# Construct the data loader
train_iter = loader.train_iter
valid_iter = loader.valid_iter

train_id_list = loader.train_id_list
valid_id_list = loader.valid_id_list
train_context_span = loader.train_context_span
valid_context_span = loader.valid_context_span
train_reference = loader.train_reference
valid_reference = loader.valid_reference

PAD_IDX = vocab.stoi['<pad>']
cel = nn.CrossEntropyLoss(ignore_index=PAD_IDX) # Ignore Padding
# optimizer = torch.optim.SGD(rnnmodel.parameters(), lr=1e-1)
optimizer = torch.optim.Adam(rnnmodel.parameters(), lr=learning_rate)

# Evaluate
squad_metric = load_metric('squad') # get_tokens() in squad_metric is not exactly same as my preprocess()..


for epoch in tqdm(range(epochs)):
    train_loss = 0
    train_accuracy = 0.0
    train_data_num = 0
    train_prediction = list()
    for train_i, train_batch in enumerate(train_iter):
        context, context_length = train_batch.context_question
        answer_start = train_batch.answer_start
        answer_end = train_batch.answer_end
        train_id_index = train_batch.id_index

        logits_start, logits_end = rnnmodel(context, context_length)

        optimizer.zero_grad() # reset process
        loss = cel(logits_start, answer_start) + cel(logits_end, answer_end) # Loss, a.k.a L

        loss.backward() # compute gradients
        # print(torch.norm(rnnmodel.lstm.weight_hh_l0.grad), loss.item())
        torch.nn.utils.clip_grad_norm_(rnnmodel.parameters(), max_norm) # gradent clipping
        optimizer.step() # update parameters
        train_loss += loss.item()
        
        _, train_start_preds = torch.max(logits_start, 1)
        _, train_end_preds = torch.max(logits_end, 1)
        # train_accuracy += ((train_start_preds == answer_start) * (train_end_preds == answer_end)).sum().float()

        train_data_num += context.shape[0]

        for train_j in range(context.shape[0]):
            current_id = train_id_index[train_j]
            start = train_start_preds[train_j]
            end = train_end_preds[train_j]

            context_sentence = train_context_span[current_id]['context']
            context_span = train_context_span[current_id]['span']
            pred_text = ""
            if start < end and end < len(context_span):
                pred_text = context_sentence[context_span[start][0]:context_span[end][1]]
            
            # start = answer_start[train_j]
            # end = answer_end[train_j]
            # answer_text = context_sentence[context_span[start][0]:context_span[end][1]+1]
            # print(pred_text, answer_text, train_reference[current_id], "\n")

            train_prediction.append({'id':train_id_list[current_id], 'prediction_text':pred_text})

    train_result = squad_metric.compute(predictions=train_prediction, references=train_reference)
    print('train:: Epoch:', '%04d' % (epoch + 1), 
          'cost =', '{:.6f},'.format(train_loss / train_data_num), 
        #   'my exact_match =', '{:.6f}'.format(train_accuracy / train_data_num),
          'squad_metric : ', train_result
          )
        
    if (epoch + 1) % 1 == 0:
        with torch.no_grad():
            valid_loss = 0
            valid_accuracy = 0.0
            valid_data_num = 0
            valid_prediction = list()
            for valid_i, valid_batch in enumerate(valid_iter):
                context, context_length = valid_batch.context_question
                answer_start = valid_batch.answer_start
                answer_end = valid_batch.answer_end
                valid_id_index = valid_batch.id_index

                logits_start, logits_end = rnnmodel(context, context_length)

                loss = cel(logits_start, answer_start) + cel(logits_end, answer_end) # Loss, a.k.a L
                valid_loss += loss.item()

                _, valid_start_preds = torch.max(logits_start, 1)
                _, valid_end_preds = torch.max(logits_end, 1)
                # valid_accuracy += ((valid_start_preds == answer_start) * (valid_end_preds == answer_end)).sum().float()

                valid_data_num += context.shape[0]

                for valid_j in range(context.shape[0]):
                    current_id = valid_id_index[valid_j]
                    start = valid_start_preds[valid_j]
                    end = valid_end_preds[valid_j]

                    context_sentence = valid_context_span[current_id]['context']
                    context_span = valid_context_span[current_id]['span']
                    pred_text = ""
                    if start < end and end < len(context_span):
                        pred_text = context_sentence[context_span[start][0]:context_span[end][1]]
                    
                    # start = answer_start[valid_j]
                    # end = answer_end[valid_j]
                    # answer_text = context_sentence[context_span[start][0]:context_span[end][1]+1]
                    # print(pred_text, answer_text, valid_reference[current_id], "\n")

                    valid_prediction.append({'id':valid_id_list[current_id], 'prediction_text':pred_text})
                
            valid_result = squad_metric.compute(predictions=valid_prediction, references=valid_reference)
            print('valid:: Epoch:', '%04d' % (epoch + 1), 
                  'cost =', '{:.6f},'.format(valid_loss / valid_data_num), 
                #   'my exact_match =', '{:.6f},'.format(valid_accuracy / valid_data_num),
                  'squad_metric : ', valid_result
                 )
            

** Comment 3.2 **

Much better performance.

Before (Prob 3.1) : 
```
train:: Epoch: 0017 cost = 0.052682, squad_metric :  {'exact_match': 4.539975399753997, 'f1': 8.803159052462188}
valid:: Epoch: 0017 cost = 0.060593, squad_metric :  {'exact_match': 3.7323299094884574, 'f1': 8.117706658077221}
```

After (Prob 3.2) : 
```
Lowest Valid Loss (Cost) ::
valid:: Epoch: 0026 cost = 0.053489, squad_metric :  {'exact_match': 10.403742499745753, 'f1': 16.172590913652723}

Best Accuracy :: (It decrease until the 100th epoch)
train:: Epoch: 0050 cost = 0.033762, squad_metric :  {'exact_match': 20.202952029520294, 'f1': 28.131849215758283}
valid:: Epoch: 0050 cost = 0.057306, squad_metric :  {'exact_match': 12.407200244076071, 'f1': 18.79763380045527}
```
-> 3-layers and other conditions implie the similar results


** Result 3.2.2 (Training & Validation) **

```
device :  cuda
batch_size :  128
max_length :  253
embedding_dim :  128
hidden_dim :  128
n_layers :  2
emb_dropout :  0.5
rnn_dropout_and_fc_dropout :  0.5
bidirectional : True
enable_layer_norm : True
learning_rate :  0.001
100%
100/100 [1:51:32<00:00, 66.92s/it]
train:: Epoch: 0001 cost = 0.065915, squad_metric :  {'exact_match': 1.831488314883149, 'f1': 5.6030915428402555}
valid:: Epoch: 0001 cost = 0.063906, squad_metric :  {'exact_match': 2.471270212549578, 'f1': 6.824002713974203}
train:: Epoch: 0002 cost = 0.061610, squad_metric :  {'exact_match': 2.5940959409594098, 'f1': 6.373685500606473}
valid:: Epoch: 0002 cost = 0.062415, squad_metric :  {'exact_match': 3.4068951489881014, 'f1': 7.220094547220149}
train:: Epoch: 0003 cost = 0.060113, squad_metric :  {'exact_match': 3.041820418204182, 'f1': 6.888533255208931}
valid:: Epoch: 0003 cost = 0.061853, squad_metric :  {'exact_match': 3.4374046577850095, 'f1': 7.596915873555304}
train:: Epoch: 0004 cost = 0.059059, squad_metric :  {'exact_match': 3.2853628536285364, 'f1': 7.1800294259277795}
valid:: Epoch: 0004 cost = 0.061258, squad_metric :  {'exact_match': 3.3560459676599206, 'f1': 7.642649897310097}
train:: Epoch: 0005 cost = 0.058154, squad_metric :  {'exact_match': 3.4538745387453873, 'f1': 7.466333179048848}
valid:: Epoch: 0005 cost = 0.060824, squad_metric :  {'exact_match': 3.4475744940506456, 'f1': 8.075712455467693}
train:: Epoch: 0006 cost = 0.056687, squad_metric :  {'exact_match': 4.601476014760148, 'f1': 9.163396904572009}
valid:: Epoch: 0006 cost = 0.058756, squad_metric :  {'exact_match': 5.288314858130784, 'f1': 10.492455047059936}
train:: Epoch: 0007 cost = 0.054387, squad_metric :  {'exact_match': 6.009840098400984, 'f1': 11.107497678592521}
valid:: Epoch: 0007 cost = 0.058129, squad_metric :  {'exact_match': 5.512051255974779, 'f1': 10.615991024594434}
train:: Epoch: 0008 cost = 0.053317, squad_metric :  {'exact_match': 6.376383763837638, 'f1': 11.598137045952164}
valid:: Epoch: 0008 cost = 0.057132, squad_metric :  {'exact_match': 6.091731923116038, 'f1': 11.571220423813529}
train:: Epoch: 0009 cost = 0.052425, squad_metric :  {'exact_match': 6.645756457564576, 'f1': 12.067977465751273}
valid:: Epoch: 0009 cost = 0.056873, squad_metric :  {'exact_match': 6.752771280382386, 'f1': 11.98787253008515}
train:: Epoch: 0010 cost = 0.051378, squad_metric :  {'exact_match': 7.592865928659287, 'f1': 13.18126304252747}
valid:: Epoch: 0010 cost = 0.056349, squad_metric :  {'exact_match': 7.362961456320553, 'f1': 12.491330908418615}
train:: Epoch: 0011 cost = 0.050625, squad_metric :  {'exact_match': 8.22509225092251, 'f1': 13.909443005132173}
valid:: Epoch: 0011 cost = 0.056135, squad_metric :  {'exact_match': 7.464659818976915, 'f1': 13.079978771771689}
train:: Epoch: 0012 cost = 0.049965, squad_metric :  {'exact_match': 8.575645756457565, 'f1': 14.356301075042378}
valid:: Epoch: 0012 cost = 0.055828, squad_metric :  {'exact_match': 7.657886708024001, 'f1': 13.151851682816195}
train:: Epoch: 0013 cost = 0.049443, squad_metric :  {'exact_match': 8.826568265682656, 'f1': 14.49584519741705}
valid:: Epoch: 0013 cost = 0.055689, squad_metric :  {'exact_match': 8.034170649852538, 'f1': 13.876875284508266}
train:: Epoch: 0014 cost = 0.048815, squad_metric :  {'exact_match': 8.96309963099631, 'f1': 14.8451849634834}
valid:: Epoch: 0014 cost = 0.055366, squad_metric :  {'exact_match': 8.085019831180718, 'f1': 13.664861901052312}
train:: Epoch: 0015 cost = 0.048145, squad_metric :  {'exact_match': 9.17220172201722, 'f1': 15.038099097013292}
valid:: Epoch: 0015 cost = 0.055125, squad_metric :  {'exact_match': 8.288416556493441, 'f1': 13.844800621792489}
train:: Epoch: 0016 cost = 0.047499, squad_metric :  {'exact_match': 9.563345633456334, 'f1': 15.542709741829917}
valid:: Epoch: 0016 cost = 0.055032, squad_metric :  {'exact_match': 8.644360825790704, 'f1': 14.035106664498212}
train:: Epoch: 0017 cost = 0.046962, squad_metric :  {'exact_match': 9.84870848708487, 'f1': 15.922161801908059}
valid:: Epoch: 0017 cost = 0.054708, squad_metric :  {'exact_match': 8.664700498321977, 'f1': 14.61675786628199}
train:: Epoch: 0018 cost = 0.046354, squad_metric :  {'exact_match': 10.194341943419435, 'f1': 16.330892496035965}
valid:: Epoch: 0018 cost = 0.054640, squad_metric :  {'exact_match': 9.203701820400692, 'f1': 14.829749625301385}
train:: Epoch: 0019 cost = 0.045840, squad_metric :  {'exact_match': 10.665436654366543, 'f1': 16.839291029629607}
valid:: Epoch: 0019 cost = 0.054670, squad_metric :  {'exact_match': 8.847757551103427, 'f1': 14.676242165810274}
train:: Epoch: 0020 cost = 0.045346, squad_metric :  {'exact_match': 10.70110701107011, 'f1': 16.93470377106904}
valid:: Epoch: 0020 cost = 0.054627, squad_metric :  {'exact_match': 9.335909691853962, 'f1': 15.135763860805596}
train:: Epoch: 0021 cost = 0.044856, squad_metric :  {'exact_match': 11.153751537515376, 'f1': 17.45634459567992}
valid:: Epoch: 0021 cost = 0.054354, squad_metric :  {'exact_match': 9.335909691853962, 'f1': 14.770820339318377}
train:: Epoch: 0022 cost = 0.044357, squad_metric :  {'exact_match': 11.6309963099631, 'f1': 17.96955037604607}
valid:: Epoch: 0022 cost = 0.053942, squad_metric :  {'exact_match': 9.763042815010678, 'f1': 15.61061819504899}
train:: Epoch: 0023 cost = 0.043858, squad_metric :  {'exact_match': 11.817958179581796, 'f1': 18.127874016031576}
valid:: Epoch: 0023 cost = 0.054091, squad_metric :  {'exact_match': 10.057968066714126, 'f1': 16.123198602298523}
train:: Epoch: 0024 cost = 0.043312, squad_metric :  {'exact_match': 12.365313653136532, 'f1': 18.856645672005456}
valid:: Epoch: 0024 cost = 0.053541, squad_metric :  {'exact_match': 9.803722160073223, 'f1': 15.722414091514771}
train:: Epoch: 0025 cost = 0.042859, squad_metric :  {'exact_match': 12.715867158671587, 'f1': 19.33001363819892}
valid:: Epoch: 0025 cost = 0.054125, squad_metric :  {'exact_match': 10.271534628292484, 'f1': 16.509836370890504}
train:: Epoch: 0026 cost = 0.042346, squad_metric :  {'exact_match': 12.733087330873309, 'f1': 19.291530383036037}
valid:: Epoch: 0026 cost = 0.053489, squad_metric :  {'exact_match': 10.403742499745753, 'f1': 16.172590913652723} -> lowest valid loss(cost)
train:: Epoch: 0027 cost = 0.041864, squad_metric :  {'exact_match': 13.346863468634686, 'f1': 20.111493559554685}
valid:: Epoch: 0027 cost = 0.054237, squad_metric :  {'exact_match': 10.576629716261568, 'f1': 16.29722519285199}
train:: Epoch: 0028 cost = 0.041457, squad_metric :  {'exact_match': 13.795817958179581, 'f1': 20.551966333297944}
valid:: Epoch: 0028 cost = 0.054089, squad_metric :  {'exact_match': 10.352893318417573, 'f1': 16.44364839988443}
train:: Epoch: 0029 cost = 0.041006, squad_metric :  {'exact_match': 13.985239852398523, 'f1': 20.84467734900896}
valid:: Epoch: 0029 cost = 0.054058, squad_metric :  {'exact_match': 10.688497915183566, 'f1': 16.93774885759844}
train:: Epoch: 0030 cost = 0.040610, squad_metric :  {'exact_match': 14.404674046740467, 'f1': 21.23351628514915}
valid:: Epoch: 0030 cost = 0.054174, squad_metric :  {'exact_match': 11.176650055934099, 'f1': 17.123895216357646}
train:: Epoch: 0031 cost = 0.040201, squad_metric :  {'exact_match': 14.469864698646987, 'f1': 21.44862837078835}
valid:: Epoch: 0031 cost = 0.055336, squad_metric :  {'exact_match': 11.380046781246822, 'f1': 17.433383942744744}
train:: Epoch: 0032 cost = 0.039767, squad_metric :  {'exact_match': 14.75768757687577, 'f1': 21.770707802468173}
valid:: Epoch: 0032 cost = 0.054204, squad_metric :  {'exact_match': 11.034272348215193, 'f1': 17.305017194698312}
train:: Epoch: 0033 cost = 0.039367, squad_metric :  {'exact_match': 15.191881918819188, 'f1': 22.239563024323438}
valid:: Epoch: 0033 cost = 0.055078, squad_metric :  {'exact_match': 11.420726126309367, 'f1': 17.5118477027124}
train:: Epoch: 0034 cost = 0.039054, squad_metric :  {'exact_match': 15.533825338253383, 'f1': 22.642943385436844}
valid:: Epoch: 0034 cost = 0.054161, squad_metric :  {'exact_match': 11.074951693277738, 'f1': 17.20631077339324}
train:: Epoch: 0035 cost = 0.038645, squad_metric :  {'exact_match': 15.74169741697417, 'f1': 22.922018448053343}
valid:: Epoch: 0035 cost = 0.054278, squad_metric :  {'exact_match': 11.613953015356453, 'f1': 18.151906815073925}
train:: Epoch: 0036 cost = 0.038290, squad_metric :  {'exact_match': 16.121771217712176, 'f1': 23.412312859552326}
valid:: Epoch: 0036 cost = 0.055055, squad_metric :  {'exact_match': 11.898708430794263, 'f1': 18.264296985917255}
train:: Epoch: 0037 cost = 0.037895, squad_metric :  {'exact_match': 16.26568265682657, 'f1': 23.627408553167754}
valid:: Epoch: 0037 cost = 0.055274, squad_metric :  {'exact_match': 11.573273670293908, 'f1': 18.00834918259013}
train:: Epoch: 0038 cost = 0.037600, squad_metric :  {'exact_match': 16.41820418204182, 'f1': 23.775117724618738}
valid:: Epoch: 0038 cost = 0.055362, squad_metric :  {'exact_match': 12.010576629716262, 'f1': 18.43531629648551}
train:: Epoch: 0039 cost = 0.037273, squad_metric :  {'exact_match': 17.04059040590406, 'f1': 24.463296120113775}
valid:: Epoch: 0039 cost = 0.054763, squad_metric :  {'exact_match': 11.919048103325537, 'f1': 18.17531778533365}
train:: Epoch: 0040 cost = 0.036804, squad_metric :  {'exact_match': 17.30627306273063, 'f1': 24.802421982963637}
valid:: Epoch: 0040 cost = 0.055043, squad_metric :  {'exact_match': 12.030916302247533, 'f1': 18.680133064371933}
train:: Epoch: 0041 cost = 0.036571, squad_metric :  {'exact_match': 17.60270602706027, 'f1': 25.060528770688975}
valid:: Epoch: 0041 cost = 0.055297, squad_metric :  {'exact_match': 12.336011390216617, 'f1': 18.786567305589674}
train:: Epoch: 0042 cost = 0.036218, squad_metric :  {'exact_match': 17.79089790897909, 'f1': 25.245082246017937}
valid:: Epoch: 0042 cost = 0.055380, squad_metric :  {'exact_match': 12.051255974778806, 'f1': 18.49026799780359}
train:: Epoch: 0043 cost = 0.035909, squad_metric :  {'exact_match': 18.062730627306273, 'f1': 25.730630204611998}
valid:: Epoch: 0043 cost = 0.055506, squad_metric :  {'exact_match': 12.203803518763348, 'f1': 18.670169966949498}
train:: Epoch: 0044 cost = 0.035703, squad_metric :  {'exact_match': 18.36408364083641, 'f1': 26.12316664968998}
valid:: Epoch: 0044 cost = 0.056148, squad_metric :  {'exact_match': 12.610596969388792, 'f1': 19.088426408066756}
train:: Epoch: 0045 cost = 0.035274, squad_metric :  {'exact_match': 18.735547355473553, 'f1': 26.45695740597906}
valid:: Epoch: 0045 cost = 0.056149, squad_metric :  {'exact_match': 12.152954337435167, 'f1': 18.39536052846257}
train:: Epoch: 0046 cost = 0.034942, squad_metric :  {'exact_match': 19.023370233702337, 'f1': 26.80306566315575}
valid:: Epoch: 0046 cost = 0.056176, squad_metric :  {'exact_match': 12.600427133123157, 'f1': 19.055526981715527}
train:: Epoch: 0047 cost = 0.034627, squad_metric :  {'exact_match': 19.394833948339482, 'f1': 27.218841859179573}
valid:: Epoch: 0047 cost = 0.056190, squad_metric :  {'exact_match': 12.488558934201158, 'f1': 18.966638160046813}
train:: Epoch: 0048 cost = 0.034356, squad_metric :  {'exact_match': 19.654366543665436, 'f1': 27.441058984250834}
valid:: Epoch: 0048 cost = 0.056246, squad_metric :  {'exact_match': 12.346181226482253, 'f1': 19.070243088408173}
train:: Epoch: 0049 cost = 0.034025, squad_metric :  {'exact_match': 19.60147601476015, 'f1': 27.574851474592617}
valid:: Epoch: 0049 cost = 0.057362, squad_metric :  {'exact_match': 12.132614664903896, 'f1': 18.786286669756787}
train:: Epoch: 0050 cost = 0.033762, squad_metric :  {'exact_match': 20.202952029520294, 'f1': 28.131849215758283}
valid:: Epoch: 0050 cost = 0.057306, squad_metric :  {'exact_match': 12.407200244076071, 'f1': 18.79763380045527}
```

** Result 3.2.2 (n_layers = 3, compare with Prob 4.2) **

```
device :  cuda
batch_size :  128
max_length :  253
embedding_dim :  128
hidden_dim :  128
n_layers :  3
emb_dropout :  0.5
rnn_dropout_and_fc_dropout :  0.5
bidirectional : True
enable_layer_norm : True
learning_rate :  0.001
70%
35/50 [53:32<22:50, 91.40s/it]
train:: Epoch: 0001 cost = 0.065585, squad_metric :  {'exact_match': 1.976629766297663, 'f1': 5.764914304293227}
valid:: Epoch: 0001 cost = 0.063549, squad_metric :  {'exact_match': 3.193328587409743, 'f1': 7.116450009478686}
train:: Epoch: 0002 cost = 0.061254, squad_metric :  {'exact_match': 2.8794587945879457, 'f1': 6.609257645917817}
valid:: Epoch: 0002 cost = 0.062051, squad_metric :  {'exact_match': 3.0204413708939284, 'f1': 7.277975980680746}
train:: Epoch: 0003 cost = 0.059870, squad_metric :  {'exact_match': 3.265682656826568, 'f1': 7.09678259751752}
valid:: Epoch: 0003 cost = 0.061299, squad_metric :  {'exact_match': 3.0306112071595646, 'f1': 7.579524635346263}
train:: Epoch: 0004 cost = 0.058775, squad_metric :  {'exact_match': 3.5018450184501844, 'f1': 7.4881703613338795}
valid:: Epoch: 0004 cost = 0.060763, squad_metric :  {'exact_match': 3.467914166581918, 'f1': 7.86356843769654}
train:: Epoch: 0005 cost = 0.057868, squad_metric :  {'exact_match': 3.4710947109471095, 'f1': 7.604822338629064}
valid:: Epoch: 0005 cost = 0.060409, squad_metric :  {'exact_match': 3.4984236753788265, 'f1': 8.034024162560828}
train:: Epoch: 0006 cost = 0.057160, squad_metric :  {'exact_match': 3.676506765067651, 'f1': 7.90582171613864}
valid:: Epoch: 0006 cost = 0.060061, squad_metric :  {'exact_match': 3.640801383097732, 'f1': 8.1366395649221}
train:: Epoch: 0007 cost = 0.056533, squad_metric :  {'exact_match': 3.7146371463714636, 'f1': 7.834796650907879}
valid:: Epoch: 0007 cost = 0.060067, squad_metric :  {'exact_match': 3.213668259941015, 'f1': 7.523847773106492}
train:: Epoch: 0008 cost = 0.056002, squad_metric :  {'exact_match': 3.9003690036900367, 'f1': 8.061712280256744}
valid:: Epoch: 0008 cost = 0.059820, squad_metric :  {'exact_match': 3.630631546832096, 'f1': 8.201126776978658}
train:: Epoch: 0009 cost = 0.055464, squad_metric :  {'exact_match': 4.108241082410824, 'f1': 8.335688033457552}
valid:: Epoch: 0009 cost = 0.059394, squad_metric :  {'exact_match': 3.6814807281602766, 'f1': 8.305539520960904}
train:: Epoch: 0010 cost = 0.053273, squad_metric :  {'exact_match': 6.017220172201722, 'f1': 10.970731510033588}
valid:: Epoch: 0010 cost = 0.056465, squad_metric :  {'exact_match': 6.071392250584766, 'f1': 11.494156628386047}
train:: Epoch: 0011 cost = 0.051256, squad_metric :  {'exact_match': 7.586715867158672, 'f1': 13.015780422300661}
valid:: Epoch: 0011 cost = 0.055608, squad_metric :  {'exact_match': 7.952811959727448, 'f1': 13.458483212774043}
train:: Epoch: 0012 cost = 0.049822, squad_metric :  {'exact_match': 8.745387453874539, 'f1': 14.545286067874851}
valid:: Epoch: 0012 cost = 0.054885, squad_metric :  {'exact_match': 7.413810637648734, 'f1': 12.711715190940277}
train:: Epoch: 0013 cost = 0.048763, squad_metric :  {'exact_match': 9.364083640836409, 'f1': 15.317276722455547}
valid:: Epoch: 0013 cost = 0.054469, squad_metric :  {'exact_match': 8.674870334587613, 'f1': 14.11292198214176}
train:: Epoch: 0014 cost = 0.048062, squad_metric :  {'exact_match': 9.880688806888068, 'f1': 16.008796867005916}
valid:: Epoch: 0014 cost = 0.054117, squad_metric :  {'exact_match': 9.112173294009967, 'f1': 15.204945146937826}
train:: Epoch: 0015 cost = 0.047370, squad_metric :  {'exact_match': 10.351783517835178, 'f1': 16.559673147276335}
valid:: Epoch: 0015 cost = 0.053597, squad_metric :  {'exact_match': 9.315570019322688, 'f1': 15.197528404735717}
train:: Epoch: 0016 cost = 0.046616, squad_metric :  {'exact_match': 10.928659286592866, 'f1': 17.288328297311928}
valid:: Epoch: 0016 cost = 0.053512, squad_metric :  {'exact_match': 9.763042815010678, 'f1': 15.879120888615722}
train:: Epoch: 0017 cost = 0.046096, squad_metric :  {'exact_match': 11.052890528905289, 'f1': 17.57585102332387}
valid:: Epoch: 0017 cost = 0.052808, squad_metric :  {'exact_match': 9.641004779823044, 'f1': 15.84140623938965}
train:: Epoch: 0018 cost = 0.045390, squad_metric :  {'exact_match': 11.60270602706027, 'f1': 18.241460727197175}
valid:: Epoch: 0018 cost = 0.052952, squad_metric :  {'exact_match': 10.129156920573578, 'f1': 16.280470230744513}
train:: Epoch: 0019 cost = 0.044803, squad_metric :  {'exact_match': 12.020910209102091, 'f1': 18.857435473034187}
valid:: Epoch: 0019 cost = 0.053431, squad_metric :  {'exact_match': 10.485101189870843, 'f1': 16.458671978283423}
train:: Epoch: 0020 cost = 0.044382, squad_metric :  {'exact_match': 12.361623616236162, 'f1': 19.16388012805624}
valid:: Epoch: 0020 cost = 0.052247, squad_metric :  {'exact_match': 10.586799552527204, 'f1': 16.91631801099672}
train:: Epoch: 0021 cost = 0.043786, squad_metric :  {'exact_match': 12.725707257072571, 'f1': 19.56743338503797}
valid:: Epoch: 0021 cost = 0.052441, squad_metric :  {'exact_match': 11.04444218448083, 'f1': 17.326214816620112}
train:: Epoch: 0022 cost = 0.043256, squad_metric :  {'exact_match': 13.334563345633457, 'f1': 20.35287519952955}
valid:: Epoch: 0022 cost = 0.051849, squad_metric :  {'exact_match': 11.105461202074647, 'f1': 17.55276943762167}
train:: Epoch: 0023 cost = 0.042928, squad_metric :  {'exact_match': 13.367773677736777, 'f1': 20.340014918348974}
valid:: Epoch: 0023 cost = 0.051963, squad_metric :  {'exact_match': 11.400386453778093, 'f1': 17.55979109536984}
train:: Epoch: 0024 cost = 0.042540, squad_metric :  {'exact_match': 13.777367773677737, 'f1': 20.893844978993524}
valid:: Epoch: 0024 cost = 0.052338, squad_metric :  {'exact_match': 11.258008746059188, 'f1': 17.456402192377052}
train:: Epoch: 0025 cost = 0.042046, squad_metric :  {'exact_match': 14.141451414514146, 'f1': 21.306083351355298}
valid:: Epoch: 0025 cost = 0.052068, squad_metric :  {'exact_match': 11.624122851622088, 'f1': 17.857217922771888}
train:: Epoch: 0026 cost = 0.041597, squad_metric :  {'exact_match': 14.407134071340714, 'f1': 21.607037752218865}
valid:: Epoch: 0026 cost = 0.051999, squad_metric :  {'exact_match': 12.183463846232076, 'f1': 18.62646442022408}
train:: Epoch: 0027 cost = 0.041216, squad_metric :  {'exact_match': 14.708487084870848, 'f1': 21.887353261890084}
valid:: Epoch: 0027 cost = 0.051823, squad_metric :  {'exact_match': 11.695311705481542, 'f1': 18.1139621265219}
train:: Epoch: 0028 cost = 0.040778, squad_metric :  {'exact_match': 15.141451414514146, 'f1': 22.470419287053442}
valid:: Epoch: 0028 cost = 0.051810, squad_metric :  {'exact_match': 12.519068442998067, 'f1': 18.98805961206842}
train:: Epoch: 0029 cost = 0.040352, squad_metric :  {'exact_match': 15.71709717097171, 'f1': 23.148573468189998}
valid:: Epoch: 0029 cost = 0.052474, squad_metric :  {'exact_match': 12.061425811044442, 'f1': 18.173078033524344}
train:: Epoch: 0030 cost = 0.039955, squad_metric :  {'exact_match': 15.862238622386224, 'f1': 23.353634435533106}
valid:: Epoch: 0030 cost = 0.052704, squad_metric :  {'exact_match': 11.959727448388081, 'f1': 18.415472454343917}
train:: Epoch: 0031 cost = 0.039642, squad_metric :  {'exact_match': 16.26568265682657, 'f1': 23.760474477553394}
valid:: Epoch: 0031 cost = 0.052407, squad_metric :  {'exact_match': 12.936031729889148, 'f1': 19.30269611152689}
train:: Epoch: 0032 cost = 0.039230, squad_metric :  {'exact_match': 16.79458794587946, 'f1': 24.25346386044038}
valid:: Epoch: 0032 cost = 0.051929, squad_metric :  {'exact_match': 12.498728770466796, 'f1': 18.947009615337333}
train:: Epoch: 0033 cost = 0.038894, squad_metric :  {'exact_match': 17.091020910209103, 'f1': 24.69983245105435}
valid:: Epoch: 0033 cost = 0.052002, squad_metric :  {'exact_match': 12.936031729889148, 'f1': 19.778920040533272}
train:: Epoch: 0034 cost = 0.038637, squad_metric :  {'exact_match': 17.10209102091021, 'f1': 24.680818824499696}
valid:: Epoch: 0034 cost = 0.052356, squad_metric :  {'exact_match': 13.129258618936236, 'f1': 19.65156620066013}
train:: Epoch: 0035 cost = 0.038106, squad_metric :  {'exact_match': 17.595325953259533, 'f1': 25.38333790763741}
valid:: Epoch: 0035 cost = 0.052579, squad_metric :  {'exact_match': 13.149598291467507, 'f1': 19.549002646178028}
```



## 4. LSTM + Attention for SQuAD

**Problem 4.1** *(20 points)* Here, we will be appending an attention layer on top of LSTM outputs. We will use a single-head attention sublayer from Transformer. That is, you will implement 
$$ \text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^\top}{\sqrt{d}}\right)V,$$
where $Q, K, V$ is obtained by the linear transformation of the hidden states of the LSTM outputs $H$, i.e. $Q = HW^Q, K=HW^K, V=HW^V$ ($W^Q, W^K, W^V \in \mathbb{R}^{d \times d}$ are trainable weights). Note that the output of $\text{Attention}$ layer has the same dimension as $H$, so you can directly append your token classification layer on top of it. Report the accuracy and compare it with 3.2.



**Answer 4.1**



In [None]:
import torch.nn as nn
import numpy as np
from tqdm.notebook import tqdm
from torch.nn.utils.rnn import pack_padded_sequence, pad_packed_sequence


class AttentionLSTMModel(nn.Module):
    def __init__(self, embedding_dim, hidden_dim, n_layers, n_label, emb_dropout, rnn_dropout, bidirectional, enable_layer_norm, device):
        super(AttentionLSTMModel, self).__init__()
        self.embedding = nn.Embedding(len(vocab), embedding_dim)
        self.lstm = nn.LSTM(input_size=embedding_dim, 
                            hidden_size=hidden_dim, 
                            num_layers=n_layers, 
                            dropout=rnn_dropout if n_layers > 1 else 0.0,  
                            bidirectional=bidirectional)
      
        n_direction = 2 if bidirectional else 1
        self.new_hidden_dim = hidden_dim*n_direction
        self.fc_start = nn.Linear(self.new_hidden_dim, 1, bias=True)
        self.fc_end = nn.Linear(self.new_hidden_dim, 1, bias=True)

        # Layer_normalization
        self.enable_layer_norm = enable_layer_norm
        if enable_layer_norm:
            self.norm1 = nn.LayerNorm(embedding_dim)
            self.norm2 = nn.LayerNorm(self.new_hidden_dim)
            self.norm3 = nn.LayerNorm(self.new_hidden_dim)

        # Attention layer
        self.att_weight_Q = nn.Linear(self.new_hidden_dim, self.new_hidden_dim, bias=False)
        self.att_weight_K = nn.Linear(self.new_hidden_dim, self.new_hidden_dim, bias=False)
        self.att_weight_V = nn.Linear(self.new_hidden_dim, self.new_hidden_dim, bias=False)
        self.softmax_layer = nn.Softmax(dim=2)

        self.emb_dropout = nn.Dropout(emb_dropout)
        self.att_dropout = nn.Dropout(rnn_dropout)
        self.fc_dropout = nn.Dropout(rnn_dropout)
        self.bidirectional = bidirectional
        self.device = device

    def forward(self, input_tensor, src_seq_lens):
        emb = self.embedding(input_tensor) # emb.shape = batch * len * embedding_size

        # Layer_normalization
        if self.enable_layer_norm:
            emb = self.norm1(emb)

        emb = self.emb_dropout(emb)
        emb = emb.transpose(0, 1) # emb.shape = len * batch * embedding_size

        # nn.LSTM
        packed = pack_padded_sequence(emb, src_seq_lens.tolist(), batch_first=False)
        outs, (hn, cn) = self.lstm(packed)
        outs, out_lens = pad_packed_sequence(outs, batch_first=False)  # outs.shape = len * batch * self.new_hidden_dim
        
        outs = outs.transpose(0, 1) # outs.shape = batch * len * self.new_hidden_dim
        
        # Layer_normalization
        if self.enable_layer_norm:
            outs = self.norm2(outs)
        
        outs = self.att_dropout(outs)

        # Attention Mechanism
        att_q = self.att_weight_Q(outs)
        att_k = self.att_weight_K(outs)
        att_v = self.att_weight_V(outs)
        
        att_qk = torch.einsum('bih,bjh->bij', att_q, att_k) 
        att_soft_qk = self.softmax_layer(att_qk / np.sqrt(self.new_hidden_dim))
        att = torch.einsum('bij,bih->bjh', att_soft_qk, att_v)

        # Layer_normalization
        if self.enable_layer_norm:
            att = self.norm3(att)

        att = self.fc_dropout(att)
        logits_start = self.fc_start(att).squeeze(2)
        logits_end = self.fc_end(att).squeeze(2)

        return (logits_start, logits_end)

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print("device : ", device)

# Vocabulary : Use vocab_list

# Construct the LSTM Model
embedding_dim = 128 # usually bigger, e.g. 128
hidden_dim = 128
n_layers = 2
n_label = max_length+1 if use_bos else max_length
emb_dropout = 0.5
rnn_dropout = 0.5
bidirectional = True
enable_layer_norm = True
rnnmodel = AttentionLSTMModel(embedding_dim, hidden_dim, n_layers, n_label, emb_dropout, rnn_dropout, bidirectional, enable_layer_norm, device).to(device)

print("batch_size : ", batch_size)
print("max_length : ", max_length)
print("embedding_dim : ", embedding_dim)
print("hidden_dim : ", hidden_dim)
print("n_layers : ", n_layers)
print("emb_dropout : ", emb_dropout)
print("rnn_dropout_and_fc_dropout : ", rnn_dropout)
if bidirectional:
    print("bidirectional : True")
else:
    print("bidirectional : False")
if enable_layer_norm:
    print("enable_layer_norm : True")
else:
    print("enable_layer_norm : False")

# Construct the data loader
train_iter = loader.train_iter
valid_iter = loader.valid_iter

train_id_list = loader.train_id_list
valid_id_list = loader.valid_id_list
train_reference = loader.train_reference
valid_reference = loader.valid_reference

# Training
learning_rate = 1e-3
print("learning_rate : ", learning_rate)

PAD_IDX = vocab.stoi['<pad>']
cel = nn.CrossEntropyLoss(ignore_index=PAD_IDX) # Ignore Padding
# optimizer = torch.optim.SGD(rnnmodel.parameters(), lr=1e-1)
optimizer = torch.optim.Adam(rnnmodel.parameters(), lr=learning_rate)

epochs = 50
max_norm = 5

# Evaluate
squad_metric = load_metric('squad')


for epoch in tqdm(range(epochs)):
    train_loss = 0
    train_accuracy = 0.0
    train_data_num = 0
    train_prediction = list()
    for train_i, train_batch in enumerate(train_iter):
        context, context_length = train_batch.context_question
        answer_start = train_batch.answer_start
        answer_end = train_batch.answer_end
        train_id_index = train_batch.id_index

        logits_start, logits_end = rnnmodel(context, context_length)

        optimizer.zero_grad() # reset process
        loss = cel(logits_start, answer_start) + cel(logits_end, answer_end) # Loss, a.k.a L
        loss.backward() # compute gradients
        # print(torch.norm(rnnmodel.lstm.weight_hh_l0.grad), loss.item())
        # torch.nn.utils.clip_grad_norm_(rnnmodel.parameters(), max_norm) # gradent clipping
        optimizer.step() # update parameters
        train_loss += loss.item()
        
        _, train_start_preds = torch.max(logits_start, 1)
        _, train_end_preds = torch.max(logits_end, 1)
        # train_accuracy += ((train_start_preds == answer_start) * (train_end_preds == answer_end)).sum().float()

        train_data_num += context.shape[0]

        for train_j in range(context.shape[0]):
            pred_text = ""
            start = train_start_preds[train_j]
            end = train_end_preds[train_j]
            if start < end:
                pred_text = [vocab_list[text_id] for text_id in context[train_j][start:end+1]]
                pred_text = " ".join(pred_text)

            # start = answer_start[train_j]
            # end = answer_end[train_j]
            # answer_text = [vocab_list[text_id] for text_id in context[train_j][start:end+1]]
            # answer_text = " ".join(answer_text)
            # print(pred_text, answer_text, train_reference[train_id_index[train_j]], "\n")

            train_prediction.append({'id':train_id_list[train_id_index[train_j]], 'prediction_text':pred_text})

    train_result = squad_metric.compute(predictions=train_prediction, references=train_reference)
    print('train:: Epoch:', '%04d' % (epoch + 1), 
          'cost =', '{:.6f},'.format(train_loss / train_data_num), 
        #   'my exact_match =', '{:.6f}'.format(train_accuracy / train_data_num),
          'other_squad_metric : ', train_result)
        
    if (epoch + 1) % 1 == 0:
        with torch.no_grad():
            valid_loss = 0
            valid_accuracy = 0.0
            valid_data_num = 0
            valid_prediction = list()
            for valid_i, valid_batch in enumerate(valid_iter):
                context, context_length = valid_batch.context_question
                answer_start = valid_batch.answer_start
                answer_end = valid_batch.answer_end
                valid_id_index = valid_batch.id_index

                logits_start, logits_end = rnnmodel(context, context_length)

                loss = cel(logits_start, answer_start) + cel(logits_end, answer_end) # Loss, a.k.a L
                valid_loss += loss.item()

                _, valid_start_preds = torch.max(logits_start, 1)
                _, valid_end_preds = torch.max(logits_end, 1)
                # valid_accuracy += ((valid_start_preds == answer_start) * (valid_end_preds == answer_end)).sum().float()

                valid_data_num += context.shape[0]

                for valid_j in range(context.shape[0]):
                    pred_text = ""
                    start = valid_start_preds[valid_j]
                    end = valid_end_preds[valid_j]
                    if start < end:
                        pred_text = [vocab_list[text_id] for text_id in context[valid_j][start:end+1]]
                        pred_text = " ".join(pred_text)
                    valid_prediction.append({'id':valid_id_list[valid_id_index[valid_j]], 'prediction_text':pred_text})
                
            valid_result = squad_metric.compute(predictions=valid_prediction, references=valid_reference)
            print('valid:: Epoch:', '%04d' % (epoch + 1), 
                  'cost =', '{:.6f},'.format(valid_loss / valid_data_num), 
                #   'my exact_match =', '{:.6f},'.format(valid_accuracy / valid_data_num),
                  'other_squad_metric : ', valid_result)
            

** Comment 4.1 **

Just use Attention Mechanism.
Use LayerNorm and dropout

Trianing and Valid Loss decrease faster and more, but get worse performance.

Before (Prob 3.2) : 
```
Lowest Valid Loss (Cost) ::
valid:: Epoch: 0026 cost = 0.053489, squad_metric :  {'exact_match': 10.403742499745753, 'f1': 16.172590913652723}

Best Accuracy :: (It decrease until the 100th epoch)
train:: Epoch: 0050 cost = 0.033762, squad_metric :  {'exact_match': 20.202952029520294, 'f1': 28.131849215758283}
valid:: Epoch: 0050 cost = 0.057306, squad_metric :  {'exact_match': 12.407200244076071, 'f1': 18.79763380045527}
```

After (Prob 4.1) : 
```
Lowest Valid Loss (Cost) ::
valid:: Epoch: 0032 cost = 0.050406, other_squad_metric :  {'exact_match': 4.30184074036408, 'f1': 11.474164388062889}
```

** Result 4.1 (Training & Validation) **
```
device :  cuda
batch_size :  128
max_length :  253
embedding_dim :  128
hidden_dim :  128
n_layers :  2
emb_dropout :  0.5
rnn_dropout_and_fc_dropout :  0.5
bidirectional : True
enable_layer_norm : True
learning_rate :  0.001
84%
42/50 [1:04:37<11:51, 88.89s/it]
train:: Epoch: 0001 cost = 0.066841, other_squad_metric :  {'exact_match': 0.1918819188191882, 'f1': 3.4376693573703014}
valid:: Epoch: 0001 cost = 0.063277, other_squad_metric :  {'exact_match': 0.4373029594223533, 'f1': 4.347033320320883}
train:: Epoch: 0002 cost = 0.060166, other_squad_metric :  {'exact_match': 0.7908979089790897, 'f1': 4.771716845701368}
valid:: Epoch: 0002 cost = 0.060228, other_squad_metric :  {'exact_match': 0.9152852639072511, 'f1': 5.3861041696909995}
train:: Epoch: 0003 cost = 0.057422, other_squad_metric :  {'exact_match': 1.3321033210332103, 'f1': 5.6525669767512685}
valid:: Epoch: 0003 cost = 0.058470, other_squad_metric :  {'exact_match': 1.2813993694701515, 'f1': 6.324177553770438}
train:: Epoch: 0004 cost = 0.055610, other_squad_metric :  {'exact_match': 1.7539975399753998, 'f1': 6.318895777595019}
valid:: Epoch: 0004 cost = 0.057428, other_squad_metric :  {'exact_match': 1.5966642937048714, 'f1': 6.6456229925696775}
train:: Epoch: 0005 cost = 0.054287, other_squad_metric :  {'exact_match': 2.148831488314883, 'f1': 6.829366843979518}
valid:: Epoch: 0005 cost = 0.056535, other_squad_metric :  {'exact_match': 1.647513475033052, 'f1': 6.957331315431395}
train:: Epoch: 0006 cost = 0.053049, other_squad_metric :  {'exact_match': 2.3874538745387452, 'f1': 7.262275577427034}
valid:: Epoch: 0006 cost = 0.055766, other_squad_metric :  {'exact_match': 1.9119292179395913, 'f1': 7.240777072262927}
train:: Epoch: 0007 cost = 0.052007, other_squad_metric :  {'exact_match': 2.80319803198032, 'f1': 7.791143841817625}
valid:: Epoch: 0007 cost = 0.054906, other_squad_metric :  {'exact_match': 1.7695515102206854, 'f1': 7.289975603571721}
train:: Epoch: 0008 cost = 0.050931, other_squad_metric :  {'exact_match': 3.1020910209102093, 'f1': 8.258806954135775}
valid:: Epoch: 0008 cost = 0.054153, other_squad_metric :  {'exact_match': 2.6644971015966643, 'f1': 8.081381168992158}
train:: Epoch: 0009 cost = 0.049987, other_squad_metric :  {'exact_match': 3.4440344403444034, 'f1': 8.739966395239879}
valid:: Epoch: 0009 cost = 0.054544, other_squad_metric :  {'exact_match': 2.3492321773619445, 'f1': 8.055001152575757}
train:: Epoch: 0010 cost = 0.049163, other_squad_metric :  {'exact_match': 3.726937269372694, 'f1': 9.145453911171098}
valid:: Epoch: 0010 cost = 0.053677, other_squad_metric :  {'exact_match': 2.654327265331028, 'f1': 8.325830550204099}
train:: Epoch: 0011 cost = 0.048398, other_squad_metric :  {'exact_match': 3.918819188191882, 'f1': 9.407541477465061}
valid:: Epoch: 0011 cost = 0.053502, other_squad_metric :  {'exact_match': 2.9085731719719314, 'f1': 8.842631995680266}
train:: Epoch: 0012 cost = 0.047689, other_squad_metric :  {'exact_match': 4.291512915129151, 'f1': 9.886274846650787}
valid:: Epoch: 0012 cost = 0.052566, other_squad_metric :  {'exact_match': 3.2238380962066513, 'f1': 9.113956613367362}
train:: Epoch: 0013 cost = 0.046888, other_squad_metric :  {'exact_match': 4.654366543665437, 'f1': 10.388261947617485}
valid:: Epoch: 0013 cost = 0.052353, other_squad_metric :  {'exact_match': 2.8373843181124783, 'f1': 8.66699077180726}
train:: Epoch: 0014 cost = 0.046180, other_squad_metric :  {'exact_match': 4.922509225092251, 'f1': 10.815424597790287}
valid:: Epoch: 0014 cost = 0.052175, other_squad_metric :  {'exact_match': 3.2543476050035594, 'f1': 9.512344665790563}
train:: Epoch: 0015 cost = 0.045535, other_squad_metric :  {'exact_match': 5.266912669126691, 'f1': 11.217056063471766}
valid:: Epoch: 0015 cost = 0.052000, other_squad_metric :  {'exact_match': 3.061120715956473, 'f1': 9.455902452252491}
train:: Epoch: 0016 cost = 0.045013, other_squad_metric :  {'exact_match': 5.531365313653136, 'f1': 11.555561514406875}
valid:: Epoch: 0016 cost = 0.051729, other_squad_metric :  {'exact_match': 3.478084002847554, 'f1': 9.89657458812739}
train:: Epoch: 0017 cost = 0.044488, other_squad_metric :  {'exact_match': 5.902829028290283, 'f1': 12.015609084082987}
valid:: Epoch: 0017 cost = 0.051203, other_squad_metric :  {'exact_match': 3.3458761313942844, 'f1': 9.986771204060988}
train:: Epoch: 0018 cost = 0.043927, other_squad_metric :  {'exact_match': 6.190651906519065, 'f1': 12.317853853795269}
valid:: Epoch: 0018 cost = 0.051325, other_squad_metric :  {'exact_match': 3.4984236753788265, 'f1': 9.728076684247345}
train:: Epoch: 0019 cost = 0.043417, other_squad_metric :  {'exact_match': 6.349323493234932, 'f1': 12.567450526476316}
valid:: Epoch: 0019 cost = 0.052718, other_squad_metric :  {'exact_match': 3.630631546832096, 'f1': 10.054131087857952}
train:: Epoch: 0020 cost = 0.042941, other_squad_metric :  {'exact_match': 6.506765067650677, 'f1': 12.969659069010456}
valid:: Epoch: 0020 cost = 0.050884, other_squad_metric :  {'exact_match': 4.0882741787857215, 'f1': 10.359784387280477}
train:: Epoch: 0021 cost = 0.042421, other_squad_metric :  {'exact_match': 6.944649446494465, 'f1': 13.450272535426203}
valid:: Epoch: 0021 cost = 0.050903, other_squad_metric :  {'exact_match': 3.7730092545510017, 'f1': 10.31306528906065}
train:: Epoch: 0022 cost = 0.042001, other_squad_metric :  {'exact_match': 7.11070110701107, 'f1': 13.651087383183153}
valid:: Epoch: 0022 cost = 0.050722, other_squad_metric :  {'exact_match': 4.1492931963795385, 'f1': 10.713900286197479}
train:: Epoch: 0023 cost = 0.041442, other_squad_metric :  {'exact_match': 7.372693726937269, 'f1': 14.017682719817978}
valid:: Epoch: 0023 cost = 0.050775, other_squad_metric :  {'exact_match': 3.8950472897386352, 'f1': 10.44782133711661}
train:: Epoch: 0024 cost = 0.040935, other_squad_metric :  {'exact_match': 7.621156211562115, 'f1': 14.353988530853623}
valid:: Epoch: 0024 cost = 0.050397, other_squad_metric :  {'exact_match': 4.423878775551714, 'f1': 11.141475654144077}
train:: Epoch: 0025 cost = 0.040546, other_squad_metric :  {'exact_match': 8.046740467404675, 'f1': 14.827385071520968}
valid:: Epoch: 0025 cost = 0.051457, other_squad_metric :  {'exact_match': 4.230651886504627, 'f1': 10.66147547605718}
train:: Epoch: 0026 cost = 0.040124, other_squad_metric :  {'exact_match': 8.17220172201722, 'f1': 15.014283599942933}
valid:: Epoch: 0026 cost = 0.050561, other_squad_metric :  {'exact_match': 4.352689921692261, 'f1': 11.017594072653083}
train:: Epoch: 0027 cost = 0.039703, other_squad_metric :  {'exact_match': 8.419434194341944, 'f1': 15.492427309531816}
valid:: Epoch: 0027 cost = 0.050666, other_squad_metric :  {'exact_match': 4.342520085426624, 'f1': 11.230974599601335}
train:: Epoch: 0028 cost = 0.039337, other_squad_metric :  {'exact_match': 8.70110701107011, 'f1': 15.795327350697617}
valid:: Epoch: 0028 cost = 0.050722, other_squad_metric :  {'exact_match': 4.037424997457541, 'f1': 10.791206092542701}
train:: Epoch: 0029 cost = 0.038912, other_squad_metric :  {'exact_match': 8.832718327183272, 'f1': 15.981105774594253}
valid:: Epoch: 0029 cost = 0.050663, other_squad_metric :  {'exact_match': 4.606935828333164, 'f1': 11.706722090592367}
train:: Epoch: 0030 cost = 0.038556, other_squad_metric :  {'exact_match': 9.04920049200492, 'f1': 16.233137857105547}
valid:: Epoch: 0030 cost = 0.050263, other_squad_metric :  {'exact_match': 4.261161395301536, 'f1': 11.011236570293459}
train:: Epoch: 0031 cost = 0.038147, other_squad_metric :  {'exact_match': 9.1389913899139, 'f1': 16.347653848899522}
valid:: Epoch: 0031 cost = 0.051511, other_squad_metric :  {'exact_match': 4.627275500864436, 'f1': 11.904271792176463}
train:: Epoch: 0032 cost = 0.037769, other_squad_metric :  {'exact_match': 9.46740467404674, 'f1': 16.715881617206705}
valid:: Epoch: 0032 cost = 0.050406, other_squad_metric :  {'exact_match': 4.30184074036408, 'f1': 11.474164388062889}
train:: Epoch: 0033 cost = 0.037405, other_squad_metric :  {'exact_match': 9.74169741697417, 'f1': 17.083492777584222}
valid:: Epoch: 0033 cost = 0.051017, other_squad_metric :  {'exact_match': 4.434048611817349, 'f1': 11.532417814356961}
train:: Epoch: 0034 cost = 0.037065, other_squad_metric :  {'exact_match': 9.993849938499386, 'f1': 17.44161352386353}
valid:: Epoch: 0034 cost = 0.050672, other_squad_metric :  {'exact_match': 4.820502389911522, 'f1': 11.820118813488032}
train:: Epoch: 0035 cost = 0.036787, other_squad_metric :  {'exact_match': 10.259532595325954, 'f1': 17.771612152656697}
valid:: Epoch: 0035 cost = 0.050937, other_squad_metric :  {'exact_match': 5.105257805349334, 'f1': 12.567752502226693}
train:: Epoch: 0036 cost = 0.036416, other_squad_metric :  {'exact_match': 10.53628536285363, 'f1': 18.042843992451598}
valid:: Epoch: 0036 cost = 0.050991, other_squad_metric :  {'exact_match': 4.657785009661344, 'f1': 11.866455986127031}
train:: Epoch: 0037 cost = 0.036077, other_squad_metric :  {'exact_match': 10.71709717097171, 'f1': 18.28638432578662}
valid:: Epoch: 0037 cost = 0.050659, other_squad_metric :  {'exact_match': 4.973049933896064, 'f1': 12.319620978966558}
train:: Epoch: 0038 cost = 0.035646, other_squad_metric :  {'exact_match': 10.813038130381305, 'f1': 18.550725901870106}
valid:: Epoch: 0038 cost = 0.051416, other_squad_metric :  {'exact_match': 4.688294518458253, 'f1': 12.03695848489165}
train:: Epoch: 0039 cost = 0.035400, other_squad_metric :  {'exact_match': 10.944649446494465, 'f1': 18.76226837126298}
valid:: Epoch: 0039 cost = 0.052219, other_squad_metric :  {'exact_match': 4.851011898708431, 'f1': 12.256517555307225}
train:: Epoch: 0040 cost = 0.035087, other_squad_metric :  {'exact_match': 11.051660516605166, 'f1': 18.800013049509367}
valid:: Epoch: 0040 cost = 0.051661, other_squad_metric :  {'exact_match': 4.983219770161701, 'f1': 12.26508045994208}
train:: Epoch: 0041 cost = 0.034794, other_squad_metric :  {'exact_match': 11.39360393603936, 'f1': 19.24503981440576}
valid:: Epoch: 0041 cost = 0.051462, other_squad_metric :  {'exact_match': 4.881521407505339, 'f1': 12.325864402440793}
train:: Epoch: 0042 cost = 0.034495, other_squad_metric :  {'exact_match': 11.597785977859779, 'f1': 19.463299378549635}
valid:: Epoch: 0042 cost = 0.051654, other_squad_metric :  {'exact_match': 5.125597477880606, 'f1': 12.471675872179091}
```

**Problem 4.2** *(10 points)* On top of the attention layer, let's add another layer of (bi-directional) LSTM. So this will look like a *sandwich* where the LSTM is bread and the attention is ham. How does it affect the accuracy? Explain why do you think this happens. 

**Answer 4.2**



In [None]:
import torch.nn as nn
import numpy as np
from tqdm.notebook import tqdm
from torch.nn.utils.rnn import pack_padded_sequence, pad_packed_sequence


class AttentionLSTMModel(nn.Module):
    def __init__(self, embedding_dim, hidden_dim, n_layers, n_label, emb_dropout, rnn_dropout, bidirectional, enable_layer_norm, device):
        super(AttentionLSTMModel, self).__init__()
        self.embedding = nn.Embedding(len(vocab), embedding_dim)
        self.lstm = nn.LSTM(input_size=embedding_dim, 
                            hidden_size=hidden_dim, 
                            num_layers=n_layers, 
                            dropout=rnn_dropout if n_layers > 1 else 0.0,  
                            bidirectional=bidirectional)
      
        n_direction = 2 if bidirectional else 1
        self.new_hidden_dim = hidden_dim*n_direction
        self.fc_start = nn.Linear(self.new_hidden_dim, 1, bias=True)
        self.fc_end = nn.Linear(self.new_hidden_dim, 1, bias=True)

        # Layer_normalization
        self.enable_layer_norm = enable_layer_norm
        if enable_layer_norm:
            self.norm1 = nn.LayerNorm(embedding_dim)
            self.norm2 = nn.LayerNorm(self.new_hidden_dim)
            self.norm3 = nn.LayerNorm(self.new_hidden_dim)
            self.norm4 = nn.LayerNorm(self.new_hidden_dim)

        # Attention layer
        self.att_weight_Q = nn.Linear(self.new_hidden_dim, self.new_hidden_dim, bias=False)
        self.att_weight_K = nn.Linear(self.new_hidden_dim, self.new_hidden_dim, bias=False)
        self.att_weight_V = nn.Linear(self.new_hidden_dim, self.new_hidden_dim, bias=False)
        self.softmax_layer = nn.Softmax(dim=2)

        # LSTM after Attention
        self.lstm2 = nn.LSTM(input_size=self.new_hidden_dim, 
                            hidden_size=hidden_dim, 
                            bidirectional=True)

        self.emb_dropout = nn.Dropout(emb_dropout)
        self.att_dropout = nn.Dropout(rnn_dropout)
        self.fc_dropout = nn.Dropout(rnn_dropout)
        self.bidirectional = bidirectional
        self.device = device

    def forward(self, input_tensor, src_seq_lens):
        emb = self.embedding(input_tensor) # emb.shape = batch * len * embedding_size

        # Layer_normalization
        if self.enable_layer_norm:
            emb = self.norm1(emb)

        emb = self.emb_dropout(emb)
        emb = emb.transpose(0, 1) # emb.shape = len * batch * embedding_size

        # nn.LSTM
        packed = pack_padded_sequence(emb, src_seq_lens.tolist(), batch_first=False)
        outs, (hn, cn) = self.lstm(packed)
        outs, out_lens = pad_packed_sequence(outs, batch_first=False)  # outs.shape = len * batch * self.new_hidden_dim
        
        outs = outs.transpose(0, 1) # outs.shape = batch * len * self.new_hidden_dim
        
        # Layer_normalization
        if self.enable_layer_norm:
            outs = self.norm2(outs)
        
        outs = self.att_dropout(outs)

        # Attention Mechanism
        att_q = self.att_weight_Q(outs)
        att_k = self.att_weight_K(outs)
        att_v = self.att_weight_V(outs)
        
        att_qk = torch.einsum('bih,bjh->bij', att_q, att_k) 
        att_soft_qk = self.softmax_layer(att_qk / np.sqrt(self.new_hidden_dim))
        att = torch.einsum('bij,bih->bjh', att_soft_qk, att_v)

        # Layer_normalization
        if self.enable_layer_norm:
            att = self.norm3(att)

        att = self.att_dropout(att)
        att = att.transpose(0, 1) # outs.shape = len * batch * self.new_hidden_dim
        
        # nn.LSTM
        packed = pack_padded_sequence(att, src_seq_lens.tolist(), batch_first=False)
        outs, (hn, cn) = self.lstm2(packed)
        outs, out_lens = pad_packed_sequence(outs, batch_first=False)  # outs.shape = len * batch * self.new_hidden_dim
        
        outs = outs.transpose(0, 1) # outs.shape = batch * len * self.new_hidden_dim

        # Layer_normalization
        if self.enable_layer_norm:
            outs = self.norm4(outs)

        outs = self.fc_dropout(outs)
        logits_start = self.fc_start(outs).squeeze(2)
        logits_end = self.fc_end(outs).squeeze(2)

        return (logits_start, logits_end)

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print("device : ", device)

# Vocabulary : Use vocab_list

# Construct the LSTM Model
embedding_dim = 128 # usually bigger, e.g. 128
hidden_dim = 128
n_layers = 2
n_label = max_length+1 if use_bos else max_length
emb_dropout = 0.5
rnn_dropout = 0.5
bidirectional = True
enable_layer_norm = True
rnnmodel = AttentionLSTMModel(embedding_dim, hidden_dim, n_layers, n_label, emb_dropout, rnn_dropout, bidirectional, enable_layer_norm, device).to(device)

print("batch_size : ", batch_size)
print("max_length : ", max_length)
print("embedding_dim : ", embedding_dim)
print("hidden_dim : ", hidden_dim)
print("n_layers : ", n_layers)
print("emb_dropout : ", emb_dropout)
print("rnn_dropout_and_fc_dropout : ", rnn_dropout)
if bidirectional:
    print("bidirectional : True")
else:
    print("bidirectional : False")
if enable_layer_norm:
    print("enable_layer_norm : True")
else:
    print("enable_layer_norm : False")

# Construct the data loader
train_iter = loader.train_iter
valid_iter = loader.valid_iter

train_id_list = loader.train_id_list
valid_id_list = loader.valid_id_list
train_reference = loader.train_reference
valid_reference = loader.valid_reference

# Training
learning_rate = 1e-3
print("learning_rate : ", learning_rate)

PAD_IDX = vocab.stoi['<pad>']
cel = nn.CrossEntropyLoss(ignore_index=PAD_IDX) # Ignore Padding
# optimizer = torch.optim.SGD(rnnmodel.parameters(), lr=1e-1)
optimizer = torch.optim.Adam(rnnmodel.parameters(), lr=learning_rate)

epochs = 50
max_norm = 5

# Evaluate
squad_metric = load_metric('squad')


for epoch in tqdm(range(epochs)):
    train_loss = 0
    train_accuracy = 0.0
    train_data_num = 0
    train_prediction = list()
    for train_i, train_batch in enumerate(train_iter):
        context, context_length = train_batch.context_question
        answer_start = train_batch.answer_start
        answer_end = train_batch.answer_end
        train_id_index = train_batch.id_index

        logits_start, logits_end = rnnmodel(context, context_length)

        optimizer.zero_grad() # reset process
        loss = cel(logits_start, answer_start) + cel(logits_end, answer_end) # Loss, a.k.a L
        loss.backward() # compute gradients
        # print(torch.norm(rnnmodel.lstm.weight_hh_l0.grad), loss.item())
        # torch.nn.utils.clip_grad_norm_(rnnmodel.parameters(), max_norm) # gradent clipping
        optimizer.step() # update parameters
        train_loss += loss.item()
        
        _, train_start_preds = torch.max(logits_start, 1)
        _, train_end_preds = torch.max(logits_end, 1)
        # train_accuracy += ((train_start_preds == answer_start) * (train_end_preds == answer_end)).sum().float()

        train_data_num += context.shape[0]

        for train_j in range(context.shape[0]):
            pred_text = ""
            start = train_start_preds[train_j]
            end = train_end_preds[train_j]
            if start < end:
                pred_text = [vocab_list[text_id] for text_id in context[train_j][start:end+1]]
                pred_text = " ".join(pred_text)

            # start = answer_start[train_j]
            # end = answer_end[train_j]
            # answer_text = [vocab_list[text_id] for text_id in context[train_j][start:end+1]]
            # answer_text = " ".join(answer_text)
            # print(pred_text, answer_text, train_reference[train_id_index[train_j]], "\n")

            train_prediction.append({'id':train_id_list[train_id_index[train_j]], 'prediction_text':pred_text})

    train_result = squad_metric.compute(predictions=train_prediction, references=train_reference)
    print('train:: Epoch:', '%04d' % (epoch + 1), 
          'cost =', '{:.6f},'.format(train_loss / train_data_num), 
        #   'my exact_match =', '{:.6f}'.format(train_accuracy / train_data_num),
          'other_squad_metric : ', train_result)
        
    if (epoch + 1) % 1 == 0:
        with torch.no_grad():
            valid_loss = 0
            valid_accuracy = 0.0
            valid_data_num = 0
            valid_prediction = list()
            for valid_i, valid_batch in enumerate(valid_iter):
                context, context_length = valid_batch.context_question
                answer_start = valid_batch.answer_start
                answer_end = valid_batch.answer_end
                valid_id_index = valid_batch.id_index

                logits_start, logits_end = rnnmodel(context, context_length)

                loss = cel(logits_start, answer_start) + cel(logits_end, answer_end) # Loss, a.k.a L
                valid_loss += loss.item()

                _, valid_start_preds = torch.max(logits_start, 1)
                _, valid_end_preds = torch.max(logits_end, 1)
                # valid_accuracy += ((valid_start_preds == answer_start) * (valid_end_preds == answer_end)).sum().float()

                valid_data_num += context.shape[0]

                for valid_j in range(context.shape[0]):
                    pred_text = ""
                    start = valid_start_preds[valid_j]
                    end = valid_end_preds[valid_j]
                    if start < end:
                        pred_text = [vocab_list[text_id] for text_id in context[valid_j][start:end+1]]
                        pred_text = " ".join(pred_text)
                    valid_prediction.append({'id':valid_id_list[valid_id_index[valid_j]], 'prediction_text':pred_text})
                
            valid_result = squad_metric.compute(predictions=valid_prediction, references=valid_reference)
            print('valid:: Epoch:', '%04d' % (epoch + 1), 
                  'cost =', '{:.6f},'.format(valid_loss / valid_data_num), 
                #   'my exact_match =', '{:.6f},'.format(valid_accuracy / valid_data_num),
                  'other_squad_metric : ', valid_result)
            

** Comment 4.2 **

Just use Attention Mechanism.
Use LayerNorm and dropout.

Trianing and Valid Loss decrease faster and more than Prob 3.2 (Not stuck in local minima easily) , 
and get slightly higher f1 score!

The number of layers is not the reason of this improvement, because 3-layer bi-lstm don't get the great results.

It seems that Attention layer catches the "context" of context sentence and question sentence. And then, the final bi-lstm layer helps to get long-term dependency.

Before (Prob 3.2) : 
```
Lowest Valid Loss (Cost) ::
valid:: Epoch: 0026 cost = 0.053489, squad_metric :  {'exact_match': 10.403742499745753, 'f1': 16.172590913652723}

Best Accuracy :: (It decrease until the 100th epoch)
train:: Epoch: 0050 cost = 0.033762, squad_metric :  {'exact_match': 20.202952029520294, 'f1': 28.131849215758283}
valid:: Epoch: 0050 cost = 0.057306, squad_metric :  {'exact_match': 12.407200244076071, 'f1': 18.79763380045527}
```

Before (Prob 4.1) : 
```
Lowest Valid Loss (Cost) ::
valid:: Epoch: 0032 cost = 0.050406, other_squad_metric :  {'exact_match': 4.30184074036408, 'f1': 11.474164388062889}
```

After (Prob 4.2) : 
```
train:: Epoch: 0050 cost = 0.026692, other_squad_metric :  {'exact_match': 20.587945879458793, 'f1': 32.11980032343598}
valid:: Epoch: 0050 cost = 0.042287, other_squad_metric :  {'exact_match': 11.410556290043731, 'f1': 22.092787217705098}
```


** Result 4.2 (Training & Validation) **
```
device :  cuda
batch_size :  128
max_length :  253
embedding_dim :  128
hidden_dim :  128
n_layers :  2
emb_dropout :  0.5
rnn_dropout_and_fc_dropout :  0.5
bidirectional : True
enable_layer_norm : True
learning_rate :  0.001
100%
50/50 [1:25:12<00:00, 102.26s/it]
train:: Epoch: 0001 cost = 0.069466, other_squad_metric :  {'exact_match': 0.16974169741697417, 'f1': 3.361814277629193}
valid:: Epoch: 0001 cost = 0.065247, other_squad_metric :  {'exact_match': 0.3457744330316282, 'f1': 3.4212623387303736}
train:: Epoch: 0002 cost = 0.062712, other_squad_metric :  {'exact_match': 0.4194341943419434, 'f1': 3.4249592633782338}
valid:: Epoch: 0002 cost = 0.062507, other_squad_metric :  {'exact_match': 0.4373029594223533, 'f1': 4.28673967780121}
train:: Epoch: 0003 cost = 0.059058, other_squad_metric :  {'exact_match': 1.002460024600246, 'f1': 5.159480458249725}
valid:: Epoch: 0003 cost = 0.059469, other_squad_metric :  {'exact_match': 1.4441167497203296, 'f1': 6.274151525233699}
train:: Epoch: 0004 cost = 0.056561, other_squad_metric :  {'exact_match': 1.878228782287823, 'f1': 6.5468186298290725}
valid:: Epoch: 0004 cost = 0.057507, other_squad_metric :  {'exact_match': 1.647513475033052, 'f1': 6.874641229932915}
train:: Epoch: 0005 cost = 0.054578, other_squad_metric :  {'exact_match': 2.674046740467405, 'f1': 7.6359413446573114}
valid:: Epoch: 0005 cost = 0.055929, other_squad_metric :  {'exact_match': 2.4509305400183057, 'f1': 8.10258175294164}
train:: Epoch: 0006 cost = 0.053056, other_squad_metric :  {'exact_match': 3.3185731857318572, 'f1': 8.722633118829728}
valid:: Epoch: 0006 cost = 0.054805, other_squad_metric :  {'exact_match': 2.9289128445032033, 'f1': 8.847399503302155}
train:: Epoch: 0007 cost = 0.051621, other_squad_metric :  {'exact_match': 4.092250922509225, 'f1': 9.905997718422435}
valid:: Epoch: 0007 cost = 0.054266, other_squad_metric :  {'exact_match': 3.2543476050035594, 'f1': 9.628237962463725}
train:: Epoch: 0008 cost = 0.050263, other_squad_metric :  {'exact_match': 4.841328413284133, 'f1': 11.160153927859104}
valid:: Epoch: 0008 cost = 0.052925, other_squad_metric :  {'exact_match': 3.93572663480118, 'f1': 10.466658823921307}
train:: Epoch: 0009 cost = 0.049018, other_squad_metric :  {'exact_match': 5.523985239852398, 'f1': 12.24201291867065}
valid:: Epoch: 0009 cost = 0.051985, other_squad_metric :  {'exact_match': 4.5560866470049834, 'f1': 11.92370672510731}
train:: Epoch: 0010 cost = 0.047711, other_squad_metric :  {'exact_match': 6.305043050430505, 'f1': 13.309821553123026}
valid:: Epoch: 0010 cost = 0.051142, other_squad_metric :  {'exact_match': 4.871351571239703, 'f1': 12.189955179372221}
train:: Epoch: 0011 cost = 0.046498, other_squad_metric :  {'exact_match': 7.088560885608856, 'f1': 14.603247437540979}
valid:: Epoch: 0011 cost = 0.050100, other_squad_metric :  {'exact_match': 5.786636835146954, 'f1': 13.432881098920634}
train:: Epoch: 0012 cost = 0.045335, other_squad_metric :  {'exact_match': 8.081180811808117, 'f1': 15.982700032089694}
valid:: Epoch: 0012 cost = 0.049419, other_squad_metric :  {'exact_match': 6.244279467100579, 'f1': 14.931068475250042}
train:: Epoch: 0013 cost = 0.044141, other_squad_metric :  {'exact_match': 8.880688806888068, 'f1': 17.205677402302314}
valid:: Epoch: 0013 cost = 0.048984, other_squad_metric :  {'exact_match': 6.620563408929116, 'f1': 15.079029022956652}
train:: Epoch: 0014 cost = 0.043162, other_squad_metric :  {'exact_match': 9.622386223862238, 'f1': 18.18664299195288}
valid:: Epoch: 0014 cost = 0.047982, other_squad_metric :  {'exact_match': 6.834129970507475, 'f1': 15.513491374991682}
train:: Epoch: 0015 cost = 0.042102, other_squad_metric :  {'exact_match': 10.134071340713406, 'f1': 18.905590852124877}
valid:: Epoch: 0015 cost = 0.047175, other_squad_metric :  {'exact_match': 7.068036204617106, 'f1': 16.573952177134657}
train:: Epoch: 0016 cost = 0.041315, other_squad_metric :  {'exact_match': 10.751537515375153, 'f1': 19.73184574430156}
valid:: Epoch: 0016 cost = 0.046982, other_squad_metric :  {'exact_match': 7.678226380555273, 'f1': 16.626068945768726}
train:: Epoch: 0017 cost = 0.040586, other_squad_metric :  {'exact_match': 11.244772447724477, 'f1': 20.393195360085272}
valid:: Epoch: 0017 cost = 0.046342, other_squad_metric :  {'exact_match': 7.9426421234618125, 'f1': 17.473379322453372}
train:: Epoch: 0018 cost = 0.039779, other_squad_metric :  {'exact_match': 11.832718327183272, 'f1': 20.988569010463284}
valid:: Epoch: 0018 cost = 0.045816, other_squad_metric :  {'exact_match': 8.176548357571443, 'f1': 17.613587476416974}
train:: Epoch: 0019 cost = 0.039056, other_squad_metric :  {'exact_match': 12.136531365313653, 'f1': 21.569753455162243}
valid:: Epoch: 0019 cost = 0.045723, other_squad_metric :  {'exact_match': 8.664700498321977, 'f1': 18.134952513635778}
train:: Epoch: 0020 cost = 0.038483, other_squad_metric :  {'exact_match': 12.611316113161132, 'f1': 22.23884296024986}
valid:: Epoch: 0020 cost = 0.045164, other_squad_metric :  {'exact_match': 8.756229024712702, 'f1': 18.370361531154288}
train:: Epoch: 0021 cost = 0.037894, other_squad_metric :  {'exact_match': 13.018450184501845, 'f1': 22.70288988243523}
valid:: Epoch: 0021 cost = 0.044323, other_squad_metric :  {'exact_match': 9.396928709447778, 'f1': 19.545964064849855}
train:: Epoch: 0022 cost = 0.037355, other_squad_metric :  {'exact_match': 13.289052890528906, 'f1': 23.099903744208138}
valid:: Epoch: 0022 cost = 0.044395, other_squad_metric :  {'exact_match': 9.244381165463237, 'f1': 19.162249720166074}
train:: Epoch: 0023 cost = 0.036674, other_squad_metric :  {'exact_match': 13.907749077490775, 'f1': 23.91963101160192}
valid:: Epoch: 0023 cost = 0.044672, other_squad_metric :  {'exact_match': 9.081663785213058, 'f1': 18.92874966399299}
train:: Epoch: 0024 cost = 0.036194, other_squad_metric :  {'exact_match': 14.190651906519065, 'f1': 24.170065149163246}
valid:: Epoch: 0024 cost = 0.044040, other_squad_metric :  {'exact_match': 9.468117563307231, 'f1': 19.907402754194923}
train:: Epoch: 0025 cost = 0.035654, other_squad_metric :  {'exact_match': 14.559655596555965, 'f1': 24.653949400193746}
valid:: Epoch: 0025 cost = 0.044101, other_squad_metric :  {'exact_match': 9.203701820400692, 'f1': 18.93926897199623}
train:: Epoch: 0026 cost = 0.035298, other_squad_metric :  {'exact_match': 14.499384993849938, 'f1': 24.783760914299492}
valid:: Epoch: 0026 cost = 0.043624, other_squad_metric :  {'exact_match': 9.396928709447778, 'f1': 19.444442074884538}
train:: Epoch: 0027 cost = 0.034831, other_squad_metric :  {'exact_match': 15.007380073800737, 'f1': 25.2753825608119}
valid:: Epoch: 0027 cost = 0.043016, other_squad_metric :  {'exact_match': 9.874911013932676, 'f1': 20.074305100156142}
train:: Epoch: 0028 cost = 0.034336, other_squad_metric :  {'exact_match': 15.391143911439114, 'f1': 25.69711846783883}
valid:: Epoch: 0028 cost = 0.043293, other_squad_metric :  {'exact_match': 10.586799552527204, 'f1': 20.69354665468823}
train:: Epoch: 0029 cost = 0.033910, other_squad_metric :  {'exact_match': 15.474784747847478, 'f1': 25.838288078455015}
valid:: Epoch: 0029 cost = 0.043318, other_squad_metric :  {'exact_match': 9.854571341401403, 'f1': 19.9525020629975}
train:: Epoch: 0030 cost = 0.033509, other_squad_metric :  {'exact_match': 15.905289052890529, 'f1': 26.407931590801045}
valid:: Epoch: 0030 cost = 0.042680, other_squad_metric :  {'exact_match': 10.485101189870843, 'f1': 20.949752360512093}
train:: Epoch: 0031 cost = 0.033029, other_squad_metric :  {'exact_match': 16.239852398523986, 'f1': 26.922576986959044}
valid:: Epoch: 0031 cost = 0.042419, other_squad_metric :  {'exact_match': 10.627478897589748, 'f1': 21.3655582733813}
train:: Epoch: 0032 cost = 0.032662, other_squad_metric :  {'exact_match': 16.451414514145142, 'f1': 27.10572095301116}
valid:: Epoch: 0032 cost = 0.043325, other_squad_metric :  {'exact_match': 10.596969388792841, 'f1': 21.17618395521775}
train:: Epoch: 0033 cost = 0.032248, other_squad_metric :  {'exact_match': 16.703567035670357, 'f1': 27.447812050845094}
valid:: Epoch: 0033 cost = 0.043247, other_squad_metric :  {'exact_match': 10.352893318417573, 'f1': 20.535471910443267}
train:: Epoch: 0034 cost = 0.031884, other_squad_metric :  {'exact_match': 16.854858548585486, 'f1': 27.723784387556876}
valid:: Epoch: 0034 cost = 0.042398, other_squad_metric :  {'exact_match': 10.352893318417573, 'f1': 20.9875390788655}
train:: Epoch: 0035 cost = 0.031533, other_squad_metric :  {'exact_match': 17.06888068880689, 'f1': 27.946554352435065}
valid:: Epoch: 0035 cost = 0.042969, other_squad_metric :  {'exact_match': 10.515610698667752, 'f1': 21.172966506592992}
train:: Epoch: 0036 cost = 0.031066, other_squad_metric :  {'exact_match': 17.61869618696187, 'f1': 28.462887843815444}
valid:: Epoch: 0036 cost = 0.042215, other_squad_metric :  {'exact_match': 10.546120207464659, 'f1': 21.104462782877796}
train:: Epoch: 0037 cost = 0.030768, other_squad_metric :  {'exact_match': 17.66789667896679, 'f1': 28.57176608422322}
valid:: Epoch: 0037 cost = 0.042015, other_squad_metric :  {'exact_match': 11.034272348215193, 'f1': 21.95038002092573}
train:: Epoch: 0038 cost = 0.030376, other_squad_metric :  {'exact_match': 18.079950799507994, 'f1': 29.033229936814934}
valid:: Epoch: 0038 cost = 0.042066, other_squad_metric :  {'exact_match': 11.105461202074647, 'f1': 21.476216073547164}
train:: Epoch: 0039 cost = 0.030158, other_squad_metric :  {'exact_match': 18.062730627306273, 'f1': 29.02067901956705}
valid:: Epoch: 0039 cost = 0.042298, other_squad_metric :  {'exact_match': 11.319027763653006, 'f1': 22.54225148502073}
train:: Epoch: 0040 cost = 0.029738, other_squad_metric :  {'exact_match': 18.366543665436655, 'f1': 29.437049239866806}
valid:: Epoch: 0040 cost = 0.042144, other_squad_metric :  {'exact_match': 11.369876944981186, 'f1': 21.81951078805602}
train:: Epoch: 0041 cost = 0.029409, other_squad_metric :  {'exact_match': 18.603936039360395, 'f1': 29.84116969975234}
valid:: Epoch: 0041 cost = 0.042219, other_squad_metric :  {'exact_match': 10.698667751449202, 'f1': 21.324294165265954}
train:: Epoch: 0042 cost = 0.029036, other_squad_metric :  {'exact_match': 18.85239852398524, 'f1': 30.03953318610129}
valid:: Epoch: 0042 cost = 0.042358, other_squad_metric :  {'exact_match': 11.430895962575002, 'f1': 21.903664668576834}
train:: Epoch: 0043 cost = 0.028731, other_squad_metric :  {'exact_match': 19.142681426814267, 'f1': 30.487018032170674}
valid:: Epoch: 0043 cost = 0.042273, other_squad_metric :  {'exact_match': 11.400386453778093, 'f1': 22.00874349671015}
train:: Epoch: 0044 cost = 0.028418, other_squad_metric :  {'exact_match': 19.301353013530136, 'f1': 30.615485274996946}
valid:: Epoch: 0044 cost = 0.042939, other_squad_metric :  {'exact_match': 10.790196277839927, 'f1': 21.3082415313785}
train:: Epoch: 0045 cost = 0.028091, other_squad_metric :  {'exact_match': 19.57441574415744, 'f1': 30.805847900909203}
valid:: Epoch: 0045 cost = 0.041975, other_squad_metric :  {'exact_match': 11.329197599918642, 'f1': 21.802735916987405}
train:: Epoch: 0046 cost = 0.027783, other_squad_metric :  {'exact_match': 19.837638376383765, 'f1': 31.098738674682618}
valid:: Epoch: 0046 cost = 0.041832, other_squad_metric :  {'exact_match': 11.705481541747178, 'f1': 23.02309877554507}
train:: Epoch: 0047 cost = 0.027449, other_squad_metric :  {'exact_match': 20.195571955719558, 'f1': 31.530467885588784}
valid:: Epoch: 0047 cost = 0.042251, other_squad_metric :  {'exact_match': 11.522424488965727, 'f1': 22.214844197606723}
train:: Epoch: 0048 cost = 0.027248, other_squad_metric :  {'exact_match': 20.253382533825338, 'f1': 31.857298200338626}
valid:: Epoch: 0048 cost = 0.041667, other_squad_metric :  {'exact_match': 12.071595647310078, 'f1': 22.922900916737586}
train:: Epoch: 0049 cost = 0.026940, other_squad_metric :  {'exact_match': 20.416974169741696, 'f1': 31.914431302435737}
valid:: Epoch: 0049 cost = 0.042164, other_squad_metric :  {'exact_match': 11.735991050544087, 'f1': 22.91321449036422}
train:: Epoch: 0050 cost = 0.026692, other_squad_metric :  {'exact_match': 20.587945879458793, 'f1': 32.11980032343598}
valid:: Epoch: 0050 cost = 0.042287, other_squad_metric :  {'exact_match': 11.410556290043731, 'f1': 22.092787217705098}
```

## 5. Attention is All You Need

**Problem 5.1 (bonus)** *(20 points)*  Implement full Transformer encoder to entirely replace LSTMs. You are allowed to copy and paste code from [*Annotated Transformer*](https://nlp.seas.harvard.edu/2018/04/03/attention.html) (but nowhere else). Report the accuracy and explain what seems to happening with attetion-only model compared to LSTM+Attention model(s). 

**Answer 5.1**


In [None]:
import torch.nn as nn
import torch.nn.functional as F
import copy
import math
from torch.autograd import Variable 
from tqdm.notebook import tqdm
from torch.nn.utils.rnn import pack_padded_sequence, pad_packed_sequence


def clones(module, N):
    "Produce N identical layers."
    return nn.ModuleList([copy.deepcopy(module) for _ in range(N)])


class Encoder(nn.Module):
    "Core encoder is a stack of N layers"
    def __init__(self, layer, N):
        super(Encoder, self).__init__()
        self.layers = clones(layer, N)
        self.norm = LayerNorm(layer.size)
        
    def forward(self, x, mask):
        "Pass the input (and mask) through each layer in turn."
        for layer in self.layers:
            x = layer(x, mask)
        return self.norm(x)


class EncoderLayer(nn.Module):
    "Encoder is made up of self-attn and feed forward (defined below)"
    def __init__(self, size, self_attn, feed_forward, dropout):
        super(EncoderLayer, self).__init__()
        self.self_attn = self_attn
        self.feed_forward = feed_forward
        self.sublayer = clones(SublayerConnection(size, dropout), 4)
        self.size = size

    def forward(self, x, mask):
        "Follow Figure 1 (left) for connections."
        x = self.sublayer[0](x, lambda x: self.self_attn(x, x, x, mask))
        return self.sublayer[1](x, self.feed_forward)


class SublayerConnection(nn.Module):
    """
    A residual connection followed by a layer norm.
    Note for code simplicity the norm is first as opposed to last.
    """
    def __init__(self, size, dropout):
        super(SublayerConnection, self).__init__()
        self.norm = LayerNorm(size)
        self.dropout = nn.Dropout(dropout)

    def forward(self, x, sublayer):
        "Apply residual connection to any sublayer with the same size."
        return x + self.dropout(sublayer(self.norm(x)))


class LayerNorm(nn.Module):
    "Construct a layernorm module (See citation for details)."
    def __init__(self, features, eps=1e-6):
        super(LayerNorm, self).__init__()
        self.a_2 = nn.Parameter(torch.ones(features))
        self.b_2 = nn.Parameter(torch.zeros(features))
        self.eps = eps

    def forward(self, x):
        mean = x.mean(-1, keepdim=True)
        std = x.std(-1, keepdim=True)
        return self.a_2 * (x - mean) / (std + self.eps) + self.b_2


def attention(query, key, value, mask=None, dropout=None):
    "Compute 'Scaled Dot Product Attention'"
    d_k = query.size(-1)
    scores = torch.matmul(query, key.transpose(-2, -1)) \
             / math.sqrt(d_k)
    if mask is not None:
        scores = scores.masked_fill(mask == 0, -1e9)
    p_attn = F.softmax(scores, dim = -1)
    if dropout is not None:
        p_attn = dropout(p_attn)
    return torch.matmul(p_attn, value), p_attn


class MultiHeadedAttention(nn.Module):
    def __init__(self, h, d_model, dropout=0.1):
        "Take in model size and number of heads."
        super(MultiHeadedAttention, self).__init__()
        assert d_model % h == 0
        # We assume d_v always equals d_k
        self.d_k = d_model // h
        self.h = h
        self.linears = clones(nn.Linear(d_model, d_model), 4)
        self.attn = None
        self.dropout = nn.Dropout(p=dropout)
        
    def forward(self, query, key, value, mask=None):
        "Implements Figure 2"
        if mask is not None:
            # Same mask applied to all h heads.
            mask = mask.unsqueeze(1)
        nbatches = query.size(0)

        # 1) Do all the linear projections in batch from d_model => h x d_k Z
        query, key, value = \
            [l(x).view(nbatches, -1, self.h, self.d_k).transpose(1, 2)
             for l, x in zip(self.linears, (query, key, value))]
        
        # 2) Apply attention on all the projected vectors in batch. 
        x, self.attn = attention(query, key, value, mask=mask, 
                                 dropout=self.dropout)
        
        # 3) "Concat" using a view and apply a final linear. 
        x = x.transpose(1, 2).contiguous() \
             .view(nbatches, -1, self.h * self.d_k)
        return self.linears[-1](x)


class PositionwiseFeedForward(nn.Module):
    "Implements FFN equation."
    def __init__(self, d_model, d_ff, dropout=0.1):
        super(PositionwiseFeedForward, self).__init__()
        self.w_1 = nn.Linear(d_model, d_ff)
        self.w_2 = nn.Linear(d_ff, d_model)
        self.dropout = nn.Dropout(dropout)

    def forward(self, x):
        return self.w_2(self.dropout(F.relu(self.w_1(x))))


class Embeddings(nn.Module):
    def __init__(self, d_model, vocab):
        super(Embeddings, self).__init__()
        self.lut = nn.Embedding(vocab, d_model)
        self.d_model = d_model

    def forward(self, x):
        return self.lut(x) * math.sqrt(self.d_model)


class PositionalEncoding(nn.Module):
    "Implement the PE function."
    def __init__(self, d_model, dropout, max_len=5000):
        super(PositionalEncoding, self).__init__()
        self.dropout = nn.Dropout(p=dropout)
        
        # Compute the positional encodings once in log space.
        pe = torch.zeros(max_len, d_model)
        position = torch.arange(0, max_len).unsqueeze(1)
        div_term = torch.exp(torch.arange(0, d_model, 2) *
                             -(math.log(10000.0) / d_model))
        pe[:, 0::2] = torch.sin(position * div_term)
        pe[:, 1::2] = torch.cos(position * div_term)
        pe = pe.unsqueeze(0)
        self.register_buffer('pe', pe)
        
    def forward(self, x):
        x = x + Variable(self.pe[:, :x.size(1)], 
                         requires_grad=False)
        return self.dropout(x)


class EncoderClassification(nn.Module):
    """
    A standard Encoder-Decoder architecture. Base for this and many 
    other models.
    """
    def __init__(self, encoder, src_embed, generator):
        super(EncoderClassification, self).__init__()
        self.encoder = encoder
        self.src_embed = src_embed
        self.generator = generator
        
    def forward(self, src, src_mask=None):
        "Take in and process masked src and target sequences."
        return self.generator(self.encode(src, src_mask))
    
    def encode(self, src, src_mask):
        return self.encoder(self.src_embed(src), src_mask)


class Generator(nn.Module):
    "Define standard linear + softmax generation step."
    def __init__(self, d_model):
        super(Generator, self).__init__()
        self.proj_start = nn.Linear(d_model, 1)
        self.proj_end = nn.Linear(d_model, 1)

    def forward(self, x):
        logit_start = self.proj_start(x).squeeze(dim=-1)
        logit_end = self.proj_end(x).squeeze(dim=-1)
        return (logit_start, logit_end)

def make_encoder(src_vocab, N=6, d_model=512, d_ff=2048, h=8, dropout=0.1):
    "Helper: Construct a model from hyperparameters."
    c = copy.deepcopy
    attn = MultiHeadedAttention(h, d_model)
    ff = PositionwiseFeedForward(d_model, d_ff, dropout)
    position = PositionalEncoding(d_model, dropout)
    model = EncoderClassification(
        Encoder(EncoderLayer(d_model, c(attn), c(ff), dropout), N),
        nn.Sequential(Embeddings(d_model, src_vocab), c(position)),
        Generator(d_model)
        )
    
    # This was important from their code. 
    # Initialize parameters with Glorot / fan_avg.
    for p in model.parameters():
        if p.dim() > 1:
            nn.init.xavier_uniform(p)
    return model


class NoamOpt:
    "Optim wrapper that implements rate."
    def __init__(self, model_size, factor, warmup, optimizer):
        self.optimizer = optimizer
        self._step = 0
        self.warmup = warmup
        self.factor = factor
        self.model_size = model_size
        self._rate = 0
        
    def step(self):
        "Update parameters and rate"
        self._step += 1
        rate = self.rate()
        for p in self.optimizer.param_groups:
            p['lr'] = rate
        self._rate = rate
        self.optimizer.step()
        
    def rate(self, step = None):
        "Implement `lrate` above"
        if step is None:
            step = self._step
        return self.factor * \
            (self.model_size ** (-0.5) *
            min(step ** (-0.5), step * self.warmup ** (-1.5)))
        
def get_std_opt(model):
    return NoamOpt(model.src_embed[0].d_model, 1, 4000,
            torch.optim.Adam(model.parameters(), lr=0, betas=(0.9, 0.98), eps=1e-9))


class LabelSmoothing(nn.Module):
    "Implement label smoothing."
    def __init__(self, size, padding_idx, smoothing=0.0):
        super(LabelSmoothing, self).__init__()
        self.criterion = nn.KLDivLoss(size_average=False)
        self.padding_idx = padding_idx
        self.confidence = 1.0 - smoothing
        self.smoothing = smoothing
        self.size = size
        self.true_dist = None
        
    def forward(self, x, target):
        assert x.size(1) == self.size
        true_dist = x.data.clone()
        true_dist.fill_(self.smoothing / (self.size - 2))
        true_dist.scatter_(1, target.data.unsqueeze(1), self.confidence)
        true_dist[:, self.padding_idx] = 0
        mask = torch.nonzero(target.data == self.padding_idx)
        if mask.dim() > 0:
            true_dist.index_fill_(0, mask.squeeze(), 0.0)
        self.true_dist = true_dist
        return self.criterion(x, Variable(true_dist, requires_grad=False))

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print("device : ", device)

# Construct the Transformer Encoder
src_vocab = len(vocab_list) # usually bigger, e.g. 128
N = 2
d_model = 256
d_ff = 1024
h = 8
dropout = 0.5
smoothing = 0.1

model = make_encoder(src_vocab, N, d_model, d_ff, h, dropout).to(device)

print("src_vocab : ", src_vocab)
print("N : ", N)
print("d_model : ", d_model)
print("d_ff : ", d_ff)
print("h : ", h)
print("dropout : ", dropout)
print("smoothing : ", smoothing)

# Construct the data loader
train_iter = loader.train_iter
valid_iter = loader.valid_iter

train_id_list = loader.train_id_list
valid_id_list = loader.valid_id_list
train_reference = loader.train_reference
valid_reference = loader.valid_reference

# Training
# learning_rate = 2e-5
# weight_decay = 1e-3
factor = 1
warmup_step = 4000
# print("learning_rate : ", learning_rate)
# print("weight_decay : ", weight_decay)
print("factor : ", factor)
print("warmup_step : ", warmup_step)

PAD_IDX = vocab.stoi['<pad>']
cel = nn.CrossEntropyLoss(ignore_index=PAD_IDX) # Ignore Padding
# optimizer = torch.optim.SGD(model.parameters(), lr=1e-1)
# optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate, weight_decay=weight_decay)
model_opt = NoamOpt(model.src_embed[0].d_model, factor, warmup_step,
        torch.optim.Adam(model.parameters(), lr=0, betas=(0.9, 0.98), eps=1e-9))

epochs = 50
max_norm = 5

# Evaluate
squad_metric = load_metric('squad')

for epoch in tqdm(range(epochs)):
    train_loss = 0
    train_accuracy = 0.0
    train_data_num = 0
    train_prediction = list()
    for train_i, train_batch in enumerate(train_iter):
        context, context_length = train_batch.context_question
        answer_start = train_batch.answer_start
        answer_end = train_batch.answer_end
        train_id_index = train_batch.id_index

        logits_start, logits_end = model(context)

        model_opt.optimizer.zero_grad() # reset process
        # loss = cel(logits_start, answer_start) + cel(logits_end, answer_end) # Loss, a.k.a L
        label_smoothing = LabelSmoothing(logits_start.size(1), PAD_IDX, smoothing=smoothing)
        loss = label_smoothing(F.log_softmax(logits_start, dim=-1), answer_start) + label_smoothing(F.log_softmax(logits_end, dim=-1), answer_end)

        loss.backward() # compute gradients
        torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm) # gradent clipping
        model_opt.step() # update parameters
        train_loss += loss.item()
        
        _, train_start_preds = torch.max(logits_start, 1)
        _, train_end_preds = torch.max(logits_end, 1)
        # train_accuracy += ((train_start_preds == answer_start) * (train_end_preds == answer_end)).sum().float()

        train_data_num += context.shape[0]

        for train_j in range(context.shape[0]):
            pred_text = ""
            start = train_start_preds[train_j]
            end = train_end_preds[train_j]
            if start < end:
                pred_text = [vocab_list[text_id] for text_id in context[train_j][start:end+1]]
                pred_text = " ".join(pred_text)

            train_prediction.append({'id':train_id_list[train_id_index[train_j]], 'prediction_text':pred_text})

    train_result = squad_metric.compute(predictions=train_prediction, references=train_reference)
    print('train:: Epoch:', '%04d' % (epoch + 1), 
          'cost =', '{:.6f},'.format(train_loss / train_data_num), 
        #   'my exact_match =', '{:.6f}'.format(train_accuracy / train_data_num),
          'other_squad_metric : ', train_result)
        
    if (epoch + 1) % 1 == 0:
        with torch.no_grad():
            valid_loss = 0
            valid_accuracy = 0.0
            valid_data_num = 0
            valid_prediction = list()
            for valid_i, valid_batch in enumerate(valid_iter):
                context, context_length = valid_batch.context_question
                answer_start = valid_batch.answer_start
                answer_end = valid_batch.answer_end
                valid_id_index = valid_batch.id_index

                logits_start, logits_end = model(context)

                # loss = cel(logits_start, answer_start) + cel(logits_end, answer_end) # Loss, a.k.a L
                label_smoothing = LabelSmoothing(logits_start.size(1), PAD_IDX, smoothing=smoothing)
                loss = label_smoothing(F.log_softmax(logits_start, dim=-1), answer_start) + label_smoothing(F.log_softmax(logits_end, dim=-1), answer_end)
                valid_loss += loss.item()

                _, valid_start_preds = torch.max(logits_start, 1)
                _, valid_end_preds = torch.max(logits_end, 1)
                # valid_accuracy += ((valid_start_preds == answer_start) * (valid_end_preds == answer_end)).sum().float()

                valid_data_num += context.shape[0]

                for valid_j in range(context.shape[0]):
                    pred_text = ""
                    start = valid_start_preds[valid_j]
                    end = valid_end_preds[valid_j]
                    if start < end:
                        pred_text = [vocab_list[text_id] for text_id in context[valid_j][start:end+1]]
                        pred_text = " ".join(pred_text)
                    valid_prediction.append({'id':valid_id_list[valid_id_index[valid_j]], 'prediction_text':pred_text})
                
            valid_result = squad_metric.compute(predictions=valid_prediction, references=valid_reference)
            print('valid:: Epoch:', '%04d' % (epoch + 1), 
                  'cost =', '{:.6f},'.format(valid_loss / valid_data_num), 
                #   'my exact_match =', '{:.6f},'.format(valid_accuracy / valid_data_num),
                  'other_squad_metric : ', valid_result)
            

device :  cuda




RuntimeError: ignored

** Comment 5.1 **

Use Transformer Encoder.

It is too hard to train this model. The model is small to train if I make its size is similar to the baseline (in Prob 4.2, bi-LSTM + Att + bi-LSTM). Also, if the model is larger than that, it overfits very fast. 
\
I try to solve this with hyperparameter tuning, using l2-regularization, using various optimizers, label smoothing, and else, but the results are not good. (worse than bi-LSTM) Now, I don't have enough time.
\
I think that the Text Classification (twice, start and end position) task with small datasets(~ 80 thousands) including long sentence (~256) is difficult to solve using only small Transformer Encoder.
\
Please comment and let me know if there is a good way I haven't found.
\
Anyway, I enjoyed this assignment a lot. Thanks.:)

Before (Prob 4.2) : 
```
train:: Epoch: 0050 cost = 0.026692, other_squad_metric :  {'exact_match': 20.587945879458793, 'f1': 32.11980032343598}
valid:: Epoch: 0050 cost = 0.042287, other_squad_metric :  {'exact_match': 11.410556290043731, 'f1': 22.092787217705098}
```

After (Prob 5.1) : 
```
Bad Performance than before. overfitting.....
```

** Result 5.1 **
```
device :  cuda
/usr/local/lib/python3.7/dist-packages/ipykernel_launcher.py:208: UserWarning: nn.init.xavier_uniform is now deprecated in favor of nn.init.xavier_uniform_.
src_vocab :  86390
N :  2
d_model :  256
d_ff :  1024
h :  8
dropout :  0.5
smoothing :  0.1
factor :  1
warmup_step :  4000
40%
20/50 [34:49<56:26, 109.23s/it]
/usr/local/lib/python3.7/dist-packages/torch/nn/_reduction.py:42: UserWarning: size_average and reduce args will be deprecated, please use reduction='sum' instead.
  warnings.warn(warning.format(ret))
train:: Epoch: 0001 cost = 8.253701, other_squad_metric :  {'exact_match': 0.023370233702337023, 'f1': 2.935981514807625}
valid:: Epoch: 0001 cost = 7.393155, other_squad_metric :  {'exact_match': 0.10169836265636123, 'f1': 3.3971864145653523}
train:: Epoch: 0002 cost = 7.102990, other_squad_metric :  {'exact_match': 0.14391143911439114, 'f1': 3.2376178820137933}
valid:: Epoch: 0002 cost = 7.180850, other_squad_metric :  {'exact_match': 0.24407607037526696, 'f1': 3.573024803347468}
train:: Epoch: 0003 cost = 6.892999, other_squad_metric :  {'exact_match': 0.4034440344403444, 'f1': 3.714113833715695}
valid:: Epoch: 0003 cost = 7.109185, other_squad_metric :  {'exact_match': 0.2542459066409031, 'f1': 3.5770800842095762}
train:: Epoch: 0004 cost = 6.636440, other_squad_metric :  {'exact_match': 0.6469864698646987, 'f1': 4.302447070499241}
valid:: Epoch: 0004 cost = 6.916278, other_squad_metric :  {'exact_match': 0.11186819892199736, 'f1': 3.717694980715409}
train:: Epoch: 0005 cost = 6.412479, other_squad_metric :  {'exact_match': 0.947109471094711, 'f1': 4.734920028223306}
valid:: Epoch: 0005 cost = 6.926673, other_squad_metric :  {'exact_match': 0.28475541543781147, 'f1': 3.677312838015529}
train:: Epoch: 0006 cost = 6.261926, other_squad_metric :  {'exact_match': 1.1439114391143912, 'f1': 5.098255703770199}
valid:: Epoch: 0006 cost = 6.944342, other_squad_metric :  {'exact_match': 0.27458557917217535, 'f1': 3.8953411330630074}
train:: Epoch: 0007 cost = 6.134191, other_squad_metric :  {'exact_match': 1.2988929889298892, 'f1': 5.4574991573947225}
valid:: Epoch: 0007 cost = 7.101853, other_squad_metric :  {'exact_match': 0.2949252517034476, 'f1': 3.7357224677152314}
train:: Epoch: 0008 cost = 5.973475, other_squad_metric :  {'exact_match': 1.5313653136531364, 'f1': 5.847280898695564}
valid:: Epoch: 0008 cost = 6.972993, other_squad_metric :  {'exact_match': 0.19322688904708635, 'f1': 3.672133882086412}
train:: Epoch: 0009 cost = 5.821159, other_squad_metric :  {'exact_match': 1.7749077490774907, 'f1': 6.234060815818788}
valid:: Epoch: 0009 cost = 7.209159, other_squad_metric :  {'exact_match': 0.27458557917217535, 'f1': 3.654622558678816}
train:: Epoch: 0010 cost = 5.686322, other_squad_metric :  {'exact_match': 2.06150061500615, 'f1': 6.5751845049001405}
valid:: Epoch: 0010 cost = 7.219692, other_squad_metric :  {'exact_match': 0.24407607037526696, 'f1': 3.6828197942077256}
train:: Epoch: 0011 cost = 5.560183, other_squad_metric :  {'exact_match': 2.27060270602706, 'f1': 6.92478130217407}
valid:: Epoch: 0011 cost = 7.325802, other_squad_metric :  {'exact_match': 0.2542459066409031, 'f1': 3.968983027781578}
train:: Epoch: 0012 cost = 5.451693, other_squad_metric :  {'exact_match': 2.4981549815498156, 'f1': 7.162360337287207}
valid:: Epoch: 0012 cost = 7.579209, other_squad_metric :  {'exact_match': 0.28475541543781147, 'f1': 3.6451924468858}
train:: Epoch: 0013 cost = 5.332609, other_squad_metric :  {'exact_match': 2.676506765067651, 'f1': 7.428378856795247}
valid:: Epoch: 0013 cost = 7.512624, other_squad_metric :  {'exact_match': 0.2237363978439947, 'f1': 3.6873263706079573}
train:: Epoch: 0014 cost = 5.237840, other_squad_metric :  {'exact_match': 3.0, 'f1': 7.68477950372832}
valid:: Epoch: 0014 cost = 7.641328, other_squad_metric :  {'exact_match': 0.32543476050035597, 'f1': 3.928084102818584}
train:: Epoch: 0015 cost = 5.138872, other_squad_metric :  {'exact_match': 3.108241082410824, 'f1': 7.852693449119078}
valid:: Epoch: 0015 cost = 7.725737, other_squad_metric :  {'exact_match': 0.32543476050035597, 'f1': 4.107738959662916}
train:: Epoch: 0016 cost = 5.068342, other_squad_metric :  {'exact_match': 3.3640836408364083, 'f1': 8.123721326258412}
valid:: Epoch: 0016 cost = 7.776266, other_squad_metric :  {'exact_match': 0.2237363978439947, 'f1': 3.644713678466444}
train:: Epoch: 0017 cost = 4.977685, other_squad_metric :  {'exact_match': 3.5596555965559658, 'f1': 8.384191372175446}
valid:: Epoch: 0017 cost = 7.823332, other_squad_metric :  {'exact_match': 0.2237363978439947, 'f1': 3.4999117223788367}
train:: Epoch: 0018 cost = 4.903867, other_squad_metric :  {'exact_match': 3.782287822878229, 'f1': 8.668882795267757}
valid:: Epoch: 0018 cost = 7.826819, other_squad_metric :  {'exact_match': 0.28475541543781147, 'f1': 3.9372846285464527}
train:: Epoch: 0019 cost = 4.839563, other_squad_metric :  {'exact_match': 3.997539975399754, 'f1': 8.858334616538366}
valid:: Epoch: 0019 cost = 8.060945, other_squad_metric :  {'exact_match': 0.27458557917217535, 'f1': 3.8723254167929637}
train:: Epoch: 0020 cost = 4.764492, other_squad_metric :  {'exact_match': 4.148831488314883, 'f1': 9.041191018972658}
valid:: Epoch: 0020 cost = 8.070532, other_squad_metric :  {'exact_match': 0.3050950879690837, 'f1': 3.8264716532301795}
```


**Problem 5.2 (bonus)** *(10 points)* Replace Transformer's sinusoidal position encoding with a fixed-length (of 256) position embedding. That is, you will create a 256-by-$d$ trainable parameter matrix for the position encoding that replaces the variable-length sinusoidal encoding. What is the clear disdvantage of this approach? Report the accuracy and compare it with 5.1. Note that this also has a clear advantage, as we will see in our future lecture on Pretrained Language Model, and more specifically, BERT (Devlin et al., 2018).

**Answer 5.2**



In [None]:
# Change PositionalEncoding layer


class TrainablePositionalEncoding(nn.Module):
    "Implement the PE function."
    def __init__(self, d_model, dropout, max_len=256):
        super(TrainablePositionalEncoding, self).__init__()     
        self.dropout = nn.Dropout(p=dropout)   
        # self.pe = torch.zeros(1, max_len, d_model, requires_grad=True)
        self.pe = torch.zeros(1, max_len * 2, d_model, requires_grad=True) # len(context) + len(question) < max_len * 2 in here
        
    def forward(self, x):
        x = x + self.pe[:, :x.size(1)]
        return self.dropout(x)


def make_encoder_with_trainable_pe(src_vocab, N=6, d_model=512, d_ff=2048, h=8, dropout=0.1):
    "Helper: Construct a model from hyperparameters."
    c = copy.deepcopy
    attn = MultiHeadedAttention(h, d_model)
    ff = PositionwiseFeedForward(d_model, d_ff, dropout)
    position = TrainablePositionalEncoding(d_model, dropout)
    model = EncoderClassification(
        Encoder(EncoderLayer(d_model, c(attn), c(ff), dropout), N),
        nn.Sequential(Embeddings(d_model, src_vocab), c(position)),
        Generator(d_model)
        )
    
    # This was important from their code. 
    # Initialize parameters with Glorot / fan_avg.
    for p in model.parameters():
        if p.dim() > 1:
            nn.init.xavier_uniform(p)
    return model


device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print("device : ", device)

# Construct the Transformer Encoder
src_vocab = len(vocab_list) # usually bigger, e.g. 128
N = 2
d_model = 256
d_ff = 1024
h = 8
dropout = 0.5
smoothing = 0.1

model = make_encoder_with_trainable_pe(src_vocab, N, d_model, d_ff, h, dropout).to(device)


print("src_vocab : ", src_vocab)
print("N : ", N)
print("d_model : ", d_model)
print("d_ff : ", d_ff)
print("h : ", h)
print("dropout : ", dropout)
print("smoothing : ", smoothing)

# Construct the data loader
train_iter = loader.train_iter
valid_iter = loader.valid_iter

train_id_list = loader.train_id_list
valid_id_list = loader.valid_id_list
train_reference = loader.train_reference
valid_reference = loader.valid_reference

# Training
# learning_rate = 2e-5
# weight_decay = 1e-3
factor = 1
warmup_step = 4000
# print("learning_rate : ", learning_rate)
# print("weight_decay : ", weight_decay)
print("factor : ", factor)
print("warmup_step : ", warmup_step)

PAD_IDX = vocab.stoi['<pad>']
cel = nn.CrossEntropyLoss(ignore_index=PAD_IDX) # Ignore Padding
# optimizer = torch.optim.SGD(model.parameters(), lr=1e-1)
# optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate, weight_decay=weight_decay)
model_opt = NoamOpt(model.src_embed[0].d_model, factor, warmup_step,
        torch.optim.Adam(model.parameters(), lr=0, betas=(0.9, 0.98), eps=1e-9))

epochs = 50
max_norm = 5

# Evaluate
squad_metric = load_metric('squad')

for epoch in tqdm(range(epochs)):
    train_loss = 0
    train_accuracy = 0.0
    train_data_num = 0
    train_prediction = list()
    for train_i, train_batch in enumerate(train_iter):
        context, context_length = train_batch.context_question
        answer_start = train_batch.answer_start
        answer_end = train_batch.answer_end
        train_id_index = train_batch.id_index

        logits_start, logits_end = model(context)

        model_opt.optimizer.zero_grad() # reset process
        # loss = cel(logits_start, answer_start) + cel(logits_end, answer_end) # Loss, a.k.a L
        label_smoothing = LabelSmoothing(logits_start.size(1), PAD_IDX, smoothing=smoothing)
        loss = label_smoothing(F.log_softmax(logits_start, dim=-1), answer_start) + label_smoothing(F.log_softmax(logits_end, dim=-1), answer_end)

        loss.backward() # compute gradients
        torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm) # gradent clipping
        model_opt.step() # update parameters
        train_loss += loss.item()
        
        _, train_start_preds = torch.max(logits_start, 1)
        _, train_end_preds = torch.max(logits_end, 1)
        # train_accuracy += ((train_start_preds == answer_start) * (train_end_preds == answer_end)).sum().float()

        train_data_num += context.shape[0]

        for train_j in range(context.shape[0]):
            pred_text = ""
            start = train_start_preds[train_j]
            end = train_end_preds[train_j]
            if start < end:
                pred_text = [vocab_list[text_id] for text_id in context[train_j][start:end+1]]
                pred_text = " ".join(pred_text)

            train_prediction.append({'id':train_id_list[train_id_index[train_j]], 'prediction_text':pred_text})

    train_result = squad_metric.compute(predictions=train_prediction, references=train_reference)
    print('train:: Epoch:', '%04d' % (epoch + 1), 
          'cost =', '{:.6f},'.format(train_loss / train_data_num), 
        #   'my exact_match =', '{:.6f}'.format(train_accuracy / train_data_num),
          'other_squad_metric : ', train_result)
        
    if (epoch + 1) % 1 == 0:
        with torch.no_grad():
            valid_loss = 0
            valid_accuracy = 0.0
            valid_data_num = 0
            valid_prediction = list()
            for valid_i, valid_batch in enumerate(valid_iter):
                context, context_length = valid_batch.context_question
                answer_start = valid_batch.answer_start
                answer_end = valid_batch.answer_end
                valid_id_index = valid_batch.id_index

                logits_start, logits_end = model(context)

                # loss = cel(logits_start, answer_start) + cel(logits_end, answer_end) # Loss, a.k.a L
                label_smoothing = LabelSmoothing(logits_start.size(1), PAD_IDX, smoothing=smoothing)
                loss = label_smoothing(F.log_softmax(logits_start, dim=-1), answer_start) + label_smoothing(F.log_softmax(logits_end, dim=-1), answer_end)
                valid_loss += loss.item()

                _, valid_start_preds = torch.max(logits_start, 1)
                _, valid_end_preds = torch.max(logits_end, 1)
                # valid_accuracy += ((valid_start_preds == answer_start) * (valid_end_preds == answer_end)).sum().float()

                valid_data_num += context.shape[0]

                for valid_j in range(context.shape[0]):
                    pred_text = ""
                    start = valid_start_preds[valid_j]
                    end = valid_end_preds[valid_j]
                    if start < end:
                        pred_text = [vocab_list[text_id] for text_id in context[valid_j][start:end+1]]
                        pred_text = " ".join(pred_text)
                    valid_prediction.append({'id':valid_id_list[valid_id_index[valid_j]], 'prediction_text':pred_text})
                
            valid_result = squad_metric.compute(predictions=valid_prediction, references=valid_reference)
            print('valid:: Epoch:', '%04d' % (epoch + 1), 
                  'cost =', '{:.6f},'.format(valid_loss / valid_data_num), 
                #   'my exact_match =', '{:.6f},'.format(valid_accuracy / valid_data_num),
                  'other_squad_metric : ', valid_result)
            

** Comment 5.2 **

The code is written above.

Disadvantage : Training cost \
Advantage : After training on pre-task, we can use this trained positional encoding on the other tasks. Pretrained Model(in this case, positional encoding) reduces training time dramatically on the downstream task.