# KAIST AI605 Assignment 3: Transformer

TA in charge: Miyoung Ko (miyoungko@kaist.ac.kr)

**Due date**:  May 17 (Tue) 11:00pm, 2022  


## Your Submission
If you are a KAIST student, you will submit your assignment via [KLMS](https://klms.kaist.ac.kr). If you are a NAVER student, you will submit via [Google Form](https://forms.gle/FSng5HUwtQinTFAU8). 

You need to submit both (1) .ipynb file (needs to be fully executable on CoLab), and (2) a pdf of the file.

Use in-line LaTeX (see below) for mathematical expressions. Collaboration among students is allowed but it is not a group assignment so make sure your answer and code are your own. Make sure to mention your collaborators in your assignment with their names and their student ids.

## Grading
The entire assignment is out of 20 points.. For every late day, your grade will be deducted by 2 points (KAIST students only). You can use one of your no-penalty late days (7 days in total). Make sure to mention this in your submission. You will receive a grade of zero if you submit after 7 days.


## Environment
You will need Python 3.7+ and PyTorch 1.9+, which are already available on Colab:

In [None]:
from platform import python_version
import torch

print("python", python_version())
print("torch", torch.__version__)

python 3.7.13
torch 1.11.0+cu113


## 1. Attention Layer

We will first start with going over a few concepts that you learned in your high school statistics class. The variance of a random variable $X$, $\text{Var}(X)$ is defined as $\text{E}[(X-\mu)^2]$ where $\mu$ is the mean of $X$. Furthermore, given two independent random variables $X$ and $Y$ and a constant $a$,
$$ \text{Var}(X+Y) = \text{Var}(X) + \text{Var}(Y),$$
$$ \text{Var}(aX) = a^2\text{Var}(X),$$
$$ \text{Var}(XY) = \text{E}(X^2)\text{E}(Y^2) - [\text{E}(X)]^2[\text{E}(Y)]^2.$$

> **Problem 1.1** *(3 points)* Suppose we are given two sets of $n$ random variables, $X_1 \dots X_n$ and $Y_1 \dots Y_n$, where all of these $2n$ variables are mutually independent and have a mean of $0$ and a variance of $1$. Prove that
$$\text{Var}\left(\sum_i^n X_i Y_i\right) = n.$$

ANSWER

$$ Var(\sum_i^n X_i Y_i) = Var(X_1Y_1) + Var(X_2Y_2) + ........ Var(X_nY_n)  $$
$$ Var(XY) = E[X^2]E[Y^2] - [E(X)]^2 [E(Y)]^2 $$

the expected value E(X) = 0
$$ Var(X) = E(X^2) – [E(X)]^2 $$
$$E[X^2]E[Y^2] = (Var(X) + [E(X)]^2)(Var(Y)+[E(Y)]^2) $$


$$ Var(XY) = (Var(X) + [E(X)]^2)(Var(Y)+[E(Y)]^2) - [E(X)]^2 [E(Y)]^2 $$
$$Var(XY) = Var(X)*Var(Y) \;\;\;\; - \;\;\;\; eq \; (1)$$
$$ Var(X) and Var(Y) = 1  $$
$$Var(XY) = 1$$
$$ Var(\sum_i^n X_i Y_i) = 1 + 1 +...... +1 $$
$$ Var(\sum_i^n X_i Y_i) = n$$


In Lecture 11 and 12, we discussed how the attention is computed in Transformer via the following equation,
$$ \text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^\top}{\sqrt{d_k}}\right)V.$$
> **Problem 1.2** *(3 points)*  Suppose $Q$ and $K$ are matrices of independent variables each of which has a mean of $0$ and a variance of $1$. Using what you learned from Problem 1.1., show that (lazily defining the equality to be element-wise comparison)
$$\text{Var}\left(\frac{QK^\top}{\sqrt{d_k}}\right) = 1.$$


$QK^T$ is a matrix multiplication, where each element is computed by sum of product of indenpendent random variables hence using from problem 1.1 Var of each element of matrxi of n x n is n. 

$$ \frac{Var(QK^T)}{Var(\sqrt d_k)} = \frac{n}{d_k} = \frac {n}{n} = 1  \;\;\;\; - \;\;\;\; eq \; (2)$$



> **Problem 1.3** *(2 points)* What would happen if the assumption that the variance of $Q$ and $K$ is $1$ does not hold? Consider each case of it being higher and lower than $1$ and conjecture what it implies, respectively.


Using the following equation:  
$$Var(XY) = Var(X)*Var(Y) \;\;\;\; - \;\;\;\; eq \; (1)$$

we can see that if variance of random variable is less than 1 or greater than 1 the varianace of the sum of products will be
$n \times Var(X) \times Var(Y)$

Hence eq 2 wont hold, and
$$\text{Var}\left(\frac{QK^\top}{\sqrt{d_k}}\right) = Var(X) \times Var(Y)$$

> **Problem 1.4** *(2 points)* Now it is time to experimentally verify the theory! Create the random variables $X$ and $Y$ with the mean of zero and the variance of one (using `torch.randn`) and verify the equation in Problem 1.2. Then experiment with higher and lower variance to verify your finding in Problem 1.3. Briefly explain your results.

In [None]:
n = 100

Q_unitVar = torch.empty([n, n]).normal_(mean=0, std=1)
K_unitVar = torch.empty([n, n]).normal_(mean=0, std=1)

print(torch.var(Q_unitVar@K_unitVar.T)/n)

Q_lowVar = torch.empty([n, n]).normal_(mean=0, std=0.1)
K_lowVar = torch.empty([n, n]).normal_(mean=0, std=0.1)

print(torch.var(Q_lowVar@K_lowVar.T)/n)

Q_highVar = torch.empty([n, n]).normal_(mean=0, std=10)
K_highVar = torch.empty([n, n]).normal_(mean=0, std=10)

print(torch.var(Q_highVar@K_highVar.T)/n)

tensor(1.0000)
tensor(9.9986e-05)
tensor(9998.7422)


Results prove the given statment that with lower than 1 var, sum of product varriance goes down as well, and greater than 1 var, sum of product variance goes up as well.

## 2. Transformer for Spelling Error Correction

In this section, you will implement Transformer for a few tasks that are simpler than machine translation. Feel free to copy and paste from [The Annotated Transformer](http://nlp.seas.harvard.edu/annotated-transformer/) (note that this is a new version released in 2022 recently), though make sure to mention that you copied the code from it. Note that we do not provide a separate training or evaluation data, so it is your job to be able to create these in a reasonable manner.


### Copied Code

In [None]:
import os
from os.path import exists
import torch
import torch.nn as nn
from torch.nn.functional import log_softmax, pad
import math
import copy
import time
from torch.optim.lr_scheduler import LambdaLR
import pandas as pd
import altair as alt
from torchtext.data.functional import to_map_style_dataset
from torch.utils.data import DataLoader
from torchtext.vocab import build_vocab_from_iterator
import torchtext.datasets as datasets
import spacy
import warnings
from torch.utils.data.distributed import DistributedSampler
import torch.distributed as dist
import torch.multiprocessing as mp
from torch.nn.parallel import DistributedDataParallel as DDP


# Set to False to skip notebook execution (e.g. for debugging)
warnings.filterwarnings("ignore")
RUN_EXAMPLES = True

In [None]:
class EncoderDecoder(nn.Module):
    """
    A standard Encoder-Decoder architecture. Base for this and many
    other models.
    """

    def __init__(self, encoder, decoder, src_embed, tgt_embed, generator):
        super(EncoderDecoder, self).__init__()
        self.encoder = encoder
        self.decoder = decoder
        self.src_embed = src_embed
        self.tgt_embed = tgt_embed
        self.generator = generator

    def forward(self, src, tgt, src_mask, tgt_mask):
        "Take in and process masked src and target sequences."
        return self.decode(self.encode(src, src_mask), src_mask, tgt, tgt_mask)

    def encode(self, src, src_mask):
        return self.encoder(self.src_embed(src), src_mask)

    def decode(self, memory, src_mask, tgt, tgt_mask):
        return self.decoder(self.tgt_embed(tgt), memory, src_mask, tgt_mask)


In [None]:
class Generator(nn.Module):
    "Define standard linear + softmax generation step."

    def __init__(self, d_model, vocab):
        super(Generator, self).__init__()
        self.proj = nn.Linear(d_model, vocab)

    def forward(self, x):
        return log_softmax(self.proj(x), dim=-1)

In [None]:
def clones(module, N):
    "Produce N identical layers."
    return nn.ModuleList([copy.deepcopy(module) for _ in range(N)])

In [None]:
class Encoder(nn.Module):
    "Core encoder is a stack of N layers"

    def __init__(self, layer, N):
        super(Encoder, self).__init__()
        self.layers = clones(layer, N)
        self.norm = LayerNorm(layer.size)

    def forward(self, x, mask):
        "Pass the input (and mask) through each layer in turn."
        for layer in self.layers:
            x = layer(x, mask)
        return self.norm(x)

In [None]:
class LayerNorm(nn.Module):
    "Construct a layernorm module (See citation for details)."

    def __init__(self, features, eps=1e-6):
        super(LayerNorm, self).__init__()
        self.a_2 = nn.Parameter(torch.ones(features))
        self.b_2 = nn.Parameter(torch.zeros(features))
        self.eps = eps

    def forward(self, x):
        mean = x.mean(-1, keepdim=True)
        std = x.std(-1, keepdim=True)
        return self.a_2 * (x - mean) / (std + self.eps) + self.b_2

In [None]:
class SublayerConnection(nn.Module):
    """
    A residual connection followed by a layer norm.
    Note for code simplicity the norm is first as opposed to last.
    """

    def __init__(self, size, dropout):
        super(SublayerConnection, self).__init__()
        self.norm = LayerNorm(size)
        self.dropout = nn.Dropout(dropout)

    def forward(self, x, sublayer):
        "Apply residual connection to any sublayer with the same size."
        return x + self.dropout(sublayer(self.norm(x)))

In [None]:
class EncoderLayer(nn.Module):
    "Encoder is made up of self-attn and feed forward (defined below)"

    def __init__(self, size, self_attn, feed_forward, dropout):
        super(EncoderLayer, self).__init__()
        self.self_attn = self_attn
        self.feed_forward = feed_forward
        self.sublayer = clones(SublayerConnection(size, dropout), 2)
        self.size = size

    def forward(self, x, mask):
        "Follow Figure 1 (left) for connections."
        x = self.sublayer[0](x, lambda x: self.self_attn(x, x, x, mask))
        return self.sublayer[1](x, self.feed_forward)

In [None]:
class Decoder(nn.Module):
    "Generic N layer decoder with masking."

    def __init__(self, layer, N):
        super(Decoder, self).__init__()
        self.layers = clones(layer, N)
        self.norm = LayerNorm(layer.size)

    def forward(self, x, memory, src_mask, tgt_mask):
        for layer in self.layers:
            x = layer(x, memory, src_mask, tgt_mask)
        return self.norm(x)

In [None]:
class DecoderLayer(nn.Module):
    "Decoder is made of self-attn, src-attn, and feed forward (defined below)"

    def __init__(self, size, self_attn, src_attn, feed_forward, dropout):
        super(DecoderLayer, self).__init__()
        self.size = size
        self.self_attn = self_attn
        self.src_attn = src_attn
        self.feed_forward = feed_forward
        self.sublayer = clones(SublayerConnection(size, dropout), 3)

    def forward(self, x, memory, src_mask, tgt_mask):
        "Follow Figure 1 (right) for connections."
        m = memory
        x = self.sublayer[0](x, lambda x: self.self_attn(x, x, x, tgt_mask))
        x = self.sublayer[1](x, lambda x: self.src_attn(x, m, m, src_mask))
        return self.sublayer[2](x, self.feed_forward)

In [None]:
def subsequent_mask(size):
    "Mask out subsequent positions."
    attn_shape = (1, size, size)
    subsequent_mask = torch.triu(torch.ones(attn_shape), diagonal=1).type(
        torch.uint8
    )
    return subsequent_mask == 0

In [None]:
def attention(query, key, value, mask=None, dropout=None):
    "Compute 'Scaled Dot Product Attention'"
    d_k = query.size(-1)
    scores = torch.matmul(query, key.transpose(-2, -1)) / math.sqrt(d_k)
    if mask is not None:
        scores = scores.masked_fill(mask == 0, -1e9)
    p_attn = scores.softmax(dim=-1)
    if dropout is not None:
        p_attn = dropout(p_attn)
    return torch.matmul(p_attn, value), p_attn

In [None]:
class MultiHeadedAttention(nn.Module):
    def __init__(self, h, d_model, dropout=0.1):
        "Take in model size and number of heads."
        super(MultiHeadedAttention, self).__init__()
        assert d_model % h == 0
        # We assume d_v always equals d_k
        self.d_k = d_model // h
        self.h = h
        self.linears = clones(nn.Linear(d_model, d_model), 4)
        self.attn = None
        self.dropout = nn.Dropout(p=dropout)

    def forward(self, query, key, value, mask=None):
        "Implements Figure 2"
        if mask is not None:
            # Same mask applied to all h heads.
            mask = mask.unsqueeze(1)
        nbatches = query.size(0)

        # 1) Do all the linear projections in batch from d_model => h x d_k
        query, key, value = [
            lin(x).view(nbatches, -1, self.h, self.d_k).transpose(1, 2)
            for lin, x in zip(self.linears, (query, key, value))
        ]

        # 2) Apply attention on all the projected vectors in batch.
        x, self.attn = attention(
            query, key, value, mask=mask, dropout=self.dropout
        )

        # 3) "Concat" using a view and apply a final linear.
        x = (
            x.transpose(1, 2)
            .contiguous()
            .view(nbatches, -1, self.h * self.d_k)
        )
        del query
        del key
        del value
        return self.linears[-1](x)

In [None]:
class PositionwiseFeedForward(nn.Module):
    "Implements FFN equation."

    def __init__(self, d_model, d_ff, dropout=0.1):
        super(PositionwiseFeedForward, self).__init__()
        self.w_1 = nn.Linear(d_model, d_ff)
        self.w_2 = nn.Linear(d_ff, d_model)
        self.dropout = nn.Dropout(dropout)

    def forward(self, x):
        return self.w_2(self.dropout(self.w_1(x).relu()))

In [None]:
class Embeddings(nn.Module):
    def __init__(self, d_model, vocab):
        super(Embeddings, self).__init__()
        self.lut = nn.Embedding(vocab, d_model)
        self.d_model = d_model

    def forward(self, x):
        return self.lut(x) * math.sqrt(self.d_model)

In [None]:
class PositionalEncoding(nn.Module):
    "Implement the PE function."

    def __init__(self, d_model, dropout, max_len=5000):
        super(PositionalEncoding, self).__init__()
        self.dropout = nn.Dropout(p=dropout)

        # Compute the positional encodings once in log space.
        pe = torch.zeros(max_len, d_model)
        position = torch.arange(0, max_len).unsqueeze(1)
        div_term = torch.exp(
            torch.arange(0, d_model, 2) * -(math.log(10000.0) / d_model)
        )
        pe[:, 0::2] = torch.sin(position * div_term)
        pe[:, 1::2] = torch.cos(position * div_term)
        pe = pe.unsqueeze(0)
        self.register_buffer("pe", pe)

    def forward(self, x):
        x = x + self.pe[:, : x.size(1)].requires_grad_(False)
        return self.dropout(x)

In [None]:
class Batch:
    """Object for holding a batch of data with mask during training."""

    def __init__(self, src, tgt=None, pad=2):  # 2 = <blank>
        self.src = src
        self.src_mask = (src != pad).unsqueeze(-2)
        if tgt is not None:
            self.tgt = tgt[:, :-1]
            self.tgt_y = tgt[:, 1:]
            self.tgt_mask = self.make_std_mask(self.tgt, pad)
            self.ntokens = (self.tgt_y != pad).data.sum()

    @staticmethod
    def make_std_mask(tgt, pad):
        "Create a mask to hide padding and future words."
        tgt_mask = (tgt != pad).unsqueeze(-2)
        tgt_mask = tgt_mask & subsequent_mask(tgt.size(-1)).type_as(
            tgt_mask.data
        )
        return tgt_mask

In [None]:
class TrainState:
    """Track number of steps, examples, and tokens processed"""

    step: int = 0  # Steps in the current epoch
    accum_step: int = 0  # Number of gradient accumulation steps
    samples: int = 0  # total # of examples used
    tokens: int = 0  # total # of tokens processed

In [None]:
def rate(step, model_size, factor, warmup):
    """
    we have to default the step to 1 for LambdaLR function
    to avoid zero raising to negative power.
    """
    if step == 0:
        step = 1
    return factor * (
        model_size ** (-0.5) * min(step ** (-0.5), step * warmup ** (-1.5))
    )

In [None]:
class LabelSmoothing(nn.Module):
    "Implement label smoothing."

    def __init__(self, size, padding_idx, smoothing=0.0):
        super(LabelSmoothing, self).__init__()
        self.criterion = nn.KLDivLoss(reduction="sum")
        self.padding_idx = padding_idx
        self.confidence = 1.0 - smoothing
        self.smoothing = smoothing
        self.size = size
        self.true_dist = None

    def forward(self, x, target):
        assert x.size(1) == self.size
        true_dist = x.data.clone()
        true_dist.fill_(self.smoothing / (self.size - 2))
        true_dist.scatter_(1, target.data.unsqueeze(1), self.confidence)
        true_dist[:, self.padding_idx] = 0
        mask = torch.nonzero(target.data == self.padding_idx)
        if mask.dim() > 0:
            true_dist.index_fill_(0, mask.squeeze(), 0.0)
        self.true_dist = true_dist
        return self.criterion(x, true_dist.clone().detach())


In [None]:
def loss(x, crit):
    d = x + 3 * 1
    predict = torch.FloatTensor([[0, x / d, 1 / d, 1 / d, 1 / d]])
    return crit(predict.log(), torch.LongTensor([1])).data


def penalization_visualization():
    crit = LabelSmoothing(5, 0, 0.1)
    loss_data = pd.DataFrame(
        {
            "Loss": [loss(x, crit) for x in range(1, 100)],
            "Steps": list(range(99)),
        }
    ).astype("float")

    return (
        alt.Chart(loss_data)
        .mark_line()
        .properties(width=350)
        .encode(
            x="Steps",
            y="Loss",
        )
        .interactive()
    )



In [None]:
# Some convenience helper functions used throughout the notebook


def is_interactive_notebook():
    return __name__ == "__main__"


def show_example(fn, args=[]):
    if __name__ == "__main__" and RUN_EXAMPLES:
        return fn(*args)


def execute_example(fn, args=[]):
    if __name__ == "__main__" and RUN_EXAMPLES:
        fn(*args)


class DummyOptimizer(torch.optim.Optimizer):
    def __init__(self):
        self.param_groups = [{"lr": 0}]
        None

    def step(self):
        None

    def zero_grad(self, set_to_none=False):
        None


class DummyScheduler:
    def step(self):
        None

In [None]:
def make_model(src_vocab, tgt_vocab, N=6, d_model=512, d_ff=2048, h=8, dropout=0.1):
    "Helper: Construct a model from hyperparameters."
    c = copy.deepcopy
    attn = MultiHeadedAttention(h, d_model)
    ff = PositionwiseFeedForward(d_model, d_ff, dropout)
    position = PositionalEncoding(d_model, dropout)
    model = EncoderDecoder(
        Encoder(EncoderLayer(d_model, c(attn), c(ff), dropout), N),
        Decoder(DecoderLayer(d_model, c(attn), c(attn), c(ff), dropout), N),
        nn.Sequential(Embeddings(d_model, src_vocab), c(position)),
        nn.Sequential(Embeddings(d_model, tgt_vocab), c(position)),
        Generator(d_model, tgt_vocab),
    )

    # This was important from their code.
    # Initialize parameters with Glorot / fan_avg.
    for p in model.parameters():
        if p.dim() > 1:
            nn.init.xavier_uniform_(p)
    return model

In [None]:
def run_epoch(
    data_iter,
    model,
    loss_compute,
    optimizer,
    scheduler,
    mode="train",
    accum_iter=1,
    train_state=TrainState(),
    epochID=None
):
    """Train a single epoch"""
    start = time.time()
    total_tokens = 0
    total_loss = 0
    tokens = 0
    n_accum = 0
    for i, batch in enumerate(data_iter):
        out = model.forward(
            batch.src, batch.tgt, batch.src_mask, batch.tgt_mask
        )
        loss, loss_node = loss_compute(out, batch.tgt_y, batch.ntokens)
        # loss_node = loss_node / accum_iter
        if mode == "train" or mode == "train+log":
            loss_node.backward()
            train_state.step += 1
            train_state.samples += batch.src.shape[0]
            train_state.tokens += batch.ntokens
            if i % accum_iter == 0:
                optimizer.step()
                optimizer.zero_grad(set_to_none=True)
                n_accum += 1
                train_state.accum_step += 1
            scheduler.step()

        total_loss += loss
        total_tokens += batch.ntokens
        tokens += batch.ntokens
        if i % 40 == 1 and (mode == "train" or mode == "train+log"):
            lr = optimizer.param_groups[0]["lr"]
            elapsed = time.time() - start
            print(
                (
                    "Epoch Step: %6d | Batch Number: %6d | Accumulation Step: %3d | Loss: %6.2f "
                    + "| Tokens / Sec: %7.1f | Learning Rate: %6.1e"
                )
                % (epochID, i, n_accum, loss / batch.ntokens, tokens / elapsed, lr)
            )
            start = time.time()
            tokens = 0
        del loss
        del loss_node
    return total_loss / total_tokens, train_state

In [None]:
def data_gen(V, batch_size, nbatches, len=32):
    "Generate random data for a src-tgt copy task."
    for i in range(nbatches):
        data = torch.randint(0, V, size=(batch_size, len)).to(device)
        data[:, 0] = 0
        src = data.requires_grad_(False).clone().detach()
        tgt = data.requires_grad_(False).clone().detach()
        yield Batch(src, tgt, 0)

In [None]:
class SimpleLossCompute:
    "A simple loss compute and train function."

    def __init__(self, generator, criterion):
        self.generator = generator
        self.criterion = criterion

    def __call__(self, x, y, norm):
        x = self.generator(x)
        sloss = (
            self.criterion(
                x.contiguous().view(-1, x.size(-1)), y.contiguous().view(-1)
            )
            / norm
        )
        return sloss.data * norm, sloss

In [None]:
def greedy_decode(model, src, src_mask, max_len, start_symbol, end_symbol=None):
    memory = model.encode(src, src_mask)
    ys = torch.zeros(1, 1).fill_(start_symbol).type_as(src.data)
    i = 0
    next_word = 0
    while next_word != end_symbol and i < max_len-1:
      out = model.decode(memory, src_mask, ys, subsequent_mask(ys.size(1)).type_as(src.data))
      prob = model.generator(out[:, -1])
      _, next_word = torch.max(prob, dim=1)
      next_word = next_word.data[0]
      ys = torch.cat([ys, torch.zeros(1, 1).type_as(src.data).fill_(next_word)], dim=1)
      i += 1
    return ys

### **Problem 2.1** *(3 points)* 
> Create a model that takes a random set of input symbols from a vocabulary of digits (i.e. 0, 1, ... , 8, 9) as the input and generate back the same symbols. Instead of varying length, we fix the length to 32. Make sure to report that your model's sequence-level (not token level) accuracy goes above 90%. Note that a similar problem is also in Annotated Transformer, and copying the code is allowed.

In [None]:
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

In [None]:
# Train the simple copy task.
V = 10
criterion = LabelSmoothing(size=V, padding_idx=0, smoothing=0.0)
model = make_model(V, V, N=2)

optimizer = torch.optim.Adam(
    model.parameters(), lr=0.5, betas=(0.9, 0.98), eps=1e-9
)
lr_scheduler = LambdaLR(
    optimizer=optimizer,
    lr_lambda=lambda step: rate(
        step, model_size=model.src_embed[0].d_model, factor=1.0, warmup=400
    ),
)

batch_size = 80
model = model.to(device)

for epoch in range(30):
    model.train()
    run_epoch(
        data_gen(V, batch_size, 30, 32),
        model,
        SimpleLossCompute(model.generator, criterion),
        optimizer,
        lr_scheduler,
        mode="train",
        epochID=epoch+1
    )
    

Epoch Step:      1 | Batch Number:      1 | Accumulation Step:   2 | Loss:   3.05 | Tokens / Sec:  9058.6 | Learning Rate: 5.5e-06
Epoch Step:      2 | Batch Number:      1 | Accumulation Step:   2 | Loss:   2.17 | Tokens / Sec: 12228.1 | Learning Rate: 8.8e-05
Epoch Step:      3 | Batch Number:      1 | Accumulation Step:   2 | Loss:   2.05 | Tokens / Sec: 12272.9 | Learning Rate: 1.7e-04
Epoch Step:      4 | Batch Number:      1 | Accumulation Step:   2 | Loss:   1.99 | Tokens / Sec: 12303.4 | Learning Rate: 2.5e-04
Epoch Step:      5 | Batch Number:      1 | Accumulation Step:   2 | Loss:   1.55 | Tokens / Sec: 12305.7 | Learning Rate: 3.4e-04
Epoch Step:      6 | Batch Number:      1 | Accumulation Step:   2 | Loss:   0.86 | Tokens / Sec: 10657.7 | Learning Rate: 4.2e-04
Epoch Step:      7 | Batch Number:      1 | Accumulation Step:   2 | Loss:   0.27 | Tokens / Sec: 12218.3 | Learning Rate: 5.0e-04
Epoch Step:      8 | Batch Number:      1 | Accumulation Step:   2 | Loss:   0.15 |

In [None]:
model.eval()
max_len = 32
src_mask = torch.ones(1, 1, max_len).to(device)

ite = 100
TP = 0
for i in range(ite):
  src = torch.randint(1, V, size=(1, max_len)).to(device)
  src[0][0] = 0
  predicted = greedy_decode(model, src, src_mask, max_len=max_len, start_symbol=0)

  TP += torch.all(src == predicted)

print(TP/ite)

tensor(1., device='cuda:0')





### **Problem 2.2** *(4 points)* 
>Now, we will implement a bit more useful function, so-called spelling error correction. Your job is to create a model whose input is a word with spelling errors, and the output is the spelling-corrected word. Here, your vocabulary will be character instead of word. You can create your own training data by using an existing text corpus as the target and inject noise into it to use it as the input. You are free to use whichever text corpus you like. If you can't think of one, please use context data in SQuAD Dataset (see Assignment 2). Report accuracy in your own evaluation data (you will receive full credit as long as both the evaluation data and the accuracy are reasonable), and also show 3 examples where it succeeds at correcting spelling.

In [None]:
!pip install -q datasets

[K     |████████████████████████████████| 342 kB 5.3 MB/s 
[K     |████████████████████████████████| 1.1 MB 44.6 MB/s 
[K     |████████████████████████████████| 212 kB 45.4 MB/s 
[K     |████████████████████████████████| 136 kB 52.9 MB/s 
[K     |████████████████████████████████| 84 kB 2.7 MB/s 
[K     |████████████████████████████████| 127 kB 44.3 MB/s 
[K     |████████████████████████████████| 144 kB 43.2 MB/s 
[K     |████████████████████████████████| 94 kB 3.1 MB/s 
[K     |████████████████████████████████| 271 kB 39.7 MB/s 
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
datascience 0.10.6 requires folium==0.2.1, but you have folium 0.8.3 which is incompatible.[0m
[?25h

In [None]:
maxWordLen = 20
batchSize = 64

In [None]:
from datasets import load_dataset
import torchtext
from torchtext.data import get_tokenizer

tokenizer = get_tokenizer("basic_english") 

squad_dataset = load_dataset('squad')
dataSet = squad_dataset['validation']

string = ' '.join([str(elem).strip('\n') for elem in dataSet['context']])

words = set(tokenizer(string))
words = [i for i in words if len(i) > 2 and len(i) <= maxWordLen - 2]

alphabatedWords = [list(i) for i in words]

vocab = ['SOW', 'EOW', 'PAD', 'UNK'] + list(set(''.join([str(elem).strip('\n') for elem in words])))
word2id = {word: id_ for id_, word in enumerate(vocab)}

noisyWords = []

for i in alphabatedWords:
  replaceID = torch.randint(1, len(i), [1])[0]
  replaingID = torch.randint(4, len(vocab), [1])[0]
  temp = i.copy()
  temp[replaceID] = vocab[replaingID]
  noisyWords.append(temp)

Downloading builder script:   0%|          | 0.00/1.97k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/1.02k [00:00<?, ?B/s]

Downloading and preparing dataset squad/plain_text (download: 33.51 MiB, generated: 85.63 MiB, post-processed: Unknown size, total: 119.14 MiB) to /root/.cache/huggingface/datasets/squad/plain_text/1.0.0/d6ec3ceb99ca480ce37cdd35555d6cb2511d223b9150cce08a837ef62ffea453...


Downloading data files:   0%|          | 0/2 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/8.12M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.05M [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/2 [00:00<?, ?it/s]

Generating train split:   0%|          | 0/87599 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/10570 [00:00<?, ? examples/s]

Dataset squad downloaded and prepared to /root/.cache/huggingface/datasets/squad/plain_text/1.0.0/d6ec3ceb99ca480ce37cdd35555d6cb2511d223b9150cce08a837ef62ffea453. Subsequent calls will reuse this data.


  0%|          | 0/2 [00:00<?, ?it/s]

In [None]:
numberOfWords = (len(alphabatedWords)//batchSize)*batchSize
alphabetDataSet = torch.zeros([numberOfWords, maxWordLen], device=device).fill_(2)
alphabetDataSet[:, 0] = 0
alphabetDataSet[:, -1] = 1

for i in range(numberOfWords):
  for j, alphabet in enumerate(alphabatedWords[i]):
    if alphabet not in word2id:
      alphabetDataSet[i][j+1] = 3
    else:
      alphabetDataSet[i][j+1] = word2id[alphabet]

alphabetDataSet = alphabetDataSet.reshape(len(alphabatedWords)//batchSize, batchSize, maxWordLen)



noisyAlphabetDataSet = torch.zeros([numberOfWords, maxWordLen], device=device).fill_(2)
noisyAlphabetDataSet[:, 0] = 0
noisyAlphabetDataSet[:, -1] = 1

for i in range(numberOfWords):
  for j, alphabet in enumerate(noisyWords[i]):
    if alphabet not in word2id:
      noisyAlphabetDataSet[i][j+1] = 3
    else:
      noisyAlphabetDataSet[i][j+1] = word2id[alphabet]

noisyAlphabetDataSet = noisyAlphabetDataSet.reshape(len(alphabatedWords)//batchSize, batchSize, maxWordLen)

In [None]:
def originalTraining():
  for i in range(len(alphabatedWords)//batchSize):
      data = alphabetDataSet[i, :, :].long()
      src = data.requires_grad_(False).clone().detach()
      tgt = data.requires_grad_(False).clone().detach()
      yield Batch(src, tgt, 0)

def noisyFineTuning():
  for i in range(len(noisyAlphabetDataSet)//batchSize):
      sData = noisyAlphabetDataSet[i, :, :].long()
      tData = alphabetDataSet[i, :, :].long()
      src = sData.requires_grad_(False).clone().detach()
      tgt = tData.requires_grad_(False).clone().detach()
      yield Batch(src, tgt, 0)

In [None]:
# Train the simple copy task.
V = len(vocab)
criterion = LabelSmoothing(size=V, padding_idx=0, smoothing=0.0)
model = make_model(V, V, N=2)

optimizer = torch.optim.Adam(
    model.parameters(), lr=0.5, betas=(0.9, 0.98), eps=1e-9
)
lr_scheduler = LambdaLR(
    optimizer=optimizer,
    lr_lambda=lambda step: rate(
        step, model_size=model.src_embed[0].d_model, factor=1.0, warmup=400
    ),
)

model = model.to(device)

#for epoch in range(10):
#    model.train()
#    run_epoch(
#        originalTraining(),
#        model,
#        SimpleLossCompute(model.generator, criterion),
#        optimizer,
#        lr_scheduler,
#        mode="train",
#        epochID=epoch+1
#    )

    
for epoch in range(100):
    model.train()
    run_epoch(
        noisyFineTuning(),
        model,
        SimpleLossCompute(model.generator, criterion),
        optimizer,
        lr_scheduler,
        mode="train",
        epochID=epoch+1
    )

Epoch Step:      1 | Batch Number:      1 | Accumulation Step:   2 | Loss:   6.30 | Tokens / Sec:  9866.3 | Learning Rate: 5.5e-06
Epoch Step:      2 | Batch Number:      1 | Accumulation Step:   2 | Loss:   4.03 | Tokens / Sec:  9329.4 | Learning Rate: 2.2e-05
Epoch Step:      3 | Batch Number:      1 | Accumulation Step:   2 | Loss:   3.48 | Tokens / Sec:  9250.0 | Learning Rate: 3.9e-05
Epoch Step:      4 | Batch Number:      1 | Accumulation Step:   2 | Loss:   2.32 | Tokens / Sec:  9914.2 | Learning Rate: 5.5e-05
Epoch Step:      5 | Batch Number:      1 | Accumulation Step:   2 | Loss:   2.09 | Tokens / Sec: 10984.3 | Learning Rate: 7.2e-05
Epoch Step:      6 | Batch Number:      1 | Accumulation Step:   2 | Loss:   1.82 | Tokens / Sec:  9701.6 | Learning Rate: 8.8e-05
Epoch Step:      7 | Batch Number:      1 | Accumulation Step:   2 | Loss:   1.58 | Tokens / Sec:  9816.8 | Learning Rate: 1.0e-04
Epoch Step:      8 | Batch Number:      1 | Accumulation Step:   2 | Loss:   1.47 |

In [None]:
model.eval()
max_len = 20
src_mask = torch.ones(1, 1, max_len).to(device)
TP = 0

total = len(noisyAlphabetDataSet)
correctlyPredicted = []
wronglyPredicted = []

for i in range(total):
  src = torch.tensor(noisyAlphabetDataSet[i])[1].long().view(1, -1)

  predicted = greedy_decode(model, src, src_mask, max_len=max_len, start_symbol=0)
  original = torch.tensor(alphabetDataSet[i])[1].long().view(1, -1)

  if torch.all(original == predicted):
    correctlyPredicted.append(predicted)
  else:
    wronglyPredicted.append(predicted)
  


  TP += torch.all(original == predicted)

print(TP/total)

tensor(0.0364, device='cuda:0')


In [None]:
wordsbyID = list(word2id.keys())
for i in correctlyPredicted:
  word = []
  for j in i[0]:
    if j == 0:
      continue
    if j == 2:
      break
    word.extend([wordsbyID[j]])
  print(''.join(word))

ideology
laying
anchors
1896
singing
bob
cashew
beating
railing
ingenious
thing
teamed
imaginative
prophet


### **Problem 2.3** *(3 points)* 
>Extend this word-level spelling correction model to sentence-level. You can assume that the number of characters of each sentence is 100 or less. You do not have to report accuracy, but find one example where the word-level model fails and sentence-level model correctly predicts.

In [None]:
maxSenLen = 100
batchSize = 64

In [None]:
!pip install tokenizers
!pip install -q datasets

Collecting tokenizers
  Downloading tokenizers-0.12.1-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (6.6 MB)
[K     |████████████████████████████████| 6.6 MB 4.1 MB/s 
[?25hInstalling collected packages: tokenizers
Successfully installed tokenizers-0.12.1
[K     |████████████████████████████████| 342 kB 4.1 MB/s 
[K     |████████████████████████████████| 1.1 MB 45.3 MB/s 
[K     |████████████████████████████████| 212 kB 49.3 MB/s 
[K     |████████████████████████████████| 136 kB 51.7 MB/s 
[K     |████████████████████████████████| 84 kB 3.4 MB/s 
[K     |████████████████████████████████| 127 kB 48.4 MB/s 
[K     |████████████████████████████████| 94 kB 3.2 MB/s 
[K     |████████████████████████████████| 271 kB 51.5 MB/s 
[K     |████████████████████████████████| 144 kB 50.4 MB/s 
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
datasc

In [None]:
from datasets import load_dataset
import torchtext
from torchtext.data import get_tokenizer
from tokenizers import normalizers

normalize = normalizers.BertNormalizer().normalize_str
squad_dataset = load_dataset('squad')
dataSet = squad_dataset['validation']

string = ' '.join([str(elem).strip('\n') for elem in dataSet['context']])



sentences = normalize(string).split('.')
sentences = [i for i in sentences if len(i) > 10 and len(i) <= 98]

alphabatedSentences = [list(i) for i in sentences]

vocab = ['SOW', 'EOW', 'PAD', 'UNK'] + list(set(''.join([str(elem).strip('\n') for elem in sentences])))
word2id = {word: id_ for id_, word in enumerate(vocab)}

noisySentences = []
noiseAmount = 1
for i in alphabatedSentences:
  for _ in range(noiseAmount):
    replaceID = torch.randint(1, len(i), [1])[0]
    replaingID = torch.randint(4, len(vocab), [1])[0]
    temp = i.copy()
    temp[replaceID] = vocab[replaingID]
  noisySentences.append(temp)

Downloading builder script:   0%|          | 0.00/1.97k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/1.02k [00:00<?, ?B/s]

Downloading and preparing dataset squad/plain_text (download: 33.51 MiB, generated: 85.63 MiB, post-processed: Unknown size, total: 119.14 MiB) to /root/.cache/huggingface/datasets/squad/plain_text/1.0.0/d6ec3ceb99ca480ce37cdd35555d6cb2511d223b9150cce08a837ef62ffea453...


Downloading data files:   0%|          | 0/2 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/8.12M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.05M [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/2 [00:00<?, ?it/s]

Generating train split:   0%|          | 0/87599 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/10570 [00:00<?, ? examples/s]

Dataset squad downloaded and prepared to /root/.cache/huggingface/datasets/squad/plain_text/1.0.0/d6ec3ceb99ca480ce37cdd35555d6cb2511d223b9150cce08a837ef62ffea453. Subsequent calls will reuse this data.


  0%|          | 0/2 [00:00<?, ?it/s]

In [None]:
numberOfWords = (len(alphabatedSentences)//batchSize)*batchSize
alphabetDataSet = torch.zeros([numberOfWords, maxSenLen], device=device).fill_(2)
alphabetDataSet[:, 0] = 0
alphabetDataSet[:, -1] = 1

for i in range(numberOfWords):
  for j, alphabet in enumerate(alphabatedSentences[i]):
    if alphabet not in word2id:
      alphabetDataSet[i][j+1] = 3
    else:
      alphabetDataSet[i][j+1] = word2id[alphabet]

alphabetDataSet = alphabetDataSet.reshape(len(alphabatedSentences)//batchSize, batchSize, maxSenLen)


noisyAlphabetDataSet = torch.zeros([numberOfWords, maxSenLen], device=device).fill_(2)
noisyAlphabetDataSet[:, 0] = 0
noisyAlphabetDataSet[:, -1] = 1

for i in range(numberOfWords):
  for j, alphabet in enumerate(noisySentences[i]):
    if alphabet not in word2id:
      noisyAlphabetDataSet[i][j+1] = 3
    else:
      noisyAlphabetDataSet[i][j+1] = word2id[alphabet]

noisyAlphabetDataSet = noisyAlphabetDataSet.reshape(len(alphabatedSentences)//batchSize, batchSize, maxSenLen)

In [None]:
def originalTraining():
  for i in range(len(alphabatedWords)//batchSize):
      data = alphabetDataSet[i, :, :].long()
      src = data.requires_grad_(False).clone().detach()
      tgt = data.requires_grad_(False).clone().detach()
      yield Batch(src, tgt, 0)

def noisyFineTuning():
  for i in range(len(noisyAlphabetDataSet)//batchSize):
      sData = noisyAlphabetDataSet[i, :, :].long()
      tData = alphabetDataSet[i, :, :].long()
      src = sData.requires_grad_(False).clone().detach()
      tgt = tData.requires_grad_(False).clone().detach()
      yield Batch(src, tgt, 0)

In [None]:
# Train the simple copy task.
V = len(vocab)
criterion = LabelSmoothing(size=V, padding_idx=0, smoothing=0.0)
model = make_model(V, V, N=2)

optimizer = torch.optim.Adam(
    model.parameters(), lr=0.5, betas=(0.9, 0.98), eps=1e-9
)
lr_scheduler = LambdaLR(
    optimizer=optimizer,
    lr_lambda=lambda step: rate(
        step, model_size=model.src_embed[0].d_model, factor=1.0, warmup=400
    ),
)

model = model.to(device)

#for epoch in range(10):
#    model.train()
#    run_epoch(
#        originalTraining(),
#        model,
#        SimpleLossCompute(model.generator, criterion),
#        optimizer,
#        lr_scheduler,
#        mode="train",
#        epochID=epoch+1
#    )

    
for epoch in range(100):
    model.train()
    run_epoch(
        noisyFineTuning(),
        model,
        SimpleLossCompute(model.generator, criterion),
        optimizer,
        lr_scheduler,
        mode="train",
        epochID=epoch+1
    )

Epoch Step:      1 | Batch Number:      1 | Accumulation Step:   2 | Loss:   6.63 | Tokens / Sec: 11488.1 | Learning Rate: 5.5e-06
Epoch Step:      2 | Batch Number:      1 | Accumulation Step:   2 | Loss:   5.40 | Tokens / Sec: 13354.5 | Learning Rate: 1.7e-05
Epoch Step:      3 | Batch Number:      1 | Accumulation Step:   2 | Loss:   2.69 | Tokens / Sec: 13287.6 | Learning Rate: 2.8e-05
Epoch Step:      4 | Batch Number:      1 | Accumulation Step:   2 | Loss:   2.31 | Tokens / Sec: 13422.5 | Learning Rate: 3.9e-05
Epoch Step:      5 | Batch Number:      1 | Accumulation Step:   2 | Loss:   1.94 | Tokens / Sec: 13418.8 | Learning Rate: 5.0e-05
Epoch Step:      6 | Batch Number:      1 | Accumulation Step:   2 | Loss:   1.64 | Tokens / Sec: 13444.2 | Learning Rate: 6.1e-05
Epoch Step:      7 | Batch Number:      1 | Accumulation Step:   2 | Loss:   1.55 | Tokens / Sec: 13260.7 | Learning Rate: 7.2e-05
Epoch Step:      8 | Batch Number:      1 | Accumulation Step:   2 | Loss:   1.44 |

In [None]:
max_len = 100

model.eval()
src_mask = torch.ones(1, 1, max_len).to(device)
TP = 0

total = len(noisyAlphabetDataSet)
correctlyPredicted = []

for i in range(total):
  src = torch.tensor(noisyAlphabetDataSet[i])[1].long().view(1, -1)
  src = src

  predicted = greedy_decode(model, src, src_mask, max_len=max_len, start_symbol=0)
  original = torch.tensor(alphabetDataSet[i])[1].long().view(1, -1)

  if torch.all(original == predicted):
    correctlyPredicted.append(predicted)
  


  TP += torch.all(original == predicted)

print(TP/total)

tensor(0.0075, device='cuda:0')


In [None]:
wordsbyID = list(word2id.keys())
for i in correctlyPredicted:
  word = []
  for j in i[0]:
    if j == 0:
      continue
    if j == 2:
      break
    word.extend([wordsbyID[j]])
  print(''.join(word))
  

 cbs broadcast super bowl 50 in the u
 the miami bid depended on whether the stadium underwent renovations


None of the word here is found in word word level correction, hence this model is solving atleast one example which is not solveable by uppar model