# Lab 11

This code is from https://pytorch.org/tutorials/beginner/transformer_tutorial.html, the official pytorch documentation for seq2seq modeling using the Transformer.

In [None]:
pip install scikit-image

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [None]:
def _supported_float_type(input_dtype, allow_complex=False):
    if isinstance(input_dtype, Iterable) and not isinstance(input_dtype, str):
        return np.result_type(*(_supported_float_type(d) for d in input_dtype))
    input_dtype = np.dtype(input_dtype)
    if not allow_complex and input_dtype.kind == 'c':
        raise ValueError("complex valued input is not supported")
    return new_float_type.get(input_dtype.char, np.float64)

def _generic_edge_filter(image, *, smooth_weights, edge_weights=[1, 0, -1],
                         axis=None, mode='reflect', cval=0.0, mask=None):
    ndim = image.ndim
    if axis is None:
        axes = list(range(ndim))
    elif np.isscalar(axis):
        axes = [axis]
    else:
        axes = axis
    return_magnitude = (len(axes) > 1)

    if image.dtype.kind == 'f':
        float_dtype = _supported_float_type(image.dtype)
        image = image.astype(float_dtype, copy=False)
    else:
        image = img_as_float(image)
    output = np.zeros(image.shape, dtype=image.dtype)

    for edge_dim in axes:
        kernel = _reshape_nd(edge_weights, ndim, edge_dim)
        smooth_axes = list(set(range(ndim)) - {edge_dim})
        for smooth_dim in smooth_axes:
            kernel = kernel * _reshape_nd(smooth_weights, ndim, smooth_dim)
        ax_output = ndi.convolve(image, kernel, mode=mode)
        if return_magnitude:
            ax_output *= ax_output
        output += ax_output

    if return_magnitude:
        output = np.sqrt(output) / np.sqrt(ndim)
    return output

In [None]:
from skimage import filters
import numpy as np

seed = np.random.seed(7777)
image = np.random.rand(128,512,512)
edges = filters.sobel(image)

In [None]:
!pip3 install torchtext==0.4.0

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting torchtext==0.4.0
  Downloading torchtext-0.4.0-py3-none-any.whl (53 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m53.1/53.1 kB[0m [31m6.0 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: torchtext
  Attempting uninstall: torchtext
    Found existing installation: torchtext 0.15.1
    Uninstalling torchtext-0.15.1:
      Successfully uninstalled torchtext-0.15.1
Successfully installed torchtext-0.4.0



Sequence-to-Sequence Modeling with nn.Transformer and TorchText
===============================================================

This is a tutorial on how to train a sequence-to-sequence model
that uses the
[nn.Transformer](https://pytorch.org/docs/master/nn.html?highlight=nn%20transformer#torch.nn.Transformer) module.

The PyTorch 1.2 release includes a standard transformer module based on the
paper [Attention is All You
Need](https://arxiv.org/pdf/1706.03762.pdf) . The transformer model
has been proved to be superior in quality for many sequence-to-sequence
problems while being more parallelizable. The ``nn.Transformer`` module
relies entirely on an attention mechanism (implemented in [nn.MultiheadAttention](https://pytorch.org/docs/master/generated/torch.nn.MultiheadAttention.html#torch.nn.MultiheadAttention) to draw global dependencies
between input and output. The ``nn.Transformer`` module is highly
modularized. As a result, a single component (like `nn.TransformerEncoder`
in this tutorial) can be easily adapted/composed.

The figure below is an overall summary of the transformer architecture.

![](https://github.com/pytorch/tutorials/blob/gh-pages/_static/img/transformer_architecture.jpg?raw=1)





Define the model
----------------




In this tutorial, we train a ``nn.TransformerEncoder`` model on a
language modeling task. One formulation of language modeling is to assign a
probability for the likelihood of a given word (or a sequence of words)
to follow a given sequence of words. 

A sequence of tokens are passed to the embedding
layer first, followed by a positional encoding layer to account for the order
of the words (see the next paragraph for more details). The
``nn.TransformerEncoder`` module consists of multiple layers of
[nn.TransformerEncoderLayer](https://pytorch.org/docs/master/generated/torch.nn.TransformerEncoderLayer.html#torch.nn.TransformerEncoderLayer). Along with the input sequence, a square
attention mask is required because the self-attention layers in
``nn.TransformerEncoder`` are only allowed to attend to earlier positions in
the sequence. For language modeling, any tokens in future
positions should be masked. To get the actual words, the output
of the ``nn.TransformerEncoder`` model is sent to a final Linear
layer, which is followed by a log-Softmax function.




### Transformer Model

In [None]:
import math
import torch
import torch.nn as nn
import torch.nn.functional as F


class TransformerModel(nn.Module):

    def __init__(self, ntoken, ninp, nhead, nhid, nlayers, dropout=0.5):
        '''
        ntoken/ntokens = len(TEXT.vocab.stoi) # the size of vocabulary
        ninp/emsize = 200 # embedding dimension
        nhead = 2 # the number of heads in the multiheadattention models
        nhid = 200 # the dimension of the feedforward network model in nn.TransformerEncoder
        nlayers = 2 # the number of nn.TransformerEncoderLayer in nn.TransformerEncoder
        dropout = 0.2 # the dropout value
        '''
        super(TransformerModel, self).__init__()
        from torch.nn import TransformerEncoder, TransformerEncoderLayer
        self.model_type = 'Transformer'
        self.src_mask = None
        self.pos_encoder = PositionalEncoding(ninp, dropout)
        encoder_layers = TransformerEncoderLayer(ninp, nhead, nhid, dropout)
        self.transformer_encoder = TransformerEncoder(encoder_layers, nlayers)
        self.encoder = nn.Embedding(ntoken, ninp)
        self.ninp = ninp
        self.decoder = nn.Linear(ninp, ntoken)

        self.init_weights()

    def _generate_square_subsequent_mask(self, sz):
        #triu returns the upper triangular part of a matrix (2-D tensor) or batch of matrices (see section below)
        mask = (torch.triu(torch.ones(sz, sz)) == 1).transpose(0, 1)
        mask = mask.float().masked_fill(mask == 0, float('-inf')).masked_fill(mask == 1, float(0.0))
        return mask

    def init_weights(self):
        initrange = 0.1
        self.encoder.weight.data.uniform_(-initrange, initrange)
        self.decoder.bias.data.zero_()
        self.decoder.weight.data.uniform_(-initrange, initrange)

    def forward(self, src):
        if self.src_mask is None or self.src_mask.size(0) != len(src):
            device = src.device
            mask = self._generate_square_subsequent_mask(len(src)).to(device)
            self.src_mask = mask

        src = self.encoder(src) * math.sqrt(self.ninp)
        src = self.pos_encoder(src)
        output = self.transformer_encoder(src, self.src_mask)
        output = self.decoder(output)
        return output

#### Masking
By passing the mask into the transformer_encoder forward() function, attention will only be calculated based on the earlier positions in the sequence.

In [None]:
#triu returns the upper triangular part of a matrix (2-D tensor) or batch of matrices (see section below)
torch.triu(torch.ones(3, 3))

tensor([[1., 1., 1.],
        [0., 1., 1.],
        [0., 0., 1.]])

In [None]:
# Masking
def masking():
  sz = 4
  mask = (torch.triu(torch.ones(sz, sz)) == 1)
  mask = mask.float().masked_fill(mask == 0, float('-inf')).masked_fill(mask == 1, float(0.0))
  
  return mask

masking()

tensor([[0., 0., 0., 0.],
        [-inf, 0., 0., 0.],
        [-inf, -inf, 0., 0.],
        [-inf, -inf, -inf, 0.]])

In [None]:
# Masking
def masking():
  sz = 4
  mask = (torch.triu(torch.ones(sz, sz)) == 1).transpose(0, 1)
  ## mask = (torch.triu(torch.ones(sz, sz)) == 0)
  mask = mask.float().masked_fill(mask == 0, float('-inf')).masked_fill(mask == 1, float(0.0))
  
  return mask

masking()

tensor([[0., -inf, -inf, -inf],
        [0., 0., -inf, -inf],
        [0., 0., 0., -inf],
        [0., 0., 0., 0.]])

### Positional Encoding

The Transformer architecture follows the base architecture of a Seq2Seq model (Encoder - Decoder). However, the transformer does not use a recurrent model so this means we need a way to captures sequence information in the input and output.

The ``PositionalEncoding`` module injects some information about the
relative or absolute position of the tokens in the sequence. The
positional encodings have the same dimensions as the embeddings so that
the two can be summed. Here, we use ``sine`` and ``cosine`` functions of
different frequencies.






In [None]:
class PositionalEncoding(nn.Module):

    def __init__(self, d_model, dropout=0.1, max_len=5000):
        '''
        d_model = 200 # embedding dimension
        max_len = 5000 # the maximum sentence length
        '''
        super(PositionalEncoding, self).__init__()
        self.dropout = nn.Dropout(p=dropout)

        pe = torch.zeros(max_len, d_model) ## a matrix of shape [max_len,d_model] with all zeros
        ### each row of pe represents one possible position with d_model embedding dimension
        ### i.e., each word is represented by d_model embedding dimension, and each possible position is also represented by d_model embedding dimension
        #####     so, we can add the word embedding and the positional encoding element-wise
        ##### e.g., the third word 'MONEY' is represented by [1,2,...,200], and the third position in a sentence is represented by [1,2,...,200]
        #####        then, the 'MONEY' inputted to transformer-encoder is [2,4,...,400] 
        position = torch.arange(0, max_len, dtype=torch.float).unsqueeze(1)
        ## torch.arange(start,end,step=1) returns a 1-D tensor of size ((end-start)/step) with values 
        ### from the interval [start,end) taken with common difference step begining from start
        ## .unsqueeze(1) returns a new tensor with a dimension of size one inserted at the specified position.
        div_term = torch.exp(torch.arange(0, d_model, 2).float() * (-math.log(10000.0) / d_model))
        pe[:, 0::2] = torch.sin(position * div_term) #0::2 means starting with index 0, step = 2
        pe[:, 1::2] = torch.cos(position * div_term) #1::2 means starting with index 1, step = 2
        pe = pe.unsqueeze(0).transpose(0, 1)
        self.register_buffer('pe', pe)
        ## If you have parameters in your model, which should be saved and restored in the state_dict, but not trained by the optimizer, you should register them as buffers.
        ## Buffers won’t be returned in model.parameters(), so that the optimizer won’t have a change to update them.

    def forward(self, x):
        x = x + self.pe[:x.size(0), :] # x.size(0)=the length of the sentence
        return self.dropout(x)

Load and batch data
-------------------




The training process uses Wikitext-2 dataset from ``torchtext``. The
vocab object is built based on the training dataset and is used to numericalize
tokens into tensors. Starting from sequential data, the ``batchify()``
function arranges the dataset into columns, trimming off any tokens remaining
after the data has been divided into batches of size ``batch_size``.
For instance, with the alphabet as the sequence (total length of 26)
and a batch size of 4, we would divide the alphabet into 4 sequences of
length 6:

\begin{align}\begin{bmatrix}
  \text{A} & \text{B} & \text{C} & \ldots & \text{X} & \text{Y} & \text{Z}
  \end{bmatrix}
  \Rightarrow
  \begin{bmatrix}
  \begin{bmatrix}\text{A} \\ \text{B} \\ \text{C} \\ \text{D} \\ \text{E} \\ \text{F}\end{bmatrix} &
  \begin{bmatrix}\text{G} \\ \text{H} \\ \text{I} \\ \text{J} \\ \text{K} \\ \text{L}\end{bmatrix} &
  \begin{bmatrix}\text{M} \\ \text{N} \\ \text{O} \\ \text{P} \\ \text{Q} \\ \text{R}\end{bmatrix} &
  \begin{bmatrix}\text{S} \\ \text{T} \\ \text{U} \\ \text{V} \\ \text{W} \\ \text{X}\end{bmatrix}
  \end{bmatrix}\end{align}

These columns are treated as independent by the model, which means that
the dependence of ``G`` and ``F`` can not be learned, but allows more
efficient batch processing.




In [None]:
import torchtext
import torch.utils.data.DataLoader
from torchtext.data.utils import get_tokenizer
TEXT = torchtext.data.Field(tokenize=get_tokenizer("basic_english"),
                            init_token='<sos>',
                            eos_token='<eos>',
                            lower=True)
train_txt, val_txt, test_txt = torchtext.datasets.WikiText2.splits(TEXT)
TEXT.build_vocab(train_txt)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

def batchify(data, bsz):
    data = TEXT.numericalize([data.examples[0].text])
    # Divide the dataset into bsz parts.
    nbatch = data.size(0) // bsz
    # Trim off any extra elements that wouldn't cleanly fit (remainders).
    data = data.narrow(0, 0, nbatch * bsz)
    # Evenly divide the data across the bsz batches.
    data = data.view(bsz, -1).t().contiguous()
    return data.to(device)

batch_size = 20
eval_batch_size = 10
train_data = batchify(train_txt, batch_size)
val_data = batchify(val_txt, eval_batch_size)
test_data = batchify(test_txt, eval_batch_size)

downloading wikitext-2-v1.zip


wikitext-2-v1.zip: 100%|██████████| 4.48M/4.48M [00:00<00:00, 18.9MB/s]


extracting


## Functions to generate input and target sequence



The ``get_batch()`` function generates the input and target sequence for
the transformer model. It subdivides the source data into chunks of
length ``bptt``. For the language modeling task, the model needs the
following words as ``Target``. For example, with a ``bptt`` value of 2,
we’d get the following two Variables for ``i`` = 0:

![](https://github.com/pytorch/tutorials/blob/gh-pages/_static/img/transformer_input_target.png?raw=1)


It should be noted that the chunks are along dimension 0, consistent
with the ``S`` dimension in the Transformer model. The batch dimension
``N`` is along dimension 1.




In [None]:
bptt = 35
def get_batch(source, i):
    seq_len = min(bptt, len(source) - 1 - i)
    data = source[i:i+seq_len]
    target = source[i+1:i+1+seq_len].view(-1)
    return data, target

Initiate an instance
--------------------




The model is set up with the hyperparameter below. The vocab size is
equal to the length of the vocab object.




In [None]:
ntokens = len(TEXT.vocab.stoi) # the size of vocabulary
emsize = 200 # embedding dimension
nhid = 200 # the dimension of the feedforward network model in nn.TransformerEncoder
nlayers = 2 # the number of nn.TransformerEncoderLayer in nn.TransformerEncoder
nhead = 2 # the number of heads in the multiheadattention models
dropout = 0.2 # the dropout value
model = TransformerModel(ntokens, emsize, nhead, nhid, nlayers, dropout).to(device)

Run the model
-------------




`CrossEntropyLoss `
is applied to track the loss and
`SGD `
implements stochastic gradient descent method as the optimizer. The initial
learning rate is set to 5.0. `StepLR ` is
applied to adjust the learn rate through epochs. During
training, we use the
`nn.utils.clip_grad_norm_ `
function to scale all the gradients together to avoid the exploding gradient problem.




In [None]:
criterion = nn.CrossEntropyLoss()
lr = 5.0 # learning rate
optimizer = torch.optim.SGD(model.parameters(), lr=lr)
scheduler = torch.optim.lr_scheduler.StepLR(optimizer, 1.0, gamma=0.95)

import time
def train():

    model.train() # Turn on the train mode
    total_loss = 0.
    start_time = time.time()
    ntokens = len(TEXT.vocab.stoi)
    for batch, i in enumerate(range(0, train_data.size(0) - 1, bptt)):
        data, targets = get_batch(train_data, i)
        optimizer.zero_grad()
        output = model(data)
        loss = criterion(output.view(-1, ntokens), targets)
        loss.backward()
        torch.nn.utils.clip_grad_norm_(model.parameters(), 0.5)
        optimizer.step()

        total_loss += loss.item()
        log_interval = 200
        if batch % log_interval == 0 and batch > 0:
            cur_loss = total_loss / log_interval
            elapsed = time.time() - start_time
            print('| epoch {:3d} | {:5d}/{:5d} batches | '
                  'lr {:02.2f} | ms/batch {:5.2f} | '
                  'loss {:5.2f} | ppl {:8.2f}'.format(
                    epoch, batch, len(train_data) // bptt, scheduler.get_lr()[0],
                    elapsed * 1000 / log_interval,
                    cur_loss, math.exp(cur_loss)))
            total_loss = 0
            start_time = time.time()


def evaluate(eval_model, data_source):
    eval_model.eval() # Turn on the evaluation mode
    total_loss = 0.
    ntokens = len(TEXT.vocab.stoi)
    with torch.no_grad():
        for i in range(0, data_source.size(0) - 1, bptt):
            data, targets = get_batch(data_source, i)
            output = eval_model(data)
            output_flat = output.view(-1, ntokens)
            total_loss += len(data) * criterion(output_flat, targets).item()
    return total_loss / (len(data_source) - 1)

Loop over epochs. Save the model if the validation loss is the best
we've seen so far. Adjust the learning rate after each epoch.



In [None]:
best_val_loss = float("inf")
epochs = 3 # The number of epochs
best_model = None

for epoch in range(1, epochs + 1):
    epoch_start_time = time.time()
    train()
    val_loss = evaluate(model, val_data)
    print('-' * 89)
    print('| end of epoch {:3d} | time: {:5.2f}s | valid loss {:5.2f} | '
          'valid ppl {:8.2f}'.format(epoch, (time.time() - epoch_start_time),
                                     val_loss, math.exp(val_loss)))
    print('-' * 89)

    if val_loss < best_val_loss:
        best_val_loss = val_loss
        best_model = model

    scheduler.step()



| epoch   1 |   200/ 2981 batches | lr 5.00 | ms/batch 29.40 | loss  8.04 | ppl  3111.89
| epoch   1 |   400/ 2981 batches | lr 5.00 | ms/batch 14.16 | loss  6.78 | ppl   877.50
| epoch   1 |   600/ 2981 batches | lr 5.00 | ms/batch 13.48 | loss  6.37 | ppl   581.48
| epoch   1 |   800/ 2981 batches | lr 5.00 | ms/batch 13.61 | loss  6.22 | ppl   504.30
| epoch   1 |  1000/ 2981 batches | lr 5.00 | ms/batch 13.55 | loss  6.12 | ppl   454.53
| epoch   1 |  1200/ 2981 batches | lr 5.00 | ms/batch 13.98 | loss  6.09 | ppl   439.72
| epoch   1 |  1400/ 2981 batches | lr 5.00 | ms/batch 13.80 | loss  6.04 | ppl   419.91
| epoch   1 |  1600/ 2981 batches | lr 5.00 | ms/batch 13.62 | loss  6.05 | ppl   424.26
| epoch   1 |  1800/ 2981 batches | lr 5.00 | ms/batch 13.62 | loss  5.96 | ppl   387.72
| epoch   1 |  2000/ 2981 batches | lr 5.00 | ms/batch 13.73 | loss  5.95 | ppl   384.22
| epoch   1 |  2200/ 2981 batches | lr 5.00 | ms/batch 14.35 | loss  5.85 | ppl   346.17
| epoch   1 |  2400/ 

## Evaluate the model with the test dataset

Measure results on the test set for the best model.



In [None]:
test_loss = evaluate(best_model, test_data)
print('=' * 89)
print('| End of training | test loss {:5.2f} | test ppl {:8.2f}'.format(
    test_loss, math.exp(test_loss)))
print('=' * 89)

| End of training | test loss  5.41 | test ppl   222.67


# HuggingFace Transformer

HuggingFace is a company that develops software and services for NLP. They have a widely used library providing transformers and related models. A large set of tutorial links can be found here: https://huggingface.co/transformers/notebooks.html

# Exercise

## E1. Multiple Choice Questions

Please answer the two questions below. Note - you must get both right to get a point.

### Question 1

Which of the following are true of attention?

1. It allows the decoder in a sequence-to- sequence model to use information from specific parts of the input.
2. It is a form of weighted sum.
3. It is a key part of the Long Short-Term Memory model (LSTM).
4. It is a key part of the Transformer model.
5. It is critical for accurate POS tagging.
6. It is critical for accurate translation.

Answer:



### Question 2

Do each of the following text representation methods produce sparse or dense vectors?

Answer:

- Bag of words - 
- Dynamic Embedding (e.g., BERT) -
- One-hot - 
- Static Embedding (e.g., GloVe) -
- TF-IDF -

## E2 Test with Transformer
Try either:

1. varying the number of heads in Multi-head Attention, **or**
2. varying the number of encoder layers

Record the test performance for each configuration you try.

Draw a graph to show the test performance (or validation losses/ppls) vs number of heads, or the test performance (or validation losses/ppls) vs number of encoders, (you can keep epochs = 3).