## Attention Model  (10 pt)

Credit to https://github.com/yandexdataschool/nlp_course/blob/2023/week04_seq2seq/practice_and_homework_pytorch.ipynb

In previous notebook we composed encoder-decoder recurrent neural networks and applied it to the task of machine translation.

![img](https://esciencegroup.files.wordpress.com/2016/03/seq2seq.jpg)
_(img: esciencegroup.files.wordpress.com)_


## Our task today to add additive attention


In [3]:
#We'll use data from https://www.kaggle.com/datasets/devicharith/language-translation-englishfrench

In [1]:
!kaggle datasets download -d devicharith/language-translation-englishfrench

Dataset URL: https://www.kaggle.com/datasets/devicharith/language-translation-englishfrench
License(s): CC0-1.0
language-translation-englishfrench.zip: Skipping, found more recently modified local copy (use --force to force download)


In [2]:
!unzip language-translation-englishfrench.zip

Archive:  language-translation-englishfrench.zip
replace eng_-french.csv? [y]es, [n]o, [A]ll, [N]one, [r]ename: 

In [3]:
# !pip3 install torch>=1.3.0
!pip3 install subword-nmt &> log
#!wget https://www.dropbox.com/s/yy2zqh34dyhv07i/data.txt?dl=1 -O data.txt
!wget https://raw.githubusercontent.com/yandexdataschool/nlp_course/2020/week04_seq2seq/vocab.py -O vocab.py
# thanks to tilda and deephack teams for the data, Dmitry Emelyanenko for the code :)

--2025-03-04 01:36:30--  https://raw.githubusercontent.com/yandexdataschool/nlp_course/2020/week04_seq2seq/vocab.py
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2879 (2.8K) [text/plain]
Saving to: ‘vocab.py’


2025-03-04 01:36:31 (22.8 MB/s) - ‘vocab.py’ saved [2879/2879]



In [4]:
import csv
from nltk.tokenize import WordPunctTokenizer
from subword_nmt.learn_bpe import learn_bpe
from subword_nmt.apply_bpe import BPE
tokenizer = WordPunctTokenizer()
def tokenize(x):
    return ' '.join(tokenizer.tokenize(x.lower()))

# split and tokenize the data
with open('train.en', 'w') as f_src,  open('train.fr', 'w') as f_dst:
  with open('eng_-french.csv', 'r') as csv_file:
    csv_reader = csv.reader(csv_file)
    header = next(csv_reader)
    for line in csv_reader:
        src_line, dst_line = line[0], line[1]
        f_src.write(tokenize(src_line) + '\n')
        f_dst.write(tokenize(dst_line) + '\n')

# build and apply bpe vocs
bpe = {}
for lang in ['en', 'fr']:
    learn_bpe(open('./train.' + lang), open('bpe_rules.' + lang, 'w'), num_symbols=8000)
    bpe[lang] = BPE(open('./bpe_rules.' + lang))

    with open('train.bpe.' + lang, 'w') as f_out:
        for line in open('train.' + lang):
            f_out.write(bpe[lang].process_line(line.strip()) + '\n')

100%|██████████| 8000/8000 [00:13<00:00, 577.47it/s] 
100%|██████████| 8000/8000 [00:12<00:00, 656.44it/s]


In [5]:
bpe['en'].process_line('A quick brown fox jumps over a lazy dog')

'A quick brown fox ju@@ mps over a lazy dog'

### Building vocabularies

We now need to build vocabularies that map strings to token ids and vice versa. We're gonna need these fellas when we feed training data into model or convert output matrices into words.

In [6]:
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

In [7]:
#data_inp = np.array(open('./train.bpe.ru').read().split('\n'))
data_inp = np.array(open('./train.bpe.fr').read().split('\n'))
data_out = np.array(open('./train.bpe.en').read().split('\n'))

from sklearn.model_selection import train_test_split
train_inp, dev_inp, train_out, dev_out = train_test_split(data_inp, data_out, test_size=3000,
                                                          random_state=42)
for i in range(3):
    print('inp:', train_inp[i])
    print('out:', train_out[i], end='\n\n')

inp: chez quel gla@@ cier allez - vous ?
out: which ice cream shop are you going to ?

inp: il fallait s ' y attendre .
out: it was to be expected .

inp: soyez dis@@ cr@@ ète !
out: be discreet .



In [8]:
from vocab import Vocab
inp_voc = Vocab.from_lines(train_inp)
out_voc = Vocab.from_lines(train_out)

### Encoder-decoder model

The code below contains a template for a simple encoder-decoder model: single GRU encoder/decoder, no attention or anything. This model is implemented for you as a reference and a baseline for your homework assignment.

In [9]:
import torch
import torch.nn as nn
import torch.nn.functional as F
# device = 'cuda' if torch.cuda.is_available() else 'cpu'

In [10]:
device = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu') # f'cuda:{2}' if torch.cuda.is_available() else 'cpu' #0 -is GPU2, 1 is GPU3, 3 is GPU1, 4 is 4 5 is GPU5,

In [11]:
class BasicModel(nn.Module):
    def __init__(self, inp_voc, out_voc, emb_size=64, hid_size=128):
        """
        A simple encoder-decoder seq2seq model
        """
        super().__init__() # initialize base class to track sub-layers, parameters, etc.

        self.inp_voc, self.out_voc = inp_voc, out_voc
        self.hid_size = hid_size

        self.emb_inp = nn.Embedding(len(inp_voc), emb_size)
        self.emb_out = nn.Embedding(len(out_voc), emb_size)
        self.enc0 = nn.GRU(emb_size, hid_size, batch_first=True)

        self.dec_start = nn.Linear(hid_size, hid_size) #connection between encoder and decoder
        self.dec0 = nn.GRUCell(emb_size, hid_size)
        self.logits = nn.Linear(hid_size, len(out_voc))

    def forward(self, inp, out):
        """ Apply model in training mode """
        initial_state = self.encode(inp)
        return self.decode(initial_state, out)


    def encode(self, inp, **flags):
        """
        Takes symbolic input sequence, computes initial state
        :param inp: matrix of input tokens [batch, time]
        :returns: initial decoder state tensors, one or many
        """
        inp_emb = self.emb_inp(inp)
        batch_size = inp.shape[0]

        enc_seq, [last_state_but_not_really] = self.enc0(inp_emb)
        # enc_seq: [batch, time, hid_size], last_state: [batch, hid_size]

        # note: last_state is not _actually_ last because of padding, let's find the real last_state
        lengths = (inp != self.inp_voc.eos_ix).to(torch.int64).sum(dim=1).clamp_max(inp.shape[1] - 1)
        # print((inp != self.inp_voc.eos_ix).to(torch.int64))
        last_state = enc_seq[torch.arange(len(enc_seq)), lengths]
        # ^-- shape: [batch_size, hid_size]

        dec_start = self.dec_start(last_state)
        return [dec_start]

    def decode_step(self, prev_state, prev_tokens, **flags):
        """
        Takes previous decoder state and tokens, returns new state and logits for next tokens
        :param prev_state: a list of previous decoder state tensors, same as returned by encode(...)
        :param prev_tokens: previous output tokens, an int vector of [batch_size]
        :return: a list of next decoder state tensors, a tensor of logits [batch, len(out_voc)]
        """
        # prev_gru0_state = prev_state[0]
        [prev_gru0_state, ] = prev_state

        prev_token_embs = self.emb_out(prev_tokens)

        new_gru_activations = self.dec0(prev_token_embs, prev_gru0_state)
        new_dec_state = [new_gru_activations]
        output_logits = self.logits(new_gru_activations)

        return new_dec_state, output_logits

    def decode(self, initial_state, out_tokens, **flags):
        """ Iterate over reference tokens (out_tokens) with decode_step """
        batch_size = out_tokens.shape[0]
        state = initial_state

        # initial logits: always predict BOS
        onehot_bos = F.one_hot(torch.full([batch_size], self.out_voc.bos_ix, dtype=torch.int64),
                               num_classes=len(self.out_voc)).to(device=out_tokens.device)
        first_logits = torch.log(onehot_bos.to(torch.float32) + 1e-9)

        logits_sequence = [first_logits]
        for i in range(out_tokens.shape[1] - 1):
            state, logits = self.decode_step(state, out_tokens[:, i])
            logits_sequence.append(logits)
        return torch.stack(logits_sequence, dim=1)

    def decode_inference(self, initial_state, max_len=100, **flags):
        """ Generate translations from model (greedy version) """
        batch_size, device = len(initial_state[0]), initial_state[0].device
        state = initial_state
        outputs = [torch.full([batch_size], self.out_voc.bos_ix, dtype=torch.int64,
                              device=device)]
        all_states = [initial_state]

        for i in range(max_len):
            state, logits = self.decode_step(state, outputs[-1])
            outputs.append(logits.argmax(dim=-1))
            all_states.append(state)

        return torch.stack(outputs, dim=1), all_states

    def translate_lines(self, inp_lines, **kwargs):
        inp = self.inp_voc.to_matrix(inp_lines).to(device)
        initial_state = self.encode(inp)
        out_ids, states = self.decode_inference(initial_state, **kwargs)
        return self.out_voc.to_lines(out_ids.cpu().numpy()), states


### Training loss

Our training objective is almost the same as it was for neural language models:
$$ L = {\frac1{|D|}} \sum_{X, Y \in D} \sum_{y_t \in Y} - \log p(y_t \mid y_1, \dots, y_{t-1}, X, \theta) $$

where $|D|$ is the __total length of all sequences__, including BOS and first EOS, but excluding PAD.

### Evaluation: BLEU

Machine translation is commonly evaluated with [BLEU](https://en.wikipedia.org/wiki/BLEU) score. This metric simply computes which fraction of predicted n-grams is actually present in the reference translation. It does so for n=1,2,3 and 4 and computes the geometric average with penalty if translation is shorter than reference.

While BLEU [has many drawbacks](http://www.cs.jhu.edu/~ccb/publications/re-evaluating-the-role-of-bleu-in-mt-research.pdf), it still remains the most commonly used metric and one of the simplest to compute.

### Your Attention Required

In this section we want you to improve over the basic model by implementing a simple attention mechanism.

This is gonna be a two-parter: building the __attention layer__ and using it for an __attentive seq2seq model__.

### Attention layer (1 points)

Here you will have to implement a layer that computes a simple additive attention:

Given encoder sequence $ h^e_0, h^e_1, h^e_2, ..., h^e_T$ and a single decoder state $h^d$,

* Compute logits with a 2-layer neural network
$$a_t = linear_{out}(tanh(linear_{e}(h^e_t) + linear_{d}(h_d)))$$
* Get probabilities from logits,
$$ p_t = {{e ^ {a_t}} \over { \sum_\tau e^{a_\tau} }} $$

* Add up encoder states with probabilities to get __attention response__
$$ attn = \sum_t p_t \cdot h^e_t $$

You can learn more about attention layers in the lecture slides or [from this post](https://distill.pub/2016/augmented-rnns/).

In [12]:
class AttentionLayer(nn.Module):
    def __init__(self, name, enc_size, dec_size, hid_size, activ=torch.tanh):
        """ A layer that computes additive attention response and weights """
        super().__init__()
        self.name = name
        self.enc_size = enc_size # num units in encoder state
        self.dec_size = dec_size # num units in decoder state
        self.hid_size = hid_size # attention layer hidden units
        self.activ = activ       # attention layer hidden nonlinearity

        # create trainable paramteres like this:
        # self.<PARAMETER_NAME> = nn.Parameter(<INITIAL_VALUES>, requires_grad=True)
        # <...>  # you will need a couple of these
        self.linear_e = nn.Linear(enc_size, hid_size, bias=False)
        self.linear_d = nn.Linear(dec_size, hid_size, bias=False)
        self.linear_out = nn.Linear(hid_size, 1, bias=False)


    def forward(self, enc, dec, inp_mask):
        """
        Computes attention response and weights
        :param enc: encoder activation sequence, float32[batch_size, ninp, enc_size]
        :param dec: single decoder state used as "query", float32[batch_size, dec_size]
        :param inp_mask: mask on enc activatons (0 after first eos), float32 [batch_size, ninp]
        :returns: attn[batch_size, enc_size], probs[batch_size, ninp]
            - attn - attention response vector (weighted sum of enc)
            - probs - attention weights after softmax
        """

        # Compute logits
        dec = dec.unsqueeze(1)
        enc_proj = self.linear_e(enc)
        dec_proj = self.linear_d(dec)

        # Apply mask - if mask is 0, logits should be -inf or -1e9
        # You may need torch.where
        attn_scores = self.linear_out(self.activ(enc_proj + dec_proj)).squeeze(-1)
        attn_scores = torch.where(inp_mask.unsqueeze(1) == 0, -1e9, attn_scores)

        # Compute attention probabilities (softmax)
        probs = torch.softmax(attn_scores, dim=1)

        # Compute attention response using enc and probs
        attn = torch.sum(probs.unsqueeze(-1) * enc, dim=1)

        return attn, probs

### Seq2seq model with attention

You can now use the attention layer to build a network. The simplest way to implement attention is to use it in decoder phase:
![img](https://i.imgur.com/6fKHlHb.png)
_image from distill.pub [article](https://distill.pub/2016/augmented-rnns/)_

On every step, use __previous__ decoder state to obtain attention response. Then feed concat this response to the inputs of next attention layer.

The key implementation detail here is __model state__. Put simply, you can add any tensor into the list of `encode` outputs. You will then have access to them at each `decode` step. This may include:
* Last RNN hidden states (as in basic model)
* The whole sequence of encoder outputs (to attend to) and mask
* Attention probabilities (to visualize)

_There are, of course, alternative ways to wire attention into your network and different kinds of attention. Take a look at [this](https://arxiv.org/abs/1609.08144), [this](https://arxiv.org/abs/1706.03762) and [this](https://arxiv.org/abs/1808.03867) for ideas. And for image captioning/im2latex there's [visual attention](https://arxiv.org/abs/1502.03044)_

In [25]:
class AttentiveModel(BasicModel):
    def __init__(self, name, inp_voc, out_voc,
                 emb_size=64, hid_size=128, attn_size=128):
        """ Translation model that uses attention. See instructions above. """
        nn.Module.__init__(self)  # initialize base class to track sub-layers, trainable variables, etc.
        self.inp_voc, self.out_voc = inp_voc, out_voc
        self.hid_size = hid_size
        self.name = name

        self.enc_emb = nn.Embedding(len(inp_voc), emb_size)
        self.dec_emb = nn.Embedding(len(out_voc), emb_size)

        self.enc = nn.GRU(emb_size, hid_size, batch_first=True)
        self.lin = nn.Linear(hid_size, hid_size)
        self.attn = AttentionLayer("attention", enc_size=hid_size, dec_size=hid_size, hid_size=attn_size)
        self.dec = nn.GRU(emb_size + hid_size, hid_size, batch_first=True)

        self.device = torch.device("cpu")

        # Output layer
        self.output = nn.Linear(hid_size, len(out_voc))

    def encode(self, inp, **flags):
        """
        Takes symbolic input sequence, computes initial state
        :param inp: matrix of input tokens [batch, time]
        :return: a list of initial decoder state tensors
        """
        inp = inp.to(self.device)
        inp_emb = self.enc_emb(inp)
        enc_out, enc_hidden = self.enc(inp_emb)

        inp_mask = (inp != self.inp_voc.eos_ix).to(torch.int64)

        dec_init_state = enc_hidden

        # Build first state: include
        # * initial states for decoder recurrent layers
        # * encoder sequence and encoder attn mask (for attention)
        # * make sure that last state item is attention probabilities tensor
        first_attn, first_attn_probas = self.attn(enc_out, dec_init_state.squeeze(0), inp_mask)
        return [dec_init_state, enc_out, inp_mask, first_attn_probas]

    def decode_step(self, prev_state, prev_tokens, **flags):
        """
        Takes previous decoder state and tokens, returns new state and logits for next tokens
        :param prev_state: a list of previous decoder state tensors
        :param prev_tokens: previous output tokens, an int vector of [batch_size]
        :return: a list of next decoder state tensors, a tensor of logits [batch, n_tokens]
        """
        dec_state, enc_out, inp, _ = prev_state
        is_inference = prev_tokens.shape[0] == 1

        prev_tokens = prev_tokens.view(-1)
        dec_emb = self.dec_emb(prev_tokens)

        print("dec_emb shape:", dec_emb.shape)

        atten, atten_probs = self.attn(enc_out, dec_state.squeeze(0), inp)
        atten = atten.squeeze(1)
        print("atten shape before slicing:", atten.shape)
        # atten = atten[:, -1, :]
        # print("atten shape after slicing:", atten.shape)
        atten = atten.unsqueeze(1) if is_inference else atten[:, -1, :]
        # if atten.shape[0] != dec_emb.shape[0]:
        #   atten = atten.expand(dec_emb.shape[0], -1)

        rnn_in = torch.cat([dec_emb, atten], dim=-1)
        dec_out, dec_state = self.dec(rnn_in.unsqueeze(1), dec_state)

        log_out = self.output(dec_out.squeeze(1))

        return [dec_state, enc_out, inp, atten_probs], log_out

### Training attentive model

Please reuse the infrastructure you've built for the regular model. I hope you didn't hard-code anything :)

In [26]:
#<YOUR CODE: create AttentiveModel and training utilities>
model = AttentiveModel("att_model", inp_voc, out_voc, hid_size=128, emb_size=128, attn_size=128)

data_loader = torch.utils.data.DataLoader
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)
criterion = nn.CrossEntropyLoss()

train_data = torch.utils.data.TensorDataset(torch.tensor(inp_voc.to_matrix(train_inp)), torch.tensor(out_voc.to_matrix(train_out)))
train_loader = data_loader(train_data, batch_size=32, shuffle=True)

test_data = torch.utils.data.TensorDataset(torch.tensor(inp_voc.to_matrix(dev_inp)), torch.tensor(out_voc.to_matrix(dev_out)))
test_loader = data_loader(test_data, batch_size=32, shuffle=False)

  train_data = torch.utils.data.TensorDataset(torch.tensor(inp_voc.to_matrix(train_inp)), torch.tensor(out_voc.to_matrix(train_out)))
  test_data = torch.utils.data.TensorDataset(torch.tensor(inp_voc.to_matrix(dev_inp)), torch.tensor(out_voc.to_matrix(dev_out)))


In [20]:
#<YOUR CODE: training loop>
EPOCHS = 3

device = torch.device("cpu")
model.to(device)

for epoch in range(EPOCHS):
    model.train()
    total_loss = 0

    print(f"Starting Epoch {epoch+1}/{EPOCHS}")
    batch_count = 0

    for batch_inp, batch_out in train_loader:
        batch_count += 1

        print(f"Batch {batch_count}/{len(train_loader)}")


        batch_inp, batch_out = batch_inp.to(device), batch_out.to(device)

        optimizer.zero_grad()

        initial_state = model.encode(batch_inp)
        logits = model.decode(initial_state, batch_out)

        loss = criterion(logits.view(-1, logits.shape[-1]), batch_out.view(-1))

        if torch.isnan(loss) or torch.isinf(loss):
          break


        loss.backward()

        nn.utils.clip_grad_norm_(model.parameters(), 5.0)

        optimizer.step()
        total_loss += loss.item()

    avg_loss = total_loss / len(train_loader)
    print(f"Epoch {epoch+1}/{EPOCHS}, Loss: {avg_loss:.4f}")

Starting Epoch 1/3
Batch 1/5395
dec_emb shape: torch.Size([32, 128])
atten shape before slicing: torch.Size([32, 77, 128])
atten shape after slicing: torch.Size([32, 128])
dec_emb shape: torch.Size([32, 128])
atten shape before slicing: torch.Size([32, 77, 128])
atten shape after slicing: torch.Size([32, 128])
dec_emb shape: torch.Size([32, 128])
atten shape before slicing: torch.Size([32, 77, 128])
atten shape after slicing: torch.Size([32, 128])
dec_emb shape: torch.Size([32, 128])
atten shape before slicing: torch.Size([32, 77, 128])
atten shape after slicing: torch.Size([32, 128])
dec_emb shape: torch.Size([32, 128])
atten shape before slicing: torch.Size([32, 77, 128])
atten shape after slicing: torch.Size([32, 128])
dec_emb shape: torch.Size([32, 128])
atten shape before slicing: torch.Size([32, 77, 128])
atten shape after slicing: torch.Size([32, 128])
dec_emb shape: torch.Size([32, 128])
atten shape before slicing: torch.Size([32, 77, 128])
atten shape after slicing: torch.Size

KeyboardInterrupt: 

In [27]:
from nltk.translate.bleu_score import corpus_bleu

# <YOUR CODE: measure final BLEU>
model.eval()
references = []
hypotheses = []

device = torch.device("cpu")

with torch.no_grad():
    for batch_inp, batch_out in test_loader:
        batch_inp = batch_inp.to(device)
        pred_out, _ = model.decode_inference(model.encode(batch_inp))

        ref_text = out_voc.to_lines(batch_out.cpu().numpy())
        hyp_text = out_voc.to_lines(pred_out.cpu().numpy())

        references.extend([[r.split()] for r in ref_text])
        hypotheses.extend([h.split() for h in hyp_text])

bleu_score = corpus_bleu(references, hypotheses)
print(f"Final BLEU Score: {bleu_score:.4f}")

dec_emb shape: torch.Size([1, 128])
atten shape before slicing: torch.Size([32, 36, 128])


RuntimeError: Tensors must have same number of dimensions: got 2 and 4

In [95]:
torch.save(model.state_dict(), 'attention.pt')

### Visualizing model attention (1 points)

After training the attentive translation model, you can check it's sanity by visualizing its attention weights.

We provided you with a function that draws attention maps using [`Bokeh`](https://bokeh.pydata.org/en/latest/index.html). Once you managed to produce something better than random noise, please leave them in the notebook or save  bokeh figures and add to your sumbission. You can save bokeh images as screenshots or using this button:

![bokeh_panel](https://github.com/yandexdataschool/nlp_course/raw/2019/resources/bokeh_panel.png)

__Note:__ you're not locked into using bokeh. If you prefer a different visualization method, feel free to use that instead of bokeh.

In [96]:
import bokeh.plotting as pl
import bokeh.models as bm
from bokeh.io import output_notebook, show
output_notebook()

def draw_attention(inp_line, translation, probs):
    """ An intentionally ambiguous function to visualize attention weights """
    inp_tokens = inp_voc.tokenize(inp_line)
    trans_tokens = out_voc.tokenize(translation)
    probs = probs[:len(trans_tokens), :len(inp_tokens)]

    fig = pl.figure(x_range=(0, len(inp_tokens)), y_range=(0, len(trans_tokens)),
                    x_axis_type=None, y_axis_type=None, tools=[])
    fig.image([probs[::-1]], 0, 0, len(inp_tokens), len(trans_tokens))

    fig.add_layout(bm.LinearAxis(axis_label='source tokens'), 'above')
    fig.xaxis.ticker = np.arange(len(inp_tokens)) + 0.5
    fig.xaxis.major_label_overrides = dict(zip(np.arange(len(inp_tokens)) + 0.5, inp_tokens))
    fig.xaxis.major_label_orientation = 45

    fig.add_layout(bm.LinearAxis(axis_label='translation tokens'), 'left')
    fig.yaxis.ticker = np.arange(len(trans_tokens)) + 0.5
    fig.yaxis.major_label_overrides = dict(zip(np.arange(len(trans_tokens)) + 0.5, trans_tokens[::-1]))

    show(fig)

In [97]:
for inp_line, trans_line in zip(dev_inp[::500], model.translate_lines(dev_inp[::500])[0]):
    print(inp_line)
    print(trans_line)
    print()

RuntimeError: The expanded size of the tensor (1) must match the existing size (6) at non-singleton dimension 0.  Target sizes: [1, -1].  Tensor sizes: [6, 128]

In [98]:
inp = dev_inp[::500]

trans, states = model.translate_lines(inp)

# select attention probs from model state (you may need to change this for your custom model)
# attention_probs below must have shape [batch_size, translation_length, input_length], extracted from states
# e.g. if attention probs are at the end of each state, use np.stack([state[-1] for state in states], axis=1)


RuntimeError: The expanded size of the tensor (1) must match the existing size (6) at non-singleton dimension 0.  Target sizes: [1, -1].  Tensor sizes: [6, 128]

In [99]:
probs = [states[i][-1] for i in range(len(states))]

NameError: name 'states' is not defined

In [100]:
attention_probs = np.stack([state[-1].cpu().detach().numpy() for state in states], axis=1)

NameError: name 'states' is not defined

In [101]:
for i in range(5):
    draw_attention(inp[i], trans[i], attention_probs[i])

# Does it look fine already? don't forget to save images for anytask!

NameError: name 'trans' is not defined

__Note:__ If the attention maps are not iterpretable, try starting encoder from zeros (instead of dec_start), forcing model to use attention.

## Implement Two different archetectures with attention (8 points)

We want you to find the best model for the task. Use everything you know.

* add attention after RNN or as a hidden state input
* different recurrent units: rnn/gru/lstm; deeper architectures
* bidirectional encoder, different attention methods for decoder (additive, dot-product, multi-head)
* word dropout, training schedules, anything you can imagine
* replace greedy inference with beam search

Describe what you tried and what results you obtained in a short report.