# Primer on Attention

## Resources
- [Visualize NMT w Attention](https://jalammar.github.io/visualizing-neural-machine-translation-mechanics-of-seq2seq-models-with-attention/)
- [Attention Explanation](https://lilianweng.github.io/lil-log/2018/06/24/attention-attention.html)
- [Another](https://medium.com/analytics-vidhya/attention-is-all-you-need-1-3b960b7b6500)
- [Another](https://towardsdatascience.com/intuitive-understanding-of-attention-mechanism-in-deep-learning-6c9482aecf4f)

## Primer

*Below is a summary of the above articles. Note that this is not a polished educational resource in its own right. This is simply me summarizing the concepts that I learn.*

__Attention__ began with the task of __Neural Machine Translation__ (NMT). Prior to its discovery, LSTM encoder-decoder architectures were standard practice. The *final hidden state of the encoder LSTM* is passed to the decoder LSTM (called a __context vector__). The decoder uses this summary of the input sentence to generate a sequence of words (translation) as output.

![img](https://lilianweng.github.io/lil-log/assets/images/encoder-decoder-example.png)

__Models with Attention__ the encoder passes hidden states for all time steps to the decoder. The decoder translates these hidden states into a single context vector for each time step. __Attention is in the decoder__

1. The decoder RNN takes in the embedding of the `<BEGIN_SENTENCE>` token and an initial decoder hidden state.
2. RNN produces an initial ouput and a hidden state vector (h1) Output is discarded.
3. Use the encoder hidden states and the h1 decoder-hidden-state vector to create a prob mask for the hidden states (__attention step__). Combine the encoder hidden states weighted by the prob mask. -> Context vector
4. Concatenate dec-hidden-state vec h1 with context vector
5. Pass this vector through a feedforward head network to produce an output from the vocab.
6. Output is saved as output of decoder for timestep 1 ($y_1$)
7. Pass the output of prev step ($y_1$) and dec-hidden-state vector (h1) to the decoder for the next time step -> produce new hidden state (h2) Continue to 3. and repeat for entire sequence



[Here's Jay Alammar's visual of the decoder](http://jalammar.github.io/images/attention_tensor_dance.mp4)

__Things to Note:__ 
- Note that this process can be generalize to any Seq2Seq task. 
- The process is simply: encoder hidden states -> decoder hidden state -> attention prob mask -> context vector -> concatenate-> FeedForward Net to softmax output

## In Depth with the Math

Task given a source seq of length n and output seq of length m, $$x = [x_1, ..., x_n]$$ $$y = [y_1,...y_m]$$

Encoder is a bidirectional RNN with concatenated hidden_states for forward and backward pass for each time step $i=1,..,n$. 
$$h_i = [\overrightarrow{h_i},\overleftarrow{h_i}]$$

The encoder processes the input sequence $x$, and produces a matrix of hidden state vectors with shape = $(len(\overrightarrow{h_i})+len(\overleftarrow{h_i}),  n)$

Next, the decoder initializes a hidden state $s_0$ based on the sentence start token (this can be confusing, but the initialization step makes all the formulas make sense) 

The decoder hidden state $s_t = f(s_{t-1}, y_{t-1}, c_t)$ for output $t=1,..,m$

$$c_t = \alpha_t\cdot h$$
$$\alpha_t = softmax(score(s_{t-1},h)))$$

__Alignment Score Functions__ score the previous hidden state $s_{t-1}$ with $h$
$$score_{Bahdanau}(s_t,h_i) = v^T\cdot tanh(W \cdot [s_t;h_i])$$
$$score_{Luong}(s_t, h_i) = s^T\cdot W\cdot h_i$$

With attention, long-term dependencties can be established readily.

__Self attention__ - Relates different positions of the same input sequence in order to compute a representation of the same sequence.

__Soft Attention__ - Attention mechanism has access to the entire input. 

__Hard Attention__ - Attention mechanism has access to one patch of input at a time (more efficient compute)

# Neural Machine Translation w TensorFlow

## Resources
- [TF2 Docs Example NMT w Attention](https://www.tensorflow.org/tutorials/text/nmt_with_attention)
- [NMT GitHub](https://github.com/tensorflow/nmt)
- [NMT w Attention Notebook](https://github.com/tensorflow/examples/blob/master/community/en/nmt_with_luong_attention.ipynb)



## TF2 Docs NMT Example ([link](https://www.tensorflow.org/tutorials/text/nmt_with_attention))

Go through the basics of: 
- preprocessing and tokenizing a text sequence. 
- Pad the sequence to certain length.

In [1]:
import matplotlib.pyplot as plt
import numpy as np
import re
import os
import unicodedata
import io
from sklearn.model_selection import train_test_split

import tensorflow as tf
print(tf.__version__)

2.1.0


In [2]:
base_dir = "E://Data/spa-eng/"
print(os.listdir(base_dir))
data_dir = os.path.join(base_dir, 'spa.txt')

['spa.txt', '_about.txt']


In [3]:
def unicode_to_ascii(s):
    """ Converts unicode characters to ascii."""
    return ''.join(c for c in unicodedata.normalize('NFD', s)
        if unicodedata.category(c) != 'Mn')

def preprocess_data(sentence):
    """
    Removes whitespaces, converts to ascii, adds spaces before after punctuation,
    removes extra spaces, removes non letter/punctuation characters,
    adds start and end tokens.
    """
    sentence = sentence.lower().strip()
    sentence = unicode_to_ascii(sentence)
    sentence = re.sub(r'([?.!,¿])', r' \1 ', sentence) #add space between words and punctuations
    sentence = re.sub(r'[" "]+', r' ', sentence) #remove double spaces
    sentence = re.sub(r"[^a-zA-Z?.!,¿]+", " ", sentence)
    sentence = sentence.rstrip().strip()
    sentence = '<start> '+ sentence + ' <end>' #add start and end tokens
    return sentence

def create_dataset(path):
    lines = io.open(path, encoding='UTF-8').read().strip().split("\n")
    word_pairs = [[preprocess_data(sentence) for sentence in l.split('\t')] for l in lines]
    return zip(*word_pairs)

In [4]:
eng, spa, _ = create_dataset(data_dir)

print(eng[-1])
print(spa[-1])

<start> it may be impossible to get a completely error free corpus due to the nature of this kind of collaborative effort . however , if we encourage members to contribute sentences in their own languages rather than experiment in languages they are learning , we might be able to minimize errors . <end>
<start> puede que sea imposible obtener un corpus completamente libre de errores debido a la naturaleza de este tipo de esfuerzo de colaboracion . sin embargo , si animamos a los miembros a contribuir frases en sus propios idiomas en lugar de experimentar con los idiomas que estan aprendiendo , podriamos ser capaces de minimizar los errores . <end>


In [5]:
def max_length(tensor):
    return max(len(t) for t in tensor)

def tokenize(lang):
    lang_tokenizer = tf.keras.preprocessing.text.Tokenizer(filters = "")
    lang_tokenizer.fit_on_texts(lang)
    tensor = lang_tokenizer.texts_to_sequences(lang)
    tensor = tf.keras.preprocessing.sequence.pad_sequences(tensor, padding='post')
    return tensor, lang_tokenizer

def load_dataset(path):
    targ_lang, inp_lang, _ = create_dataset(data_dir)
    input_tensor, inp_lang_tokenizer = tokenize(inp_lang)
    target_tensor, targ_lang_tokenizer = tokenize(targ_lang)
    return input_tensor, target_tensor, inp_lang_tokenizer, targ_lang_tokenizer

In [6]:
input_tensor, target_tensor, inp_lang_tokenizer, targ_lang_tokenizer = load_dataset(data_dir)
in_max, out_max = max_length(input_tensor), max_length(target_tensor)

input_tensor_train, input_tensor_val, target_tensor_train, target_tensor_val = train_test_split(input_tensor, target_tensor, test_size=0.2)

print(len(input_tensor_train), len(input_tensor_val))

98668 24667


In [7]:
#what the tensor looks like. 
input_tensor[:2]

array([[   1,  370,    3,    2,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0],
       [   1, 1397,    3,    2,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0]])

In [8]:
targ_lang_tokenizer

<keras_preprocessing.text.Tokenizer at 0x1a29f5f3d88>

In [9]:
def index_to_word(tokenizer, tensor):
    for t in tensor:
        if t!=0:
            print("%d ---> %s" % (t, tokenizer.index_word[t]))
    
index_to_word(inp_lang_tokenizer, input_tensor[0])
print()
index_to_word(targ_lang_tokenizer, target_tensor[0])

1 ---> <start>
370 ---> ve
3 ---> .
2 ---> <end>

1 ---> <start>
49 ---> go
3 ---> .
2 ---> <end>


### Create a tf.data dataset

In [10]:
BUFFER_SIZE = len(input_tensor_train)
BATCH_SIZE = 64
STEPS_PER_EPOCH = len(input_tensor_train) // BATCH_SIZE
embedding_dim = 256
units = 1024

vocab_inp_size = len(inp_lang_tokenizer.word_index)+1
vocab_tar_size = len(targ_lang_tokenizer.word_index)+1

dataset = tf.data.Dataset.from_tensor_slices((input_tensor_train, target_tensor_train)).shuffle(BATCH_SIZE)
dataset = dataset.batch(BATCH_SIZE, drop_remainder=True)

In [11]:
example_input_batch, example_target_batch = next(iter(dataset))
example_input_batch.shape, example_target_batch.shape

(TensorShape([64, 59]), TensorShape([64, 54]))

### Implementing the Attention Model

Encoder will ouput a tensor with shape (batch_size, max_length, hidden_size) and an encoder hidden state of shape (batch_size, hidden_size).



In [23]:
class Encoder(tf.keras.Model):
    def __init__(self, vocab_size, embedding_dim, hidden_dim, batch_size):
        super(Encoder, self).__init__()
        self.batch_size = batch_size
        self.hidden_dim = hidden_dim
        self.embedding = tf.keras.layers.Embedding(vocab_size, embedding_dim)
        self.GRU = tf.keras.layers.GRU(hidden_dim, 
                                       return_sequences=True, #because we are doing seq2seq vector for each timestep
                                      return_state=True) #return the final hidden state
    
    def init_state(self):
        return tf.zeros((self.batch_size, self.hidden_dim))
    
    def call(self, x):
        x = self.embedding(x)
        out, hidden = self.GRU(x, initial_state=self.init_state())
        return out, hidden

In [30]:
enc = Encoder(vocab_inp_size, embedding_dim, units, BATCH_SIZE)

sample_out, sample_hidden = enc(example_input_batch)
print("Input Batch Shape:",example_input_batch.shape) 
#Input to GRU will be shape (batch_size, max_length, emb_dim)
print("Output shape:",sample_out.shape)
print("Hidden Shape:", sample_hidden.shape)

Input Batch Shape: (64, 59)
Output shape: (64, 59, 1024)
Hidden Shape: (64, 1024)


In [17]:
class BahdanauAttention(tf.keras.layers.Layer):
    def __init__(self, units):
        super(BahdanauAttention, self).__init__()
        self.W1 = tf.keras.layers.Dense(units, activation=None)
        self.W2 = tf.keras.layers.Dense(units, activation=None)
        self.W3 = tf.keras.layers.Dense(1, activation=None)
    
    def call(self, s, h):
        """
        s - Decoder output for t-1, shape: (batch_size, dec_hidden_units)
        h - Encoder outputs for t=1,..,n. shape: (batch_size, max_len, enc_hidden_units)
        """
        s = tf.expand_dims(s,1) #makes s and h both have rank 3
        score = self.W3(tf.nn.tanh(self.W1(s) + self.W2(h)))
        a = tf.nn.softmax(score, axis=1)
        context_vector = a*h
        context_vector = tf.reduce_sum(context_vector, axis=1) #how can i write this as mat_vec mult
        return context_vector, a

    
class LuongAttention(tf.keras.layers.Layer):
    def __init__(self, units):
        super(LuongAttention, self).__init__()
        self.W1 = tf.keras.layers.Dense(units)
    
    def call(self, s, h):
        score = tf.linalg.matmul(h, self.W1(s))
        a = tf.nn.softmax(score, axis=1)
        return tf.linalg.matmul(a, h, transpose_b=True)
        
        

In [19]:
attention = BahdanauAttention(10)
context, att = attention(tf.random.normal([BATCH_SIZE,units]), sample_out)
print("Context Vector Shape:", context.shape)
print("Attention Vector Shape:", att.shape)

Context Vector Shape: (64, 1024)
Attention Vector Shape: (64, 59, 1)


In [27]:
class Decoder(tf.keras.Model):
    def __init__(self, target_lang_size, emb_dim, dec_units, batch_size):
        super(Decoder, self).__init__()
        self.dec_units = dec_units
        self.batch_size = batch_size
        self.emb = tf.keras.layers.Embedding(target_lang_size, emb_dim)
        self.GRU = tf.keras.layers.GRU(dec_units, 
                                       return_sequences=True, 
                                       return_state=True)
        self.attention = BahdanauAttention(dec_units)
        self.Dense1 = tf.keras.layers.Dense(target_lang_size)
    
    def init_hidden_state(self):
        return tf.zeros((self.batch_size, self.dec_units))
    
    def call(self, x, dec_hidden, enc_output):
        #context vector has shape (batch_size, enc_units)
        context, att = self.attention(dec_hidden, enc_output)
        x = self.emb(x)
        # x shape after concatenation == (batch_size, 1, hidden_size + embedding_dim)
        x = tf.concat([tf.expand_dims(context, axis=1), x], axis=-1)
        #note that the GRU will only output one vector per sample since the input format is (batch_size, time_steps=1, features)
        out, hidden = self.GRU(x)
        out = tf.reshape(out, (-1, out.shape[2]))
        out = self.Dense1(out)
        return out, hidden, att

In [33]:
dec = Decoder(vocab_tar_size, embedding_dim, units, BATCH_SIZE)
sample_output, _, att_weights = dec(tf.random.uniform((BATCH_SIZE, 1)), #simulate a word
                                      sample_hidden, 
                                    sample_out)

print ('Decoder output shape: (batch_size, vocab size) {}'.format(sample_output.shape))


Decoder output shape: (batch_size, vocab size) (64, 13048)


### Training

In [None]:
optimizer = tf.keras.optimizers.Adam()
loss_object = tf.keras.losses.SparseCategoricalCrossentropy(
    from_logits=True, reduction='none')

def loss_function(real, pred):
    mask = tf.math.logical_not(tf.math.equal(real, 0))
    loss_ = loss_object(real, pred)

    mask = tf.cast(mask, dtype=loss_.dtype)
    loss_ *= mask

    return tf.reduce_mean(loss_)

In [None]:
checkpoint_dir = 'E://Models/Neural_Machine_Translation'
checkpoint_prefix = os.path.join(checkpoint_dir, "ckpt")
checkpoint = tf.train.Checkpoint(optimizer=optimizer,
                                 encoder=encoder,
                                 decoder=decoder)


In [None]:
@tf.function
def train_step(inp, targ, enc_hidden):
    loss = 0
    with tf.GradientTape() as tape:
        enc_output, enc_hidden = encoder(inp, enc_hidden)
        dec_hidden = enc_hidden
        dec_input = tf.expand_dims([targ_lang.word_index['<start>']] * BATCH_SIZE, 1)
        #teacher forcing
        for t in range(1, targ.shape[1]):
            precitions, dec_hidden, _ =decoder(dec_inputs, dec_hidden, enc_output)
    
            loss += loss_function(targ[:,t], predictions)
            dec_input - tf.expand_dims(targ[:,t], 1)
    batch_loss = (loss / int(targ.shape[1]))
    variables = encoder.trainable_variables + decoder.trainable_variables
    gradient = tape.gradient(loss, variables)
    optimizer.apply_gradients(zip(gradient, variables))
    return batch_loss      

In [None]:
EPOCHS = 10
for epoch in range(EPOCHS):
    start = time.time()
    enc_hidden = encoder.initialize_hidden_state()
    total_loss = 0
    
    for (batch, (inp, targ)) in enumerate(dataset.take(steps_per_epoch)):
        batch_loss = train_step(inp, targ, enc_hidden)
        total_loss+=batch_loss
        if batch % 100 == 0:
        print('Epoch {} Batch {} Loss {:.4f}'.format(epoch + 1,
                                                   batch,
                                                   batch_loss.numpy()))
    if (epoch+1) % 2 == 0:
        checkpoint.save(file_prefix=checkpoint_prefix)
    print('Epoch {} Loss {:.4f}'.format(epoch + 1,
                                      total_loss / steps_per_epoch))
    print('Time taken for 1 epoch {} sec\n'.format(time.time() - start))

### Model Evaluation