# 16. Natural Language Processing with RNNs and Attention

Looking at it from a certain perspective, the Turing test is an NLP task. This chapter will focus on how to tackle NLP tasks (albeit less complex than a Turing test) using RNNs. 

### Generating Shakespearean Text Using a Character RNN

Let's look at how to build a Char-RNN, a net that predicts the next character in a sentence. 

#### Creating the Training Dataset

Downloading the file from Andrej Karpathy's GitHub repo:

In [1]:
from tensorflow import keras
import os

filepath = os.path.join(os.getcwd(), 'datasets', 'shakespeare', 'input.txt')
with open(filepath) as f:
    shakespeare_text = f.read()

Next, we must encode every character as an integer. We will use `Tokenizer` for this. 

In [2]:
tokenizer = keras.preprocessing.text.Tokenizer(char_level=True)
tokenizer.fit_on_texts([shakespeare_text])

Let's quickly check what it does: 

In [3]:
tokenizer.texts_to_sequences(["Hello"])

[[7, 2, 12, 12, 4]]

In [4]:
tokenizer.sequences_to_texts([[7, 2, 12, 12, 4]])

['h e l l o']

In [5]:
max_id = len(tokenizer.word_index) # number of distinct characters

In [6]:
dataset_size = tokenizer.document_count # total number of characters

In [7]:
import numpy as np

[encoded] = np.array(tokenizer.texts_to_sequences([shakespeare_text])) - 1 # starting from 0

#### How to Split a Sequential Dataset

It is very important to avoid any overlap between the training set, the validation set, and the test set. It would also be a good idea to leave a gap between these sets to avoid the risk of a paragraph overlapping over two sets.

In our case, let's keep 90% for the training set:

In [8]:
import tensorflow as tf

train_size = dataset_size * 90 // 100
dataset = tf.data.Dataset.from_tensor_slices(encoded[:train_size])

#### Chopping the Sequential Dataset into Multiple Windows

Now we have a single very long sequence of characters. We can't just train our RNN on it. Let's use `window` to to convert this long sequence of characters into many smaller windows of text.

In [9]:
n_steps = 100
window_length = n_steps + 1 # target = input shifted 1 character ahead
dataset = dataset.window(window_length, shift=1, drop_remainder=True)

By default, windows are **not** overlapping, but we used `shift=1` to make them so. We drop remainder to keep all the windows to 101 character length. 

In [10]:
# flattening our dataset
dataset = dataset.flat_map(lambda window: window.batch(window_length))

Now the dataset contains consecutive windows of 101 characters each. To get the best out of Gradient Descent, we can batch the windows and then separate the inputs (first 100 chars) from the target (last char). 

In [11]:
batch_size = 32
dataset = dataset.shuffle(10000).batch(batch_size)
dataset = dataset.map(lambda windows: (windows[:, :-1], windows[:, 1:]))

Let's encode each character: 

In [12]:
dataset = dataset.map(
    lambda X_batch, Y_batch: (tf.one_hot(X_batch, depth = max_id),
Y_batch))

In [13]:
# prefetching
dataset = dataset.prefetch(1)

### Sentiment Analysis

IMBd is the MNIST of NLP. Let's play around with it in this section. 

In [14]:
(X_train, y_train), (X_test, y_test) = keras.datasets.imdb.load_data()

In [15]:
X_train[0][:10]

[1, 14, 22, 16, 43, 530, 973, 1622, 1385, 65]

The reviews are already tokenized and indexed by frequency (low = most frequent). Special integers:

0 = Padding token  
1 = Start of sequence token  
2 = Unkown words  

Let's visualize a review: 

In [16]:
word_index = keras.datasets.imdb.get_word_index()

In [17]:
id_to_word = {id_ + 3: word for word, id_ in word_index.items()}

In [18]:
for id_, token in enumerate(("<pad>", "<sos>", "<unk>")):
    id_to_word[id_] = token

In [19]:
" ".join([id_to_word[id_] for id_ in X_train[0][:10]])

'<sos> this film was just brilliant casting location scenery story'

Although tokeninizing by word may work well in English (at least in most cases) this may create issues in other languages (i.e. 中文). 

Next, let's move on doing preprocessing ourselves in TF. Let's load the reviews as strings: 

In [20]:
import tensorflow_datasets as tfds
datasets, info = tfds.load("imdb_reviews", as_supervised=True,
                            with_info=True)
train_size = info.splits["train"].num_examples



In [21]:
# preprocessing function

def preprocess(X_batch, y_batch):
    # keeping only first 300 chars
    X_batch = tf.strings.substr(X_batch, 0, 300)
    # turn <br \> into space
    X_batch = tf.strings.regex_replace(X_batch, b"<br\\s*/?>", b" ")
    # anything that is not a letter becomes a space
    X_batch = tf.strings.regex_replace(X_batch, b"[^a-zA-Z']", b" ")
    X_batch = tf.strings.split(X_batch)
    return X_batch.to_tensor(default_value=b"<pad>"), y_batch

Next, we need to construct the vocabulary.

In [22]:
from collections import Counter
vocabulary = Counter()
for X_batch, y_batch in datasets["train"].batch(32).map(preprocess):
    for review in X_batch:
        vocabulary.update(list(review.numpy()))

In [23]:
# most common words
vocabulary.most_common()[:3]

[(b'<pad>', 215349), (b'the', 61137), (b'a', 38564)]

In [24]:
# keeping only 10000 most popular 
vocab_size = 10000
truncated_vocabulary = [
    word for word, count in vocabulary.most_common()[:vocab_size]]

In [25]:
# replace each word with ID
words = tf.constant(truncated_vocabulary)
word_ids = tf.range(len(truncated_vocabulary), dtype=tf.int64)
vocab_init = tf.lookup.KeyValueTensorInitializer(words, word_ids)
# out of vocabulary (oov)
num_oov_buckets = 1000
table = tf.lookup.StaticVocabularyTable(vocab_init, num_oov_buckets)

In [26]:
table.lookup(tf.constant([b"This movie was amaaaazing".split()]))

<tf.Tensor: shape=(1, 4), dtype=int64, numpy=array([[   22,    12,    11, 10698]], dtype=int64)>

Note how the last word goes into an oov bucket (ID > 10000). 

In [27]:
def encode_words(X_batch, y_batch):
    return table.lookup(X_batch), y_batch

train_set = datasets["train"].batch(32).map(preprocess)
train_set = train_set.map(encode_words).prefetch(1)

Time to create the model and train it!

In [None]:
embed_size = 128

model = keras.models.Sequential([
    keras.layers.Embedding(vocab_size + num_oov_buckets, embed_size,
                           input_shape=[None]),
    keras.layers.GRU(128, return_sequences=True),
    keras.layers.GRU(128),
    keras.layers.Dense(1, activation="sigmoid")
])

model.compile(loss="binary_crossentropy", optimizer="adam",
              metrics=["accuracy"])
history = model.fit(train_set, epochs=5)

Epoch 1/5
    627/Unknown - 131s 209ms/step - loss: 0.6166 - accuracy: 0.6385

### Masking

Our model needs to ignore padding tokens (ID = 0). We can do that using `mask_zero=True` in the `Embedding` layer. 

#### Reusing Pretrained Embeddings

Model components are called **modules**. Let's do an example based on sentence embedding module `nnlm-en-dim50`:

In [None]:
import tensorflow_hub as hub

model = keras.Sequential([
    hub.KerasLayer("https://tfhub.dev/google/tf2-preview/nnlm-en-dim50/1",
                    dtype=tf.string, input_shape=[], output_shape=[50]),
keras.layers.Dense(128, activation="relu"),
keras.layers.Dense(1, activation="sigmoid")
])
model.compile(loss="binary_crossentropy", optimizer="adam",
              metrics=["accuracy"])

Training the model:

In [None]:
datasets, info = tfds.load("imdb_reviews", as_supervised=True,
with_info=True)
train_size = info.splits["train"].num_examples
batch_size = 32
train_set = datasets["train"].batch(batch_size).prefetch(1)
history = model.fit(train_set, epochs=5)

### An Encoder–Decoder Network for Neural Machine Translation

![Machine Translation Model](images/16.Encoder-Decoder.png)

In short, sentences in one language are fed to the encoder, and the decoder outputs the translations in the other language.

Here is how an encoder-decoder machine translation model would work in slightly more detail:
1. Every sentence is expecteto a have a **start-of-sentence** (SOS) and **end-of-sentence** (EOS) token
2. Sentences in the first language are **reversed**, since the last word will be translated first
3. Each word is initially **represented by its ID** 
4. An embedding layer returns the **word embeddings**, which are fed to the encoder and the decoder
5. At each step, the decoder outputs a score for each word in the output vocabulary, and then the softmax layer turns these scores into %. Word with **highest % > output**

Example code for a basic Encoder–Decoder model using TF Addons:

In [None]:
import tensorflow_addons as tfa

encoder_inputs = keras.layers.Input(shape=[None], dtype=np.int32)
decoder_inputs = keras.layers.Input(shape=[None], dtype=np.int32)
sequence_lengths = keras.layers.Input(shape=[], dtype=np.int32)

embeddings = keras.layers.Embedding(vocab_size, embed_size)
encoder_embeddings = embeddings(encoder_inputs)
decoder_embeddings = embeddings(decoder_inputs)

encoder = keras.layers.LSTM(512, return_state=True)
encoder_outputs, state_h, state_c = encoder(encoder_embeddings)
encoder_state = [state_h, state_c]

sampler = tfa.seq2seq.sampler.TrainingSampler()

decoder_cell = keras.layers.LSTMCell(512)
output_layer = keras.layers.Dense(vocab_size)
decoder = tfa.seq2seq.basic_decoder.BasicDecoder(decoder_cell, sampler,

output_layer=output_layer)
final_outputs, final_state, final_sequence_lengths = decoder(
    decoder_embeddings, initial_state=encoder_state,
    sequence_length=sequence_lengths)
Y_proba = tf.nn.softmax(final_outputs.rnn_output)

model = keras.Model(inputs=[encoder_inputs, decoder_inputs,
sequence_lengths],
                    outputs=[Y_proba])

#### Bidirectional RNNs

A each time step, a regular recurrent layer only looks at past and present inputs before generating its output. For translation however, it may make sense to look ahead a few words. To implement this, run two recurrent layers on the same inputs, one reading the words from left to right and the other reading them from right to left.

In [None]:
keras.layers.Bidirectional(keras.layers.GRU(10, return_sequences=True))

### Attention Mechanisms

Instead of just sending the encoder’s final hidden state to the decoder, we now send all of its outputs to the decoder and a weighted sum of all these encoder outputs to understand where to focus (our **attention**). 

![Machine Translation with Attention](images/16.Encoder-Decoder_Attention.png)

Where do the weights come from? From a small neural network called an **alignment model** (or an **attention layer**) trained alongside the rest of the Encoder-Decoder model.

Explained in a few steps:

1. We start with a time-distributed `Dense` layer with a single neuron, which receives as input all the encoder outputs, concatenated with the decoder’s previous hidden state (e.g., $h_{(2)}$ in fig above) 
2. This layer outputs a score (or energy) for each encoder output (e.g. $e_{(3,0)}$) which measures how well each output is aligned with the decoder’s previous hidden state
3. All the scores go through a `Softmax` layer to get a final weight for each encoder output (e.g., $\alpha_{(3,0)}$ )

This is what is usually called **concatenative attention** (or **additive attention**) since it concatenates the encoder output with the decoder’s previous hidden state.

A more widely used approach nowadays is **multiplicative attention**, which instead computes the dot product of the encoder’s outputs
and the decoder’s previous hidden state as a similarity measure. 

Example code for multiplicative attention in TF:

In [None]:
attention_mechanism = tfa.seq2seq.attention_wrapper.LuongAttention(
    units, encoder_state,
memory_sequence_length=encoder_sequence_length)
attention_decoder_cell = tfa.seq2seq.attention_wrapper.AttentionWrapper(
    decoder_cell, attention_mechanism, attention_layer_size=n_units)

### Visual Attention

Particularly useful for captions. At each decoder time step (each word), the decoder uses the attention model to focus on just the right part of the image.

### Transformer Architecture

Aka: why bother with RNNs or CNNs when all you can use are attention mechanisms (plus embedding layers, dense layers, normalization layers, and a few other bits and pieces). Plus, it faster to train and easier to parallelize. 


![Transformer Architecture](images\16.Transformer.png)

* Left: Encoder > It takes as input a batch of sentences represented as sequences of word IDs and it encodes each word into a 512-dimensional representation.   
* Right: Decoder > During training, it takes the target sentence as input shifted one time step to the right. It also receives outputs of the encoder. It outputs a probability for each possible next word, at each time step.  

So far, it looks like the model is only looking at one sentence at the time. How can we translate full sentences this way?

* **Multi-Head attention** layer to encode each word’s relationship with every other word  
* **Positional embeddings** are simply dense vectors (much like word embeddings) that represent the position of a word in the sentence 