## 11.4 The Transformer architecture
Transformers were introduced in the seminal paper “Attention is all you need”. “Neural Attention” could be used to build powerful sequence models that didn’t feature any recurrent layers or convolution layers.
### Understanding self-attention
Not all input information seen by a model is equally important to the task at hand, so models should “pay more attention” to some features and “pay less attention” to other features.
Does that sound familiar? You’ve already encountered a similar concept twice in this book:
- Max pooling in convnets looks at a pool of features in a spatial region and selects just one feature to keep. That’s an “all or nothing” form of attention: keep the most important feature and discard the rest.
- TF-IDF normalization assigns importance scores to tokens based on how much information different tokens are likely to carry. Important tokens get boosted while irrelevant tokens get faded out. That’s a continuous form of attention.


Word embeddings—vector spaces capture the “shape” of the semantic relationships between different words. In an embedding space, a single word has a fixed position, i.e. a fixed set of relationships with every other word in the space. But that’s not quite how language works, the meaning of a word is usually context-specific.

Clearly, a smart embedding space would provide a different vector representation for a word depending on the other words surrounding it. That’s where **self-attention** comes in. The purpose of **self-attention** is to modulate the representation of a token by using the representations of related tokens in the sequence. 

![alt text](static/self-attention.png "self-attention")

For every word in the sentence, we'll do the following:
1. Compute relevancy scores between the vector for the word and every other word in the sentence. These are the attention scores. It's the dot product between two word vectors as a measure of the strength of their relationship.
2. Compute the sum of all word vectors in the sentence, weighted by our relevancy scores. Words closely related to the word will contribute more to the sum (including the word itself), while irrelevant words will contribute almost nothing.

A Transformer is a sequence-to-sequence model: it was designed to convert one sequence into another.

![alt text](static/simple-self-attention.png "self-attention")
This means “for each token in inputs (A), compute how much the token is related to every token in inputs (B), and use these scores to weight a sum of tokens from inputs (C).”

In the general case, you could be doing this with three different sequences. We’ll call them “query,” “keys,” and “values.” The operation becomes “for each element in the query, compute how much the element is related to every key, and use these scores to weight a sum of values”
```
outputs = sum(values * pairwise_score(query, keys)
```

So, if you’re just doing sequence classification, then query, keys, and values are all the same: you’re comparing a sequence to itself, to enrich each token with context from the whole sequence.

### Multi-head attention
Multi-head attention is an improvement to self-attention from the “Attention is All You Need” paper. It works by dividing the attention process into several smaller parts called heads, which each learn differently. We create three versions of the input (query, key, and value) using different small neural networks, resulting in three different sets of vectors. Each set of vectors is processed separately through attention, and then all the results are combined into one final output.

![alt text](static/multi-head-attention.png "mh-attention")

Having independent heads helps the layer learn different groups of features for each token, where features within one group are correlated with each other but are mostly independent of features in a different group.



## The Transformer Architecture
If adding extra dense projections is so useful, why don’t we also apply one or two to the output of the attention mechanism? Actually, that’s a great idea—let’s do that.
1. Let's add residual connections to make sure we don’t destroy any valuable information along the way.
2. Let's add normalization layers, which are supposed to help gradients flow better during backpropagation.

The first Dense Layer in the Dense projection: It is used in the first Dense layer of the dense_proj sequential model to increase the dimensionality of the input vectors temporarily.
 
The second Dense layer in the Dense Projection: To ensure that the output of the dense projection matches the input dimension for residual connections.

![alt text](static/transformers.png "mh-attention")
Figure: The TransformerEncoder chains a MultiHeadAttention layer with a dense projection and adds normalization as well as residual connections.

The original Transformer architecture consists of two parts:
1. a Transformer encoder that processes the source sequence.
2. Transformer decoder that uses the source sequence to generate a translated version.

The encoder part can be used for text classification. It’s a very generic module that ingests a sequence and learns to turn it into a more useful representation.


In [8]:
# imports
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
import numpy as np
import matplotlib.pyplot as plt

In [9]:
batch_size = 32
train_ds = keras.utils.text_dataset_from_directory(
    "data/aclImdb/train", batch_size=batch_size
)
val_ds = keras.utils.text_dataset_from_directory(
    "data/aclImdb/val", batch_size=batch_size
)
test_ds = keras.utils.text_dataset_from_directory(
    "data/aclImdb/test", batch_size=batch_size
)
text_vectorization = layers.TextVectorization(output_mode="int")
text_only_train_ds = train_ds.map(lambda x, y: x)
text_vectorization.adapt(text_only_train_ds)
int_train_ds = train_ds.map(lambda x, y: (text_vectorization(x), y), num_parallel_calls=4)
int_test_ds = test_ds.map(lambda x, y: (text_vectorization(x), y), num_parallel_calls=4)
int_val_ds = val_ds.map(lambda x, y: (text_vectorization(x), y), num_parallel_calls=4)

Found 20000 files belonging to 2 classes.
Found 5000 files belonging to 2 classes.
Found 25000 files belonging to 2 classes.


In [37]:
# implementing the previous architecture
# trying the transformer encoder on the movie review sentiment classification task.

# embed dim: It is used as the key_dim for the MultiHeadAttention layer, which determines the dimension of the query, key, and value vectors in the attention mechanism. It is also the output dimension of the dense_proj layer, ensuring that the output of the dense projection matches the input dimension for residual connections.

# dense dim: It is used in the first Dense layer of the dense_proj sequential model to increase the dimensionality of the input vectors temporarily. The second Dense layer projects this higher-dimensional representation back to the embed_dim.
class TransformerEncoder(layers.Layer):
    def __init__(self, embed_dim, dense_dim, num_heads, **kwargs):
        super().__init__(**kwargs)
        self.embed_dim = embed_dim  # Size of the input token vectors
        self.dense_dim = dense_dim  # Size of the inner dense layer
        self.num_heads = num_heads
        self.attention = layers.MultiHeadAttention(
            num_heads=num_heads, key_dim=embed_dim)

        self.dense_proj = keras.Sequential([
            layers.Dense(dense_dim, activation="relu"),
            layers.Dense(embed_dim)
        ])

        self.layernorm_1 = layers.LayerNormalization()
        self.layernorm_2 = layers.LayerNormalization()
        # BatchNormalization doesn't work well for sequence data. Instead, we’re using the LayerNormalization layer, which normalizes each sequence independently of other sequences in the batch

    def call(self, inputs, mask=None):
        # The mask that will be generated by the Embedding layer will be 2D, but the attention layer expects to be 3D or 4D, so we expand its rank.
        if mask is not None:
            mask = mask[:, tf.newaxis, :]
        attention_output = self.attention(
            inputs, inputs, attention_mask=mask)
        proj_input = self.layernorm_1(inputs + attention_output)
        proj_output = self.dense_proj(proj_input)
        return self.layernorm_2(proj_input + proj_output)

    def get_config(self):  # Serialization to save the model
        config = super().get_config()
        config.update({
            "embed_dim": self.embed_dim,
            "num_heads": self.num_heads,
            "dense_dim": self.dense_dim,
        })
        return config

# When you write custom layers, make sure to implement the get_config method: this enables the layer to be reinstantiated from its config dict, which is useful during model saving and loading. The method should return a Python dict that contains the values of the constructor arguments used to create the layer.

In [33]:
vocab_size = 20000
embed_dim = 256
num_heads = 2
dense_dim = 32

inputs = keras.Input(shape=(None,), dtype="int64")
x = layers.Embedding(vocab_size, embed_dim)(inputs)
x = TransformerEncoder(embed_dim, dense_dim, num_heads)(x)
x = layers.GlobalMaxPooling1D()(x)
x = layers.Dropout(0.5)(x)
outputs = layers.Dense(1, activation="sigmoid")(x)
model = keras.Model(inputs, outputs)
model.compile(optimizer="rmsprop",
              loss="binary_crossentropy",
              metrics=["accuracy"])
model.summary()

Model: "model_6"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_13 (InputLayer)       [(None, None)]            0         
                                                                 
 embedding_15 (Embedding)    (None, None, 256)         5120000   
                                                                 
 transformer_encoder_12 (Tr  (None, None, 256)         543776    
 ansformerEncoder)                                               
                                                                 
 global_max_pooling1d_6 (Gl  (None, 256)               0         
 obalMaxPooling1D)                                               
                                                                 
 dropout_6 (Dropout)         (None, 256)               0         
                                                                 
 dense_30 (Dense)            (None, 1)                 257 

In [34]:
callbacks = [
    keras.callbacks.ModelCheckpoint("models/transformer_encoder.keras",
                                    save_best_only=True)
]
model.fit(int_train_ds, validation_data=int_val_ds, epochs=1,
          callbacks=callbacks)

# stopped this cell since it's very expensive and takes a lot of time

  1/625 [..............................] - ETA: 2:19:48 - loss: 1.6602 - accuracy: 0.4375

KeyboardInterrupt: 

In [6]:
model = keras.models.load_model("models/transformer_encoder.keras",
                                custom_objects={"TransformerEncoder": TransformerEncoder})
print(f"Test acc: {model.evaluate(int_test_ds)[1]:.3f}")

Test acc: 0.842


The Transformer encoder you just saw in action wasn't a sequence model at all.

You could change the order of the tokens in a sequence, and you’d get the exact same pairwise attention scores and the exact same context-aware representations. If you were to completely scramble the words in every movie review, the model wouldn't notice, and you’d still get the exact same accuracy.

Self-attention is a set-processing mechanism, focused on the relationships between pairs of sequence elements. It’s blind to whether these elements occur at the beginning, at the end, or in the middle of a sequence.

![alt text](static/nlp_compare.png "nlp compare")



### Using positional encoding to re-inject order information
Transformer is a hybrid approach that is technically order-agnostic, but that manually injects order information in the representations it processes.
This is the missing ingredient! It’s called **positional encoding.**

We're going to add the word's position in the sentence to each word embedding. 
Our input word embeddings will have two components:
- A word vector: represents the word independently of any specific context.
- A position vector: represents the position of the word in the current sentence.

Neural networks don’t like very large input values, or discrete input distributions!
Two solutions are used:
1. Use the Cosine function to squash the positional values in the range [-1, 1] (used in the "attention all you need" paper)
2. Learn position-embedding vectors the same way we learn to embed word indices. Then, proceed to add our position embeddings to the corresponding word embeddings (used in the book)

In [38]:
# A reminder: 
# The input dim is the size of the vocabulary, i.e., the total number of unique tokens (words) in the input data.
# The output dim is the representation power of each word embedding.
class PositionalEmbedding(layers.Layer):
    def __init__(self, sequence_length, input_dim, output_dim, **kwargs):
        super().__init__(**kwargs)
        # original word embeddings
        self.token_embeddings = layers.Embedding(input_dim=input_dim, output_dim=output_dim)
        # new position embeddings
        self.position_embeddings = layers.Embedding(input_dim=sequence_length, output_dim=output_dim)
        self.sequence_length = sequence_length
        self.input_dim = input_dim
        self.output_dim = output_dim

    def call(self, inputs):
        # get the length of the input
        length = tf.shape(inputs)[-1]
        # tensor of positions
        positions = tf.range(start=0, limit=length, delta=1)
        # get the token embeddings
        embedded_tokens = self.token_embeddings(inputs)
        # get the position embeddings
        embedded_positions = self.position_embeddings(positions)
        # add both
        return embedded_tokens + embedded_positions

    def compute_mask(self, inputs, mask=None):
        # this will be called automatically by the framework to mask zero padded input
        return tf.math.not_equal(inputs, 0)

    def get_config(self):
        config = super().get_config()
        config.update({
            "output_dim": self.output_dim,
            "sequence_length": self.sequence_length,
            "input_dim": self.input_dim,
        })
        return config

In [36]:
# Putting it all together
vocab_size = 20000
sequence_length = 600
embed_dim = 256
num_heads = 2
dense_dim = 32

inputs = keras.Input(shape=(None,), dtype="int64")
x = PositionalEmbedding(sequence_length, vocab_size, embed_dim)(inputs)
x = TransformerEncoder(embed_dim, dense_dim, num_heads)(x)
x = layers.GlobalMaxPooling1D()(x)
x = layers.Dropout(0.5)(x)
outputs = layers.Dense(1, activation="sigmoid")(x)
model = keras.Model(inputs, outputs)
model.compile(optimizer="rmsprop",
              loss="binary_crossentropy",
              metrics=["accuracy"])
model.summary()

callbacks = [
    keras.callbacks.ModelCheckpoint("models/full_transformer_encoder.keras",
                                    save_best_only=True)
]
model.fit(int_train_ds, validation_data=int_val_ds, epochs=1, callbacks=callbacks)
model = keras.models.load_model(
    "models/full_transformer_encoder.keras",
    custom_objects={"TransformerEncoder": TransformerEncoder,
                    "PositionalEmbedding": PositionalEmbedding})
print(f"Test acc: {model.evaluate(int_test_ds)[1]:.3f}")

# stopped it because it takes a lot of time.

NameError: name 'TransformerEncoder' is not defined

## When to use Sequence Models over Bag of Words?
You may sometimes hear that bag-of-words methods are outdated, and that Transformer-based sequence models are the way to go, no matter what task or dataset you’re looking at. This is definitely not the case: a small stack of Dense layers on top of a bag-of-bigrams remains a perfectly valid and relevant approach in many cases.

In 2017, my team and I (the author) ran a systematic analysis of the performance of various text-classification techniques across many different types of text datasets, and we discovered a remarkable and surprising rule of thumb for deciding whether to go with a bag-of-words model or a sequence model.

It turns out that when approaching a new text-classification task, you should pay close attention to **the ratio between the number of samples in your training data and the mean number of words per sample**.:
- If that ratio is small -less than 1,500- then use the bag-of-bigrams model will perform better
- If that ratio is higher than 1,500, then you should go with a sequence model.

In other words, **sequence models work best when lots of training data is available and when each sample is relatively short**.

![alt text](static/bow-vs-sequence.png "bag of words vs sequence model")

Now, keep in mind that this heuristic rule was developed specifically for text classification. It may not necessarily hold for other NLP tasks—when it comes to machine translation. For instance, Transformer shines especially for very long sequences, compared to RNNs.

## Beyond text classification: Sequence-to-sequence learning
A sequence-to-sequence model takes a sequence as input (often a sentence or paragraph) and translates it into a different sequence. This is the task at the heart of many of the most successful applications of NLP:
- Machine translation: Convert a paragraph in a source language to its equivalent in a target language.
- Text summarization: Convert a long document to a shorter version that retains the most important information.
- Question answering: Convert an input question into its answer.
- Chatbots: Convert a dialogue prompt into a reply to this prompt, or convert the history of a conversation into the next reply in the conversation.
- Text generation: Convert a text prompt into a paragraph that completes the prompt.

The general template behind sequence-to-sequence models is described in the next figure. During training:
- An encoder model turns the source sequence into an intermediate representation.
- A decoder is trained to predict the next token i in the target sequence by looking at both previous tokens (0 to i - 1) and the encoded source sequence.

![alt text](static/seq2seq.png "sequence to sequence")

During inference, we don’t have access to the target sequence. We’re trying to predict it from scratch. We’ll have to generate it one token at a time:
- We obtain the encoded source sequence from the encoder.
- The decoder starts by looking at the encoded source sequence as well as an initial “seed” token (such as the string "[start]"), and uses them to predict the first real token in the sequence.
- The predicted sequence so far is fed back into the decoder, which generates the next token, and so on, until it generates a stop token (such as the string "[end]").




In [1]:
# We'll do machine translation.
# We'll be working with an English-to-Spanish translation dataset available at www.manythings.org/anki/. 
# It's a dataset of tab separated words.
!wget http://storage.googleapis.com/download.tensorflow.org/data/spa-eng.zip -P ./data
!unzip -q ./data/spa-eng.zip -d ./data

--2024-07-28 17:14:19--  http://storage.googleapis.com/download.tensorflow.org/data/spa-eng.zip
Resolving storage.googleapis.com (storage.googleapis.com)... 2607:f8b0:4023:1009::cf, 2607:f8b0:4023:1004::cf, 2607:f8b0:4023:1002::cf, ...
Connecting to storage.googleapis.com (storage.googleapis.com)|2607:f8b0:4023:1009::cf|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2638744 (2.5M) [application/zip]
Saving to: ‘./data/spa-eng.zip’


2024-07-28 17:14:19 (7.48 MB/s) - ‘./data/spa-eng.zip’ saved [2638744/2638744]



In [5]:
text_file = "data/spa-eng/spa.txt"
with open(text_file) as f:
    lines = f.read().split("\n")[:-1]
text_pairs = []
for line in lines:
    english, spanish = line.split("\t")
    spanish = "[start] " + spanish + " [end]"
    text_pairs.append((english, spanish))
# the file contains words and sentences
print(text_pairs[:10])
print(text_pairs[-10:])

[('Go.', '[start] Ve. [end]'), ('Go.', '[start] Vete. [end]'), ('Go.', '[start] Vaya. [end]'), ('Go.', '[start] Váyase. [end]'), ('Hi.', '[start] Hola. [end]'), ('Run!', '[start] ¡Corre! [end]'), ('Run.', '[start] Corred. [end]'), ('Who?', '[start] ¿Quién? [end]'), ('Fire!', '[start] ¡Fuego! [end]'), ('Fire!', '[start] ¡Incendio! [end]')]
[("You can't view Flash content on an iPad. However, you can easily email yourself the URLs of these web pages and view that content on your regular computer when you get home.", "[start] No puedes ver contenido en Flash en un iPad. Sin embargo, puedes fácilmente enviarte por correo electrónico las URL's de esas páginas web y ver el contenido en tu computadora cuando llegas a casa. [end]"), ('A mistake young people often make is to start learning too many languages at the same time, as they underestimate the difficulties and overestimate their own ability to learn them.', '[start] Un error que cometen a menudo los jóvenes es el de comenzar a aprender 

In [6]:
import random

# shuffle the data and split it
random.shuffle(text_pairs)
num_val_samples = int(0.15 * len(text_pairs))
num_train_samples = len(text_pairs) - 2 * num_val_samples
train_pairs = text_pairs[:num_train_samples]
val_pairs = text_pairs[num_train_samples:num_train_samples + num_val_samples]
test_pairs = text_pairs[num_train_samples + num_val_samples:]

Next, let’s prepare two separate TextVectorization layers: one for English and one for Spanish. We’re going to need to customize the way strings are preprocessed:
- We need to preserve the "[start]" and "[end]" tokens that we’ve inserted. By default, the characters \[ and \] would be stripped, but we want to keep them around so we can tell apart the word “start” and the start token "[start]".
- Punctuation is different from language to language! In the Spanish TextVectorization layer, if we’re going to strip punctuation characters, we need to also strip the character ¿.

Note that for a non-toy translation model, we would treat punctuation characters as separate tokens rather than stripping them, since we would want to be able to generate correctly punctuated sentences. In our case, for simplicity, we’ll get rid of all punctuation.

In [9]:
import tensorflow as tf
import string
import re

strip_chars = string.punctuation + "¿"  # add the flipped ? to the chars to be stripped
strip_chars = strip_chars.replace("[", "")  # remove [ from the chars to be stripped
strip_chars = strip_chars.replace("]", "")  # remove ] from the chars to be stripped


def custom_standardization(input_string):
    lowercase = tf.strings.lower(input_string)
    return tf.strings.regex_replace(
        lowercase, f"[{re.escape(strip_chars)}]", "")


# We’ll only look at the top 15,000 words in each language, and we’ll restrict sentences to 20 words.
vocab_size = 15000
sequence_length = 20

# English Vectorization
source_vectorization = layers.TextVectorization(
    max_tokens=vocab_size,
    output_mode="int",
    output_sequence_length=sequence_length,
)

# Spanish Vectorization
target_vectorization = layers.TextVectorization(
    max_tokens=vocab_size,
    output_mode="int",
    # Why one extra token? The target sequence fed to the decoder during training is the ground truth sentence shifted by one position. This helps the model learn to predict the next token in the sequence. 
    output_sequence_length=sequence_length + 1,
    standardize=custom_standardization,
)
train_english_texts = [pair[0] for pair in train_pairs]  # pair 0 is the english word
train_spanish_texts = [pair[1] for pair in train_pairs]  # pair 1 is the spanish word
source_vectorization.adapt(train_english_texts)  # adapt the english text
target_vectorization.adapt(train_spanish_texts)  # adapt the spanish text

2024-07-28 18:51:52.896095: I metal_plugin/src/device/metal_device.cc:1154] Metal device set to: Apple M2
2024-07-28 18:51:52.896182: I metal_plugin/src/device/metal_device.cc:296] systemMemory: 8.00 GB
2024-07-28 18:51:52.896351: I metal_plugin/src/device/metal_device.cc:313] maxCacheSize: 2.67 GB
2024-07-28 18:51:52.896582: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:306] Could not identify NUMA node of platform GPU ID 0, defaulting to 0. Your kernel may not have been built with NUMA support.
2024-07-28 18:51:52.896908: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:272] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 0 MB memory) -> physical PluggableDevice (device: 0, name: METAL, pci bus id: <undefined>)
2024-07-28 18:51:54.069373: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:117] Plugin optimizer for device_type GPU is enabled.


In [10]:
# Finally, we can turn our data into a tf.data pipeline. We want it to return a tuple (inputs, target) where inputs is a dict with two keys, “encoder_inputs” (the English sentence) and “decoder_inputs” (the Spanish sentence), and target is the Spanish sentence offset by one step ahead.
batch_size = 64


def format_dataset(eng, spa):
    eng = source_vectorization(eng)
    spa = target_vectorization(spa)
    # The input Spanish sentence doesn't include the last token to keep inputs and targets at the same length.
    # The target Spanish sentence is one step ahead. Both are still the same length (20 words).
    return ({
                "english": eng,
                "spanish": spa[:, :-1],
            }, spa[:, 1:])


def make_dataset(pairs):
    eng_texts, spa_texts = zip(*pairs)
    eng_texts = list(eng_texts)
    spa_texts = list(spa_texts)
    dataset = tf.data.Dataset.from_tensor_slices((eng_texts, spa_texts))
    dataset = dataset.batch(batch_size)
    dataset = dataset.map(format_dataset, num_parallel_calls=4)
    # prefetch(16): This means that the data pipeline will keep 16 batches in the buffer ready to be used. 
    return dataset.shuffle(2048).prefetch(16).cache()


train_ds = make_dataset(train_pairs)
val_ds = make_dataset(val_pairs)

In [16]:
# check the data
for inputs, targets in train_ds.take(1):
    print(f"inputs['english'].shape: {inputs['english'].shape}")
    print(f"inputs['spanish'].shape: {inputs['spanish'].shape}")
    print(f"targets.shape: {targets.shape}")
# the output shows that each item of the inputs is in size of 64 (batch size) and length of 20 (sequence length)

inputs['english'].shape: (64, 20)
inputs['spanish'].shape: (64, 20)
targets.shape: (64, 20)


2024-07-28 18:59:31.058922: W tensorflow/core/kernels/data/cache_dataset_ops.cc:858] The calling iterator did not fully read the dataset being cached. In order to avoid unexpected truncation of the dataset, the partially cached contents of the dataset  will be discarded. This can happen if you have an input pipeline similar to `dataset.cache().take(k).repeat()`. You should use `dataset.take(k).cache().repeat()` instead.


The data is now Ready.
Let's build some Models.
## Sequence 2 Sequence Learning with RNN

In [None]:
inputs = keras.Input(shape=(sequence_length,))
x = layers.Embedding(input_dim=vocab_size, output_dim=128)(inputs)
x = layers.LSTM(16, return_sequences=True)(x)
outputs = layers.Dense(vocab_size, activation='softmax')(x)
model = keras.Model(inputs=inputs, outputs=outputs)

There are two flaws with this approach:
1. The target sequence must always be the same length as the source sequence. In practice, this is rarely the case. Technically, this isn’t critical, as you could always pad either the source sequence or the target sequence to make their lengths match.
2. Due to the step-by-step nature of RNNs, the model will only be looking at tokens 0...N in the source sequence in order to predict token N in the target sequence. This constraint makes this setup unsuitable for most tasks, and particularly translation. Consider translating “The weather is nice today” to French—that would be “Il fait beau aujourd’hui.” You’d need to be able to predict “Il” from just “The,” “Il fait” from just “The weather,” etc., which is simply impossible.

If you’re a human translator, you’d start by reading the entire source sentence before starting to translate it. This is especially important if you’re dealing with languages that have wildly different word ordering, like English and Japanese. And that’s exactly what standard sequence-to-sequence models do.

In a proper sequence-to-sequence setup, you would first use an RNN **(the encoder)** to turn the source sequence into a single vector (or set of vectors). This could be the last output of the RNN, or alternatively, its final internal state vectors. Then you would use this vector (or vectors) as the initial state of another RNN **(the decoder)**, which would look at elements 0...N in the target sequence, and try to predict step N+1 in the target sequence

![alt text](static/seq2seq_model.png "sequence2seq model")

This figure explains things in more details

<img src="static/seq2seq_detailed.png" width="800">

Notice that in the RNN-based sequence-to-sequence (Seq2Seq) models, the input sequence must be fully processed by the encoder before it can be sent to the decoder.


In [28]:
# Defining the Encoder 
embed_dim = 256
# The latent_dim parameter in the Bidirectional GRU layer defines the size of the output space of the GRU layer, which is essentially the dimensionality of the intermediate representation produced by the GRU units. This output will be fed to the decoder.
latent_dim = 1024

english_input = keras.Input(shape=(sequence_length,), name='english')
x = layers.Embedding(input_dim=vocab_size, output_dim=embed_dim, mask_zero=True)(english_input)
# Merge Mode is the Bidirectional layer specifies how to combine the outputs from the forward and backward GRU layers.
# Here, the outputs are summed. Other options for merge_mode include "concat" (concatenates outputs from both directions) and "ave" (averages the outputs).
encoded_english = layers.Bidirectional(layers.GRU(latent_dim), merge_mode="sum")(x)

In [29]:
# Defining the Decoder 
# In training, we use the true previous tokens to predict the next token in the sequence. This is called "teacher forcing."
# past_target is the true sequence of tokens from the Spanish sentence.
spanish_input = keras.Input(shape=(sequence_length,), name='spanish')
# The embedding layer converts each token into a dense vector representation.
x = layers.Embedding(input_dim=vocab_size, output_dim=embed_dim, mask_zero=True)(spanish_input)

# The GRU processes the entire sequence at once, using the encoded_source as the initial state.
# The GRU layer outputs a sequence of hidden states, one for each time step.
decoder_gru = layers.GRU(latent_dim, return_sequences=True)
x = decoder_gru(x, initial_state=encoded_english)
x = layers.Dropout(0.5)(x)
spanish_output = layers.Dense(vocab_size, activation="softmax")(x)

seq2seq_rnn = keras.Model([english_input, spanish_input], spanish_output)

In [30]:
seq2seq_rnn.compile(
    optimizer="rmsprop",
    loss="sparse_categorical_crossentropy",
    metrics=["accuracy"])
seq2seq_rnn.fit(train_ds, epochs=1, validation_data=val_ds)

 121/1302 [=>............................] - ETA: 9:39 - loss: 5.5182 - accuracy: 0.2530

KeyboardInterrupt: 

If you work on a real-world machine translation system, you will likely use “BLEU scores” to evaluate your models—a metric that looks at entire generated sequences and that seems to correlate well with human perception of translation quality.

The RNN approach to sequence-to-sequence learning has a few fundamental limitations:
- The source sequence representation has to be held entirely in the encoder state vector(s), which puts significant limitations on the size and complexity of the sentences you can translate. It’s a bit as if a human were translating a sentence entirely from memory, without looking twice at the source sentence while producing the translation.
- RNNs have trouble dealing with very long sequences, since they tend to progressively forget about the past—by the time you’ve reached the 100th token in either sequence, little information remains about the start of the sequence. That means RNN-based models can’t hold onto long-term context, which can be essential for translating long documents.


In [24]:
# Using our code in inference
import numpy as np

spa_vocab = target_vectorization.get_vocabulary()
spa_index_lookup = dict(zip(range(len(spa_vocab)), spa_vocab))
max_decoded_sentence_length = 20


def decode_sequence(input_sentence):
    tokenized_input_sentence = source_vectorization([input_sentence])
    decoded_sentence = "[start]"
    for i in range(max_decoded_sentence_length):
        tokenized_target_sentence = target_vectorization([decoded_sentence])
        next_token_predictions = seq2seq_rnn.predict(
            [tokenized_input_sentence, tokenized_target_sentence])
        sampled_token_index = np.argmax(next_token_predictions[0, i, :])
        sampled_token = spa_index_lookup[sampled_token_index]
        decoded_sentence += " " + sampled_token
        if sampled_token == "[end]":
            break
    return decoded_sentence


test_eng_texts = [pair[0] for pair in test_pairs]
for _ in range(20):
    input_sentence = random.choice(test_eng_texts)
    print("-")
    print(input_sentence)
    print(decode_sequence(input_sentence))

-
I asked him to send us the book.


ValueError: in user code:

    File "/Users/osamaabuhamdan/PycharmProjects/DNNResearchProject/venv/lib/python3.11/site-packages/keras/src/engine/training.py", line 2440, in predict_function  *
        return step_function(self, iterator)
    File "/Users/osamaabuhamdan/PycharmProjects/DNNResearchProject/venv/lib/python3.11/site-packages/keras/src/engine/training.py", line 2425, in step_function  **
        outputs = model.distribute_strategy.run(run_step, args=(data,))
    File "/Users/osamaabuhamdan/PycharmProjects/DNNResearchProject/venv/lib/python3.11/site-packages/keras/src/engine/training.py", line 2413, in run_step  **
        outputs = model.predict_step(data)
    File "/Users/osamaabuhamdan/PycharmProjects/DNNResearchProject/venv/lib/python3.11/site-packages/keras/src/engine/training.py", line 2381, in predict_step
        return self(x, training=False)
    File "/Users/osamaabuhamdan/PycharmProjects/DNNResearchProject/venv/lib/python3.11/site-packages/keras/src/utils/traceback_utils.py", line 70, in error_handler
        raise e.with_traceback(filtered_tb) from None
    File "/Users/osamaabuhamdan/PycharmProjects/DNNResearchProject/venv/lib/python3.11/site-packages/keras/src/engine/input_spec.py", line 298, in assert_input_compatibility
        raise ValueError(

    ValueError: Input 1 of layer "model_1" is incompatible with the layer: expected shape=(None, 20), found shape=(None, 21)


## Sequence-to-sequence learning with Transformer
Sequence-to-sequence learning is the task where Transformer really shines. Neural attention enables Transformer models to successfully process sequences that are considerably longer and more complex than those RNNs can handle.

As a human translating English to Spanish, you’re not going to read the English sentence one word at a time, keep its meaning in memory, and then generate the Spanish sentence one word at a time. That may work for a five-word sentence, but it’s unlikely to work for an entire paragraph. Instead, you’ll probably want to go back and forth between the source sentence and your translation in progress, and **pay attention** to different words in the source as you’re writing down different parts of your translation.

In a sequence-to-sequence Transformer, the Transformer encoder would naturally play the role of the encoder, which reads the source sequence and produces an encoded representation of it. Unlike our previous RNN encoder (after processing the entire input sequence, the output is a context vector), the Transformer encoder keeps the encoded representation in a sequence format: it's a sequence of context-aware embedding vectors.

The second half of the model is the **Transformer decoder**. Just like the RNN decoder, it reads tokens 0...N in the target sequence and tries to predict token N+1 with neural attention.
 
In a Transformer decoder:
- **The target sequence serves as an attention "query"**.
- **The source sequence plays the roles of both keys and values.** 
i.e. The queries come from the decoder, while the keys and values come from the encoder's output.
<img src="static/seq2seq_with_transformer.png">

In [33]:
class TransformerDecoder(layers.Layer):
    def __init__(self, embed_dim, dense_dim, num_heads, **kwargs):
        super().__init__(**kwargs)
        self.embed_dim = embed_dim
        self.dense_dim = dense_dim
        self.num_heads = num_heads
        self.attention_1 = layers.MultiHeadAttention(
            num_heads=num_heads, key_dim=embed_dim)
        self.attention_2 = layers.MultiHeadAttention(
            num_heads=num_heads, key_dim=embed_dim)
        self.dense_proj = keras.Sequential(
            [layers.Dense(dense_dim, activation="relu"),
             layers.Dense(embed_dim), ]
        )
        self.layernorm_1 = layers.LayerNormalization()
        self.layernorm_2 = layers.LayerNormalization()
        self.layernorm_3 = layers.LayerNormalization()
        # This attribute ensures that the layer will propagate its input mask to its outputs; masking in Keras is explicitly opt-in.
        # If you pass a mask to a layer that doesn't implement compute_mask() and that doesn't expose this supports_masking attribute, that’s an error.
        self.supports_masking = True

    def get_causal_attention_mask(self, inputs):
        input_shape = tf.shape(inputs)
        batch_size, sequence_length = input_shape[0], input_shape[1]
        i = tf.range(sequence_length)[:, tf.newaxis]
        j = tf.range(sequence_length)
        mask = tf.cast(i >= j, dtype="int32")
        mask = tf.reshape(mask, (1, input_shape[1], input_shape[1]))
        mult = tf.concat(
            [tf.expand_dims(batch_size, -1),
             tf.constant([1, 1], dtype=tf.int32)], axis=0)
        return tf.tile(mask, mult)

    def call(self, inputs, encoder_outputs, mask=None):
        causal_mask = self.get_causal_attention_mask(inputs)
        if mask is not None:
            padding_mask = tf.cast(
                mask[:, tf.newaxis, :], dtype="int32")
            padding_mask = tf.minimum(padding_mask, causal_mask)
        attention_output_1 = self.attention_1(
            query=inputs,
            value=inputs,
            key=inputs,
            attention_mask=causal_mask)
        attention_output_1 = self.layernorm_1(inputs + attention_output_1)
        attention_output_2 = self.attention_2(
            query=attention_output_1,
            value=encoder_outputs,
            key=encoder_outputs,
            attention_mask=padding_mask,
        )
        attention_output_2 = self.layernorm_2(
            attention_output_1 + attention_output_2)
        proj_output = self.dense_proj(attention_output_2)
        return self.layernorm_3(attention_output_2 + proj_output)

    def get_config(self):
        config = super().get_config()
        config.update({
            "embed_dim": self.embed_dim,
            "num_heads": self.num_heads,
            "dense_dim": self.dense_dim,
        })
        return config

### Causal Padding:
Causal padding is a technique used in sequence-to-sequence models, particularly in Transformer decoders, to ensure that the model does not use information from future tokens when predicting the next token in a sequence. 

### Why Causal Padding is Needed:
In Transformer models, the self-attention mechanism allows each token in a sequence to attend to all other tokens, including future tokens. While this is beneficial for understanding the context in many applications, it poses a problem in autoregressive sequence generation tasks like language modeling and translation, where future tokens should not be known during prediction.

### How Causal Padding Works
Causal padding is implemented by masking out future tokens in the attention mechanism. This means that when predicting the next token in a sequence, the model can only attend to the tokens that have already been generated and not the ones that come after.

In [42]:
embed_dim = 256
dense_dime = 2048
num_heads = 8
sequence_length = 20
vocab_size = 15000

encoder_inputs = keras.Input(shape=(None,), name="english")
x = PositionalEmbedding(sequence_length, input_dim=vocab_size, output_dim=embed_dim)(encoder_inputs)
encoder_outputs = TransformerEncoder(embed_dim, dense_dim, num_heads)(x)

decoder_inputs = keras.Input(shape=(None,), name="spanish")
x = PositionalEmbedding(sequence_length, input_dim=vocab_size, output_dim=embed_dim)(decoder_inputs)
decoder = TransformerDecoder(embed_dim, dense_dim, num_heads)(x, encoder_outputs)
x = layers.Dropout(0.5)(x)
decoder_outputs = layers.Dense(vocab_size, activation="softmax")(x)
transformer = keras.Model([encoder_inputs, decoder_inputs], decoder_outputs)

In [43]:
transformer.compile(
    optimizer="rmsprop",
    loss="sparse_categorical_crossentropy",
    metrics=["accuracy"])
transformer.fit(train_ds, epochs=1, validation_data=val_ds)



<keras.src.callbacks.History at 0x3ef8a9910>

In [None]:
# We can use the previous code for inference 

## Mother of Summaries
- There are two kinds of NLP models: bag-of-words models that process sets of words or N-grams without taking into account their order, and sequence models that process word order. A bag-of-words model is made of Dense layers, while a sequence model could be an RNN, a 1D convnet, or a Transformer.
- When it comes to text classification, the ratio between the number of samples in your training data and the mean number of words per sample can help you determine whether you should use a bag-of-words model or a sequence model.
- Word embeddings are vector spaces where semantic relationships between words are modeled as distance relationships between vectors that represent those words.
- Sequence-to-sequence learning is a generic, powerful learning framework that can be applied to solve many NLP problems, including machine translation. A sequence-to-sequence model is made of an encoder, which processes a source sequence, and a decoder, which tries to predict future tokens in target sequence by looking at past tokens, with the help of the encoder-processed source sequence.
- Neural attention is a way to create context-aware word representations. It’s the basis for the Transformer architecture.
- The Transformer architecture, which consists of a TransformerEncoder and a TransformerDecoder, yields excellent results on sequence-to-sequence tasks. The first half, the TransformerEncoder, can also be used for text classification or any sort of single-input NLP task.
