
# Transformers Overview

This notebook provides an overview of Transformer models, their architecture, how they work, and a basic implementation using TensorFlow and Keras.



## Background

### Transformer Models

Transformers are a type of neural network architecture introduced in the paper "Attention is All You Need" by Vaswani et al. in 2017. They have since become the foundation for many state-of-the-art models in natural language processing, including BERT, GPT, and T5.

### Key Components of Transformers

- **Self-Attention Mechanism**: Allows the model to weigh the importance of different words in a sequence relative to each other.
- **Positional Encoding**: Adds information about the order of words in a sequence, since the Transformer architecture doesn't inherently capture this.
- **Multi-Head Attention**: Enables the model to focus on different parts of the input sequence simultaneously.
- **Feedforward Neural Network**: Processes the output of the attention mechanism.
- **Layer Normalization**: Stabilizes and accelerates training by normalizing the inputs to each layer.

### Applications of Transformers

Transformers are used in various NLP tasks, including machine translation, text summarization, sentiment analysis, and more. They are also being adapted for tasks in other domains such as vision and speech processing.



## Mathematical Foundation

### Self-Attention Mechanism

The self-attention mechanism calculates a weighted sum of the input values, where the weights are determined by the similarity between the input elements.

Given an input sequence \( X \), the self-attention mechanism involves three steps:

1. **Compute Query, Key, and Value Matrices**:

\[
Q = XW_Q, \quad K = XW_K, \quad V = XW_V
\]

2. **Compute Attention Scores**:

\[
\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right) V
\]

Where \( d_k \) is the dimension of the key vectors.

3. **Multi-Head Attention**:

\[
\text{MultiHead}(Q, K, V) = \text{Concat}(\text{head}_1, \text{head}_2, \dots, \text{head}_h)W_O
\]

Where each head \( \text{head}_i \) is a separate attention mechanism.

### Positional Encoding

Positional encoding is added to the input embeddings to inject information about the relative position of words in a sequence:

\[
\text{PE}_{(pos, 2i)} = \sin\left(\frac{pos}{10000^{\frac{2i}{d}}}\right)
\]

\[
\text{PE}_{(pos, 2i+1)} = \cos\left(\frac{pos}{10000^{\frac{2i}{d}}}\right)
\]

Where \( pos \) is the position and \( i \) is the dimension.



## Implementation in Python

We'll implement a simple Transformer using TensorFlow and Keras on a text classification task (e.g., sentiment analysis with the IMDB dataset).


In [None]:

import tensorflow as tf
from tensorflow.keras import layers, models
from tensorflow.keras.datasets import imdb
from tensorflow.keras.preprocessing import sequence

# Load and preprocess the IMDB dataset
max_features = 10000  # Number of words to consider as features
maxlen = 500  # Cut texts after this number of words

(x_train, y_train), (x_test, y_test) = imdb.load_data(num_words=max_features)
x_train = sequence.pad_sequences(x_train, maxlen=maxlen)
x_test = sequence.pad_sequences(x_test, maxlen=maxlen)

# Define a simple Transformer block
class TransformerBlock(layers.Layer):
    def __init__(self, embed_dim, num_heads, ff_dim, rate=0.1):
        super(TransformerBlock, self).__init__()
        self.att = layers.MultiHeadAttention(num_heads=num_heads, key_dim=embed_dim)
        self.ffn = models.Sequential([
            layers.Dense(ff_dim, activation="relu"),
            layers.Dense(embed_dim),
        ])
        self.layernorm1 = layers.LayerNormalization(epsilon=1e-6)
        self.layernorm2 = layers.LayerNormalization(epsilon=1e-6)
        self.dropout1 = layers.Dropout(rate)
        self.dropout2 = layers.Dropout(rate)

    def call(self, inputs, training):
        attn_output = self.att(inputs, inputs)
        attn_output = self.dropout1(attn_output, training=training)
        out1 = self.layernorm1(inputs + attn_output)
        ffn_output = self.ffn(out1)
        ffn_output = self.dropout2(ffn_output, training=training)
        return self.layernorm2(out1 + ffn_output)

# Positional Embedding Layer
class PositionalEncoding(layers.Layer):
    def __init__(self, maxlen, embed_dim):
        super(PositionalEncoding, self).__init__()
        self.pos_emb = layers.Embedding(input_dim=maxlen, output_dim=embed_dim)

    def call(self, x):
        maxlen = tf.shape(x)[-1]
        positions = tf.range(start=0, limit=maxlen, delta=1)
        positions = self.pos_emb(positions)
        return x + positions

# Build the Transformer model
embed_dim = 32
num_heads = 2
ff_dim = 32
maxlen = 500

inputs = layers.Input(shape=(maxlen,))
embedding_layer = layers.Embedding(max_features, embed_dim)(inputs)
pos_encoding = PositionalEncoding(maxlen, embed_dim)(embedding_layer)
transformer_block = TransformerBlock(embed_dim, num_heads, ff_dim)(pos_encoding)
pooling = layers.GlobalAveragePooling1D()(transformer_block)
dropout = layers.Dropout(0.1)(pooling)
outputs = layers.Dense(1, activation="sigmoid")(dropout)

model = models.Model(inputs=inputs, outputs=outputs)

# Compile and train the model
model.compile(optimizer="adam", loss="binary_crossentropy", metrics=["accuracy"])
model.fit(x_train, y_train, epochs=3, batch_size=64, validation_data=(x_test, y_test))

# Evaluate the model
test_loss, test_acc = model.evaluate(x_test, y_test)
print(f"Test accuracy: {test_acc}")



## Conclusion

This notebook provided an overview of Transformer models, their architecture, and a basic implementation using the IMDB dataset for sentiment analysis. Transformers have become the state-of-the-art model for many NLP tasks due to their ability to process and learn from sequential data effectively.
