### Part 1: Introduction

#### Exploring Transformers with Keras

Welcome to this Jupyter Notebook, where we will embark on an exciting journey to explore and understand the Transformer architecture using Keras. This notebook is tailored for the Artificial Intelligence course at ESIR – Université Rennes, focusing on practical aspects of implementing and utilizing Transformers in natural language processing (NLP).

#### Background

Transformers, since their introduction in the paper “Attention Is All You Need” by Vaswani et al., have revolutionized the field of NLP. Known for their efficiency in handling sequential data and their ability to capture long-range dependencies, Transformers have set new benchmarks in a wide range of NLP tasks.

In this notebook, we will:

- Implement a Transformer block as a Keras layer.
- Use this Transformer for text classification.
- Compare its performance with a traditional LSTM-based approach.
- Experiment with different hyperparameters to understand their impact.

#### Dataset

We will be working with the Ohsumed dataset, a collection of medical abstracts, which presents a challenging yet insightful task for text classification. This dataset differs from the usual IMDB dataset typically used in transformer examples, providing a more domain-specific challenge.




### Part 2: Setup and Imports

First, let's set up our environment by importing the necessary libraries. We'll need TensorFlow and its high-level API, Keras, for building and training our models. The layers submodule from Keras provides us with the necessary layers to construct our Transformer model.



In [1]:

# Importing necessary libraries
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers




TensorFlow will serve as our backend for matrix operations and neural network functionalities, while Keras will offer a user-friendly interface to build and train our models.




### Part 3: Transformer Block Implementation

#### Implementing the Transformer Block

The core of the Transformer model is its unique architecture, which we will implement as a custom layer in Keras. The `TransformerBlock` class encapsulates the key components of the Transformer's architecture.


In [2]:
class TransformerBlock(layers.Layer):
    def __init__(self, embed_dim, num_heads, ff_dim, rate=0.1):
        super().__init__()
        self.att = layers.MultiHeadAttention(num_heads=num_heads, key_dim=embed_dim)
        self.ffn = keras.Sequential(
            [
                layers.Dense(ff_dim, activation="relu"),
                layers.Dense(embed_dim),
            ]
        )
        self.layernorm1 = layers.LayerNormalization(epsilon=1e-6)
        self.layernorm2 = layers.LayerNormalization(epsilon=1e-6)
        self.dropout1 = layers.Dropout(rate)
        self.dropout2 = layers.Dropout(rate)

    def call(self, inputs, training):
        attn_output = self.att(inputs, inputs)
        attn_output = self.dropout1(attn_output, training=training)
        out1 = self.layernorm1(inputs + attn_output)
        ffn_output = self.ffn(out1)
        ffn_output = self.dropout2(ffn_output, training=training)
        return self.layernorm2(out1 + ffn_output)


#### Explanation

This class is a representation of a single Transformer block, which consists of:

1. **Multi-Head Attention Mechanism**: This allows the model to jointly attend to information from different representation subspaces at different positions.

2. **Feed Forward Network**: A simple fully connected neural network applied to each position separately and identically.

3. **Layer Normalization and Dropout**: Used for regularization and to stabilize the training process.

4. **Residual Connections**: Encourage gradient flow through the network.

This block can process a batch of sequences in parallel, making it highly efficient for NLP tasks.



### Part 4: Data Preparation

#### Loading the Ohsumed Dataset

The Ohsumed dataset is a comprehensive collection of medical abstracts, making it an excellent choice for our text classification task. To load the dataset, we use TensorFlow Datasets (TFDS), which simplifies the process of downloading and preparing datasets.



In [5]:
import tensorflow_datasets as tfds

# Load the Ohsumed dataset
ds = tfds.load('huggingface:ohsumed/ohsumed')

ImportError: cannot import name 'core' from partially initialized module 'tensorflow_datasets' (most likely due to a circular import) (c:\Users\codya\AppData\Local\Programs\Python\Python311\Lib\site-packages\tensorflow_datasets\__init__.py)



#### Dataset Overview

The Ohsumed dataset contains 348,566 references from MEDLINE, covering titles, abstracts, and other metadata from medical journals over five years (1987-1991). The dataset is split into 'train' (54,709 examples) and 'test' (293,855 examples) sets. The key features we will focus on are the 'title' and 'abstract', which will serve as the input, and 'mesh_terms' for the output labels.

#### Preprocessing the Data

Preprocessing includes tokenizing the text data and padding the sequences to a uniform length. This is crucial as the Transformer model expects inputs of the same size.

In [4]:
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

# Parameters
vocab_size = 20000  # considering the top 20k words
maxlen = 200  # maximum length of the sequences

# Tokenizer
tokenizer = Tokenizer(num_words=vocab_size)
tokenizer.fit_on_texts(ds['train']['abstract'])

# Tokenize and pad sequences
def preprocess_text(text):
    sequences = tokenizer.texts_to_sequences(text)
    padded_sequences = pad_sequences(sequences, maxlen=maxlen)
    return padded_sequences

# Preprocess the dataset
x_train = preprocess_text(ds['train']['abstract'])
x_test = preprocess_text(ds['test']['abstract'])

NameError: name 'ds' is not defined



### Part 5: Model Building

#### Creating the Classifier Model

Now, we will use the previously defined Transformer block to build our classifier model. The model architecture is designed to process the input text, apply the Transformer block, and then classify the text into appropriate categories.



In [None]:
# Model Parameters
embed_dim = 32  # Embedding size for each token
num_heads = 2   # Number of attention heads
ff_dim = 32     # Hidden layer size in feed forward network inside transformer

# Model Architecture
inputs = layers.Input(shape=(maxlen,))
embedding_layer = TokenAndPositionEmbedding(maxlen, vocab_size, embed_dim)
x = embedding_layer(inputs)
transformer_block = TransformerBlock(embed_dim, num_heads, ff_dim)
x = transformer_block(x)
x = layers.GlobalAveragePooling1D()(x)
x = layers.Dropout(0.1)(x)
x = layers.Dense(20, activation="relu")(x)
x = layers.Dropout(0.1)(x)
outputs = layers.Dense(2, activation="softmax")(x)  # Adjust the output layer based on the number of classes

model = keras.Model(inputs=inputs, outputs=outputs)



#### Explanation

In this architecture:

- The `TokenAndPositionEmbedding` layer first processes the input sequences.
- The `TransformerBlock` applies the attention mechanism and feedforward network.
- The `GlobalAveragePooling1D` layer condenses the output from the transformer block.
- The `Dense` layers at the end serve as a classifier on top of the transformer outputs.

### Part 6: Model Compilation

#### Compiling the Model

After building the model, it needs to be compiled with the appropriate loss function and optimizer. Since this is a classification task, we'll use the 'sparse_categorical_crossentropy' loss.



In [None]:
model.compile(optimizer="adam", 
              loss="sparse_categorical_crossentropy", 
              metrics=["accuracy"])



#### Explanation

- **Optimizer**: 'Adam' is used for its efficiency in handling sparse gradients and adaptive learning rate capabilities.
- **Loss Function**: 'sparse_categorical_crossentropy' is suitable for multi-class classification tasks where labels are integers.
- **Metrics**: We are tracking 'accuracy' as it's a common metric for classification tasks.
