<a href="https://colab.research.google.com/github/marimcmurtrie/NLP/blob/main/Mari_Transformer_Based_Text_Classifier.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Text Classification Using Transformers


This is a four-part coding activity to practice the concepts behind transformers, particularly those used in models like BERT and GPT. This activity builds understanding from a foundational level, covering:

- **Tokenization and Embedding Layers**

- **Self-Attention Mechanism**

- **Transformer Encoder**

- **Text Classification Using BERT-style Transformer Encoder**


### Part 1: Tokenization and Embedding Layers

**Objective:** Introduce tokenization and word embedding, the first step in processing text for transformers.

**Explanation:**

- **Tokenization:** We use a pre-trained BERT tokenizer to convert the sentence into token IDs.

- **Embedding Layer:** Converts token IDs into dense vectors. Each token ID gets mapped to a 768-dimensional vector (following BERT's embedding size).

- **Output:** `tokens['input_ids']` shows token IDs, and `embedded_tokens.shape` gives the shape of the embedding tensor, `[1, num_tokens, 768]`.

In [None]:
import torch
import torch.nn as nn
from transformers import BertTokenizer

# Sample sentence
sentence = "Transformers are amazing for natural language processing!"

# 1. Tokenization using BERT tokenizer
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
tokens = tokenizer(sentence, return_tensors="pt")

# 2. Word Embedding with Embedding Layer
embedding_layer = nn.Embedding(num_embeddings=tokenizer.vocab_size, embedding_dim=768)
embedded_tokens = embedding_layer(tokens['input_ids'])

# Output shapes
print("\n\nTokens:", tokens['input_ids'])
print("\nEmbedded tokens shape:", embedded_tokens.shape)


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]



Tokens: tensor([[  101, 19081,  2024,  6429,  2005,  3019,  2653,  6364,   999,   102]])

Embedded tokens shape: torch.Size([1, 10, 768])


### Part 2: Self-Attention Mechanism


**Objective:** Show how self-attention calculates the importance of each word in a sentence with respect to the others.

In this part, we'll compute a simplified version of self-attention for the token embeddings.

**Explanation:**

- **Query, Key, and Value:** Each word’s embedding serves as `query`, `key`, and `value`.

- **Dot Product Attention:** We calculate scores using dot products of `query` and `key`, scaled by the square root of the dimension. This gives us the relationship of each word to every other word.

- **Softmax and Output:** Softmax gives attention weights, which are applied to `value` to produce the attention output. The shape should match the embedded tokens `[1, num_tokens, 768]`.

In [None]:
import torch.nn.functional as F

def scaled_dot_product_attention(query, key, value):
    # Calculate the attention scores using the dot product of query and key
    scores = torch.matmul(query, key.transpose(-2, -1)) / torch.sqrt(torch.tensor(query.size(-1), dtype=torch.float32))
    # Apply softmax to get the attention weights
    weights = F.softmax(scores, dim=-1)
    # Weighted sum of the values
    output = torch.matmul(weights, value)
    return output, weights

# Define query, key, and value
query = embedded_tokens
key = embedded_tokens
value = embedded_tokens

# Compute self-attention
attention_output, attention_weights = scaled_dot_product_attention(query, key, value)

print("Attention output shape:", attention_output.shape)
print("Attention weights shape:", attention_weights.shape)


Attention output shape: torch.Size([1, 10, 768])
Attention weights shape: torch.Size([1, 10, 10])


### Part 3: Transformer Encoder Layer


**Objective:** Build a transformer encoder layer using self-attention and feed-forward layers.

**Explanation:**

- **Multi-Head Attention:** `nn.MultiheadAttention `applies self-attention across multiple heads.

- **Feed-Forward Network:** After attention, the layer passes through a 2-layer feed-forward network.

- **Normalization and Residual Connections:** Adds residuals and normalizes for stability.

- **Output:** `encoder_output.shape` shows `[1, num_tokens, 768]`, the same as the input embedding dimensions.

In [None]:
class TransformerEncoderLayer(nn.Module):
    def __init__(self, d_model=768, nhead=8, dim_feedforward=2048, dropout=0.1):
        super().__init__()
        # Multi-head Attention layer
        self.self_attn = nn.MultiheadAttention(d_model, nhead, dropout=dropout)
        # Feed-forward layers
        self.linear1 = nn.Linear(d_model, dim_feedforward)
        self.dropout = nn.Dropout(dropout)
        self.linear2 = nn.Linear(dim_feedforward, d_model)
        # Layer norm and dropout
        self.norm1 = nn.LayerNorm(d_model)
        self.norm2 = nn.LayerNorm(d_model)
        self.dropout1 = nn.Dropout(dropout)
        self.dropout2 = nn.Dropout(dropout)

    def forward(self, src):
        # Self-attention and add + norm
        src2, _ = self.self_attn(src, src, src)
        src = src + self.dropout1(src2)
        src = self.norm1(src)

        # Feed-forward network and add + norm
        src2 = self.linear2(self.dropout(F.relu(self.linear1(src))))
        src = src + self.dropout2(src2)
        src = self.norm2(src)
        return src

# Initialize transformer encoder
encoder_layer = TransformerEncoderLayer()
encoder_output = encoder_layer(embedded_tokens)

print("Transformer Encoder output shape:", encoder_output.shape)


Transformer Encoder output shape: torch.Size([1, 10, 768])


### Part 4: Text Classification Using Transformer Encoder


**Objective:** Use a transformer encoder to classify text, similar to how BERT can be used for classification tasks.

**Explanation:**

- **Encoder:** A single encoder layer processes the token embeddings.

- **Classification Token:** Similar to BERT, the embedding of the first token represents the entire sentence, which we pass to a classifier.

- **Output:** classification_output has shape [1, num_classes], giving predictions for each class.

In [None]:
class TextClassifier(nn.Module):
    def __init__(self, num_classes, d_model=768):
        super().__init__()
        self.encoder = TransformerEncoderLayer(d_model=d_model)
        self.classifier = nn.Linear(d_model, num_classes)

    def forward(self, x):
        # Pass through the encoder
        x = self.encoder(x)
        # Use the first token's embedding as the sentence representation (similar to BERT [CLS] token)
        cls_token_embedding = x[:, 0, :]
        # Classification layer
        output = self.classifier(cls_token_embedding)
        return output

# Define a simple binary classifier
num_classes = 2  # Binary classification
model = TextClassifier(num_classes=num_classes)

# Forward pass
classification_output = model(embedded_tokens)
print("Classification output shape:", classification_output.shape)


Classification output shape: torch.Size([1, 2])


The output shape `torch.Size([1, 2])` from the classifier indicates that the model has produced a **logit** **score** for each class in a binary classification setting.

This shape corresponds to the raw scores (before applying an activation function like `softmax`) for each of the two classes in your task.

In this setup:

The first dimension 1 represents the batch size (in this case, we only have one input sentence).

The second dimension 2 represents the two classes (e.g., "positive" and "negative").

The output of TextClassifier is not a class label yet; it's a tensor of two values, each corresponding to a score for one of the two classes.

To obtain the final predicted class label, you need to convert these scores to probabilities or take the index of the highest score.

**Converting to Class Labels**

- **Softmax:** You can apply a softmax function to turn the raw scores into probabilities for each class. This is typical for multi-class classification.

- **Argmax:** For binary classification, the class with the highest score can be selected using `torch.argmax`, which returns the index of the maximum value as the predicted class.

In [None]:
import torch.nn.functional as F

# Assuming 'classification_output' is the raw output from the model
# Apply softmax to get probabilities
probabilities = F.softmax(classification_output, dim=1)

# Get the predicted class (0 or 1) based on the highest probability
predicted_class = torch.argmax(probabilities, dim=1).item()

print("Predicted class:", predicted_class)
print("Probabilities:", probabilities)

Predicted class: 1
Probabilities: tensor([[0.2799, 0.7201]], grad_fn=<SoftmaxBackward0>)


### Try with real dataset: IMDB Movie Review

To implement the transformer-based text classifier on a real dataset like the IMDB Movie Review dataset (a common binary sentiment classification task), we'll follow these steps:

1. Load and preprocess the IMDB data.

2.Define the transformer-based text classifier model.

3. Train the model on the training data.

4. Evaluate the model on test data.

We'll be using PyTorch and **Hugging Face**'s Transformers library for easier access to pre-trained tokenizers and tools.

In [None]:
pip install torch transformers datasets


Collecting datasets
  Downloading datasets-3.1.0-py3-none-any.whl.metadata (20 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Downloading multiprocess-0.70.16-py310-none-any.whl.metadata (7.2 kB)
Collecting fsspec (from torch)
  Downloading fsspec-2024.9.0-py3-none-any.whl.metadata (11 kB)
Downloading datasets-3.1.0-py3-none-any.whl (480 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m480.6/480.6 kB[0m [31m25.9 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m11.3 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading fsspec-2024.9.0-py3-none-any.whl (179 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Step 1: Load and Preprocess the IMDB Dataset

- Load Dataset: We load the IMDB dataset using Hugging Face's datasets library.

- Tokenization: We use BERT's tokenizer to convert each review to token IDs, pa dding, and truncating each to a max length of 128 tokens.

- DataLoader: The DataLoader batches and shuffles the data for efficient training.

In [None]:
import torch
from torch.utils.data import DataLoader
from transformers import BertTokenizer
from datasets import load_dataset

# Load the IMDB dataset
dataset = load_dataset("imdb")
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")

# Define tokenization function
def tokenize(batch):
    return tokenizer(batch["text"], padding=True, truncation=True, max_length=128)

# Tokenize the datasets
train_data = dataset["train"].map(tokenize, batched=True)
test_data = dataset["test"].map(tokenize, batched=True)

# Set format for PyTorch DataLoader
train_data.set_format("torch", columns=["input_ids", "attention_mask", "label"])
test_data.set_format("torch", columns=["input_ids", "attention_mask", "label"])

# Create DataLoader
train_loader = DataLoader(train_data, batch_size=16, shuffle=True)
test_loader = DataLoader(test_data, batch_size=16)


README.md:   0%|          | 0.00/7.81k [00:00<?, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/21.0M [00:00<?, ?B/s]

test-00000-of-00001.parquet:   0%|          | 0.00/20.5M [00:00<?, ?B/s]

unsupervised-00000-of-00001.parquet:   0%|          | 0.00/42.0M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating unsupervised split:   0%|          | 0/50000 [00:00<?, ? examples/s]

Map:   0%|          | 0/25000 [00:00<?, ? examples/s]

Map:   0%|          | 0/25000 [00:00<?, ? examples/s]

**Step 2: Define the Transformer-Based Classifier Model**

We’ll reuse the TextClassifier model, which is similar to the BERT model’s architecture but simpler for our custom implementation.

**Transformer Layer:** This simplified transformer uses BERT-style embeddings and multi-head attention but with fewer layers.

**Classification Head: **After encoding, the first token embedding ([CLS]) represents the sentence, which we pass to a Linear layer for binary classification.

In [None]:
import torch
import torch.nn as nn
import torch.nn.functional as F

class TransformerClassifier(nn.Module):
    def __init__(self, num_classes=2, vocab_size=30522, d_model=768):
        super().__init__()
        # Embedding layer for token ids
        self.embedding = nn.Embedding(vocab_size, d_model)
        # Transformer encoder
        self.bert = nn.TransformerEncoder(
            nn.TransformerEncoderLayer(d_model=d_model, nhead=8),
            num_layers=4
        )
        # Classification layer
        self.classifier = nn.Linear(d_model, num_classes)

    def forward(self, input_ids, attention_mask):
        # Convert input_ids to embeddings
        x = self.embedding(input_ids)

        # Ensure attention_mask is a boolean tensor with True for padded positions
        attention_mask = attention_mask == 0  # Convert to boolean with True for padding

        # Forward through the transformer encoder with masking
        x = self.bert(x.transpose(0, 1), src_key_padding_mask=attention_mask)  # Transpose to [seq_len, batch, d_model]

        # Use the first token's embedding for classification (similar to [CLS] token in BERT)
        cls_token_embedding = x[0, :, :]  # Get the embeddings of the first token in each sequence

        # Classification layer
        output = self.classifier(cls_token_embedding)
        return output

# Initialize model with vocab_size, for BERT typically 30522
model = TransformerClassifier(num_classes=2, vocab_size=30522)





**Step 3: Training the Model**

Define the optimizer, loss function, and a training loop.

**Loss Function:** Cross-entropy loss is used for binary classification.

**Optimizer:** AdamW, common for transformer training.

**Training Loop:** We zero the gradients, do a forward pass to compute the loss, perform backpropagation, and update the model parameters.

In [None]:
from torch.optim import AdamW
from tqdm import tqdm

# Define loss and optimizer
criterion = nn.CrossEntropyLoss()
optimizer = AdamW(model.parameters(), lr=2e-5)

# Training loop
def train(model, train_loader):
    model.train()
    total_loss = 0
    for batch in tqdm(train_loader):
        optimizer.zero_grad()
        input_ids = batch["input_ids"]
        attention_mask = batch["attention_mask"]
        labels = batch["label"]

        # Forward pass
        outputs = model(input_ids, attention_mask)
        loss = criterion(outputs, labels)
        total_loss += loss.item()

        # Backward pass
        loss.backward()
        optimizer.step()

    avg_loss = total_loss / len(train_loader)
    print(f"Training loss: {avg_loss:.4f}")

# Run training
for epoch in range(1):  # Train for n=1 epochs; it takes approximately 2.5 hours per epoch
    print(f"Epoch {epoch + 1}")
    train(model, train_loader)


Epoch 1


100%|██████████| 1563/1563 [2:16:51<00:00,  5.25s/it]

Training loss: 0.5472





**Step 4: Evaluate the Model**

After training, we evaluate the model on the test set.

**Evaluation Mode:** The model is set to evaluation mode, which disables dropout and other training-only layers.

**Accuracy Calculation:** After predictions, we compare them with the true labels to calculate accuracy.



In [None]:
def evaluate(model, test_loader):
    model.eval()
    correct = 0
    total = 0

    with torch.no_grad():
        for batch in tqdm(test_loader):
            input_ids = batch["input_ids"]
            attention_mask = batch["attention_mask"]
            labels = batch["label"]

            outputs = model(input_ids, attention_mask)
            predictions = torch.argmax(outputs, dim=1)

            correct += (predictions == labels).sum().item()
            total += labels.size(0)

    accuracy = correct / total
    print(f"Test Accuracy: {accuracy:.4f}")

# Run evaluation
evaluate(model, test_loader)


100%|██████████| 1563/1563 [40:09<00:00,  1.54s/it]

Test Accuracy: 0.7701





In [None]:
# Define the function to test the model with custom examples
def test_model(model, sentences, tokenizer):
    model.eval()  # Set the model to evaluation mode
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    model.to(device)

    # Tokenize input sentences
    inputs = tokenizer(sentences, padding=True, truncation=True, max_length=128, return_tensors="pt")

    # Move inputs to the appropriate device
    input_ids = inputs["input_ids"].to(device)
    attention_mask = inputs["attention_mask"].to(device)

    # Get predictions
    with torch.no_grad():
        outputs = model(input_ids, attention_mask=attention_mask)
        predictions = torch.argmax(outputs, dim=-1)  # Directly use the tensor output

    # Return the predicted class labels
    return predictions.cpu().numpy()

# Example sentences to test
sentences = [
    "The movie was absolutely fantastic! I loved every moment of it.",
    "The plot was dull and the characters were poorly developed.",
    "It was an average movie with some good and bad moments.",
]

# Test the model
predictions = test_model(model, sentences, tokenizer)

# Display the results
for sentence, label in zip(sentences, predictions):
    sentiment = "Positive" if label == 1 else "Negative"
    print(f"Sentence: {sentence}\nPredicted Sentiment: {sentiment}\n")


Sentence: The movie was absolutely fantastic! I loved every moment of it.
Predicted Sentiment: Positive

Sentence: The plot was dull and the characters were poorly developed.
Predicted Sentiment: Negative

Sentence: It was an average movie with some good and bad moments.
Predicted Sentiment: Positive

