<div style="line-height:1.2;">

<h1 style="color:#BF66F2; margin-bottom: 0.3em;"> Embeddings in PyTorch </h1>

<h4 style="margin-top: 0.3em; margin-bottom: 1em;"> Text Classification with a Feedforward Network, based on AG_NEWS dataset. </h4>

<div style="line-height:1.4; margin-bottom: 0.5em;">
    <h3 style="color: lightblue; display: inline; margin-right: 0.5em;">Keywords:</h3> 
Torchtext + yield + to_map_style_dataset + nn.EmbeddingBag + torch.optim.lr_scheduler.StepLR
</div>

</div>

<div style="line-height:0.5">
<h1 style="color:#BF66F2 "> Embeddings in PyTorch </h1>
<h4> Text Classification with a Feedforward Network.  </h4>
</div>
<div style="margin-top: -18px;">
<span style="display: inline-block;">
    <h3 style="color: lightblue; display: inline;">Keywords:</h3>
    Torchtext + yield + to_map_style_dataset + nn.EmbeddingBag + torch.optim.lr_scheduler.StepLR
</span>
</div>

In [20]:
#!pip install -U portalocker>=2.0.0
!pip install portalocker



In [21]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
device

device(type='cpu')

In [22]:
import os
import time
import torch
from torchtext.datasets import AG_NEWS
from torchtext.data.utils import get_tokenizer
from torchtext.vocab import build_vocab_from_iterator

import torch
from torch import nn
from torchtext.datasets import AG_NEWS
from torch.utils.data import DataLoader
from torch.utils.data.dataset import random_split
from torchtext.data.functional import to_map_style_dataset

In [23]:
train_iter = iter(AG_NEWS(split="train"))
train_iter

<generator object ShardingFilterIterDataPipe.__iter__ at 0x7e5e0781f530>

In [24]:
tokenizer = get_tokenizer("basic_english")
# Load the training split of the AG_NEWS dataset
train_iter = AG_NEWS(split="train")

def yield_tokens(data_iter):
    """ Apply the tokenizer function to each text.\\
    The batches contain (label, text) tuples, ignoring the label and just extracting the text.\\
    Yield returns a generator, so this function will lazily tokenize each text and return token lists.
    """
    for _, text in data_iter:
        yield tokenizer(text)

In [25]:
vocab = build_vocab_from_iterator(yield_tokens(train_iter), specials=["<unk>"])
vocab.set_default_index(vocab["<unk>"])

In [26]:
""" The vocabulary block converts a list of tokens into integers. """
vocab(['here', 'is', 'an', 'example'])

[475, 21, 30, 5297]

In [27]:
text_pipeline = lambda x: vocab(tokenizer(x))
label_pipeline = lambda x: int(x) - 1

In [28]:
text_pipeline('here is the an example')

[475, 21, 2, 30, 5297]

In [29]:
label_pipeline('10')

9

<h2 style="color:#BF66F2 "> Generate data batch and iterator </h2>

In [30]:
def collate_batch(batch):
    """ Combine a batch of examples into a single tensor, that can be fed into a neural network
    
    Parameters:
        List of examples, where each example is a tuple containing a label and a text.
    
    Details:
        - Initialize empty lists for the label, text, and offset tensors
        - Loop:
            - Process the label using the label_pipeline function and append it to the label list
            - Process the text using the text_pipeline function and convert the resulting list of tokens to a tensor
            - Append the processed text tensor to the text list
            - Append the size of the processed text tensor to the offsets list
        - Convert the label and offset lists to tensors

        - Concatenate the text tensors into a single tensor

        - Move the label, text, and offset tensors to the device specified by the `device` variable
    
    Returns:
        Tensors representing the batch, containing the processed labels, processed texts, and offsets
    """
    label_list, text_list, offsets = [], [], [0]

    for _label, _text in batch:
        label_list.append(label_pipeline(_label))
        processed_text = torch.tensor(text_pipeline(_text), dtype=torch.int64)
        text_list.append(processed_text)
        offsets.append(processed_text.size(0))

    label_list = torch.tensor(label_list, dtype=torch.int64)
    offsets = torch.tensor(offsets[:-1]).cumsum(dim=0)

    text_list = torch.cat(text_list)

    return label_list.to(device), text_list.to(device), offsets.to(device)

In [31]:
train_iter = AG_NEWS(split="train")
dataloader = DataLoader(train_iter, batch_size=8, shuffle=False, collate_fn=collate_batch)

<div style="line-height:0.5">
<h2 style="color:#BF66F2 "> Define the model </h2>
</div>
Composed of the nn.EmbeddingBag layer (the default mode of “mean”) + a FFN layer for the classification purpose. <br>
Although the text entries here have different lengths, nn.EmbeddingBag module requires no padding here since the text lengths are saved in offsets.<br>
Additionally, since nn.EmbeddingBag accumulates the average across the embeddings on the fly, nn.EmbeddingBag can enhance the performance <br> and memory efficiency to process a sequence of tensors.    

In [32]:
class TextClassificationModel(nn.Module):
    """ Network model for text classification.
    
    Attributes:
        - Embedding layer that maps input tokens to dense vectors [nn.EmbeddingBag]
        - Linear fully-connected layer that maps the embeddings to class scores [nn.Linear]
    
    Args:
        - vocab_size: The size of the vocabulary [int]
        - embed_dim: The dimensionality of the embedding vectors [int]
        - num_class: The number of classes [int]
    """
    def __init__(self, vocab_size, embed_dim, num_class):
        super(TextClassificationModel, self).__init__()

        self.embedding = nn.EmbeddingBag(vocab_size, embed_dim, sparse=False)
        self.fc = nn.Linear(embed_dim, num_class)
        self.init_weights()

    def init_weights(self):
        """ Initialize the weights of the embedding and linear layers. """
        initrange = 0.5
        self.embedding.weight.data.uniform_(-initrange, initrange)
        self.fc.weight.data.uniform_(-initrange, initrange)
        self.fc.bias.data.zero_()

    def forward(self, text, offsets):
        """Compute the forward pass of the model.
        
        Parameters:
            - Input text [torch.Tensors]
            - Tensor of offsets that indicate the starting index of each example in the input text.
        
        Details:
            - Embed the input text using the embedding layer.
            - Map the embeddings to class scores using the linear layer.
        
        Returns:
            Tensor of class scores.
        """
        embedded = self.embedding(text, offsets)
        return self.fc(embedded)

<h2 style="color:#BF66F2 "> Initiate an instance </h2>

In [33]:
train_iter = AG_NEWS(split="train")
num_class = len(set([label for (label, text) in train_iter]))
vocab_size = len(vocab)
emsize = 64
model = TextClassificationModel(vocab_size, emsize, num_class).to(device)

In [34]:
def train(dataloader):
    """ Train the TextClassificationModel on a dataset.
    
    Parameters:
        DataLoader object that provides batches of training data

    Details:
        - Set the model to training mode

        - Initialize variables for tracking accuracy and time

        - Loop: Iterate over each batch in the dataloader
        - Zero the gradients
        - Compute the predicted labels for the batch
        - Compute the loss between the predicted and true labels
        - Backpropagate the loss and update the model parameters
        - Update the accuracy and count variables
        - Print the accuracy and time elapsed every `log_interval` batches
    
    """
    model.train()

    total_acc, total_count = 0, 0
    log_interval = 500
    start_time = time.time()

    for idx, (label, text, offsets) in enumerate(dataloader):
        optimizer.zero_grad()
        predicted_label = model(text, offsets)
        loss = criterion(predicted_label, label)

        loss.backward()
        torch.nn.utils.clip_grad_norm_(model.parameters(), 0.1)
        optimizer.step()

        total_acc += (predicted_label.argmax(1) == label).sum().item()
        total_count += label.size(0)

        if idx % log_interval == 0 and idx > 0:
            elapsed = time.time() - start_time
            print("| epoch {:3d} | {:5d}/{:5d} batches ""| accuracy {:8.3f}".format(epoch, idx, len(dataloader), total_acc / total_count))
            # Reset the accuracy and count variables, and update the start time
            total_acc, total_count = 0, 0
            start_time = time.time()


def evaluate(dataloader):
    """ Evaluate the TextClassificationModel on a dataset.
    
    Parameters:
        DataLoader object that provides batches of evaluation data
    
    Details:
        - Set the model to evaluation mode
        - Initialize variables for tracking accuracy
        - Loop: Iterate over each batch in the dataloader
            - Compute the predicted labels for the batch
            - Compute the loss between the predicted and true labels
            - Update the accuracy and count variables
    
    Returns:
        Accuracy of the model on the evaluation data [float]
    """
    model.eval()
    total_acc, total_count = 0, 0

    with torch.no_grad():
        for idx, (label, text, offsets) in enumerate(dataloader):
            predicted_label = model(text, offsets)
            loss = criterion(predicted_label, label)
            total_acc += (predicted_label.argmax(1) == label).sum().item()
            total_count += label.size(0)

    return total_acc / total_count

<div style="line-height:0.5">
<h2 style="color:#BF66F2 "> Main: Split the dataset and run the model </h2>
</div>
CrossEntropyLoss criterion combines nn.LogSoftmax() and nn.NLLLoss() in a single class, when training a classification problem with C classes.<br>
SGD implements stochastic gradient descent method as the optimizer.

In [35]:
""" Hyperparameters """
EPOCHS = 10  # epoch
LR = 5  # learning rate
BATCH_SIZE = 64  # batch size for training

In [36]:
### Loss function + optimizer + learning rate scheduler
criterion = torch.nn.CrossEntropyLoss()
optimizer = torch.optim.SGD(model.parameters(), lr=LR)
scheduler = torch.optim.lr_scheduler.StepLR(optimizer, 1.0, gamma=0.1)

##### Training and testing sets created from the AG_NEWS dataset
total_accu = None
train_iter, test_iter = AG_NEWS()
train_dataset = to_map_style_dataset(train_iter)
test_dataset = to_map_style_dataset(test_iter)
num_train = int(len(train_dataset) * 0.95)

# Split 2 => Training set into a training subset and a validation subset.
split_train_, split_valid_ = random_split(train_dataset, [num_train, len(train_dataset) - num_train])

### DataLoader objects for the training, validation, and testing sets.
train_dataloader = DataLoader(split_train_, batch_size=BATCH_SIZE, shuffle=True, collate_fn=collate_batch)
valid_dataloader = DataLoader(split_valid_, batch_size=BATCH_SIZE, shuffle=True, collate_fn=collate_batch)
test_dataloader = DataLoader(test_dataset, batch_size=BATCH_SIZE, shuffle=True, collate_fn=collate_batch)

#################### Train the model
for epoch in range(1, EPOCHS + 1):  #for the specified number of epochs.
    # Record the start time of the epoch.
    epoch_start_time = time.time()
    # Train the model on the training subset.
    train(train_dataloader)
    # Evaluate the model on the validation subset and record the accuracy.
    accu_val = evaluate(valid_dataloader)
    # Adjust the learning rate using the scheduler.
    if total_accu is not None and total_accu > accu_val:
        scheduler.step()
    else:
        total_accu = accu_val

    print("-" * 59)
    print("| end of epoch {:3d} | time: {:5.2f}s | " "valid accuracy {:8.3f} ".format(epoch, time.time() - epoch_start_time, accu_val))
    print("-" * 59)

| epoch   1 |   500/ 1782 batches | accuracy    0.682
| epoch   1 |  1000/ 1782 batches | accuracy    0.853
| epoch   1 |  1500/ 1782 batches | accuracy    0.877
-----------------------------------------------------------
| end of epoch   1 | time: 41.63s | valid accuracy    0.888 
-----------------------------------------------------------
| epoch   2 |   500/ 1782 batches | accuracy    0.895
| epoch   2 |  1000/ 1782 batches | accuracy    0.902
| epoch   2 |  1500/ 1782 batches | accuracy    0.901
-----------------------------------------------------------
| end of epoch   2 | time: 36.89s | valid accuracy    0.892 
-----------------------------------------------------------
| epoch   3 |   500/ 1782 batches | accuracy    0.915
| epoch   3 |  1000/ 1782 batches | accuracy    0.915
| epoch   3 |  1500/ 1782 batches | accuracy    0.911
-----------------------------------------------------------
| end of epoch   3 | time: 37.82s | valid accuracy    0.902 
-------------------------------

In [37]:
print("Checking the results ...")
accu_test = evaluate(test_dataloader)
print("test accuracy is: {:8.3f}".format(accu_test))

Checking the results of test dataset.
test accuracy    0.907


In [38]:
ag_news_label = {1: "World", 2: "Sports", 3: "Business", 4: "Sci/Tec"}

def predict(text, text_pipeline):
    with torch.no_grad():
        text = torch.tensor(text_pipeline(text))
        output = model(text, torch.tensor([0]))
        return output.argmax(1).item() + 1


ex_text_str = """Apple has unveiled a new iPhone that has a faster processor, improved cameras, and longer battery life.
The iPhone 13, which comes in four different models,
also features a new cinematic mode that allows users to record professional-looking video with shallow depth of field.
The new models start at $699 and are available for pre-order now."""

model = model.to("cpu")

print("This is a %s news" % ag_news_label[predict(ex_text_str, text_pipeline)])

This is a Sci/Tec news


In [39]:
ex_text_str2 = "Scientists have discovered a new species of dinosaur that lived over 90 million years ago. \
The dinosaur, named Hesperornithoides miessleri, was a small, bird-like predator that likely fed on insects and small animals. \
Its fossils were found in Montana, USA, and provide new insights into the evolution of dinosaurs during the Late Cretaceous period."

print("This is a %s news" % ag_news_label[predict(ex_text_str, text_pipeline)])

This is a Sci/Tec news


In [41]:
ex_text_str3 = "Ex-Inter Milan striker Diego Milito had a hand in the club's latest signing, as Lautaro Martinez completed a €22 million \
move from Racing Club. The Argentine forward, who scored 13 goals in 21 appearances for Racing Club last season, has signed a five-year contract with Inter Milan. \
Martinez has been compared to Milito, who played for Inter Milan from 2009 to 2014 and helped the club win the treble in his first season. \
Martinez will later on became the captain in 2022."

print("This is a %s news!" % ag_news_label[predict(ex_text_str3, text_pipeline)])

This is a Sports news!
