# Document Classification using Neural Networks and TorchText
This notebook demonstrates how to implement a simple document classification model using PyTorch and TorchText. We will work with the AG News dataset and build a pipeline for:
1. Text preprocessing and tokenization
2. Embedding text using `nn.EmbeddingBag`
3. Building a feedforward neural network
4. Making predictions using `argmax` over logits
5. Understanding logits, classes, and hyperparameters

---

In [None]:
# Step 1: Imports and Setup
import torch  # Import PyTorch core library
import torch.nn as nn  # Import neural network module
import torchtext  # Import torchtext for NLP datasets and tools
from torchtext.datasets import AG_NEWS  # Import AG_NEWS dataset
from torchtext.data.utils import get_tokenizer  # Import tokenizer utility
from torchtext.vocab import build_vocab_from_iterator  # Import vocab builder
from torch.utils.data import DataLoader  # Import DataLoader for batching

import numpy as np  # Import NumPy for numerical operations
import time  # Import time module for timing

# Set seed for reproducibility
torch.manual_seed(42)  # Set random seed for PyTorch


<torch._C.Generator at 0x221f0d3ffb0>

In [None]:
# Step 2: Load AG_NEWS Dataset and Tokenize
train_iter = AG_NEWS(split='train')  # Load the AG_NEWS training dataset
tokenizer = get_tokenizer('basic_english')  # Create a basic English tokenizer

def yield_tokens(data_iter):  # Define a generator to yield tokens from dataset
    for _, text in data_iter:  # Iterate over each (label, text) pair
        yield tokenizer(text)  # Yield tokenized text

vocab = build_vocab_from_iterator(yield_tokens(train_iter), specials=["<unk>"])  # Build vocabulary from tokens, add <unk> special token
# The <unk> (unknown) token is needed to handle words that are not present in the vocabulary.
# When processing real-world text, it's common to encounter words that were not seen during vocabulary building.
# If the model encounters such out-of-vocabulary (OOV) words, it needs a way to represent them.
# By assigning a default index to "<unk>", any unknown word will be mapped to this token's embedding.
# This prevents errors during inference and ensures the model can process any input text, even with unseen words.
vocab.set_default_index(vocab["<unk>"])  # Set default index for unknown tokens

print("Vocabulary size:", len(vocab))  # Print the size of the vocabulary

Vocabulary size: 95811


In [12]:
# Step 3: Pipeline to Encode Text as Tensor
text_pipeline = lambda x: vocab(tokenizer(x))  # Convert text to list of token indices using tokenizer and vocab
label_pipeline = lambda x: int(x) - 1  # Convert label string to integer index starting from 0

example_text = "Google's quantum computer achieves new milestone in speed."  # Example text for preview
print("Token indices:", text_pipeline(example_text))  # Print token indices for the example text


Token indices: [202, 16, 9, 9254, 179, 20726, 23, 4216, 7, 1096, 1]


In [13]:
# Step 4: Create Batch Function with Offsets (for EmbeddingBag)
def collate_batch(batch):  # Define function to collate a batch for DataLoader
    label_list, text_list, offsets = [], [], [0]  # Initialize lists for labels, texts, and offsets
    for (_label, _text) in batch:  # Iterate over each sample in the batch
        label_list.append(label_pipeline(_label))  # Process and append label
        processed_text = torch.tensor(text_pipeline(_text), dtype=torch.int64)  # Tokenize and convert text to tensor
        text_list.append(processed_text)  # Append processed text tensor
        offsets.append(processed_text.size(0))  # Append length of processed text to offsets
    label_tensor = torch.tensor(label_list, dtype=torch.int64)  # Convert label list to tensor
    offsets = torch.tensor(offsets[:-1]).cumsum(dim=0)  # Compute cumulative sum of offsets (excluding last)
    text_tensor = torch.cat(text_list)  # Concatenate all text tensors into one tensor
    return label_tensor, text_tensor, offsets  # Return label tensor, text tensor, and offsets

# Create DataLoader
train_iter = AG_NEWS(split='train')  # Reload AG_NEWS training dataset
dataloader = DataLoader(list(train_iter)[:1000], batch_size=8, shuffle=True, collate_fn=collate_batch)  # Create DataLoader with custom collate function

In [None]:
# Step 5: Define Model Architecture (EmbeddingBag + Linear Layer)
class TextClassificationModel(nn.Module):  # Define the model class inheriting from nn.Module
    def __init__(self, vocab_size, embed_dim, num_class):  # Initialize model with vocab size, embedding dim, and number of classes
        super(TextClassificationModel, self).__init__()  # Call parent constructor
        self.embedding = nn.EmbeddingBag(vocab_size, embed_dim, sparse=True)  # EmbeddingBag layer for efficient embedding lookup
        self.fc = nn.Linear(embed_dim, num_class)  # Linear layer for classification
        self.init_weights()  # Initialize weights

    def init_weights(self):  # Method to initialize weights
        initrange = 0.5  # Set initialization range
        self.embedding.weight.data.uniform_(-initrange, initrange)  # Uniform init for embedding weights
        self.fc.weight.data.uniform_(-initrange, initrange)  # Uniform init for linear weights
        self.fc.bias.data.zero_()  # Zero init for linear bias

    def forward(self, text, offsets):  # Define forward pass
        embedded = self.embedding(text, offsets)  # Get embeddings for input text
        return self.fc(embedded)  # Pass embeddings through linear layer

num_classes = 4  # Number of output classes
vocab_size = len(vocab)  # Size of the vocabulary
embed_dim = 64  # Dimension of embeddings

model = TextClassificationModel(vocab_size, embed_dim, num_classes)  # Instantiate the model
print(model)  # Print model architecture

TextClassificationModel(
  (embedding): EmbeddingBag(95811, 64, mode='mean')
  (fc): Linear(in_features=64, out_features=4, bias=True)
)


In [None]:
# Step 6: Make Predictions with Argmax
for labels, text, offsets in dataloader:  # Iterate over batches from the dataloader
    outputs = model(text, offsets)  # Get model outputs (logits) for the batch
    predictions = torch.argmax(outputs, dim=1)  # Get predicted class indices using argmax
    print("Logits:\n", outputs)  # Print the raw logits
    print("Predicted classes:", predictions)  # Print the predicted class indices
    print("True labels:", labels)  # Print the true labels for comparison
    break  # Process only the first batch


Logits:
 tensor([[ 0.1303,  0.1744,  0.0400, -0.0927],
        [ 0.1418,  0.0776, -0.0377, -0.1060],
        [ 0.0352,  0.3715, -0.0508, -0.1522],
        [-0.0834, -0.0282, -0.1419,  0.0064],
        [ 0.2335,  0.1354,  0.0988,  0.0340],
        [-0.0714,  0.0370,  0.0396, -0.1004],
        [ 0.0967, -0.1384, -0.0711,  0.0736],
        [ 0.1472,  0.0940, -0.2131, -0.0617]], grad_fn=<AddmmBackward0>)
Predicted classes: tensor([1, 0, 1, 3, 0, 2, 0, 0])
True labels: tensor([0, 3, 0, 2, 3, 2, 2, 3])


## Summary
In this notebook, you:
- Loaded the AG_NEWS dataset using TorchText
- Built a vocabulary and tokenized the data
- Used `nn.EmbeddingBag` to aggregate word embeddings efficiently
- Built a simple classifier with a linear output layer
- Used the `argmax` function to predict classes from logits

**Next steps**: Train the model using a loss function like `CrossEntropyLoss` and an optimizer like `SGD` or `Adam`, and evaluate it on a test set.

---