In [5]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import os
import zipfile


with zipfile.ZipFile('data/data.zip', 'r') as zip_ref:
    zip_ref.extractall('data')

# Introduction
This notebook implements a text classification model to categorize AG News dataset articles into four predefined classes using a Recurrent Neural Network (RNN) built with PyTorch. The workflow includes preprocessing text data, tokenizing and encoding text sequences, and training an RNN-based classifier with embeddings, sequential modeling, and mean pooling to predict article categories. The notebook demonstrates key concepts in natural language processing (NLP) and deep learning, making it a valuable resource for learning text classification techniques with PyTorch.

In [None]:
!pip install torchtext==0.5.0

# Import all required packages

Imports several essential libraries for text processing and deep learning tasks, particularly for natural language processing (NLP). PyTorch (torch) is used as the core library for building and training neural networks, with utilities like Dataset and DataLoader from torch.utils.data to handle datasets, batching, and data shuffling efficiently. The torchtext library and its get_tokenizer function facilitate text preprocessing, tokenization, and vocabulary management, supporting various tokenization methods like basic_english. The Python collections module provides specialized data structures like Counter and defaultdict, often used for counting word frequencies or managing token-to-index mappings. Additionally, pandas is imported for data manipulation, enabling the loading, cleaning, and transformation of tabular data, such as CSV files, into formats suitable for machine learning workflows. Together, these libraries streamline the process of preparing text data, creating datasets, and developing models for tasks like text classification.

In [None]:
import torch
from torchtext.data.utils import get_tokenizer
import collections
import torchtext
from torch.utils.data import Dataset, DataLoader
import pandas as pd

tokenizer = get_tokenizer('basic_english')

In [None]:
classes = ['World', 'Sports', 'Business', 'Sci/Tech']

train_df = pd.read_csv('train.csv')
test_df = pd.read_csv('test.csv')

train_df.head()

# Data exploration
Let's explore the data so know the type of data in the dataset. We will also check if we have null values. From the dataset, the Title and Description have the information about the news article while Class Index has the class of the news article.

In [None]:
# Visualize the train_df DataFrame
import matplotlib.pyplot as plt
import seaborn as sns

print('------------Description of the data-------------')
print(train_df.describe())

# Check for missing values
print('\n------------Missing values in the data-------------')
print(train_df.isnull().sum())

# Distribution of classes
plt.figure(figsize=(8, 6))
sns.countplot(x='Class Index', data=train_df)
plt.title('Distribution of Classes in Training Data')
plt.xlabel('Class Index')
plt.ylabel('Count')
plt.show()


The dataset has title and description for each news article, we will combine the Title and Description columns to have more vocabulary to help provide more context to the models

In [None]:
def combine_text(row):
    return f"{row['Title']} - {row['Description']}"

train_df['Text'] = train_df.apply(combine_text, axis=1)
test_df['Text'] = test_df.apply(combine_text, axis=1)

# Create a custom dataset

We will create a custom Pytorch dataset called NewsDataset for handling news articles stored in a DataFrame as a PyTorch Dataset object. It initializes with the DataFrame (df) and tracks the number of samples. The __getitem__ method retrieves the label and the combination of Title and Description by its index, returning a tuple containing the class label (Class Index) and the text (Text). The __len__ method provides the total number of samples in the dataset. This class enables easy integration with PyTorch’s DataLoader for batching and shuffling data.

In [None]:
class NewsDataset(Dataset):
  def __init__(self,df):
    self.n_samples = len(df)
    self.dataframe = df

  def __getitem__(self, index):
    row = self.dataframe.iloc[index]
    return row['Class Index'], row['Text']

  def __len__(self):
    return self.n_samples

In [None]:
# now we convert the dataframe for the training and testing into datasets
train_dataset = NewsDataset(train_df)
test_dataset = NewsDataset(test_df)

# Tokenization

Neural networks work with numbers and our dataset is made up of text, we will need to convert these texts into numbers to be able to pass them to a neural network and this process is called Vectorization. But before we do this, we will have to tokenize the dataset by breaking down text into smaller units, called tokens. These tokens can be words, subwords, or characters, depending on the tokenization method used.

To convert text to numbers, we will need to build a vocabulary of all tokens. We first build the dictionary using the Counter object, and then create a Vocab object that would help us deal with vectorization

In [None]:
counter = collections.Counter()
for (label, line) in train_dataset:
    counter.update(torchtext.data.utils.ngrams_iterator(tokenizer(line),ngrams=1))
vocab = torchtext.vocab.Vocab(counter, min_freq=1)

Now, we use the `vocab.stoi` dictionary from `torchtext` to convert strings into their numeric representations, with `stoi` standing for "string to integer". The is the `encode` function. Conversely, the vocab.itos dictionary allows us to convert numeric representations back into text by performing a reverse lookup as seen in the `decode` function

In [None]:
vocab_size = len(vocab)
print(f"Vocab size if {vocab_size}")

def encode(x):
    return [vocab.stoi[s] for s in tokenizer(x)]

def decode(x):
    return [vocab.itos[i] for i in x]

The padify function processes a batch of data for a PyTorch model. It takes a batch b (a list of tuples, where each tuple contains a label and a text sequence). The text sequences are encoded into numerical representations, and all sequences are padded to the same length using the maximum sequence length in the batch. The function returns a tuple containing two tensors: the first tensor holds the labels (adjusted by subtracting 1), and the second tensor contains the padded features. This ensures consistent input dimensions for the model.

In [None]:
def padify(b):
    v = [encode(x[1]) for x in b]
    l = max(map(len,v))
    return ( # tuple of two tensors - labels and features
        torch.LongTensor([t[0]-1 for t in b]),
        torch.stack([torch.nn.functional.pad(torch.tensor(t),(0,l-len(t)),mode='constant',value=0) for t in v])
    )

# Define the RNN Classifier
We will now define a simple RNN classifer. The RNNClassifier class is a PyTorch model for text classification. It consists of an embedding layer to convert tokens into dense vectors, a simple Recurrent Neural Network (RNN) to process sequential data, and a fully connected layer for mapping the RNN’s output to class probabilities. During the forward pass, the model computes embeddings, processes them through the RNN, applies mean pooling over the sequence, and passes the result through the classifier to predict the class. This model is designed for tasks involving sequential text inputs.

In [None]:
class RNNClassifier(torch.nn.Module):
    def __init__(self, vocab_size, embed_dim, hidden_dim, num_class):
        super().__init__()
        self.hidden_dim = hidden_dim
        self.embedding = torch.nn.Embedding(vocab_size, embed_dim)
        self.rnn = torch.nn.RNN(embed_dim,hidden_dim,batch_first=True)
        self.fc = torch.nn.Linear(hidden_dim, num_class)

    def forward(self, x):
        batch_size = x.size(0)
        x = self.embedding(x)
        x,h = self.rnn(x)
        return self.fc(x.mean(dim=1))

In [None]:
train_loader = torch.utils.data.DataLoader(train_dataset, batch_size=16, collate_fn=padify, shuffle=True)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
device

In [None]:
model = RNNClassifier(vocab_size,64,32,len(classes)).to(device)
lr = 0.01
report_freq=200
loss_fn = torch.nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(),lr=lr)
loss_fn = loss_fn.to(device)
model.train()
total_loss,acc,count,i = 0,0,0,0
for labels,features in train_loader:
    optimizer.zero_grad()
    features, labels = features.to(device), labels.to(device)
    out = model(features)
    loss = loss_fn(out,labels)
    loss.backward()
    optimizer.step()
    total_loss+=loss
    _,predicted = torch.max(out,1)
    acc+=(predicted==labels).sum()
    count+=len(labels)
    i+=1
    if i%report_freq==0:
        print(f"{count}: acc={acc.item()/count}")


In [None]:
test_loader = torch.utils.data.DataLoader(test_dataset, batch_size=16, collate_fn=padify, shuffle=True)


model.eval()

with torch.no_grad():
    for batch_idx, (target, data) in enumerate(test_loader):

        word_lookup = [vocab.itos[w] for w in data[batch_idx]]
        unknow_vals = {'<unk>'}
        word_lookup = [ele for ele in word_lookup if ele not in unknow_vals]
        print('Input text:\n {}\n'.format(word_lookup))

        data, target = data.to(device), target.to(device)
        pred = model(data)
        print(torch.argmax(pred[batch_idx]))
        print("Actual:\nvalue={}, class_name= {}\n".format(target[batch_idx], classes[target[batch_idx]]))
        print("Predicted:\nvalue={}, class_name= {}\n".format(pred[0].argmax(0),classes[pred[0].argmax(0)]))
        break