# Classifying News Articles Using LSTM

We'll be using TorchText, which is PyTorch text library providing advanced functionalities on language processing.

The dataset we'll be using is AG News. AG News (AG’s News Corpus) is a subdataset of AG's corpus of news articles constructed by assembling titles and description fields of articles from the 4 largest classes (“World”, “Sports”, “Business”, “Sci/Tech”) of AG’s Corpus. The AG News contains 30,000 training and 1,900 test samples per class.

**Note:** AG_NEWS is not a PyTorch dataset class; it is a function that returns a PyTorch DataPipe. PyTorch DataPipes are used for data loading and processing in PyTorch.

More information on PyTorch DataPipes: https://pytorch.org/data/main/torchdata.datapipes.iter.html

Difference between DataSet and DataPipe: https://medium.com/deelvin-machine-learning/comparison-of-pytorch-dataset-and-torchdata-datapipes-486e03068c58

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

import torch
import torch.nn as nn
from torch import optim
from torch.utils.data import DataLoader
from torch.utils.data.dataset import Dataset
import torch.nn.functional as F

from torchtext.datasets import AG_NEWS
from torchtext.data.utils import get_tokenizer
from torchtext.vocab import build_vocab_from_iterator
import torchtext.transforms as T

from tqdm.notebook import trange, tqdm

Import the AG News Dataset

In [None]:
data_set_root = "../../datasets"  # Root directory of the dataset
dataset_train = AG_NEWS(root=data_set_root, split="train")  # Training dataset
dataset_test = AG_NEWS(root=data_set_root, split="test")  # Test dataset

Define the hyperparameters

In [None]:
learning_rate = 1e-4  # Learning rate for the optimizer
nepochs = 20  # Number of training epochs
batch_size = 32  # Batch size for training
max_len = 128  # Maximum length of input sequences

## Tokenization
Tokenization is the process of splitting a text into individual words or tokens.

Here, we use the `get_tokenizer` function from the `torchtext.data.utils` module to create a tokenizer based on the "basic_english" tokenization method.

Then we define a generator function `yield_tokens` to yield tokens from the data iterator.The data iterator provides pairs of (label, text) where text is the input sentence. From these tokens we can build the vocabulary.

In [None]:
tokenizer = get_tokenizer("basic_english")

def yield_tokens(data_iter):
    for _, text in data_iter:
        yield tokenizer(text)


# We set min_freq=2 to include a token only if it appears more than 2 times in the dataset.

# We will also add "special" tokens that we'll use to signal something to our model
# <pad> is a padding token that is added to the end of a sentence to ensure 
# the length of all sequences in a batch is the same
# <sos> signals the "Start-Of-Sentence" aka the start of the sequence
# <eos> signal the "End-Of-Sentence" aka the end of the sequence
# <unk> "unknown" token is used if a token is not contained in the vocab
vocab = build_vocab_from_iterator(
    yield_tokens(dataset_train),  # Tokenized data iterator
    min_freq=2,  # Minimum frequency threshold for token inclusion
    specials=['<pad>', '<sos>', '<eos>', '<unk>'],  # Special case tokens
    special_first=True  # Place special tokens first in the vocabulary
)

# Set the default index of the vocabulary to the index of the <unk> token.
# If a token is not found in the vocabulary, it will be replaced with the <unk> token.
vocab.set_default_index(vocab['<unk>'])