# Text Classification With Torchtext

This tutorials shows how to use the text classification datasets in `torchtext`, including: 

- AG_NEWS
- SogouNews
- DBPedia
- YelpReviewPolarity
- YelpReivewFull
- YahooAnswers
- AmazonReviewPolarity
- AmazonReviewFull

This examples shows how to train supervised leanring algorithm for classification using one of these `TextClassification` datasets.

## Load data with ngrams

A bag of ngrams fetures is applied to capture some partial information about the local word order. In practice, bi-gram or tri-gram are applied to provide more benefits as word groups than only one word. An exmaple:

```python
"load data with ngrams"
Bi-grams results: "load data", "data with", "with ngrams"
Tri-grams results: "load data with", "data with ngrams"
```

`TextClassification` Dataset supports the ngrams method. By setting ngram to 2, the example text in the dataset will be a list of single words plus bi-gram string.

In [1]:
import os
import torch
import torchtext
from torchtext.datasets import text_classification

NGRAMS = 2

if not os.path.isdir('./data'):
    os.mkdir('./data')

ret = text_classification.DATASETS['AG_NEWS'](root='./data', 
                                              ngrams=NGRAMS,
                                             vocab=None)
train_dataset, test_dataset = ret
BATCH_SIZE = 16
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')


120000lines [00:07, 16169.64lines/s]
120000lines [00:12, 9247.07lines/s]
7600lines [00:00, 9335.48lines/s]


## Define the model

The model is composed of the Embedding layer and the linear layer. `nn.EmbeddingBag` computes the mean value of a "bag" of embedding. The text entries here have different lengths. `nn.EmbeddingBag` requires no padding here since the text lengths are saed in offsets.

Additionally, since `nn.EmbeddingBag` accumulates the average across the embeddings on the fly, `nn.EmbeddingBag` can enhance the performance and memory efficiency to proces a sequence of tensors.

![](https://pytorch.org/tutorials/_images/text_sentiment_ngrams_model.png)

In [2]:
import torch.nn as nn
import torch.nn.functional as F

class TextSentiment(nn.Module):
    
    def __init__(self, vocal_size, embed_dim, num_class):
        super().__init__()
        
        self.embedding = nn.EmbeddingBag(vocal_size, embed_dim, 
                                         sparse=True)
        self.fc = nn.Linear(embed_dim, num_class)
        self.init_weights()
        
    def init_weights(self):
        initrange = 0.5
        self.embedding.weight.data.uniform_(-initrange, initrange)
        self.fc.weight.data.uniform_(-initrange, initrange)
        self.fc.bias.data.zero_()
        
    def forward(self, text, offsets):
        embedded = self.embedding(text, offsets)
        return self.fc(embedded)

## Initiate an instance

The AG_NEWS dataset has four labels and therefore the number of classes is four.

1. Word
2. Sports
3. Business
4. Sci/Tec

The vocab size is equal to the length of vocab (including single wod and ngrams). The number of classes is equal to the number of labels, which four in AG_NEWS case.

In [3]:
VOCAB_SIZE = len(train_dataset.get_vocab())
EMBED_DIM = 32
NUM_CLASS = len(train_dataset.get_labels())
model = TextSentiment(VOCAB_SIZE, EMBED_DIM, NUM_CLASS).to(device)

## Functions used to generate batch

Since the text entries have different lengths, a custom function generate_batch() is used to generate batches and offses. The function is passed to `collate_fn` in `torch.utils.data.DataLoader`. The input to `collate_fn` is a list of tensors with the size of batch_size, and the `collate_fn` function packs them into a mini-batch. Pay attention here and maek sure that `collate_fn` is declared as a top level def. This ensures that function is available in each worker.

The text entries in the original data batch input are packed into a list and concatenated as a single tensor as the input of `nn.EmbeddingBag`. The offset is a tensor of delimiters to represent the beginning index of the individual sequence in the text tensor. Label is tensor saving the labels of individual text entries.

In [4]:
def generate_batch(batch):
    label = torch.tensor([entry[0] for entry in batch])
    text = [entry[1] for entry in batch]
    offsets = [0] + [len(entry) for entry in text]
    
    offsets = torch.tensor(offsets[:-1]).cumsum(dim=0)
    text = torch.cat(text)
    return text, offsets, label

## Define functions to train the model and evaluate results

[torch.utils.data.DataLoader](https://pytorch.org/docs/stable/data.html?highlight=dataloader#torch.utils.data.DataLoader) is recommended for PyTorch users, and it makes data loading in parallel easily. We use `DataLoader` here to load AG_NEWS datasets and send ti to the model for training/validation.

In [5]:
from torch.utils.data import DataLoader

def train_func(sub_train_):
    # Train the model
    train_loss = 0
    train_acc = 0
    data = DataLoader(sub_train_, batch_size=BATCH_SIZE,
                     shuffle=True, collate_fn=generate_batch)
    for i, (text, offsets, cls) in enumerate(data):
        optimizer.zero_grad()
        text, offsets, cls = text.to(device), offsets.to(device), cls.to(device)
        output = model(text, offsets)
        loss = criterion(output, cls)
        train_loss += loss.item()
        loss.backward()
        optimizer.step()
        train_acc += (output.argmax(1) == cls).sum().item()
        
    # Adjust the learning rate
    scheduler.step()
    
    return train_loss / len(sub_train_), train_acc / len(sub_train_)

def test(data_):
    loss = 0
    acc = 0
    data = DataLoader(data_, batch_size=BATCH_SIZE,
                      collate_fn=generate_batch)

    for text, offsets, cls in data:
        with torch.no_grad():
            output = model(text, offsets)
            loss = criterion(output, cls)
            loss += loss.item()
            acc += (output.argmax(1) == cls).sum().item()
            
    return loss / len(data_), acc / len(data_)

## Split the dataset and run the model

Since the orginal AG_NEWS has no valid dataset, we split the training dataset into train/valid sets  with a split ratio of 0.95(train) and 0.05(valid). Here we use [torch.utils.data.dataset.random_split](https://pytorch.org/docs/stable/data.html?highlight=random_split#torch.utils.data.random_split)function in PyTorch core library.

[CrossEntropyLoss](https://pytorch.org/docs/stable/nn.html?highlight=crossentropyloss#torch.nn.CrossEntropyLoss) criterion combines nn.LogSoftmax() and nn.NLLLoss() in a single class. It is useful when training a classification problem with C classes. SGD.  implements stochastic gradient descent method as optimizer. The initial learning rate is set to 4.0. [StepLR](https://pytorch.org/docs/master/_modules/torch/optim/lr_scheduler.html#StepLR) is used here to ajust the learning rate through epochs.

In [6]:
import time
from torch.utils.data.dataset import random_split

N_EPOCHS = 5
min_valid_loss = float('inf')

criterion = torch.nn.CrossEntropyLoss()
optimizer = torch.optim.SGD(model.parameters(), lr=4.0)
scheduler = torch.optim.lr_scheduler.StepLR(optimizer, 1, gamma=0.9)

train_len = int(len(train_dataset) * 0.95)
sub_train_, sub_valid_ = random_split(train_dataset,
                                     [train_len, len(train_dataset)-train_len])

for epoch in range(N_EPOCHS):
    start_time = time.time()
    train_loss, train_acc = train_func(sub_train_)
    valid_loss, valid_acc = test(sub_valid_)
    
    secs = int(time.time() - start_time) 
    mins = secs / 60
    secs = secs % 60
    
    print('Epoch: %d' % (epoch+1), ' | time in %s minutes, %d seconds' % (mins, secs))
    print(f'\tLoss: {train_loss:.4f}(train)\t|\tAcc: {train_acc * 100:.1f}%(train)')
    print(f'\tLoss: {valid_loss:.4f}(valid)\t|\tAcc: {valid_acc * 100:.1f}%(valid)')

Epoch: 1  | time in 0.4 minutes, 24 seconds
	Loss: 0.0263(train)	|	Acc: 84.7%(train)
	Loss: 0.0001(valid)	|	Acc: 90.5%(valid)
Epoch: 2  | time in 0.38333333333333336 minutes, 23 seconds
	Loss: 0.0119(train)	|	Acc: 93.7%(train)
	Loss: 0.0001(valid)	|	Acc: 91.3%(valid)
Epoch: 3  | time in 0.38333333333333336 minutes, 23 seconds
	Loss: 0.0070(train)	|	Acc: 96.4%(train)
	Loss: 0.0001(valid)	|	Acc: 91.5%(valid)
Epoch: 4  | time in 0.38333333333333336 minutes, 23 seconds
	Loss: 0.0039(train)	|	Acc: 98.1%(train)
	Loss: 0.0001(valid)	|	Acc: 91.7%(valid)
Epoch: 5  | time in 0.36666666666666664 minutes, 22 seconds
	Loss: 0.0023(train)	|	Acc: 99.0%(train)
	Loss: 0.0001(valid)	|	Acc: 91.7%(valid)


## Test on a random news

In [7]:
import re
from torchtext.data.utils import ngrams_iterator, get_tokenizer


ag_news_label = {
    1: 'World',
    2: 'Sports',
    3: 'Business',
    4: 'Sci/Tec'
}

def predict(text, model, vocab, ngrams):
    tokenizer = get_tokenizer('basic_english')
    with torch.no_grad():
        text = torch.tensor([vocab[token] for token in ngrams_iterator(tokenizer(text), ngrams)])

        output = model(text, torch.tensor([0]))
        return output.argmax(1).item() + 1
    
ex_text_str = "MEMPHIS, Tenn. – Four days ago, Jon Rahm was \
    enduring the season’s worst weather conditions on Sunday at The \
    Open on his way to a closing 75 at Royal Portrush, which \
    considering the wind and the rain was a respectable showing. \
    Thursday’s first round at the WGC-FedEx St. Jude Invitational \
    was another story. With temperatures in the mid-80s and hardly any \
    wind, the Spaniard was 13 strokes better in a flawless round. \
    Thanks to his best putting performance on the PGA Tour, Rahm \
    finished with an 8-under 62 for a three-stroke lead, which \
    was even more impressive considering he’d never played the \
    front nine at TPC Southwind."

vocab = train_dataset.get_vocab()
print('This is a %s news' % ag_news_label[predict(ex_text_str, model, vocab, 2)])

This is a Sports news
