<a href="https://colab.research.google.com/github/heimmer/NLP/blob/main/tutorial-full%20version/Tutorials/tutorial_2/cs6493_tutorial_week2_text_classification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# CS6493 - Tutorial 2


## Advanced applications with PyTorch - Text classification on AG_NEWS

Text classification is a typical NLP task that requires the model to assign a set of predefined categories to open-ended text. Text classifiers can be used to organize, structure, and categorize pretty much any kind of text – from documents, medical studies and files, and all over the web.

For example, we use AG_NEWS dataset, which classifies news data into 4 categories: 

ag_news_label = {0: "World",
                 1: "Sports",
                 2: "Business",
                 3: "Sci/Tec"}

In this tutorial, we will show how to use the **datasets** to load the raw data and use the **torchtext** to build the dataset for the text classification analysis. You will have the flexibility to

   - Access to the raw data
   - Build data processing pipeline to convert the raw text strings into ``torch.Tensor`` that can be used to train the model
   - Shuffle and iterate the data with `torch.utils.data.DataLoader`
---

To use torchtext on JupyterHub, you should install the suitable torchtext library regarding to the torch version. Please find the suitable package on https://github.com/pytorch/text.

In [1]:
!pip list | grep torch

torch                         2.0.0+cu118
torchaudio                    2.0.1+cu118
torchdata                     0.6.0
torchsummary                  1.5.1
torchtext                     0.15.1
torchvision                   0.15.1+cu118


In [2]:
!pip install torchtext==0.11.0

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting torchtext==0.11.0
  Downloading torchtext-0.11.0-cp39-cp39-manylinux1_x86_64.whl (8.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m8.0/8.0 MB[0m [31m49.8 MB/s[0m eta [36m0:00:00[0m
Collecting torch==1.10.0
  Downloading torch-1.10.0-cp39-cp39-manylinux1_x86_64.whl (881.9 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m881.9/881.9 MB[0m [31m1.7 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: torch, torchtext
  Attempting uninstall: torch
    Found existing installation: torch 2.0.0+cu118
    Uninstalling torch-2.0.0+cu118:
      Successfully uninstalled torch-2.0.0+cu118
  Attempting uninstall: torchtext
    Found existing installation: torchtext 0.15.1
    Uninstalling torchtext-0.15.1:
      Successfully uninstalled torchtext-0.15.1
[31mERROR: pip's dependency resolver does not currently take into account all the p

In [3]:
!pip install datasets

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting datasets
  Downloading datasets-2.11.0-py3-none-any.whl (468 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m468.7/468.7 kB[0m [31m10.4 MB/s[0m eta [36m0:00:00[0m
Collecting xxhash
  Downloading xxhash-3.2.0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (212 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m212.2/212.2 kB[0m [31m19.8 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting multiprocess
  Downloading multiprocess-0.70.14-py39-none-any.whl (132 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m132.9/132.9 kB[0m [31m18.2 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting dill<0.3.7,>=0.3.0
  Downloading dill-0.3.6-py3-none-any.whl (110 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m110.5/110.5 kB[0m [31m15.8 MB/s[0m eta [36m0:00:00[0m
Collecting aiohttp
  Downloading aiohttp-3.8.4-cp39

## Loading raw dataset

The Huggingface library provides a large amount of raw datasets. For example, the ``ag_news`` dataset can be download and load with scripts below.


In [4]:
import torch
import random
torch.manual_seed(42)
random.seed(42)

In [5]:
from torchtext.datasets import AG_NEWS
from datasets import load_dataset
ag_news = load_dataset('ag_news')

Downloading builder script:   0%|          | 0.00/4.06k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/2.65k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/7.95k [00:00<?, ?B/s]

Downloading and preparing dataset ag_news/default to /root/.cache/huggingface/datasets/ag_news/default/0.0.0/bc2bcb40336ace1a0374767fc29bb0296cdaf8a6da7298436239c54d79180548...


Downloading data:   0%|          | 0.00/11.0M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/751k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/120000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/7600 [00:00<?, ? examples/s]

Dataset ag_news downloaded and prepared to /root/.cache/huggingface/datasets/ag_news/default/0.0.0/bc2bcb40336ace1a0374767fc29bb0296cdaf8a6da7298436239c54d79180548. Subsequent calls will reuse this data.


  0%|          | 0/2 [00:00<?, ?it/s]

Now, we can look inside this variable:

In [6]:
print(ag_news)

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 120000
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 7600
    })
})


It’s stored inside the `DatasetDict` class and is already split into train and test sets. In order to access each of the splits, we have to just call it with standard sytax of the Python `dict`.

In [7]:
train_split = ag_news['train']
print(train_split)
print("Features of the train_split: ", train_split.features, '\n')
print('Text in Sample 0: ', train_split[0]['text'], '\n')
print('Label of Sample 0: ',train_split[0]['label'], '\n')
print('Map the Label Back to the Original String:', train_split.features['label'].int2str(train_split[0]['label']), '\n')

Dataset({
    features: ['text', 'label'],
    num_rows: 120000
})
Features of the train_split:  {'text': Value(dtype='string', id=None), 'label': ClassLabel(names=['World', 'Sports', 'Business', 'Sci/Tech'], id=None)} 

Text in Sample 0:  Wall St. Bears Claw Back Into the Black (Reuters) Reuters - Short-sellers, Wall Street's dwindling\band of ultra-cynics, are seeing green again. 

Label of Sample 0:  2 

Map the Label Back to the Original String: Business 



There are two different approaches to access to the `text` in the `Dataset` object. 
First we can treat it as a **list** that contains all the `text` samples:

In [8]:
print('The type of the train_split["text"] is: ', type(train_split['text']))
print("train_split['text'][0]:", train_split['text'][0])

The type of the train_split["text"] is:  <class 'list'>
train_split['text'][0]: Wall St. Bears Claw Back Into the Black (Reuters) Reuters - Short-sellers, Wall Street's dwindling\band of ultra-cynics, are seeing green again.


Some of the more powerful applications of 🤗 Datasets come from using the map() function. 
The primary purpose of map() is to speed up processing functions. 
It allows you to apply a processing function to each example in a dataset, independently or in batches.

Start by creating a function that adds 'News: ' to the beginning of each sentence. The function needs to accept and output a dict:

In [9]:
def add_prefix(example):
    example["text"] = 'News: ' + example["text"]
    return example

# Now use map() to apply the add_prefix function to the entire dataset:

updated_ag_news = ag_news.map(add_prefix)
updated_ag_news['train']["text"][:5]

Map:   0%|          | 0/120000 [00:00<?, ? examples/s]

Map:   0%|          | 0/7600 [00:00<?, ? examples/s]

["News: Wall St. Bears Claw Back Into the Black (Reuters) Reuters - Short-sellers, Wall Street's dwindling\\band of ultra-cynics, are seeing green again.",
 'News: Carlyle Looks Toward Commercial Aerospace (Reuters) Reuters - Private investment firm Carlyle Group,\\which has a reputation for making well-timed and occasionally\\controversial plays in the defense industry, has quietly placed\\its bets on another part of the market.',
 "News: Oil and Economy Cloud Stocks' Outlook (Reuters) Reuters - Soaring crude prices plus worries\\about the economy and the outlook for earnings are expected to\\hang over the stock market next week during the depth of the\\summer doldrums.",
 'News: Iraq Halts Oil Exports from Main Southern Pipeline (Reuters) Reuters - Authorities have halted oil export\\flows from the main pipeline in southern Iraq after\\intelligence showed a rebel militia could strike\\infrastructure, an oil official said on Saturday.',
 'News: Oil prices soar to all-time record, posi

## Prepare data processing pipelines

This tutorial uses torchtext to generate dataset, there are three very basic components of the torchtext library, including **tokenizer**, **vocabulary**, **word vectors**. 

*   A **tokenizer** breaks a stream of text into tokens, usually by looking for whitespace (tabs, spaces, new lines). Here, we use the predefined tokenizer in torchtext. 
*   The **vocab** object is built based on the train dataset and is used to numericalize tokens into tensors. We represent rare tokens as `<unk>`.
*   After mapping each token into a numerical index according to the constructed vocabulary, we then convert the index into **word vectors** in the following model defination part. 

Here is an example for typical NLP data processing with tokenizer and vocabulary. The first step is to build a vocabulary with the raw training dataset. Here we use built in factory function `build_vocab_from_iterator` which accepts iterator that yield list or iterator of tokens. We can also pass any special symbols to be added to the
vocabulary.



In [10]:
from torchtext.data.utils import get_tokenizer
from torchtext.vocab import build_vocab_from_iterator

tokenizer = get_tokenizer('basic_english')

def yield_tokens(data):
    for text in data['text']:
        yield tokenizer(text)

vocab = build_vocab_from_iterator(yield_tokens(train_split), specials=["<unk>", "<pad>"])
vocab.set_default_index(vocab["<unk>"])

The vocabulary block converts a list of tokens into integers.

In [11]:
vocab(['here', 'is', 'an', 'example', '.'])

[476, 22, 31, 5298, 2]

Then, we prepare the text processing pipeline with the tokenizer and vocabulary. The text pipeline will be used to process the raw data strings from the dataset.

In [12]:
text_pipeline = lambda x: vocab(tokenizer(x))

The text pipeline converts a text string into a list of integers based on the lookup table defined in the vocabulary. For example,

In [13]:
text = "here is an example."
print(f"Raw text: '{text}' \nWord ids: {text_pipeline(text)}")

Raw text: 'here is an example.' 
Word ids: [476, 22, 31, 5298, 2]


## Generate data batch and iterator 

[torch.utils.data.DataLoader](https://pytorch.org/tutorials/beginner/data_loading_tutorial.html) is recommended for PyTorch users.
It works with a map-style dataset that implements the ``getitem()`` and ``len()`` protocols, and represents a map from indices/keys to data samples. It also works with an iterable dataset with the shuffle argument of ``False``.

Before sending to the model, ``collate_fn`` function works on a batch of samples generated from ``DataLoader``. The input to ``collate_fn`` is a batch of data with the batch size in ``DataLoader``, and ``collate_fn`` processes them according to the data processing pipelines declared previously. Pay attention here and make sure that ``collate_fn`` is declared as a top level definition. This ensures that the function is available in each worker.

In this example, the text entries in the original data batch input are packed into a list and concatenated as a single tensor for the input of ``nn.EmbeddingBag``. The offset is a tensor of delimiters to represent the beginning index of the individual sequence in the text tensor. Label is a tensor saving the labels of individual text entries.



In [14]:
from torch.utils.data import DataLoader
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

def collate_batch(batch):
    label_list, text_list, offsets = [], [], [0]
    for sample in batch:
        _label = sample['label']
        _text = sample['text']   
        label_list.append(_label)
        processed_text = torch.tensor(text_pipeline(_text))
        text_list.append(processed_text)
        offsets.append(processed_text.size(0))
    label_list = torch.tensor(label_list, dtype=torch.int64)
    offsets = torch.tensor(offsets[:-1]).cumsum(dim=0)
    text_list = torch.cat(text_list)
    return label_list.to(device), text_list.to(device), offsets.to(device)    

train_split = ag_news['train']
dataloader = DataLoader(train_split, batch_size=8, shuffle=False, collate_fn=collate_batch)

## Define the model

The model is composed of the [nn.EmbeddingBag](https://pytorch.org/docs/stable/nn.html?highlight=embeddingbag#torch.nn.EmbeddingBag) layer plus a linear layer for the classification purpose. ``nn.EmbeddingBag`` with the default mode of "mean" computes the mean value of a “bag” of embeddings. Although the text entries here have different lengths, `nn.EmbeddingBag` module requires no padding here since the text lengths are saved in offsets.

Additionally, since ``nn.EmbeddingBag`` accumulates the average across
the embeddings on the fly, ``nn.EmbeddingBag`` can enhance the
performance and memory efficiency to process a sequence of tensors.

In [15]:
from torch import nn

class TextClassificationModel(nn.Module):

    def __init__(self, vocab_size, embed_dim, num_class):
        super(TextClassificationModel, self).__init__()
        self.embedding = nn.EmbeddingBag(vocab_size, embed_dim, sparse=True)
        self.fc = nn.Linear(embed_dim, num_class)
        self.init_weights()

    def init_weights(self):
        initrange = 0.5
        self.embedding.weight.data.uniform_(-initrange, initrange)
        self.fc.weight.data.uniform_(-initrange, initrange)
        self.fc.bias.data.zero_()

    def forward(self, text, offsets):
        embedded = self.embedding(text, offsets)
        return self.fc(embedded)

## Initiate an instance

The ``AG_NEWS`` dataset has four labels and therefore we essentially work on a four-classes classification task with the labels as following,

`{0 : "World", 1 : "Sports", 2 : "Business", 3 : "Sci/Tec"}`

We build a model with the embedding dimension of 64. The vocab size is equal to the length of the vocabulary instance. The number of classes is equal to the number of labels,




In [16]:
num_class = len(set([label for label in train_split['label']]))
vocab_size = len(vocab)
emsize = 64
model = TextClassificationModel(vocab_size, emsize, num_class).to(device)

Define functions to train the model and evaluate results.
---------------------------------------------------------




In [17]:
import time

def train(dataloader):
    model.train()
    total_acc, total_count = 0, 0
    log_interval = 500
    start_time = time.time()

    for idx, (label, text, offsets) in enumerate(dataloader):
        optimizer.zero_grad()
        predicted_label = model(text, offsets)
        loss = criterion(predicted_label, label)
        loss.backward()
        torch.nn.utils.clip_grad_norm_(model.parameters(), 0.1)
        optimizer.step()
        total_acc += (predicted_label.argmax(1) == label).sum().item()
        total_count += label.size(0)
        if idx % log_interval == 0 and idx > 0:
            elapsed = time.time() - start_time
            print('| epoch {:3d} | {:5d}/{:5d} batches '
                  '| accuracy {:8.3f}'.format(epoch, idx, len(dataloader),
                                              total_acc/total_count))
            total_acc, total_count = 0, 0
            start_time = time.time()

def evaluate(dataloader):
    model.eval()
    total_acc, total_count = 0, 0

    with torch.no_grad():
        for idx, (label, text, offsets) in enumerate(dataloader):
            predicted_label = model(text, offsets)
            loss = criterion(predicted_label, label)
            total_acc += (predicted_label.argmax(1) == label).sum().item()
            total_count += label.size(0)
    return total_acc/total_count


Split the dataset and run the model
-----------------------------------

Since the original AG_NEWS has no *valid* dataset, we split the training
dataset into train/valid sets with a split ratio of 0.95 (train) and
0.05 (valid). Here we use
[torch.utils.data.dataset.random_split](https://pytorch.org/docs/stable/data.html?highlight=random_split#torch.utils.data.random_split)
function in PyTorch core library.

We use the [CrossEntropyLoss](https://pytorch.org/docs/stable/nn.html?highlight=crossentropyloss#torch.nn.CrossEntropyLoss>)which combines ``nn.LogSoftmax()`` and ``nn.NLLLoss()`` in a single class, to supervise the training process.
It is useful when training a classification problem with C classes.
And we use [SGD](https://pytorch.org/docs/stable/_modules/torch/optim/sgd.html) with a step learning scheduler [StepLR](https://pytorch.org/docs/master/_modules/torch/optim/lr_scheduler.html#StepLR) to update the model parameters. The initial learning rate is set to 5.0.


In [18]:
# Hyperparameters
EPOCHS = 10 # epoch
LR = 5  # learning rate
BATCH_SIZE = 64 # batch size for training
  
criterion = torch.nn.CrossEntropyLoss()
optimizer = torch.optim.SGD(model.parameters(), lr=LR)
scheduler = torch.optim.lr_scheduler.StepLR(optimizer, 1.0, gamma=0.1)
total_accu = None
train_dataset, test_dataset = ag_news['train'], ag_news['test']
#train_dataset = to_map_style_dataset(train_iter)
#test_dataset = to_map_style_dataset(test_iter)
splited_ = train_dataset.train_test_split(test_size = 0.1)
split_train_, split_valid_ = splited_['train'], splited_['test']

train_dataloader = DataLoader(split_train_, batch_size=BATCH_SIZE, shuffle=True, collate_fn=collate_batch)
valid_dataloader = DataLoader(split_valid_, batch_size=BATCH_SIZE, shuffle=False, collate_fn=collate_batch)
test_dataloader = DataLoader(test_dataset, batch_size=BATCH_SIZE, shuffle=False, collate_fn=collate_batch)

for epoch in range(1, EPOCHS + 1):
    epoch_start_time = time.time()
    train(train_dataloader)
    accu_val = evaluate(valid_dataloader)
    if total_accu is not None and total_accu > accu_val:
      scheduler.step()
    else:
       total_accu = accu_val
    print('-' * 59)
    print('| end of epoch {:3d} | time: {:5.2f}s | '
          'valid accuracy {:8.3f} '.format(epoch,
                                           time.time() - epoch_start_time,
                                           accu_val))
    print('-' * 59)

| epoch   1 |   500/ 1688 batches | accuracy    0.689
| epoch   1 |  1000/ 1688 batches | accuracy    0.856
| epoch   1 |  1500/ 1688 batches | accuracy    0.878
-----------------------------------------------------------
| end of epoch   1 | time: 16.72s | valid accuracy    0.876 
-----------------------------------------------------------
| epoch   2 |   500/ 1688 batches | accuracy    0.897
| epoch   2 |  1000/ 1688 batches | accuracy    0.900
| epoch   2 |  1500/ 1688 batches | accuracy    0.901
-----------------------------------------------------------
| end of epoch   2 | time: 16.47s | valid accuracy    0.895 
-----------------------------------------------------------
| epoch   3 |   500/ 1688 batches | accuracy    0.914
| epoch   3 |  1000/ 1688 batches | accuracy    0.913
| epoch   3 |  1500/ 1688 batches | accuracy    0.915
-----------------------------------------------------------
| end of epoch   3 | time: 18.67s | valid accuracy    0.899 
-------------------------------

Evaluate the model with test dataset
------------------------------------




Checking the results of the test dataset…



In [19]:
print('Checking the results of test dataset.')
accu_test = evaluate(test_dataloader)
print('test accuracy {:8.3f}'.format(accu_test))

Checking the results of test dataset.
test accuracy    0.908


Test on a random news
---------------------

Use the best model so far and test a golf news.




In [20]:
ag_news_label = {0: "World",
                 1: "Sports",
                 2: "Business",
                 3: "Sci/Tec"}

def predict(text, text_pipeline):
    with torch.no_grad():
        text = torch.tensor(text_pipeline(text))
        output = model(text, torch.tensor([0]))
        return output.argmax(1).item()

ex_text_str = "MEMPHIS, Tenn. – Four days ago, Jon Rahm was \
    enduring the season’s worst weather conditions on Sunday at The \
    Open on his way to a closing 75 at Royal Portrush, which \
    considering the wind and the rain was a respectable showing. \
    Thursday’s first round at the WGC-FedEx St. Jude Invitational \
    was another story. With temperatures in the mid-80s and hardly any \
    wind, the Spaniard was 13 strokes better in a flawless round. \
    Thanks to his best putting performance on the PGA Tour, Rahm \
    finished with an 8-under 62 for a three-stroke lead, which \
    was even more impressive considering he’d never played the \
    front nine at TPC Southwind."

model = model.to("cpu")

print("This is a %s news" %ag_news_label[predict(ex_text_str, text_pipeline)])

This is a Sports news


## Practice

Please try more experimental settings and hype-parameters to obtain better performance. You can consider from following aspects:

- Hype-parameters: batch size, learning rate, training epochs;
- The type of optimizer and learning rate scheduler.
- More advanced network.

In [None]:
# insert your code