#Text Classification Example with Simple Recurrent Neural Network (RNN)

Source: https://coderzcolumn.com/tutorials/artificial-intelligence/pytorch-rnn-for-text-classification-tasks

Some examples uses PyTorch Field, which is deprecated. So you need to look for the latest examples when Googling about. Remember that for text classification or NLP tasks, the text not only refers to the characters/words/sentences in the language but includes others such as punctuation mark, commas, exclamation mark etc. For text classification/nlp tasks, you'll need to install dependecies such as torchtext, spacy etc. 




In [14]:

!pip install torchtext
!pip install torchdata

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


#Building vocabulary

A slightly tricky part for text classification/nlp task is the tokenization-vectorization process where we are converting the words in the text into numbers/vectors. Be aware that tokenization can also be done on character level (eg. character classification/prediction task).

In this example functions such as `get_tokenizer` and `build_vocab_from_iterator` are utilised. When dealing with text datasets it is commons for python generators to be used. 

Please take note that in this example only `<UNK>` is used and there is no `<pad>` . Most probably because the vocabulary takes all the unique words from the dataset. In cases where the vocabulary is high, it is common to take the most frequent words and ignore low frequency words (such as Pneumonoultramicroscopicsilicovolcanoconiosis ) .

In [15]:
import torch
import torchtext
from torchtext import data 
from torch.utils.data import DataLoader
from collections import Iterable


train, test = torchtext.datasets.AG_NEWS()

labels = ["world","sports","biz","science"]

from torchtext.data import get_tokenizer
from torchtext.vocab import build_vocab_from_iterator

tokenizer  =  get_tokenizer("basic_english")

def build_vocab(datasets):
    for dataset in datasets:
        for _, text in dataset:
            yield tokenizer(text)

vocab = build_vocab_from_iterator(build_vocab([train, test]), specials=["<UNK>"])
vocab.set_default_index(vocab["<UNK>"])
len(vocab.get_itos())


98635

Here the index of the vocabulary is shown. Each unique word has its own index.

In [16]:
tokens = tokenizer("Hello how are you?, Welcome to CoderzColumn!!")
indexes = vocab(tokens)

print("Indexes for the sentence 'what is your name?'")
print(vocab(["what","is","your","name"]))

print("\n Tokens and indexes for the sentence 'Hello how are you?, Welcome to CoderzColumn!!'")
tokens, indexes


Indexes for the sentence 'what is your name?'
[183, 21, 379, 971]

 Tokens and indexes for the sentence 'Hello how are you?, Welcome to CoderzColumn!!'


(['hello',
  'how',
  'are',
  'you',
  '?',
  ',',
  'welcome',
  'to',
  'coderzcolumn',
  '!',
  '!'],
 [12388, 355, 42, 164, 80, 3, 3298, 4, 0, 747, 747])

In [17]:
from torch.utils.data import DataLoader
from torchtext.data.functional import to_map_style_dataset

train_dataset, test_dataset  = torchtext.datasets.AG_NEWS()
train_dataset, test_dataset  = to_map_style_dataset(train_dataset), to_map_style_dataset(test_dataset)

target_classes = ["World", "Sports", "Business", "Sci/Tech"]

max_words = 25

def vectorize_batch(batch):
    Y, X = list(zip(*batch))
    X = [vocab(tokenizer(text)) for text in X]
    X = [tokens+([0]* (max_words-len(tokens))) if len(tokens)<max_words else tokens[:max_words] for tokens in X] ## Bringing all samples to max_words length.

    return torch.tensor(X, dtype=torch.int32), torch.tensor(Y) - 1 ## We have deducted 1 from target names to get them in range [0,1,2,3] from [1,2,3,4]


train_loader = DataLoader(train_dataset, batch_size=1024, collate_fn=vectorize_batch, shuffle=True)
test_loader  = DataLoader(test_dataset , batch_size=1024, collate_fn=vectorize_batch)

#Defining the RNN

In this example one embedding layer followed by RNN and then followed by a linear layer. 

In [18]:
from torch import nn
from torch.nn import functional as F

embed_len = 50
hidden_dim = 50
n_layers=1

class RNNClassifier(nn.Module):
    def __init__(self):
        super(RNNClassifier, self).__init__()
        self.embedding_layer = nn.Embedding(num_embeddings=len(vocab), embedding_dim=embed_len)
        self.rnn = nn.RNN(input_size=embed_len, hidden_size=hidden_dim, num_layers=n_layers, batch_first=True)
        self.linear = nn.Linear(hidden_dim, len(target_classes))

    def forward(self, X_batch):
        embeddings = self.embedding_layer(X_batch)
        output, hidden = self.rnn(embeddings, torch.randn(n_layers, len(X_batch), hidden_dim))
        return self.linear(output[:,-1])

In [19]:
rnn_classifier = RNNClassifier()

rnn_classifier

RNNClassifier(
  (embedding_layer): Embedding(98635, 50)
  (rnn): RNN(50, 50, batch_first=True)
  (linear): Linear(in_features=50, out_features=4, bias=True)
)

In [20]:
for layer in rnn_classifier.children():
    print("Layer : {}".format(layer))
    print("Parameters : ")
    for param in layer.parameters():
        print(param.shape)
    print()

Layer : Embedding(98635, 50)
Parameters : 
torch.Size([98635, 50])

Layer : RNN(50, 50, batch_first=True)
Parameters : 
torch.Size([50, 50])
torch.Size([50, 50])
torch.Size([50])
torch.Size([50])

Layer : Linear(in_features=50, out_features=4, bias=True)
Parameters : 
torch.Size([4, 50])
torch.Size([4])

