<a href="https://colab.research.google.com/github/nisha1729/Text-Classifier/blob/master/Text_Classification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Text Classification in PyTorch

## Introduction
This project deals with neural text classification using PyTorch. 
## Load Data

A bag of **ngrams** feature is applied to capture some partial information about the local word order. In practice, bi-grams or tri-grams are applied to provide more benefits as word groups than only one word.

**Example:**

*"I love Neural Networks"*
* **Bi-grams:** "I love", "love Neural", "Neural Networks"
* **Tri-grams:** "I love Neural", "love Neural Networks"

In the code below, we have loaded the `AG_NEWS` dataset from the ``torchtext.datasets.TextClassification`` package with bi-grams feature. The dataset supports the ngrams method. By setting ngrams to 2, the example text in the dataset will be a list of single words plus bi-grams string.

In [None]:
!pip install Torchtext==0.04

In [2]:
# Load the AG_NEWS dataset in bi-gram features format.

import torch
import torchtext
from torchtext.datasets import text_classification
import os


NGRAMS = 2

if not os.path.isdir('./.data'):
    os.mkdir('./.data')

train_dataset, test_dataset = text_classification.DATASETS['AG_NEWS'](root='./.data', ngrams=NGRAMS, vocab=None)

BATCH_SIZE = 16

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

120000lines [00:07, 16406.36lines/s]
120000lines [00:14, 8010.90lines/s]
7600lines [00:00, 8205.85lines/s]



## Analysing the Dataset

The vocab size is equal to the length of vocab (including single word and ngrams). 

In [3]:
# Paramters and model instance creation.

VOCAB_SIZE = len(train_dataset.get_vocab())
NUM_CLASS = len(train_dataset.get_labels())

print(VOCAB_SIZE)
print(NUM_CLASS)

# see how the data looks
print(train_dataset[0]) 

1308844
4
(2, tensor([    572,     564,       2,    2326,   49106,     150,      88,       3,
           1143,      14,      32,      15,      32,      16,  443749,       4,
            572,     499,      17,      10,  741769,       7,  468770,       4,
             52,    7019,    1050,     442,       2,   14341,     673,  141447,
         326092,   55044,    7887,     411,    9870,  628642,      43,      44,
            144,     145,  299709,  443750,   51274,     703,   14312,      23,
        1111134,  741770,  411508,  468771,    3779,   86384,  135944,  371666,
           4052]))


## Model

The first simple model is composed of an [`EmbeddingBag`](https://pytorch.org/docs/stable/nn.html?highlight=embeddingbag#torch.nn.EmbeddingBag) layer and a linear layer.

``EmbeddingBag`` computes the mean value of a “bag” of embeddings. The text entries here have different lengths. ``EmbeddingBag`` requires no padding here since the text lengths are saved in offsets. Additionally, since ``EmbeddingBag`` accumulates the average across the embeddings on the fly, ``EmbeddingBag`` can enhance the performance and memory efficiency to process a sequence of tensors.

In [4]:
import torch.nn as nn

class TextClassifier(nn.Module):

    def __init__(self, vocab_size, dim, num_class):
        super().__init__()

        self.layer1 = nn.EmbeddingBag(vocab_size, dim)
        self.layer2 = nn.Linear(dim, num_class)

        self.init_weights()

    def init_weights(self):
        self.layer1.weight.data.uniform_(-0.5, 0.5)
        self.layer2.weight.data.uniform_(-0.5, 0.5)
        self.layer2.bias.data.fill_(0.)
    
    def forward(self,in_text, in_offset):
        out1 = self.layer1(in_text, in_offset)
        out2 = self.layer2(out1)
        return out2

## Generate batch

Since the text entries have different lengths, you need to create a custom function to generate data batches and offsets. This function should be passed to the ``collate_fn`` parameter in the ``DataLoader``. The input to ``collate_fn`` is a list of tensors with the size of batch_size, and the ``collate_fn`` function packs them into a mini-batch. ``collate_fn`` must be declared as a top level definition. This ensures that the function is available in each worker. 

The text entries in the original data batch input are packed into a list and concatenated as a single tensor as the input of ``EmbeddingBag``. The offsets is a tensor of delimiters to represent the beginning index of the individual sequence in the text tensor. Label is a tensor saving the labels of individual text entries.

This function takes batch as an input parameter. Each entry in the batch contains a pair of values of the text and the corresponding label.

In [5]:
def generate_batch(data_batch):
    
    label = torch.tensor([entry[0] for entry in data_batch])
    text = [entry[1] for entry in data_batch]
    offsets = [0] + [len(entry) for entry in text]  

    offsets = torch.tensor(offsets[:-1]).cumsum(dim=0)
    text = torch.cat(text)
    return text, offsets, label

#### Train Function



In [6]:
from torch.utils.data import DataLoader

def train(train_data):
    
    train_loss = 0
    total_acc = 0

    data = DataLoader(train_data, batch_size=BATCH_SIZE, shuffle=True, collate_fn=generate_batch)
  
    for i, (text, offsets, cls) in enumerate(data):
        optimizer.zero_grad()        
        
        text, offsets, cls = text.to(device), offsets.to(device), cls.to(device)
        output = model(text, offsets)

        loss = criterion(output, cls)
        train_loss += loss.item()

        loss.backward()
        optimizer.step()

        total_acc += (output.argmax(1) == cls).sum().item() 

    scheduler.step()

    total_train_loss = train_loss/len(train_data)
    total_train_acc = total_acc/len(train_data)

    return total_train_loss, total_train_acc 

#### Test function

In [7]:
def test(test_data):
        
    test_acc = 0
    test_loss = 0
    data = DataLoader(test_data, batch_size=BATCH_SIZE, shuffle=False, collate_fn=generate_batch)
    
    for text, offsets, cls in data:
        text, offsets, cls = text.to(device), offsets.to(device), cls.to(device)
        with torch.no_grad():
            output = model(text, offsets)
            
            loss = criterion(output, cls)
            test_loss += loss.item()
            
            test_acc += (output.argmax(1) == cls).sum().item()

            total_test_loss = test_loss/len(test_data)
            total_test_acc = test_acc/len(test_data)

    return total_test_loss, total_test_acc


## Split the dataset and run the model

In [8]:
import time
from tqdm import tqdm
from torch.utils.data.dataset import random_split


N_EPOCHS = 5
LEARNING_RATE = 4.0
TRAIN_RATIO = 0.9
EMBED_DIM = 32 

model = TextClassifier(VOCAB_SIZE, EMBED_DIM, NUM_CLASS).to(device)

# set the intial validation loss to positive infinity
valid_loss = float('inf')

criterion = nn.CrossEntropyLoss().to(device)
optimizer = torch.optim.SGD(model.parameters(), lr=LEARNING_RATE)
scheduler = torch.optim.lr_scheduler.StepLR(optimizer, 1, gamma=0.9)

train_size = int(len(train_dataset)*TRAIN_RATIO) 
val_size = len(train_dataset) - train_size

train_set, val_set = random_split(train_dataset, [train_size, val_size])

for epoch in tqdm(range(N_EPOCHS)):
  start_time = time.time()
  train_loss, train_acc = train(train_set)
  valid_loss, valid_acc = test(val_set)

  secs = int(time.time() - start_time)
  mins = secs / 60
  secs = secs % 60
  print('Epoch: %d' %(epoch + 1), " | time in %d minutes, %d seconds" %(mins, secs))
  print(f'\Loss: {train_loss:.4f}(train)\t|\tAcc: {train_acc * 100:.1f}%(train)')
  print(f'\tLoss: {valid_loss:.4f}(valid)\t|\tAcc: {valid_acc * 100:.1f}%(valid)')

 20%|██        | 1/5 [00:48<03:13, 48.43s/it]

Epoch: 1  | time in 0 minutes, 48 seconds
\Loss: 0.0266(train)	|	Acc: 84.3%(train)
	Loss: 0.0227(valid)	|	Acc: 87.2%(valid)


 40%|████      | 2/5 [01:36<02:25, 48.41s/it]

Epoch: 2  | time in 0 minutes, 48 seconds
\Loss: 0.0119(train)	|	Acc: 93.7%(train)
	Loss: 0.0173(valid)	|	Acc: 90.9%(valid)


 60%|██████    | 3/5 [02:25<01:36, 48.39s/it]

Epoch: 3  | time in 0 minutes, 48 seconds
\Loss: 0.0069(train)	|	Acc: 96.4%(train)
	Loss: 0.0184(valid)	|	Acc: 91.2%(valid)


 80%|████████  | 4/5 [03:13<00:48, 48.38s/it]

Epoch: 4  | time in 0 minutes, 48 seconds
\Loss: 0.0038(train)	|	Acc: 98.2%(train)
	Loss: 0.0215(valid)	|	Acc: 90.5%(valid)


100%|██████████| 5/5 [04:01<00:00, 48.39s/it]

Epoch: 5  | time in 0 minutes, 48 seconds
\Loss: 0.0022(train)	|	Acc: 99.0%(train)
	Loss: 0.0219(valid)	|	Acc: 91.0%(valid)





## Test the model

In [9]:
# the results (loss and accuracy) on the test data

print('Checking the results of test dataset')
test_loss, test_acc = test(test_dataset)
print(f'\tLoss: {test_loss:.4f}(test)\t|\tAcc: {test_acc * 100:.1f}%(test)')

Checking the results of test dataset
	Loss: 0.0278(test)	|	Acc: 88.4%(test)


In [10]:
# importing necessary libraries

import re
from torchtext.data.utils import ngrams_iterator
from torchtext.data.utils import get_tokenizer

# labels for the AG_NEWS dataset

ag_news_label = {1 : "World",
                 2 : "Sports",
                 3 : "Business",
                 4 : "Sci/Tec"}

def predict(text, model, vocab, ngrams):
    tokenizer = get_tokenizer("basic_english")
    with torch.no_grad():
        text = torch.tensor([vocab[token]
                            for token in ngrams_iterator(tokenizer(text), ngrams)])
        output = model(text, torch.tensor([0]))
        return output.argmax(1).item() + 1

ex_text_str = "MEMPHIS, Tenn. – Four days ago, Jon Rahm was \
    enduring the season’s worst weather conditions on Sunday at The \
    Open on his way to a closing 75 at Royal Portrush, which \
    considering the wind and the rain was a respectable showing. \
    Thursday’s first round at the WGC-FedEx St. Jude Invitational \
    was another story. With temperatures in the mid-80s and hardly any \
    wind, the Spaniard was 13 strokes better in a flawless round. \
    Thanks to his best putting performance on the PGA Tour, Rahm \
    finished with an 8-under 62 for a three-stroke lead, which \
    was even more impressive considering he’d never played the \
    front nine at TPC Southwind."

vocab = train_dataset.get_vocab()
model = model.to("cpu")

print("This is a '%s' news" % ag_news_label[predict(ex_text_str, model, vocab, 2)])

This is a 'Sports' news


Try out with some random text samples

In [11]:
ex_text_str_sport = "In finishing third in Group D, Ireland scored a measly \
    seven goals in eight matches, a figure in stark contrast to the 23 scored \
    by second-placed Denmark, and the 19 scored by group winners Switzerland. \
    It’s a problem that has affected Ireland for a while, especially since the \
    retirement of legendary striker Robbie Keane. Both Martin O’Neill and now \
    McCarthy have struggled to find a forward who can be relied upon to \
    consistently deliver the goods."

ex_text_str_scitech = "A new month means new games: Sony has announced the \
    new free games for PS Plus members. The company seems to have been quite \
    generous lately, because there are more than the two titles we are used to.\
    Already in January 2020 PS-Plus members could look forward to four games, \
    in February Sony awards three titles worth almost 120 euros, which are \
    presented as usual via a short trailer."

ex_text_str_world = "Asia reported hundreds of new coronavirus cases \
    on Wednesday, including a U.S. soldier stationed in South Korea, as the \
    United States warned of an inevitable pandemic and outbreaks in Italy and \
    Iran spread to other countries."

ex_text_str_business = "World stocks tumbled for the fifth day on fears of \
    prolonged disruption to global supply chains, while safe-haven gold rose \
    back towards seven-year highs and U.S. bond yields held near record lows. \
    Stock markets globally have wiped out $3.3 trillion of value in the past \
    four trading sessions, as measured by the MSCI all-country index."

print("This is a '%s' text" % ag_news_label[predict(ex_text_str_sport, model, vocab, 2)])
print("This is a '%s' text" % ag_news_label[predict(ex_text_str_scitech, model, vocab, 2)])
print("This is a '%s' text" % ag_news_label[predict(ex_text_str_world, model, vocab, 2)])
print("This is a '%s' text" % ag_news_label[predict(ex_text_str_business, model, vocab, 2)])

This is a 'Sports' text
This is a 'Sci/Tec' text
This is a 'World' text
This is a 'Business' text


The model is performing quite well. It was able to distiguish between video games and sports, and correctly classifier the second example as 'Sci/Tec'

In [12]:
ex_text_str1 = "The effect of elections on the economy is stunning. It could \
    either be the start of a new trend or the same pattern that has been \
    going on for the past few decades."

ex_text_str2 = "In the beginning the Universe was created.\
    This had made many people very angry and has been widely \
    regarded as a bad move. The humans have been trying to figure out \
    why they spend so much time between looking at the digital clocks"

ex_text_str3 = "At the spawn point, there’s a lovely little village. \
    This was started as a communal building area, says team member Trog. \
    We let everyone build wherever in that area and then connected paths \
    to each building to make it look more like a village. Even with ten \
    team members all building their own things in the same space, the \
    village manages to look remarkably well put-together, with shops, \
    farms, and even a graveyard."

print("This is a '%s' text" % ag_news_label[predict(ex_text_str1, model, vocab, 2)])
print("This is a '%s' text" % ag_news_label[predict(ex_text_str2, model, vocab, 2)])
print("This is a '%s' text" % ag_news_label[predict(ex_text_str3, model, vocab, 2)])

This is a 'World' text
This is a 'Sci/Tec' text
This is a 'World' text


# LSTM model 


In [18]:
from torch import nn
class LSTM(nn.Module):
    def __init__(self, embedding_dim, hidden_dim, output_dim, vocab_size):
        
        super().__init__()

        self.embedding_dim = embedding_dim
        self.embedding = nn.EmbeddingBag(vocab_size, embedding_dim)
        self.lstm = nn.LSTM(embedding_dim, hidden_dim)
        self.fc = nn.Linear(hidden_dim, output_dim)
        
        
    def forward(self, text, offset):
        embedded = self.embedding(text.to(device), offset.to(device))
        print(embedded.shape)
        embedded = embedded.unsqueeze(0).permute(1, 0, 2)
        print(embedded.shape)

        last_output, (last_hidden, last_cell) = self.lstm(embedded)

        print(last_output.shape, last_hidden.shape, last_cell.shape)
        output = self.fc(last_hidden[-1])
        return output 

In [14]:
HIDDEN_DIM = 256
OUTPUT_DIM = 4
VOCAB_SIZE = len(train_dataset.get_vocab())
NUM_CLASS = len(train_dataset.get_labels())
lr =  2e-5
EMBEDDING_DIM = 32
criterion = nn.CrossEntropyLoss()


In [19]:
import time
from torch.utils.data.dataset import random_split
from torch import nn
from tqdm.autonotebook import tqdm

N_EPOCHS = 13
TRAIN_RATIO = 0.9
valid_loss = float('inf')

model = LSTM(EMBEDDING_DIM, HIDDEN_DIM, NUM_CLASS, VOCAB_SIZE).to(device)


criterion = nn.CrossEntropyLoss().to(device)
optimizer = torch.optim.Adam(model.parameters() )
scheduler = torch.optim.lr_scheduler.StepLR(optimizer, 1, gamma=0.9)

train_size = int(len(train_dataset)*TRAIN_RATIO)
val_size = len(train_dataset) - train_size
train_set, val_set = random_split(train_dataset, [train_size, val_size])

for epoch in tqdm(range(N_EPOCHS)):
    start_time = time.time()
    train_loss, train_acc = train(train_set)
    valid_loss, valid_acc = test(val_set)

    secs = int(time.time() - start_time)
    mins = secs / 60
    secs = secs % 60
    

    print('Epoch: %d' %(epoch + 1), " | time in %d minutes, %d seconds" %(mins, secs))
    print(f'\tLoss: {train_loss:.4f}(train)\t|\tAcc: {train_acc * 100:.1f}%(train)')
    print(f'\tLoss: {valid_loss:.4f}(valid)\t|\tAcc: {valid_acc * 100:.1f}%(valid)')
    torch.save(model.state_dict(), f"model_{epoch+5}")

HBox(children=(FloatProgress(value=0.0, max=13.0), HTML(value='')))

torch.Size([16, 32])
torch.Size([16, 1, 32])
torch.Size([16, 1, 256]) torch.Size([1, 1, 256]) torch.Size([1, 1, 256])


ValueError: ignored

In [None]:
for i in range(13):
    torch.cuda.empty_cache() 
    model = LSTM(32, HIDDEN_DIM, 4, VOCAB_SIZE).to(device)
    model.load_state_dict(torch.load(f"model_{i}"))
    # TODO: Compete the code below to find 
    # the results (loss and accuracy) on the test data
    print(f'Epoch{i+1} test results........using model{i+1}')
    test_loss, test_acc = test(test_dataset)
    print(f'\tLoss: {test_loss:.4f}(test)\t|\tAcc: {test_acc * 100:.1f}%(test)')