# Project 3: Text Classification in PyTorch

## Instructions

* All the tasks that you need to complete in this project are either coding tasks (mentioned inside the code cells of the notebook with `#TODO` notations) or theoretical questions that you need to answer by editing the markdown question cells.
* **Please make sure you read the [Notes](#Important-Notes) section carefully before you start the project.**

## Introduction
This project deals with neural text classification using PyTorch. Text classification is the process of assigning tags or categories to text according to its content. It's one of the fundamental tasks in Natural Language Processing (NLP) with broad applications such as sentiment analysis, topic labeling, spam detection, and intent detection.

Text classification algorithms are at the heart of a variety of software systems that process text data at scale. Email software uses text classification to determine whether incoming mail is sent to the inbox or filtered into the spam folder. Discussion forums use text classification to determine whether comments should be flagged as inappropriate.

**_Example:_** A simple example of text classification would be Spam Classification. Consider the bunch of emails that you would receive in the your personal inbox if the email service provider did not have a spam filter algorithm. Because of the spam filter, spam emails get redirected to the Spam folder, while you receive only non-spam ("_ham_") emails in your inbox.

![](http://blog.yhat.com/static/img/spam-filter.png)

## Task
Here, we want you to focus on a specific type of text classification task, "Document Classification into Topics". It can be addressed as classifying text data or even large documents into separate discrete topics/genres of interest.


![](https://miro.medium.com/max/700/1*YWEqFeKKKzDiNWy5UfrTsg.png)

In this project, you will be working on classifying given text data into discrete topics or genres. You are given a bunch of text data, each of which has a label attached. We ask you to learn why you think the contents of the documents have been given these labels based on their words. You need to create a neural classifier that is trained on this given information. Once you have a trained classifier, it should be able to predict the label for any new document or text data sample that is fed to it. The labels need not have any meaning to us, nor to you necessarily.

## Data
There are various datasets that we can use for this purpose. This tutorial shows how to use the text classification datasets in the PyTorch library ``torchtext``. There are different datasets in this library like `AG_NEWS`, `SogouNews`, `DBpedia`, and others. This project will deal with training a supervised learning algorithm for classification using one of these datasets. In task 1 of this project, we will work with the `AG_NEWS` dataset.

## Load Data

A bag of **ngrams** feature is applied to capture some partial information about the local word order. In practice, bi-grams or tri-grams are applied to provide more benefits as word groups than only one word.

**Example:**

*"I love Neural Networks"*
* **Bi-grams:** "I love", "love Neural", "Neural Networks"
* **Tri-grams:** "I love Neural", "love Neural Networks"

In the code below, we have loaded the `AG_NEWS` dataset from the ``torchtext.datasets.TextClassification`` package with bi-grams feature. The dataset supports the ngrams method. By setting ngrams to 2, the example text in the dataset will be a list of single words plus bi-grams string.

In [1]:
"""
Load the AG_NEWS dataset in bi-gram features format.
"""

!pip install torchtext==0.4

import torch
import torchtext
from torchtext.datasets import text_classification
import os

from tqdm import tqdm
NGRAMS = 2

if not os.path.isdir('./.data'):
    os.mkdir('./.data')

train_dataset, test_dataset = text_classification.DATASETS['AG_NEWS'](
    root='./.data', ngrams=NGRAMS, vocab=None)

BATCH_SIZE = 16

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")



120000lines [00:10, 11993.46lines/s]
120000lines [00:20, 5995.63lines/s]
7600lines [00:01, 6476.75lines/s]


## Model

Our first simple model is composed of an [`EmbeddingBag`](https://pytorch.org/docs/stable/nn.html?highlight=embeddingbag#torch.nn.EmbeddingBag) layer and a linear layer.

``EmbeddingBag`` computes the mean value of a “bag” of embeddings. The text entries here have different lengths. ``EmbeddingBag`` requires no padding here since the text lengths are saved in offsets. Additionally, since ``EmbeddingBag`` accumulates the average across the embeddings on the fly, ``EmbeddingBag`` can enhance the performance and memory efficiency to process a sequence of tensors.

In [0]:
# TODO: Import the necessary libraries
from torch import nn

# TODO: Create a class TextClassifier. Remember that this class will be your model.
class TextClassifier(nn.Module):

    # TODO: Define the __init__() method with proper parameters
    # (vocabulary size, dimensions of the embeddings, number of classes)
    def __init__(self, vocab_size, embed_dim, num_class):
        super().__init__()
        # TODO: define the embedding layer
        self.embedding = nn.EmbeddingBag(vocab_size, embed_dim)
        # TODO: define the linear forward layer
        self.linear = nn.Linear(embed_dim, num_class)
        # TODO: Initialize weights
        self._initWeights()

    # TODO: Define a method to initialize weights.
    def _initWeights(self):
        self._setWeightsAndBias(self.embedding)
        self._setWeightsAndBias(self.linear)

        
    def _setWeightsAndBias(self, layer):
        # The weights should be random in the range of -0.5 to 0.5.
        stdv = 0.5
        layer.weight.data.uniform_(-stdv, stdv)
        # You can initialize bias values as zero.
        if hasattr(layer, "bias"):
            layer.bias.data.zero_()

    
    # TODO: Define the forward function.
    def forward(self, inputs, offsets):
        # This should calculate the embeddings and return the linear layer
        # with calculated embedding values.
        x = self.embedding(inputs, offsets)
        x = self.linear(x)
        return x


## Check your data before you proceed!

Okay, so we know that we are using the `AG_NEWS` dataset in this project, but do you know what does the data contain? What is the format of the data? How many classes of data are there in this dataset? We do not know, yet. Let's find out!


## Question 1:
Create a new cell in this notebook and try to analyze the dataset that we loaded for you before. Report the following:
* Vocabulary size (VOCAB_SIZE)
* Number of classes (NUM_CLASS)
* Names of the classes


## Answer 1:

In [3]:
def getVocabSize(dataset):
    return len(dataset.get_vocab())

def getClasses(dataset):
    return dataset.get_labels()

VOCAB_SIZE = getVocabSize(test_dataset)
CLASSES = getClasses(test_dataset)
NUM_CLASSES = len(CLASSES)


print(f"VOCAB_SIZE: {VOCAB_SIZE}")
print(f"NUM_CLASSES: {NUM_CLASSES}")
print(f"CLASSES: {CLASSES}")

VOCAB_SIZE: 1308844
NUM_CLASSES: 4
CLASSES: {0, 1, 2, 3}


## Create an instance for your model

Great! You have successfully completed a basic analysis of the data that you are going to work with. The vocab size is equal to the length of vocab (including single word and ngrams). The number of classes is equal to the number of labels. Copy paste the code statements you used in your analysis to complete the code below. Also, using these parameters, create an instance `model` of your text classifier `TextClassifier`.

In [0]:
'''
Paramters and model instance creation.
'''

# TODO: Instantiate the Vocabulary size and the number of classes
# from the training dataset that we loaded for you.

# Hint: Remember that these are PyTorch datasets. So, there should be 
# readily available functions that you can use to save time. ;)

VOCAB_SIZE = getVocabSize(train_dataset)
EMBED_DIM = 32
NUM_CLASS = len(getClasses(train_dataset))

# TODO: Instantiate the model with the parameters you defined above. 
# Remember to allocate it to your 'device' variable.

model = TextClassifier(VOCAB_SIZE, EMBED_DIM, NUM_CLASS).to(device)

## Generate batch

Since the text entries have different lengths, you need to create a custom function to generate data batches and offsets. This function should be passed to the ``collate_fn`` parameter in the ``DataLoader`` call of pyTorch which you will use to create the data later on. The input to ``collate_fn`` is a list of tensors with the size of batch_size, and the ``collate_fn`` function packs them into a mini-batch. Pay attention here and make sure that ``collate_fn`` is declared as a top level definition. This ensures that the function is available in each worker. This is the reason why you need to define this custom function first before you call DataLoader().

The text entries in the original data batch input are packed into a list and concatenated as a single tensor as the input of ``EmbeddingBag``. The offsets is a tensor of delimiters to represent the beginning index of the individual sequence in the text tensor. Label is a tensor saving the labels of individual text entries.

Finish the function definition below. The function should take batch as an input parameter. Each entry in the batch contains a pair of values of the text and the corresponding label.

In [0]:
# TODO: Finish the function definition.

def generate_batch(raw_batch):
    label = torch.tensor([entry[0] for entry in raw_batch])
    text = [entry[1] for entry in raw_batch]
    offsets = [0] + [len(entry) for entry in text]

    offsets = torch.tensor(offsets[:-1]).cumsum(dim=0)
    text = torch.cat(text)
    
    return text, offsets, label

## Define the train function

Here, you need to define a function which you will use later on in the project to train your model. This is very similar to the training steps that you have encountered before in previous coding assignment(s). The outline of the function is something like this -

* load the data as batches
* iterate over the batches
* find the model output for a forward pass
* calculate the loss
* perform backpropagation on the loss (optimize it)
* find the training accuracy

In addition to this, you also need to find the total loss and total training accuracy values. Also, you need to return the average values of the total loss and total accuracy.

In [0]:
def train(train_data):

    # Initial values of training loss and training accuracy
    
    train_loss = 0
    train_acc = 0

    # TODO: Use the PyTorch DataLoader class to load the data 
    # into shuffled batches of appropriate sizes into the variable 'data'.
    # Remember, this is the place where you need to generate batches.
    data = torch.utils.data.DataLoader(train_data, batch_size=BATCH_SIZE, collate_fn=generate_batch, shuffle=True)
    
    
    for i, (text, offsets, cls) in tqdm(enumerate(data)):
        
        # TODO: What do you need to do in order to perform backprop on the optimizer?
        optimizer.zero_grad()
        
        text, offsets, cls = text.to(device), offsets.to(device), cls.to(device)

        # TODO: Store the output of the model in variable 'output'
        output = model(text, offsets)
        
       # print(f"Output: {output}")
        #print(f"Text: {text}")
        #print(f"Offsets: {offsets}")
        #print(f"Cls: {cls}")
        
        # TODO: Define the 'loss' variable (with respect to 'output' and 'cls').
        loss = criterion(output, cls)
        # Also calculate the total loss in variable 'train_loss'
        train_loss += loss
        
        # TODO: Perform the backward propagation on 'loss' and 
        # optimize it through the 'optimizer' step
        loss.backward()
        optimizer.step()
        
        
        # TODO: Calculate and store the total training accuracy
        # in the variable 'total_acc'.
        # Remember, you need to find the 
        _, pred_labels = output.max(dim=1)
        accuracy = (pred_labels == cls).sum() / float(BATCH_SIZE)
        train_acc += accuracy
        

    # TODO: Adjust the learning rate here using the scheduler step
    scheduler.step()
    
    # TODO: CHANGE THIS
    return train_loss / len(data), train_acc / len(data)

## Define the test function

Using the framework of the `train()` function in the previous cell, try to figure out the structure of the test function below.

In [0]:
def test(test_data):
    
    # Initial values of test loss and test accuracy
    
    loss = 0
    acc = 0
    
    # TODO: Use DataLoader class to load the data
    # into non-shuffled batches of appropriate sizes.
    # Remember, you need to generate batches here too.
    data = torch.utils.data.DataLoader(test_data, batch_size=BATCH_SIZE, collate_fn=generate_batch, shuffle=False)
    
    
    for text, offsets, cls in data:
        
        text, offsets, cls = text.to(device), offsets.to(device), cls.to(device)
        
        # Hint: There is a 'hidden hint' here. Let's see if you can find it :)
        with torch.no_grad():
        
            
            # TODO: Get the model output
            output = model(text, offsets)
            
            
            
            # TODO: Calculate and add the loss to find total 'loss'
            loss += criterion(output, cls)
        
            
            
            # TODO: Calculate the accuracy and store it in the 'acc' variable
            _, pred_labels = output.max(dim=1)
            acc += (pred_labels == cls).sum() / float(BATCH_SIZE)
            

    return loss / len(data), acc / len(data)

## Split the dataset and run the model

The original `AG_NEWS` has no validation dataset. For this reason, you need to split the training dataset into training and validation sets with a proper split ratio. The `random_split()` function in the torch.utils core PyTorch library should be able to help you with this. We have already imported it for you. :)

* Consider the initial learning rate as 4.0, number of epochs as 5, training data ratio as 0.9.
* You need to define and use a proper loss function
* Define an Optimization algorithm (Suggestion: SGD)
* Define a scheduler function to adjust the learning rate through epochs (gamma parameter = 0.9).
(Hint: Look at the `StepLR` function)
* Monitor the loss and accuracy values for both training and validation data sets.

In [8]:
import time
from torch.utils.data.dataset import random_split

# TODO: Set the number of epochs and the learning rate to 
# their initial values here

# TODO: FIGURE THIS OUT
N_EPOCHS = 1
LEARNING_RATE = 4.0
TRAIN_RATIO = 0.9

# TODO: Set the intial validation loss to positive infinity
INIT_VAL_LOSS = float('inf')


# TODO: Use the appropriate loss function
criterion = nn.CrossEntropyLoss()


# TODO: Use the appropriate optimization algorithm with parameters (Suggested: SGD)
optimizer = torch.optim.Adam(model.parameters())

# TODO: Use a scheduler function
scheduler = torch.optim.lr_scheduler.StepLR(optimizer, 1, gamma=0.9)



# TODO: Split the data into train and validation sets using random_split()
size = len(train_dataset)
split_size = int(size * TRAIN_RATIO)
train_dataset_split, validation_dataset_split = random_split(train_dataset, [split_size, size - split_size])


# TODO: Finish the rest of the code below

for epoch in range(N_EPOCHS):

    start_time = time.time()
    train_loss, train_acc = train(train_dataset_split)
    valid_loss, valid_acc = test(validation_dataset_split)

    secs = int(time.time() - start_time)
    mins = secs / 60
    secs = secs % 60

    print('Epoch: %d' %(epoch + 1), " | time in %d minutes, %d seconds" %(mins, secs))
    print(f'\tLoss: {train_loss:.4f}(train)\t|\tAcc: {train_acc * 100:.1f}%(train)')
    print(f'\tLoss: {valid_loss:.4f}(valid)\t|\tAcc: {valid_acc * 100:.1f}%(valid)')

6750it [03:34, 31.11it/s]


Epoch: 1  | time in 3 minutes, 34 seconds
	Loss: 0.4414(train)	|	Acc: 87.3%(train)
	Loss: 0.2528(valid)	|	Acc: 92.0%(valid)


## Let's  check the test loss and test accuracy

So you have trained your model and seen how well it performs on the training and validation datasets. Now, you need to check your model's performance against the test dataset. Using the test dataset as input, report the test loss and test accuracy scores of your model.

In [9]:
# TODO: Compete the code below to find 
# the results (loss and accuracy) on the test data

print('Checking the results of test dataset...')
test_loss, test_acc = test(test_dataset)
print(f'\tLoss: {test_loss:.4f}(test)\t|\tAcc: {test_acc * 100:.1f}%(test)')

Checking the results of test dataset...
	Loss: 0.2790(test)	|	Acc: 90.8%(test)


In [10]:
# importing necessary libraries

import re
from torchtext.data.utils import ngrams_iterator
from torchtext.data.utils import get_tokenizer

# labels for the AG_NEWS dataset

ag_news_label = {1 : "World",
                 2 : "Sports",
                 3 : "Business",
                 4 : "Sci/Tec"}

def predict(text, model, vocab, ngrams):
    tokenizer = get_tokenizer("basic_english")
    with torch.no_grad():
        text = torch.tensor([vocab[token]
                            for token in ngrams_iterator(tokenizer(text), ngrams)])
        output = model(text, torch.tensor([0]))
        return output.argmax(1).item() + 1

ex_text_str = "MEMPHIS, Tenn. – Four days ago, Jon Rahm was \
    enduring the season’s worst weather conditions on Sunday at The \
    Open on his way to a closing 75 at Royal Portrush, which \
    considering the wind and the rain was a respectable showing. \
    Thursday’s first round at the WGC-FedEx St. Jude Invitational \
    was another story. With temperatures in the mid-80s and hardly any \
    wind, the Spaniard was 13 strokes better in a flawless round. \
    Thanks to his best putting performance on the PGA Tour, Rahm \
    finished with an 8-under 62 for a three-stroke lead, which \
    was even more impressive considering he’d never played the \
    front nine at TPC Southwind."

vocab = train_dataset.get_vocab()
model = model.to("cpu")

# TODO: Predict the topic of the above given random text (use bigrams)
# Use the proper paramters in the predict() function

print("This is a '%s' news" % ag_news_label[predict(ex_text_str, model, vocab, NGRAMS)])

# If you have done everything correctly in this task,
# then the output of this cell should be - "This is a 'Sports' news".

This is a 'Sports' news


# Congratulations! You just designed your first neural classifier!

And probably you have achieved a good accuracy score too. Great job!

## Question 2:
You just tested your model with a new sample text. Try to feed some more random examples of similar text (which you think are related to at least one of the four topics _"World", "Sports", "Business", "Sci/Tec"_ of our problem) to the model and see how your model reacts. Give at least 3 such examples (You are free to include more examples if you wish to).

## Answer 2:



In [11]:
#World article
article_1 = 'China is still seeing the huge majority of confirmed cases, but they are beginning to climb in other countries that are thousands of miles away.Iran and Italy have both seen surging numbers of infected patients.Irans official death toll stands at 34, but sources within the countrys healthcare system tell BBC Persian that the count could be as high as 210. Iran has denied that it is withholding information about the number of dead and infected.Seventeen people have died in Italy.There have also been 13 deaths in South Korea and three deaths in Japan.'
#Sports article
article_2 = 'Juventus lost 1-0 at Lyon in a Champions League last 16 first leg tie on Wednesday night.The goal came from Lucas Tousart.The 22-year-old French midfielder slotted home in the 31st minute.Juve have won the scudetto eight times in a row but have not bagged Europes premier silverware since 1996, when they won their second Champions League title since the Haysel-disaster-marred win in 1985. To help them win another European crown, they signed Cristiano Ronaldo from Real Madrid the summer before last, but again had no joy.They have again prioritised Europe this year even though they appear to be heading for a possible ninth straight Serie A title despite strong challenges from Lazio and Inter.'
#Business article
article_3 = 'Dozens of big companies have warned, for example, that the coronavirus will hit their share price.Trouble over an extended period could have an effect on the amount of work available, says Moira ONeill from Interactive Investor. Stock market falls also affect business confidence and the ability of companies to raise money, she says. So if this short-term blip becomes something more pronounced, there could be an impact for the wider economy and maybe your job.'

# TODO: Predict the topic of the above given random text (use bigrams)
# Use the proper paramters in the predict() function

print("This is a '%s' news" % ag_news_label[predict(article_1, model, vocab, NGRAMS)])
print("This is a '%s' news" % ag_news_label[predict(article_2, model, vocab, NGRAMS)])
print("This is a '%s' news" % ag_news_label[predict(article_3, model, vocab, NGRAMS)])




This is a 'World' news
This is a 'Sports' news
This is a 'Business' news


## Question 3:
Okay, probably the model still works great with the examples you fed to it in the previous question. How about a twist in the plot? Let's feed it some more random text data from completely different genres/topics (not belonging to the 4 topics which we talk about the in the first question). How does your model react now? Give at least 3 such examples (You are free to include more examples if you wish to).

Of course the predictions will be limited to the four class labels that your model is trained on. Can you somehow justify the labels that your model predicted now for the given text inputs?

## Answer 3:


The article about tiger has been categorizd as World. This could be possibly 
due to the presence of words like 'regions', or the areas which the tiger inhabitates. Also the artice has mentioned words like 'rivers', 'China', 'Indian Subcontinent' which makes it more close to being a World Article.

The articel about changing tyres got categorized into business. This maybe because the words in the article are more similar(latent or semantic similarity) to words appearing in the business article. 

The article about cooking pasta has terms like 'boiling water', 'tongs', 'add salt' which resemble a scientifc experiment (chemistry lab stuff) and hence it is more close to science and technology.




In [12]:
#Animals
article_1 = """The tiger (Panthera tigris) is the largest species among the Felidae and classified in the genus Panthera. 
It is most recognisable for its dark vertical stripes on orangish-brown fur with a lighter underside. 
It is an apex predator, primarily preying on ungulates such as deer and wild boar. 
It is territorial and generally a solitary but social predator, requiring large contiguous areas of habitat, 
which support its requirements for prey and rearing of its offspring. 
Tiger cubs stay with their mother for about two years, before they become independent and leave their mothers home range to establish their own.
The tiger once ranged widely from the Eastern Anatolia Region in the west to the Amur River basin, and in the south from the foothills of the Himalayas 
to Bali in the Sunda islands. 
Since the early 20th century, tiger populations have lost at least 93% of their historic range and have been extirpated in Western and Central Asia, 
from the islands of Java and Bali, and in large areas of Southeast and South Asia and China. 
Todays tiger range is fragmented, stretching from Siberian temperate forests to subtropical and tropical forests on the Indian subcontinent and Sumatra."""
#How to change car tyres
article_2 = """As soon as you realize you have a flat tire, do not abruptly brake or turn.  
Slowly reduce speed and scan your surroundings for a level, straight stretch of road with a wide shoulder. 
An empty parking lot would be an ideal place. Level ground is good because it will prevent your vehicle from rolling.
 Also, straight stretches of road are better than curves because oncoming traffic is more likely to see you."""
#Cooking Pasta
article_3 = """Boil water in a large pot
To make sure pasta doesn’t stick together, use at least 4 quarts of water for every pound of noodles.
Salt the water with at least a tablespoon—more is fine
The salty water adds flavor to the pasta.
Add pasta
Pour pasta into boiling water. Don’t break the pasta; it will soften up within 30 seconds and fit into the pot.
Stir the pasta
As the pasta starts to cook, stir it well with the tongs so the noodles don’t stick to each other (or the pot).
Test the pasta by tasting it
Follow the cooking time on the package, but always taste pasta before draining to make sure the texture is right. 
Pasta cooked properly should be little chewy."""

# TODO: Predict the topic of the above given random text (use bigrams)
# Use the proper paramters in the predict() function

print("This is a '%s' news" % ag_news_label[predict(article_1, model, vocab, NGRAMS)])
print("This is a '%s' news" % ag_news_label[predict(article_2, model, vocab, NGRAMS)])
print("This is a '%s' news" % ag_news_label[predict(article_3, model, vocab, NGRAMS)])


# If you have done everything correctly in this task,
# then the output of this cell should be - "This is a 'Sports' news".

This is a 'Business' news
This is a 'Business' news
This is a 'Business' news


## Question 4:
Your model probably has achieved a good accuracy score. However, there may be lots of things that you could still try to do to improve your classifier model. Can you try to list down some improvements that you think would be able to improve the above model's performance?

_(Hint: Maybe think about alternate architectures, #layers, hyper-paramters, etc..., but try not to come up with too complex stuff! :) )_

## Answer 4:

We can introduce the following:
1. Use ADAM as an adaptive learning rate.
2. We can increase the number of hidden layers and number of epochs.
3. We can increase the batch size so that the estimates of our gradient are more accurate.
4. Use dropout to prevent overfitting.
5. We can use batch normalization.
6. We can introduce nonlinearity such as reLu or its variants like leakyRelu.
 


# Important Notes

## NOTE 1:
If you want, you can try out the models on other datasets too for comparisons. Although this is not mandatory, it would be really interesting to see how your model performs for data from different domains maybe. Note that you may need to tweak the code a little bit when you are considering other datasets and formats. 

## NOTE 2:
Any form of plagiarism is strictly prohibited. If it is found that you have copied sample code from the internet, the entire team will be penalized.

## NOTE 3:
Often Jupyter Notebooks tend to stop working or crash due to overload of memory (lot of variables, big neural models, memory-intensive training of models, etc...). Moreover, with more number of tasks, the number of variables that you will be using will surely incerase. Therefore, it is recommended that you use separate notebooks for each _Task_ in this project.

## NOTE 4:
You are expected to write well-documented code, that is, with proper comments wherever you think is needed. Make sure you write a comprehensive report for the entire project consisting of data analysis, your model architecture, methods used, discussing and comparing the models against the accuracy and loss metrics, and a final conslusion. If you want to prepare separate reports for each _Task_, you could do this in the Jupyter Notebook itself using $Mardown$ and $\LaTeX$ code if needed. If you want to submit a single report for the entire project, you could submit a PDF file in that case (Word or $\LaTeX$).

All the very best for project 2. Wishing you happy holidays and a very happy new year in advance! :)