# Project 3: Text Classification in PyTorch

## Instructions

* All the tasks that you need to complete in this project are either coding tasks (mentioned inside the code cells of the notebook with `#TODO` notations) or theoretical questions that you need to answer by editing the markdown question cells.
* **Please make sure you read the [Notes](#Important-Notes) section carefully before you start the project.**

## Introduction
This project deals with neural text classification using PyTorch. Text classification is the process of assigning tags or categories to text according to its content. It's one of the fundamental tasks in Natural Language Processing (NLP) with broad applications such as sentiment analysis, topic labeling, spam detection, and intent detection.

Text classification algorithms are at the heart of a variety of software systems that process text data at scale. Email software uses text classification to determine whether incoming mail is sent to the inbox or filtered into the spam folder. Discussion forums use text classification to determine whether comments should be flagged as inappropriate.

**_Example:_** A simple example of text classification would be Spam Classification. Consider the bunch of emails that you would receive in the your personal inbox if the email service provider did not have a spam filter algorithm. Because of the spam filter, spam emails get redirected to the Spam folder, while you receive only non-spam ("_ham_") emails in your inbox.

![](http://blog.yhat.com/static/img/spam-filter.png)

## Task
Here, we want you to focus on a specific type of text classification task, "Document Classification into Topics". It can be addressed as classifying text data or even large documents into separate discrete topics/genres of interest.


![](https://miro.medium.com/max/700/1*YWEqFeKKKzDiNWy5UfrTsg.png)

In this project, you will be working on classifying given text data into discrete topics or genres. You are given a bunch of text data, each of which has a label attached. We ask you to learn why you think the contents of the documents have been given these labels based on their words. You need to create a neural classifier that is trained on this given information. Once you have a trained classifier, it should be able to predict the label for any new document or text data sample that is fed to it. The labels need not have any meaning to us, nor to you necessarily.

## Data
There are various datasets that we can use for this purpose. This tutorial shows how to use the text classification datasets in the PyTorch library ``torchtext``. There are different datasets in this library like `AG_NEWS`, `SogouNews`, `DBpedia`, and others. This project will deal with training a supervised learning algorithm for classification using one of these datasets. In task 1 of this project, we will work with the `AG_NEWS` dataset.

## Load Data

A bag of **ngrams** feature is applied to capture some partial information about the local word order. In practice, bi-grams or tri-grams are applied to provide more benefits as word groups than only one word.

**Example:**

*"I love Neural Networks"*
* **Bi-grams:** "I love", "love Neural", "Neural Networks"
* **Tri-grams:** "I love Neural", "love Neural Networks"

In the code below, we have loaded the `AG_NEWS` dataset from the ``torchtext.datasets.TextClassification`` package with bi-grams feature. The dataset supports the ngrams method. By setting ngrams to 2, the example text in the dataset will be a list of single words plus bi-grams string.


# Task 3: Let your creativity flow!

As discussed earlier, you are free to come up with anything in task 3. Think and try to model unique (not too complex!) neural architecture on your own. Remember that this model has to be novel as much as possible, so try not to copy other people's existing work. Using the same data, train the new model, and report the accuracy scores. How much better/worse is this model than the previous two models? Why do you think this is better/worse?


In [0]:
"""
Load the AG_NEWS dataset in bi-gram features format.
"""

!pip install torchtext==0.4

import os
import torch
import torchtext
from torchtext.datasets import text_classification

from tqdm import tqdm


if not os.path.isdir('./.data'):
    os.mkdir('./.data')

train_dataset, test_dataset = text_classification.DATASETS['AG_NEWS'](
    root='./.data', ngrams=1, vocab=None, include_unk=True)

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

Collecting torchtext==0.4
[?25l  Downloading https://files.pythonhosted.org/packages/43/94/929d6bd236a4fb5c435982a7eb9730b78dcd8659acf328fd2ef9de85f483/torchtext-0.4.0-py3-none-any.whl (53kB)
[K     |██████▏                         | 10kB 26.3MB/s eta 0:00:01[K     |████████████▍                   | 20kB 6.1MB/s eta 0:00:01[K     |██████████████████▌             | 30kB 8.7MB/s eta 0:00:01[K     |████████████████████████▊       | 40kB 5.7MB/s eta 0:00:01[K     |██████████████████████████████▉ | 51kB 6.9MB/s eta 0:00:01[K     |████████████████████████████████| 61kB 5.1MB/s 
Installing collected packages: torchtext
  Found existing installation: torchtext 0.3.1
    Uninstalling torchtext-0.3.1:
      Successfully uninstalled torchtext-0.3.1
Successfully installed torchtext-0.4.0


ag_news_csv.tar.gz: 11.8MB [00:00, 42.8MB/s]
120000lines [00:04, 29337.61lines/s]
120000lines [00:06, 17459.86lines/s]
7600lines [00:00, 18121.97lines/s]


In [0]:
def getVocabSize(dataset):
    return len(dataset.get_vocab())

def getClasses(dataset):
    return dataset.get_labels()
  
VOCAB_SIZE = getVocabSize(test_dataset)
CLASSES = getClasses(test_dataset)
NUM_CLASSES = len(CLASSES)

print(f"VOCAB_SIZE: {VOCAB_SIZE}")
print(f"NUM_CLASSES: {NUM_CLASSES}")
print(f"CLASSES: {CLASSES}")

VOCAB_SIZE: 95812
NUM_CLASSES: 4
CLASSES: {0, 1, 2, 3}


# Hyperparams

In [0]:
EMBED_DIM = 100
BATCH_SIZE = 1
FILTER_DIM = 100
NGA = 2
NGB = 3 
NGC = 4 

# The model

In [0]:
import torch.nn as nn
import torch.nn.functional as F


class CreativeNN(nn.Module):

    def __init__(self, vocab_size, embedding_dim, fil_dim, nga, ngb, ngc, num_classes):
        """
        Initialize the model by setting up the layers.
        """
        super().__init__()

        self.num_classes = num_classes
        
        # Possibly interchange this layer with glove pre trained embeddings
        self.embedding = nn.Embedding(vocab_size, embedding_dim)

        self.cnn1 = nn.Conv2d(1, fil_dim, (nga, embedding_dim))

        self.cnn2 = nn.Conv2d(1, fil_dim, (ngb, embedding_dim))
        
        self.cnn3 = nn.Conv2d(1, fil_dim, (ngc, embedding_dim))

        self.linear = nn.Linear(3 * fil_dim, num_classes)

        

    def forward(self, input):
      x = self.embedding(input)

      # print(f"Embedding shape {x.shape}")

      x = x[:, None, :, :]
      # print(f"Embedding postprocessing shape: {x.shape}")

      # =========================================

      x1 = F.relu(self.cnn1(x))
      x1 = torch.squeeze(x1, 3)
      # print(f"X1 shape: {x1.shape}")
      x1 = F.max_pool1d(x1, x1.shape[2])
      # print(f"X1 shape: {x1.shape}")
      x1 = torch.squeeze(x1, 2)
      x1 = torch.squeeze(x1, 0)
      # print(f"X1 shape: {x1.shape}")

      # =========================================

      x2 = F.relu(self.cnn2(x))
      x2 = torch.squeeze(x2, 3)
      # print(f"X2 shape: {x2.shape}")
      x2 = F.max_pool1d(x2, x2.shape[2])
      # print(f"X2 shape: {x2.shape}")
      x2 = torch.squeeze(x2, 2)
      x2 = torch.squeeze(x2, 0)
      # print(f"X2 shape: {x2.shape}")
      
      # =========================================

      x3 = F.relu(self.cnn3(x))
      x3 = torch.squeeze(x3, 3)
      # print(f"X3 shape: {x3.shape}")
      x3 = F.max_pool1d(x3, x3.shape[2])
      # print(f"X3 shape: {x3.shape}")
      x3 = torch.squeeze(x3, 2)
      x3 = torch.squeeze(x3, 0)
      # print(f"X3 shape: {x3.shape}")
     
      # =========================================

      x = torch.cat((x1, x2, x3)).to(device)
      # print("===========")
      # print(f"FEATURE: {x.shape}")
      # print("===========")


      x = self.linear(x) 
      x = x[None, :]

      return x

model = CreativeNN(VOCAB_SIZE, EMBED_DIM, FILTER_DIM, NGA, NGB, NGC, NUM_CLASSES).to(device)
print(model)

CreativeNN(
  (embedding): Embedding(95812, 100)
  (cnn1): Conv2d(1, 100, kernel_size=(2, 100), stride=(1, 1))
  (cnn2): Conv2d(1, 100, kernel_size=(3, 100), stride=(1, 1))
  (cnn3): Conv2d(1, 100, kernel_size=(4, 100), stride=(1, 1))
  (linear): Linear(in_features=300, out_features=4, bias=True)
)


# The training

In [0]:
def train(train_data):

    # Initial values of training loss and training accuracy
    
    train_loss = 0
    train_acc = 0

    # TODO: Use the PyTorch DataLoader class to load the data 
    # into shuffled batches of appropriate sizes into the variable 'data'.
    # Remember, this is the place where you need to generate batches.
    data = torch.utils.data.DataLoader(train_data, batch_size=BATCH_SIZE, shuffle=True)
    
    
    for i, (cls, text) in tqdm(enumerate(data)):
        
        # TODO: What do you need to do in order to perform backprop on the optimizer?
        optimizer.zero_grad()
        
        cls, text = cls.to(device), text.to(device)

        # TODO: Store the output of the model in variable 'output'
        output = model(text)

        # print(f"Output: {output}")
        # print(f"Cls: {cls}")

        # TODO: Define the 'loss' variable (with respect to 'output' and 'cls').
        loss = criterion(output, cls)
        # Also calculate the total loss in variable 'train_loss'
        train_loss += loss
        
        # TODO: Perform the backward propagation on 'loss' and 
        # optimize it through the 'optimizer' step
        loss.backward()
        optimizer.step()
        
        
        # TODO: Calculate and store the total training accuracy
        # in the variable 'total_acc'.
        # Remember, you need to find the 
        _, pred_labels = output.max(dim=1)
        
        accuracy = (pred_labels == cls).sum() / float(BATCH_SIZE)
        train_acc += accuracy
        

    # TODO: Adjust the learning rate here using the scheduler step
    scheduler.step()
    
    # TODO: CHANGE THIS
    return train_loss / len(data), train_acc / len(data)

# The testing

In [0]:
def test(test_data):
    
    # Initial values of test loss and test accuracy
    
    loss = 0
    acc = 0
    
    # TODO: Use DataLoader class to load the data
    # into non-shuffled batches of appropriate sizes.
    # Remember, you need to generate batches here too.
    data = torch.utils.data.DataLoader(test_data, batch_size=BATCH_SIZE, shuffle=False)
    
    
    for cls, text in data:
        
        cls, text = cls.to(device), text.to(device)
        
        # Hint: There is a 'hidden hint' here. Let's see if you can find it :)
        with torch.no_grad():
        
            
            # TODO: Get the model output
            output = model(text)
            
            
            # TODO: Calculate and add the loss to find total 'loss'
            loss += criterion(output, cls)
            
            
            # TODO: Calculate the accuracy and store it in the 'acc' variable
            _, pred_labels = output.max(dim=1)
            acc += (pred_labels == cls).sum() / float(BATCH_SIZE)
            

    return loss / len(data), acc / len(data)

In [0]:
import time
from torch.utils.data.dataset import random_split

# TODO: Set the number of epochs and the learning rate to 
# their initial values here

# TODO: FIGURE THIS OUT
N_EPOCHS = 3 
TRAIN_RATIO = 0.9

# TODO: Set the intial validation loss to positive infinity
INIT_VAL_LOSS = float('inf')


# TODO: Use the appropriate loss function
criterion = nn.CrossEntropyLoss()


# TODO: Use the appropriate optimization algorithm with parameters (Suggested: SGD)
# optimizer = torch.optim.SGD(model.parameters(), lr=LEARNING_RATE)
optimizer = torch.optim.Adam(model.parameters())


# TODO: Use a scheduler function
scheduler = torch.optim.lr_scheduler.StepLR(optimizer, 1, gamma=0.9)



# TODO: Split the data into train and validation sets using random_split()
size = len(train_dataset)
split_size = int(size * TRAIN_RATIO)
train_dataset_split, validation_dataset_split = random_split(train_dataset, [split_size, size - split_size])

# TODO: Finish the rest of the code below

DEBUG = True

if DEBUG:
  pass

else:
    for epoch in range(N_EPOCHS):

        start_time = time.time()
        train_loss, train_acc = train(train_dataset_split)
        valid_loss, valid_acc = test(validation_dataset_split)
    
        secs = int(time.time() - start_time)
        mins = secs / 60
        secs = secs % 60
    
        print('Epoch: %d' %(epoch + 1), " | time in %d minutes, %d seconds" %(mins, secs))
        print(f'\tLoss: {train_loss:.4f}(train)\t|\tAcc: {train_acc * 100:.1f}%(train)')
        print(f'\tLoss: {valid_loss:.4f}(valid)\t|\tAcc: {valid_acc * 100:.1f}%(valid)')

# Evaluation

In [0]:
def trainAndSafe(embedding_dim, filter_dim, ng_list):

  nga, ngb, ngc = ng_list
  filename = f"/content/gdrive/My Drive/task3_ed-fd-ng_{embedding_dim, filter_dim, nga, ngb, ngc}"
  model = CreativeNN(VOCAB_SIZE, embedding_dim, filter_dim, nga, ngb, ngc, NUM_CLASSES).to(device)

  INIT_VAL_LOSS = float('inf')
  criterion = nn.CrossEntropyLoss()
  optimizer = torch.optim.Adam(model.parameters())
  scheduler = torch.optim.lr_scheduler.StepLR(optimizer, 1, gamma=0.9)
  
  size = len(train_dataset)
  split_size = int(size * TRAIN_RATIO)
  train_dataset_split, validation_dataset_split = random_split(train_dataset, [split_size, size - split_size])
  
  for epoch in range(N_EPOCHS):

        start_time = time.time()
        train_loss, train_acc = train(train_dataset_split)
        valid_loss, valid_acc = test(validation_dataset_split)
        test_loss, test_acc = test(test_dataset)
    
        secs = int(time.time() - start_time)
        mins = secs / 60
        secs = secs % 60

       
        data = {
            'epoch': epoch,
            'mins': mins,
            'secs': secs,
            'model_state_dict': model.state_dict(),
            'train_loss': train_loss,
            'train_acc': train_acc,
            'valid_loss': valid_loss,
            'valid_acc': valid_acc,
            'test_loss': test_loss,
            'test_acc': test_acc
        }

        torch.save(data, f"{filename}_{epoch}")

        print('Epoch: %d' %(epoch + 1), " | time in %d minutes, %d seconds" %(mins, secs))
        print(f'\tLoss: {train_loss:.4f}(train)\t|\tAcc: {train_acc * 100:.1f}%(train)')
        print(f'\tLoss: {valid_loss:.4f}(valid)\t|\tAcc: {valid_acc * 100:.1f}%(valid)')
        print(f'\tLoss: {test_loss:.4f}(test)\t|\tAcc: {test_acc * 100:.1f}%(test)')

def loadModel(embedding_dim, filter_dim, ng_list, epoch):
  nga, ngb, ngc = ng_list

  filename = f"/content/gdrive/My Drive/task3_ed-fd-ng_{embedding_dim, filter_dim, nga, ngb, ngc}"
  data = torch.load(f"{filename}_{epoch}")

  return data


def d2s(data, params, epoch=-1):
    if epoch == -1 or epoch-1 == data['epoch']:
      print(f"Task3: {params}")
      print('Epoch: %d' %(data["epoch"] + 1), " | time in %d minutes, %d seconds" %(data["mins"], data["secs"]))
      print(f'\tLoss: {data["train_loss"]:.4f}(train)\t|\tAcc: {data["train_acc"] * 100:.1f}%(train)')
      print(f'\tLoss: {data["valid_loss"]:.4f}(valid)\t|\tAcc: {data["valid_acc"] * 100:.1f}%(valid)')


def bestAcc(data, epoch, previousBestAcc, params):
  if data['epoch'] == epoch and data['valid_acc'].item() > previousBestAcc[0]:
    return [data['train_acc'].item(), params]
  return previousBestAcc


# Do your experiments here

In [0]:
from google.colab import drive
drive.mount('/content/gdrive')


Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3aietf%3awg%3aoauth%3a2.0%3aoob&response_type=code&scope=email%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdocs.test%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive.photos.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fpeopleapi.readonly

Enter your authorization code:
··········
Mounted at /content/gdrive


In [0]:
TRAIN = False

if TRAIN:
  trainAndSafe(32, 32, [2, 3, 4])
  trainAndSafe(32, 64, [2, 3, 4])
  trainAndSafe(64, 32, [2, 3, 4])
  trainAndSafe(64, 64, [2, 3, 4])

In [0]:
hyperparams = [[32, 32, [2, 3, 4]],
               [32, 64, [2, 3, 4]],
               [64, 32, [2, 3, 4]],
               [64, 64, [2, 3, 4]]]


bestAccForEpoch = [[0]] * N_EPOCHS

for hps in hyperparams:
  for epoch in range(N_EPOCHS):
    previousBest = bestAccForEpoch[epoch]
    try:
      data = loadModel(*hps, epoch)
      bestAccForEpoch[epoch] = bestAcc(data, epoch, previousBest, hps)
      d2s(data, hps, N_EPOCHS)
    except:
      pass

In [0]:

trainAndSafe(64, 64, [2, 3, 4])

108000it [07:17, 246.88it/s]
21it [00:00, 201.87it/s]

Epoch: 1  | time in 7 minutes, 34 seconds
	Loss: 0.3134(train)	|	Acc: 91.4%(train)
	Loss: 0.2919(valid)	|	Acc: 91.9%(valid)
	Loss: 0.3872(test)	|	Acc: 89.9%(test)


108000it [07:16, 247.41it/s]
21it [00:00, 209.98it/s]

Epoch: 2  | time in 7 minutes, 33 seconds
	Loss: 0.2466(train)	|	Acc: 93.2%(train)
	Loss: 0.3735(valid)	|	Acc: 90.5%(valid)
	Loss: 0.5121(test)	|	Acc: 88.3%(test)


108000it [07:17, 208.68it/s]


Epoch: 3  | time in 7 minutes, 34 seconds
	Loss: 0.1873(train)	|	Acc: 94.9%(train)
	Loss: 0.4036(valid)	|	Acc: 91.5%(valid)
	Loss: 0.5482(test)	|	Acc: 90.1%(test)


| embedding_dim 	| hidden_size 	| num_layers 	| train acc 	| val acc 	|
|---------------	|-------------	|------------	|-----------	|---------	|
| 32            	| 32          	| [2, 3, 4]  	| 91.5%     	| 90.3%   	|
| 32            	| 64          	| [2, 3, 4]  	| 91.8%     	| 88.5%   	|
| 64            	| 32          	| [2, 3, 4]  	| 91.7%     	| 89.6%   	|
| 64            	| 64          	| [2, 3, 4]  	| 91.6%     	| 90.4%   	|

# Model Architecture:
1. Single Convolutionl layer using 3 convolutions with kernels of following sizes:

    *   [2 x embedding_dimension]
    *   [3 x embedding_dimension]
    *   [4 x embedding_dimension]
  
  The input to the convolution layer is the word embeddings of a single article derived from the embedding layer.
2.   Embedding dimension is a hyperparameter. 32 and 64 are the candidates for the embedding_dimension and we select the one using validation set approach.
3. The convolution is followed by reLU function. 
4. Since the width of the kernels equals the embedding_dimension, the kernels can only stride downwards i.e. along the height of the input. The height of the input is the number of words in the input(passage). We use stride of **1** for each kernel.
5. As a result, the output of convolution is a vector rather than a matrix (one dimensional convolution).
6. The number of kernels used of each type is a hyperparameter. We call this hyperparameter as '*filter_dim*'. The candidates are 32 and 64 and we select the one using validation set.
7. After using the reLU function, we use Max Pooling. The Max Pooling simply selects the highest value from each vector. The outputs of Max Pooling are combined into a new vector which is fed as an input to feed forward neural network.
8. The feedforward network has 1 input layer and 1 output layer. There is no hidden layer.We use cross entropy function to calculate the loss.
9. To optimize our network, we are using **ADAM** to adapt the learning rate.
10. We are using a batch size of one. This is because different articles will have different word lengths and convolution layer expects a fixed input size for each articles in the batch.












# Conclusion

The model yields comparable results to the first two. Although we did not make use of batching in this task (to avoid introducing noise as in part 2) training the model does not take much time when compared to lstm. When not using batching in lstm training the model for a given set of hyperparameters took around 1.5 hours as opposed to the cnn approach which only takes 15 minutes for one epoch. While it did not perform better than the first or second task we are confident that by tweaking the hyperparameters it could surpass both the previous networks. Also introducing a batch size greater than 1 would speed up the process as well. The reason why we think it might be able to outperform the first one is that it also gets a notion of bigrams and trigrams and even four grams by providing it with kernels of theses sizes. The convolution thus learns features comparable to the feateres learnt in the first task. All in all we have shown that all three approaches are working very well for categorizing articles. 