# Project 3: Text Classification in PyTorch

## Instructions

* All the tasks that you need to complete in this project are either coding tasks (mentioned inside the code cells of the notebook with `#TODO` notations) or theoretical questions that you need to answer by editing the markdown question cells.
* **Please make sure you read the [Notes](#Important-Notes) section carefully before you start the project.**

## Introduction
This project deals with neural text classification using PyTorch. Text classification is the process of assigning tags or categories to text according to its content. It's one of the fundamental tasks in Natural Language Processing (NLP) with broad applications such as sentiment analysis, topic labeling, spam detection, and intent detection.

Text classification algorithms are at the heart of a variety of software systems that process text data at scale. Email software uses text classification to determine whether incoming mail is sent to the inbox or filtered into the spam folder. Discussion forums use text classification to determine whether comments should be flagged as inappropriate.

**_Example:_** A simple example of text classification would be Spam Classification. Consider the bunch of emails that you would receive in the your personal inbox if the email service provider did not have a spam filter algorithm. Because of the spam filter, spam emails get redirected to the Spam folder, while you receive only non-spam ("_ham_") emails in your inbox.

![](http://blog.yhat.com/static/img/spam-filter.png)

## Task
Here, we want you to focus on a specific type of text classification task, "Document Classification into Topics". It can be addressed as classifying text data or even large documents into separate discrete topics/genres of interest.


![](https://miro.medium.com/max/700/1*YWEqFeKKKzDiNWy5UfrTsg.png)

In this project, you will be working on classifying given text data into discrete topics or genres. You are given a bunch of text data, each of which has a label attached. We ask you to learn why you think the contents of the documents have been given these labels based on their words. You need to create a neural classifier that is trained on this given information. Once you have a trained classifier, it should be able to predict the label for any new document or text data sample that is fed to it. The labels need not have any meaning to us, nor to you necessarily.

## Data
There are various datasets that we can use for this purpose. This tutorial shows how to use the text classification datasets in the PyTorch library ``torchtext``. There are different datasets in this library like `AG_NEWS`, `SogouNews`, `DBpedia`, and others. This project will deal with training a supervised learning algorithm for classification using one of these datasets. In task 1 of this project, we will work with the `AG_NEWS` dataset.

## Load Data

A bag of **ngrams** feature is applied to capture some partial information about the local word order. In practice, bi-grams or tri-grams are applied to provide more benefits as word groups than only one word.

**Example:**

*"I love Neural Networks"*
* **Bi-grams:** "I love", "love Neural", "Neural Networks"
* **Tri-grams:** "I love Neural", "love Neural Networks"

In the code below, we have loaded the `AG_NEWS` dataset from the ``torchtext.datasets.TextClassification`` package with bi-grams feature. The dataset supports the ngrams method. By setting ngrams to 2, the example text in the dataset will be a list of single words plus bi-grams string.

In [11]:
"""
Load the AG_NEWS dataset in bi-gram features format.
"""

!pip install torchtext==0.4

import torch
import torchtext
from torchtext.datasets import text_classification
import os

from tqdm import tqdm

NGRAMS = 2

if not os.path.isdir('./.data'):
    os.mkdir('./.data')

train_dataset, test_dataset = text_classification.DATASETS['AG_NEWS'](
    root='./.data', ngrams=NGRAMS, vocab=None)

BATCH_SIZE = 16

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")




0lines [00:00, ?lines/s][A
1712lines [00:00, 17110.76lines/s][A
3264lines [00:00, 16437.48lines/s][A
5018lines [00:00, 16751.88lines/s][A
8413it [02:00, 79.68it/s]
8102lines [00:00, 15842.32lines/s][A
9513lines [00:00, 15279.15lines/s][A
11004lines [00:00, 15164.38lines/s][A
12453lines [00:00, 14953.32lines/s][A
13909lines [00:00, 14831.50lines/s][A
15412lines [00:01, 14889.60lines/s][A
16856lines [00:01, 14734.87lines/s][A
18310lines [00:01, 14673.60lines/s][A
19926lines [00:01, 15089.85lines/s][A
21423lines [00:01, 14911.66lines/s][A
22958lines [00:01, 15039.82lines/s][A
24590lines [00:01, 15402.01lines/s][A
26295lines [00:01, 15861.03lines/s][A
27969lines [00:01, 16114.04lines/s][A
29662lines [00:01, 16349.57lines/s][A
31310lines [00:02, 16387.82lines/s][A
32951lines [00:02, 16365.78lines/s][A
34589lines [00:02, 16031.64lines/s][A
36195lines [00:02, 15524.31lines/s][A
37753lines [00:02, 15499.84lines/s][A
39307lines [00:02, 15190.12lines/s][A
40844lines [00

## Model

Our first simple model is composed of an [`EmbeddingBag`](https://pytorch.org/docs/stable/nn.html?highlight=embeddingbag#torch.nn.EmbeddingBag) layer and a linear layer.

``EmbeddingBag`` computes the mean value of a “bag” of embeddings. The text entries here have different lengths. ``EmbeddingBag`` requires no padding here since the text lengths are saved in offsets. Additionally, since ``EmbeddingBag`` accumulates the average across the embeddings on the fly, ``EmbeddingBag`` can enhance the performance and memory efficiency to process a sequence of tensors.

In [0]:
# TODO: Import the necessary libraries
from torch import nn

# TODO: Create a class TextClassifier. Remember that this class will be your model.
class TextClassifier(nn.Module):

    # TODO: Define the __init__() method with proper parameters
    # (vocabulary size, dimensions of the embeddings, number of classes)
    def __init__(self, vocab_size, embed_dim, num_class):
        super().__init__()
        # TODO: define the embedding layer
        self.embedding = nn.EmbeddingBag(vocab_size, embed_dim)
        # TODO: define the linear forward layer
        self.linear = nn.Linear(embed_dim, num_class)
        # TODO: Initialize weights
        self._initWeights()

    # TODO: Define a method to initialize weights.
    def _initWeights(self):
        self._setWeightsAndBias(self.embedding)
        self._setWeightsAndBias(self.linear)

        
    def _setWeightsAndBias(self, layer):
        # The weights should be random in the range of -0.5 to 0.5.
        stdv = 0.5
        layer.weight.data.uniform_(-stdv, stdv)
        # You can initialize bias values as zero.
        if hasattr(layer, "bias"):
            layer.bias.data.zero_()

    
    # TODO: Define the forward function.
    def forward(self, inputs, offsets):
        # This should calculate the embeddings and return the linear layer
        # with calculated embedding values.
        x = self.embedding(inputs, offsets)
        x = self.linear(x)
        return x


## Check your data before you proceed!

Okay, so we know that we are using the `AG_NEWS` dataset in this project, but do you know what does the data contain? What is the format of the data? How many classes of data are there in this dataset? We do not know, yet. Let's find out!


## Question 1:
Create a new cell in this notebook and try to analyze the dataset that we loaded for you before. Report the following:
* Vocabulary size (VOCAB_SIZE)
* Number of classes (NUM_CLASS)
* Names of the classes


## Answer 1:

In [13]:
def getVocabSize(dataset):
    return len(dataset.get_vocab())

def getClasses(dataset):
    return dataset.get_labels()

VOCAB_SIZE = getVocabSize(test_dataset)
CLASSES = getClasses(test_dataset)
NUM_CLASSES = len(CLASSES)


print(f"VOCAB_SIZE: {VOCAB_SIZE}")
print(f"NUM_CLASSES: {NUM_CLASSES}")
print(f"CLASSES: {CLASSES}")

VOCAB_SIZE: 1308844
NUM_CLASSES: 4
CLASSES: {0, 1, 2, 3}


## Create an instance for your model

Great! You have successfully completed a basic analysis of the data that you are going to work with. The vocab size is equal to the length of vocab (including single word and ngrams). The number of classes is equal to the number of labels. Copy paste the code statements you used in your analysis to complete the code below. Also, using these parameters, create an instance `model` of your text classifier `TextClassifier`.

In [0]:
'''
Paramters and model instance creation.
'''

# TODO: Instantiate the Vocabulary size and the number of classes
# from the training dataset that we loaded for you.

# Hint: Remember that these are PyTorch datasets. So, there should be 
# readily available functions that you can use to save time. ;)

VOCAB_SIZE = getVocabSize(train_dataset)
EMBED_DIM = 32
NUM_CLASS = len(getClasses(train_dataset))

# TODO: Instantiate the model with the parameters you defined above. 
# Remember to allocate it to your 'device' variable.

model = TextClassifier(VOCAB_SIZE, EMBED_DIM, NUM_CLASS).to(device)

## Generate batch

Since the text entries have different lengths, you need to create a custom function to generate data batches and offsets. This function should be passed to the ``collate_fn`` parameter in the ``DataLoader`` call of pyTorch which you will use to create the data later on. The input to ``collate_fn`` is a list of tensors with the size of batch_size, and the ``collate_fn`` function packs them into a mini-batch. Pay attention here and make sure that ``collate_fn`` is declared as a top level definition. This ensures that the function is available in each worker. This is the reason why you need to define this custom function first before you call DataLoader().

The text entries in the original data batch input are packed into a list and concatenated as a single tensor as the input of ``EmbeddingBag``. The offsets is a tensor of delimiters to represent the beginning index of the individual sequence in the text tensor. Label is a tensor saving the labels of individual text entries.

Finish the function definition below. The function should take batch as an input parameter. Each entry in the batch contains a pair of values of the text and the corresponding label.

In [0]:
# TODO: Finish the function definition.

def generate_batch(raw_batch):
    label = torch.tensor([entry[0] for entry in raw_batch])
    text = [entry[1] for entry in raw_batch]
    offsets = [0] + [len(entry) for entry in text]

    offsets = torch.tensor(offsets[:-1]).cumsum(dim=0)
    text = torch.cat(text)
    
    return text, offsets, label

## Define the train function

Here, you need to define a function which you will use later on in the project to train your model. This is very similar to the training steps that you have encountered before in previous coding assignment(s). The outline of the function is something like this -

* load the data as batches
* iterate over the batches
* find the model output for a forward pass
* calculate the loss
* perform backpropagation on the loss (optimize it)
* find the training accuracy

In addition to this, you also need to find the total loss and total training accuracy values. Also, you need to return the average values of the total loss and total accuracy.

In [0]:
def train(train_data):

    # Initial values of training loss and training accuracy
    
    train_loss = 0
    train_acc = 0

    # TODO: Use the PyTorch DataLoader class to load the data 
    # into shuffled batches of appropriate sizes into the variable 'data'.
    # Remember, this is the place where you need to generate batches.
    data = torch.utils.data.DataLoader(train_data, batch_size=BATCH_SIZE, collate_fn=generate_batch, shuffle=True)
    
    
    for i, (text, offsets, cls) in tqdm(enumerate(data)):
        
        # TODO: What do you need to do in order to perform backprop on the optimizer?
        optimizer.zero_grad()
        
        text, offsets, cls = text.to(device), offsets.to(device), cls.to(device)

        # TODO: Store the output of the model in variable 'output'
        output = model(text, offsets)
        
        
        # TODO: Define the 'loss' variable (with respect to 'output' and 'cls').
        loss = criterion(output, cls)
        # Also calculate the total loss in variable 'train_loss'
        train_loss += loss
        
        # TODO: Perform the backward propagation on 'loss' and 
        # optimize it through the 'optimizer' step
        loss.backward()
        optimizer.step()
        
        
        # TODO: Calculate and store the total training accuracy
        # in the variable 'total_acc'.
        # Remember, you need to find the 
        _, pred_labels = output.max(dim=1)
        accuracy = (pred_labels == cls).sum() / float(BATCH_SIZE)
        train_acc += accuracy
        

    # TODO: Adjust the learning rate here using the scheduler step
    scheduler.step()
    
    # TODO: CHANGE THIS
    return train_loss / len(data), train_acc / len(data)

## Define the test function

Using the framework of the `train()` function in the previous cell, try to figure out the structure of the test function below.

In [0]:
def test(test_data):
    
    # Initial values of test loss and test accuracy
    
    loss = 0
    acc = 0
    
    # TODO: Use DataLoader class to load the data
    # into non-shuffled batches of appropriate sizes.
    # Remember, you need to generate batches here too.
    data = torch.utils.data.DataLoader(test_data, batch_size=BATCH_SIZE, collate_fn=generate_batch, shuffle=False)
    
    
    for text, offsets, cls in data:
        
        text, offsets, cls = text.to(device), offsets.to(device), cls.to(device)
        
        # Hint: There is a 'hidden hint' here. Let's see if you can find it :)
        with torch.no_grad():
        
            
            # TODO: Get the model output
            output = model(text, offsets)
            
            
            
            # TODO: Calculate and add the loss to find total 'loss'
            loss += criterion(output, cls)
        
            
            
            # TODO: Calculate the accuracy and store it in the 'acc' variable
            _, pred_labels = output.max(dim=1)
            acc += (pred_labels == cls).sum() / float(BATCH_SIZE)
            

    return loss / len(data), acc / len(data)

## Split the dataset and run the model

The original `AG_NEWS` has no validation dataset. For this reason, you need to split the training dataset into training and validation sets with a proper split ratio. The `random_split()` function in the torch.utils core PyTorch library should be able to help you with this. We have already imported it for you. :)

* Consider the initial learning rate as 4.0, number of epochs as 5, training data ratio as 0.9.
* You need to define and use a proper loss function
* Define an Optimization algorithm (Suggestion: SGD)
* Define a scheduler function to adjust the learning rate through epochs (gamma parameter = 0.9).
(Hint: Look at the `StepLR` function)
* Monitor the loss and accuracy values for both training and validation data sets.

In [18]:
import time
from torch.utils.data.dataset import random_split

# TODO: Set the number of epochs and the learning rate to 
# their initial values here

# TODO: FIGURE THIS OUT
N_EPOCHS = 1
LEARNING_RATE = 4.0
TRAIN_RATIO = 0.9

# TODO: Set the intial validation loss to positive infinity
INIT_VAL_LOSS = float('inf')


# TODO: Use the appropriate loss function
criterion = nn.CrossEntropyLoss()


# TODO: Use the appropriate optimization algorithm with parameters (Suggested: SGD)
optimizer = torch.optim.SGD(model.parameters(), lr=LEARNING_RATE)


# TODO: Use a scheduler function
scheduler = torch.optim.lr_scheduler.StepLR(optimizer, 1, gamma=0.9)



# TODO: Split the data into train and validation sets using random_split()
size = len(train_dataset)
split_size = int(size * TRAIN_RATIO)
train_dataset_split, validation_dataset_split = random_split(train_dataset, [split_size, size - split_size])


# TODO: Finish the rest of the code below

for epoch in range(N_EPOCHS):

    start_time = time.time()
    train_loss, train_acc = train(train_dataset_split)
    valid_loss, valid_acc = test(validation_dataset_split)

    secs = int(time.time() - start_time)
    mins = secs / 60
    secs = secs % 60

    print('Epoch: %d' %(epoch + 1), " | time in %d minutes, %d seconds" %(mins, secs))
    print(f'\tLoss: {train_loss:.4f}(train)\t|\tAcc: {train_acc * 100:.1f}%(train)')
    print(f'\tLoss: {valid_loss:.4f}(valid)\t|\tAcc: {valid_acc * 100:.1f}%(valid)')


0it [00:00, ?it/s][A
14it [00:00, 133.66it/s][A
29it [00:00, 136.86it/s][A
44it [00:00, 138.40it/s][A
59it [00:00, 139.61it/s][A
74it [00:00, 140.98it/s][A
89it [00:00, 141.47it/s][A
104it [00:00, 142.39it/s][A
119it [00:00, 143.00it/s][A
134it [00:00, 142.57it/s][A
149it [00:01, 142.74it/s][A
164it [00:01, 143.54it/s][A
179it [00:01, 143.92it/s][A
194it [00:01, 144.08it/s][A
209it [00:01, 143.83it/s][A
224it [00:01, 143.18it/s][A
239it [00:01, 142.64it/s][A
254it [00:01, 141.96it/s][A
269it [00:01, 141.46it/s][A
284it [00:01, 141.47it/s][A
299it [00:02, 142.55it/s][A
314it [00:02, 143.06it/s][A
329it [00:02, 143.08it/s][A
344it [00:02, 142.82it/s][A
359it [00:02, 142.60it/s][A
374it [00:02, 143.44it/s][A
389it [00:02, 142.65it/s][A
404it [00:02, 142.28it/s][A
419it [00:02, 142.78it/s][A
434it [00:03, 141.81it/s][A
449it [00:03, 142.23it/s][A
464it [00:03, 143.03it/s][A
479it [00:03, 142.78it/s][A
494it [00:03, 142.73it/s][A
509it [00:03, 143.38it/s]

Epoch: 1  | time in 0 minutes, 47 seconds
	Loss: 0.4221(train)	|	Acc: 84.6%(train)
	Loss: 0.3135(valid)	|	Acc: 89.4%(valid)


## Let's  check the test loss and test accuracy

So you have trained your model and seen how well it performs on the training and validation datasets. Now, you need to check your model's performance against the test dataset. Using the test dataset as input, report the test loss and test accuracy scores of your model.

In [19]:
# TODO: Compete the code below to find 
# the results (loss and accuracy) on the test data

print('Checking the results of test dataset...')
test_loss, test_acc = test(test_dataset)
print(f'\tLoss: {test_loss:.4f}(test)\t|\tAcc: {test_acc * 100:.1f}%(test)')

Checking the results of test dataset...
	Loss: 0.3004(test)	|	Acc: 90.1%(test)


In [20]:
# importing necessary libraries

import re
from torchtext.data.utils import ngrams_iterator
from torchtext.data.utils import get_tokenizer

# labels for the AG_NEWS dataset

ag_news_label = {1 : "World",
                 2 : "Sports",
                 3 : "Business",
                 4 : "Sci/Tec"}

def predict(text, model, vocab, ngrams):
    tokenizer = get_tokenizer("basic_english")
    with torch.no_grad():
        text = torch.tensor([vocab[token]
                            for token in ngrams_iterator(tokenizer(text), ngrams)])
        output = model(text, torch.tensor([0]))
        return output.argmax(1).item() + 1

ex_text_str = "MEMPHIS, Tenn. – Four days ago, Jon Rahm was \
    enduring the season’s worst weather conditions on Sunday at The \
    Open on his way to a closing 75 at Royal Portrush, which \
    considering the wind and the rain was a respectable showing. \
    Thursday’s first round at the WGC-FedEx St. Jude Invitational \
    was another story. With temperatures in the mid-80s and hardly any \
    wind, the Spaniard was 13 strokes better in a flawless round. \
    Thanks to his best putting performance on the PGA Tour, Rahm \
    finished with an 8-under 62 for a three-stroke lead, which \
    was even more impressive considering he’d never played the \
    front nine at TPC Southwind."

vocab = train_dataset.get_vocab()
model = model.to("cpu")

# TODO: Predict the topic of the above given random text (use bigrams)
# Use the proper paramters in the predict() function

print("This is a '%s' news" % ag_news_label[predict(ex_text_str, model, vocab, NGRAMS)])

# If you have done everything correctly in this task,
# then the output of this cell should be - "This is a 'Sports' news".

This is a 'Sports' news


# Congratulations! You just designed your first neural classifier!

And probably you have achieved a good accuracy score too. Great job!

## Question 2:
You just tested your model with a new sample text. Try to feed some more random examples of similar text (which you think are related to at least one of the four topics _"World", "Sports", "Business", "Sci/Tec"_ of our problem) to the model and see how your model reacts. Give at least 3 such examples (You are free to include more examples if you wish to).

## Answer 2:



## Question 3:
Okay, probably the model still works great with the examples you fed to it in the previous question. How about a twist in the plot? Let's feed it some more random text data from completely different genres/topics (not belonging to the 4 topics which we talk about the in the first question). How does your model react now? Give at least 3 such examples (You are free to include more examples if you wish to).

Of course the predictions will be limited to the four class labels that your model is trained on. Can you somehow justify the labels that your model predicted now for the given text inputs?

## Answer 3:

## Question 4:
Your model probably has achieved a good accuracy score. However, there may be lots of things that you could still try to do to improve your classifier model. Can you try to list down some improvements that you think would be able to improve the above model's performance?

_(Hint: Maybe think about alternate architectures, #layers, hyper-paramters, etc..., but try not to come up with too complex stuff! :) )_

## Answer 4:

# Task 2: Try the better option that you proposed

In Question 4, you have proposed some alternate solution that you think will be able to somehow improve your model. Following one of the options below, try to build and train a new model, and report the new loss and accuracy scores. Is it better than your initial classifier model for the same data?

For your reference, here are some neural models using which researchers have tried to classify text before:

* Recurrent Neural Networks (RNNs)
* Long-Short Term Memory (LSTM)
* Bi-directional LSTM (BiLSTM)
* Gated Recurrent Units (GRUs)


# The model

In [25]:
import torch.nn as nn

class SentimentLSTM(nn.Module):
    """
    The RNN model that will be used to perform Sentiment analysis.
    """

    def __init__(self, vocab_size, embedding_dim, hidden_size, num_layers,
                 num_classes, dropout):
        """
        Initialize the model by setting up the layers.
        """
        super().__init__()
        
        # embedding and LSTM layers
        self.embedding = nn.Embedding(vocab_size, embedding_dim)
        self.lstm = nn.RNN(embedding_dim, hidden_size=hidden_size,
                            num_layers=num_layers, dropout=dropout, batch_first=True)
        # Last layer in lstm doesnt use dropout
        self.dropout = nn.Dropout(dropout)
        self.linear = nn.Linear(hidden_size, num_classes)

    def forward(self, inputs):
      x = self.embedding(inputs)
      x, (h_n, c_n) = self.lstm(x)
      x = self.dropout(h_n[-1])
      x = self.linear(x)
      return x

model = SentimentLSTM(VOCAB_SIZE, EMBED_DIM, 64, 2, NUM_CLASSES, 0.3).to(device)
print(model)

SentimentLSTM(
  (embedding): Embedding(1308844, 32)
  (lstm): RNN(32, 64, num_layers=2, batch_first=True, dropout=0.3)
  (dropout): Dropout(p=0.3, inplace=False)
  (linear): Linear(in_features=64, out_features=4, bias=True)
)


# The training

In [0]:
def train(train_data):

    # Initial values of training loss and training accuracy
    
    train_loss = 0
    train_acc = 0

    # TODO: Use the PyTorch DataLoader class to load the data 
    # into shuffled batches of appropriate sizes into the variable 'data'.
    # Remember, this is the place where you need to generate batches.
    data = torch.utils.data.DataLoader(train_data, batch_size=1, shuffle=True)
    
    
    for i, (cls, text) in tqdm(enumerate(data)):
        
        # TODO: What do you need to do in order to perform backprop on the optimizer?
        optimizer.zero_grad()
        
        cls, text = cls.to(device), text.to(device)

        # TODO: Store the output of the model in variable 'output'
        output = model(text)

        # TODO: Define the 'loss' variable (with respect to 'output' and 'cls').
        loss = criterion(output, cls)
        # Also calculate the total loss in variable 'train_loss'
        train_loss += loss
        
        # TODO: Perform the backward propagation on 'loss' and 
        # optimize it through the 'optimizer' step
        loss.backward()
        optimizer.step()
        
        
        # TODO: Calculate and store the total training accuracy
        # in the variable 'total_acc'.
        # Remember, you need to find the 
        _, pred_labels = output.max(dim=1)
        
        accuracy = (pred_labels == cls).sum() / float(BATCH_SIZE)
        train_acc += accuracy
        

    # TODO: Adjust the learning rate here using the scheduler step
    scheduler.step()
    
    # TODO: CHANGE THIS
    return train_loss / len(data), train_acc / len(data)

# The testing

In [0]:
def test(test_data):
    
    # Initial values of test loss and test accuracy
    
    loss = 0
    acc = 0
    
    # TODO: Use DataLoader class to load the data
    # into non-shuffled batches of appropriate sizes.
    # Remember, you need to generate batches here too.
    data = torch.utils.data.DataLoader(test_data, batch_size=1, collate_fn=generate_batch, shuffle=False)
    
    
    for cls, text in data:
        
        cls, text = cls.to(device), text.to(device)
        
        # Hint: There is a 'hidden hint' here. Let's see if you can find it :)
        with torch.no_grad():
        
            
            # TODO: Get the model output
            output = model(text)
            
            
            
            # TODO: Calculate and add the loss to find total 'loss'
            loss += criterion(output, cls)
        
            
            
            # TODO: Calculate the accuracy and store it in the 'acc' variable
            _, pred_labels = output.max(dim=1)
            acc += (pred_labels == cls).sum() / float(BATCH_SIZE)
            

    return loss / len(data), acc / len(data)

In [28]:
data = torch.utils.data.DataLoader(train_dataset_split, batch_size=BATCH_SIZE, shuffle=True)

train(train_dataset_split)
test(validation_dataset_split)

# first = next(iter(data))
# inp = first[1].to(device)
# print(inp.shape)
# o = model(inp)
# print(o.shape)
# print(o)



0it [00:00, ?it/s][A[A

[A[A

IndexError: ignored


# Task 3: Let your creativity flow!

As discussed earlier, you are free to come up with anything in task 3. Think and try to model unique (not too complex!) neural architecture on your own. Remember that this model has to be novel as much as possible, so try not to copy other people's existing work. Using the same data, train the new model, and report the accuracy scores. How much better/worse is this model than the previous two models? Why do you think this is better/worse?


In [0]:
class CreativeNN(nn.Module):
    """
    The RNN model that will be used to perform Sentiment analysis.
    """

    def __init__(self, vocab_size, embedding_dim, hidden_size, num_layers,
                 num_classes, dropout, glove):
        """
        Initialize the model by setting up the layers.
        """
        super().__init__()
        
        # embedding and LSTM layers
        self.embedding = nn.Embedding.from_pretrained(glove)
        self.cnn1 = (embedding_dim, cnn_dim1, (3, embedding_dim))
        self.cnn2 = (embedding_dim, cnn_dim2, (5, embedding_dim))
        self.cnn3 = (embedding_dim, cnn_dim3, (7, embedding_dim))
        self.pooling = nn.MaxPool2d(2, 2)
        self.linear = nn.Linear(num_features, num_classes)

        

    def forward(self, inputs):
      x = self.embedding(inputs)
      x1 = self.pooling(nn.relu(self.cnn1(x)))
      x2 = self.pooling(nn.relu(self.cnn2(x)))
      x3 = self.pooling(nn.relu(self.cnn3(x)))
      x = x1 + x2 + x3
      x = self.linear(x)
      return x

model = CreativeCNN(VOCAB_SIZE, EMBED_DIM, 64, 2, NUM_CLASSES, 0.3).to(device)
print(model)


# Important Notes

## NOTE 1:
If you want, you can try out the models on other datasets too for comparisons. Although this is not mandatory, it would be really interesting to see how your model performs for data from different domains maybe. Note that you may need to tweak the code a little bit when you are considering other datasets and formats. 

## NOTE 2:
Any form of plagiarism is strictly prohibited. If it is found that you have copied sample code from the internet, the entire team will be penalized.

## NOTE 3:
Often Jupyter Notebooks tend to stop working or crash due to overload of memory (lot of variables, big neural models, memory-intensive training of models, etc...). Moreover, with more number of tasks, the number of variables that you will be using will surely incerase. Therefore, it is recommended that you use separate notebooks for each _Task_ in this project.

## NOTE 4:
You are expected to write well-documented code, that is, with proper comments wherever you think is needed. Make sure you write a comprehensive report for the entire project consisting of data analysis, your model architecture, methods used, discussing and comparing the models against the accuracy and loss metrics, and a final conslusion. If you want to prepare separate reports for each _Task_, you could do this in the Jupyter Notebook itself using $Mardown$ and $\LaTeX$ code if needed. If you want to submit a single report for the entire project, you could submit a PDF file in that case (Word or $\LaTeX$).

All the very best for project 2. Wishing you happy holidays and a very happy new year in advance! :)