# Text Clasification with Neural Networks
Implement classifiers based on Convolutional Neural Networks (CNN's) and Recurrent Neural Networks (RNN's) to detect the sentiment of movie reviews from the IMDb movie reviews dataset.

We recommend runing this notebook on Google Colab instead of your local computer to avoid the hassle of installing necessary Python packages on local machine. Selecting "GPU" as the runtime type as this will speed up the training of your models. You can find this by going to <TT>Runtime > Change Runtime Type</TT> and select "GPU" from the dropdown menu.

In [None]:
from collections import defaultdict
import numpy as np
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torch.utils import data
import torchtext 
import random

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

if __name__=='__main__':
    print('Using device:', device)

Using device: cuda


# Step 1: Download the Data
First, download the dataset using [torchtext](https://torchtext.readthedocs.io/en/latest/index.html), which is a package that supports NLP for PyTorch. The following cell will get the `train_data` and `test_data`. It also does some basic tokenization.

*   To access the list of textual tokens for the *i*th example, use `train_data[i][1]`
*   To access the label for the *i*th example, use `train_data[i][0]`



In [None]:
def preprocess(review):
    '''
    Simple preprocessing function.
    '''
    res = []
    for x in review.split(' '):
        remove_beg=True if x[0] in {'(', '"', "'"} else False
        remove_end=True if x[-1] in {'.', ',', ';', ':', '?', '!', '"', "'", ')'} else False
        if remove_beg and remove_end: res += [x[0], x[1:-1], x[-1]]
        elif remove_beg: res += [x[0], x[1:]]
        elif remove_end: res += [x[:-1], x[-1]]
        else: res += [x]
    return res

if __name__=='__main__':
    train_data = torchtext.datasets.IMDB(root='.data', split='train')
    train_data = list(train_data)
    train_data = [(x[0], preprocess(x[1])) for x in train_data]
    train_data, test_data = train_data[0:10000] + train_data[12500:12500+10000], train_data[10000:12500] + train_data[12500+10000:], 

    print('Num. Train Examples:', len(train_data))
    print('Num. Test Examples:', len(test_data))

    print("\nSAMPLE DATA:")
    for x in random.sample(train_data, 5):
        print('Sample text:', x[1])
        print('Sample label:', x[0], '\n')
        

100%|██████████| 84.1M/84.1M [00:08<00:00, 9.47MB/s]


Num. Train Examples: 20000
Num. Test Examples: 5000

SAMPLE DATA:
Sample text: ['The', 'DVD', 'version', 'consists', 'of', '2', 'episodes', ',', 'the', 'parricide', 'of', 'Caesar', 'being', 'the', 'juncture', '.', 'In', 'addition', ',', 'the', 'language', 'was', 'Spanish', 'without', 'subtitles', '.', 'Hence', ',', "it's", 'hard', 'for', 'me', 'to', 'review', 'in', 'depth', 'this', 'movie', 'because', 'because', 'i', "didn't", 'understand', 'what', 'was', 'said.<br', '/><br', '/>Cleopatra', 'being', 'an', 'historic', 'icon', ',', 'the', 'part', 'is', 'very', 'difficult', 'and', 'i', 'found', 'that', 'for', 'a', 'newcomer', ',', 'Leonor', 'Varela', 'just', 'plays', 'fine', '.', 'She', 'is', 'strong-willed', 'but', 'also', 'a', 'very', 'supportive', ',', 'tender', 'soul', 'mate', '.', 'Thimothy', 'Dalton', 'as', 'Caesar', 'is', 'perfect', 'and', 'their', 'romance', 'is', 'the', 'main', 'thing', 'of', 'the', 'first', 'episode', '.', 'So', ',', 'it', 'is', 'not', 'really', 'a', 'documentar

# Step 2: Create Dataloader




## Define the Dataset Class

The dataset contains the tokenized data for the model. The following functions will be implemented: 

*   <b>` build_dictionary(self)`:</b>  Creates the dictionaries `idx2word` and `word2idx`. Represent each word in the dataset with a unique index, and keep track of this in these dictionaries. Use the hyperparameter `threshold` to control which words appear in the dictionary: a training word’s frequency should be `>= threshold` to be included in the dictionary.

* <b>`convert_text(self)`:</b> Converts each review in the dataset to a list of indices, given by `word2idx` dictionary. Store this in the `textual_ids` variable, and the function does not return anything. If a word is not present in the  `word2idx` dictionary, use the `<UNK>` token for that word. Be sure to append the `<END>` token to the end of each review.

*   <b>` get_text(self, idx) `:</b> Return the review at `idx` in the dataset as an array of indices corresponding to the words in the review. If the length of the review is less than `max_len`, pad the review with the `<PAD>` character up to the length of `max_len`. If the length is greater than `max_len`, then only return the first `max_len` words. The return type should be `torch.LongTensor`.

*   <b>`get_label(self, idx) `</b>: Return `1` if the label for `idx` in the dataset is `positive`, `0` if it is `negative`. The return type should be `torch.LongTensor`.

*  <b> ` __len__(self) `:</b> Return the total number of reviews in the dataset as an `int`.

*   <b>` __getitem__(self, idx)`:</b> Return the (padded) text, and the label. The return type for both these items should be `torch.LongTensor`.


<b>Note:</b> convert all words to lower case in the functions.

In [None]:
PAD = '<PAD>'
END = '<END>'
UNK = '<UNK>'

class TextDataset(data.Dataset):
    def __init__(self, examples, split, threshold, max_len, idx2word=None, word2idx=None):

        self.examples = examples
        assert split in {'train', 'val', 'test'}
        self.split = split
        self.threshold = threshold
        self.max_len = max_len

        # Dictionaries
        self.idx2word = idx2word
        self.word2idx = word2idx
        if split == 'train':
            self.build_dictionary()
        self.vocab_size = len(self.word2idx)
        
        # Convert text to indices
        self.textual_ids = [] # should this be a 2-D array indexed by ith example's review and also each idx in the ith review?
        self.convert_text()

    
    def build_dictionary(self): 
        '''
        Build the dictionaries idx2word and word2idx. This is only called when split='train', as these
        dictionaries are passed in to the __init__(...) function otherwise. Be sure to use self.threshold
        to control which words are assigned indices in the dictionaries.
        Returns nothing.
        '''
        assert self.split == 'train'
        
        # Don't change this
        self.idx2word = {0:PAD, 1:END, 2: UNK}
        self.word2idx = {PAD:0, END:1, UNK: 2}

        # Count the frequencies of all words in the training data (self.examples)
        # Assign idx (starting from 3) to all words having word_freq >= self.threshold
        # Make sure to call word.lower() on each word to convert it to lowercase
        temp_dic = {}
        
        # In the following, the [1] after [i] indicate the list of textual tokens for the ith example as mentioned in
        # Step 1: Download the Data: To access the list of textual tokens for the ith example, use train_data[i][1]

        for i in range(len(self.examples)):
            for j in range(len(self.examples[i][1])):
                token_lower_case = self.examples[i][1][j].lower()
                # save the lower case version
                self.examples[i][1][j] = token_lower_case
                # increment the count 
                temp_dic[token_lower_case] = temp_dic.get(token_lower_case, 0) + 1
        
        # starting idx from 3
        idx = 3
        for token in temp_dic.keys():
            if temp_dic[token] >= self.threshold:
                self.idx2word[idx] = token
                self.word2idx[token] = idx
                idx += 1
    
    def convert_text(self):
        '''
        Convert each review in the dataset (self.examples) to a list of indices, given by self.word2idx.
        Store this in self.textual_ids; returns nothing.
        '''

        # Remember to replace a word with the <UNK> token if it does not exist in the word2idx dictionary.
        # Remember to append the <END> token to the end of each review.
        # Note: the two lines of comments above this line is implemented in get_text() funtion
        for i in range(len(self.examples)):
            self.textual_ids.append(self.get_text(i))


    def get_text(self, idx):
        '''
        Return the review at idx as a long tensor (torch.LongTensor) of integers corresponding to the words in the review.
        You may need to pad as necessary (see above).
        '''
        # intialize all of text id list with the value 0 of max length defined as self.max_len
        # value 0 is the idx of PAD defined above. Also refer to the following statement we quoted from above 
        # "you should pad the review with the <PAD> character up to the length of max_len"
        txt_id_list = [0] * self.max_len
        # In the following,  the index [1] is used to access texual tokens, 
        # refer to statement in step 1 "To access the list of textual tokens for the ith example, use train_data[i][1]"
        review_text = self.examples[idx][1] 

        review_len = min(self.max_len, len(review_text)) 
        for i in range(review_len):
            # Note: in the following if a token is not found, the get() returns a default value of 2 
            # which is the idx of UNK as defined above --- self.word2idx = {PAD:0, END:1, UNK: 2}
            # Quoted from step 2: If a word is not present in the word2idx dictionary, you should use the <UNK> token for that word
            txt_id_list[i] = self.word2idx.get(review_text[i],2)
        # The following is doing as quoted from Step 2: "Be sure to append the <END> token to the end of each review."
        # And the index value of END is "1", and the position of the end of the list is self.max_len-1, as assign value "1"  to it
        txt_id_list[self.max_len-1] = 1

        return torch.LongTensor(txt_id_list)
    
    def get_label(self, idx):
        '''
        This function should return the value 1 if the label for idx in the dataset is 'positive', 
        and 0 if it is 'negative'. The return type should be torch.LongTensor.
        '''
        # Note: in the following, the [0] after [idx] indicate the label for the ith example as mentioned in
        # Step 1: Download the Data: To access the label for the ith example, use train_data[i][0]

        # The label for "positive" is "pos", label for "negative" is "neg", so DO NOT USE
        # if self.examples[idx][0] == "positive" in the following, otherwise, all examples
        # will return a "negative" label, and the train and test accuracy will all be 100%, abnormal
        if self.examples[idx][0] == "pos":
            return torch.squeeze(torch.LongTensor([1]))
        else:
            return torch.squeeze(torch.LongTensor([0]))

    def __len__(self):
        '''
        Return the number of reviews (int value) in the dataset
        '''
        return len(self.examples)
    
    def __getitem__(self, idx):
        '''
        Return the review, and label of the review specified by idx.
        '''

        text_ids = self.get_text(idx)
        label = self.get_label(idx)
        return text_ids, label

In [None]:
if __name__=='__main__':
    # Sample item
    Ds = TextDataset(train_data, 'train', threshold=10, max_len=150)
    print('Vocab size:', Ds.vocab_size)

    text, label = Ds[random.randint(0, len(Ds))]
    print('Example text:', text)
    print('Example label:', label)

Vocab size: 19002
Example text: tensor([   41,   458,    49,   472,     3,   536, 13647,    27,    41,  8372,
          391,    91,    17,    19,  2484,    24,   233,  2498,  5120,   932,
           24,    17,   633,  1146,    70,  4269,   934,   127,    41,  7450,
           60,    24,  7292, 14305,  1512,   863,  2610,   105,    74,  2526,
         4384,    11,  1048,    38,  1320,    38,    91,  1569,   339,   283,
           24,   724, 14306,    38,    13,   969,    38,   103,   129,    41,
         3238,  6920,   157,    15,   224,    13,  3243,   837,    11, 11526,
         6468,    22,    13,  5708,  7414,   351,    24,    50,    12,    11,
          124,  1133,    38, 13647,  2934,   105,   124,  3018,    24,   232,
           55,   302,  3302,    50,    41,    20,   323,   969,    38,   511,
          105,    41,  1965,  1133,    24,   732,    55,    41,   136,   551,
           24,     2,     2,    38,  5365,  3783, 13834,    38, 10838,     2,
           38,    91,     2,    

# Step 3: Train a Convolutional Neural Network (CNN)

## Define the CNN Mode
Define a convolutional neural network for text classification.
In particular, pay attention to the desired tensor shapes, print them out if necessary for debugging. Also refer to PyTorch documentation for the modules & functions to be used, since they describe input and output dimensions.

In [None]:
class CNN(nn.Module):
    def __init__(self, vocab_size, embed_size, out_channels, filter_heights, stride, dropout, num_classes, pad_idx):
        super(CNN, self).__init__()
        
        # Create an embedding layer (https://pytorch.org/docs/stable/generated/torch.nn.Embedding.html)
        #   to represent the words in your vocabulary. Make sure to use vocab_size, embed_size, and pad_idx here.
        #  Note: pad_idx = 0, refer to the defintion above  self.word2idx = {PAD:0, END:1, UNK: 2}
        self.EmbeddingLayer = nn.Embedding(vocab_size, embed_size)

        # Define multiple Convolution layers (nn.Conv2d) with filter (kernel) size [filter_height, embed_size] based on 
        #   different filter_heights.
        # Input channels will be 1 and output channels will be out_channels (these many different filters will be trained 
        #   for each convolution layer)
        # Note: even though the conv layers are nn.Conv2d, we are doing a 1d convolution since we are only moving the filter 
        #   in one direction

        # Note: in the following "1" is the input channels as mentioned above
        # 3 is the num_of_cnn_layers variable used below, 3 CNN layers are side by side, not stacked, 
        self.conv1 = nn.Conv2d(1, out_channels, [filter_heights[0], embed_size])
        self.conv2 = nn.Conv2d(1, out_channels, [filter_heights[1], embed_size])
        self.conv3 = nn.Conv2d(1, out_channels, [filter_heights[2], embed_size])

        # Create a dropout layer (nn.Dropout) using dropout
        self.DropoutLayer = nn.Dropout()

        # Define a linear layer (nn.Linear) that consists of num_classes units 
        # and takes as input the concatenated output for all cnn layers (out_channels * num_of_cnn_layers units)
        # Note: in the following "3" is the num_of_cnn_layers, see a comment above
        self.LinearLayer = nn.Linear(out_channels * 3, num_classes)


    def forward(self, texts):
        """
        texts: LongTensor [batch_size, max_len]
        
        Returns output: Tensor [batch_size, num_classes]
        """

        # Pass texts through the embedding layer to convert from word ids to word embeddings
        #   Resulting: shape: [batch_size, max_len, embed_size]
        embeddings  = self.EmbeddingLayer(texts)

        # Input to conv should have 1 channel. Take a look at torch's unsqueeze() function
        #   Resulting shape: [batch_size, 1, MAX_LEN, embed_size]
       
        # Pass these texts to each of the conv layers and compute their output as follows:
        #   The cnn output will have shape [batch_size, out_channels, *, 1] where * depends on filter_height and stride
        #   Convert to shape [batch_size, out_channels, *] (see torch's squeeze() function)
        #   Apply non-linearity on it (F.relu() is a commonly used one)
        #   Take the max value across last dimension to have shape [batch_size, out_channels]
        # Concatenate outputs from all the cnn's [batch_size, (out_channels*num_of_cnn_layers)]
        #
        Conv1_Output = self.conv1(torch.unsqueeze(embeddings, 1))
        Conv2_Output = self.conv2(torch.unsqueeze(embeddings, 1))
        Conv3_Output = self.conv3(torch.unsqueeze(embeddings, 1))

        Conv1_Output_Converted = torch.squeeze(Conv1_Output, 3)
        Conv2_Output_Converted = torch.squeeze(Conv2_Output, 3)
        Conv3_Output_Converted = torch.squeeze(Conv3_Output, 3)

        Relu1 = F.relu(Conv1_Output_Converted)
        Relu2 = F.relu(Conv2_Output_Converted)
        Relu3 = F.relu(Conv3_Output_Converted)

        Max1 = torch.max(Relu1, 2)[0]
        Max2 = torch.max(Relu2, 2)[0]
        Max3 = torch.max(Relu3, 2)[0]

        CNNOutputConcat = torch.cat([Max1, Max2, Max3], dim = 1)

        # Let's summarize what has been done so far:
        #   Since each cnn is of different filter_height, it will look at different number of words at a time
        #     So, a filter_height of 3 means the cnn looks at 3 words (3-grams) at a time and tries to extract some information from it
        #   Each cnn will learn out_channels number of features from the words it sees at a time
        #   Then a non-linearity is applied and the max value for all channels is taken
        #     We are essentially trying to find important n-grams from the entire text
        # Everything happens on a batch simultaneously hence we have that additional batch_size as the first dimension

        # Apply dropout
        DropoutCNNConcat = self.DropoutLayer(CNNOutputConcat)

        # Pass the output through the linear layer and return its output 
        #   Resulting shape: [batch_size, num_classes]

        output = self.LinearLayer(DropoutCNNConcat) 

        # NOTE: Do not apply a sigmoid or softmax to the final output - done in training method!

        return output

## Train CNN Model

First, initialize the train and test <b>dataloaders</b>. A dataloader is responsible for providing batches of data to the model.

In [None]:
if __name__=='__main__':
    THRESHOLD = 5 # Don't change this
    MAX_LEN = 100 # Don't change this
    BATCH_SIZE = 32 # Feel free to try other batch sizes

    train_Ds = TextDataset(train_data, 'train', THRESHOLD, MAX_LEN)
    train_loader = torch.utils.data.DataLoader(train_Ds, batch_size=BATCH_SIZE, shuffle=True, num_workers=2, drop_last=True)

    test_Ds = TextDataset(test_data, 'test', THRESHOLD, MAX_LEN, train_Ds.idx2word, train_Ds.word2idx)
    test_loader = torch.utils.data.DataLoader(test_Ds, batch_size=1, shuffle=False, num_workers=1, drop_last=False)

Then the following function takes the model and trains it on the data.


In [None]:
from tqdm.notebook import tqdm

def train_model(model, num_epochs, data_loader, optimizer, criterion):
    print('Training Model...')
    model.train()
    for epoch in tqdm(range(num_epochs)):
        epoch_loss = 0
        epoch_acc = 0
        for texts, labels in data_loader:
            texts = texts.to(device) # shape: [batch_size, MAX_LEN]
            labels = labels.to(device) # shape: [batch_size]

            optimizer.zero_grad()

            output = model(texts)
            acc = accuracy(output, labels)
            
            loss = criterion(output, labels)
            loss.backward()
            optimizer.step()

            epoch_loss += loss.item()
            epoch_acc += acc.item()
        print('[TRAIN]\t Epoch: {:2d}\t Loss: {:.4f}\t Train Accuracy: {:.2f}%'.format(epoch+1, epoch_loss/len(data_loader), 100*epoch_acc/len(data_loader)))
    print('Model Trained!\n')

Here are some other helper functions we will need.

In [None]:

def count_parameters(model):
    """
    Count number of trainable parameters in the model
    """
    return sum(p.numel() for p in model.parameters() if p.requires_grad)


def accuracy(output, labels):
    """
    Returns accuracy per batch
    output: Tensor [batch_size, n_classes]
    labels: LongTensor [batch_size]
    """
    preds = output.argmax(dim=1) # find predicted class
    correct = (preds == labels).sum().float() # convert into float for division 
    acc = correct / len(labels)
    return acc

Now instantiate the model, with some hyperparameters values, can play around with them.

In [None]:
if __name__=='__main__':
    cnn_model = CNN(vocab_size = train_Ds.vocab_size, # Don't change this
                embed_size = 128, 
                out_channels = 64, 
                filter_heights = [2, 3, 4], 
                stride = 1, 
                dropout = 0.5, 
                num_classes = 2, # Don't change this
                pad_idx = train_Ds.word2idx[PAD]) # Don't change this

    # Put the model on the device (cuda or cpu)
    cnn_model = cnn_model.to(device)
    
    print('The model has {:,d} trainable parameters'.format(count_parameters(cnn_model)))

The model has 3,879,746 trainable parameters


Next, we create the **criterion**, which is the loss function: it is a measure of how well the model matches the empirical distribution of the data. We use cross-entropy loss (https://en.wikipedia.org/wiki/Cross_entropy).

We also define the **optimizer**, which performs gradient descent. We use the Adam optimizer (https://arxiv.org/pdf/1412.6980.pdf), which has been shown to work well on these types of models.

In [None]:
if __name__=='__main__':    
    LEARNING_RATE = 5e-4 # Feel free to try other learning rates

    # Define the loss function
    criterion = nn.CrossEntropyLoss().to(device)

    # Define the optimizer
    optimizer = optim.Adam(cnn_model.parameters(), lr=LEARNING_RATE)

Finally, we can train the model.

In [None]:
if __name__=='__main__':    
    N_EPOCHS = 20 # Feel free to change this
    
    # train model for N_EPOCHS epochs
    train_model(cnn_model, N_EPOCHS, train_loader, optimizer, criterion)

Training Model...


  0%|          | 0/20 [00:00<?, ?it/s]

[TRAIN]	 Epoch:  1	 Loss: 0.6920	 Train Accuracy: 59.05%
[TRAIN]	 Epoch:  2	 Loss: 0.5919	 Train Accuracy: 68.05%
[TRAIN]	 Epoch:  3	 Loss: 0.5366	 Train Accuracy: 73.09%
[TRAIN]	 Epoch:  4	 Loss: 0.4968	 Train Accuracy: 75.78%
[TRAIN]	 Epoch:  5	 Loss: 0.4510	 Train Accuracy: 79.01%
[TRAIN]	 Epoch:  6	 Loss: 0.4083	 Train Accuracy: 81.47%
[TRAIN]	 Epoch:  7	 Loss: 0.3648	 Train Accuracy: 83.78%
[TRAIN]	 Epoch:  8	 Loss: 0.3239	 Train Accuracy: 85.89%
[TRAIN]	 Epoch:  9	 Loss: 0.2804	 Train Accuracy: 88.17%
[TRAIN]	 Epoch: 10	 Loss: 0.2374	 Train Accuracy: 90.05%
[TRAIN]	 Epoch: 11	 Loss: 0.1997	 Train Accuracy: 91.94%
[TRAIN]	 Epoch: 12	 Loss: 0.1681	 Train Accuracy: 93.32%
[TRAIN]	 Epoch: 13	 Loss: 0.1428	 Train Accuracy: 94.42%
[TRAIN]	 Epoch: 14	 Loss: 0.1107	 Train Accuracy: 95.89%
[TRAIN]	 Epoch: 15	 Loss: 0.0920	 Train Accuracy: 96.61%
[TRAIN]	 Epoch: 16	 Loss: 0.0816	 Train Accuracy: 96.89%
[TRAIN]	 Epoch: 17	 Loss: 0.0674	 Train Accuracy: 97.60%
[TRAIN]	 Epoch: 18	 Loss: 0.058

## Evaluate CNN Model

Now that we have trained a model for text classification, it is time to evaluate it. 


In [None]:
def evaluate(model, data_loader, criterion):
    print('Evaluating performance on the test dataset...')
    model.eval()
    epoch_loss = 0
    epoch_acc = 0
    all_predictions = []
    print("\nSOME PREDICTIONS FROM THE MODEL:")
    for texts, labels in tqdm(data_loader):
        texts = texts.to(device)
        labels = labels.to(device)
        
        output = model(texts)
        acc = accuracy(output, labels)
        pred = output.argmax(dim=1)
        all_predictions.append(pred)
        
        loss = criterion(output, labels)
        
        epoch_loss += loss.item()
        epoch_acc += acc.item()

        if random.random() < 0.0015:
            print("Input: "+' '.join([data_loader.dataset.idx2word[idx] for idx in texts[0].tolist() if idx not in {data_loader.dataset.word2idx[PAD], data_loader.dataset.word2idx[END]}]))
            print("Prediction:", pred.item(), '\tCorrect Output:', labels.item(), '\n')

    full_acc = 100*epoch_acc/len(data_loader)
    full_loss = epoch_loss/len(data_loader)
    print('[TEST]\t Loss: {:.4f}\t Accuracy: {:.2f}%'.format(full_loss, full_acc))
    predictions = torch.cat(all_predictions)
    return predictions, full_acc, full_loss

In [None]:
if __name__=='__main__':
    evaluate(cnn_model, test_loader, criterion) # Compute test data accuracy

Evaluating performance on the test dataset...

SOME PREDICTIONS FROM THE MODEL:


  0%|          | 0/5000 [00:00<?, ?it/s]

Input: i couldn't help but think of behind the mask : the rise of leslie vernon ( a massively more amazing film ) when watching this because of the realistic feel to it as well as the great innovative idea . this could have been a <UNK> film . the acting <UNK> some of the actors alright . from <UNK> downright horrible.<br /><br />that aside the idea is great and the format is great . the story is pretty good as well , though suffering often from big blows to the logical mind.<br /><br <UNK> that though right ? it
Prediction: 1 	Correct Output: 0 

Input: <UNK> <UNK> <UNK> is a vast improvement on <UNK> <UNK> <UNK> as it has sound mostly in the right places and a rudimentary plot . <UNK> time they've <UNK> slightly further away from the car park the other two movies were filmed in which is a good move as you can no longer hear cars driving past what is supposed to be a remote <UNK> /><br <UNK> time around there's a reality <UNK> show and a fake clown to scare off the contestants . <UNK>

# Step 4: Train a Recurrent Neural Network (RNN)
Build a text clasification model that is based on **recurrent neural network**.

## Define the RNN Model

Define the methods for RNN, `__init__(...)` and `forward(...)` are the most important ones.

In [None]:
class RNN(nn.Module):
    def __init__(self, vocab_size, embed_size, hidden_size, num_layers, bidirectional, dropout, num_classes, pad_idx):
        super(RNN, self).__init__()
        self.hidden_size = hidden_size
        self.num_layers = num_layers


        if bidirectional:
            self.num_dirs = 2
        else:
            self.num_dirs = 1

        # Create an embedding layer (https://pytorch.org/docs/stable/generated/torch.nn.Embedding.html)
        #   to represent the words in the vocabulary. Make sure to use vocab_size, embed_size, and pad_idx here.
        self.EmbeddingLayer = nn.Embedding(vocab_size, embed_size)        

        # Create a recurrent network (use nn.GRU) with batch_first = True
        # Make sure to use hidden_size, num_layers, dropout, and bidirectional here.
        # Note: If parameter "dropout" is non-zero, it introduces a Dropout layer on the outputs 
        # of each GRU layer except the last layer, with dropout probability equal to dropout, Default: 0
        self.rnn = nn.GRU(embed_size, hidden_size, num_layers, batch_first = True, dropout = dropout, bidirectional = bidirectional)
        
        
        # Create a dropout layer (nn.Dropout) using dropout
        # Note: this dropout is for the last layer of GRU, i.e. for the input of the linear layer after GRU
        self.DropoutLayer = nn.Dropout(dropout)

        # Define a linear layer (nn.Linear) that consists of num_classes units 
        #   and takes as input the output of the last timestep. In the bidirectional case, need to concatenate
        #   the output of the last timestep of the forward direction with the output of the last timestep of the backward direction.
        # Note: see Pytorch documentation for GRU, its output has a size of D * H_out, where D is 2 if
        # bidirectional = True, and 1 otherwise; H_out = hidden_size, therefore we have the following self.num_dirs * hiddne_size
        self.LinearLayer = nn.Linear(self.num_dirs * hidden_size, num_classes)


    def forward(self, texts):
        """
        texts: LongTensor [batch_size, MAX_LEN]
        
        Returns output: Tensor [batch_size, num_classes]
        """

        # Pass texts through the embedding layer to convert from word ids to word embeddings
        #   Resulting: shape: [batch_size, max_len, embed_size]
        embeddings  = self.EmbeddingLayer(texts)

        # Pass the result through the recurrent network
        #   See PyTorch documentation for resulting shape for nn.GRU
        _, h_n = self.rnn(embeddings)
        
        # Concatenate the outputs of the last timestep for each direction 
        #   This depends on whether or not the model is bidirectional.
        #   Resulting shape: [batch_size, num_dirs*hidden_size]

        # according to GRU ducumentation, h_n is tensor of shape [num_layers*num_dirs, batch_size, hidden_size]
        # it can be viewed as if it is the shape of [num_layers, num_dirs, batch_size, hidden_size]. 
        # On the first dimension with range num_layers, if bidirectional, 
        # layer 1 (index 0) has its forward and backword output as h_n[0], h_n[1]
        # layer 2 (index 1) has its forward and backward output as h_n[2], h_n[3], ...
        # the last layer has the last two elements as its forward and backward output, so h_n[-2] and h_n[-1] 
        # If not bidirectional, only forward direction, we have the first layer's output as h_n[0]..., last layer's h_n[-1]
        # in the following, we only need the last layer's output
        if self.num_dirs == 2:
            RnnOutput = torch.cat([h_n[-2],h_n[-1]],dim = 1)
        else:
            RnnOutput = h_n[-1]
        # Apply dropout
        DropoutRnnOutput = self.DropoutLayer(RnnOutput)

        # Pass the output through the linear layer and return its output 
        #   Resulting shape: [batch_size, num_classes]
        output = self.LinearLayer(DropoutRnnOutput)

        #NOTE: Do not apply a sigmoid or softmax to the final output - done in training method!
        
        return output

## Train RNN Model
First, we initialize the train and test dataloaders.

In [None]:
if __name__=='__main__':
    THRESHOLD = 5 # Don't change this
    MAX_LEN = 100 # Don't change this
    BATCH_SIZE = 32 # Feel free to try other batch sizes

    train_Ds = TextDataset(train_data, 'train', THRESHOLD, MAX_LEN)
    train_loader = torch.utils.data.DataLoader(train_Ds, batch_size=BATCH_SIZE, shuffle=True, num_workers=2, drop_last=True)

    test_Ds = TextDataset(test_data, 'test', THRESHOLD, MAX_LEN, train_Ds.idx2word, train_Ds.word2idx)
    test_loader = torch.utils.data.DataLoader(test_Ds, batch_size=1, shuffle=False, num_workers=1, drop_last=False)

Now instantiate the model with some hyperparameters values

In [None]:
if __name__=='__main__':
    rnn_model = RNN(vocab_size = train_Ds.vocab_size, # Don't change this
                embed_size = 128, 
                hidden_size = 128, 
                num_layers = 2,
                bidirectional = True,
                dropout = 0.5,
                num_classes = 2, # Don't change this
                pad_idx = train_Ds.word2idx[PAD]) # Don't change this

    # Put the model on device
    rnn_model = rnn_model.to(device)

    print('The model has {:,d} trainable parameters'.format(count_parameters(rnn_model)))

The model has 4,300,546 trainable parameters


Here, we create the criterion and optimizer; as with the CNN, we use cross-entropy loss and Adam optimization.

In [None]:
if __name__=='__main__':    
    LEARNING_RATE = 5e-4 # Feel free to try other learning rates

    # Define the loss function
    criterion = nn.CrossEntropyLoss().to(device)

    # Define the optimizer
    optimizer = optim.Adam(rnn_model.parameters(), lr=LEARNING_RATE)

Finally, we can train the model. We use the same `train_model(...)` function that we defined for the CNN.

In [None]:
if __name__=='__main__':    
    N_EPOCHS = 15 # Feel free to change this
    
    # train model for N_EPOCHS epochs
    train_model(rnn_model, N_EPOCHS, train_loader, optimizer, criterion)

Training Model...


  0%|          | 0/15 [00:00<?, ?it/s]

[TRAIN]	 Epoch:  1	 Loss: 0.6584	 Train Accuracy: 59.77%
[TRAIN]	 Epoch:  2	 Loss: 0.5054	 Train Accuracy: 75.36%
[TRAIN]	 Epoch:  3	 Loss: 0.3927	 Train Accuracy: 82.33%
[TRAIN]	 Epoch:  4	 Loss: 0.3041	 Train Accuracy: 87.32%
[TRAIN]	 Epoch:  5	 Loss: 0.2299	 Train Accuracy: 91.11%
[TRAIN]	 Epoch:  6	 Loss: 0.1595	 Train Accuracy: 94.12%
[TRAIN]	 Epoch:  7	 Loss: 0.1046	 Train Accuracy: 96.28%
[TRAIN]	 Epoch:  8	 Loss: 0.0652	 Train Accuracy: 97.96%
[TRAIN]	 Epoch:  9	 Loss: 0.0411	 Train Accuracy: 98.66%
[TRAIN]	 Epoch: 10	 Loss: 0.0319	 Train Accuracy: 98.95%
[TRAIN]	 Epoch: 11	 Loss: 0.0257	 Train Accuracy: 99.11%
[TRAIN]	 Epoch: 12	 Loss: 0.0189	 Train Accuracy: 99.35%
[TRAIN]	 Epoch: 13	 Loss: 0.0204	 Train Accuracy: 99.31%
[TRAIN]	 Epoch: 14	 Loss: 0.0140	 Train Accuracy: 99.53%
[TRAIN]	 Epoch: 15	 Loss: 0.0117	 Train Accuracy: 99.59%
Model Trained!



## Evaluate RNN Model


In [None]:
if __name__=='__main__':    
    evaluate(rnn_model, test_loader, criterion) # Compute test data accuracy

Evaluating performance on the test dataset...

SOME PREDICTIONS FROM THE MODEL:


  0%|          | 0/5000 [00:00<?, ?it/s]

Input: <UNK> , <UNK> wrote this review in anger at <UNK> <UNK> and <UNK> /><br <UNK> has produced movies based on one of the darkest days of our nation . 911 changed everything . <UNK> changed our perception of security . <UNK> changed our understanding of the evil of man and humanity . <UNK> importantly and devastatingly  , it changed our world.<br /><br <UNK> , <UNK> can't not stress how utterly repulsed , disillusioned , and angry <UNK> am at the careless , blatant ignorance of <UNK> seeking to make a lucrative profit out of death and destruction .
Prediction: 0 	Correct Output: 0 

Input: <UNK> review good movies when you can review " <UNK> <UNK> /><br <UNK> , this film is soooo lame . <UNK> can just picture the cast and crew driving around <UNK> . with a camcorder , hurling extras in silly monster make-up at poor , long-suffering <UNK> <UNK> . <UNK> stars ' families actually turn up to play cameos , probably because <UNK> <UNK> couldn't afford " real " extras . <UNK> effects , lam