# LSTMs Lab

### Introduction

In this lesson we'll practice training our data on our TREC dataset.  Let's get started.

### Loading our Data

> To start, change the runtime type to GPU.

We'll begin by loading up the necessary libaries and seeding our data.

In [2]:
import torch
from torchtext import data
from torchtext import datasets
SEED = 12
torch.manual_seed(SEED)
torch.backends.cudnn.deterministic = True

Next, initialize a field object for the text field, and another one for the label field.  For the TEXT field, tokenize with `spacy` include lengths so that we can pack the tensor later on.

In [3]:
TEXT = data.Field(tokenize = 'spacy', include_lengths = True)
LABEL = data.LabelField(dtype = torch.float)



In [4]:
TEXT.include_lengths
# True

True

Ok, now we'll download both a training and test data for the TREC dataset.  Pass through the  TEXT and LABEL fields to specify how to tokenize the data.

In [4]:
from torchtext import datasets

train_data, test_data = datasets.TREC.splits(TEXT, LABEL)

downloading train_5500.label


train_5500.label: 100%|██████████| 336k/336k [00:00<00:00, 2.60MB/s]
TREC_10.label: 100%|██████████| 23.4k/23.4k [00:00<00:00, 1.10MB/s]


downloading TREC_10.label


Ok, next numericalize the both the TEXT and LABEL fields.   For the text field, download the glove word vectors of length 300 and that were trained on 6 billion words.  Only numericalize the 25000 most frequent words in the corpus.

In [5]:
TEXT.build_vocab(train_data, 
                 max_size = 25_000, 
                 vectors = "glove.6B.300d", 
                 unk_init = torch.Tensor.normal_)

.vector_cache/glove.6B.zip: 862MB [06:50, 2.10MB/s]                               
100%|█████████▉| 399999/400000 [00:28<00:00, 14197.02it/s]


In [9]:
LABEL.build_vocab(train_data)

Check that the label field has numericalized the six categories.

In [10]:

# defaultdict(<function torchtext.vocab._default_unk_index>,
#             {'ABBR': 5, 'DESC': 2, 'ENTY': 0, 'HUM': 1, 'LOC': 4, 'NUM': 3})

Then, let's bucket our data into batches of size 64.  Set the device equal to a cuda device, and set the batch size equal to 64.  Set, `sort_within_batch` to True.

In [9]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

train_iterator, test_iterator = data.BucketIterator.splits(
    (train_data, test_data), 
    batch_size = 64,
    sort_within_batch = True,
    device = device)



Select the first batch of text from the `train_iterator` by looping through the data and breaking after the first iteration.

In [10]:
for batch in train_iterator:
    first_batch_text = batch.text
    break



The `first_batch_text` should be a tuple of the numericalized words and the number of non-padded words for each document.

In [None]:
type(first_batch_text)
# tuple

### Building our Layers

Let's start by initializing the following layers of our LSTM model.  First we need to set the embedding layer.  It should have a row for each word in our vocabulary, and a column equal to the length of the vectors.  Also pass through the padding index.

Next comes the LSTM layer.  Specify four LSTM layers, that are bidirectional, with a dropout rate of .5.  There should be 200 dimensions for the hidden state.  Then initialize a dropout layer with a rate of .5, that we'll eventually use between our last LSTM layer and our first linear layer.

Then create the linear layer, which will be the output layer.  Each neuron of the linear layer should take in the hidden state from the last layer, both the forwards and backwards hidden states.  Then, let's initialize the neural network, and check the result below.

In [18]:
import torch.nn as nn
import torch.nn.functional as F
class LSTM(nn.Module):
    def __init__(self):
        super().__init__()
        self.embedding = nn.Embedding(9343, 300, padding_idx = 1)
        self.lstm_layer = nn.LSTM(300, 200, num_layers=4, bidirectional=True, 
                           dropout=.5)
        self.dropout = nn.Dropout(.5)
        self.fc = nn.Linear(200 * 2, 6)
    def forward(self, text, document_lengths):
        embedded_batch = self.embedding(text) # torch.Size([713, 64, 100])
        packed_embedded = nn.utils.rnn.pack_padded_sequence(embedded_batch, 
                                                    document_lengths, 
                                                    enforce_sorted=False)
        packed_output, (hidden, cell) = self.lstm_layer(packed_embedded)
        l2_forwards = hidden[-2,:,:]
        l2_backwards = hidden[-1, :, :]
        combined_hidden = torch.cat((l2_forwards, 
                                     l2_backwards), dim = 1)
        dropout = self.dropout(combined_hidden)
        output_layer = self.fc(dropout)
        return F.log_softmax(output_layer, dim = 1)

In [19]:
lstm = LSTM()
lstm

# LSTM(
#   (embedding): Embedding(9343, 300, padding_idx=1)
#   (lstm_layer): LSTM(300, 200, num_layers=4, dropout=0.5, bidirectional=True)
#   (dropout): Dropout(p=0.5, inplace=False)
#   (fc): Linear(in_features=400, out_features=6, bias=True)
# )


LSTM(
  (embedding): Embedding(9343, 300, padding_idx=1)
  (lstm_layer): LSTM(300, 200, num_layers=4, dropout=0.5, bidirectional=True)
  (dropout): Dropout(p=0.5, inplace=False)
  (fc): Linear(in_features=400, out_features=6, bias=True)
)

Ok, next it's time to fill in the `forward` method.  The method should take in both the text and the document_lengths of each batch.  Then pass the text through the embedding layer, pack the documents to remove the padding from the documents.  Then pass the padded sequence to the LSTM layer.  

The LSTM layer returns to us the output, the hidden state and the cell of the LSTM.  From hidden, select the forwards hidden layer and backwards hidden layer, and then concatenate the two states into a vector.  Then apply the dropout to the concatenated hidden state.  Finally, pass the dropout to the linear layer and apply the `log_softmax` to output our predictions.

> Before trying out our forward function, we need to place our model on cuda.  Do so below, and reassign the model. 

In [17]:
lstm = lstm.to(device)

Then try out the model by making predictions, passing through both the `text` and the `document_lengths`.

In [None]:
predictions = None

predictions.shape
# torch.Size([64, 6])

Once our predictions are working, copy the glove word embeddings over to the embedding layer of the LSTM.

> Replace the vectors associated with the unknown token and pad tokens with a vector of zeros.

In [16]:
pretrained_embeddings = TEXT.vocab.vectors
lstm.embedding.weight.data.copy_(pretrained_embeddings)
UNK_IDX = TEXT.vocab.stoi[TEXT.unk_token]
PAD_IDX = TEXT.vocab.stoi[TEXT.pad_token]

lstm.embedding.weight.data[UNK_IDX] = torch.zeros(300)
lstm.embedding.weight.data[PAD_IDX] = torch.zeros(300)

### Training our model

Now it's time to train our model.  Use the Adam optimizer, and cross entropy for our loss, as we are perform a multiclass classification problem.

Then ensure that the cross entropy loss operates on cuda.

In [24]:
import torch.optim as optim

optimizer = optim.Adam(lstm.parameters())

c_e_loss = nn.CrossEntropyLoss()

lstm = lstm.to(device)
c_e_loss = c_e_loss.to(device)

Theen define the training loop, printing the loss at each step.  We may need to convert the labels to type `long` when calculating our cross entropy loss.

In [None]:
for epoch in range(15):
    for batch in train_iterator:
        preds = lstm(*batch.text)
        loss = c_e_loss(preds, batch.label.to(device).long())
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
    print(loss)

### Evaluating our data

Ok, when training is completed, compute the accuracy of the test data.  We'll define the `categorical_accuracy` for you.

In [None]:
def categorical_accuracy(preds, y):
    max_preds = preds.argmax(dim = 1, keepdim = True) # get the index of the max probability
    correct = max_preds.squeeze(1).eq(y)
    return correct.sum() / torch.FloatTensor([y.shape[0]])

It's your task to create a list of `accuracies` for each batch of data.  The calculated the accuracy for the entire test set using a weighted sum below.

In [25]:
accuracies = []
batch_lengths = []
lstm.eval()
with torch.no_grad():
    for batch in test_iterator:
        predictions = lstm(*batch.text)
        acc = categorical_accuracy(predictions.to(device), batch.label.to(device).long())
        accuracies.append(acc.item())
        batch_lengths.append(len(predictions))

In [None]:
weighted_sum = sum([accuracy*batch_length for accuracy, batch_length in zip(accuracies, batch_lengths)])/sum(batch_lengths)
weighted_sum
# 0.8699999995231629