# Pytorch LSTMs

### Introduction

In this lesson, we'll work through how to implement LSTMs in Pytorch.  Let's get started.

### Loading the Data

> As a first step, let's change the runtime type in colab to GPU.

We'll begin by downloading our data.  As usual, we set up seeding for Pytorch.

In [12]:
import torch
from torchtext import data
from torchtext import datasets
SEED = 12
torch.manual_seed(SEED)
torch.backends.cudnn.deterministic = True

TEXT = data.Field(tokenize = 'spacy', include_lengths = True)
LABEL = data.LabelField(dtype = torch.float)

> Notice that this time, we included an argument `include_lengths = True`.  This will come in handy later.  Let's keep going.

Next we download our IMDB data and build our vocabulary, and batch our data into sizes of 64.

In [59]:
from torchtext import datasets

train_data, test_data = datasets.IMDB.splits(TEXT, LABEL)
TEXT.build_vocab(train_data, 
                 max_size = 25_000, 
                 vectors = "glove.6B.100d", 
                 unk_init = torch.Tensor.normal_)

LABEL.build_vocab(train_data)

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

train_iterator, valid_iterator, test_iterator = data.BucketIterator.splits(
    (train_data, valid_data, test_data), 
    batch_size = 64,
    sort_within_batch = True,
    device = device)

Finally, we select the first batch of text.

In [58]:
for batch in train_iterator:
    first_batch_text = batch.text
    break



# Building our LSTM model

Now it's time to construct our LSTM model.  This is the whole thing.

In [134]:
import torch.nn as nn

class LSTM(nn.Module):
    def __init__(self):
        super().__init__()
        self.embedding = nn.Embedding(25002, 100, padding_idx = 1)
        self.lstm_layer = nn.LSTM(100, 256, num_layers=2, bidirectional=True, 
                           dropout=.5)
        self.fc = nn.Linear(256 * 2, 1)
    def forward(self, text, sentence_lengths):
        return self.embedding(text) 

As we can see, there's really only three layers that we'll need: an embedding layer, our lstm, and our linear layer, which is our ouput layer. 

> Notice that our forward function is not flushed out right now.  It simply takes in text and returns the output from the embedding layer.  We'll fill it in as we move along.

Ok, so let's initialize and copy over the embeddings.  Then we'll go through each layer of our model. 

In [137]:
lstm = LSTM()

In [43]:
pretrained_embeddings = TEXT.vocab.vectors
lstm.embedding.weight.data.copy_(pretrained_embeddings)
UNK_IDX = TEXT.vocab.stoi[TEXT.unk_token]

lstm.embedding.weight.data[UNK_IDX] = torch.zeros(EMBEDDING_DIM)
lstm.embedding.weight.data[PAD_IDX] = torch.zeros(EMBEDDING_DIM)

### LSTM Layers

Ok, now let's take a look at the layers in our LSTM model.  We'll go through each of these layers in turn.

In [44]:
lstm

LSTM(
  (embedding): Embedding(25002, 100, padding_idx=1)
  (lstm): LSTM(100, 256, num_layers=2, dropout=0.5, bidirectional=True)
  (fc): Linear(in_features=512, out_features=1, bias=True)
)

1. Embedding Layer

* In the embedding layer, we specify the number of rows, one for each word in the dictionary.  And the number of columns is the length of our vectors, 100.  

* When we pass the padding index, we tell Pytorch not to update this vector during training, as we do not believe that padding can help us predict the sentiment of a review.

2. LSTM Layer

```python
nn.LSTM(100, 256, num_layers=2, bidirectional=True, dropout=.5)
```

* Because the output from our preceding *embedding* layer is a vector of length 100 for each word, we specify 100 rows for the weight matrix of our LSTM.  The number of columns that we specify determines the size of the hidden dimension.  Here we set it to 256.  
* We can make each layer bidirectional with our `bidirectional = True`, and because this concatenates the two hidden states, our combined hidden state is really of length 512.  
* Finally, we specify two layers for our LSTM.  One layer passing the hidden state to the corresponding column in the subsequent level.  And between each layer we hae a dropout of `.5`.


3. Output layer

```python 
nn.Linear(256 * 2, 1)
```

Notice that the output layer is takes in `256*2` features, and outputs a single neuron.  This is because the input from the preceding layer is a hidden state from a bidirectional RNN, where the forward and backwards hidden state is each of length 256.  And then want a single output to represent whether our sentence is positive or negative.

Let's take another look at how we initialize our layers before moving on.  Try to feel comfortable with each of the parameters we are initializing our different layers with.  And the task that each layer is performing.

In [135]:
import torch.nn as nn

class LSTM(nn.Module):
    def __init__(self):
        super().__init__()
        self.embedding = nn.Embedding(25002, 100, padding_idx = 1)
        self.lstm_layer = nn.LSTM(100, 256, num_layers=2, bidirectional=True, 
                           dropout=.5)
        self.fc = nn.Linear(256 * 2, 1)
    def forward(self, text, document_lengths):
        pass

## Forward Function

Now that we initialized each of our layers, it's time to move through the building our the forward function.  

1. Document lengths

Looking at the forward function, notice that we pass through both the numericalized text of each document, and the `document_lengths` as arguments.

The `document_lengths` are a list of numbers signifying the number of words in each document, not including the padding words.

> Because we specified `include_lengths` when tokenizing, `TEXT = data.Field(tokenize = 'spacy', include_lengths = True)`, pytorch returns the non-padded lengths to us.  It's the second element of each batch.

In [132]:
batch_document_lengths = first_batch_text[1]
batch_document_lengths[-3:]

tensor([139, 115,  60])

So, above, we see the document lengths for the last three documents in our batch.  We'll eventually use these document_lengths so that when we pass each document to our LSTM, we can remove the padding tokens, which we do not believe to effect sentiment, and so we do not want captured in the hidden state.

Before doing that, let's select the first element from our text, our batch of numericalized documents.

In [130]:
batch_text = first_batch_text[0]

In [131]:
batch_text.shape

torch.Size([713, 64])

2. The embedding layer and document packing

Ok, now it's time to pass our data through the layer.  We can fill in the first two lines of the forward function with the following.

In [133]:
embedded_batch = lstm.embedding(batch_text) # torch.Size([713, 64, 100])
packed_embedded = nn.utils.rnn.pack_padded_sequence(embedded_batch, 
                                                    batch_document_lengths, 
                                                    enforce_sorted=False)

In this first line we pass the numericalized text through our embedding layer, which returns a corresponding embedding vector for each word.  Then, in the second line, we call `pack_padded_sequence` to remove the paddings from each document before we pass our documents through the LSTM layer.

We can see that there is now a 100 length vector, for each non-padded word in the sentence.

In [148]:
packed_embedded.data.shape

torch.Size([16470, 100])

In [149]:
sum(document_lengths)

tensor(16470)

3. The LSTM layer

Now that we have "packed" versions of each of our documents, we can pass our documents to our lstm layer.

In [138]:
packed_output, (hidden, cell) = lstm.lstm_layer(packed_embedded)

This returns to us three different components.  The first is the `packed_component`, which represents the hidden state across all tokens, of the last layer of our LSTM. 

In [142]:
packed_output.data.shape

torch.Size([16470, 512])

> So we have the output from the 512 neurons for each of the non-padded words in our batch.

Second is hidden, which is the hidden state of only the last words in each document.

In [143]:
hidden.data.shape

torch.Size([4, 64, 256])

Above, the first dimension is a 4 as hidden contains the last hidden output from both of the layers in our LSTM, both the forward and backwards outputs.    

We only need the outputs from the last layer, the last two matrices, so we can select that with the following.

In [153]:
last_forwards = hidden[-2,:,:]
last_forwards.shape

torch.Size([64, 256])

In [154]:
last_backwards = hidden[-1, :, :]
last_backwards.shape

torch.Size([64, 256])

> There is a final state for each of the 64 documents in the batch, and the hidden state is of length 256.

4. Passing the hidden states to the output layer

Finally, we can concatenate the forwards and backwards hidden states together.

In [155]:
combined_hidden = torch.cat((last_forwards, last_backwards), dim = 1)
combined_hidden.shape

torch.Size([64, 512])

And then pass this to our output layer.

In [156]:
lstm.fc(combined_hidden).shape

torch.Size([64, 1])

Now that we walked through each of the steps, let's update our neural network by filling in our `forward` method.

In [163]:
import torch.nn as nn

class LSTM(nn.Module):
    def __init__(self):
        super().__init__()
        self.embedding = nn.Embedding(25002, 100, padding_idx = 1)
        self.lstm_layer = nn.LSTM(100, 256, num_layers=2, bidirectional=True, 
                           dropout=.5)
        self.fc = nn.Linear(256 * 2, 1)
    def forward(self, text, document_lengths):
        embedded_batch = self.embedding(text) # torch.Size([713, 64, 100])
        packed_embedded = nn.utils.rnn.pack_padded_sequence(embedded_batch, 
                                                    document_lengths, 
                                                    enforce_sorted=False)
        packed_output, (hidden, cell) = self.lstm_layer(packed_embedded)
        l2_forwards = hidden[-2,:,:]
        l2_backwards = hidden[-1, :, :]
        combined_hidden = torch.cat((l2_forwards, 
                                     l2_backwards), dim = 1)
        return self.fc(combined_hidden)

This is the final version of our LSTM, so take a look at it to ensure that each of the components make sense.

Then let's initialize an instance of the LSTM, and see how we can pass through some data.

In [164]:
updated_lstm = LSTM()

In [None]:
pretrained_embeddings = TEXT.vocab.vectors
updated_lstm.embedding.weight.data.copy_(pretrained_embeddings)
UNK_IDX = TEXT.vocab.stoi[TEXT.unk_token]
PAD_IDX = TEXT.vocab.stoi[TEXT.pad_token]

updated_lstm.embedding.weight.data[UNK_IDX] = torch.zeros(100)
updated_lstm.embedding.weight.data[PAD_IDX] = torch.zeros(100)

> Remember that each batch, is a tuple of both numericalized documents and the document lengths.

In [161]:
type(first_batch)

tuple

So we can pass them to our LSTM individually.

In [None]:
updated_lstm = updated_lstm.to(device)

In [165]:
predictions = updated_lstm(first_batch[0], first_batch[1])

Or use `splat` to have Python unpack the tuple, and pass element through as separate elements.

In [166]:
predictions = updated_lstm(*first_batch)

### Train the model

Ok, now it's time to train the model.  We initialize our optimizer, and loss function.

In [None]:
import torch.optim as optim

optimizer = optim.Adam(model.parameters())
bce_loss = nn.BCEWithLogitsLoss()

Then we call `to(device)` so that our model is trained with cuda.

In [None]:
updated_lstm = updated_lstm.to(device)
bce_loss = bce_loss.to(device)

And then we train the model.

In [None]:
for epoch in range(7):
    for batch in train_iterator:
        preds = updated_lstm(batch.text[0].cuda(), batch.text[1].cuda())
        loss = bce_loss(preds.squeeze(1), batch.label.to(device))
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
    print(loss)

After training the model, we can compute the accuracy on the test data.

In [None]:
def binary_accuracy(preds, y):
    rounded_preds = torch.round(torch.sigmoid(preds))
    correct = (rounded_preds == y).float() #convert into float for division 
    acc = correct.sum() / len(correct)
    return acc

In [None]:
updated_lstm.eval()

accuracies = []
batch_lengths = []
with torch.no_grad():
    for batch in test_iterator:
        outputs = updated_lstm(*batch.text)
        labels = batch.label
        accuracy = binary_accuracy(outputs.squeeze(1), labels)
        accuracies.append(accuracy.item())
        batch_lengths.append(len(outputs))

In [None]:
sum([accuracy*batch_length for accuracy, batch_length in zip(accuracies, batch_lengths)])/sum(batch_lengths)
# .878

> Sanity check.

In [None]:
import spacy
nlp = spacy.load('en')

def predict_sentiment(model, sentence):
    model.eval()
    tokenized = [tok.text for tok in nlp.tokenizer(sentence)]
    indexed = [TEXT.vocab.stoi[t] for t in tokenized]
    length = [len(indexed)]
    tensor = torch.LongTensor(indexed).to(device)
    tensor = tensor.unsqueeze(1)
    length_tensor = torch.LongTensor(length)
    prediction = torch.sigmoid(model(tensor, length_tensor))
    return prediction.item()