## Sequence Models

In this tutorial, you will learn about LSTM neural networks and see an example of how they can be used to recognize parts of speech.

In [1]:
import math
import matplotlib.pyplot as plt 
import numpy as np 
import os
import torch
import torch.nn as nn
import torch.nn.functional as F 
import torch.optim as optim
from IPython.display import Image

%matplotlib inline

torch.manual_seed(1)

<torch._C.Generator at 0x12557b650>

## Introduction

### RNNs
* Difficulty learning widely separated relationships in long sequences. ### What is an LSTM?
* Long Short-Term Memory.
* Able to learn long term dependencies in a sequence.
* Uses a cell state and gates to remember relationships.

**Example**: *The humidity is very high. Today it is going to **rain**. A regular RNN might not do a good job predicting the word **rain**.*

### Rolled Out LSTM Network

![rolled out lstm](data/roll_out.png)

### LSTM Cell

![lstm cell](data/lstm_cell.png)



### Input Gate.
* Decide what information to store in cell state.
* Decide what to update.
  * Sigmoid layer.
* Create candidate values for cell state.
  *  Tanh layer.

### Example Continued:

* replace the previous rain likelihood with the new likelihood for dry weather.

![](data/LSTM3-focus-i.png)

### Update Cell State
* Remove forgotten information.
  * Multiply $f_t$ by old cell state.
* Add new candidate values scaled by their importance.

  * Add $i_t\ast\tilde{C}_t$

  ![](data/LSTM3-focus-C.png)

### Output Gate
* Return filtered version of the cell state.

### Example Continued:

* Keep track of humidity magnitude to help the network decide whether to predict a large storm or just light rain.

![](data/LSTM3-focus-o.png)

## Very Simple Example 1
Create an LSTM in PyTorch using a for loop.

In [2]:
# define LSTM 
SEQUENCE_LEN = 5  # The length of the sequence
INPUT_SIZE = 1  # Number of input features per time step
HIDDEN_SIZE = 1  # Number of LSTM blocks per layer of the RNN
BATCH_SIZE = 1  # The batch size

lstm = nn.LSTM(input_size=INPUT_SIZE, hidden_size=HIDDEN_SIZE)

# create fake inputs for LSTM
inputs = torch.randn(SEQUENCE_LEN, BATCH_SIZE, INPUT_SIZE)

# initialize the hidden state and cell states
hidden_0 = torch.randn(1, BATCH_SIZE, HIDDEN_SIZE)
cell_0 = torch.randn(1, BATCH_SIZE, HIDDEN_SIZE)

# Step through the LSTM as it takes in the input sequence
for i, in_value in enumerate(inputs):
    # Step through the sequence one element at a time.
    # After each time step, hidden contains the hidden state.
    out, hidden_out = lstm(in_value.view(1, 1, -1), (hidden_0, cell_0))
    print('x_{}: {}'.format(i+1, out))
    print('h_{}: {}'.format(i+1, hidden_out))
    print('')

x_1: tensor([[[0.1286]]], grad_fn=<StackBackward>)
h_1: (tensor([[[0.1286]]], grad_fn=<StackBackward>), tensor([[[0.5273]]], grad_fn=<StackBackward>))

x_2: tensor([[[0.1395]]], grad_fn=<StackBackward>)
h_2: (tensor([[[0.1395]]], grad_fn=<StackBackward>), tensor([[[0.4831]]], grad_fn=<StackBackward>))

x_3: tensor([[[0.1322]]], grad_fn=<StackBackward>)
h_3: (tensor([[[0.1322]]], grad_fn=<StackBackward>), tensor([[[0.5156]]], grad_fn=<StackBackward>))

x_4: tensor([[[0.1449]]], grad_fn=<StackBackward>)
h_4: (tensor([[[0.1449]]], grad_fn=<StackBackward>), tensor([[[0.4218]]], grad_fn=<StackBackward>))

x_5: tensor([[[0.1443]]], grad_fn=<StackBackward>)
h_5: (tensor([[[0.1443]]], grad_fn=<StackBackward>), tensor([[[0.4398]]], grad_fn=<StackBackward>))



Here, x is the output and h is the value of the hidden and cell states at each step in the sequence



## Very Simple Example 2
Create an LSTM in PyTorch using cat.

In [3]:
# Define LSTM architecture
SEQUENCE_LEN = 5  # The length of the sequence
INPUT_SIZE = 1  # Number of input features per time step
HIDDEN_SIZE = 1  # Number of LSTM blocks per layer of the RNN
BATCH_SIZE = 1  # The batch size
lstm = nn.LSTM(input_size=INPUT_SIZE, hidden_size=HIDDEN_SIZE)

# create the inputs for the LSTM
inputs = [torch.randn(BATCH_SIZE, INPUT_SIZE) for _ in range(SEQUENCE_LEN)]

# Concatenate the inputs so that they are a tensor
inputs = torch.cat(inputs).view(len(inputs), 1, -1)

# initialize the hidden and cell states
hidden_0 = torch.randn(1, BATCH_SIZE, HIDDEN_SIZE)
cell_0 = torch.randn(1, BATCH_SIZE, HIDDEN_SIZE)

# out = all states, hidden = last state and last cell state
out, hidden = lstm(inputs, (hidden_0, cell_0))


print('out: {}'.format(out))
print('last hidden and cell states: {}'.format(hidden))

out: tensor([[[-0.1616]],

        [[-0.1378]],

        [[ 0.0745]],

        [[-0.2183]],

        [[-0.2227]]], grad_fn=<StackBackward>)
last hidden and cell states: (tensor([[[-0.2227]]], grad_fn=<StackBackward>), tensor([[[-0.9171]]], grad_fn=<StackBackward>))


## Example: An LSTM for Part-of-Speech Tagging
* Predict parts-of-speach in a sentence.

### Prepare the data:
Training data is a list of list pairs.
* First list is a sentence.
* Second list are the parts-of-speech tags for each word in the sentence.

In [5]:
training_data = [
    ("The dog ate the apple.".split(), ["Determiner", "Noun", "Verb", "Determiner", "Noun"]),
    ("Everybody read that book.".split(), ["Noun", "Verb", "Determiner", "Noun"])
]

In [7]:
training_sentences = [sentence for sentence, tags in training_data]
training_sentences

[['The', 'dog', 'ate', 'the', 'apple.'],
 ['Everybody', 'read', 'that', 'book.']]

### Clean the data
Make all words lower case and remove punctuation.



In [10]:
def clean_words(words_list):
    '''lowercase words and remove punctuation
    returns: clean words list
    '''
    return [word.lower().split('.')[0] for word in words_list]

In [11]:
training_data_clean = [(clean_words(sentence), tags) for sentence, tags in training_data]
training_data_clean



[(['the', 'dog', 'ate', 'the', 'apple'],
  ['Determiner', 'Noun', 'Verb', 'Determiner', 'Noun']),
 (['everybody', 'read', 'that', 'book'],
  ['Noun', 'Verb', 'Determiner', 'Noun'])]

In [12]:
training_sentences_clean = [sentence for sentence, tag in training_data_clean]
training_sentences_clean

[['the', 'dog', 'ate', 'the', 'apple'], ['everybody', 'read', 'that', 'book']]

### Create vocabulary
Using all words in each sentence of the training data, create a vocabulary.

In [18]:
words = []

for sentence in training_sentences_clean:
    words += sentence

vocab = list(set(words))
print(vocab)

['apple', 'read', 'ate', 'that', 'the', 'dog', 'everybody', 'book']


### Create mapping dictionaries
Using dictionaries to convert words to integers.

In [21]:
word_to_ix = {word:i for i, word in enumerate(vocab)}
print(word_to_ix)

{'apple': 0, 'read': 1, 'ate': 2, 'that': 3, 'the': 4, 'dog': 5, 'everybody': 6, 'book': 7}


Map the parts-of-speech tags to integers:



In [22]:
# Tags to integers
tag_to_ix = dict([('Determiner', 0), ('Noun', 1), ('Verb',2)])

In [23]:
tag_to_ix

{'Determiner': 0, 'Noun': 1, 'Verb': 2}

Map the integers back to parts-of-speech.



In [24]:
ix_to_tag = {i:tag for tag, i in tag_to_ix.items()}
ix_to_tag

{0: 'Determiner', 1: 'Noun', 2: 'Verb'}

## Set Hyperparameters


In [25]:
EMBEDDING_DIM = 6
HIDDEN_DIM = 6
LEARNING_RATE = 0.1
NUM_EPOCHS = 300

## Create the model

LSTMTagger class.

* Inherits nn.Module from PyTorch.
* Inputs:
    * Embedding dimension.
    *  Number of hidden dimensions.
    *  Vocabulary size.
    *  Tag set size.


In [40]:
class LSTMTagger(nn.Module):
    def __init__(self, embedding_dim, hidden_dim, vocab_size, tagset_size):
        super(LSTMTagger, self).__init__()
        self.hidden_dim = hidden_dim
        self.word_embeddings = nn.Embedding(vocab_size, embedding_dim)
        # LSTM: Inputs are embeddings, outputs are hidden states
        self.lstm = nn.LSTM(embedding_dim, hidden_dim)
        # Linear layer maps hidden space to tag space
        self.hidden2tag = nn.Linear(hidden_dim, tagset_size)
        self.hidden = self.init_hidden()

    def init_hidden(self):
        #Initialize the hidden state. The axes correspond to (num_layers, minibatch_size, hidden_dim).
        return (torch.zeros(1, 1, self.hidden_dim),
                torch.zeros(1, 1, self.hidden_dim))

    def forward(self, sentence):
        '''
        Make a forward pass through the LSTM

        :param sentence: The input sentence
        :type sentence: list
        :return: A Tensor of tag scores.
        :rtype:: Tensor
        '''
        embeds = self.word_embeddings(sentence)
        lstm_out, self.hidden = self.lstm(embeds.view(len(sentence), 1, -1), self.hidden)
        tag_space = self.hidden2tag(lstm_out.view(len(sentence), -1))
        tag_scores = F.log_softmax(tag_space, dim=1)
        return tag_scores





## Helper Function
Map either words or tags to integers, using the previously defined dictionaries (tag_to_ix, ix_to_tag).

In [41]:
def prepare_sequence(seq, to_ix):
    '''
    Convert words or tags to integers and return a Tensor
    :param seq: sequence of words
    :type seq: list
    :param to_ix: Dictionary mapping words or tags to integers
    :return: A Pytorch tensor of indices
    :rtype: Tensor
    '''
    idx = [to_ix[w] for w in seq]
    return torch.tensor(idx, dtype=torch.long)

In [42]:
# test
prepare_sequence(['the'], word_to_ix)

tensor([4])

## Train the model:
Create the LSTM Pytorch model using the hyperparameters defined above.

In [43]:
model = LSTMTagger(EMBEDDING_DIM, HIDDEN_DIM, len(vocab), len(tag_to_ix))

Define the loss function. In this case, we will be using a negative log likelihood function, which is useful in classification problems.



In [44]:
loss_function = nn.NLLLoss()

### Negative Log Likelihood
We can illustrate negative log likelihood in the following diagram:

![data/nll.png](data/nll.png)

We will train the model using stochastic gradient descent.



In [45]:
optimizer = optim.SGD(model.parameters(), lr=LEARNING_RATE)

Let's run the model before any training has been done and store the scores to a **list**. We will then compare these scores with the scores after training.



In [46]:
training_sentences

[['The', 'dog', 'ate', 'the', 'apple.'],
 ['Everybody', 'read', 'that', 'book.']]

In [51]:
store_initial_probabilities = []
store_initial_predictions = []

with torch.no_grad():
    for sentence in training_sentences_clean:
        inputs = prepare_sequence(sentence, word_to_ix)
        tag_scores = model(inputs)
        tag_probabilities = tag_scores.exp()
        # print(tag_probabilities)
        max_values, max_indices = torch.max(tag_probabilities, dim=1)
        initial_prediction = [ix_to_tag[x] for x in max_indices.numpy()]
        store_initial_predictions.append(initial_prediction)
        store_initial_probabilities.append(tag_probabilities)

In [52]:
print(store_initial_probabilities)

[tensor([[0.2398, 0.2627, 0.4975],
        [0.2419, 0.2779, 0.4802],
        [0.2534, 0.2684, 0.4782],
        [0.2489, 0.2651, 0.4860],
        [0.2222, 0.2934, 0.4845]]), tensor([[0.2205, 0.3022, 0.4773],
        [0.2305, 0.2911, 0.4784],
        [0.2230, 0.2964, 0.4807],
        [0.2240, 0.2953, 0.4807]])]


In [53]:
print(store_initial_predictions)

[['Verb', 'Verb', 'Verb', 'Verb', 'Verb'], ['Verb', 'Verb', 'Verb', 'Verb']]


Now, we will train the model.



In [54]:
training_data_clean

[(['the', 'dog', 'ate', 'the', 'apple'],
  ['Determiner', 'Noun', 'Verb', 'Determiner', 'Noun']),
 (['everybody', 'read', 'that', 'book'],
  ['Noun', 'Verb', 'Determiner', 'Noun'])]

In [56]:
for epoch in range(NUM_EPOCHS):
    for sentence, tags in training_data_clean:
        optimizer.zero_grad()

        # Initialize hidden state of LSTM after each instance
        model.hidden = model.init_hidden()

        # Turn inputs into tensors of word indices
        sentence_in = prepare_sequence(sentence, word_to_ix)
        targets = prepare_sequence(tags, tag_to_ix)

        # Forward pass
        tag_scores = model(sentence_in)

        # Compute the loss, gradients and update parameters
        loss = loss_function(tag_scores, targets)
        loss.backward()

        print(f'epoch: {epoch+1}, loss: {loss:.4f}')
        optimizer.step()

epoch: 1, loss: 0.0402
epoch: 1, loss: 0.0526
epoch: 2, loss: 0.0397
epoch: 2, loss: 0.0521
epoch: 3, loss: 0.0393
epoch: 3, loss: 0.0515
epoch: 4, loss: 0.0388
epoch: 4, loss: 0.0510
epoch: 5, loss: 0.0384
epoch: 5, loss: 0.0505
epoch: 6, loss: 0.0380
epoch: 6, loss: 0.0500
epoch: 7, loss: 0.0376
epoch: 7, loss: 0.0495
epoch: 8, loss: 0.0371
epoch: 8, loss: 0.0491
epoch: 9, loss: 0.0367
epoch: 9, loss: 0.0486
epoch: 10, loss: 0.0364
epoch: 10, loss: 0.0481
epoch: 11, loss: 0.0360
epoch: 11, loss: 0.0477
epoch: 12, loss: 0.0356
epoch: 12, loss: 0.0472
epoch: 13, loss: 0.0352
epoch: 13, loss: 0.0468
epoch: 14, loss: 0.0349
epoch: 14, loss: 0.0464
epoch: 15, loss: 0.0345
epoch: 15, loss: 0.0459
epoch: 16, loss: 0.0342
epoch: 16, loss: 0.0455
epoch: 17, loss: 0.0338
epoch: 17, loss: 0.0451
epoch: 18, loss: 0.0335
epoch: 18, loss: 0.0447
epoch: 19, loss: 0.0332
epoch: 19, loss: 0.0443
epoch: 20, loss: 0.0328
epoch: 20, loss: 0.0439
epoch: 21, loss: 0.0325
epoch: 21, loss: 0.0435
epoch: 22,

Our model has now finished training. Let's print out some statistics to show how well the model training performed.



In [58]:
training_sentences_clean

[['the', 'dog', 'ate', 'the', 'apple'], ['everybody', 'read', 'that', 'book']]

In [57]:
store_initial_predictions.reverse()
store_initial_probabilities.reverse()

with torch.no_grad():
    for sentence in training_sentences_clean:
        inputs = prepare_sequence(sentence, word_to_ix)
        tag_scores = model(inputs)
        tag_probabilities = tag_scores.exp()
        max_values, max_indices = torch.max(tag_probabilities, 1)
        predictions = [ix_to_tag[x] for x in max_indices.numpy()]

        print('Before training:')
        print(' - initial probabilities: {}'.format(store_initial_probabilities.pop()))
        print(' - sentence: {}'.format(' '.join(sentence)))
        print(' - predicition: {}'.format(store_initial_predictions.pop()))
        print('After training:')
        print(' - final probabilities: {}'.format(tag_probabilities))
        print(' - sentence: {}'.format(' '.join(sentence)))
        print(' - prediction: {}'.format(predictions))
        print('')

Before training:
 - initial probabilities: tensor([[0.2398, 0.2627, 0.4975],
        [0.2419, 0.2779, 0.4802],
        [0.2534, 0.2684, 0.4782],
        [0.2489, 0.2651, 0.4860],
        [0.2222, 0.2934, 0.4845]])
 - sentence: the dog ate the apple
 - predicition: ['Verb', 'Verb', 'Verb', 'Verb', 'Verb']
After training:
 - final probabilities: tensor([[9.8002e-01, 1.1596e-02, 8.3884e-03],
        [2.3367e-03, 9.9410e-01, 3.5672e-03],
        [8.6107e-03, 6.2304e-03, 9.8516e-01],
        [9.9327e-01, 8.0867e-04, 5.9249e-03],
        [2.7279e-03, 9.9586e-01, 1.4118e-03]])
 - sentence: the dog ate the apple
 - prediction: ['Determiner', 'Noun', 'Verb', 'Determiner', 'Noun']

Before training:
 - initial probabilities: tensor([[0.2205, 0.3022, 0.4773],
        [0.2305, 0.2911, 0.4784],
        [0.2230, 0.2964, 0.4807],
        [0.2240, 0.2953, 0.4807]])
 - sentence: everybody read that book
 - predicition: ['Verb', 'Verb', 'Verb', 'Verb']
After training:
 - final probabilities: tensor([[1.1

## Save Model
Save the Pytorch model to disk. This model will be used in the deployment tutorial.

In [62]:
models_path = os.path.join(os.getcwd(), 'models', 'model.pt')
models_path

'/Users/hzh/Desktop/Pytorch_Tutorial/models/model.pt'

In [65]:
torch.save(model.state_dict(), models_path)

## Load Model

In [66]:
model = LSTMTagger(EMBEDDING_DIM, HIDDEN_DIM, len(word_to_ix), len(tag_to_ix))

In [67]:
model.load_state_dict(torch.load(models_path))

<All keys matched successfully>

In [68]:
model.eval()

LSTMTagger(
  (word_embeddings): Embedding(8, 6)
  (lstm): LSTM(6, 6)
  (hidden2tag): Linear(in_features=6, out_features=3, bias=True)
)

Run a training example through the loaded model and make a prediction.



In [69]:
with torch.no_grad():
    inputs = prepare_sequence(training_sentences_clean[0], word_to_ix)
    tag_scores = model(inputs)
    tag_probabilities = tag_scores.exp()
    max_values, max_indices = torch.max(tag_probabilities, 1)
    predictions = [ix_to_tag[x] for x in max_indices.numpy()]
    print('sentence: {}'.format(' '.join(training_sentences[0])))
    print('parts-of-speach: {}'.format(predictions))

sentence: The dog ate the apple.
parts-of-speach: ['Determiner', 'Noun', 'Verb', 'Determiner', 'Noun']
