In [None]:
%matplotlib inline


Sequence Models and Long-Short Term Memory Networks
===================================================

At this point, we have seen various feed-forward networks. That is,
there is no state maintained by the network at all. This might not be
the behavior we want. Sequence models are central to NLP: they are
models where there is some sort of dependence through time between your
inputs. The classical example of a sequence model is the Hidden Markov
Model for part-of-speech tagging. Another example is the conditional
random field.

A recurrent neural network is a network that maintains some kind of
state. For example, its output could be used as part of the next input,
so that information can propogate along as the network passes over the
sequence. In the case of an LSTM, for each element in the sequence,
there is a corresponding *hidden state* $h_t$, which in principle
can contain information from arbitrary points earlier in the sequence.
We can use the hidden state to predict words in a language model,
part-of-speech tags, and a myriad of other things.


LSTM's in Pytorch
~~~~~~~~~~~~~~~~~

Before getting to the example, note a few things. Pytorch's LSTM expects
all of its inputs to be 3D tensors. The semantics of the axes of these
tensors is important. The first axis is the sequence itself, the second
indexes instances in the mini-batch, and the third indexes elements of
the input. We haven't discussed mini-batching, so lets just ignore that
and assume we will always have just 1 dimension on the second axis. If
we want to run the sequence model over the sentence "The cow jumped",
our input should look like

\begin{align}\begin{bmatrix}
   \overbrace{q_\text{The}}^\text{row vector} \\
   q_\text{cow} \\
   q_\text{jumped}
   \end{bmatrix}\end{align}

Except remember there is an additional 2nd dimension with size 1.

In addition, you could go through the sequence one at a time, in which
case the 1st axis will have size 1 also.

Let's see a quick example.



In [1]:
# Author: Robert Guthrie

import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim

torch.manual_seed(1)

<torch._C.Generator at 0x1285ca4f0>

In [77]:
inputs = [torch.randn(1, 3) for _ in range(5)]  # make a sequence of length 5
inputs, len(inputs), len(inputs[0]), len(inputs[0][0])

([tensor([[ 2.4088, -0.4895, -1.1209]]),
  tensor([[0.0227, 1.6322, 1.4544]]),
  tensor([[0.8580, 0.2502, 0.8019]]),
  tensor([[-1.9631, -0.6891, -1.8564]]),
  tensor([[-0.7074, -0.0188,  0.4403]])],
 5,
 1,
 3)

input dim = (5,3)

In [59]:
# initialize the hidden state.
hidden = (torch.randn(1, 1, 3),
          torch.randn(1, 1, 3))
hidden

(tensor([[[-0.8395,  0.1390, -0.0379]]]),
 tensor([[[-0.0193, -0.7948,  0.9273]]]))

hidden dim = (2,3)

## `.view()`

In [28]:
a = torch.arange(1, 17)  # a's shape is (16,)
a

tensor([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16])

In [29]:
a.view(4, 4) # output below

tensor([[ 1,  2,  3,  4],
        [ 5,  6,  7,  8],
        [ 9, 10, 11, 12],
        [13, 14, 15, 16]])

In [30]:
a.view(2, 2, 4) # output below

tensor([[[ 1,  2,  3,  4],
         [ 5,  6,  7,  8]],

        [[ 9, 10, 11, 12],
         [13, 14, 15, 16]]])

In [32]:
a.view(2, 2, 4)[0], a.view(2, 2, 4)[1]

(tensor([[1, 2, 3, 4],
         [5, 6, 7, 8]]), tensor([[ 9, 10, 11, 12],
         [13, 14, 15, 16]]))

In [39]:
a.view(2, 4, -1), a.view(2, 4, -1).size ()

(tensor([[[ 1,  2],
          [ 3,  4],
          [ 5,  6],
          [ 7,  8]],
 
         [[ 9, 10],
          [11, 12],
          [13, 14],
          [15, 16]]]), torch.Size([2, 4, 2]))

In [42]:
a.view(-1, 2, 4), a.view(-1, 2, 4).size()

(tensor([[[ 1,  2,  3,  4],
          [ 5,  6,  7,  8]],
 
         [[ 9, 10, 11, 12],
          [13, 14, 15, 16]]]), torch.Size([2, 2, 4]))

In [40]:
a.view(2, 8), a.view(2, 8).size() # output below, 

(tensor([[ 1,  2,  3,  4,  5,  6,  7,  8],
         [ 9, 10, 11, 12, 13, 14, 15, 16]]), torch.Size([2, 8]))

#### Back to tutorial

In [43]:
print(inputs[0], inputs[0].size())
print(inputs[0].view(1, 1, -1), inputs[0].view(1, 1, -1).size())

tensor([[-1.8107, -1.9912, -0.2434]]) torch.Size([1, 3])
tensor([[[-1.8107, -1.9912, -0.2434]]]) torch.Size([1, 1, 3])


`inputs[0].view(1,1,-1)` adds another dim  

## `lstm()`

- **input** of shape `(seq_len, batch, input_size)`: 
      tensor containing the features of the input sequence.

- **h_0** of shape `(num_layers * num_directions, batch, hidden_size)`: tensor
      containing the initial hidden state for each element in the batch.
      If the LSTM is bidirectional, num_directions should be 2, else it should be 1.
- **c_0** of shape `(num_layers * num_directions, batch, hidden_size)`: tensor
      containing the initial cell state for each element in the batch.

If `(h_0, c_0)` is not provided, both **h_0** and **c_0** default to zero.

In [None]:
lstm = nn.LSTM(3, 3)  # Input dim is 3, output dim is 3

Args:
- input_size: The number of expected features in the input `x`
- hidden_size: The number of features in the hidden state `h`
- num_layers: Number of recurrent layers. E.g., setting ``num_layers=2`` would mean stacking two LSTMs together to form a `stacked LSTM`, with the second LSTM taking in outputs of the first LSTM and computing the final results. Default: 1
- bias: If ``False``, then the layer does not use bias weights `b_ih` and `b_hh`. Default: ``True``
- batch_first: If ``True``, then the input and output tensors are provided as (batch, seq, feature). Default: ``False``
- dropout: If non-zero, introduces a `Dropout` layer on the outputs of each LSTM layer except the last layer, with dropout probability equal to :attr:`dropout`. Default: 0
- bidirectional: If ``True``, becomes a bidirectional LSTM. Default: ``False``

`lstm` inputs:

In [67]:
print(inputs)
print()
print(inputs[0].size())
print()

print(f'seq_len = {inputs[0].view(1, 1, -1).size()[0]}')
print(f'batch  = {inputs[0].view(1, 1, -1).size()[1]}')
print(f'input_size = {inputs[0].view(1, 1, -1).size()[2]}')

[tensor([[-1.8107, -1.9912, -0.2434]]), tensor([[-0.5782,  0.6942, -1.0025]]), tensor([[-0.0356, -1.3917, -0.3523]]), tensor([[1.9390, 0.4224, 0.0572]]), tensor([[-2.1116,  2.1080,  0.3232]])]

torch.Size([1, 3])

seq_len = 1
batch  = 1
input_size = 3


In [65]:
print(hidden)
print()
print(f'num_layers * num_directions = {hidden[0].size()[0]}')
print(f'batch dim = {hidden[0].size()[1]}')
print(f'hidden_size = {hidden[0].size()[2]}')

(tensor([[[-0.8395,  0.1390, -0.0379]]]), tensor([[[-0.0193, -0.7948,  0.9273]]]))

num_layers * num_directions = 1
batch dim = 1
hidden_size = 3


In [47]:
for i in inputs:
    # Step through the sequence one element at a time.
    # after each step, hidden contains the hidden state.
    out, hidden = lstm(i.view(1, 1, -1), hidden)

print(out.size())
print(hidden)

torch.Size([1, 1, 3])
(tensor([[[-0.0870,  0.0206, -0.1474]]], grad_fn=<StackBackward>), tensor([[[-0.1651,  0.0521, -0.2960]]], grad_fn=<StackBackward>))


In [78]:
# alternatively, we can do the entire sequence all at once.
# the first value returned by LSTM is all of the hidden states throughout
# the sequence. the second is just the most recent hidden state
# (compare the last slice of "out" with "hidden" below, they are the same)
# The reason for this is that:
# "out" will give you access to all hidden states in the sequence
# "hidden" will allow you to continue the sequence and backpropagate,
# by passing it as an argument  to the lstm at a later time
# Add the extra 2nd dimension
inputs = torch.cat(inputs).view(len(inputs), 1, -1)
print(f'seq_len = {inputs.size()[0]}')
print(f'batch  = {inputs.size()[1]}')
print(f'input_size = {inputs.size()[2]}')
inputs.size()

seq_len = 5
batch  = 1
input_size = 3


torch.Size([5, 1, 3])

In [84]:
hidden = (torch.randn(1, 1, 3), torch.randn(1, 1, 3))  # clean out hidden state
out, hidden = lstm(inputs, hidden)
print(out)
print(out.size())
print()
print(hidden)
print()
print(hidden[0].size())
print(hidden[1].size())

tensor([[[ 0.4328, -0.0734, -0.2011]],

        [[ 0.2481,  0.0485, -0.0943]],

        [[ 0.1556,  0.0235, -0.0594]],

        [[-0.0202, -0.0036, -0.0719]],

        [[-0.0756, -0.0395, -0.1422]]], grad_fn=<StackBackward>)
torch.Size([5, 1, 3])

(tensor([[[-0.0756, -0.0395, -0.1422]]], grad_fn=<StackBackward>), tensor([[[-0.1334, -0.0800, -0.2875]]], grad_fn=<StackBackward>))

torch.Size([1, 1, 3])
torch.Size([1, 1, 3])


Example: An LSTM for Part-of-Speech Tagging
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

In this section, we will use an LSTM to get part of speech tags. We will
not use Viterbi or Forward-Backward or anything like that, but as a
(challenging) exercise to the reader, think about how Viterbi could be
used after you have seen what is going on.

The model is as follows: let our input sentence be
$w_1, \dots, w_M$, where $w_i \in V$, our vocab. Also, let
$T$ be our tag set, and $y_i$ the tag of word $w_i$.
Denote our prediction of the tag of word $w_i$ by
$\hat{y}_i$.

This is a structure prediction, model, where our output is a sequence
$\hat{y}_1, \dots, \hat{y}_M$, where $\hat{y}_i \in T$.

To do the prediction, pass an LSTM over the sentence. Denote the hidden
state at timestep $i$ as $h_i$. Also, assign each tag a
unique index (like how we had word\_to\_ix in the word embeddings
section). Then our prediction rule for $\hat{y}_i$ is

\begin{align}\hat{y}_i = \text{argmax}_j \  (\log \text{Softmax}(Ah_i + b))_j\end{align}

That is, take the log softmax of the affine map of the hidden state,
and the predicted tag is the tag that has the maximum value in this
vector. Note this implies immediately that the dimensionality of the
target space of $A$ is $|T|$.


Prepare data:



In [106]:
def prepare_sequence(seq, to_ix):
    # to_idx is a dict mapping of word to index
    idxs = [to_ix[w] for w in seq]
    return torch.tensor(idxs, dtype=torch.long)


training_data = [
    ("The dog ate the apple".split(), ["DET", "NN", "V", "DET", "NN"]),
    ("Everybody read that book".split(), ["NN", "V", "DET", "NN"])
]

In [87]:
training_data

[(['The', 'dog', 'ate', 'the', 'apple'], ['DET', 'NN', 'V', 'DET', 'NN']),
 (['Everybody', 'read', 'that', 'book'], ['NN', 'V', 'DET', 'NN'])]

In [145]:
word_to_ix = {}
for sent, tags in training_data:
    for word in sent:
        if word not in word_to_ix:
            word_to_ix[word] = len(word_to_ix)
print(word_to_ix)
tag_to_ix = {"DET": 0, "NN": 1, "V": 2}
print(tag_to_ix)

# These will usually be more like 32 or 64 dimensional.
# We will keep them small, so we can see how the weights change as we train.
EMBEDDING_DIM = 6
HIDDEN_DIM = 6

{'The': 0, 'dog': 1, 'ate': 2, 'the': 3, 'apple': 4, 'Everybody': 5, 'read': 6, 'that': 7, 'book': 8}
{'DET': 0, 'NN': 1, 'V': 2}


Note that EMBEDDING_DIM = HIDDEN_DIM

Create the model:



In [170]:
class LSTMTagger(nn.Module):

    def __init__(self, embedding_dim, hidden_dim, vocab_size, tagset_size):
        super(LSTMTagger, self).__init__()
        self.hidden_dim = hidden_dim

        self.word_embeddings = nn.Embedding(vocab_size, embedding_dim)
        #print(self.word_embeddings)
        # The LSTM takes word embeddings as inputs, and outputs hidden states
        # with dimensionality hidden_dim.
        self.lstm = nn.LSTM(embedding_dim, hidden_dim)

        # The linear layer that maps from hidden state space to tag space
        self.hidden2tag = nn.Linear(hidden_dim, tagset_size)

    def forward(self, sentence):
        #print(sentence.size())
        embeds = self.word_embeddings(sentence)
        lstm_out, _ = self.lstm(embeds.view(len(sentence), 1, -1))
        tag_space = self.hidden2tag(lstm_out.view(len(sentence), -1))
        tag_scores = F.log_softmax(tag_space, dim=1)
        return tag_scores

#### LSTM Inputs

Quick look again at the training data and the converted inputs

In [101]:
training_data

[(['The', 'dog', 'ate', 'the', 'apple'], ['DET', 'NN', 'V', 'DET', 'NN']),
 (['Everybody', 'read', 'that', 'book'], ['NN', 'V', 'DET', 'NN'])]

In [108]:
inputs1 = prepare_sequence(training_data[0][0], word_to_ix)
training_data[0][0], inputs1

(['The', 'dog', 'ate', 'the', 'apple'], tensor([0, 1, 2, 3, 4]))

In [109]:
inputs2 = prepare_sequence(training_data[1][0], word_to_ix)
training_data[1][0], inputs2

(['Everybody', 'read', 'that', 'book'], tensor([5, 6, 7, 8]))

What do the embeddings look like?

In [118]:
len(word_to_ix)

9

In [120]:
print(f'sentence is {training_data[0][0]}')
print(f'tagset is {training_data[0][1]}')
print()

vocab_size=len(word_to_ix)
tagset_size=len(tag_to_ix)


word_embeddings = nn.Embedding(num_embeddings=vocab_size, embedding_dim=EMBEDDING_DIM)

word_embeddings(inputs1), word_embeddings(inputs1).size()

sentence is ['The', 'dog', 'ate', 'the', 'apple']
tagset is ['DET', 'NN', 'V', 'DET', 'NN']



(tensor([[ 0.2381, -1.8574, -2.3299],
         [ 0.2288, -1.1299,  0.3972],
         [-1.3283,  0.0452,  1.1780],
         [ 0.9890, -0.2442, -1.4545],
         [ 0.3249, -1.7692,  0.0739]], grad_fn=<EmbeddingBackward>),
 torch.Size([5, 3]))

So we see that each word is mapped to a tensor of 3 floats, where each float represents the weight of a tag

What is the input to the `lstm` function itself?

In [132]:
print(inputs1.size())
print(word_embeddings(inputs1).size())
word_embeddings(inputs1).view(len(inputs1), 1, -1), word_embeddings(inputs1).view(len(inputs1), 1, -1).size()

torch.Size([5])
torch.Size([5, 3])


(tensor([[[ 0.2381, -1.8574, -2.3299]],
 
         [[ 0.2288, -1.1299,  0.3972]],
 
         [[-1.3283,  0.0452,  1.1780]],
 
         [[ 0.9890, -0.2442, -1.4545]],
 
         [[ 0.3249, -1.7692,  0.0739]]], grad_fn=<ViewBackward>),
 torch.Size([5, 1, 3]))

Remember `lstm` takes inputs of shape: 
- seq_len,
- batch,
- input_size

So here an extra dimension was added in the middle, which is the batch size. 
- 5 words in sequence, 1 batch, input embedding size = 3

#### LSTM Outputs

In [142]:
inputs1, len(inputs1)

(tensor([0, 1, 2, 3, 4]), 5)

In [141]:
embeds.view(len(inputs1), 1, -1)[0], embeds.view(len(inputs1), 1, -1)[1]

(tensor([[-0.4898,  0.1734,  0.2236]], grad_fn=<SelectBackward>),
 tensor([[-1.8068, -0.2216, -0.1605]], grad_fn=<SelectBackward>))

In [161]:
HIDDEN_DIM, EMBEDDING_DIM

(6, 6)

In [167]:
embeds.view(len(inputs1), 1, -1).size(), inputs1.size()

(torch.Size([5, 1, 6]), torch.Size([5]))

In [385]:
# word_embeddings = nn.Embedding(vocab_size, embedding_dim)
# self.lstm = nn.LSTM(embedding_dim, hidden_dim)
# embeds = self.word_embeddings(sentence)
# lstm_out, _ = self.lstm(embeds.view(len(sentence), 1, -1))

vocab_size=len(word_to_ix)
tagset_size=len(tag_to_ix)
word_embeddings = nn.Embedding(vocab_size, EMBEDDING_DIM)
embeds = word_embeddings(inputs1)

lstm = nn.LSTM(EMBEDDING_DIM, HIDDEN_DIM)

lstm_out, _ = lstm(embeds.view(len(inputs1), 1, -1))
inputs1, lstm_out, lstm_out.size(), _

(tensor([0, 1, 2, 3, 4]),
 tensor([[[ 0.3158, -0.1654, -0.0170, -0.1158, -0.1461, -0.0255]],
 
         [[ 0.0227, -0.0561,  0.0953, -0.2123,  0.1872,  0.2304]],
 
         [[ 0.0487, -0.0696,  0.1490, -0.1778,  0.0638,  0.2308]],
 
         [[ 0.0413, -0.1821,  0.0544,  0.2306, -0.1145,  0.1279]],
 
         [[ 0.2577, -0.1178,  0.0208,  0.0358, -0.2629,  0.2879]]],
        grad_fn=<StackBackward>),
 torch.Size([5, 1, 6]),
 (tensor([[[ 0.2577, -0.1178,  0.0208,  0.0358, -0.2629,  0.2879]]],
         grad_fn=<StackBackward>),
  tensor([[[ 0.5504, -0.1857,  0.0402,  0.0669, -0.6002,  0.8489]]],
         grad_fn=<StackBackward>)))

Output size from the lstm after passing the sentence through is (5,1,6)
- input indexes size: (5)
- emb_input size: (5,1,6)
- output size: ((5,1,6)

### Pass to linear layer

In [175]:
hidden2tag = nn.Linear(HIDDEN_DIM, tagset_size)
tag_space = hidden2tag(lstm_out.view(len(inputs1), -1))
tag_scores = F.log_softmax(tag_space, dim=1)
tag_scores, tag_scores.size()

(tensor([[-0.8272, -1.1052, -1.4629],
         [-0.9013, -1.0333, -1.4350],
         [-0.9042, -1.0531, -1.4012],
         [-0.8687, -1.0910, -1.4079],
         [-0.8381, -1.1143, -1.4299]], grad_fn=<LogSoftmaxBackward>),
 torch.Size([5, 3]))

## Train the Entire model:

In [168]:
model = LSTMTagger(EMBEDDING_DIM, HIDDEN_DIM, vocab_size=len(word_to_ix), tagset_size=len(tag_to_ix))

Embedding(9, 6)


In [191]:
loss_function = nn.NLLLoss()
optimizer = optim.SGD(model.parameters(), lr=0.1)

# See what the scores are before training
# Note that element i,j of the output is the score for tag j for word i.
# Here we don't need to train, so the code is wrapped in torch.no_grad()
with torch.no_grad():
    inputs = prepare_sequence(training_data[0][0], word_to_ix)
    tag_scores = model(inputs)
    print(tag_scores)

for epoch in range(300):  # again, normally you would NOT do 300 epochs, it is toy data
    for sentence, tags in training_data:
        # Step 1. Remember that Pytorch accumulates gradients.
        # We need to clear them out before each instance
        model.zero_grad()

        # Step 2. Get our inputs ready for the network, that is, turn them into
        # Tensors of word indices.
        sentence_in = prepare_sequence(sentence, word_to_ix)
        targets = prepare_sequence(tags, tag_to_ix)

        # Step 3. Run our forward pass.
        tag_scores = model(sentence_in)

        # Step 4. Compute the loss, gradients, and update the parameters by
        #  calling optimizer.step()
        loss = loss_function(tag_scores, targets)
        loss.backward()
        optimizer.step()

# See what the scores are after training
with torch.no_grad():
    inputs = prepare_sequence(training_data[0][0], word_to_ix)
    tag_scores = model(inputs)

    # The sentence is "the dog ate the apple".  i,j corresponds to score for tag j
    # for word i. The predicted tag is the maximum scoring tag.
    # Here, we can see the predicted sequence below is 0 1 2 0 1
    # since 0 is index of the maximum value of row 1,
    # 1 is the index of maximum value of row 2, etc.
    # Which is DET NOUN VERB DET NOUN, the correct sequence!
    print(tag_scores, tag_scores.size())

tensor([[-1.0121e-02, -4.6538e+00, -7.5149e+00],
        [-4.6781e+00, -4.0647e-02, -3.4889e+00],
        [-5.1345e+00, -3.7689e+00, -2.9397e-02],
        [-1.5419e-02, -4.4585e+00, -5.5937e+00],
        [-6.0074e+00, -3.7719e-03, -6.6419e+00]])
tensor([[-5.3158e-03, -5.2923e+00, -8.2111e+00],
        [-5.5502e+00, -1.9739e-02, -4.1567e+00],
        [-5.5924e+00, -4.5351e+00, -1.4557e-02],
        [-8.1100e-03, -5.1436e+00, -6.1011e+00],
        [-6.7140e+00, -1.8006e-03, -7.4437e+00]]) torch.Size([5, 3])


In [185]:
tag_scores[0]

tensor([-0.8272, -1.1052, -1.4629], grad_fn=<SelectBackward>)

In [190]:
import numpy as np
for w in tag_scores:
    #print(w)
    print(np.argmax(w.detach()))

tensor(0)
tensor(0)
tensor(0)
tensor(0)
tensor(0)


Exercise: Augmenting the LSTM part-of-speech tagger with character-level features
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

In the example above, each word had an embedding, which served as the
inputs to our sequence model. Let's augment the word embeddings with a
representation derived from the characters of the word. We expect that
this should help significantly, since character-level information like
affixes have a large bearing on part-of-speech. For example, words with
the affix *-ly* are almost always tagged as adverbs in English.

To do this, let $c_w$ be the character-level representation of
word $w$. Let $x_w$ be the word embedding as before. Then
the input to our sequence model is the concatenation of $x_w$ and
$c_w$. So if $x_w$ has dimension 5, and $c_w$
dimension 3, then our LSTM should accept an input of dimension 8.

To get the character level representation, do an LSTM over the
characters of a word, and let $c_w$ be the final hidden state of
this LSTM. Hints:

* There are going to be two LSTM's in your new model.
  The original one that outputs POS tag scores, and the new one that
  outputs a character-level representation of each word.
* To do a sequence model over characters, you will have to embed characters.
  The character embeddings will be the input to the character LSTM.




# Char-LSTM
### Steps
1. Get character vocab
2. Get char-embeddings (c_w)   [x_w is the word embedding]
3. Input = concat(x_W, c_w)
4. create char-lstm

Target is still [part of speech]

1. Char vocab

In [200]:
def prepare_char_sequence(seq, to_ix):
    # to_idx is a dict mapping of char to index
    idxs = [to_ix[c] for c in seq]
    return torch.tensor(idxs, dtype=torch.long)


training_data = [
    ("The dog ate the apple".split(), ["DET", "NN", "V", "DET", "NN"]),
    ("Everybody read that book".split(), ["NN", "V", "DET", "NN"])
]

In [201]:
char_to_ix = {}
for sent, tags in training_data:
    for word in sent:
        for letter in word:
            if letter not in char_to_ix:
                char_to_ix[letter] = len(char_to_ix)
                
print(char_to_ix)
tag_to_ix = {"DET": 0, "NN": 1, "V": 2}
print(tag_to_ix)

# These will usually be more like 32 or 64 dimensional.
# We will keep them small, so we can see how the weights change as we train.
EMBEDDING_DIM = 6
HIDDEN_DIM = 6

{'T': 0, 'h': 1, 'e': 2, 'd': 3, 'o': 4, 'g': 5, 'a': 6, 't': 7, 'p': 8, 'l': 9, 'E': 10, 'v': 11, 'r': 12, 'y': 13, 'b': 14, 'k': 15}
{'DET': 0, 'NN': 1, 'V': 2}


In [207]:
training_data[0][0]

['The', 'dog', 'ate', 'the', 'apple']

In [203]:
training_data[0][0][0], prepare_char_sequence(training_data[0][0][0], char_to_ix)

('The', tensor([0, 1, 2]))

Get char representation per word, 1 word is an entire sequence here!!!

In [239]:
word1 = training_data[0][0][0]
c_input = prepare_char_sequence(word1, char_to_ix)
c_input, c_input.size()

(tensor([0, 1, 2]), torch.Size([3]))

Init embedding for this sequence

In [247]:
char_vocab_size = len(char_to_ix)
char_embeddings = nn.Embedding(num_embeddings=char_vocab_size, embedding_dim=EMBEDDING_DIM)
char_embeddings

Embedding(16, 6)

Get actual embedding

In [248]:
char_embeds = char_embeddings(c_input)
char_embeds, char_embeds.size()

(tensor([[-0.2845,  0.5521,  1.3652,  0.9553, -0.5913,  0.0407],
         [ 1.2207, -0.4602,  1.0191,  1.0931,  0.8007,  0.7547],
         [ 0.6940,  1.4390, -0.9423,  1.4651,  0.7460,  0.0379]],
        grad_fn=<EmbeddingBackward>), torch.Size([3, 6]))

Put seq embedding into lstm. First, input dimensions need to match lstm needs

In [255]:
char_embeds_view = char_embeds.view(len(c_input), 1, -1)
print(f'seq_len = {char_embeds_view.size()[0]}')
print(f'batch  = {char_embeds_view.size()[1]}')
print(f'each input_size = {char_embeds_view.size()[2]}')
char_embeds_view.size()

seq_len = 3
batch  = 1
each input_size = 6


torch.Size([3, 1, 6])

And now put in lstm

In [266]:
char_lstm = nn.LSTM(EMBEDDING_DIM, HIDDEN_DIM)
char_lstm_out, _ = char_lstm(char_embeds.view(len(c_input), 1, -1))
char_lstm_out, char_lstm_out.size()

(tensor([[[-0.0493, -0.0139, -0.1389, -0.0993,  0.1474,  0.0917]],
 
         [[ 0.0563,  0.0796, -0.1367, -0.0643,  0.0982,  0.1793]],
 
         [[ 0.1091,  0.0565, -0.0305, -0.1263,  0.0511,  0.1545]]],
        grad_fn=<StackBackward>), torch.Size([3, 1, 6]))

And now concat to that word's embedding

In [267]:
print(f'Input sentence: {inputs1}')
vocab_size=len(word_to_ix)
tagset_size=len(tag_to_ix)
word_embeddings = nn.Embedding(vocab_size, EMBEDDING_DIM)
embeds = word_embeddings(inputs1)
lstm = nn.LSTM(EMBEDDING_DIM, HIDDEN_DIM)
lstm_out, _ = lstm(embeds.view(len(inputs1), 1, -1))
lstm_out, lstm_out.size()

Input sentence: tensor([0, 1, 2, 3, 4])


(tensor([[[ 0.1641,  0.0171,  0.0625, -0.1705, -0.0794,  0.1250]],
 
         [[ 0.0121,  0.0100,  0.0601, -0.0949, -0.1124,  0.1637]],
 
         [[ 0.0457,  0.0562,  0.0252,  0.0169, -0.0844,  0.0319]],
 
         [[ 0.3130,  0.1223,  0.1611, -0.2087, -0.1743,  0.0638]],
 
         [[-0.0072, -0.0025,  0.0291, -0.0332, -0.0922,  0.1330]]],
        grad_fn=<StackBackward>), torch.Size([5, 1, 6]))

In [273]:
print(lstm_out[0])
print()
print(char_lstm_out)
print()
print(f'word embedding size is: {lstm_out[0].size()}')
print(f'char embedding size for that seq (word) is: {char_lstm_out.size()}')

tensor([[ 0.1641,  0.0171,  0.0625, -0.1705, -0.0794,  0.1250]],
       grad_fn=<SelectBackward>)

tensor([[[-0.0493, -0.0139, -0.1389, -0.0993,  0.1474,  0.0917]],

        [[ 0.0563,  0.0796, -0.1367, -0.0643,  0.0982,  0.1793]],

        [[ 0.1091,  0.0565, -0.0305, -0.1263,  0.0511,  0.1545]]],
       grad_fn=<StackBackward>)

word embedding size is: torch.Size([1, 6])
char embedding size for that seq (word) is: torch.Size([3, 1, 6])


Pop an extra dim in the word output embedding and we're able to concatenate!

In [288]:
fin_out = torch.cat((lstm_out[0].view(1,1,6), char_lstm_out), dim=0)
fin_out, fin_out.size()

(tensor([[[ 0.1641,  0.0171,  0.0625, -0.1705, -0.0794,  0.1250]],
 
         [[-0.0493, -0.0139, -0.1389, -0.0993,  0.1474,  0.0917]],
 
         [[ 0.0563,  0.0796, -0.1367, -0.0643,  0.0982,  0.1793]],
 
         [[ 0.1091,  0.0565, -0.0305, -0.1263,  0.0511,  0.1545]]],
        grad_fn=<CatBackward>), torch.Size([4, 1, 6]))

In [290]:
fin_out = torch.cat((lstm_out, char_lstm_out), dim=0)
fin_out, fin_out.size()

(tensor([[[ 0.1641,  0.0171,  0.0625, -0.1705, -0.0794,  0.1250]],
 
         [[ 0.0121,  0.0100,  0.0601, -0.0949, -0.1124,  0.1637]],
 
         [[ 0.0457,  0.0562,  0.0252,  0.0169, -0.0844,  0.0319]],
 
         [[ 0.3130,  0.1223,  0.1611, -0.2087, -0.1743,  0.0638]],
 
         [[-0.0072, -0.0025,  0.0291, -0.0332, -0.0922,  0.1330]],
 
         [[-0.0493, -0.0139, -0.1389, -0.0993,  0.1474,  0.0917]],
 
         [[ 0.0563,  0.0796, -0.1367, -0.0643,  0.0982,  0.1793]],
 
         [[ 0.1091,  0.0565, -0.0305, -0.1263,  0.0511,  0.1545]]],
        grad_fn=<CatBackward>), torch.Size([8, 1, 6]))

In [502]:
class CharLSTMTagger(nn.Module):
    def __init__(self, embedding_dim, hidden_dim, vocab_size, char_vocab_size, tagset_size):
        super(CharLSTMTagger, self).__init__()
        self.hidden_dim = hidden_dim
        self.word_embeddings = nn.Embedding(vocab_size, embedding_dim)
        self.char_embeddings = nn.Embedding(char_vocab_size, embedding_dim)
        
        self.char_lstm = nn.LSTM(embedding_dim, hidden_dim)
        self.lstm = nn.LSTM(embedding_dim*2, hidden_dim*2)
        self.hidden2tag = nn.Linear(hidden_dim*2, tagset_size)
        
    def forward(self, sentence, sentence_ix):
        embeds = self.word_embeddings(sentence_ix)
        
        for i,word in enumerate(sentence):
            char_ix = prepare_char_sequence(word, char_to_ix)
            char_embeds=self.char_embeddings(char_ix)
            char_lstm_out, (h_n, c_n) = self.char_lstm(char_embeds.view(len(word),1,-1))
            
            # Now the encoded word has the word embedding and a generated char embedding
            # TODO: Should I pass the hidden state or the cell state? I think cell state
            encoded_word = torch.cat((embeds[i].view(1,1,-1), c_n), dim=2)
            #print(encoded_word.size())
            
            # Add the encoded word to the final embeddings for the last lstm
            if i == 0: fin_embeds = encoded_word
            else: fin_embeds = torch.cat((fin_embeds, encoded_word))
                
        #print(f'final encoded embeds size is: {fin_embeds.size()}')
        lstm_out, (h_n, c_n) = self.lstm(fin_embeds)
        
        #print(f'final output size is: {lstm_out.size()}')
        
        # Send the lstm outputs through
        #print(f'linear input is of size: {lstm_out.view(len(sentence), -1).size()}')
        tag_space = self.hidden2tag(lstm_out.view(len(sentence), -1))
        tag_scores = F.log_softmax(tag_space, dim=1)
        
        return tag_scores  

In [497]:
model = CharLSTMTagger(EMBEDDING_DIM, HIDDEN_DIM, len(word_to_ix), len(char_to_ix), len(tag_to_ix))

In [501]:
# for sentence, tags in training_data:
#     # Step 1. Remember that Pytorch accumulates gradients.
#     # We need to clear them out before each instance
#     model.zero_grad()

#     # Step 2. Get our inputs ready for the network, that is, turn them into
#     # Tensors of word indices.
#     sentence_ix = prepare_sequence(sentence, word_to_ix)
#     targets = prepare_sequence(tags, tag_to_ix)

#     # Step 3. Run our forward pass.
#     out = model(sentence, sentence_ix)
#     print(out, out.size())
#     break

In [499]:
loss_function = nn.NLLLoss()
optimizer = optim.SGD(model.parameters(), lr=0.1)

# See what the scores are before training
# Note that element i,j of the output is the score for tag j for word i.
# Here we don't need to train, so the code is wrapped in torch.no_grad()
with torch.no_grad():
    inputs = prepare_sequence(training_data[0][0], word_to_ix)
    tag_scores = model(training_data[0][0], sentence_ix)
    print(tag_scores)

for epoch in range(300):  # again, normally you would NOT do 300 epochs, it is toy data
    for sentence, tags in training_data:
        # Step 1. Remember that Pytorch accumulates gradients.
        # We need to clear them out before each instance
        model.zero_grad()

        # Step 2. Get our inputs ready for the network, that is, turn them into
        # Tensors of word indices.
#         sentence_in = prepare_sequence(sentence, word_to_ix)
#         targets = prepare_sequence(tags, tag_to_ix)

#         # Step 3. Run our forward pass.
#         tag_scores = model(sentence_in)
        sentence_ix = prepare_sequence(sentence, word_to_ix)
        targets = prepare_sequence(tags, tag_to_ix)

        # Step 3. Run our forward pass.
        tag_scores = model(sentence, sentence_ix)
        
        # Step 4. Compute the loss, gradients, and update the parameters by
        #  calling optimizer.step()
        loss = loss_function(tag_scores, targets)
        loss.backward()
        optimizer.step()

# See what the scores are after training
with torch.no_grad():
    inputs = prepare_sequence(training_data[0][0], word_to_ix)
    tag_scores = model(sentence, sentence_ix)

    # The sentence is "the dog ate the apple".  i,j corresponds to score for tag j
    # for word i. The predicted tag is the maximum scoring tag.
    # Here, we can see the predicted sequence below is 0 1 2 0 1
    # since 0 is index of the maximum value of row 1,
    # 1 is the index of maximum value of row 2, etc.
    # Which is DET NOUN VERB DET NOUN, the correct sequence!
    print(tag_scores, tag_scores.size())

tensor([[-1.0467, -1.2343, -1.0276],
        [-1.0468, -1.2271, -1.0333],
        [-1.0245, -1.2663, -1.0240],
        [-1.0492, -1.2778, -0.9912],
        [-1.0566, -1.2747, -0.9866]])
tensor([[-6.7249, -0.0219, -3.8870],
        [-4.4208, -3.8466, -0.0339],
        [-0.0141, -5.0470, -4.8858],
        [-5.4596, -0.0092, -5.3204]]) torch.Size([4, 3])


In [495]:
import numpy as np
print(sentence)

ix_to_tag = {v: k for k, v in tag_to_ix.items()}
for w in tag_scores:
    #print(w)
    print(ix_to_tag[np.argmax(w.detach()).item()])

['Everybody', 'read', 'that', 'book']
NN
V
DET
NN
