# Deep learning tutorial with pytorch

http://pytorch.org/tutorials/beginner/nlp/deep_learning_tutorial.html

## This tutorial stars out with introducing affine maps
f(x) = Ax + b

In [1]:
import torch
import torch.autograd as autograd
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim

In [2]:
torch.manual_seed(1)

<torch._C.Generator at 0x1104c2e10>

In [3]:
lin = nn.Linear(5, 3)  # maps from R^5 to R^3, paramters A, b
# data is 2x5. A maps from 5 to 3... can we "map" data under A?
data = autograd.Variable(torch.randn(2, 5))
print(lin(data))  # yes

Variable containing:
 0.1755 -0.3268 -0.5069
-0.6602  0.2260  0.1089
[torch.FloatTensor of size 2x3]



## Non-linearities

In [4]:
# In pytorch, most non-linearities are in torch.functional (we have it imported as F)
# Note that non-linearites typically don't have parameters like affine maps do.
# That is, they don't have weights that are updated during training.
data = autograd.Variable(torch.randn(2, 2))
print(data)
print(F.relu(data))

Variable containing:
-0.5404 -2.2102
 2.1130 -0.0040
[torch.FloatTensor of size 2x2]

Variable containing:
 0.0000  0.0000
 2.1130  0.0000
[torch.FloatTensor of size 2x2]



## Softmax and probabilities

Let x be a vector of real numbers (positive, negative, whatever, there are no constraints). Then the i’th component of Softmax(x) is

\begin{equation*} \frac{\text{exp}(x_i)}{\sum_j{exp(x_j)}} \end{equation*}

It should be clear that the output is a probability distribution: each element is non-negative and the sum over all components is 1.

You could also think of it as just applying an element-wise exponentiation operator to the input to make everything non-negative and then dividing by the normalization constant.

In [5]:
# Softmax is also in torch.nn.functional
data = autograd.Variable(torch.randn(5))
print(data)
print(F.softmax(data, 0))
print(F.softmax(data, 0).sum())  # Sums to 1 because it is a distribution!
print(F.log_softmax(data, 0))  # theres also log_softmax

Variable containing:
 1.3800
-1.3505
 0.3455
 0.5046
 1.8213
[torch.FloatTensor of size 5]

Variable containing:
 0.2948
 0.0192
 0.1048
 0.1228
 0.4584
[torch.FloatTensor of size 5]

Variable containing:
 1
[torch.FloatTensor of size 1]

Variable containing:
-1.2214
-3.9519
-2.2560
-2.0969
-0.7801
[torch.FloatTensor of size 5]



Let’s write an annotated example of a network that takes in a sparse bag-of-words representation and outputs a probability distribution over two labels: “English” and “Spanish”. This model is just logistic regression.

### Logistic regression bag-of-words classifier

The bag-of-words (BOW) vector will look like this:

`[Count(hello),Count(world)]`

And the network output will be:

`log Softmax(Ax + b)`

In other words, pass the input through an affine map and then do <br>`log Softmax`

In [6]:
data = [("me gusta comer en la cafeteria".split(), "SPANISH"),
        ("Give it to me".split(), "ENGLISH"),
        ("No creo que sea una buena idea".split(), "SPANISH"),
        ("No it is not a good idea to get lost at sea".split(), "ENGLISH")]

test_data = [("Yo creo que si".split(), "SPANISH"),
             ("it is lost on me".split(), "ENGLISH")]


In [7]:
# word_to_ix maps each word in the vocab to a unique integer, which will be its
# index into the Bag of Words vectors
word_to_ix = {}
for sent, _ in data + test_data:
    for word in sent:
        if word not in word_to_ix:
            word_to_ix[word] = len(word_to_ix)

print(word_to_ix)

VOCAB_SIZE = len(word_to_ix)
NUM_LABELS = 2

{'me': 0, 'gusta': 1, 'comer': 2, 'en': 3, 'la': 4, 'cafeteria': 5, 'Give': 6, 'it': 7, 'to': 8, 'No': 9, 'creo': 10, 'que': 11, 'sea': 12, 'una': 13, 'buena': 14, 'idea': 15, 'is': 16, 'not': 17, 'a': 18, 'good': 19, 'get': 20, 'lost': 21, 'at': 22, 'Yo': 23, 'si': 24, 'on': 25}


In [8]:
class BOWClassifier(nn.Module):  # inherit from nn.Module!
    
    def __init__(self, num_labels, vocab_size):
        # calls the init function of nn.Module.  Dont get confused by syntax,
        # just always do it in an nn.Module
        super(BOWClassifier, self).__init__()

        # Define the parameters that you will need.  In this case, we need A and b,
        # the parameters of the affine mapping.
        # Torch defines nn.Linear(), which provides the affine map.
        # Make sure you understand why the input dimension is vocab_size
        # and the output is num_labels!
        self.linear = nn.Linear(vocab_size, num_labels)

        # NOTE! The non-linearity log softmax does not have parameters! So we don't need
        # to worry about that here
        
    def forward(self, bow_vec):
        # Pass the input through the linear layer
        # then pass that through Log_softmax.
        # Many Non-linearities and other functions are in torch.nn.functional
        return F.log_softmax(self.linear(bow_vec), 1)


def make_bow_vector(sentence, word_to_ix):
    vec = torch.zeros(len(word_to_ix))
    for word in sentence:
        vec[word_to_ix[word]] += 1
    return vec.view(1, -1)


def make_target(label, label_to_ix):
    return torch.LongTensor([label_to_ix[label]])


model = BOWClassifier(NUM_LABELS, VOCAB_SIZE)

In [9]:
# the model knows its parameters.  The first output below is A, the second is b.
# Whenever you assign a component to a class variable in the __init__ function
# of a module, which was done with the line
# self.linear = nn.Linear(...)
# Then through some Python magic from the Pytorch devs, your module
# (in this case, BoWClassifier) will store knowledge of the nn.Linear's parameters
for param in model.parameters():
    print(param)

Parameter containing:

Columns 0 to 9 
 0.1194  0.0609 -0.1268  0.1274  0.1191  0.1739 -0.1099 -0.0323 -0.0038  0.0286
 0.1152 -0.1136 -0.1743  0.1427 -0.0291  0.1103  0.0630 -0.1471  0.0394  0.0471

Columns 10 to 19 
-0.1488 -0.1392  0.1067 -0.0460  0.0958  0.0112  0.0644  0.0431  0.0713  0.0972
-0.1313 -0.0931  0.0669  0.0351 -0.0834 -0.0594  0.1796 -0.0363  0.1106  0.0849

Columns 20 to 25 
-0.1816  0.0987 -0.1379 -0.1480  0.0119 -0.0334
-0.1268 -0.1668  0.1882  0.0102  0.1344  0.0406
[torch.FloatTensor of size 2x26]

Parameter containing:
 0.0631
 0.1465
[torch.FloatTensor of size 2]



In [10]:
# To run the model, pass in a BoW vector, but wrapped in an autograd.Variable
sample = data[0]
bow_vector = make_bow_vector(sample[0], word_to_ix)
log_probs = model(autograd.Variable(bow_vector))
print(log_probs)

Variable containing:
-0.5378 -0.8771
[torch.FloatTensor of size 1x2]



In [11]:
label_to_ix = {"SPANISH": 0, "ENGLISH": 1}

Now train the model.

Note that the input to NLLLoss is a vector of log probabilities, and a target label. It doesn’t compute the log probabilities for us. This is why the last layer of our network is log softmax. The loss function nn.CrossEntropyLoss() is the same as NLLLoss(), except it does the log softmax for you.

In [12]:
# run on test data before training to get a 'before and after' 
for instance, label in test_data:
    bow_vec = autograd.Variable(make_bow_vector(instance, word_to_ix))
    log_probs = model(bow_vec)
    print(log_probs)


# Print the matrix column corresponding to "creo"
print('matrix column corresponding to "creo" from test w/o training:')
print(next(model.parameters())[:, word_to_ix["creo"]])


loss_function = nn.NLLLoss()
optimizer = optim.SGD(model.parameters(), lr=0.1)

# Usually you pass over the training data several times.
# 100 (used here) is much bigger than on a real data set, but real datasets have more than
# two instances. Usually, somewhere between 5 and 30 epochs is reasonable.
for epoch in range(100):
    for instance, label in data:
        # Step 1. Remember that Pytorch accumulates gradients.
        # We need to clear them out before each instance
        model.zero_grad()
        
        # Step 2. Make our BOW vector and also wrap the target in a 
        # Variable as an integer. For example, if the target is SPANISH, then 
        # wrap the integer 0. The Loss Function then knows that the 0th 
        # element of the log probabilities is the log probability 
        # that corresponds to SPANISH
        bow_vec = autograd.Variable(make_bow_vector(instance, word_to_ix))
        target = autograd.Variable(make_target(label, label_to_ix))
        
        # Step 3. Run the forward pass
        log_probs = model(bow_vec)
        
        # Step 4. Compute the loss, gradients, and update the parameters by 
        # calling optimizer.step()
        loss = loss_function(log_probs, target)
        loss.backward()
        optimizer.step()

        
for instance, label in test_data:
    bow_vec = autograd.Variable(make_bow_vector(instance, word_to_ix))
    log_probs = model(bow_vec)
    print(log_probs)


# Index corresponding to Spanish goes up, English goes down!
print("after training:")
print(next(model.parameters())[:, word_to_ix["creo"]])

Variable containing:
-0.9297 -0.5020
[torch.FloatTensor of size 1x2]

Variable containing:
-0.6388 -0.7506
[torch.FloatTensor of size 1x2]

matrix column corresponding to "creo" from test w/o training:
Variable containing:
-0.1488
-0.1313
[torch.FloatTensor of size 2]

Variable containing:
-0.2093 -1.6669
[torch.FloatTensor of size 1x2]

Variable containing:
-2.5330 -0.0828
[torch.FloatTensor of size 1x2]

after training:
Variable containing:
 0.2803
-0.5605
[torch.FloatTensor of size 2]



In [16]:
print(next(model.parameters())[:, word_to_ix["Give"]])

Variable containing:
-0.6310
 0.5841
[torch.FloatTensor of size 2]



## Word embeddings: Encoding lexical semantics

**word embeddings are a representation of the *semantics* of a word, efficiently encoding semantic information that might be relevant to the task at hand. **

In [1]:
import torch
import torch.autograd as autograd
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim

In [2]:
torch.manual_seed(1)

<torch._C.Generator at 0x10c9ffed0>

In [3]:
word_to_ix = {"hello": 0, "world": 1}
embeds = nn.Embedding(2, 5)  # 2 words in vocab, 5 dimensional embeddings
lookup_tensor = torch.LongTensor([word_to_ix["hello"]])
hello_embed = embeds(autograd.Variable(lookup_tensor))
print(hello_embed)

Variable containing:
 0.6614  0.2669  0.0617  0.6213 -0.4519
[torch.FloatTensor of size 1x5]



### an example: N-Gram Language Modeling

In [4]:
CONTEXT_SIZE = 2
EMBEDDING_DIM = 10
# We will use Shakespeare Sonnet 2
test_sentence = """When forty winters shall besiege thy brow,
And dig deep trenches in thy beauty's field,
Thy youth's proud livery so gazed on now,
Will be a totter'd weed of small worth held:
Then being asked, where all thy beauty lies,
Where all the treasure of thy lusty days;
To say, within thine own deep sunken eyes,
Were an all-eating shame, and thriftless praise.
How much more praise deserv'd thy beauty's use,
If thou couldst answer 'This fair child of mine
Shall sum my count, and make my old excuse,'
Proving his beauty by succession thine!
This were to be new made when thou art old,
And see thy blood warm when thou feel'st it cold.""".split()
# we should tokenize the input, but we will ignore that for now
# build a list of tuples.  Each tuple is ([ word_i-2, word_i-1 ], target word)
trigrams = [([test_sentence[i], test_sentence[i + 1]], test_sentence[i + 2])
            for i in range(len(test_sentence) - 2)]
# print the first 3, just so I can see what they look like
print(trigrams[:3])

[(['When', 'forty'], 'winters'), (['forty', 'winters'], 'shall'), (['winters', 'shall'], 'besiege')]


In [5]:
vocab = set(test_sentence)
word_to_ix = {word: i for i, word in enumerate(vocab)}

In [6]:
class NGramLanguageModeler(nn.Module):
    
    def __init__(self, vocab_size, embedding_dim, context_size):
        super(NGramLanguageModeler, self).__init__()
        self.embeddings = nn.Embedding(vocab_size, embedding_dim)
        self.linear1 = nn.Linear(context_size * embedding_dim, 128)
        self.linear2 = nn.Linear(128, vocab_size)
    
    def forward(self, inputs):
        embeds = self.embeddings(inputs).view((1, -1))
        out = F.relu(self.linear1(embeds))
        out = self.linear2(out)
        log_probs = F.log_softmax(out)
        return log_probs


losses = []
loss_function = nn.NLLLoss()
model = NGramLanguageModeler(len(vocab), EMBEDDING_DIM, CONTEXT_SIZE)
optimizer = optim.SGD(model.parameters(), lr=0.001)

In [7]:
for epoch in range(10):
    total_loss = torch.Tensor([0])
    for context, target in trigrams:
        
        # Step 1. Prepare the inputs to be passed to the model (i.e, turn the words
        # into integer indices and wrap them in variables)
        context_idxs = [word_to_ix[w] for w in context]
        context_var = autograd.Variable(torch.LongTensor(context_idxs))

        # Step 2. Recall that torch *accumulates* gradients. Before passing in a
        # new instance, you need to zero out the gradients from the old
        # instance
        model.zero_grad()
        
        # Step 3. Run the forward pass, getting log probabilities over next
        # words
        log_probs = model(context_var)
        
        # Step 4. Compute your loss function. (Again, Torch wants the target
        # word wrapped in a variable)
        loss = loss_function(log_probs, autograd.Variable(
            torch.LongTensor([word_to_ix[target]])))
        
        # Step 5. Do the backward pass and update the gradient
        loss.backward()
        optimizer.step()
        
        total_loss += loss.data
    losses.append(total_loss)
print(losses)  # The loss decreased every iteration over the training data!

  del sys.path[0]


[
 516.7384
[torch.FloatTensor of size 1]
, 
 514.2780
[torch.FloatTensor of size 1]
, 
 511.8333
[torch.FloatTensor of size 1]
, 
 509.4035
[torch.FloatTensor of size 1]
, 
 506.9883
[torch.FloatTensor of size 1]
, 
 504.5865
[torch.FloatTensor of size 1]
, 
 502.1965
[torch.FloatTensor of size 1]
, 
 499.8185
[torch.FloatTensor of size 1]
, 
 497.4510
[torch.FloatTensor of size 1]
, 
 495.0930
[torch.FloatTensor of size 1]
]


## Exercise: Computing word embeddings: Continuous Bag-of-Words

The CBOW model is as follows. Given a target word *w$_i$* and an *N* context window on each side, $w_{i−1}$,…,w$_{i−N}$ and w$_{i+1}$,…,w$_{i+N}$, referring to all context words collectively as ***C***, CBOW tries to minimize

\begin{equation*} −\log p(w_i|C)=−\log \text{Softmax}(A(\sum_{w∈C}q_w)+b) \end{equation*}

is the embedding for word *w*.

Implement this model in Pytorch by filling in the class below. Some tips:

*    Think about which parameters you need to define.
*    Make sure you know what shape each operation expects. Use .view() if you need to reshape.


In [8]:
CONTEXT_SIZE = 2  # 2 words to the left, 2 to the right
raw_text = """We are about to study the idea of a computational process.
Computational processes are abstract beings that inhabit computers.
As they evolve, processes manipulate other abstract things called data.
The evolution of a process is directed by a pattern of rules
called a program. People create programs to direct processes. In effect,
we conjure the spirits of the computer with our spells.""".split()

# By deriving a set from `raw_text`, we deduplicate the array
vocab = set(raw_text)
vocab_size = len(vocab)

word_to_ix = {word: i for i, word in enumerate(vocab)}
data = []
for i in range(2, len(raw_text) - 2):
    context = [raw_text[i - 2], raw_text[i - 1],
               raw_text[i + 1], raw_text[i + 2]]
    target = raw_text[i]
    data.append((context, target))
print(data[:5])

[(['We', 'are', 'to', 'study'], 'about'), (['are', 'about', 'study', 'the'], 'to'), (['about', 'to', 'the', 'idea'], 'study'), (['to', 'study', 'idea', 'of'], 'the'), (['study', 'the', 'of', 'a'], 'idea')]


In [9]:
class CBOW(nn.Module):
    
    def __init__(self, vocab_size, embedding_dim, context_size):
        super(CBOW, self).__init__()
        self.embeddings = nn.Embedding(vocab_size, embedding_dim)
        self.linear = nn.Linear(embedding_dim, vocab_size)

    def forward(self, inputs):
        embeds = self.embeddings(inputs).sum(dim=0).view((1, -1))
        out = self.linear(embeds)
        log_probs = F.log_softmax(out, dim=1)
        return log_probs


def make_context_vector(context, word_to_ix):
    idxs = [word_to_ix[w] for w in context]
    tensor = torch.LongTensor(idxs)
    return autograd.Variable(tensor)


make_context_vector(data[0][0], word_to_ix)  # example

losses = []
loss_function = nn.NLLLoss()
model = CBOW(vocab_size, EMBEDDING_DIM, CONTEXT_SIZE)
optimizer = optim.SGD(model.parameters(), lr=0.001)

In [10]:
for epoch in range(100):
    total_loss = torch.Tensor([0])
    for context, target in data:
        
        # Step 1. Prepare the inputs to be passed to the model (i.e, turn the words
        # into integer indices and wrap them in variables)
        context_idxs = [word_to_ix[w] for w in context]
        context_var = autograd.Variable(torch.LongTensor(context_idxs))

        # Step 2. Recall that torch *accumulates* gradients. Before passing in a
        # new instance, you need to zero out the gradients from the old
        # instance
        model.zero_grad()
        
        # Step 3. Run the forward pass, getting log probabilities over next
        # words
        log_probs = model(context_var)
        
        # Step 4. Compute your loss function. (Again, Torch wants the target
        # word wrapped in a variable)
        loss = loss_function(log_probs, autograd.Variable(
            torch.LongTensor([word_to_ix[target]])))
        
        # Step 5. Do the backward pass and update the gradient
        loss.backward()
        optimizer.step()
        
        total_loss += loss.data
    losses.append(total_loss)
print(losses)  # The loss decreased every iteration over the training data!

[
 260.6818
[torch.FloatTensor of size 1]
, 
 257.9831
[torch.FloatTensor of size 1]
, 
 255.3309
[torch.FloatTensor of size 1]
, 
 252.7230
[torch.FloatTensor of size 1]
, 
 250.1575
[torch.FloatTensor of size 1]
, 
 247.6327
[torch.FloatTensor of size 1]
, 
 245.1471
[torch.FloatTensor of size 1]
, 
 242.6995
[torch.FloatTensor of size 1]
, 
 240.2887
[torch.FloatTensor of size 1]
, 
 237.9137
[torch.FloatTensor of size 1]
, 
 235.5737
[torch.FloatTensor of size 1]
, 
 233.2677
[torch.FloatTensor of size 1]
, 
 230.9952
[torch.FloatTensor of size 1]
, 
 228.7553
[torch.FloatTensor of size 1]
, 
 226.5475
[torch.FloatTensor of size 1]
, 
 224.3713
[torch.FloatTensor of size 1]
, 
 222.2261
[torch.FloatTensor of size 1]
, 
 220.1114
[torch.FloatTensor of size 1]
, 
 218.0267
[torch.FloatTensor of size 1]
, 
 215.9717
[torch.FloatTensor of size 1]
, 
 213.9458
[torch.FloatTensor of size 1]
, 
 211.9487
[torch.FloatTensor of size 1]
, 
 209.9799
[torch.FloatTensor of size 1]
, 
 208.0390

In [11]:
#For sanity check, let's see what our model predicts
acc = 0
for context, target in data:

        # Step 1. Prepare the inputs to be passed to the model (i.e, turn the words
        # into integer indices and wrap them in variables)
        context_idxs = [word_to_ix[w] for w in context]
        context_var = autograd.Variable(torch.LongTensor(context_idxs))

        # Step 3. Run the forward pass, getting log probabilities over next
        # words
        log_probs = model(context_var)
        _,idx = torch.min(log_probs,-1)
print (context, word_to_ix[target], idx.data[0])

['the', 'computer', 'our', 'spells.'] 48 5


In [12]:
loss_function(log_probs, autograd.Variable(torch.LongTensor([word_to_ix[target]]))).data


 1.6186
[torch.FloatTensor of size 1]

## Sequence model and Long-Short Term Memory Networks

A recurrent neural network is a network that maintains some kind of state. For example, its output could be used as part of the next input, so that information can propogate along as the network passes over the sequence. In the case of an LSTM, for each element in the sequence, there is a corresponding hidden state _h_$_{t}$, which in principle can contain information from arbitrary points earlier in the sequence. We can use the hidden state to predict words in a language model, part-of-speech tags, and a myriad of other things.

### LSTMs in Pytorch

you could go through the sequence one at a time, in which case the 1st axis will have size 1 also.

Let’s see a quick example.

In [60]:
import torch
import torch.autograd as autograd
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim

torch.manual_seed(1)

<torch._C.Generator at 0x10c9ffed0>

In [61]:
lstm = nn.LSTM(3, 3)  # input dim is 3, output dim is 3
inputs = [autograd.Variable(torch.randn(1, 3))
          for _ in range(5)]  # make a sequence of length 5

# initialize the hidden state
hidden = (autograd.Variable(torch.randn(1, 1, 3)),
          autograd.Variable(torch.randn((1, 1, 3))))

for i in inputs:
    # step through the sequence one element at a time
    # after each step, hidden contains the hidden state
    out, hidden = lstm(i.view(1, 1, -1), hidden)

In [62]:
print(out)
print(hidden)

Variable containing:
(0 ,.,.) = 
 -0.3600  0.0893  0.0215
[torch.FloatTensor of size 1x1x3]

(Variable containing:
(0 ,.,.) = 
 -0.3600  0.0893  0.0215
[torch.FloatTensor of size 1x1x3]
, Variable containing:
(0 ,.,.) = 
 -1.1298  0.4467  0.0254
[torch.FloatTensor of size 1x1x3]
)


In [63]:
# alternatively, we can do the entire sequence all at once.
# the first value returned by LSTM is all of the hidden states throughout
# the sequence. the second is just the most recent hidden state
# (compare the last slice of "out" with "hidden" below, they are the same)
# The reason for this is that:
# "out" will give you access to all hidden states in the sequence
# "hidden" will allow you to continue the sequence and backpropagate,
# by passing it as an argument  to the lstm at a later time
# Add the extra 2nd dimension
inputs = torch.cat(inputs).view(len(inputs), 1, -1)
hidden = (autograd.Variable(torch.randn(1, 1, 3)), autograd.Variable(
    torch.randn((1, 1, 3))))  # clean out hidden state
out, hidden = lstm(inputs, hidden)
print(out)
print(hidden)

Variable containing:
(0 ,.,.) = 
 -0.0187  0.1713 -0.2944

(1 ,.,.) = 
 -0.3521  0.1026 -0.2971

(2 ,.,.) = 
 -0.3191  0.0781 -0.1957

(3 ,.,.) = 
 -0.1634  0.0941 -0.1637

(4 ,.,.) = 
 -0.3368  0.0959 -0.0538
[torch.FloatTensor of size 5x1x3]

(Variable containing:
(0 ,.,.) = 
 -0.3368  0.0959 -0.0538
[torch.FloatTensor of size 1x1x3]
, Variable containing:
(0 ,.,.) = 
 -0.9825  0.4715 -0.0633
[torch.FloatTensor of size 1x1x3]
)


In [64]:
print(hidden)

(Variable containing:
(0 ,.,.) = 
 -0.3368  0.0959 -0.0538
[torch.FloatTensor of size 1x1x3]
, Variable containing:
(0 ,.,.) = 
 -0.9825  0.4715 -0.0633
[torch.FloatTensor of size 1x1x3]
)


### Example: An LSTM for Part-of-Speech Tagging

In [65]:
def prepare_sequence(seq, to_ix):
    idxs = [to_ix[w] for w in seq]
    tensor = torch.LongTensor(idxs)
    return autograd.Variable(tensor)

training_data = [("The dog ate the apple".split(), ["DET", "NN", "V", "DET", "NN"]),
                 ("Everybody read that book".split(), ["NN", "V", "DET", "NN"])]

word_to_ix = {}
for sent, tags in training_data:
    for word in sent:
        if word not in word_to_ix:
            word_to_ix[word] = len(word_to_ix)
print(word_to_ix)
tag_to_ix = {"DET": 0, "NN": 1, "V": 2}

# These will usually be more like 32 or 64 dimensional.
# We will keep them small, so we can see how the weights change as we train.
EMBEDDING_DIM = 6
HIDDEN_DIM = 6

{'The': 0, 'dog': 1, 'ate': 2, 'the': 3, 'apple': 4, 'Everybody': 5, 'read': 6, 'that': 7, 'book': 8}


#### Create the model:

In [66]:
class LSTMTagger(nn.Module):
    
    def __init__(self, embedding_dim, hidden_dim, vocab_size, target_size):
        super(LSTMTagger, self).__init__()
        self.hidden_dim = hidden_dim
        
        self.word_embeddings = nn.Embedding(vocab_size, embedding_dim)
        
        # The LSTM takes word embeddings as inputs, and outputs hidden states
        # with dimensionality hidden_dim
        self.lstm = nn.LSTM(embedding_dim, hidden_dim)
        
        # The linear layer that maps from hidden state to space to tag space
        self.hidden2tag = nn.Linear(hidden_dim, target_size)
        self.hidden = self.init_hidden()
        
    def init_hidden(self):
        # Before we've done anything, we dont have any hidden state.
        # Refer to the Pytorch documentation to see exactly
        # why they have this dimensionality.
        # The axes semantics are (num_layers, minibatch_size, hidden_dim)
        return (autograd.Variable(torch.zeros(1, 1, self.hidden_dim)),
                autograd.Variable(torch.zeros(1, 1, self.hidden_dim)))
    
    def forward(self, sentence):
        embeds = self.word_embeddings(sentence)
        lstm_out, self.hidden = self.lstm(
            embeds.view(len(sentence), 1, -1), self.hidden)
        tag_space = self.hidden2tag(lstm_out.view(len(sentence), -1))
        tag_scores = F.log_softmax(tag_space, dim=1)
        return tag_scores

#### Train the model:

In [67]:
model = LSTMTagger(EMBEDDING_DIM, HIDDEN_DIM, len(word_to_ix), len(tag_to_ix))
loss_function = nn.NLLLoss()
optimizer = optim.SGD(model.parameters(), lr=0.1)

# See what the scores are before training
# Note that element i,j of the output is the score for tag j for word i.
inputs = prepare_sequence(training_data[0][0], word_to_ix)
tag_scores = model(inputs)
print(tag_scores)

for epoch in range(300):  # again, normally you would NOT do 300 epochs, it is toy data
    for sentence, tags in training_data:
        # Step 1. Remember that Pytorch accumulates gradients.
        # We need to clear them out before each instance
        model.zero_grad()
        
        # Also, we need to clear out the hidden state of the LSTM,
        # detaching it from its history on the last instance.
        model.hidden = model.init_hidden()
        
        # Step 2. Get our inputs ready for the network, that is, turn them into
        # Variables of word indices.
        sentence_in = prepare_sequence(sentence, word_to_ix)
        targets = prepare_sequence(tags, tag_to_ix)
        
        # Step 3. Run our forward pass
        tag_scores = model(sentence_in)
        
        # Step 4. Compute the loss, gradients, and update the parameters by
        #  calling optimizer.step()
        loss = loss_function(tag_scores, targets)
        loss.backward()
        optimizer.step()

        
# See what the scores are after training
inputs = prepare_sequence(training_data[0][0], word_to_ix)
tag_scores = model(inputs)
# The sentence is "the dog ate the apple".  i,j corresponds to score for tag j
#  for word i. The predicted tag is the maximum scoring tag.
# Here, we can see the predicted sequence below is 0 1 2 0 1
# since 0 is index of the maximum value of row 1,
# 1 is the index of maximum value of row 2, etc.
# Which is DET NOUN VERB DET NOUN, the correct sequence!
print(tag_scores)

Variable containing:
-1.1989 -0.9630 -1.1497
-1.2522 -0.9158 -1.1586
-1.2563 -1.0022 -1.0550
-1.1518 -1.1443 -1.0065
-1.1728 -1.0677 -1.0593
[torch.FloatTensor of size 5x3]

Variable containing:
-0.1902 -1.8654 -3.9957
-4.1051 -0.0263 -4.6590
-4.0204 -3.1797 -0.0614
-0.0372 -4.3504 -3.7448
-4.0387 -0.0348 -4.1001
[torch.FloatTensor of size 5x3]



In [81]:
torch.max(tag_scores, 1)

(Variable containing:
 -0.1902
 -0.0263
 -0.0614
 -0.0372
 -0.0348
 [torch.FloatTensor of size 5], Variable containing:
  0
  1
  2
  0
  1
 [torch.LongTensor of size 5])