In [1]:
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim

Our model will map a sparse BoW representation to log probabilities over
labels. We assign each word in the vocab an index. For example, say our
entire vocab is two words "hello" and "world", with indices 0 and 1
respectively. The BoW vector for the sentence "hello hello hello hello"
is

\begin{align}\left[ 4, 0 \right]\end{align}

For "hello world world hello", it is

\begin{align}\left[ 2, 2 \right]\end{align}

etc. In general, it is

\begin{align}\left[ \text{Count}(\text{hello}), \text{Count}(\text{world}) \right]\end{align}

Denote this BOW vector as $x$. The output of our network is:

\begin{align}\log \text{Softmax}(Ax + b)\end{align}

That is, we pass the input through an affine map and then do log
softmax.

In [2]:
data = [("me gusta comer en la cafeteria".split(), "SPANISH"),
        ("Give it to me".split(), "ENGLISH"),
        ("No creo que sea una buena idea".split(), "SPANISH"),
        ("No it is not a good idea to get lost at sea".split(), "ENGLISH")]

test_data = [("Yo creo que si".split(), "SPANISH"),
             ("it is lost on me".split(), "ENGLISH")]

In [3]:
# word_to_ix maps each word in the vocab to a unique integer, which will be its
# index into the Bag of words vector
word_to_ix = {}
for sent, _ in data + test_data:
    for word in sent:
        if word not in word_to_ix:
            word_to_ix[word] = len(word_to_ix)
print(word_to_ix)

VOCAB_SIZE = len(word_to_ix)
NUM_LABELS = 2

{'la': 4, 'good': 19, 'comer': 2, 'creo': 10, 'Give': 6, 'at': 22, 'gusta': 1, 'me': 0, 'Yo': 23, 'en': 3, 'to': 8, 'not': 17, 'lost': 21, 'idea': 15, 'buena': 14, 'que': 11, 'it': 7, 'cafeteria': 5, 'si': 24, 'get': 20, 'a': 18, 'sea': 12, 'una': 13, 'is': 16, 'No': 9, 'on': 25}


In [4]:
VOCAB_SIZE

26

In [5]:
class BoWClassifier(nn.Module):  # inheriting from nn.Module!

    def __init__(self, num_labels, vocab_size):
        # calls the init function of nn.Module.  Dont get confused by syntax,
        # just always do it in an nn.Module
        super(BoWClassifier, self).__init__()

        # Define the parameters that you will need.  In this case, we need A and b,
        # the parameters of the affine mapping.
        # Torch defines nn.Linear(), which provides the affine map.
        # Make sure you understand why the input dimension is vocab_size
        # and the output is num_labels!
        self.linear = nn.Linear(vocab_size, num_labels)

        # NOTE! The non-linearity log softmax does not have parameters! So we don't need
        # to worry about that here

    def forward(self, bow_vec):
        # Pass the input through the linear layer,
        # then pass that through log_softmax.
        # Many non-linearities and other functions are in torch.nn.functional
        return F.log_softmax(self.linear(bow_vec), dim=1)


def make_bow_vector(sentence, word_to_ix):
    vec = torch.zeros(len(word_to_ix))
    for word in sentence:
        vec[word_to_ix[word]] += 1
    return vec.view(1, -1)


def make_target(label, label_to_ix):
    return torch.LongTensor([label_to_ix[label]])


model = BoWClassifier(NUM_LABELS, VOCAB_SIZE)

# the model knows its parameters.  The first output below is A, the second is b.
# Whenever you assign a component to a class variable in the __init__ function
# of a module, which was done with the line
# self.linear = nn.Linear(...)
# Then through some Python magic from the PyTorch devs, your module
# (in this case, BoWClassifier) will store knowledge of the nn.Linear's parameters
for param in model.parameters():
    print(param)

# To run the model, pass in a BoW vector
# Here we don't need to train, so the code is wrapped in torch.no_grad()
with torch.no_grad():
    sample = data[0]
    bow_vector = make_bow_vector(sample[0], word_to_ix)
    log_probs = model(bow_vector)
    print(log_probs)

Parameter containing:
tensor([[ 0.0350, -0.0104, -0.0060, -0.1439,  0.0208,  0.0688,  0.0875,
          0.1016, -0.0837,  0.0521, -0.1487,  0.1121, -0.0628,  0.1276,
          0.0418,  0.0903,  0.1391,  0.0010,  0.1580,  0.0348, -0.0716,
          0.1603, -0.1323, -0.1786,  0.0727, -0.0489],
        [-0.0816, -0.0938,  0.1653, -0.0293,  0.1641, -0.0407, -0.1666,
         -0.0695, -0.1416,  0.0866, -0.1788, -0.1954,  0.0416,  0.1858,
          0.0629, -0.1355,  0.1906, -0.1136,  0.1900,  0.0569,  0.1498,
          0.1036,  0.1350, -0.1901, -0.0192, -0.0135]])
Parameter containing:
tensor([ 0.0994,  0.1434])
tensor([[-0.7783, -0.6147]])


In [6]:
# model.forward(bow_vector)

Which of the above values corresponds to the log probability of ENGLISH,
and which to SPANISH? We never defined it, but we need to if we want to
train the thing.

In [7]:
label_to_ix = {"SPANISH": 0, "ENGLISH": 1}

So lets train! To do this, we pass instances through to get log
probabilities, compute a loss function, compute the gradient of the loss
function, and then update the parameters with a gradient step. Loss
functions are provided by Torch in the nn package. nn.NLLLoss() is the
negative log likelihood loss we want. It also defines optimization
functions in torch.optim. Here, we will just use SGD.

Note that the *input* to NLLLoss is a vector of log probabilities, and a
target label. It doesn't compute the log probabilities for us. This is
why the last layer of our network is log softmax. The loss function
nn.CrossEntropyLoss() is the same as NLLLoss(), except it does the log
softmax for you.

In [8]:
# Run on test data before we train, just to see a before-and-after
with torch.no_grad():
    for instance, label in test_data:
        bow_vec = make_bow_vector(instance, word_to_ix)
        log_probs = model(bow_vec)
        print(log_probs)

# Print the matrix column corresponding to "creo"
print(next(model.parameters())[:, word_to_ix["creo"]])
print()

loss_function = nn.NLLLoss()
optimizer = optim.SGD(model.parameters(), lr=0.1)

# Usually you want to pass over the training data several times.
# 100 is much bigger than on a real data set, but real datasets have more than
# two instances.  Usually, somewhere between 5 and 30 epochs is reasonable.
for epoch in range(100):
    for instance, label in data:
        # Step 1. Remember that PyTorch accumulates gradients.
        # We need to clear them out before each instance
        model.zero_grad()

        # Step 2. Make our BOW vector and also we must wrap the target in a
        # Tensor as an integer. For example, if the target is SPANISH, then
        # we wrap the integer 0. The loss function then knows that the 0th
        # element of the log probabilities is the log probability
        # corresponding to SPANISH
        bow_vec = make_bow_vector(instance, word_to_ix)
        target = make_target(label, label_to_ix)

        # Step 3. Run our forward pass.
        log_probs = model(bow_vec)

        # Step 4. Compute the loss, gradients, and update the parameters by
        # calling optimizer.step()
        loss = loss_function(log_probs, target)
        loss.backward()
        optimizer.step()

with torch.no_grad():
    for instance, label in test_data:
        bow_vec = make_bow_vector(instance, word_to_ix)
        log_probs = model(bow_vec)
        print(log_probs)

# Index corresponding to Spanish goes up, English goes down!
print(next(model.parameters())[:, word_to_ix["creo"]])

tensor([[-0.5142, -0.9113]])
tensor([[-0.5920, -0.8057]])
tensor([-0.1487, -0.1788])

tensor([[-0.1144, -2.2251]])
tensor([[-2.6406, -0.0740]])
tensor([ 0.2572, -0.5847])


We got the right answer! You can see that the log probability for
Spanish is much higher in the first example, and the log probability for
English is much higher in the second for the test data, as it should be.

Now you see how to make a PyTorch component, pass some data through it
and do gradient updates. We are ready to dig deeper into what deep NLP
has to offer.

In [9]:
next(model.parameters())

Parameter containing:
tensor([[-0.1031,  0.4798,  0.4842,  0.3463,  0.5111,  0.5591, -0.5409,
         -0.7497, -0.9350,  0.2352,  0.2572,  0.5180,  0.1203,  0.5335,
          0.4477,  0.2734, -0.0837, -0.2218, -0.0649, -0.1881, -0.2944,
         -0.0625, -0.3551, -0.1786,  0.0727, -0.0489],
        [ 0.0566, -0.5840, -0.3250, -0.5195, -0.3262, -0.5309,  0.4618,
          0.7817,  0.7096, -0.0965, -0.5847, -0.6014, -0.1415, -0.2202,
         -0.3431, -0.3186,  0.4134,  0.1093,  0.4128,  0.2797,  0.3726,
          0.3265,  0.3578, -0.1901, -0.0192, -0.0135]])

In [10]:
p = list(model.parameters())
p

[Parameter containing:
 tensor([[-0.1031,  0.4798,  0.4842,  0.3463,  0.5111,  0.5591, -0.5409,
          -0.7497, -0.9350,  0.2352,  0.2572,  0.5180,  0.1203,  0.5335,
           0.4477,  0.2734, -0.0837, -0.2218, -0.0649, -0.1881, -0.2944,
          -0.0625, -0.3551, -0.1786,  0.0727, -0.0489],
         [ 0.0566, -0.5840, -0.3250, -0.5195, -0.3262, -0.5309,  0.4618,
           0.7817,  0.7096, -0.0965, -0.5847, -0.6014, -0.1415, -0.2202,
          -0.3431, -0.3186,  0.4134,  0.1093,  0.4128,  0.2797,  0.3726,
           0.3265,  0.3578, -0.1901, -0.0192, -0.0135]]), Parameter containing:
 tensor([ 0.1443,  0.0984])]