# Training a model in DyNet
In this tutorial, we will train a bag-of-words model to differentiate between negative and postive movie reviews using the movie review dataset from Cornell [(Pang and Lee 2004)](http://www.cs.cornell.edu/home/llee/papers/cutsent.home.html).

## Simple data pre-processing
Loading the negative and positive reviews into memory. This code uses only the first sentence in each review.

In [1]:
import os
import nltk

def load_data(directory):
    l = [ ]
    for filename in os.listdir(directory):
        if filename.endswith("txt"):
            words = nltk.word_tokenize((open(directory + filename).readlines()[0]).lower())
            l.append({"id" : filename, "data" : words})
    return l

negative_examples = load_data("neg/")
positive_examples = load_data("pos/")

print("num per class: negative " + str(len(negative_examples)) + "; positive " + str(len(positive_examples)))

num per class: negative 1000; positive 1000


Now we are going to split into a train/dev split. For this dataset, there are not traditional splits (prior work uses cross-validation).

In [2]:
dev_examples = [ ]
train_examples = [ ]

import random
random.shuffle(negative_examples)
random.shuffle(positive_examples)

for example in negative_examples:
    example["label"] = 0
    randnum = random.random()
    if randnum < 0.8:
        train_examples.append(example)
    else:
        dev_examples.append(example)
        
for example in positive_examples:
    example["label"] = 1
    randnum = random.random()
    if randnum < 0.8:
        train_examples.append(example)
    else:
        dev_examples.append(example)
        
print("lengths: train " + str(len(train_examples)) + "; dev " + str(len(dev_examples)))

lengths: train 1598; dev 402


In [3]:
" ".join(train_examples[0]["data"])

"david schwimmer ( from the television series `` friends `` ) stars as a sensitive ( and slightly neurotic ) single guy who gets more than he expected from the grieving mother ( barbara hershey ) of a classmate he ca n't remember ."

Next we will create a vocabulary list using the words in the training data.

In [4]:
vocab = list(set([word for example in train_examples for word in example["data"]])) + ["UNK"]
len(vocab)

7110

We can verify that it works:

In [5]:
indices = [vocab.index(word) for word in train_examples[0]["data"][:10] if word in vocab]
indices

[6084, 4333, 4396, 5193, 581, 4065, 2184, 1192, 1462, 1192]

In [6]:
[vocab[index] for index in indices]

['david',
 'schwimmer',
 '(',
 'from',
 'the',
 'television',
 'series',
 '``',
 'friends',
 '``']

Great. Now we can start building the model!

## Constructing the ParameterCollection
Our model will be a simple bag-of-words model with a single hidden layer. The model will have the following parameters:

1. A `LookupParameters` object for the words in the data, which is of size len(vocab) x 64.
2. A `Parameters` object for the hidden layer of size 64 x 64.
3. A `Parameters` object for the hidden layer's biases of size 64 x 1.
4. A `Parameters` object for the final layer of size 64 x 2.
5. A `Parameters` object for the final layer's biases of size 2 x 1.

We will show off a few things in this section:

1. Naming parameters. This will make things a bit easier to debug if necessary.
2. Initializing parameters using [`PyInitializer`](http://dynet.readthedocs.io/en/latest/python_ref.html#parameters-initializers).

In [7]:
import _dynet as dy
dyparams = dy.DynetParams()
dyparams.set_mem(2048)
dyparams.set_autobatch(True)
dyparams.init()

pc = dy.ParameterCollection()

HIDDEN_SIZE = 64
NUM_CLASSES = 2
word_embeddings = pc.add_lookup_parameters((len(vocab), HIDDEN_SIZE), name="word-embeddings")
hidden_weights = pc.add_parameters((HIDDEN_SIZE, HIDDEN_SIZE), name="hidden-weights", init=dy.NormalInitializer())
hidden_biases = pc.add_parameters((HIDDEN_SIZE, 1), name="hidden-biases", init=dy.NormalInitializer())
final_weights = pc.add_parameters((HIDDEN_SIZE, NUM_CLASSES), name = "final-weights")
final_biases = pc.add_parameters((NUM_CLASSES, 1), name = "final-biases")

[(param.name(), param.shape()) for param in pc.lookup_parameters_list() + pc.parameters_list()]

[('/word-embeddings', (7110, 64)),
 ('/hidden-weights', (64, 64)),
 ('/hidden-biases', (64, 1)),
 ('/final-weights', (64, 2)),
 ('/final-biases', (2, 1))]

Now we should implement a function that takes a single example and computes a set of scores over the two classes. This will later be used for training and evaluation.

In [8]:
def get_scores(example):
    word_vectors = [ ]
    
    # First, get the word embeddings for every word in the data.
    for word in example["data"]:
        if word in vocab:
            word_vectors.append(word_embeddings[vocab.index(word)])
        else:
            word_vectors.append(word_embeddings[vocab.index("UNK")])
    
    # Then get the average word embedding for the example.
    embedding = dy.esum(word_vectors) / float(len(word_vectors))
    
    # Intermediate representation...
    intermediate_value = dy.parameter(hidden_weights) * dy.reshape(embedding, (HIDDEN_SIZE, 1)) + dy.parameter(hidden_biases)
    
    # With a nonlinearity
    intermediate_value = dy.tanh(intermediate_value)
    
    # Final probability distribution
    scores = dy.transpose(dy.parameter(final_weights)) * intermediate_value  + dy.parameter(final_biases)
    return scores

dy.renew_cg()
get_scores(train_examples[0]).value()

array([[ 1.2319001 ],
       [-1.97725821]])

## Training: loss, prediction

To get the loss for a particular example, we can simply use the [`dy.pickneglogsoftmax`](http://dynet.readthedocs.io/en/latest/python_ref.html#dynet.pickneglogsoftmax) function.

In [9]:
dy.pickneglogsoftmax(get_scores(train_examples[0]), train_examples[0]["label"]).value()

0.03959619998931885

Before continuing, let's write a simple function that will compute whether a prediction was correct.

In [10]:
import numpy as np

def evaluate(example):
    prediction = np.argmax(get_scores(example).value())
    return prediction == example["label"]

evaluate(train_examples[0])

True

OK. Let's look at how training works for 20 examples. First we need to create an optimizer. Let's use the [Adam optimizer](http://dynet.readthedocs.io/en/latest/python_ref.html#dynet.AdamTrainer).

In [11]:
import time

optimizer = dy.AdamTrainer(pc)

epoch_start_time = time.time()

for i, example in enumerate(train_examples[:20]):
    loss = dy.pickneglogsoftmax(get_scores(example), example["label"])
    loss.forward()
    loss.backward()
    optimizer.update()
    
    print(loss.value())
print("total time: " + str(time.time() - epoch_start_time))


0.03959619998931885
0.043017446994781494
0.05421280860900879
0.04489290714263916
0.052370548248291016
0.03637802600860596
0.045955777168273926
0.05160510540008545
0.035446763038635254
0.04442179203033447
0.043647170066833496
0.03885173797607422
0.03299403190612793
0.03218638896942139
0.04452073574066162
0.04988741874694824
0.05559486150741577
0.03305995464324951
0.033492207527160645
0.03645956516265869
total time: 0.20117998123168945


### Batching
Batching your updates means that you consolidate a certain number of losses into a single value, then backpropagate over an average of the values. This has two affects on performance:

1. Training could be a lot faster, especially with autobatching in DyNet. 
2. There are empirical changes in performance with lower or higher batch sizes. The batch size is a hyperparameter you can tune to get the best results in the end, and it depends a lot on the task you are using.

The following code waits until a certain number of examples have been processed, and only then does an update. Importantly, we also turn on autobatching, so that intermediate computations will be batched automatically, making things faster.

In [None]:
import random
BATCH_SIZE = 100

epoch_start_time = time.time()

def epoch_train(examples):
    dy.renew_cg()
    random.shuffle(examples)
    current_losses = [ ]
    for i, example in enumerate(examples):
        loss = dy.pickneglogsoftmax(get_scores(example), example["label"])
        current_losses.append(loss)

        if len(current_losses) >= BATCH_SIZE:
            mean_loss = dy.esum(current_losses) / float(len(current_losses))
            mean_loss.forward()
            mean_loss.backward()
            optimizer.update()
            current_losses = [ ]
            dy.renew_cg()
    if current_losses:
        mean_loss = dy.esum(current_losses) / float(len(current_losses))
        mean_loss.forward()
        mean_loss.backward()
        optimizer.update()
       
epoch_train(train_examples[:20])
print("total time: " + str(time.time() - epoch_start_time))

total time: 0.17264699935913086


Now let's write some code that trains for a few epochs.

In [None]:
max_accuracy = 0.
best_epoch = 0
for i in range(25):
    epoch_train(train_examples)
    
    accuracy = sum([float(evaluate(example)) for example in dev_examples]) / float(len(dev_examples))
    print("epoch " + str(i) + " accuracy: " + str(accuracy))
    if accuracy > max_accuracy:
        print("improved!")
        pc.save("model-epoch" + str(i) + ".dy")
        best_epoch = i
        max_accuracy = accuracy

print("loading from model at epoch " + str(best_epoch))
pc.populate("model-epoch" + str(best_epoch + ".dy"))

epoch 0 accuracy: 0.4925373134328358
improved!
epoch 1 accuracy: 0.4925373134328358
epoch 2 accuracy: 0.5174129353233831
improved!
epoch 3 accuracy: 0.5323383084577115
improved!
epoch 4 accuracy: 0.5870646766169154
improved!
epoch 5 accuracy: 0.5845771144278606
epoch 6 accuracy: 0.5796019900497512
epoch 7 accuracy: 0.5845771144278606
epoch 8 accuracy: 0.5895522388059702
improved!
epoch 9 accuracy: 0.599502487562189
improved!
epoch 10 accuracy: 0.6044776119402985
improved!
epoch 11 accuracy: 0.599502487562189
epoch 12 accuracy: 0.6019900497512438
epoch 13 accuracy: 0.6119402985074627
improved!
epoch 14 accuracy: 0.6169154228855721
improved!
epoch 15 accuracy: 0.6094527363184079
epoch 16 accuracy: 0.6069651741293532
epoch 17 accuracy: 0.6069651741293532
epoch 18 accuracy: 0.6069651741293532
epoch 19 accuracy: 0.6019900497512438
epoch 20 accuracy: 0.599502487562189
epoch 21 accuracy: 0.599502487562189


Our model has improved performance! Now we should do some analysis of its errors. 

## Error analysis
First, we will print 20 examples of random errors that the model made and see if we can identify why they made the error.

In [None]:
wrong_examples = [example for example in dev_examples if not evaluate(example)]
random.shuffle(wrong_examples)
for example in wrong_examples[:20]:
    print(str(example["label"]) + "\t" + " ".join(example["data"]))