# Training a model in DyNet
In this tutorial, we will train a bag-of-words model to differentiate between negative and postive movie reviews using the movie review dataset from Cornell [(Pang and Lee 2004)](http://www.cs.cornell.edu/home/llee/papers/cutsent.home.html).

## Simple data pre-processing
Loading the negative and positive reviews into memory. This code uses only the first sentence in each review.

In [None]:
import os
import nltk

def load_data(directory):
    l = [ ]
    for filename in os.listdir(directory):
        if filename.endswith("txt"):
            words = nltk.word_tokenize((open(directory + filename).readlines()[0]).lower())
            l.append({"id" : filename, "data" : words})
    return l

negative_examples = load_data("neg/")
positive_examples = load_data("pos/")

print("num per class: negative " + str(len(negative_examples)) + "; positive " + str(len(positive_examples)))

In [None]:
negative_examples[0]

In [None]:
" ".join(negative_examples[24]["data"])

Now we are going to split into a train/dev split. For this dataset, there are not traditional splits (prior work uses cross-validation).

In [None]:
dev_examples = [ ]
train_examples = [ ]

import random
random.shuffle(negative_examples)
random.shuffle(positive_examples)

for example in negative_examples:
    example["label"] = 0
    randnum = random.random()
    if randnum < 0.8:
        train_examples.append(example)
    else:
        dev_examples.append(example)
        
for example in positive_examples:
    example["label"] = 1
    randnum = random.random()
    if randnum < 0.8:
        train_examples.append(example)
    else:
        dev_examples.append(example)
        
print("lengths: train " + str(len(train_examples)) + "; dev " + str(len(dev_examples)))

In [None]:
" ".join(train_examples[0]["data"])

Next we will create a vocabulary list using the words in the training data.

In [None]:
vocab = list(set([word for example in train_examples for word in example["data"]])) + ["UNK"]
len(vocab)

We can verify that it works:

In [None]:
indices = [vocab.index(word) for word in train_examples[0]["data"][:10] if word in vocab]
indices

In [None]:
[vocab[index] for index in indices]

Let's have a look at word frequencies in training data:

In [None]:
from collections import defaultdict
import matplotlib.pyplot as plt
d = defaultdict(int)
wordlist = [word for example in train_examples for word in example["data"]]
for word in wordlist:
    d[word] += 1
plt.plot(range(len(d)), list(reversed(sorted(d.values()))))

Great. Now we can start building the model!

## Constructing the ParameterCollection
Our model will be a simple bag-of-words model with a single hidden layer. The model will have the following parameters:

1. A `LookupParameters` object for the words in the data, which is of size len(vocab) x 64.
2. A `Parameters` object for the hidden layer of size 64 x 64.
3. A `Parameters` object for the hidden layer's biases of size 64 x 1.
4. A `Parameters` object for the final layer of size 64 x 2.
5. A `Parameters` object for the final layer's biases of size 2 x 1.

We will show off a few things in this section:

1. Naming parameters. This will make things a bit easier to debug if necessary.
2. Initializing parameters using [`PyInitializer`](http://dynet.readthedocs.io/en/latest/python_ref.html#parameters-initializers).

In [None]:
import _dynet as dy
dyparams = dy.DynetParams()
dyparams.set_mem(2048)
dyparams.set_autobatch(True)
dyparams.init()

pc = dy.ParameterCollection()

HIDDEN_SIZE = 64
NUM_CLASSES = 2
word_embeddings = pc.add_lookup_parameters((len(vocab), HIDDEN_SIZE), name="word-embeddings")
hidden_weights = pc.add_parameters((HIDDEN_SIZE, HIDDEN_SIZE), name="hidden-weights", init=dy.NormalInitializer())
hidden_biases = pc.add_parameters((HIDDEN_SIZE, 1), name="hidden-biases", init=dy.NormalInitializer())
final_weights = pc.add_parameters((HIDDEN_SIZE, NUM_CLASSES), name = "final-weights")
final_biases = pc.add_parameters((NUM_CLASSES, 1), name = "final-biases")

[(param.name(), param.shape()) for param in pc.lookup_parameters_list() + pc.parameters_list()]

Now we should implement a function that takes a single example and computes a set of scores over the two classes. This will later be used for training and evaluation.

In [None]:
def get_scores(example):
    word_vectors = [ ]
    
    # First, get the word embeddings for every word in the data.
    for word in example["data"]:
        if word in vocab:
            word_vectors.append(word_embeddings[vocab.index(word)])
        else:
            word_vectors.append(word_embeddings[vocab.index("UNK")])
    
    # Then get the average word embedding for the example.
    embedding = dy.esum(word_vectors) / float(len(word_vectors))
    
    # Intermediate representation...
    intermediate_value = hidden_weights * dy.reshape(embedding, (HIDDEN_SIZE, 1)) + hidden_biases
    
    # With a nonlinearity
    intermediate_value = dy.tanh(intermediate_value)
    
    # Final probability distribution
    scores = dy.transpose(final_weights) * intermediate_value  + final_biases
    return scores

dy.renew_cg()
get_scores(train_examples[0]).value()

## Training: loss, prediction

To get the loss for a particular example, we can simply use the [`dy.pickneglogsoftmax`](http://dynet.readthedocs.io/en/latest/python_ref.html#dynet.pickneglogsoftmax) function.

In [None]:
dy.pickneglogsoftmax(get_scores(train_examples[0]), train_examples[0]["label"]).value()

Before continuing, let's write a simple function that will compute whether a prediction was correct.

In [None]:
import numpy as np

def evaluate(example):
    prediction = np.argmax(get_scores(example).value())
    return prediction == example["label"]

evaluate(train_examples[0])

OK. Let's look at how training works for 20 examples. First we need to create an optimizer. Let's use the [Adam optimizer](http://dynet.readthedocs.io/en/latest/python_ref.html#dynet.AdamTrainer).

In [None]:
import time

optimizer = dy.AdamTrainer(pc)

epoch_start_time = time.time()

for i, example in enumerate(train_examples[:20]):
    dy.renew_cg()
    loss = dy.pickneglogsoftmax(get_scores(example), example["label"])
    loss.forward()
    loss.backward()
    optimizer.update()
    
    print(loss.value())
print("total time: " + str(time.time() - epoch_start_time))


### Batching
Batching your updates means that you consolidate a certain number of losses into a single value, then backpropagate over an average of the values. This has two affects on performance:

1. Training could be a lot faster, especially with autobatching in DyNet. 
2. There are empirical changes in performance with lower or higher batch sizes. The batch size is a hyperparameter you can tune to get the best results in the end, and it depends a lot on the task you are using.

The following code waits until a certain number of examples have been processed, and only then does an update. Importantly, we also turn on autobatching, so that intermediate computations will be batched automatically, making things faster.

In [None]:
import random
BATCH_SIZE = 10


def epoch_train(examples):
    epoch_start_time = time.time()
    dy.renew_cg()
    random.shuffle(examples)
    current_losses = [ ]
    for i, example in enumerate(examples):
        loss = dy.pickneglogsoftmax(get_scores(example), example["label"])
        current_losses.append(loss)

        if len(current_losses) >= BATCH_SIZE:
            mean_loss = dy.esum(current_losses) / float(len(current_losses))
            mean_loss.forward()
            mean_loss.backward()
            optimizer.update()
            current_losses = [ ]
            dy.renew_cg()
    if current_losses:
        mean_loss = dy.esum(current_losses) / float(len(current_losses))
        mean_loss.forward()
        mean_loss.backward()
        optimizer.update()
    print("total time: " + str(time.time() - epoch_start_time))
       
epoch_train(train_examples[:20])

Now let's write some code that trains for a few epochs.

In [None]:
NUM_EPOCHS = 25

max_accuracy = 0.
best_epoch = 0
start_time = time.time()
for i in range(NUM_EPOCHS):
    epoch_train(train_examples)
    
    accuracy = sum([float(evaluate(example)) for example in dev_examples]) / float(len(dev_examples))
    print("epoch " + str(i) + " accuracy: " + str(accuracy))
    if accuracy > max_accuracy:
        print("improved!")
        pc.save("model-epoch" + str(i) + ".dy")
        best_epoch = i
        max_accuracy = accuracy

total_time = time.time() - start_time
print("total training time: " + str(total_time) + "; " + str(float(total_time) / NUM_EPOCHS) + " per epoch")
print("loading from model at epoch " + str(best_epoch))
pc.populate("model-epoch" + str(best_epoch) + ".dy")

Our model has improved performance! Now we should do some analysis of its errors. 

## Evaluation and error analysis
First, we will print 20 examples of random errors that the model made and see if we can identify why they made the error.

In [None]:
wrong_examples = [example for example in dev_examples if not evaluate(example)]
print("Error rate: " + str(float(len(wrong_examples)) / len(dev_examples)))
random.shuffle(wrong_examples)
for example in wrong_examples[:10]:
    print(str(example["label"]) + "\t" + " ".join(example["data"]))

In [None]:
wrong_examples = [example for example in train_examples if not evaluate(example)]
print("Error rate: " + str(float(len(wrong_examples)) / len(dev_examples)))
random.shuffle(wrong_examples)
for example in wrong_examples[:10]:
    print(str(example["label"]) + "\t" + " ".join(example["data"]))

In [None]:
def evaluate_input(input_string):
    example = {"data": nltk.word_tokenize(input_string.lower())}
    return np.argmax(get_scores(example).value())

evaluate_input("This movie is really bad")

In [None]:
evaluate_input("This movie made me feel happy")

* What mistake was I making in terms of handling data?
* How do we improve development / test performance?
* Issues with the UNK token?