# Deep Value Networks tutorial
## Overview
This tutorial presents DVNs, an energy-based model for structured prediction.
DVN learn a value function, i.e. they estimate the performance of a solution __y__, w.r.t the ground-truth __y∗__,
as a function of the input __x__ and the solution __y__. I.e, the goal of training is to get a value network _v_ for which

_v_(__x__, __y__; __θ__) = v∗ (__y__, __y*__), (Eq. (5) in the paper)

where v∗ is the true performance of __y__, given the ground truth __y∗__. In this example, we use F1 scores (Eq (7)  in the paper).
Once the network is trained, as in this notebook, the network can predict the performance under absence of the ground truth __y∗__.

Predictions are made by gradient ascent, w.r.t. __y__.
Given some initial solution __y__, the network estimates the value of that solutions and then __y__ is updated, so that the estimated value increases.
For more details, see the paper.

__Paper:__ 	Deep Value Networks Learn to Evaluate and Iteratively Refine Structured Outputs
M. Gygli, M. Norouzi, A. Angelova. ICML 2017 https://people.ee.ethz.ch/~gyglim/dvn/dvn_imcl17.pdf 

__Code by:__ Michael Gygli https://twitter.com/GygliMichael

## Contents of this tutorial
In this we use a DVN that was already trained on the Bibtex dataset and analyze it's predictions.
In particular we look at what value the DVN assigns to the ground truth and how it iteratively makes predictions.

## Prerequisites
This code works with python 2.7.
Addionally, the notebook needs jupyter. Consult http://jupyter.readthedocs.io/en/latest/install.html on how to install it.

### Installing DVN
Run
```bash
git clone git@github.com:gyglim/dvn.git
cd dvn
pip install -r requirements.txt
```

then the notebook can be started with
 ```bash
 jupyter notebook
 ```

In [1]:
# Imports
import mlc_datasets
import value_nets
import numpy as np

# Create the model
net = value_nets.ValueNetwork('/tmp/bibtex',
                                  feature_dim=1836,
                                  label_dim=159,
                                  learning_rate=0.1, # the learning rate when estimating the DVN parameters
                                  inf_lr=0.5, # the learning rate for the gradient ascent at inference
                                  num_hidden=150, # number of hidden units
                                  weight_decay=0.001,
                                  include_second_layer=False)


# Restore the pre-trained weights
net.restore("./bibtex_pretrained/weights-88148")

INFO:tensorflow:Restoring parameters from ./bibtex_pretrained/weights-88148


In [2]:
# Load the test data
test_labels, test_features, tagnames, txt_inputs = mlc_datasets.get_bibtex('test')

# Normalize the test features
normalized_test_features = np.array(test_features, np.float64)
normalized_test_features -= net.mean
normalized_test_features /= net.std

## Analyzing what the model learns
The DVN learns to predict the performance of a labelling __y__.
Thus, it is also possible to inspect the model and understand how it makes predictions.
Let's look at one example and what performance the model estimates for the ground truth.

In [3]:
# Lets pick some example, e.g. a specific or a random one
test_idx = 7 # np.random.randint(len(test_labels))

# Print the input words for this example
# The network uses a Bag of Words representation of these words as input
print "Input words: %s" % ", ".join(txt_inputs)

Input words: 0, 000, 02, 05, 06, 1, 10, 100, 11, 12, 13, 14, 15, 16, 17, 18, 1997, 1998, 1999, 2, 20, 2000, 2001, 2002, 2003, 2004, 2005, 2006, 2007, 24, 25, 3, 30, 4, 40, 5, 50, 6, 60, 7, 8, 9, 95, 98, a, ab, ability, able, about, above, absence, abstract, academic, access, accessible, according, account, accuracy, accurate, achieve, achieved, acid, acids, acm, acquisition, across, act, action, activation, active, activities, activity, actual, adaptation, adapted, adaptive, added, addition, additional, address, addressed, addresses, advanced, advances, advantage, advantages, affect, affected, affinity, after, against, age, agent, agents, agreement, aim, aims, al, algorithm, algorithms, all, allow, allowed, allowing, allows, almost, along, alpha, already, als, also, alternative, although, always, am, american, amino, among, amount, amounts, amperometric, an, analyse, analyses, analysis, analytical, analyze, analyzed, analyzing, and, annotation, annual, another, answer, anti, antibodies

### Let's see what performance DVN estimates for the ground truth labelling

In [4]:
# Compute the estimated value
estimated_value = net.sess.run(net.predicted_values, feed_dict={net.features_pl: normalized_test_features[test_idx].reshape(1,-1),
                                       net.labels_pl: test_labels[test_idx].reshape(1,-1)})
print "Print estimated value of true tags %s: %.3f; True score: 1.0" % (", ".join(np.array(tagnames)[test_labels[test_idx]==1]),
                                                  estimated_value)

Print estimated value of true tags TAG_children, TAG_education, TAG_mathematics: 0.868; True score: 1.0


### We now investigate how the model thinks it can improve the current labelling, by running inference with a labelling initiallized by the ground truth

In [5]:
# Run the gradient ascent procedure for 20 iterations
# This corresponds to Eq (8) in the paper, but here we actually start from the ground truth instead
predicted_scores = net.inference(normalized_test_features[test_idx].reshape(1,-1),
                                 initial_labels=test_labels[test_idx].reshape(1,-1).astype(np.float),
                                 learning_rate=net.inf_lr,
                                 num_iterations=20).flatten()
# Binarize the predictions
predicted_labels = predicted_scores >= 0.5

# Compute the true performance of the predicted labelling
true_value = net.gt_value(predicted_labels, test_labels[test_idx])

print "Predicted tags: %s. Estimated score %.3f; True score %.3f" % (", ".join(np.array(tagnames)[predicted_labels==True]),
                                                                     estimated_value,
                                                                     true_value)

Predicted tags: TAG_children, TAG_education, TAG_kaldesignresearch, TAG_mathematics. Estimated score 0.868; True score 0.857


As we can see (for test_idx = 7), the labelling changes a little bit. However, for most examples, it stays the same, indicating that the DVN is accurate in estimation the performance in the proximity of the ground truth.

### We can also see what the model thinks are the most probable tags by running standard inference

In [6]:
# Run the iterative prediction on this test example
predicted_labels = net.predict(test_features[test_idx], num_iterations=20)

# Get the value the value net assigns to these labels
estimated_value = net.sess.run(net.predicted_values, feed_dict={net.features_pl: normalized_test_features[test_idx].reshape(1,-1),
                                       net.labels_pl: predicted_labels.reshape(1,-1)})

# Compute the true performance of the predicted labelling
true_value = net.gt_value(predicted_labels, test_labels[test_idx])

print "Predicted tags: %s. Estimated score %.3f; True score %.3f" % (", ".join(np.array(tagnames)[predicted_labels==True]),
                                                                     estimated_value,
                                                                     true_value)

Predicted tags: TAG_children, TAG_kaldesignresearch, TAG_mathematics. Estimated score 0.936; True score 0.667


### Now it's up to you to explore. What happens when you change the number of iterations? How does the initialization affect the final result?
E.g. we initialized with __y__ = __0__, i.e. all labels get a zero probability. Is the network affected by setting the intial probability to one or to random values? 

## Next steps
While we used a pre-trained net here, you can also train them yourself, on your own structure prediction task.
For more info see: https://github.com/gyglim/dvn and https://github.com/gyglim/dvn/blob/master/reproduce_results.py

### For reference, let's also evaluate the model on the test dataset of bibtex

In [7]:
# Evaluate the final model
print "F1 score: %.3f" % mlc_datasets.evaluate_f1(net.predict, test_features, test_labels)

0.857 (0 of 2515)


  'precision', 'predicted', average, warn_for)


0.413 (100 of 2515)
0.415 (200 of 2515)
0.427 (300 of 2515)
0.432 (400 of 2515)
0.437 (500 of 2515)
0.435 (600 of 2515)
0.442 (700 of 2515)
0.442 (800 of 2515)
0.440 (900 of 2515)
0.439 (1000 of 2515)
0.449 (1100 of 2515)
0.449 (1200 of 2515)
0.447 (1300 of 2515)
0.448 (1400 of 2515)
0.448 (1500 of 2515)
0.448 (1600 of 2515)
0.447 (1700 of 2515)
0.448 (1800 of 2515)
0.450 (1900 of 2515)
0.447 (2000 of 2515)
0.445 (2100 of 2515)
0.444 (2200 of 2515)
0.444 (2300 of 2515)
0.446 (2400 of 2515)
0.447 (2500 of 2515)
0.447
F1 score: 0.447
