# Neural Bag-of-Words Model

In this notebook, we'll move beyond linear classifiers and implement a neural network for our classification task. 

We'll also introduce the [TensorFlow Estimator API](https://www.tensorflow.org/extend/estimators), which provides a high-level interface similar to scikit-learn. This involves a few new concepts, such as the idea of a `model_fn` and an `input_fn`, but it greatly simplifies experiments and reduces the need to write tedious data-feeding code.

## Outline

- **Part (a):** Model architecture
- **Part (b):** Implementing the Neural BOW model
- **Part (c):** Introduction to `tf.Estimator`
- **Part (d):** Training, evaluation, and tuning

As with the first half of the assignment, exercised are interspersed throughout the notebook. In particular, Part (d) has 4 questions, Part (e) asks you to write code in `models.py`, and Part (f) has 4 questions plus one optional implementation exercise.

In [1]:
# Install a few python packages using pip
from common import utils
utils.require_package("wget")      # for fetching dataset

In [2]:
from __future__ import division
import os, sys, re, json, time, datetime, shutil, copy
import itertools, collections
from importlib import reload
from IPython.display import display, HTML

# NLTK for NLP utils and corpora
import nltk

# NumPy and TensorFlow
import numpy as np
import pandas as pd
import tensorflow as tf
#assert(tf.__version__.startswith("1.4"))

# Helper libraries
from common import utils, vocabulary, tf_embed_viz, treeviz
from common import patched_numpy_io
# Code for this assignment
import sst, models, models_test

# Monkey-patch NLTK with better Tree display that works on Cloud or other display-less server.
print("Overriding nltk.tree.Tree pretty-printing to use custom GraphViz.")
treeviz.monkey_patch(nltk.tree.Tree, node_style_fn=sst.sst_node_style, format='svg')

#test trace
print('Success! All imports done!')

  from ._conv import register_converters as _register_converters


Overriding nltk.tree.Tree pretty-printing to use custom GraphViz.
Success! All imports done!


# Part (a): Model Architecture

The neural bag-of-words classifier is one of the simplest neural models for text classification. It takes its name from the bag-of-words assumption common to linear models, in which the weights for each input word are summed to make a prediction. For our neural version, we'll instead sum the _vector representations_ of each word, and then add feed-forward (hidden) layers to make a deep network.

Here's a diagram:

![Neural Bag-of-Words Model](images/neural_bow.png)

We'll use the following notation:
- $w^{(i)} \in \mathbb{Z}$ for the $i^{th}$ word of the sequence (as an integer index)
- $x^{(i)} \in \mathbb{R}^d$ for the vector representation (embedding) of $w^{(i)}$
- $x \in \mathbb{R}^d$ for the fixed-length vector given by summing all the $x^{(i)}$ for an example
- $h^{(j)}$ for the hidden state after the $j^{th}$ fully-connected layer
- $y$ for the target label ($\in 1,\ldots,\mathtt{num\_classes}$)

Our model is defined as:
- **Embedding layer:** $x^{(i)} = W_{embed}[w^{(i)}]$
- **Summing vectors:** $x = \sum_{i=1}^n x^{(i)}$
- **Hidden layer(s):** $h^{(j)} = f(h^{(j-1)} W^{(j)} + b^{(j)})$ where $h^{(-1)} = x$ and $j = 0,1,\ldots,J-1$
- **Output layer:** $\hat{y} = \hat{P}(y) = \mathrm{softmax}(h^{(final)} W_{out} + b_{out})$ where $h^{(final)} = h^{(J-1)}$ is the output of the last hidden layer.

As per usual, we define the logits to be the argument of the softmax:

$$ \mathrm{logits} = h^{(final)}W_{out} + b_{out} $$

We'll refer to the first part of this model (**Embedding layer**, **Summing vectors**, and **Hidden layer(s)**) as the **Encoder**: it has the role of encoding the input sequence into a fixed-length vector representation that we pass to the output layer.

We'll also use these as shorthand for important dimensions:
- `V`: the vocabulary size (equal to `ds.vocab.size`)
- `embed_dim`: the embedding dimension $d$
- `hidden_dims`: a list of dimensions for the output of each hidden layer (i.e. $\mathrm{dim}(h^{(j)})$&nbsp;=&nbsp;`hidden_dims[j]`)
- `num_classes`: the number of target classes (2 for the binary task)

## Part (a) Exercises

Think about the following questions, and work with your group to answer them before moving on to the code in the next section.

1. Let `embed_dim = d`, `hidden_dims = [h1, h2]`, and `num_classes = k`. In terms of these values and the vocabulary size `V`, write down the shapes of the following variables: $W_{embed}$, $W^{(0)}$, $b^{(0)}$, $W^{(1)}$, $b^{(1)}$, $W_{out}$, $b_{out}$. (*Hint: $W_{embed}$ has a row for each word in the vocabulary.*)
<p>
2. Using your answer to 1., how many parameters (matrix or vector elements) are in the embedding layer? How about in the hidden layers? And the output layer?  
<p>
<p>
3. Recall that logistic regression can be thought of as a single-layer neural network. What should we set as the values of `embed_dim` and `hidden_dims` such that this model implements logistic regression?
<p>
4. Suppose that we have two examples, `[foo bar baz]` and `[baz bar foo]`. Will this model make the same predictions on these? Why or why not?

## Training with Minibatches

Modern hardware (especially GPUs) performs most efficiently when processing a large amount of data in parallel. Because of this, we usually feed data to a neural network in batches - that is, running several examples at a time, in parallel. If each example is represented by a vector $x \in \mathbb{R}^d$, then we can feed in a batch of $m$ examples as a matrix $X \in \mathbb{R}^{m \times d}$, where each row is an example. Note that if we write our matrix-vector products with the vector on the left, as in the equations above, the batch dimension carries through while the rows remain independent:

$$ H = f(X W + b) $$

is equivalent to computing in parallel $H_i = f(X_i W + b)$ for each $i = 0, \ldots, m - 1$. Most TensorFlow operations are designed to handle batching seamlessly, so long as $bs$ = `batch_size` is the first dimension of the input data.

### Padding Sequences

Unlike the Naive Bayes classifier, which took long ($d = V \approx 16,000$) sparse vectors as input, our neural network will operate directly on a _sequence_ of ids (as stored in `ds.train.ids`). This can be variable-length (depending on the length of the sequence), but we'll need to coerce it into a fixed-length vector for training.

The easiest thing to do here is to pad the vectors with a dummy index, which we can zero-out inside our model. Consider the inputs:
```
[great movies] (2 tokens)
[this is a terrible movie] (5 tokens)
```
We'll convert these to IDs, then pad with a dummy index `0` to get a 2 x 5 matrix:
```
[[144, 104,  0,   0,  0 ]
 [ 20,  10,  6, 937, 21]]
```

For SST, we'll arbitrarily choose to pad to length 40, and clip any examples longer than that. _(Recall from Part (a) that this will only clip fewer than 5% of the dataset.)_

The `ds.as_padded_array` function is implemented for you, and will handle clipping and padding automatically. Note the second return value, `*_ns`: this is a vector containing the original (clipped) sequence lengths. We'll use this inside the model to mask the dummy indices so they don't bias our predictions.

One last thing: we'll be using [GloVe vectors](https://nlp.stanford.edu/projects/glove/) later, so let's load these now so we can use the same vocabulary. (This will save us from having to re-run the data processing steps later.)

In [3]:
import glove_helper; reload(glove_helper)

hands = glove_helper.Hands(ndim=100)  # 50, 100, 200, 300 dim are available

Loading vectors from data/glove/glove.6B.zip
Parsing file: data/glove/glove.6B.zip:glove.6B.100d.txt
Found 400,000 words.
Parsing vectors... Done! (W.shape = (400003, 100))


In [4]:
import sst
ds = sst.SSTDataset(V=hands.vocab).process(label_scheme="binary")
print('Success! Stanford Sentiment Treebank (SST Dataset) loaded!')

Loading SST from data/sst/trainDevTestTrees_PTB.zip
Training set:     8,544 trees
Development set:  1,101 trees
Test set:         2,210 trees
Using pre-built vocabulary - 400,003 words
Processing to phrases...  Done!
Splits: train / dev / test : 98,794 / 13,142 / 26,052
Success! Stanford Sentiment Treebank (SST Dataset) loaded!


### Using Pre-trained Representations

We'll build three versions of the same model: a base model, a version using GloVe embeddings for each word type, and a version using the ELMo language model to provide _contextual_ embeddings for each token. 

For the first version, we'll learn embeddings from scratch, and so we need to provide the model with lists of token ids. For the latter two, we'll replace the token ids with pre-computed vectors, and train a continuous classifier on top.

In [5]:
max_len = 40
train_x, train_ns, train_y = ds.as_padded_array('train', max_len=max_len, root_only=True)
dev_x,   dev_ns,   dev_y   = ds.as_padded_array('dev',   max_len=max_len, root_only=True)
test_x,  test_ns,  test_y  = ds.as_padded_array('test',  max_len=max_len, root_only=True)

In [6]:
print("Examples:\n\n", train_x[:3])
print("\nOriginal sequence lengths: \t", train_ns[:3])
print("Target labels: \t\t\t", train_y[:3])
print("")
print("Padded:\n\n", " ".join(ds.vocab.ids_to_words(train_x[0])))
print("\nUn-padded:\n\n", " ".join(ds.vocab.ids_to_words(train_x[0,:train_ns[0]])))

Examples:

 [[    3  1140    17 10456     7    33     3  5036   592    12    53    31
  18515    30     8    15    21    12   225     7   162    10 16809   154
   1416    76  5821  6683     4     2  1464 43710    49  4414 26987     5
      0     0     0     0]
 [    3 78618  5137 10119     6    31     3  2373     6     3  6822    30
  12307    17   103  1327    15    10  3238     6  1377    89    39 12426
   4469     2  1297  1757    12  2855  3141     6 63556 23465    12 55756
      5     0     0     0]
 [    2  7109  4129 12905    10 15416     6  1502    68    10   309  1159
   2045     4    10   309    59  1484 22738     7     3   526    68    37
      3  1118  2086  2035 14694     3  2024     4 12601     4  2946     6
      3  2368     5     0]]

Original sequence lengths: 	 [36 37 39]
Target labels: 			 [1 1 1]

Padded:

 the rock is destined to be the 21st century 's new `` conan '' and that he 's going to make a splash even greater than arnold schwarzenegger , <unk> van damme or

# Part (b): Implementing the Neural BOW Model

In order to better manage the model code, we'll implement our BOW model in `models.py`. In particular, you should implement the following functions:

- `embedding_layer(...)`: constructs an embedding layer
- `BOW_encoder(...)`: constructs the encoder stack as described above
- `softmax_output_layer(...)`: constructs a softmax output layer

**Follow the instructions in the code (function docstrings and comments) carefully!**

In particular, for unit tests to work, you shouldn't change (or add) any `tf.name_scope` or `tf.variable_scope` calls, and must name the variables exactly as documented. (Your model may work just fine, of course, but the test harness will throw all sorts of errors!)

To aid debugging and readability, we've adopted a convention that TensorFlow tensors are represented by variables ending in an underscore, such as `W_embed_` or `train_op_`.

You may find the following TensorFlow API functions useful:
- [`tf.nn.embedding_lookup`](https://www.tensorflow.org/versions/master/api_docs/python/tf/nn/embedding_lookup)
- [`tf.nn.sparse_softmax_cross_entropy_with_logits`](https://www.tensorflow.org/versions/master/api_docs/python/tf/nn/sparse_softmax_cross_entropy_with_logits)
- [`tf.reduce_mean`](https://www.tensorflow.org/versions/master/api_docs/python/tf/reduce_mean) and [`tf.reduce_sum`](https://www.tensorflow.org/versions/master/api_docs/python/tf/reduce_sum)

**Do your work in `models.py`.** When ready, run the cell below to run the unit tests.

In [7]:
reload(models)
utils.run_tests(models_test, ["TestLayerBuilders", "TestNeuralBOW"])

test_embedding_layer (models_test.TestLayerBuilders) ... ok
test_softmax_output_layer (models_test.TestLayerBuilders) ... ok
test_BOW_encoder (models_test.TestNeuralBOW) ... ok

----------------------------------------------------------------------
Ran 3 tests in 0.085s

OK


# Training a Neural Network (the hard way)

We've implemented a wrapper function, `models.classifier_model_fn`, which uses the functions you wrote in **Part (b)** to build a model graph. It takes as input `features` and `labels` which contain input and target tensors, as well as `model` and `params` which configure the model. 

**Exercise (not graded):** Read through the code for `classifier_model_fn()` in `models.py`. Where is the code you wrote in Part (e) called? Where is the loss function set up, and what loss is used? How is the optimizer set up, and what options are available? What types of predictions are returned in the `predictions` dict?

Using this function directly, we can write a simple training loop that directly feeds data from Python into a TensorFlow session. This will be fast, but very barebones:

In [8]:
import models; reload(models)

x, ns, y = train_x, train_ns, train_y
batch_size = 32

# Specify model hyperparameters as used by model_fn
model_params = dict(V=ds.vocab.size, embed_dim=100, hidden_dims=[50, 25], num_classes=len(ds.target_names),
                    encoder_type='bow',
                    lr=0.1, optimizer='adagrad', beta=0)
# model_params['embed_vecs'] = hands.W
model_fn = models.classifier_model_fn

total_batches = 0
total_examples = 0
total_loss = 0
loss_ema = np.log(2)  # track exponential-moving-average of loss
ema_decay = np.exp(-1/10)  # decay parameter for moving average = np.exp(-1/history_length)
with tf.Graph().as_default(), tf.Session() as sess:
    ##
    # Construct the graph here. No session.run calls - just wiring up Tensors.
    ##
    # Add placeholders so we can feed in data.
    x_ph_  = tf.placeholder(tf.int32, shape=[None, x.shape[1]])  # [batch_size, max_len]
    ns_ph_ = tf.placeholder(tf.int32, shape=[None])              # [batch_size]
    y_ph_  = tf.placeholder(tf.int32, shape=[None])              # [batch_size]
    
    # Construct the graph using model_fn
    features = {"ids": x_ph_, "ns": ns_ph_}  # note that values are Tensors
    estimator_spec = model_fn(features, labels=y_ph_, mode=tf.estimator.ModeKeys.TRAIN,
                              params=model_params)
    loss_     = estimator_spec.loss
    train_op_ = estimator_spec.train_op
    
    ##
    # Done constructing the graph, now we can make session.run calls.
    ##
    sess.run(tf.global_variables_initializer())
    
    # Run a single epoch
    t0 = time.time()
    for (bx, bns, by) in utils.multi_batch_generator(batch_size, x, ns, y):
        # feed NumPy arrays into the placeholder Tensors
        feed_dict = {x_ph_: bx, ns_ph_: bns, y_ph_: by}
        batch_loss, _ = sess.run([loss_, train_op_], feed_dict=feed_dict)
        
        # Compute some statistics
        total_batches += 1
        total_examples += len(bx)
        total_loss += batch_loss * len(bx)  # re-scale, since batch loss is mean
        # Compute moving average to smooth out noisy per-batch loss
        loss_ema = ema_decay * loss_ema + (1 - ema_decay) * batch_loss
        
        if (total_batches % 25 == 0):
            print("{:5,} examples, moving-average loss {:.2f}".format(total_examples, 
                                                                      loss_ema))    
    print("Completed one epoch in {:s}".format(utils.pretty_timedelta(since=t0)))

  800 examples, moving-average loss 0.33
1,600 examples, moving-average loss 0.22
2,400 examples, moving-average loss 0.19
3,200 examples, moving-average loss 0.28
4,000 examples, moving-average loss 0.40
4,800 examples, moving-average loss 0.29
5,600 examples, moving-average loss 0.30
6,400 examples, moving-average loss 0.29
Completed one epoch in 0:00:00


# Part (c): Training a Neural Network with tf.Estimator

As you see above, there's a lot of boilerplate involved with training a model - we need to instantiate the graph, manage a TensorFlow session, and manually feed data for each batch. This can get tedious, especially as we add support for checkpointing, saving models, and tracking statistics during training. (And as you build more complex models, these are _definitely_ things you'll want!) To streamline this process, we can use a high-level api like `tf.Estimator`.

The Estimator API allows us to define custom models, then provides an `Estimator` object that exposes `train()`, `evaluate()`, and `predict()` functions in a similar interface as scikit-learn. Take a few minutes to skim through the main documentation:

- [TensorFlow Estimator API](https://www.tensorflow.org/extend/estimators)
- [Estimators in 'Effective TensorFlow'](https://github.com/vahidk/EffectiveTensorflow#tf_learn) (advanced)

### Model Functions (model_fn)

The Estimator API is a functional interface, built around the idea of a `model_fn`. A `model_fn` is just a function that follows a specific interface, and when called constructs a graph of TensorFlow variables and ops that constitutes your model. Here's an example of what one looks like:

```python
def my_model_fn(features, labels, mode, params):
    x_ = features['x']
    logits_ = my_network(x_, hidden_dims=params['hidden_dims'],
                         foo=params['foo'], bar=params['bar'])
    
    predictions_dict = {"max": tf.argmax(logits_, 1)}
    eval_metrics = {"accuracy": tf.metrics_accuracy(predictions_dict['max']}
    if mode == tf.estimator.ModeKeys.PREDICT:
        return tf.estimator.EstimatorSpec(mode=mode,
                                          predictions=predictions_dict)

    loss_ = my_loss_fn(logits_)
    return tf.estimator.EstimatorSpec(mode=mode,
                                      predictions=predictions_dict,
                                      loss=loss_,
                                      train_op=train_op_,
                                      eval_metric_ops=eval_metrics)
```
You can read more about the arguments here: 
- [Constructing the model_fn](https://www.tensorflow.org/extend/estimators#constructing_the_model_fn)

The Estimator API takes a pointer to this _function_, then calls it internally to instantiate your model in the appropriate context. This allows it to handle things like writing and restoring checkpoints automatically, as well as feeding data to the model during training and evaluation. 

### Input Functions (input_fn)

Data feeding is handled by an `input_fn`, which takes the place of the placeholder variables and `feed_dict` we'd otherwise need. The `input_fn` is defined separately from the `model_fn`, and builds the part of the graph up to `features` and `labels`.

We won't write our own `input_fn` in this assignment, but instead we can just use the existing `numpy_input_fn` implementation. This takes NumPy arrays as inputs, and creates an `input_fn` that will generate minibatches:

```python
train_input_fn = tf.estimator.inputs.numpy_input_fn(
                    x={"ids": train_x, "ns": train_ns}, y=train_y,
                    batch_size=32, num_epochs=20, shuffle=True
                 )
```

You can read more about `input_fn`-s here: 
- [Building Input Functions with tf.Estimator](https://www.tensorflow.org/get_started/input_fn)

**Note:** for this assignment, we'll use a patched version of `tf.estimator.inputs.numpy_input_fn` included with this assignment. This version allows us to seed the random number generator so that training data is shuffled but deterministic.

### Building an Estimator

With a `model_fn` and an `input_fn` in hand, we can now build and train an Estimator with just a couple of lines:

```python
model_params = dict(...)   # passed as 'params' to the model_fn
model = tf.estimator.Estimator(model_fn=my_model_fn, 
                               params=model_params,
                               model_dir="/tmp/my_model_checkpoints")
model.train(input_fn=train_input_fn)
```

The last line will kick off a train loop, ingesting data until the `input_fn` runs dry (20 epochs, for the one above). We can then evaluate on labeled data by calling `model.evaluate(input_fn=...)`, and run inference on unlabeled data by calling `model.predict(input_fn=...)` with appropriate `input_fn`-s.

_**Note:** You might be wondering why TensorFlow adds all this boilerplate on top of the actual model. It doesn't seem necessary for small-scale experiments like this assignment, but as soon as you scale up to models that take hours, days, or even weeks to train, having robust checkpoint management, live dashboards, and distributed data queues really starts to pay off!_

# Part (d): Training and Evaluation

The cell below defines some model params and sets up a checkpoint directory for TensorBoard.

Use the following default parameters to start, as given below in `model_params`:
```python
embed_dim = 50
hidden_dims = [25]  # single hidden layer
optimizer = 'adagrad'
lr = 0.1  # learning rate
beta = 0.01  # L2 regularization
```

**Note:** Due to a bug in TensorFlow, if you re-use the same checkpoint directory (even after deleting the contents) it will sometimes fail to write the event data for TensorBoard. To work around this, the code below creates a new checkpoint directory each time with a name derived from the timestamp. You may want to delete these after a few runs, since they can take up ~35MB each. To do so just run:

```sh
# On command line
rm -rfv /tmp/tf_bow_sst_*
```

In [9]:
import models; reload(models)

# Specify model hyperparameters as used by model_fn
model_params = dict(V=ds.vocab.size, embed_dim=50, hidden_dims=[25], num_classes=len(ds.target_names),
                    encoder_type='bow',
                    lr=0.1, optimizer='adagrad', beta=0.001)

def make_estimator(model_params):
    if os.path.isdir(os.getcwd() + "/tmp/") is False: os.mkdir(os.getcwd() + "/tmp/")
    checkpoint_dir = os.getcwd() + "/tmp/tf_bow_sst_" + datetime.datetime.now().strftime("%Y%m%d-%H%M")
    os.mkdir(checkpoint_dir)
    if os.path.isdir(checkpoint_dir):
        shutil.rmtree(checkpoint_dir)

    # Write vocabulary to file, so TensorBoard can label embeddings.
    # creates checkpoint_dir/projector_config.pbtxt and checkpoint_dir/metadata.tsv
    ds.vocab.write_projector_config(checkpoint_dir, "Encoder/Embedding_Layer/W_embed")

    run_config = tf.estimator.RunConfig(tf_random_seed=42)
    model = tf.estimator.Estimator(model_fn=models.classifier_model_fn, 
                                   params=model_params,
                                   config=run_config, 
                                   model_dir=checkpoint_dir)
    print("")
    print("To view training (once it starts), run:\n")
    print("    conda activate jsalt; tensorboard --logdir='{:s}' --port 6006".format(checkpoint_dir))
    print("\nThen in your browser, open: http://localhost:6006")
    
    return model
    
model = make_estimator(model_params)

Vocabulary (400,003 words) written to '/home/iftenney/jsalt/summerschool/assignment/classifier-dev/tmp/tf_bow_sst_20180621-1853/metadata.tsv'
Projector config written to /home/iftenney/jsalt/summerschool/assignment/classifier-dev/tmp/tf_bow_sst_20180621-1853/projector_config.pbtxt
INFO:tensorflow:Using config: {'_model_dir': '/home/iftenney/jsalt/summerschool/assignment/classifier-dev/tmp/tf_bow_sst_20180621-1853', '_tf_random_seed': 42, '_save_summary_steps': 100, '_save_checkpoints_steps': None, '_save_checkpoints_secs': 600, '_session_config': None, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_log_step_count_steps': 100, '_train_distribute': None, '_service': None, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7ff14d455358>, '_task_type': 'worker', '_task_id': 0, '_global_id_in_cluster': 0, '_master': '', '_evaluation_master': '', '_is_chief': True, '_num_ps_replicas': 0, '_num_worker_replicas': 1}

To view training (once it

Now run the cell below to start training! If you run TensorBoard from the command line, you should see loss curves in the "Scalars" tab as training progresses. We've set it up to run an evaluation on the dev set every `train_params['eval_every']` epochs, and this should appear in the same tab as a blue line after a couple minutes.

Using the default `model_params` above and the following training params, as given in `train_params` below:
```python
batch_size = 32
total_epochs = 20
eval_every = 2  # every 2 epochs, eval the dev set
```
Your model should train very quickly - 20 epochs in under two minutes on a single-core GCE instance.

After 20 epochs, your loss curves should look something like this:

![Loss curves](images/tensorboard_curves.png)

Don't worry if they don't match exactly - colors may vary, and the red dot labeled "eval\_test" won't appear until you run the evaluation cell below. There are also some other curves that you might see: "global\_step/sec" is the number of minibatches per second that the model processes, and the "enqueue\_input/..." plot has to do with the feeder queues that the Estimator API uses to stream data to the model.

In [10]:
# Training params, just used in this cell for the input_fn-s
# change total epochs to test!
# train_params = dict(batch_size=32, total_epochs=2, eval_every=2)
train_params = dict(batch_size=32, total_epochs=20, eval_every=2)
assert(train_params['total_epochs'] % train_params['eval_every'] == 0)

# Construct and train the model, saving checkpoints to the directory above.
# Input function for training set batches
# Do 'eval_every' epochs at once, followed by evaluating on the dev set.
# NOTE: use patch_numpy_io.numpy_input_fn instead of tf.estimator.inputs.numpy_input_fn
train_input_fn = patched_numpy_io.numpy_input_fn(
                    x={"ids": train_x, "ns": train_ns}, y=train_y,
                    batch_size=train_params['batch_size'], 
                    num_epochs=train_params['eval_every'], shuffle=True, seed=42,
                 )

# Input function for dev set batches. As above, but:
# - Don't randomize order
# - Iterate exactly once (one epoch)
dev_input_fn = tf.estimator.inputs.numpy_input_fn(
                    x={"ids": dev_x, "ns": dev_ns}, y=dev_y,
                    batch_size=128, num_epochs=1, shuffle=False
                )

for _ in range(train_params['total_epochs'] // train_params['eval_every']):
    # Train for a few epochs, then evaluate on dev
    model.train(input_fn=train_input_fn)
    eval_metrics = model.evaluate(input_fn=dev_input_fn, name="dev")

INFO:tensorflow:Calling model_fn.
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Create CheckpointSaverHook.
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Saving checkpoints for 1 into /home/iftenney/jsalt/summerschool/assignment/classifier-dev/tmp/tf_bow_sst_20180621-1853/model.ckpt.
INFO:tensorflow:loss = 0.7497469, step = 1
INFO:tensorflow:global_step/sec: 338.453
INFO:tensorflow:loss = 0.40973058, step = 101 (0.297 sec)
INFO:tensorflow:global_step/sec: 401.767
INFO:tensorflow:loss = 0.30776173, step = 201 (0.249 sec)
INFO:tensorflow:global_step/sec: 446.361
INFO:tensorflow:loss = 0.39250714, step = 301 (0.224 sec)
INFO:tensorflow:global_step/sec: 418.215
INFO:tensorflow:loss = 0.22943294, step = 401 (0.239 sec)
INFO:tensorflow:Saving checkpoints for 433 into /home/iftenney/jsalt/summerschool/assignment/classifier-dev/tmp/tf_bow_sst_20180621-1853/model.ckpt.
INFO:tensorflow:Loss for fin

INFO:tensorflow:Starting evaluation at 2018-06-21-18:54:07
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Restoring parameters from /home/iftenney/jsalt/summerschool/assignment/classifier-dev/tmp/tf_bow_sst_20180621-1853/model.ckpt-2165
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Finished evaluation at 2018-06-21-18:54:08
INFO:tensorflow:Saving dict for global step 2165: accuracy = 0.7614679, cross_entropy_loss = 1.1556056, global_step = 2165, loss = 1.209686
INFO:tensorflow:Calling model_fn.
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Create CheckpointSaverHook.
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Restoring parameters from /home/iftenney/jsalt/summerschool/assignment/classifier-dev/tmp/tf_bow_sst_20180621-1853/model.ckpt-2165
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Saving checkpoints for 2166 into /home/iftenney/jsalt/summerschool/assignme

INFO:tensorflow:global_step/sec: 344.273
INFO:tensorflow:loss = 0.035007425, step = 3998 (0.292 sec)
INFO:tensorflow:global_step/sec: 431.059
INFO:tensorflow:loss = 0.035551317, step = 4098 (0.232 sec)
INFO:tensorflow:global_step/sec: 438.278
INFO:tensorflow:loss = 0.03541856, step = 4198 (0.228 sec)
INFO:tensorflow:global_step/sec: 435.222
INFO:tensorflow:loss = 0.03548367, step = 4298 (0.230 sec)
INFO:tensorflow:Saving checkpoints for 4330 into /home/iftenney/jsalt/summerschool/assignment/classifier-dev/tmp/tf_bow_sst_20180621-1853/model.ckpt.
INFO:tensorflow:Loss for final step: 0.033432286.
INFO:tensorflow:Calling model_fn.
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Starting evaluation at 2018-06-21-18:54:24
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Restoring parameters from /home/iftenney/jsalt/summerschool/assignment/classifier-dev/tmp/tf_bow_sst_20180621-1853/model.ckpt-4330
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.

## Evaluating Your Model

To evaluate on the test set, we just need to construct another `input_fn`, then call `model.evaluate`. 

**1.)** Fill in the cell below, and run it to compute accuracy on the test set. With the default parameters, you should get accuracy around 77%.

In [11]:
#### YOUR CODE HERE ####
# Code for Part (f).1
test_input_fn = None  # replace with an input_fn, similar to dev_input_fn
test_input_fn = tf.estimator.inputs.numpy_input_fn(              #--SOLUTION--
                    x={"ids": test_x, "ns": test_ns}, y=test_y,  #--SOLUTION--
                    batch_size=128, num_epochs=1, shuffle=False  #--SOLUTION--
                )                                                #--SOLUTION--

eval_metrics = None  # replace with result of model.evaluate(...)
eval_metrics = model.evaluate(input_fn=test_input_fn, name="test")  #--SOLUTION--

#### END(YOUR CODE) ####
print("Accuracy on test set: {:.02%}".format(eval_metrics['accuracy']))
eval_metrics

INFO:tensorflow:Calling model_fn.
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Starting evaluation at 2018-06-21-18:54:25
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Restoring parameters from /home/iftenney/jsalt/summerschool/assignment/classifier-dev/tmp/tf_bow_sst_20180621-1853/model.ckpt-4330
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Finished evaluation at 2018-06-21-18:54:25
INFO:tensorflow:Saving dict for global step 4330: accuracy = 0.778693, cross_entropy_loss = 1.0988533, global_step = 4330, loss = 1.1398171
Accuracy on test set: 77.87%


{'accuracy': 0.778693,
 'cross_entropy_loss': 1.0988533,
 'global_step': 4330,
 'loss': 1.1398171}

We can also evaluate the old-fashioned way, by calling `model.predict(...)` and working with the predicted labels directly:

In [12]:
from sklearn.metrics import accuracy_score
predictions = list(model.predict(test_input_fn))  # list of dicts
y_pred = [p['max'] for p in predictions]
acc = accuracy_score(y_pred, test_y)
print("Accuracy on test set: {:.02%}".format(acc))

INFO:tensorflow:Calling model_fn.
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Restoring parameters from /home/iftenney/jsalt/summerschool/assignment/classifier-dev/tmp/tf_bow_sst_20180621-1853/model.ckpt-4330
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
Accuracy on test set: 77.87%


# Part (e): GloVe vectors

Let's see if we can improve performance by using pre-trained GloVe vectors. While the SST dataset contains only hundreds of thousands of tokens (and thousands of target labels), unsupervised methods like GloVe can be trained on billions of tokens of unlabeled text.

In [13]:
def get_glove_vectors(X, hands):
    output_mat = np.zeros((X.shape[0], hands.ndim))
    for i, row in enumerate(X):
        for token_id in row:
            if token_id == 0:
                continue  # skip padding
            output_mat[i] += hands.W[token_id]
    return output_mat.astype(np.float32)

train_g = get_glove_vectors(train_x, hands); print("Train GloVe (summed): ", train_g.shape)
dev_g   = get_glove_vectors(dev_x, hands);   print("Dev GloVe (summed):   ", dev_g.shape)
test_g  = get_glove_vectors(test_x, hands);  print("Test GloVe (summed):  ", test_g.shape)

Train GloVe (summed):  (6920, 100)
Dev GloVe (summed):    (872, 100)
Test GloVe (summed):   (1821, 100)


In [14]:
reload(models)

def train_and_evaluate_glove(model, train_params):
    train_input_fn = patched_numpy_io.numpy_input_fn(
                        x={"x": train_g}, y=train_y,
                        batch_size=train_params['batch_size'], 
                        num_epochs=train_params['eval_every'], shuffle=True, seed=42,
                     )

    # Input function for dev set batches. As above, but:
    # - Don't randomize order
    # - Iterate exactly once (one epoch)
    dev_input_fn = tf.estimator.inputs.numpy_input_fn(
                        x={"x": dev_g}, y=dev_y,
                        batch_size=128, num_epochs=1, shuffle=False
                    )
    

    for _ in range(train_params['total_epochs'] // train_params['eval_every']):
        # Train for a few epochs, then evaluate on dev
        model.train(input_fn=train_input_fn)
        eval_metrics = model.evaluate(input_fn=dev_input_fn, name="dev")
        
    test_input_fn = tf.estimator.inputs.numpy_input_fn(
                        x={"x": test_g}, y=test_y,
                        batch_size=128, num_epochs=1, shuffle=False
                    )
    return model.evaluate(input_fn=test_input_fn, name="test")


model_params_glove = copy.deepcopy(model_params)
model_params_glove['hidden_dims'] = [25, 10]
model_params_glove['lr'] = 0.01
# model_params_glove['optimizer'] = 'sgd'
# model_params_glove['beta'] = 0.05
model_params_glove['encoder_type'] = 'mlp'  # directly classify from vectors
model = make_estimator(model_params_glove)

train_params = dict(batch_size=32, total_epochs=20, eval_every=2)
train_and_evaluate_glove(model, train_params)

Vocabulary (400,003 words) written to '/home/iftenney/jsalt/summerschool/assignment/classifier-dev/tmp/tf_bow_sst_20180621-1854/metadata.tsv'
Projector config written to /home/iftenney/jsalt/summerschool/assignment/classifier-dev/tmp/tf_bow_sst_20180621-1854/projector_config.pbtxt
INFO:tensorflow:Using config: {'_model_dir': '/home/iftenney/jsalt/summerschool/assignment/classifier-dev/tmp/tf_bow_sst_20180621-1854', '_tf_random_seed': 42, '_save_summary_steps': 100, '_save_checkpoints_steps': None, '_save_checkpoints_secs': 600, '_session_config': None, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_log_step_count_steps': 100, '_train_distribute': None, '_service': None, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7ff14b2f22b0>, '_task_type': 'worker', '_task_id': 0, '_global_id_in_cluster': 0, '_master': '', '_evaluation_master': '', '_is_chief': True, '_num_ps_replicas': 0, '_num_worker_replicas': 1}

To view training (once it

INFO:tensorflow:Create CheckpointSaverHook.
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Restoring parameters from /home/iftenney/jsalt/summerschool/assignment/classifier-dev/tmp/tf_bow_sst_20180621-1854/model.ckpt-1732
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Saving checkpoints for 1733 into /home/iftenney/jsalt/summerschool/assignment/classifier-dev/tmp/tf_bow_sst_20180621-1854/model.ckpt.
INFO:tensorflow:loss = 0.30640823, step = 1733
INFO:tensorflow:global_step/sec: 661.13
INFO:tensorflow:loss = 0.35540166, step = 1833 (0.153 sec)
INFO:tensorflow:global_step/sec: 895.895
INFO:tensorflow:loss = 0.32820868, step = 1933 (0.112 sec)
INFO:tensorflow:global_step/sec: 853.316
INFO:tensorflow:loss = 0.47953492, step = 2033 (0.118 sec)
INFO:tensorflow:global_step/sec: 835.492
INFO:tensorflow:loss = 0.4065003, step = 2133 (0.120 sec)
INFO:tensorflow:Saving checkpoints for 2165 into /home/iftenney/jsalt/summerschool/assignment/

INFO:tensorflow:Calling model_fn.
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Starting evaluation at 2018-06-21-18:54:42
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Restoring parameters from /home/iftenney/jsalt/summerschool/assignment/classifier-dev/tmp/tf_bow_sst_20180621-1854/model.ckpt-3897
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Finished evaluation at 2018-06-21-18:54:43
INFO:tensorflow:Saving dict for global step 3897: accuracy = 0.6857798, cross_entropy_loss = 0.62961435, global_step = 3897, loss = 0.66102445
INFO:tensorflow:Calling model_fn.
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Create CheckpointSaverHook.
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Restoring parameters from /home/iftenney/jsalt/summerschool/assignment/classifier-dev/tmp/tf_bow_sst_20180621-1854/model.ckpt-3897
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow

{'accuracy': 0.66886324,
 'cross_entropy_loss': 0.6390379,
 'global_step': 4330,
 'loss': 0.670235}

In fact, since we don't need an explicit embedding layer (so long as the GloVe vectors are taken as fixed), we can even train a simple linear model over these vectors, which will perform quite well:

In [15]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

lr = LogisticRegression()
lr.fit(train_g, train_y)
y_pred = lr.predict(test_g)

# predictions = list(model.predict(test_input_fn))  # list of dicts
# y_pred = [p['max'] for p in predictions]
acc = accuracy_score(y_pred, test_y)
print("Accuracy on test set: {:.02%}".format(acc))

Accuracy on test set: 76.06%


# Part (f): ELMo Contextual Vectors

In [16]:
import numpy as np
import elmo_runner; reload(elmo_runner)
from tqdm import tqdm

def tokens_to_elmo(tokens_input):
    gen = elmo_runner.process_elmo(tokens_input, bow=True)
    t = np.vstack(gen)
    return t

In [17]:
train_tok, train_y = ds.as_token_list('train', root_only=True)
train_elmo = tokens_to_elmo(tqdm(train_tok))

print(train_elmo.shape)
# np.save(os.path.join(os.getcwd(), "sst.train.elmo_bow.npy"), train_elmo)

  0%|          | 0/6920 [00:00<?, ?it/s]

INFO:tensorflow:Using /tmp/tfhub_modules to cache modules.
INFO:tensorflow:Initialize variable module/aggregation/scaling:0 from checkpoint b'/tmp/tfhub_modules/9bb74bc86f9caffc8c47dd7b33ec4bb354d9602d/variables/variables' with aggregation/scaling
INFO:tensorflow:Initialize variable module/aggregation/weights:0 from checkpoint b'/tmp/tfhub_modules/9bb74bc86f9caffc8c47dd7b33ec4bb354d9602d/variables/variables' with aggregation/weights
INFO:tensorflow:Initialize variable module/bilm/CNN/W_cnn_0:0 from checkpoint b'/tmp/tfhub_modules/9bb74bc86f9caffc8c47dd7b33ec4bb354d9602d/variables/variables' with bilm/CNN/W_cnn_0
INFO:tensorflow:Initialize variable module/bilm/CNN/W_cnn_1:0 from checkpoint b'/tmp/tfhub_modules/9bb74bc86f9caffc8c47dd7b33ec4bb354d9602d/variables/variables' with bilm/CNN/W_cnn_1
INFO:tensorflow:Initialize variable module/bilm/CNN/W_cnn_2:0 from checkpoint b'/tmp/tfhub_modules/9bb74bc86f9caffc8c47dd7b33ec4bb354d9602d/variables/variables' with bilm/CNN/W_cnn_2
INFO:tensorflo

  5%|▍         | 332/6920 [02:27<48:49,  2.25it/s] 

KeyboardInterrupt: 

In [None]:
dev_tok, dev_y = ds.as_token_list('dev', root_only=True)
dev_elmo = tokens_to_elmo(tqdm(dev_tok))

print(dev_elmo.shape)
np.save(os.path.join(os.getcwd(), "sst.dev.elmo_bow.npy"), dev_elmo)

In [None]:
test_tok, test_y = ds.as_token_list('test', root_only=True)
test_elmo = tokens_to_elmo(tqdm(test_tok))

print(test_elmo.shape)
np.save(os.path.join(os.getcwd(), "sst.test.elmo_bow.npy"), test_elmo)

In [None]:
reload(elmo_runner)

In [None]:
import tensorflow_hub as hub
with tf.Graph().as_default():
    elmo = hub.Module("https://tfhub.dev/google/elmo/2", trainable=False)
    tokens_input = [["the", "cat", "is", "on", "the", "mat"],
                    ["dogs", "are", "in", "the", "fog", ""]]
    tokens_length = [6, 5]
    embeddings = elmo(
        inputs={
            "tokens": tokens_input,
            "sequence_len": tokens_length
        },
        signature="tokens",
        as_dict=True)["elmo"]
    
    with tf.Session() as sess:
        sess.run(tf.global_variables_initializer())
        sess.run(tf.tables_initializer())
        emb_np = sess.run(embeddings)

In [None]:
emb_np.shape