# Convolutional Neural Networks

This notebook introduces convolutional neural networks (CNNs), a more powerful classification model similar to the Neural Bag-of-Words (BOW) model you explored earlier.

## Outline

- **Part (g):** Model Architecture
- **Part (h):** Implementing the CNN Model
- **Part (i):** Training and Evaluation
- **Part (j):** Tuning

This section of the assignment is similar to the one on Neural BOW models, parts (d)-(f). Part (g) has 4 questions, part (h) asks for an model implementation, part (j) has 5 questions on model tuning.

In [1]:
from __future__ import division
import os, sys, re, json, time, datetime, shutil
import itertools, collections
from importlib import reload
from IPython.display import display, HTML

# NLTK for NLP utils and corpora
import nltk

# NumPy and TensorFlow
import numpy as np
import pandas as pd
import tensorflow as tf

# Helper libraries
from w266_common import utils, vocabulary, tf_embed_viz, treeviz
from w266_common import patched_numpy_io
# Code for this assignment
import sst, models, models_test

# Monkey-patch NLTK with better Tree display that works on Cloud or other display-less server.
print("Overriding nltk.tree.Tree pretty-printing to use custom GraphViz.")
treeviz.monkey_patch(nltk.tree.Tree, node_style_fn=sst.sst_node_style, format='svg')

# Part (g): Model Architecture

CNNs are a more sophisticated neural model for sentence classification than the Neural BOW model we saw in the last section. CNNs operate by sweeping a collection of filters over a text. Each filter produces a sequence of feature values known as a _feature map_. In one of the most basic formulations introduced by [Kim (2014)](https://www.aclweb.org/anthology/D14-1181), a single layer of _pooling_ is used to summarize the _feature maps_ as a fixed length vector. The fixed length vector is then feed to the output layer in order to produce classification labels. A popular choice for the pooling operation is to take the maximum feature value from by each _feature map_.

![Convolutional Neural Network from Kim 2014](kim_2014_figure_1_cnn.png)
*CNN model architure, Figure 1 from Kim (2014)*

We'll use the following notation:
- $w^{(i)} \in \mathbb{Z}$, the word id for the $i^{th}$ word of the sequence (as an integer index)
- $x^{(i)} \in \mathbb{R}^d$ for the vector representation (embedding) of $w^{(i)}$
- $x^{i:i+j}$ is the concatenation of $x^{(i)}, x^{(i+1)} ... x^{(i+j)}$ 
- $c^{(i)}_{k}$ is the value of the $k^{th}$ feature map along the word sequence, each filter applies over a window of $h$ words and uses non-linearity $f$.
- $\hat{c}_{k}$ is the value of the $k^{th}$ feature after pooling the feature map over the whole sequence.
- $\hat{C}$ is the concatenation of pooled feature maps. 
- $y$ for the target label ($\in 1,\ldots,\mathtt{num\_classes}$)

Our model is defined as:
- **Embedding layer:** $x^{(i)} = W_{embed}[w^{(i)}]$
- **Convolutional layer:** $c^{(i)}_{k} = f(x^{i:i+h-1} W_k + b)$
- **Pooling layer:**  $\hat{c}_{k}$ = $max(c^{(0)}_{k}, c^{(1)}_{k}...)$ 
- **Output layer:** $\hat{y} = \hat{P}(y) = \mathrm{softmax}(\hat{C} W_{out} + b_{out})$


We'll refer to the first part of this model (**Embedding layer**, **Convolutional layer**, and **Pooling layer**) as the **Encoder**: it has the role of encoding the input sequence into a fixed-length vector representation that we pass to the output layer.

We'll also use these as shorthand for important dimensions:
- `V`: the vocabulary size (equal to `ds.vocab.size`)
- `N`: the maximum number of tokens in the input text
- `embed_dim`: the embedding dimension $d$
- `kernel_size`: a list of filter lengths
- `filters`: number filters per filter length
- `num_classes`: the number of target classes (2 for the binary task)

## Part (g) Short Answer Questions

Answer the following in the cell below. 

1. Let `embed_dim = d`, `kernel_size = [3, 4, 5]`, `filters=128`, and `num_classes = k`. 
   In terms of these values, the vocabulary size `V` and the maximum sequence length `N`, what are the
   shapes of the following variables: 
   $c^{(i)}_{kernal\_size=3}$, $c^{(i)}_{kernal\_size=4}$, $c^{(i)}_{kernal\_size=5}$, $\hat{c}^{(i)}_{kernal\_size=3}$, $\hat{c}^{(i)}_{kernal\_size=4}$, $\hat{c}^{(i)}_{kernal\_size=5}$, and $\hat{C}$. Assume a stride size of 1. The some of the shapes will depending on whether or not the edges of the sequence are padded.
   
      a. Assuming padding is not used
      (e.g., for tf.nn.max_pool and tf.nn.conv1d, setting padding='VALID'), provide the shapes listed above.

      b. Provide the names and shapes of the tensors that change when paddiding is used.
      (e.g., for tf.nn.max_pool and tf.nn.conv1d, setting padding='same').
<p>
2. How many parameters are in each of the convolutional filters, $W_{filter_length=3}$, $W_{filter_length=4}$, $W_{filter_length=5}$? And the output layer, $W_{out}$?
<p>
<p>
3. Historically NLP models made heavy use of manual feature engineering. In relation to systems with manually engineered features, describe what type of operation is being performed by the convolutional filters.
<p>
4. Suppose that we have two examples, `[foo bar baz]` and `[baz bar foo]`. Will this model make the same predictions on these? Why or why not?

# Part (h): Implementing the CNN Model

We'll implement our CNN model in `models.py`. Since we can re-use most of the code you already wrote for the neural BOW model, you'll only need to implement the `CNN_encoder(...)` that constructs the encoder portion of the CNN described above. Our implementation will differ from [Kim (2014)](https://www.aclweb.org/anthology/D14-1181) in that we will support using multiple dense hidden layers after the convolutional layers.

**Follow the instructions in the code (function docstrings and comments) carefully!**

In particular, for unit tests to work, you shouldn't change (or add) any `tf.name_scope` or `tf.variable_scope` calls, and must name the variables exactly as documented. (Your model may work just fine, of course, but the test harness will throw all sorts of errors!)

To aid debugging and readability, we've adopted a convention that TensorFlow tensors are represented by variables ending in an underscore, such as `W_embed_` or `train_op_`.

**Before you start**, be sure to answer the short-answer questions in Part (g).

You may find the following TensorFlow API functions useful:
- [`
tf.layers.conv1d`](https://www.tensorflow.org/api_docs/python/tf/layers/conv1d)
- [`
tf.reduce_max`](https://www.tensorflow.org/api_docs/python/tf/math/reduce_max)

**Do your work in `models.py`.** When ready, run the cell below to run the unit tests.

In [2]:
import models; reload(models)
utils.run_tests(models_test, ["TestCNN"])

# Part (i): Training and Evaluation

Similar to what was done for the Neural BOW model, we will now want to train and evaluate our CNN. The code in this section is very similar to the Neural BOW part of the assignment. Revisit the Neural BOW section if anything looks unfamilar.   

## Training

Run the code below to train the CNN encoder on SST. 

In [3]:
import models; reload(models)
import sst

# Load SST dataset
ds = sst.SSTDataset(V=20000).process(label_scheme="binary")
max_len = 40
train_x, train_ns, train_y = ds.as_padded_array('train', max_len=max_len, root_only=True)
dev_x,   dev_ns,   dev_y   = ds.as_padded_array('dev',   max_len=max_len, root_only=True)
test_x,  test_ns,  test_y  = ds.as_padded_array('test',  max_len=max_len, root_only=True)

# Specify model hyperparameters as used by model_fn
model_params = dict(V=ds.vocab.size, embed_dim=25,  filters=10, kernel_sizes=[2, 3, 4],
                    hidden_dims=[], num_classes=len(ds.target_names), encoder_type='cnn',
                    dropout_rate=0.5, lr=0.1, optimizer='adagrad', beta=0.0)

checkpoint_dir = "/tmp/tf_bow_sst_" + datetime.datetime.now().strftime("%Y%m%d-%H%M")
if os.path.isdir(checkpoint_dir):
    shutil.rmtree(checkpoint_dir)
# Write vocabulary to file, so TensorBoard can label embeddings.
# creates checkpoint_dir/projector_config.pbtxt and checkpoint_dir/metadata.tsv
ds.vocab.write_projector_config(checkpoint_dir, "Encoder/Embedding_Layer/W_embed")

model = tf.estimator.Estimator(model_fn=models.classifier_model_fn, 
                               params=model_params,
                               model_dir=checkpoint_dir)
print("")
print("To view training (once it starts), run:\n")
print("    tensorboard --logdir='{:s}' --port 6006".format(checkpoint_dir))
print("\nThen in your browser, open: http://localhost:6006")

# Training params, just used in this cell for the input_fn-s
train_params = dict(batch_size=32, total_epochs=25, eval_every=5)
assert(train_params['total_epochs'] % train_params['eval_every'] == 0)

# Construct and train the model, saving checkpoints to the directory above.
# Input function for training set batches
# Do 'eval_every' epochs at once, followed by evaluating on the dev set.
# NOTE: use patch_numpy_io.numpy_input_fn instead of tf.estimator.inputs.numpy_input_fn
train_input_fn = patched_numpy_io.numpy_input_fn(
                    x={"ids": train_x, "ns": train_ns}, y=train_y,
                    batch_size=train_params['batch_size'], 
                    num_epochs=train_params['eval_every'], shuffle=True, seed=42
                 )

# Input function for dev set batches. As above, but:
# - Don't randomize order
# - Iterate exactly once (one epoch)
dev_input_fn = tf.estimator.inputs.numpy_input_fn(
                    x={"ids": dev_x, "ns": dev_ns}, y=dev_y,
                    batch_size=128, num_epochs=1, shuffle=False
                )

for _ in range(train_params['total_epochs'] // train_params['eval_every']):
    # Train for a few epochs, then evaluate on dev
    model.train(input_fn=train_input_fn)
    eval_metrics = model.evaluate(input_fn=dev_input_fn, name="dev")

## Evaluation

As in the NeuralBOW section, define the test_input_fn and provided the appropriate call to model.evaluate(...) to evaluate your CNN model.

**Hint: This should be trival if you have already completed the NeuralBOW section.**

In [4]:
#### YOUR CODE HERE ####
# Code for Part (f).1
test_input_fn = None  # replace with an input_fn, similar to dev_input_fn

eval_metrics = None  # replace with result of model.evaluate(...)

#### END(YOUR CODE) ####
print("Accuracy on test set: {:.02%}".format(eval_metrics['accuracy']))
eval_metrics

# Part (j): Tuning Your Model

We'll once again want to optimize hyperparameters for our model to see if we can improve performance. The CNN model includes a number of new parameters that can significantly influence model performance.

In this section, you will be asked to describe the new parameters as well as use them to attempt to improve the performance of your model.

## Part (j) Short Answer Questions

  1. What model hyperparameters are shared by the NeuralBOW and CNN models?
  2. What new parameters are introduced by the CNN model?
  3. Choose two parameters unique the CNN model, perform at least 10 runs with different combinations of values for these parameters, and then report the dev set results below. ***Hint: Consider wrapping the training code above in a for loop the examines the different values.***  To do this efficiently, you should consider [this paper](http://www.jmlr.org/papers/volume13/bergstra12a/bergstra12a.pdf) from Bergstra and Bengio, underscoring the low efficiency of a grid search.  [This blog post](https://blog.floydhub.com/guide-to-hyperparameters-search-for-deep-learning-models/) also has a less formal treatment of the same topic.
  4. Describe any trends you see in experiments above (e.g., can you identify good ranges for the individual parameters; are there any interesting interactions?)
  5. Pick the three best configurations according to the dev set and evaluate them on the test data. Is the ranking of the three best models the same on the dev and test sets?