In [9]:
# Run thie cell to import everything we'll need.
import tensorflow as tf
import graph
import graph_test
from matplotlib import pyplot as plt
import numpy as np
import unittest

%matplotlib inline

reload(graph)
reload(graph_test)

# Fun with TensorFlow (10 points)

The goal of this section is to familiarize yourself with the Python [TensorFlow API](https://www.tensorflow.org/api_docs/python/index.html). We'll be using TensorFlow throughout the class to implement deep learning models, which are the state-of-the-art on many NLP tasks such as machine translation, sentiment analysis, and language modeling.

### TensorFlow: Declarative Numerical Programming

The TensorFlow programming model has two phases:
1.  **Construct a graph** by running Python code
2.  **Execute the graph** by calling `session.run`

In the **graph construction** phase, we operate on everything symbolically. Executing the Python code doesn't actually do any numerical calculations - it just tells TensorFlow how to do the computation later. Every variable you define here is a **Tensor**, which creates a node in the computation graph.

In the **execution phase**, we give TensorFlow input data and a list of output operations. It runs the data through the graph and returns numerical results as NumPy arrays.

#### Tensor Objects

Tensor objects are the symbolic equivalent of NumPy arrays, and support many similar operations. For example, to compute a linear model $y = vW + b$ in NumPy, you might do:
```python
# w, v are np.ndarray
y = np.dot(v, w) + b
```
In TensorFlow, this would be expressed as:
```python
# w, v, b are tf.Tensor
y = tf.matmul(v, w) + b
```

There are a few ways to define Tensors, but the most important are:

- **[Constants and sequences](https://www.tensorflow.org/versions/r0.10/api_docs/python/constant_op.html#constants-sequences-and-random-values)**, like tf.constant(), tf.zeros(), or tf.linspace(). These create a Tensor with a fixed value, and pretty much work like their NumPy equivalents.

- **[Variables](https://www.tensorflow.org/versions/r0.10/how_tos/variables/index.html)**, which are persistent and can be modified during execution. Think model parameters, which get updated by training.

- **[Placeholders](https://www.tensorflow.org/versions/r0.10/api_docs/python/io_ops.html#placeholder)**, which are used for data inputs. You feed these in by passing a NumPy array at execution time.

Operations on tensors - like `tf.matmul()` or `tf.nn.softmax()` - produce other tensors and add additional nodes to the graph.

#### Delayed Execution

The key difference between the NumPy code `y = np.dot(v, w) + b` and the TensorFlow equivalent `y = tf.matmul(v, w) + b` is that the latter _doesn't actually do the computation_. Instead, it tells TensorFlow that `y` is derived by performing the `matmul` operation on `v` and `w`, followed by adding `b`. In order to crunch the numbers, you need to run the graph, such as:
```python
# w, b defined as persistent tf.Variable, assume w is 10-dimensional vector
y = tf.matmul(v, w) + b  # Add Op (Tensor) to the graph
y_value = session.run(y, feed_dict={v=np.ones(10)})  # Run the graph
```
where `feed_dict` is how we "feed" input (NumPy arrays) to TensorFlow, and `y_value` will be a NumPy array containing the result of the computation.

This seems clunky for such a simple example - but it will dramatically simplify things when we start working with more complicated models.

## Part 0: Simple Adder

Open [`graph.py`](graph.py).  This file contains a number of class/function placeholders that we will implement through the course of this notebook - as well as a wealth of comments explaining how they work.

Implement the methods of the `AddTwo` class using TensorFlow.  In particular:
- `__init__` should construct a graph that adds the numbers.  Use two placeholders for the numbers to add.
- `Add` should (only) execute the graph with its two arguments and return the result.  It should not create any graph nodes or the session object.

When you are done, execute the next cell to test it, and verify that the unit tests below both pass.

Be sure that your adder can handle parameters of any dimension! (*Hint: TensorFlow will mostly do this automatically, unless you tell it not to.*)

In [2]:
reload(graph)
adder = graph.AddTwo()  # Internally, creates tf.Graph and tf.Session
print adder.Add(40, 2)
print adder.Add([1,2],[3,4])

In [3]:
reload(graph)
reload(graph_test)
unittest.TextTestRunner(verbosity=2).run(
    unittest.TestLoader().loadTestsFromName(
        'TestAdder.test_adder', graph_test))

If you didn't already, make sure that your adder can handle parameters of any dimension.

In [4]:
reload(graph)
reload(graph_test)
unittest.TextTestRunner(verbosity=2).run(
    unittest.TestLoader().loadTestsFromName(
        'TestAdder.test_vector_adder', graph_test))

## Part 1: Affine & Fully Connected Layers

### Brief Review of Machine Learning

In supervised learning, parametric models are those where the model is a function of a certain form with a number of unknown _parameters_.  Together with a loss function and a training set, an optimizer can select parameters to minimize the loss with respect to the trainin set.  Common optimizers include stochastic gradient descent.  It tweaks the parameters slightly to move the loss "downhill" due to a small batch of examples from the training set.

### Linear & Logistic Regression

You've likely seen linear regression before.  In linear regression, we fit a line (technically, hyperplane) that predicts a target variable, $y$, based on some features $x$.  The form of this model is affine (even if we call it "linear"):  $y_{hat} = xW + b$ where $W$ and $b$ are weights and an offset, respectively, and are the parameters of this parametric model.  The loss function that the optimizer uses to fit these parameters is the squared error ($||\cdots||_2$) between the prediction and the ground truth in the training set.

You've also likely seen logistic regression, which is tightly related to linear regression.  Logistic regression also fits a line - this time separating the positive and negative examples of a binary classifier.  The form of this model is similar: $y_{hat} = \sigmoid(xW + b)$.  Again $W$ and $b$ are the parameters of this model.The loss function that the optimizer uses to fit these parameters is the cross entropy between the prediction and the ground truth in the training set.

This pattern of an affine transform, $xW + b$, occurs over and over in machine learning.

In all of these cases, the optimizer needs:

- A batch of examples of $x$ and $y$ from the training set
- Variables to maintain the current values of the parameters of the model
- A loss function
- An optimization strategy

### Your Task

In this section, you don't need to create the graph and session (we've done it for you).  Instead, you will simply implement functions that construct parts of a larger graph.

You will first build an affine layer: $z = xW + b$, and then then a stack of fully connected layers (described in more detail below), each implementing $h = f(xW + b)$.  You'll use the former as a building block for the latter.

### 1(a): Affine Layer
In particular, your function will accept a TensorFlow Op that represents the value of $x$ and should return value $z$ of desired dimension.  You must construct whatever variables you need.

Implement `affine_layer(...)`.

Hints:
- use `tf.get_variable()` to create variables to store the current values of parameters.
- `W` should be randomly initialized using [Xavier initialization](https://www.tensorflow.org/versions/master/api_docs/python/contrib.layers/initializers)
- `b` should be initialized to a vector of zeros
- `a * b` is a element-wise product.  Look for the function that performs matrix products!

Run the little fragment below until you get your code up and running, then run the more comprehensive unit tests in the cell below that.

In [27]:
reload(graph)
with tf.Graph().as_default():
    sess = tf.Session()
    x_ph = tf.placeholder(tf.float32, shape=(None, 3))
    y = graph.affine_layer(1, x_ph)  #### <---- Your code called here.
    sess.run(tf.global_variables_initializer())
    
    print 'You should have two trainable variables, one for each of parameters W and b: ', len(tf.trainable_variables())
    assert len(tf.trainable_variables()) == 2

    print 'These should be a (3, 1) W weight matrix and a (1,) offset.'
    variables = sess.run(tf.trainable_variables())
    print variables[0].shape
    print variables[1].shape
    assert set([variables[0].shape, variables[1].shape]) == set([(3, 1), (1,)])

    print 'This should be [[3.888]].'
    y_val = sess.run(y, feed_dict={x_ph: np.array([[1, 2, 3]])})
    print y_val
    assert y_val.shape == (1, 1)

In [28]:
reload(graph)
reload(graph_test)
unittest.TextTestRunner(verbosity=2).run(
    unittest.TestLoader().loadTestsFromName(
        'TestLayer.test_affine', graph_test))

### 1(b): Fully-Connected Layers

A fully connected layer has the following form (you'll notice this is very similar to logistic regression!):

1.  An affine transform
2.  An elementwise nonlinearity $f(z)$ (this is sigmoid in logistic regression; we'll use [relu](https://en.wikipedia.org/wiki/Rectifier_(neural_networks)) here, but the idea is the same)

These fully connected layers, in square brackets below, can be stacked repeatedly to build a deep neural network:

$x \rightarrow [xW + b \rightarrow z \rightarrow f(z)] \rightarrow h_1 \rightarrow [h_1W + b \rightarrow z1 \rightarrow f(z_1)] \rightarrow h_2 \cdots$

- Implement the `fully_connected_layers()` function.

In [6]:
reload(graph)
reload(graph_test)
unittest.TextTestRunner(verbosity=2).run(
    unittest.TestLoader().loadTestsFromName(
        'TestLayer.test_fully_connected_layers', graph_test))

# Part 2: Training a Neural Network

Let's put it all together, and build a simple neural network that fits some training data.

- Implement the `train_nn()` function.

**Note:** you will need to do all the work (creating the graph and the session and a training op).

To get the tests to pass, please use [`tf.train.GradientDescentOptimizer`](https://www.tensorflow.org/api_docs/python/train/optimizers#GradientDescentOptimizer) as your optimizer.

In [7]:
reload(graph_test)
X_train, y_train, X_test, y_test = graph_test.generate_data(1000, 10)
plt.scatter(X_train[:,0], X_train[:,1], c=y_train, cmap='bwr')

**Hint:** You should expect to see an initial loss here of 0.2 - 1.0.  This is because a well-initialized random classifier tends to output a uniform distribution.  For each example in the batch, we either compute the cross-entropy loss of the label ([1, 0] or [0, 1]) against the model's output (~[0.5, 0.5]).  Both cases result in -ln(0.5) = ln(2) = 0.69.

Of course, your random classifier won't output exactly uniform distributions (it's random after all), but you should anticipate it being pretty close.  If it's not, your initialization may be broken and make it hard for your network to learn.

**[Optional]** Some technical details... if your randomly initialized network is outputting very confident predictions, the loss computed may be very large while at the same time the sigmoids in the network are likely in saturation, quickly shrinking gradients.  The result is that you make tiny updates in the face of a huge loss.

In [8]:
reload(graph)
reload(graph_test)
unittest.TextTestRunner(verbosity=2).run(
    unittest.TestLoader().loadTestsFromName(
        'TestNN.test_train_nn', graph_test))

That was fairly straightforward...  the data is clearly linearly separable.

### Tuning Parameters

Let's try our network on a problem that's a bit harder!

Here, we'll train a neural network with a couple of hidden layers before the final sigmoid.  This lets the network learn non-linear decision boundaries.

Try playing around with the hyperparameters to get a feel for what happens if you set the learning rate too big (or too small), or if you don't give the network enough capacity (i.e. hidden layers and width).

In [9]:
reload(graph_test)
X_train, y_train, X_test, y_test = graph_test.generate_non_linear_data(1000, 10)
plt.scatter(X_train[:,0], X_train[:,1], c=y_train, cmap='bwr')

In [10]:
hidden_layers = [10, 10]
batch_size = 50
epochs = 2000
learning_rate = 0.001
predictions = graph.train_nn(X_train, y_train, X_test, hidden_layers, batch_size, epochs, learning_rate)

In [11]:
plt.scatter(X_test[:,0], X_test[:,1], c=predictions, cmap='bwr')

That looks pretty good!

Let's compare the predictions vs. the labels and see what we got wrong...

In [12]:
plt.scatter(X_test[:,0], X_test[:,1], c=(predictions==y_test), cmap='bwr')

Only a tiny number of errors (hopefully!).  Good work!

## Congratulations

You have implemented a deep neural network using tensorflow!

One remaining API you may want to take a look at is [tf.nn.embedding_lookup](https://www.tensorflow.org/versions/r0.11/api_docs/python/nn.html#embedding_lookup).  It is simply an op that takes a variable (like the "w" you did in your affine layer) and returns a column from it.  This will be useful later when we "embed" words into vector space.  We'll have our embedding table as a single variable with dimensions `[#words x word_vector_length]` and we'll use this op to select word vectors from it efficiently.