# Getting Started
In today's lab, we will be working with tensorflow to see an implementation of a neural net (deep learning) in action.

We will be using an ipython notebook. If you are familiar with ipython notebooks, you can skip to the "Tensorflow" section. Otherwise, keep reading!

# Ipython Notebooks
Ipython notebooks are an interactive environment for running code and viewing images. Code is written in cells like the one below, and in order to run the code, you must select the cell by clicking on it, and then press enter while holding down shift. Try this with the below cell to see the first 10 primes printed out.

In [None]:
print("Hello world!")
print("The first 10 primes are:")
primes = [2]
num = 2
while len(primes) < 10:
    num_is_prime = True
    for prime in primes:
        if num % prime == 0:
            num_is_prime = False
    if num_is_prime:
        primes.append(num)
    num += 1
print(primes)

# Running multiple cells
Whenever a cell is run in an ipython notebook, the variables it creates are saved and can be accessed from any other cell (it's like the interactive python shell). Try running the following two cells in sequence. This is important because the notebook is designed so that all the cells will be run in sequence. So, don't skip ahead to later cells without first running all the previous cells in order!

In [None]:
my_string = "I ran these two cells in sequence."

In [None]:
print(my_string)

# Tensorflow
Let's get started using tensorflow! First, import all the packages we will need: If you get a warning about the compiletime version of tensorflow not matching the runtime version, or deprecated support, just ignore it, everything still works fine.

In [None]:
import tensorflow as tf
from matplotlib import pyplot as plt
import numpy as np
%matplotlib inline

# logistic regression: http://web.stanford.edu/class/cs109/lectureHandouts/25%20LogisticRegression.pdf
# gradient ascent: http://web.stanford.edu/class/cs109/lectureHandouts/22%20GradientAscent.pdf
# thanks http://machinelearninguru.com/deep_learning/tensorflow/machine_learning_basics/logistic_regresstion/logistic_regression.html

## The dataset
We will be working with the MNIST dataset, which is a dataset of hand-written digits. We will only work with the
zeros and ones, so we can use logistic regression to classify whether an image is a 0 or a 1. Load the dataset and print out how many training and testing datapoints we are working with:

In [None]:
from tensorflow.examples.tutorials.mnist import input_data
mnist = input_data.read_data_sets("MNIST_data/", reshape=True, one_hot=False)
data={}
data['train/image'] = mnist.train.images
data['train/label'] = mnist.train.labels
data['test/image'] = mnist.test.images
data['test/label'] = mnist.test.labels
# Get only the samples with zero and one label for training.
index_list_train = []
for sample_index in range(data['train/label'].shape[0]):
    label = data['train/label'][sample_index]
    if label == 1 or label == 0:
        index_list_train.append(sample_index)
# Reform the train data structure.
data['train/image'] = mnist.train.images[index_list_train]
data['train/label'] = mnist.train.labels[index_list_train]
# Get only the samples with zero and one label for test set.
index_list_test = []
for sample_index in range(data['test/label'].shape[0]):
    label = data['test/label'][sample_index]
    if label == 1 or label == 0:
        index_list_test.append(sample_index)
# Reform the test data structure.
data['test/image'] = mnist.test.images[index_list_test]
data['test/label'] = mnist.test.labels[index_list_test]

print("\nNumber of training datapoints: %d" % data['train/label'].shape[0])
print("Number of testing datapoints: %d" % data['test/label'].shape[0])

### Visualizing MNIST examples
Let's visualize a few example images:

In [None]:
"""Let's visualize a couple examples of mnist images:"""
def gen_image(arr, ax):
    two_d = (np.reshape(arr, (28, 28)) * 255).astype(np.uint8)
    ax.imshow(two_d, cmap='gray')

f, (ax1, ax2, ax3, ax4, ax5) = plt.subplots(1, 5, sharey=True, figsize=(8,1))
gen_image(data['train/image'][0], ax1)
gen_image(data['train/image'][1], ax2)
gen_image(data['train/image'][2], ax3)
gen_image(data['train/image'][3], ax4)
gen_image(data['train/image'][4], ax5)
print("Labels: %d, %d, %d, %d, %d" % tuple(data['train/label'][:5].tolist()))


## Building the deep learning model (2 layer fully connected neural network):
First, just set some parameters. These work well, but feel free to play around with them if you want.


In [None]:
# Parameters
learning_rate = 5
training_epochs = 1000
batch_size = 100 # the number of training images we use for each gradient ascent update
display_step = 10
n_hidden = 2 # the number of "neurons" in the first hidden layer of the neural network.

Now, we need to build the model. Tensorflow works by setting up a computation graph that input is then fed into. 
We first specify the inputs to the graph with tf.placeholder. Basically, x and y must be specified when we run the model.
x will be a batch of input images, and y will be a batch of labels for the image (if necessary). 

In [None]:
# tf Graph Input
x = tf.placeholder(tf.float32, [None, 784]) # mnist data image of shape 28*28=784
y = tf.placeholder(tf.float32, [None, 1]) # 0-1 digits recognition => 2 classes

Next, we have to setup the parameters of the model. For now, just run this cell, and we'll go over what the variables are exactly later in the notebook. Note that 784 = 28*28 = the number of pixels in an mnist image.

In [None]:
# Set model weights/parameters
# note that 784 = 28*28 = the number of pixels in an mnist image.
W_h = tf.get_variable("W_h", shape=[784, n_hidden],
       initializer=tf.contrib.layers.xavier_initializer())
b_h = tf.Variable(tf.zeros([n_hidden]))
W_y = tf.get_variable("W_y", shape=[n_hidden, 1],
       initializer=tf.contrib.layers.xavier_initializer())
b_y = tf.Variable(tf.zeros([1]))

Now we can do the interesting part, constructing the model!

"a" in the below cell is performing a bunch of different logistic regressions on the flattened input image, x. 

tf.matmul(x, W_h) does a matrix multiplication between x and W_h. This corresponds to $\theta^T x$ in logistic regression. b_h is just $\theta_0$. However, notice that the size of W_h, as specified in the above cell, is 784 x n_hidden. Basically, this means that we are performing n_hidden different logistic regressions on the input, all with their own $\theta$ parameters. The same goes for b_h - there is a unique $\theta_0$ for each logistic regression on the input data. 

Now, the key difference between "deep learning" or "neural nets" and simple logistic regression: we perform logistic regression AGAIN, on the outputs of the first logistic regression pass, to get our final output probabilities, prob_y1! 

To provide a little more detail for the interested: What we have done here is apply a linear transformation to the input image
tf.matmul(x, W_h) + b_h
mapping the input image to a n_hidden dimensional space. Then we apply some non-linear function to this linear transformation - in this case the sigmoid, but it can be many other things - tf.tanh, or tf.relu are popular choices - and then we just repeat the above process many times. Apply a linear transformation, and then a nonlinear function to the outputs. We only need two layers (one linear transformation, one nonlinear function, and then another linear transformation) to get a universal function approximator (which means the neural net can approximate any function to arbitrary precision, if n_hidden is large enough), but deeper neural nets with more layers allow this approximation to be better with smaller values of n_hidden. 

TODO add image to this cell of neural network, or go over on board in section.

In [None]:
# Construct model

a = tf.nn.sigmoid(tf.matmul(x, W_h) + b_h) # activate the hidden layer
prob_y1 = tf.nn.sigmoid(tf.matmul(a, W_y) + b_y) # softmax with 2 dimensions is just a sigmoid
prob_y0 = 1 - prob_y1


## Compute loss
Now that we have setup our model, for any given input images x and their classifications y, we can compute the negative log likelihood loss as follows:


In [None]:
# Maximize log likelihood (aka minimize negative log likelihood) using cross entropy
neg_LL = - tf.reduce_mean(y*tf.log(prob_y1)+(1-y)*tf.log(prob_y0))

tf.log just takes the element-wise logarithm of the inputs.

tf.reduce_mean takes the mean of all entries in an array (so this is actually -LL/N, where N is batch size)

tensorflow supports broadcasting just like numpy. So 1 - A just replaces every entry $a_{ij}$ with $1 - a_{ij}$

Can you recognize how the above cell computes negative log likelihood?

$-LL = - \sum_x \log p(y = \hat{y} | x)$

$ = - \sum_{x | \hat{y}=1} p(y=1 | x) + \sum_{x | \hat{y}=0} p(y=0 | x) $

$ = - \sum_x \left( (\hat{y}) p(y=1|x) + (1-\hat{y}) p(y=0 | x) \right) $

### gradient ascent / descent with tensorflow:
Now, we're just going to add an operation to the graph that performs gradient descent for us (to minimize -LL)!
The way this works is tensorflow computes the gradients of cost with respect the the four variables we specified, 
W_h, b_h, W_y, and b_y. Then, every time we evaluate the "optimizer" operation, it will perform a gradient update step for us!

In [None]:
""" Gradient Descent (NOTE: gradient ascent of log likelihood is the same as gradient descent of 
the negative log likelihood. Tensorflow has handy implementations of gradient descent, but not ascent.)

This performs all the gradient computations for us! Isn't that great?
"""
optimizer = tf.train.GradientDescentOptimizer(learning_rate).minimize(neg_LL)

OK, now we've setup the entire neural network graph, so let's start training it!

In [None]:
# Initialize the variables (i.e. assign their default value)
init = tf.global_variables_initializer()
sess = tf.Session()

In [None]:
# Start training
# Run the initializer
sess.run(init)
    
xs, ys = data['train/image'], data['train/label'].reshape(11623,1)
for epoch in range(training_epochs):
    # Run optimization op (backprop) and neg_LL op (to get loss value)
    _, nll = sess.run([optimizer, neg_LL], feed_dict={x: xs, y: ys}) 
    # following https://cs224d.stanford.edu/lectures/CS224d-Lecture7.pdf
    # Print cost
    if (epoch+1) % display_step == 0:
        print("Epoch:", '%04d' % (epoch+1), "-LL =", "{:.9f}".format(nll))


## Classification accuracy
Now, let's see how well our classifier does on the test dataset. To do this, we need to build an
operation "accuracy", which, given a set of datapoints X and labels Y, computes the accuracy of classification:

In [None]:

classifications = tf.round(prob_y1) # get classifications of each datapoint from probabilities
accurracy = 1.0 - tf.reduce_mean(tf.abs(classifications - y)) # compute accuracy using classifications and truth y
# training
xs, ys = data['train/image'], data['train/label'].reshape(11623,1)
# sess.run is how we use the tensorflow model to evaluate things.
# feed_dict provides the input to the placeholders x and y, 
# and accurracy is what we want to evaluate.
training_acc = sess.run(accurracy, feed_dict={x: xs, y: ys})
# testing
xs, ys = data['test/image'], data['test/label'].reshape(2115,1)
testing_acc = sess.run(accurracy, feed_dict={x: xs, y: ys})
print("Training accuracy: %f" % training_acc)
print("Testing accuracy: %f" % testing_acc)


You should see accuracy > 0.999 on the testing dataset! So this model does a very good job of classifying zeros and ones. Let's try to visualize approximately what it is doing in the next section.

# Part 2: Maximal Activation in the Context of MNIST
Read question 2 in the section hand out and ponder it a bit before proceeding. This next portion deals with similar concepts in the context of the data we've already used above for question 1. Here, however, we're trying to generate, from scratch, a images that will maximally activate our final "1" neuron

We can find the image that the model most classifies as a "1" by doing gradient ascent on prob_y1, with respect to the input image. We first initialize an image where all pixel values are 0. Then, repeatedly, we feed the image into the model, compute the gradients of prob_y1 with respect to the image, and then update the image with those gradients to increase prob_y1.

Below is most of the code to complete this task on the MNIST dataset. There are two lines that you have to supply (the gradient update step) marked with TODO.

In [None]:
"""Trying to find the image that the model is most confident is a 1:"""
# first, find gradient of probs with respect to image
one_grads = tf.gradients(prob_y1, x)
zero_grads = tf.gradients(prob_y0, x)

# now, repeatedly feed an image into model, calculate gradients, and perform gradient ascent on prob_y1 with respect
# to the image: 
eta = 1 # the learning rate; this choice is important here
# these are our starting images that we will update with gradient ascent
perfect_zero = np.zeros((1,28*28))
perfect_one = np.zeros((1,28*28))

for i in range(10000):
    p0_grads = sess.run((zero_grads), feed_dict={x:perfect_zero})[0] # the gradient with respect to perfect_zero
    p1_grads = sess.run((one_grads), feed_dict={x:perfect_one})[0] # the gradient with respect to perfect_one
    
    # TODO: YOU NEED TO ADD TWO LINES HERE to perform the gradient ascent update on perfect_zero and perfect_one

    
    # ----------------------

f, (ax1,ax2) = plt.subplots(1, 2, sharey=True, figsize=(5,2))
# images that we've generated are not scaled like MNIST images. So need to rescale:
def normalize_image(image):
    # normalize so all pixels in image are between 0 and 1
    image = image - np.min(image)
    return image / np.max(image)
    

perfect_zero = normalize_image(perfect_zero)
perfect_one = normalize_image(perfect_one)
gen_image(perfect_zero, ax1)
gen_image(perfect_one, ax2)


What do you notice about the images above? Do they look like "perfect" ones or zeros? Why or why not?



Let's try something else: instead of finding the image that maximally activates the probability of being a one or zero, let's find the images that maximally activate each of the two "neurons" in the hidden layer:

YOUR TASK:
the below code is mostly copy pasted from the above cell. Modify it so that you find the images that maximally activate the two neurons in the hidden layer (rather than the final probabilities). The lines you need to change are commented.

HINT: The neurons in the hidden layer are represented as "a". "a" is a tensor with shape [batch_size, 2]. Tensors can be indexed the same way as numpy arrays. That is, if "a" is

[[0,1],
 [0,1],
 [0,1],
 [2,3]]
 
Then "a[:,0]" will give 
[0,0,0,2]


In [None]:
# TODO: "zero_grads" should be the gradients of the first neuron in the hidden layer, with respect to the input image
# and "one_grads" should be the gradients with respect to the second neuron in the hidden layer.



# ----------------------


# now, repeatedly feed an image into model, calculate gradients, and perform gradient ascent with respect
# to the image: 
eta = .01 # choice of learning rate (eta) is important here
perfect_zero = np.zeros((1,28*28))
perfect_one = np.zeros((1,28*28))

for i in range(1000):
    p0_grads = sess.run((zero_grads), feed_dict={x:perfect_zero})[0]
    p1_grads = sess.run((one_grads), feed_dict={x:perfect_one})[0]
    # TODO: update images with gradients (gradient ascent step)

    
    # ----------------------

f, (ax1,ax2) = plt.subplots(1, 2, sharey=True, figsize=(5,2))
# images that we've generated are not scaled like mnist images. So need to rescale:
def normalize_image(image):
    # normalize so all pixels in image are between 0 and 1
    image = image - np.min(image)
    return image / np.max(image)
    

perfect_zero = normalize_image(perfect_zero)
perfect_one = normalize_image(perfect_one)
gen_image(perfect_zero, ax1)
gen_image(perfect_one, ax2)

What we have just done is a very simple start towards something really cool, which is neural networks "dreaming"
up images of things that they know how to classify! In this example, the visualizations the hidden neurons are learning relatively simple things, but for more complicated datasets, hidden neurons can learn interesting things (like how to detect wheels, eyes, branches, etc (even if none of those are things that need to be classified)). Take a look at the blog post below for some awesome examples of this.

Deep dream blog post:
https://research.googleblog.com/2015/06/inceptionism-going-deeper-into-neural.html

        

In [None]:
# the weights of the last layer of the model
print(sess.run(W_y))