# Notes on Neural Networks

From Chapter 10 of Hands-On Machine Learning textbook by Aurelien Geron. 

A few reasons why Neural nets are going to stick around:
- There is now huge quanitity of data available to train neural nets, and ANNs frequently outperform other ML techniques
- Tremendous increase in computing power since 1990s now makes it possible to train huge nnnets. THis is partly due to Moore's Law, and also thanks to the gaming industry which has produced powerful GPU cards by the millions.
- Training algos have been improved. To be fair, they are only slihglty diferent from the ones used in the 19990s, but these relatively small tweaks have had a huge impact.
- Some theoretical limitations of ANNs have turned out to be benign in practice. For example many people thought that ANN training algos were doomed because they were likely to get stuck in local optima, but it turns out that this is rather rare in practice (or when it is the case, they are usually fairly close to the global optimum)
- ANNs seem to have entered a virtuous circle of funding and progress. Amazing products based on ANNs regularily make the headline news, which pulls more and more attentions and funding towards them. 

### The Perceptron

An Linear Treshold Unity can be used for simple linear binary classification. It computers a linear combination of inputs and if the result exceeds a threshold, it outputs the positive class or else outputs the negative class. 

A perception is simply composed of a single layer Linear Threshold Units (LTU), with each neuron connected to all of the inputs. **These connections are often represented using special passthrough neurons called input neurons: they just output whatever input they are fed.** Moreover, an extra bias feature is gnerally added (x0=1). This bias feature is typically represented using a special type of neuron called a bias neuron, which just outputs 1 all of the time.

A perceptron with two inputs and three outputs is represented below. This peceptron can classify instances simultaneously into three binary classes, which makes it a multioutput classifier:
![](pictures/ann.jpg)


So how is a preceptron trained? The perceptron is fed one training instance at a time, and for each instance it makes its predictions. **For every output neuron that produced a wrong prediction, it reinforces the connection weights from the inputs that would have contributed to the correct prediction.** The rule is shown below: 

![](pictures/anneq.jpg)

The decision boundary of each output neural is linear, so Perceptrons are INCAPABLE of learning complex patterns (just like Logistic Regression classifiers). However, if the training instances are linearly separable, Rosenblatt demonstrated that this algorithm would converge to a solution. This is called Perceptron convergence theorem. 

Scikit-Learn provides a Perceptron class that implements a single LTU network. It can be used
pretty much as you would expect — for example, on the iris dataset:

In [57]:
import numpy as np
from sklearn.datasets import load_iris
from sklearn.linear_model import Perceptron

iris = load_iris()
X = iris.data[:, (2, 3)] # petal length, petal width
y = (iris.target == 0).astype(np.int) # Iris Setosa?

per_clf = Perceptron(random_state=42)
per_clf.fit(X, y)
y_pred = per_clf.predict([[2, 0.5]])

y_pred



array([1])

You may have recognized that the Perceptron learning algorithm strongly resembles Stochastic Gradient Descent. 

**In fact, Scikit-Learn’s Perceptron class is equivalent to using an SGDClassifier with the following hyperparameters: loss="perceptron", learning_rate="constant", eta0=1 (the learning rate), and penalty=None (no regularization).**

Note that contrary to Logistic Regression classifiers, Perceptrons do not output a class probabiliity; rather they just make predictions based on a hard threshold. This is one reason to prefer LogReg over Perceptrons. 

### Multi-Layer Perceptron and Backpropogation

Many of the limitations of Perceptrons can be eliminated by stacking multiple Perceptrons. The resulting ANN is called a Multi-Layer Perceptron (MLP). An MP is composed of one (passthrough) input layer, one or more layers of LTUs, called (hidden layers) and one final layer of LTUs called the output layer.  ** Every layer except the output layer includes a bias neuron and is fully connected to the next layer. WHEN AN ANN HAS TWO OR MORE HIDDEN LAYERS, IT IS CALLED A DEEP NEURAL NETWORK**.

![](pictures/ann2.jpg)

For many years researchers struggled to find a way to train MLPs, without success. But in 1986, D. E. Rumelhart et al. published a groundbreaking article8 introducing the backpropagation training algorithm. **Today we would describe it as Gradient Descent using reverse-mode autodiff**. 

**For each training instance:**
- the algorithm feeds it to the network and computes the output of every neuron in each consecutive layer (this is the forward pass, just like when making predictions). 
- Then it measures the network’s output error (i.e., the difference between the desired output and the actual output of the network), 
- and it computes how much each neuron in the last hidden layer contributed to each output neuron’s error. It then proceeds to measure how much of these error contributions came from each neuron in the previous hidden layer - and so on until the algorithm reaches the input layer. 

<font color=red> This reverse pass efficiently measures the error gradient across all the connection weights in the network by propogating the error gradient backward in the network (hence the name of the algorithm). If you check out the reverse-mode autodiff algorithm in Appendix D, you will find that the forward and reverse passes of backpropogations simply perform revese-mode autodiff. The last step of back propogation algorithm is a Gradient Descent step on al the connection weights in the network, using the error gradients measured earlier.  </font>

Let's make this process even shorter. For each training instance:
- the back prop algo first makes a prediction (forward pass)
- measures the error
- then goes through each layer in reverse to measure the error contribution from each connection (reverse pass), 
- and finally slihgtly tweaks the connection weights to reduce the error (gradient descent step)

In order for this to work properly, the authors made a key change to the MLPs archtiechture: **they replaced the step function with the logistic function  (σ(z) = 1 / (1 + exp(–z)). This was essential because the step function contains only flat segments, so there is no gradient to work with (Gradient Descent cannot move on a flat surface), <font color = blue>which the logistic function has a well-defined nonzero derivative everywhere, allowing Gradient Descent to make some progress at every step</font>. Two other popular activation functions are:**

- The hyperbolic tangent function tanh (z) = 2σ(2z) – 1
    - **Just like the logistic function it is S-shaped, continuous, and differentiable, but its output value ranges from –1 to 1 (instead of 0 to 1 in the case of the logistic function),** which tends to make each layer’s output more or less normalized (i.e., centered around 0) at the beginning of training. This often helps speed up convergence.


- The ReLU function (introduced in Chapter 9): ReLU (z) = max (0, z)
    - It is continuous but unfortunately not differentiable at z = 0 (the slope changes abruptly, which can make Gradient Descent bounce around). However, **in practice it works very well and has the advantage of being fast to compute. Most importantly, the fact that it does not have a maximum output value also helps reduce some issues during Gradient Descent**.

These popular activation functions and their derivatives are represented in Figure 10-8.

![](pictures/deriv.jpg)

An MLP is oftern used for classification, with each output corresponding to a different binary class. When the classes are exclusive (eg. classes 0 through 9 for digits), the output later is typically modified by replacing the individual activation functions by a shared softmax function. The outcome of each neuron corresponds to the estimated probabilitiy of the corespoinding class. **Note that the signal flows only in one direction (from the inputs to the outputs) so this architecture is an example of a feedforward nerual network (FNN).**

![](pictures/nnet.jpg)

# Training and MLP with TensorFlows HighLevel API

The simplest way to train an MLP with TF is to use the high level API TF.Learn. which is quite similar to SKlearns API. The DNNClassifier class makes it trivial to train a deep neural network with any number of hidden layers, and a softmax output layer to output layer to output estimated probabilities. For example, the following code trains a DNN for classificaiton with two hidden layers (one with 300 neurons, and the other with 100 neurons) and a softmax output layer with 10 Neurons:

        import tensorflow as tf

        feature_cols = tf.contrib.learn.infer_real_valued_columns_from_input(X_train)
        dnn_clf = tf.contrib.learn.DNNClassifier(hidden_units=[300,100], n_classes=10,
                                                 feature_columns=feature_cols, config=config)

        dnn_clf = tf.contrib.learn.SKCompat(dnn_clf) # if TensorFlow >= 1.1
        dnn_clf.fit(X_train, y_train, batch_size=50, steps=40000)

<font color=red size=5> Under the hood, the DNNClassifier class creates all the neuron layers, **based on the ReLU activation
function (we can change this by setting the activation_fn hyperparameter)**. The output layer relies
on the softmax function, and the cost function is cross entropy.

The TF.Learn API is still quite new, so some of the names and functions used in these examples may evolve a bit by the
time you read this book. However, the general ideas should not change.

# Training a Deep Neural Network (DNN) using Plain TensorFlow

If you want more control over the architecture of the network, you may prefer to use TFs lower level Python API. 

# Construction Phase

First lets import tf, and specify the number of inputs and outputs, and set the number of hidden neurons in each layer:

In [58]:
import tensorflow as tf
n_inputs = 28*28 #MNIST
n_hidden1 = 300
n_hidden2 = 100
n_outputs = 10

Next, use palceholder nodes to represent the training data and the targets. The shape of the x is only partially defined. We know that it will be a 2D tensor (i.e. a matrix), with instances along the first dimension and features along the second dimension, and we know that the number of features is going to be 28*28 (one per pixel), but we dont know how many instances each training batch will contain. **SO the shape of x is (None, n_inputs). ** Similarily we know that y will be a 1D tesnro, with only one entry per instance, but agian we dont know the size of the training batch, so the shape is (None). 

In [59]:
tf.reset_default_graph()
X = tf.placeholder(tf.float32, shape=(None, n_inputs), name="X")
y = tf.placeholder(tf.int64)

Now lets create a NNet. The placeholder x will act as the input layer during the execution phase, it will be replaced wiht one training batch at a time (**note that all instances in a training batch will be processed simultaneously by the neural network)**. The two hidden layers are almost identical: the differ only by the inputs they are connect to and by the number of neurons they contain. **The output layer is also very similar, but it uses a softmax activation function instead of a ReLU activation (this creates the probabilitiy for each output)**. Let's create a neuron_layer() funciton that we will use to create one layer at a time. It will need params to specify inputs, number of neurons, activation function, and the name of the layer.

In [60]:
def neuron_layer(X, n_neurons, name, activation=None):
    with tf.name_scope(name):
        n_inputs = int(X.get_shape()[1])
        stddev = 2 / np.sqrt(n_inputs)
        init = tf.truncated_normal((n_inputs, n_neurons), stddev=stddev)
        W = tf.Variable(init, name="weights")
        b = tf.Variable(tf.zeros([n_neurons]), name="biases")
        z = tf.matmul(X, W) + b
        if activation=="relu":
            return tf.nn.relu(z)
        else:
            return z

Lets go through this line by line:
1. First we create a name scope using the name of the layer: it will contain all the computation nodes for this neuron layer. THIS IS OPTIONAL BUT WILL LOOK MUCH NICER IN TENSORBOARD IF NODES ARE WELL ORGANIZED.
2. Next, we get the number of inputs by **looking up the input matrix's shape and getting the size of the second dimension (the first dimension is for instances (rows)).**
3. W Variable: The next three lines will create a variable W, that will hold the weights matrix. 
    - it will be a 2D tensor (**when they say 2D, I think they mean like, there will be two parameters going into the tuple... so there could be multiple rows and columns, but it comes from a 2D tensor... I think?**) containing all the connection weights between each input and each neuron; hence, its shape will be (n_inputs, n_neurons)
    - the weights matrix will be initialized randomly, using a truncated normal (Guassian) distribution with a standard deviation of 2/np.sqrt(n_inputs). **Using this specific standard deviation helps the algo converge much faster. It is one of those small tweaks to NNETS that has had HUGE benefits. It is important to initialize connection weights randomly for ALL hidden layers to avoid any symmetries that the GD alo would be unable to break. ALSO, Using a "truncated normal distribution" rather than a regular normal distribution ensures that there won’t be any large weights, which could slow down training.**
4. b Variable: the next line creates a b variable for baises, initialized to 0 (so no symmetry isses in this case) with one bias paramater per neuron.
5. Then we create a subgraph to compute z = X · W + b. This vectorized implementation will efficiently compute the weighted sums of the ipnuts plus the bias terms for each and every neuron in the layer, for all the instances in the batch **in just one shot. (ie. matrix multiplication!)**
6. Finally, if the activation param is set to "relu" the code returns relu(z) (i.e. max (0,z) or else it just reutrns zero... **(so I think that the max(0,z) think just replaces all the negative numbers with 0s in the output matrix amirite?!?!?)**

### You can see above that each negative value is replaced by 0s in the relu hidden layers. WOOOHOOO

Okay, so now that we have a nice function to create a neuron layer, lets use it to create the deep neural network!!! 
- The first hidden layer takes x as its input. 
- The second takes the output of the first hidden layer as its input.
- and finally the output layer takes the ouput of the second layer as its input

In [61]:
with tf.name_scope("dnn"):
    hidden1 = neuron_layer(X, n_hidden1, "hidden1", activation="relu")
    hidden2 = neuron_layer(hidden1, n_hidden2, "hidden2", activation="relu")
    logits = neuron_layer(hidden2, n_outputs, "outputs")

<font color =red> Notice that once again we used a name scope for clarity... Also not that logits is the output for the neural network BEFORE going through the softmax activation function: **for opimization reasons, we will handle softmax computation later**</font>

**Obvi TF comes with many handy functions to create standard neural network layers, so theres often no need to define your own neuron_layer() function like we just did.** For example, TF's fully_connected() functionc reates a fully connected layer, where all inputs are connected to all neurons in the layer. It takes care of creating the weights and biases variables, with proper initialization strategy and it uses the ReLU activation function by default (tho this can be changed using the activation_fn argumenbt). It also supports regularization and normalization params. Lets tweek the preceding code to use 'fully_connected()' instead of our neuron_layer() function.

In [62]:
from tensorflow.contrib.layers import fully_connected

with tf.name_scope("dnn"):
    hidden1 = fully_connected(X,n_hidden1, scope="hidden1")
    hidden2 = fully_connected(hidden1, n_hidden2, scope="hidden2")
    logits = fully_connected(hidden2, n_outputs, scope="outputs",
                            activation_fn=None)

<font color=red>WARNING:
The tensorflow.contrib package contains many useful functions, but it is a place for experimental code that has not yet
graduated to be part of the main TensorFlow API. So the fully_connected() function (and any other contrib code) may
change or move in the future.

Now tht we have the neural network model ready to go, we need to define the cose function that we will use to train it. Just as we did for Softmax Regression in Ch4, we will use cross entropy. **As we discussed earlier, cross entropy will penalize models that estimate a low probability for the correct target class.** TF provides several functions to compute cross entropy. We will use "sparse_softmax_cross_entropy_with_logits()": <font color=red size=4> This computes the cross entropy based on the "logits" (i.e. the output of the network before going through the softmax activation function), and it expects labels in the form of integers ranging from 0 to the number of the classes minus 1 (in this case from 0 to 9) </font>.

This will give us a 1D tensor containing the cross entropy for each instance. We then use TFs reduce_mean function to compute the mean cross entropy over all instances.

In [63]:
with tf.name_scope("loss"):
    xentropy = tf.nn.sparse_softmax_cross_entropy_with_logits(labels=y,
                                                             logits=logits)
    loss = tf.reduce_mean(xentropy, name="loss")

<font color =blue size=5> The sparse_softmax_cross_entropy_with_logits() function is equivalent to applying the softmax activation function and then computing the cross entropy, BUT IT IS MORE EFFICIENT, AND IT PROPELY DEALS WITH CORNER CASES LIKE THE LOGITS EQUAL TO 0. 

There is also softmax_corss_entropy_with_logits() which takes labels in the form of one hot vectors (instead of ints from 0 to n_classes minus 1.

<font color =red> We now have the components needed for our DNN:
- our neural net model
- our cost function

Now we need to define our GradientDescentOptimizer that will tweak the params to minimize the cost funciton. 

In [64]:
learning_rate = 0.01
with tf.name_scope("train"):
    optimizer = tf.train.GradientDescentOptimizer(learning_rate)
    trainig_op = optimizer.minimize(loss)

The last step in the construction phase is to specify how to evaluate the model. We will simply use accuracy as our performance measure. 
- First, for each instance, determine if the neural nets pred is correct by checking whether or not the highest logit corresponds to the target class. **For this you  call call in in_top_k() funciton. This returns a 1D tensor full of boolean values so we need to cast these booleans to floats and then compute the average. This will give us the nerworks overall accuracy.**

In [65]:
with tf.name_scope("eval"):
    correct = tf.nn.in_top_k(logits, y, 1)
    accuracy = tf.reduce_mean(tf.cast(correct, tf.float32))

And as usual we need to create a node to initialize all variables, and we will also create a Saver to save our trained model params to disk:

In [66]:
init = tf.global_variables_initializer()
saver = tf.train.Saver()

#### WHOAH! That was intense. To recap, here are the steps we just went through:
1. created placeholders for the inputs and targers
2. created function to build a neuron layers
3. used step 2 function to create the multilayer DNN
4. defined the cost function for the DNN
5. defined optimizer to tweak params of DNN to reduce cost function
6. lastly we defined performance measure.

Lets put it all together!

In [67]:
import tensorflow as tf
from tensorflow.contrib.layers import fully_connected
tf.reset_default_graph()

#define input params
n_inputs = 28*28 # MNIST
n_hidden1 = 300
n_hidden2 = 10

#set placeholders
X = tf.placeholder(tf.float32, shape=(None, n_inputs), name="X")
y = tf.placeholder(tf.int64, shape=(None), name="y")

#create Deep Neural Network fully_connected() function
with tf.name_scope("dnn"):
    hidden1 = fully_connected(X, n_hidden1, scope="hidden1")
    hidden2 = fully_connected(hidden1, n_hidden2, scope="hidden2")
    logits = fully_connected(hidden2, n_outputs, scope="outputs",
                             activation_fn=None)

#calculate loss with cross entropy cost function (w/ built in softmax output layer)
with tf.name_scope("loss"):
    xentropy = tf.nn.sparse_softmax_cross_entropy_with_logits(labels=y, logits=logits)
    loss = tf.reduce_mean(xentropy, name="loss")
    
#define Gradient Descent Optimizer
learning_rate = 0.01
with tf.name_scope("train"):
    optimizer = tf.train.GradientDescentOptimizer(learning_rate)
    training_op = optimizer.minimize(loss)

#specify evaluation metric (accuracy)
with tf.name_scope("eval"):
    correct = tf.nn.in_top_k(logits, y, 1)
    accuracy = tf.reduce_mean(tf.cast(correct, tf.float32))
    
#set initializer and saver
init = tf.global_variables_initializer()
saver = tf.train.Saver()

# Execution Phase
Execution phase is generally much shorter and simpler. First, load MNIST. TF offers its own helper that fetches data, scales it (between 0 and 1), shuffles it, and provides a simple function to load one mini-batches at a time.

In [68]:
from tensorflow.examples.tutorials.mnist import input_data
mnist = input_data.read_data_sets("/tmp/data")

Extracting /tmp/data\train-images-idx3-ubyte.gz
Extracting /tmp/data\train-labels-idx1-ubyte.gz
Extracting /tmp/data\t10k-images-idx3-ubyte.gz
Extracting /tmp/data\t10k-labels-idx1-ubyte.gz


Now we define the number of epochs that we want to run, as well as the size of the mini-batches. And then we train the model!

In [69]:
n_epochs = 50
batch_size = 50

with tf.Session() as sess:
    init.run()
    for epoch in range(n_epochs):
        for iteration in range(mnist.train.num_examples // batch_size):
            X_batch, y_batch = mnist.train.next_batch(batch_size)
            sess.run(training_op, feed_dict={X: X_batch, y: y_batch})
        acc_train = accuracy.eval(feed_dict={X: X_batch, y: y_batch})
        acc_test = accuracy.eval(feed_dict={X: mnist.test.images,
        y: mnist.test.labels})
        print(epoch, "Train accuracy:", acc_train, "Test accuracy:", acc_test)
    save_path = saver.save(sess, "./my_model_final.ckpt")

0 Train accuracy: 0.94 Test accuracy: 0.8697
1 Train accuracy: 0.92 Test accuracy: 0.9041
2 Train accuracy: 0.86 Test accuracy: 0.9189
3 Train accuracy: 0.92 Test accuracy: 0.9278
4 Train accuracy: 0.9 Test accuracy: 0.9353
5 Train accuracy: 0.92 Test accuracy: 0.942
6 Train accuracy: 0.98 Test accuracy: 0.945
7 Train accuracy: 1.0 Test accuracy: 0.9502
8 Train accuracy: 0.96 Test accuracy: 0.9535
9 Train accuracy: 0.98 Test accuracy: 0.9551
10 Train accuracy: 0.96 Test accuracy: 0.956
11 Train accuracy: 0.98 Test accuracy: 0.9595
12 Train accuracy: 0.98 Test accuracy: 0.9599
13 Train accuracy: 0.98 Test accuracy: 0.9621
14 Train accuracy: 0.96 Test accuracy: 0.9639
15 Train accuracy: 0.96 Test accuracy: 0.9635
16 Train accuracy: 0.98 Test accuracy: 0.9654
17 Train accuracy: 0.98 Test accuracy: 0.9654
18 Train accuracy: 0.96 Test accuracy: 0.9672
19 Train accuracy: 0.96 Test accuracy: 0.9682
20 Train accuracy: 0.96 Test accuracy: 0.9684
21 Train accuracy: 0.98 Test accuracy: 0.9693
22 

- This code opens a TensorFlow session, and it runs the init node that initializes all the variables. 
- Then it runs the main training loop: at each epoch, the code iterates through a number of mini-batches that corresponds to the training set size. 
- Each mini-batch is fetched via the next_batch() method, and then the code simply runs the training operation, feeding it the current mini-batch input data and targets. 
- Next, at the end of each epoch, the code evaluates the model on the last mini-batch and on the full training set, and it prints out the result. Finally, the model parameters are saved to disk.

# Using the Neural Network

Now that the nerual network is trained, you can use it to make predictions. To do that, you can reuse the same construction phase, but change the execution phase like this:

In [70]:
with tf.Session() as sess:
    saver.restore(sess, "./my_model_final.ckpt")
    X_new_scaled = mnist.test.images[:20]
    Z = logits.eval(feed_dict={X: X_new_scaled})
    y_pred = np.argmax(Z, axis=1)
y_pred

INFO:tensorflow:Restoring parameters from ./my_model_final.ckpt


array([7, 2, 1, 0, 4, 1, 4, 9, 4, 9, 0, 6, 9, 0, 1, 5, 9, 7, 3, 4], dtype=int64)

In [71]:
t = [[1,2,3,4,1000,6,7,8,9],
     [1000,6,3,4,5,6,7,8,10],
     [1,2,3,10,5,6,7,13,1000]]

np.argmax(t,axis=1)

array([4, 0, 8], dtype=int64)

Key steps of prediction code above:
1. First the code loads the model parameters from disk. 
2. Then it loads some new images that you want to classify. **Remember to apply the same feature scaling as for the training data (in this case, scale it from 0 to 1).**
3. Then the code evaluates the logits node. If you wanted to know all the estimated class probabilities, you would need to apply the softmax() function to the logits, but if you just want to predict a class, you can simply pick the class that has the highest logit value (using the argmax() function does the trick).

# Fine-Tuning Neural Network Hyperparameters

The flexibility of neural nets is also one of their main drawbacks (ie. tons of hyperparameters to tweak: <font color =red>**Not only can you use ANY IMAGINABLE NETWORK TOPOLOGY (how neurons are interconnected), but even a simple MLP can change THE NUMBER OF LAYERS, the number of neurons per layer, the type of activation function to use in each layer, the weight initialization logic, and MUCH MORE!!!**</font> How do you know what combination of hyperparameters is the best for you task?

Of course, you can use grid search with cross-validaiton, but this will take a shit ton of time with neural nets (you will only be able to explore a tiny part of the hyperparamater space). **It is much better to use RANDOMIZED SEARCH. Another option is to use a tool such as OSCAR, which implements more complex algorithms to help you find a good set of hyperparameters quickly.**

It also helps to have an idea of what values are reasonable for each hyperparameter so you can restrict the search space. Lets start with n_hidden_layers.

## 1. Number of Hidden Layers

For many problems, a single hidden layers will get you reasonable results. **It has actually been shown that a Multi-Layer-Perceptron with just one hidden layer can model even the most complex functions provided it has enough neurons.** For a long time, these facts convinced researchers that there was no need to investigate any deep neural networks. But they overlooked the fact that **deep networks have a much higher PARAMETER EFFICIENCY than shallow ones: <font color=red> they can model complex functions using exponentially fewer neurons than shallow nets, making them much faster to train.**</font>

To understand why,supposed you are asked to draw a forest using some drawing software, but you are forbidden to use copy/paste. You would have to draw each tree individually, branch per brance, leaf per leaf. If you could instead draw one leaf, copy/paste it to draw a branch, then copy/paste that
branch to create a tree, and finally copy/paste this tree to make a forest, you would be finished in no time. **Real-world data is often structured in such a hierarchical way and DNNs automatically take advantage of this fact:** <font color=green size=5> lower hidden layers model low-level structures (e.g., line segments of various
shapes and orientations), intermediate hidden layers combine these low-level structures to model
intermediate-level structures (e.g., squares, circles), and the highest hidden layers and the output layer
combine these intermediate structures to model high-level structures (e.g., faces).</font>

Not only does this hierarchical arcitecture help DNNs converge faster to a good solution, it also improves their abilitiy to generalize to new datasets. **For example, if you have already trained a model to recognize faces in pictures, and you now want to train a new neural network to recognize hairstyles, then you can kickstart training by reusing the lower  layers of the first network.** Instead of randomly initializing the weights and biases of the first few layers of the new neural network, you can initialize them to the value of the weights and biases of the lower layers of the first network. This way the network will not have to learn from scratch all the low level structures that occur in most pictures. It will only have to learn the high level structures (e.g. hairstyles).

In summary, for many problems you can start with just one or two hidden layers and it will work just fine (e.g., you can easily reach above 97% accuracy on the MNIST dataset using just one hidden layer with a few hundred neurons, and above 98% accuracy using two hidden layers with the same total amount of neurons, in roughly the same amount of training time). **For more complex problems, you can
gradually ramp up the number of hidden layers, UNTIL YOU START OVERFITTING THE TRAINING SET.** Very complex tasks, such as large image classification or speech recognition, typically require networks with dozens of layers (or even hundreds, but not fully connected ones, as we will see in Chapter 13), and they need a huge amount of training data. However, you will rarely have to train such networks from scratch: it is much more common to reuse parts of a pretrained state-of-the-art network that performs a similar task. Training will be a lot faster and require much less data (we will discuss this in Chapter 11).


## 2. Number of Neurons per Hidden Layer

**Obviously, the number of neurons in the INPUT and OUTPUT layers is determined by the type of input and output your task requires.** For example, the MNIST tasks requires:
- 28*28=784 input neurons (which are essentially features) 
- and 10 output neurons

As for the hiddden layers, a common practice is to size them to form a funnel, with fewer and fewer neurons at each layer. The rational begin that **many low level features can coalesce into far fewer high level features. For example, a typical neural network for MNIST may have two hidden layers, the first with 300 neurons and the second with 100. HOWEVER THIS PRACTICE IS NOT AS COMMON THESE DAYS, AND YOU CAN PROBABLY USE THE SAME SIZE FOR ALL HIDDEN LAYERS: for example, all hiddenlayers with 150 nerons. THIS MEANS YOU NOW ONLY HAVE TO CHOOSE ONE NUMBER OF NEURONS PARAM**

<font color =green> One strategy for choosing the number of neurons per hidden layers is to keep make all layers have same number of neurons, and then gradually increase them until the model starts overfitting the training data.<font size=5> in general you will get more bang for your buck by increasing the number of layers than the number of neurons per layer. BUT ITS STILL A BIT OF AN ART </font> </font>

<font color=red>A simpler approach is to pick a model with more layers and neurons than you actually need, then use early stopping to prevent it from overfitting (and other regularization techniques, especially dropout, as we will see in Chapter 11). This has been dubbed the “stretch pants” approach: instead of
wasting time looking for pants that perfectly match your size, just use large stretch pants that will shrink down to the right size.


## 3. Activation Functions

**In most cases you can use the ReLU activation function in the hidden layers. It is a bit faster to compute than other activation functions, and Gradient Descent does NOT get stuck as much on plateaus, thanks to the fact that it does not saturate for large input values (as opposed to logistic function or the hyperbolic tangent function which saturate at 1)**.

For the output layer, the softmax activation function is generally a good choice for classification tasks (when classes are mutually exclusive). For regression tasks, you can simply use NO activation function at all. 



# Exercises

#### (1) Draw an ANN using the original artificial neurons (like the ones in Figure 10-3) that computes A ⊕ B (where ⊕ represents the XOR operation). Hint: A ⊕ B = (A ∧ ¬ B) ∨ (¬ A∧ B).

![](pictures/ann3.jpg)


#### (2) Why is it generally preferable to use a Logistic Regression classifier rather than a classical Perceptron (i.e., a single layer of linear threshold units trained using the Perceptron training algorithm)? How can you tweak a Perceptron to make it equivalent to a Logistic Regression classifier?

A classical Perceptron will converge ONLY if the dataset is linearly separable, and it wont be able to estimate probabilities. In contrast, a LOGISTIC REGRESSION CLASSIFIER will converge to a good solution EVEN IF the dataset is NOT LINEARLY SEPARABLY, and it will output class probabilities. If you change the Perceptrons activation function to the logistic activation function (OR SOFTMAX FUNCTION IF THERE ARE MULTIPLE NEURONS), and if you train Gradient Descent (or some other optimization algo minimizng the cost function, typically cross entropy), then it becomes equivalent to a Logistic Regression classifier.

#### (3) Why was the logistic activation function a key ingredient in training the first MLPs?

The logistic activation function was a key ingredient in training the first MLPs because its
derivative is always nonzero, so Gradient Descent can always roll down the slope. When the
activation function is a step function, Gradient Descent cannot move, as there is no slope at
all.

#### (4) Name three popular activation functions. Can you draw them?
- step function
- logistic function
- hyperbolic tangent
- rectified linear unit

![](pictures/deriv.jpg)


#### (5). Suppose you have an MLP composed of one input layer with 10 passthrough neurons, followed by one hidden layer with 50 artificial neurons, and finally one output layer with 3 artificial neurons. All artificial neurons use the ReLU activation function.
**What is the shape of the input matrix X?**
- input X shape: (None, 10), or (m, 10) where m respresents the batch size

**What about the shape of the hidden layer’s weight vector Wh, and the shape of its bias vector bh?**
- **W**h shape: (10,50)
- length of bias vector is 50 (one for each neuron)
- note these wont change with increase in instances!!!!

**What is the shape of the output layer’s weight vector Wo, and its bias vector bo?**
- output  layer weight vector shape: (50, 3)
- output layer weight bias length is 3. 

**What is the shape of the network’s output matrix Y?**
- output shape is (m,3)

**Write the equation that computes the network’s output matrix Y as a function of X, Wh, bh, Wo and bo.**

-**Y = (X · Wh + bh) · Wo + bo** Note that when you are adding a bias vector to a matrix, it is added to every single row in the matrix, which is called broadcasting.



#### (6) How many neurons do you need in the output layer if you want to classify email into spam or ham? What activation function should you use in the output layer? If instead you want to tackle MNIST, how many neurons do you need in the output layer, using what activation function? Answer the same questions for getting your network to predict housing prices as in Chapter 2.
- BINARY CLASSIFICATION, spam or ham: you need 1 neuron in the output layer of the neural network - for example, indicating the probability of that the email is spam. You would typically use the logistic activation function in the output layer when estimtaing probability. 
- MULTI-CLASS CLASSIFICATION, MNIST: you need 10 neurons in the output layer, and you must replace the logistic function with the softmax activation function, which can handle MULTIPLE classes, outputting ONE PROBABILITIY PER CLASS. 
- REGRESSION, predicting housing prices: all you need is one output neuron, using NO ACTIVATION FUNCTION in the output layer.

**7. What is backpropagation and how does it work? What is the difference between backpropagation and reverse-mode autodiff?**

**Backpropagation is a technique used to train artificial neural networks. It first computes the gradients of the cost function with regards to every model parameter (all the weights and biases), and then it performs a Gradient Descent step using these gradients.** This backpropagation step is typically performed thousands or millions of times, using many training batches, until the model parameters converge to values that (hopefully) minimize the
cost function. 

<font color =red> To compute the gradients, backpropagation uses reverse-mode autodiff. Reverse-mode autodiff performs a forward pass through a
computation graph, computing every node’s value for the current training batch, and then it performs a reverse pass, computing all the gradients at once (see Appendix D for more details). **So what’s the difference? Well, backpropagation refers to the whole process of training an artificial neural network using multiple backpropagation steps, each of which
computes gradients and uses them to perform a Gradient Descent step. In contrast, reversemode autodiff is a simply a technique to compute gradients efficiently, and it happens to be used by backpropagation.**


#### (8) Can you list all the hyperparameters you can tweak in an MLP? If the MLP overfits the training data, how could you tweak these hyperparameters to try to solve the problem?

Here is a list of all the hyperparameters you can tweak in a basic MLP: 
- the number of hidden layers, 
- the number of neurons in each hidden layer, 
- and the activation function used in each hidden layer and in the output layer. **In general, the ReLU activation function is a good default for the hidden layers. For the output layer, in general you will want the logistic activation function for binary classification, the softmax activation function for multiclass classification, or no activation function for regression.**

If the MLP overfits the training data, you can try reducing the number of hidden layers and reducing the number of neurons per hidden layer.



#### (9) Train a deep MLP on the MNIST dataset and see if you can get over 98% precision. Just like in the last exercise of Chapter 9, try adding all the bells and whistles (i.e., save checkpoints, restore the last checkpoint in case of an interruption, add summaries, plot learning curves using TensorBoard, and so on)

First lets create the deep net. Exactly the same as earlier, with one addition: **we add a tf.summary.scalar() to track the loss and the accuracy during training, we can view nice learning curves using TensorBoard**

In [72]:
import tensorflow as tf
n_inputs = 28*28 #MNIST features
n_hidden1 = 150
n_hidden2 = 150
n_hidden3 = 150
n_outputs = 10

#### Define input data placeholders and layers

In [73]:
tf.reset_default_graph()

#create input data placeholders
X = tf.placeholder(tf.float32, shape=(None, n_inputs), name="X")
y = tf.placeholder(tf.int64, shape=(None), name="y")

#crete hidden and output layers
with tf.name_scope("dnn"):
    hidden1 = tf.layers.dense(X, n_hidden1, name="hidden1", 
                              activation=tf.nn.relu)
    hidden2 = tf.layers.dense(hidden1, n_hidden2, name="hidden2",
                              activation=tf.nn.relu)
    hidden3 = tf.layers.dense(hidden2, n_hidden3, name="hidden3",
                              activation=tf.nn.relu)
    logits = tf.layers.dense(hidden2, n_outputs, name="outputs")

#### Define loss function and gradient descent

In [74]:
with tf.name_scope("loss"):
    #calculate cross entropy cost function with softmax for each 
    xentropy = tf.nn.sparse_softmax_cross_entropy_with_logits(labels=y, 
                                                              logits=logits)
    #take mean crossentropy across all observations
    loss = tf.reduce_mean(xentropy, 
                          name="loss")
    #record mean cross entropy (loss)
    loss_summary = tf.summary.scalar('log_loss', 
                                     loss)
    
learning_rate=0.01
with tf.name_scope("train"):
    optimizer = tf.train.GradientDescentOptimizer(learning_rate)
    training_op = optimizer.minimize(loss)

#### Assess model accuracy

In [75]:
with tf.name_scope("eval"):
    correct = tf.nn.in_top_k(logits,y,1)
    accuracy = tf.reduce_mean(tf.cast(correct, tf.float32))
    #record accuracy
    accuracy_summary = tf.summary.scalar('accuracy', accuracy)

#### Create function to create time specific log_dirs

In [76]:
from datetime import datetime

def log_dir(prefix=""):
    now = datetime.utcnow().strftime("%Y%m%d%H%M%S")
    root_logdir = "tf_logs"
    if prefix:
        prefix += "-"
    name = prefix + "run-" + now
    return "{}/{}/".format(root_logdir, name)

log_dir('TEST')

'tf_logs/TEST-run-20171229222102/'

In [77]:
#create FileWriter that we use to write TensorBoard logs
logdir = log_dir("MCs_MNIST_DNN")
file_writer = tf.summary.FileWriter(logdir, tf.get_default_graph())

### Run the model!

Hey! Why don't we implement early stopping? For this, we are going to need a validation set. Luckily, the dataset returned by TensorFlow's input_data() function (see above) is already split into a training set (60,000 instances, already shuffled for us), a validation set (5,000 instances) and a test set (5,000 instances). So we can easily define X_valid and y_valid:

In [78]:
from tensorflow.examples.tutorials.mnist import input_data
mnist = input_data.read_data_sets("/tmp/data")
X_train = mnist.train.images
X_test = mnist.test.images
y_train = mnist.train.labels.astype("int")
y_test = mnist.test.labels.astype("int")

X_valid = mnist.validation.images
y_valid = mnist.validation.labels

print(X_train.shape)
m, n = X_train.shape

Extracting /tmp/data\train-images-idx3-ubyte.gz
Extracting /tmp/data\train-labels-idx1-ubyte.gz
Extracting /tmp/data\t10k-images-idx3-ubyte.gz
Extracting /tmp/data\t10k-labels-idx1-ubyte.gz
(55000, 784)


In [83]:
import os
#initialize initializer and saver
init = tf.global_variables_initializer()
saver = tf.train.Saver()

n_epochs = 10001
batch_size = 50
n_batches = int(np.ceil(m / batch_size))

checkpoint_path = "/tmp/my_deep_mnist_model.ckpt"
checkpoint_epoch_path = checkpoint_path + ".epoch"
final_model_path = "./my_deep_mnist_model"

best_loss = np.infty
epochs_without_progress = 0
max_epochs_without_progress = 50

with tf.Session() as sess:
    if os.path.isfile(checkpoint_epoch_path):
        # if the checkpoint file exists, restore the model and load the epoch number
        with open(checkpoint_epoch_path, "rb") as f:
            start_epoch = int(f.read())
        print("Training was interrupted. Continuing at epoch", start_epoch)
        saver.restore(sess, checkpoint_path)
    else:
        start_epoch = 0
        sess.run(init)

    for epoch in range(start_epoch, n_epochs):
        for iteration in range(mnist.train.num_examples // batch_size):
            X_batch, y_batch = mnist.train.next_batch(batch_size)
            sess.run(training_op, feed_dict={X: X_batch, y: y_batch})
        accuracy_val, loss_val, accuracy_summary_str, loss_summary_str = sess.run([accuracy, loss, accuracy_summary, loss_summary], feed_dict={X: X_valid, y: y_valid})
        file_writer.add_summary(accuracy_summary_str, epoch)
        file_writer.add_summary(loss_summary_str, epoch)
        if epoch % 5 == 0:
            print("Epoch:", epoch,
                  "\tValidation accuracy: {:.3f}%".format(accuracy_val * 100),
                  "\tLoss: {:.5f}".format(loss_val))
            saver.save(sess, checkpoint_path)
            with open(checkpoint_epoch_path, "wb") as f:
                f.write(b"%d" % (epoch + 1))
            if loss_val < best_loss:
                saver.save(sess, final_model_path)
                best_loss = loss_val
            else:
                epochs_without_progress += 5
                if epochs_without_progress > max_epochs_without_progress:
                    print("Early stopping")
                    break

os.remove(checkpoint_epoch_path)

with tf.Session() as sess:
    saver.restore(sess, final_model_path)
    accuracy_val = accuracy.eval(feed_dict={X: X_test, y: y_test})

Epoch: 0 	Validation accuracy: 90.000% 	Loss: 0.37552
Epoch: 5 	Validation accuracy: 94.840% 	Loss: 0.19159
Epoch: 10 	Validation accuracy: 96.040% 	Loss: 0.14298
Epoch: 15 	Validation accuracy: 96.740% 	Loss: 0.11783
Epoch: 20 	Validation accuracy: 97.140% 	Loss: 0.10232
Epoch: 25 	Validation accuracy: 97.420% 	Loss: 0.09089
Epoch: 30 	Validation accuracy: 97.500% 	Loss: 0.08531
Epoch: 35 	Validation accuracy: 97.780% 	Loss: 0.08001
Epoch: 40 	Validation accuracy: 97.800% 	Loss: 0.07633
Epoch: 45 	Validation accuracy: 97.920% 	Loss: 0.07261
Epoch: 50 	Validation accuracy: 97.980% 	Loss: 0.07212
Epoch: 55 	Validation accuracy: 97.940% 	Loss: 0.07270
Epoch: 60 	Validation accuracy: 98.000% 	Loss: 0.07154
Epoch: 65 	Validation accuracy: 98.020% 	Loss: 0.07194
Epoch: 70 	Validation accuracy: 98.100% 	Loss: 0.07199
Epoch: 75 	Validation accuracy: 98.220% 	Loss: 0.07137
Epoch: 80 	Validation accuracy: 98.080% 	Loss: 0.07253
Epoch: 85 	Validation accuracy: 98.100% 	Loss: 0.07296
Epoch: 90 	V

In [86]:
accuracy_val

0.97860003