#### Setup

In [1]:
# To support both python 2 and python 3
from __future__ import division, print_function, unicode_literals

# Common imports
import numpy as np
import os

# to make this notebook's output stable across runs
def reset_graph(seed=42):
    tf.reset_default_graph()
    tf.set_random_seed(seed)
    np.random.seed(seed)

# To plot pretty figures
%matplotlib inline
import matplotlib
import matplotlib.pyplot as plt
plt.rcParams['axes.labelsize'] = 14
plt.rcParams['xtick.labelsize'] = 12
plt.rcParams['ytick.labelsize'] = 12

# Where to save the figures
PROJECT_ROOT_DIR = "."
CHAPTER_ID = "ann"

## From Biological to Artifical Neurons

1. Artifical Neural Networks fequently out-perform other machine learning algorithms
2. Large amounts of computing power is needed to train ANNs

### *Biological Neurons*

Neurons structured in successive layers can perform highly complex computations. 

### *Logical Computations with Neurons*

An **artificial neuron** has multiple binary inputs and only one binary output. If you assume that such a neuron can only be activated two active inputs, then it is possible to construct a network that can compute any logical expression. Example: logical AND can be computed if input A and input B have only one connection to output C. This works as both A and B must be active to activate C. 

### *The Preceptron*

The **Preceptron** is a simple ANN architecture that is based on the *linear threshold unit* (LTU), an artifical neuron that has weighted numerical inputs and a single output dependant on some function (i.e. given input of the form $z = w_1 x_1 + \, ... \, + w_n x_n$ the LHU applies a step function to the input to get the result: $h_w(x) = step(z) = step(w^T \cdot x)$). A common step function is the [Heavyside](https://en.wikipedia.org/wiki/Heaviside_step_function) function. 

A single LTU can be used to perform simple binary classification. 

A preceptron is simply an ANN composed of a single layer of LTUs that get their input from a layer of *input neurons* that simply pass values and a single *bias* neurons that always outputs 1. 

Precpton learning is summarized as "Cells that fire together, wire together". This basically means if two neurons output the same value and the network predicts the right result, then the weight between the two is increased. 

**Precptron learning rule**:

$\large w_{i,j}^{(\text{next step})} = w_{i,j} + \eta \, (\hat{y_j} - y_j) x_i$

where: 
* $w_{i,j}$ is the connection weight between i-th input neuron and j-th output neuron
* $x_i$ is the i-th input of the current training instance
* $\hat{y_j}$ is the output of the j-th output neuron for the current training instance
* $y_j$ is the output target of the j-th output neuron for the current training instance
* $\eta$ is the learning rate

The decision boundary is linear so preceptrons are incapable of learning complex patterns. However by the *preceptron convergence theorem*, if the training instances are [linearly separable](https://en.wikipedia.org/wiki/Linear_separability) then the algorithm will converge to a solution. 

Scikit comes with an implementation of the preceptron built in:

In [3]:
import numpy as np
from sklearn.datasets import load_iris
from sklearn.linear_model import Perceptron

iris = load_iris()
X = iris.data[:, (2, 3)]  # petal length, petal width
y = (iris.target == 0).astype(np.int)

per_clf = Perceptron(random_state=42)
per_clf.fit(X, y)

print (per_clf.predict([[2, 0.5]]))

[1]


Many of the weakness of preceptrons can be eliminated if you simple stack multiple layers of them (a **Multi-layered Preceptron (MLP)**).

### *Multi-Layer Preceptron and Backpropogation*

MLP: one passthrough input layer, one or more layers of LTUs called *hidden layers*, and one final layer of LTUs called the *output layer*. Every layer except the output layer has a bias neuron. When a layer has multiple hidden layers its called a *deep neural network*. 

For each training instance the *back propogatio algorithm* first makes a prediction (forward pass), measures the error, then goes through each layer in reverse to measure the error contribution from each connection (reverse pass), and finally slightly tweaks the connection weights to reduce the error (Gradient Descent step). The backprogoation algorithm cannot be used with the heavyside step function. It can however use any of the following:
* [ReLU Function](https://en.wikipedia.org/wiki/Rectifier_(neural_networks)
* [Logistic Function](https://en.wikipedia.org/wiki/Logistic_function)
* [Hyperbolic Tangent Function](http://mathworld.wolfram.com/HyperbolicTangent.html)

## Training an MLP with TensorFlow's High Level API

The simplest way to train a MLP with tensorflow is use the TF.Learn API. The following code trains an MLP with two hidden layers (300 and 100 nodes each) and a softmax output of 10 nodes: 

In [4]:
from tensorflow.examples.tutorials.mnist import input_data

mnist = input_data.read_data_sets("/tmp/data/")

Successfully downloaded train-images-idx3-ubyte.gz 9912422 bytes.
Extracting /tmp/data/train-images-idx3-ubyte.gz
Successfully downloaded train-labels-idx1-ubyte.gz 28881 bytes.
Extracting /tmp/data/train-labels-idx1-ubyte.gz
Successfully downloaded t10k-images-idx3-ubyte.gz 1648877 bytes.
Extracting /tmp/data/t10k-images-idx3-ubyte.gz
Successfully downloaded t10k-labels-idx1-ubyte.gz 4542 bytes.
Extracting /tmp/data/t10k-labels-idx1-ubyte.gz


In [5]:
X_train = mnist.train.images
X_test = mnist.test.images
y_train = mnist.train.labels.astype("int")
y_test = mnist.test.labels.astype("int")

In [6]:
import tensorflow as tf

config = tf.contrib.learn.RunConfig(tf_random_seed=42) # not shown in the config

feature_cols = tf.contrib.learn.infer_real_valued_columns_from_input(X_train)
dnn_clf = tf.contrib.learn.DNNClassifier(hidden_units=[300,100], n_classes=10,
                                         feature_columns=feature_cols, config=config)
dnn_clf = tf.contrib.learn.SKCompat(dnn_clf) # if TensorFlow >= 1.1
dnn_clf.fit(X_train, y_train, batch_size=50, steps=40000)

INFO:tensorflow:Using config: {'_tf_random_seed': 42, '_task_type': None, '_keep_checkpoint_every_n_hours': 10000, '_save_checkpoints_secs': 600, '_is_chief': True, '_save_checkpoints_steps': None, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x0000014EF536F0F0>, '_model_dir': None, '_evaluation_master': '', '_keep_checkpoint_max': 5, '_save_summary_steps': 100, '_num_ps_replicas': 0, '_tf_config': gpu_options {
  per_process_gpu_memory_fraction: 1
}
, '_master': '', '_num_worker_replicas': 0, '_task_id': 0, '_environment': 'local'}
Instructions for updating:
Please switch to tf.summary.scalar. Note that tf.summary.scalar uses the node name instead of the tag. This means that TensorFlow will automatically de-duplicate summary names based on the scope they are created in. Also, passing a tensor or list of tags to a scalar summary op is no longer supported.
INFO:tensorflow:Create CheckpointSaverHook.
INFO:tensorflow:Saving checkpoints for 1 into C:\Users\

INFO:tensorflow:global_step/sec: 452.507
INFO:tensorflow:step = 7001, loss = 0.0166421 (0.221 sec)
INFO:tensorflow:global_step/sec: 462.027
INFO:tensorflow:step = 7101, loss = 0.00448675 (0.229 sec)
INFO:tensorflow:global_step/sec: 430.203
INFO:tensorflow:step = 7201, loss = 0.0508161 (0.220 sec)
INFO:tensorflow:global_step/sec: 432.814
INFO:tensorflow:step = 7301, loss = 0.00550252 (0.229 sec)
INFO:tensorflow:global_step/sec: 495.617
INFO:tensorflow:step = 7401, loss = 0.0154965 (0.216 sec)
INFO:tensorflow:global_step/sec: 464.523
INFO:tensorflow:step = 7501, loss = 0.004822 (0.200 sec)
INFO:tensorflow:global_step/sec: 464.782
INFO:tensorflow:step = 7601, loss = 0.0131625 (0.215 sec)
INFO:tensorflow:global_step/sec: 462.137
INFO:tensorflow:step = 7701, loss = 0.00680886 (0.216 sec)
INFO:tensorflow:global_step/sec: 449.153
INFO:tensorflow:step = 7801, loss = 0.00376618 (0.223 sec)
INFO:tensorflow:global_step/sec: 435.125
INFO:tensorflow:step = 7901, loss = 0.0131479 (0.231 sec)
INFO:te

INFO:tensorflow:global_step/sec: 445.94
INFO:tensorflow:step = 15201, loss = 0.00159225 (0.224 sec)
INFO:tensorflow:global_step/sec: 484.181
INFO:tensorflow:step = 15301, loss = 0.0014488 (0.207 sec)
INFO:tensorflow:global_step/sec: 462.529
INFO:tensorflow:step = 15401, loss = 0.00376972 (0.234 sec)
INFO:tensorflow:global_step/sec: 411.421
INFO:tensorflow:step = 15501, loss = 0.00406733 (0.228 sec)
INFO:tensorflow:global_step/sec: 414.105
INFO:tensorflow:step = 15601, loss = 0.00472252 (0.241 sec)
INFO:tensorflow:global_step/sec: 411.989
INFO:tensorflow:step = 15701, loss = 0.0146699 (0.243 sec)
INFO:tensorflow:global_step/sec: 410.582
INFO:tensorflow:step = 15801, loss = 0.00153426 (0.244 sec)
INFO:tensorflow:global_step/sec: 408.716
INFO:tensorflow:step = 15901, loss = 0.000447847 (0.245 sec)
INFO:tensorflow:global_step/sec: 428.012
INFO:tensorflow:step = 16001, loss = 0.00656253 (0.233 sec)
INFO:tensorflow:global_step/sec: 490.373
INFO:tensorflow:step = 16101, loss = 0.0043496 (0.21

INFO:tensorflow:step = 23301, loss = 0.00169613 (0.216 sec)
INFO:tensorflow:global_step/sec: 499.098
INFO:tensorflow:step = 23401, loss = 0.000720154 (0.216 sec)
INFO:tensorflow:global_step/sec: 462.592
INFO:tensorflow:step = 23501, loss = 0.000805685 (0.201 sec)
INFO:tensorflow:global_step/sec: 437.162
INFO:tensorflow:step = 23601, loss = 0.000631053 (0.232 sec)
INFO:tensorflow:global_step/sec: 439.401
INFO:tensorflow:step = 23701, loss = 0.000255326 (0.228 sec)
INFO:tensorflow:global_step/sec: 441.806
INFO:tensorflow:step = 23801, loss = 0.00180432 (0.226 sec)
INFO:tensorflow:global_step/sec: 493.99
INFO:tensorflow:step = 23901, loss = 0.000976085 (0.199 sec)
INFO:tensorflow:global_step/sec: 462.591
INFO:tensorflow:step = 24001, loss = 0.00113664 (0.216 sec)
INFO:tensorflow:global_step/sec: 499.03
INFO:tensorflow:step = 24101, loss = 0.000741892 (0.200 sec)
INFO:tensorflow:global_step/sec: 463.635
INFO:tensorflow:step = 24201, loss = 0.000928554 (0.216 sec)
INFO:tensorflow:global_ste

INFO:tensorflow:step = 31401, loss = 0.000649023 (0.200 sec)
INFO:tensorflow:global_step/sec: 458.962
INFO:tensorflow:step = 31501, loss = 0.000197947 (0.220 sec)
INFO:tensorflow:global_step/sec: 488.868
INFO:tensorflow:step = 31601, loss = 6.59552e-05 (0.202 sec)
INFO:tensorflow:global_step/sec: 500.649
INFO:tensorflow:step = 31701, loss = 0.000625275 (0.200 sec)
INFO:tensorflow:global_step/sec: 462.219
INFO:tensorflow:step = 31801, loss = 9.4202e-05 (0.216 sec)
INFO:tensorflow:global_step/sec: 499.543
INFO:tensorflow:step = 31901, loss = 0.000756386 (0.200 sec)
INFO:tensorflow:global_step/sec: 463.935
INFO:tensorflow:step = 32001, loss = 0.000197061 (0.216 sec)
INFO:tensorflow:global_step/sec: 500.246
INFO:tensorflow:step = 32101, loss = 0.000425281 (0.216 sec)
INFO:tensorflow:global_step/sec: 464.192
INFO:tensorflow:step = 32201, loss = 0.00102074 (0.200 sec)
INFO:tensorflow:global_step/sec: 462.823
INFO:tensorflow:step = 32301, loss = 0.00041044 (0.216 sec)
INFO:tensorflow:global_s

INFO:tensorflow:global_step/sec: 498.677
INFO:tensorflow:step = 39501, loss = 0.000241011 (0.201 sec)
INFO:tensorflow:global_step/sec: 464.027
INFO:tensorflow:step = 39601, loss = 0.000744708 (0.216 sec)
INFO:tensorflow:global_step/sec: 497.95
INFO:tensorflow:step = 39701, loss = 0.000178735 (0.218 sec)
INFO:tensorflow:global_step/sec: 449.325
INFO:tensorflow:step = 39801, loss = 0.00115475 (0.205 sec)
INFO:tensorflow:global_step/sec: 500.747
INFO:tensorflow:step = 39901, loss = 0.00126455 (0.200 sec)
INFO:tensorflow:Saving checkpoints for 40000 into C:\Users\hugoj\AppData\Local\Temp\tmpims7vzmp\model.ckpt.
INFO:tensorflow:Loss for final step: 0.000402969.


SKCompat()

In [7]:
from sklearn.metrics import accuracy_score

y_pred = dnn_clf.predict(X_test)
accuracy_score(y_test, y_pred['classes'])

INFO:tensorflow:Restoring parameters from C:\Users\hugoj\AppData\Local\Temp\tmpims7vzmp\model.ckpt-40000


0.98209999999999997

## Training a DNN Using Plain TensorFlow

If you want more controll over the structure of your NN, use the lower-level Python API to train your network.

### *Construction Phase*

Start by importing TensorFlow and specifiying the number of inputs, outputs, and neurons per layer:

In [8]:
import tensorflow as tf

n_inputs = 28*28  # MNIST
n_hidden1 = 300
n_hidden2 = 100
n_outputs = 10

Specify your placeholder nodes for the inputs. We don't know how many instances we are going to get to so leave the number of rows as **None**. The same is true for the label placeholders. 

In [9]:
reset_graph()

X = tf.placeholder(tf.float32, shape=(None, n_inputs), name="X")
y = tf.placeholder(tf.int64, shape=(None), name="y")

Let's define a function that will handle creating the NN layers for us:

In [10]:
def neuron_layer(X, n_neurons, name, activation=None):
    with tf.name_scope(name): # Create a scope for this layer so that it looks nice on TensorBoard
        n_inputs = int(X.get_shape()[1]) # Determine input number by looking at 2nd dimension of input
        
        # Create a radmomly initialized variable W that holds all the weights for the layer
        stddev = 2 / np.sqrt(n_inputs)  
        init = tf.truncated_normal((n_inputs, n_neurons), stddev=stddev)
        W = tf.Variable(init, name="kernel")
        
        # Creates one bias parameter per neuron, can be set to 0 with no symetry issues
        b = tf.Variable(tf.zeros([n_neurons]), name="bias")
        
        # Node to compute X * W + b
        Z = tf.matmul(X, W) + b
        
        if activation is not None:
            return activation(Z)
        else:
            return Z

In [11]:
with tf.name_scope("dnn"):
    # first layer takes X as input
    hidden1 = neuron_layer(X, n_hidden1, name="hidden1", activation=tf.nn.relu)
    
    # takes the 1st hidden layer as input
    hidden2 = neuron_layer(hidden1, n_hidden2, name="hidden2", activation=tf.nn.relu)
    
    # takes the 2nd hidden layer as input 
    logits = neuron_layer(hidden2, n_outputs, name="outputs")

Alternatively, we can have tensorflow do everything our function does automatically:

In [12]:
# Note: tensorflow.contrib is for experimental code that has yet to be fully added to the api 
from tensorflow.contrib.layers import fully_connected

with tf.name_scope("dnn"):
    hidden1 = fully_connected(X, n_hidden1, scope="hidden1")
    hidden2 = fully_connected(hidden1, n_hidden2, scope="hidden2")
    logits = fully_connected(hidden2, n_outputs, scope="outputs", activation_fn=None)

For the error, we will use cross entropy. The function in tensorflow that we will use will compute the cross entropy based on the logits input **before** the softmax function. Thus we will get a 1D tensor we must use to compute the mean cross entropy over all instances: 

In [13]:
with tf.name_scope("loss"):
    xentropy = tf.nn.sparse_softmax_cross_entropy_with_logits(labels=y, logits=logits)
    loss = tf.reduce_mean(xentropy, name="loss")

Our gradient descent optimizer is similar to the one used in Chapter 9: 

In [14]:
learning_rate = 0.01

with tf.name_scope("train"):
    optimizer = tf.train.GradientDescentOptimizer(learning_rate)
    training_op = optimizer.minimize(loss)

Now we specify how we're going to evaluate the model. We're going to use accuracy for this case. So first we need to check if our NN correctly predicted the target lable. We use **in_top_k()** to do that: 

In [15]:
with tf.name_scope("eval"):
    correct = tf.nn.in_top_k(logits, y, 1)
    accuracy = tf.reduce_mean(tf.cast(correct, tf.float32))

Lastly, we intialize and create a saver:

In [16]:
init = tf.global_variables_initializer()
saver = tf.train.Saver()

### *Execution Phase*

Import the data and set your training parameters:

In [18]:
from tensorflow.examples.tutorials.mnist import input_data

mnist = input_data.read_data_sets("/tmp/data/") # Data is automatically scaled for us 
n_epochs = 40
batch_size = 50

Extracting /tmp/data/train-images-idx3-ubyte.gz
Extracting /tmp/data/train-labels-idx1-ubyte.gz
Extracting /tmp/data/t10k-images-idx3-ubyte.gz
Extracting /tmp/data/t10k-labels-idx1-ubyte.gz


Next train your model:

In [20]:
with tf.Session() as sess:
    init.run()
    for epoch in range(n_epochs):
        for iteration in range(mnist.train.num_examples // batch_size):
            X_batch, y_batch = mnist.train.next_batch(batch_size)
            sess.run(training_op, feed_dict={X: X_batch, y: y_batch})
        acc_train = accuracy.eval(feed_dict={X: X_batch, y: y_batch})
        acc_test = accuracy.eval(feed_dict={X: mnist.test.images,
                                            y: mnist.test.labels})
        print(epoch, "Train accuracy:", acc_train, "Test accuracy:", acc_test)

    save_path = saver.save(sess, "./my_model_final.ckpt")

0 Train accuracy: 0.9 Test accuracy: 0.9036
1 Train accuracy: 0.88 Test accuracy: 0.9218
2 Train accuracy: 0.9 Test accuracy: 0.9307
3 Train accuracy: 0.92 Test accuracy: 0.9378
4 Train accuracy: 0.92 Test accuracy: 0.9434
5 Train accuracy: 0.94 Test accuracy: 0.947
6 Train accuracy: 0.96 Test accuracy: 0.9509
7 Train accuracy: 0.94 Test accuracy: 0.954
8 Train accuracy: 0.96 Test accuracy: 0.958
9 Train accuracy: 0.96 Test accuracy: 0.9604
10 Train accuracy: 0.96 Test accuracy: 0.9619
11 Train accuracy: 0.92 Test accuracy: 0.9645
12 Train accuracy: 0.98 Test accuracy: 0.9666
13 Train accuracy: 0.96 Test accuracy: 0.9648
14 Train accuracy: 1.0 Test accuracy: 0.9683
15 Train accuracy: 1.0 Test accuracy: 0.9688
16 Train accuracy: 1.0 Test accuracy: 0.9691
17 Train accuracy: 0.94 Test accuracy: 0.97
18 Train accuracy: 1.0 Test accuracy: 0.9706
19 Train accuracy: 1.0 Test accuracy: 0.9712
20 Train accuracy: 1.0 Test accuracy: 0.9713
21 Train accuracy: 1.0 Test accuracy: 0.9718
22 Train acc

### *Using the Neural Network*

To use your newly trained NN, all you need to do is make changes to the execution phase:

In [21]:
with tf.Session() as sess:
    saver.restore(sess, "./my_model_final.ckpt") # or better, use save_path
    X_new_scaled = mnist.test.images[:20]
    Z = logits.eval(feed_dict={X: X_new_scaled})
    y_pred = np.argmax(Z, axis=1)

INFO:tensorflow:Restoring parameters from ./my_model_final.ckpt


## Fine-Tuning Neural Network Hyperparameters

If you want to find the optimium hyperparameters for your NN, you're better off conducting a random search or using an external tool. This is due to the fact that the number of possible combinations of hyperparameters for a NN is to high. 

### *Number of Hidden Layers*

Usually one hidden layer is all you need. However, DNNs have much higher *parameter efficiency*. This means they can model complex functions using fewer neurons. Therefore start with one or two layers and keep adding more until your DNN starts to overfit the data. 

The first layers capture fundamental components, the next layers capture fundamental structures, and the last layers put it all together to try and model the high level structure of your data. This is why you can intialize your first layers to match those of previously trained NNs that are tackling similar datasets. 

### *Nunber of Neurons per Hidden Layer*

Try to form a funnel of sorts when structuring your network. Each layer should have fewer and fewer neurons in order for each layer to try and capture higher level structures. 

Typically its better to add more layers than it is to add more neurons.

### *Activation Functions*

ReLU is good for most cases as GD doesn't tend to get stuck as often using that activation function. 