# Intro to Artificial Neural Networks

*Artificial Neural Networks* (ANNs) are inspired by the architecture of the brain. ANNs are at the core of Deep Learning. They're versatile, powerful, and scalable, making them ideal to tackle large and highly complex ML tasks such as classifying billions of images (e.g. Google Images), powering speech recognition services (e.g. Siri, Cortana, etc), recommending the best videso to watch to hundreds of millions of users daily (e.g. the YouTube algorithm), or learning to beat the world champion at the game of *Go* by examining millions of paast games and then playing against istelf (DeepMind's AlphaGo).

This chapter will introduce ANNs, starting with a quick tour of the first ANN architectures. We'll then present *Multi-Layer Perceptrons* (MLPs) and implement one using TensorFlow to tackle the MNIST digit classification problem.

## From Biological to Artificial Neurons

ANNs are surprisingly old; they were first introduced in 1943 by the neurophysiologist Warren McCulloch and the mathematician Walter Pitts in their landmark paper ["A Logical Calculus of Ideas Immanent in Nervous Activity"](https://goo.gl/Ul4mxW). They presented a simplified computational model of how biological neurons may work together in animal brains to perform complex computations using *propositional logic*. This was the first ANN architecture, and since then many others have been invented.

The early successes of ANNS till the 60s led to the belief that truly intelligent machines would exist. The funding went elsewhere when people realized that that dream wouldn't be feasible, but in the early 80s there was a revival of interest in ANNs as the new network architectures were invented and better training techniques were developed. By the 90s, powerful alternative ML techniques such as SVMs were favored by most researchers as they seemed to offer better results and stronger theoretical foundations.

We're in another ANN renaissance, but this one may last because:

* There's a __huge__ quantity of data available to train neural networks, and ANNs frequently outperform other ML techniques on very large and complex problems.

* The tremendous increase in computing power since the 90s now makes it possible to train large neural nets in a reasonable amount of time. This is partially due to Moore's Law, but also due to powerful GPUs being developed by the gaming industry.

* The training algos have improved. They're really only slightly different than the ones from the 90s, but those tweaks have a huge positive impact.

* Some theoretical limits of ANNs have turned out to be benign in practice. (One example is how people thought that training algos would get stuck at local optima but that's rather rare in practice).

* ANNs seem to have entered a virtuous circle of funding and progress. Amazing products based on ANNs regularly make the headline news, which pulls more attention and funding towards them, resulting in more progress and products.

### The Perceptron

*Perceptrons* are one of the simplest ANN architectures. They were invented in 1957 by Frank Rosenblatt (yup, that guy). It's blased on an artificial neuron known as a *Linear Threshold Unit* (LTU); the inputs and outputs are numbers and each input connection is associated with a weight. The LTU computes a weighted sum of the inputs ($z = w_1x_1 + w_2x_2 + \cdots + w_nx_n = \textbf{w}^T \cdot \textbf{x}$), then applies a *step function* to that sum and outputs the result: $h_w(\textbf{x}) = \text{step}(z) = \text{step}(\textbf{w}^T \cdot \textbf{x})$

The most common step function used in Perceptrons is the *Heaviside step function* given in the next equation:

$$\text{heaviside }(z) = \left\{\begin{array}{ll} 0 &\text{ if } z \lt 0 \\ 1 &\text{ if } z \geq 0\end{array}\right.$$

Another common function is the *sign function* given below:

$$\text{sgn }(z) = \left\{\begin{array}{ll} -1 &\text{ if } z \lt 0 \\ 0 &\text{ if } z = 0 \\ 1 &\text{ if } z \gt 0\end{array}\right.$$

A single LTU can be used for simple linear binary calssification. It computes a linear combination of the inputs and if the result exceeds a threshold, it outputs the positive class (else outputs the negative class). This works like a Logistic Regression classifier or a Linear SVM. You could use a single LTU to classify iris flowers based on petal length and width (with an extra bias feature $x_0 = 1$ like in the previous chapters). Training an LTU means finding the right values for $w_0, w_1, \text{ and } w_2$.

A Perceptron is a single layer of LTUs with each neuron connected to all of the inputs. These connections are often represented using special pass-through neurons called *input neurons*; they just output whatever input they're fed. Moreover, an extra bias feature is generally added ($x_0 = 1$). This bias feature is typically represented using a special type of neuron called a *bias neuron*, which just outputs 1 all of the time.

The first training algo for Perceptrons (proposed by Rosenblatt) was largely inspired by *Hebb's rule*. In his book *The Organization of Behavior*, Hebb suggested that when a biologial neuron triggers another neuron, the connection between these two neurons grows stronger. This rule became known as *Hebbian learning*; that is, the connection weight between two neurons is increased whenever they have the same output. Perceptrons are trained using a variant of this rule that accounts for the error made by the network (it doesn't reinforce connections that lead to the wrong output). The equation is given below:

$$w_{i, j}^{(\text{next step})} = w_{i, j} + \eta\Big(y_j - \hat{y}_j\Big)x_i$$

* $w_{i, j}$ is the connection weight between the i<sup>th</sup> input neuron and the j<sup>th</sup> output nuron.
* $x_i$ is the i<sup>th</sup> input value of the current training instance.
* $\hat{y}_j$ is the output of the j<sup>th</sup> output neuron for the current training instance.
* $y_i$ is the target output of the j<sup>th</sup> output neuron for the current training instance.
* $\eta$ is the learning rate.

The decision boundary of each output neron is linear, so Perceptrons are incapable of learning complex patters (just like Logistic Regression classifiers). However, if the traiing instanes are linearly separable, Rosenblatt demonstrated that this algo will converge to a solution. This is called the *Perceptron convergence theorem*.

Scikit-Learn provides a `Perceptron` class that implements a single LTU network. It can be used exactly as expected (here using the iris dataset):

In [1]:
%matplotlib inline

import os
import tensorflow as tf
import matplotlib.pyplot as plt
import numpy as np

plt.rcParams['axes.labelsize'] = 14
plt.rcParams['xtick.labelsize'] = 12
plt.rcParams['ytick.labelsize'] = 12

def reset_graph(seed=42):
    tf.reset_default_graph()
    tf.set_random_seed(seed)
    np.random.seed(seed)

  return f(*args, **kwds)


In [2]:
from sklearn.datasets import load_iris
from sklearn.linear_model import Perceptron

iris = load_iris()
X = iris.data[:, (2, 3)] # petal length, petal width
y = (iris.target == 0).astype(np.int) # Iris-Setosa?

per_clf = Perceptron(random_state=42, max_iter=100)
per_clf.fit(X, y)

y_pred = per_clf.predict([[2, 0.5]])
y_pred

array([1])

Perceptrons strongly resemble Stochastic Gradient Descent, and sklearn's `Perceptron` class is equivalent to the `SGDClassifier` class with the following hyperparams: `loss='perceptron'`, `learning_rate='constant'`, `eta0=1` (learning rate), and `penalty=None` (no regularization).

Contrary to logistic regression classifiers, Perceptrons don't output a class probability; rather, they make predictions based on a hard threshold. This is one of the good reasons to prefer Logistic Regression over Perceptrons.

Many of the limits of Perceptrons (like that they can't solve some trivial problems) are eliminated by stacking multiple Perceptrons. The resulting ANN is known as a *Multi-Layer Perceptron* (MLP). An MLP is capable of solving the XOR problem.

### Multi-Layer Perceptron and Backpropagation

An MLP is composed of one (passthrough) input layer, one or more layers of LTUs called *hidden layers*, and one final layer of LTUs called the *output layer*. Every layer except the output layer includes a bias neuron and is fully connected to the next layer. When an ANN has two or more hidden layers, it is called a *deep neural network* (DNN).

Researchers struggled to find a way to train MLPs without success for years, but in 1986, D.E. Rumelhart et al. published a [groundbreaking article](https://goo.gl/Wl7Xyc) introducing *backpropagation* (we know it today as Gradient Descent using reverse-mode autodiff; Gradient Descent was introduced in chapter 4 and autodiff was introduced in chapter 9)

For each training instance, the algo feeds it to the network and computes the output of each neuron in each consecutive layer (this is the forward pass). It then measures the network's output error (i.e. the difference between the desired output and the actual output of the network) and it computes how much each neuron in the last hidden layer contributed to each output neuron's error. It then proceeds to measure how much of these error contributions came from each neuron in the previous hidden layer–and so on until the algo reaches the input layer. This revese pass efficiently measures the error gradient across all the connection weights in the network by propagating the error gradient backward in the network (hence the name).

If you check out the reverse-mode autodiff algo in the book in Appendix D, you'll find that the forward and reverse passses of backprop simply perform this autodiff. The last step of backprop is a Gradient Descent step on all the connection weights in the network using the error gradients measured earlier.

In order for this algo to work, the authors made a key change to the MLP architecture: they replaced the step function with the logistic function $\sigma(z) = \frac{1}{1 + \exp(-z)}$ This was essential because the step function contains only flat segments (so there isn't a gradient) while the logistic function has a well-defined, nonzero derivative everywhere. The backprop algo can be used with other *activiation functions* instead of the logistic function. Two other popular ones include:

* *The hyperbolic tangent function $tanh(z) = 2\sigma(2z) - 1$*

Just like the logistic function, it's S-shaped, continuous, and differentiable, but its output value ranges from -1 to 1 (instead of 0 to 1 like in logistic function) which tends to make each layer's output more or less normalized (i.e. centered around 0) at the beginning of training. This often helps speed up convergence.
   
* *The ReLU function (introduced in Chapter 9)*

$\text{ReLU}(z) = \text{ max }(0, z)$ is continuous but unfortunately not differentiable at $z = 0$ (the slope changes abruptly, which can make Gradient Descent bounce around). It works very well in practice and has the advantage of being fast to compute. Most importatly, the fact that it doesn't have a maximum output value also helps reduce some issues during Gradient Descent (we'll revisit this in the next chapter)

An MLP is often used for classification, with each output corresponding to a different binary class. When the classes are exclusive (like digits 0-9), the output layer is typically modified by replacing the individual activiation functions by a shared *softmax* function. The softmax function was introduced back in Chapter 4. The output of each neuron then corresponds to the estimated probability of the corresponding class. Signal only flows one-way, making this architecture an example of a *feedforward neural network* (FNN).

*Note: biological neurons seem to implement a roughly sigmoid (S-shaped) activation function, so researchers stuck to sigmoid functions for a very long time. Turns out that the ReLU activation function generally works better in ANNs though.*

## Training an MLP with TensorFlow's High-Level API

The simplest way to train an MLP with TensorFlow is to use the high-level API tf.learn, which offers a sklearn-compatible API. The `DNNClassifier` class makes it fairly easy to train a deep neural network with any number of hidden layers and a softmax output layer to output estimated class probabilities. For example, the following code trains a DNN for classification with two hidden layers (one with 300 neurons, one with 100) and a softmax output layer with 10 neurons:

In [3]:
# First, separate the data
(X_train, y_train), (X_test, y_test) = tf.keras.datasets.mnist.load_data()
X_train = X_train.astype(np.float32).reshape(-1, 28*28) / 255.0
X_test = X_test.astype(np.float32).reshape(-1, 28*28) / 255.0
y_train = y_train.astype(np.int32)
y_test = y_test.astype(np.int32)
X_valid, X_train = X_train[:5000], X_train[5000:]
y_valid, y_train = y_train[:5000], y_train[5000:]

In [4]:
feature_cols = [tf.feature_column.numeric_column('X', shape=[28 * 28])]
dnn_clf = tf.estimator.DNNClassifier(hidden_units=[300, 100], n_classes=10, feature_columns=feature_cols)
input_fn = tf.estimator.inputs.numpy_input_fn(x={"X": X_train}, y=y_train, num_epochs=40, batch_size=50,
                                              shuffle=True)
dnn_clf.train(input_fn=input_fn)

INFO:tensorflow:Using default config.
INFO:tensorflow:Using config: {'_model_dir': '/var/folders/pz/0k_47k855d194vh0354xcvrc0000gn/T/tmplpcyceck', '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_steps': None, '_save_checkpoints_secs': 600, '_session_config': None, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_log_step_count_steps': 100, '_train_distribute': None, '_device_fn': None, '_service': None, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x107d22a58>, '_task_type': 'worker', '_task_id': 0, '_global_id_in_cluster': 0, '_master': '', '_evaluation_master': '', '_is_chief': True, '_num_ps_replicas': 0, '_num_worker_replicas': 1}
INFO:tensorflow:Calling model_fn.
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Create CheckpointSaverHook.
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Saving checkpoints for 0 i

INFO:tensorflow:global_step/sec: 483.199
INFO:tensorflow:loss = 0.30129975, step = 7201 (0.207 sec)
INFO:tensorflow:global_step/sec: 486.739
INFO:tensorflow:loss = 0.16883413, step = 7301 (0.205 sec)
INFO:tensorflow:global_step/sec: 485.15
INFO:tensorflow:loss = 1.0265357, step = 7401 (0.206 sec)
INFO:tensorflow:global_step/sec: 485.923
INFO:tensorflow:loss = 0.6697571, step = 7501 (0.206 sec)
INFO:tensorflow:global_step/sec: 403.799
INFO:tensorflow:loss = 0.05456145, step = 7601 (0.248 sec)
INFO:tensorflow:global_step/sec: 479.235
INFO:tensorflow:loss = 0.30643904, step = 7701 (0.209 sec)
INFO:tensorflow:global_step/sec: 413.284
INFO:tensorflow:loss = 0.14411016, step = 7801 (0.243 sec)
INFO:tensorflow:global_step/sec: 420.941
INFO:tensorflow:loss = 1.7541127, step = 7901 (0.237 sec)
INFO:tensorflow:global_step/sec: 466.814
INFO:tensorflow:loss = 0.24501503, step = 8001 (0.214 sec)
INFO:tensorflow:global_step/sec: 401.829
INFO:tensorflow:loss = 0.26312822, step = 8101 (0.250 sec)
INFO

INFO:tensorflow:global_step/sec: 476.54
INFO:tensorflow:loss = 0.08832723, step = 15401 (0.210 sec)
INFO:tensorflow:global_step/sec: 454.33
INFO:tensorflow:loss = 0.07513741, step = 15501 (0.220 sec)
INFO:tensorflow:global_step/sec: 454.469
INFO:tensorflow:loss = 0.1700801, step = 15601 (0.220 sec)
INFO:tensorflow:global_step/sec: 373.818
INFO:tensorflow:loss = 0.003381093, step = 15701 (0.267 sec)
INFO:tensorflow:global_step/sec: 457.055
INFO:tensorflow:loss = 0.018435977, step = 15801 (0.219 sec)
INFO:tensorflow:global_step/sec: 475.194
INFO:tensorflow:loss = 0.025203666, step = 15901 (0.210 sec)
INFO:tensorflow:global_step/sec: 481.195
INFO:tensorflow:loss = 0.19151206, step = 16001 (0.208 sec)
INFO:tensorflow:global_step/sec: 462.278
INFO:tensorflow:loss = 0.13565953, step = 16101 (0.216 sec)
INFO:tensorflow:global_step/sec: 405.127
INFO:tensorflow:loss = 0.2928977, step = 16201 (0.247 sec)
INFO:tensorflow:global_step/sec: 480.568
INFO:tensorflow:loss = 0.2682612, step = 16301 (0.2

INFO:tensorflow:global_step/sec: 440.698
INFO:tensorflow:loss = 0.027692752, step = 23501 (0.227 sec)
INFO:tensorflow:global_step/sec: 437.223
INFO:tensorflow:loss = 0.02722666, step = 23601 (0.229 sec)
INFO:tensorflow:global_step/sec: 454.686
INFO:tensorflow:loss = 0.020128762, step = 23701 (0.220 sec)
INFO:tensorflow:global_step/sec: 366.608
INFO:tensorflow:loss = 0.043561637, step = 23801 (0.273 sec)
INFO:tensorflow:global_step/sec: 449.055
INFO:tensorflow:loss = 0.044166967, step = 23901 (0.222 sec)
INFO:tensorflow:global_step/sec: 435.88
INFO:tensorflow:loss = 0.0037725873, step = 24001 (0.229 sec)
INFO:tensorflow:global_step/sec: 462.032
INFO:tensorflow:loss = 0.09617338, step = 24101 (0.217 sec)
INFO:tensorflow:global_step/sec: 482.147
INFO:tensorflow:loss = 0.066949494, step = 24201 (0.207 sec)
INFO:tensorflow:global_step/sec: 467.067
INFO:tensorflow:loss = 0.0013058052, step = 24301 (0.214 sec)
INFO:tensorflow:global_step/sec: 405.385
INFO:tensorflow:loss = 0.020831391, step =

INFO:tensorflow:global_step/sec: 426.854
INFO:tensorflow:loss = 0.22976612, step = 31601 (0.234 sec)
INFO:tensorflow:global_step/sec: 423.575
INFO:tensorflow:loss = 0.01615125, step = 31701 (0.236 sec)
INFO:tensorflow:global_step/sec: 469.657
INFO:tensorflow:loss = 0.0356962, step = 31801 (0.213 sec)
INFO:tensorflow:global_step/sec: 445.814
INFO:tensorflow:loss = 0.00809639, step = 31901 (0.224 sec)
INFO:tensorflow:global_step/sec: 474.936
INFO:tensorflow:loss = 0.03560849, step = 32001 (0.210 sec)
INFO:tensorflow:global_step/sec: 447.708
INFO:tensorflow:loss = 0.013653827, step = 32101 (0.224 sec)
INFO:tensorflow:global_step/sec: 459.449
INFO:tensorflow:loss = 0.017421462, step = 32201 (0.218 sec)
INFO:tensorflow:global_step/sec: 458.947
INFO:tensorflow:loss = 0.051849775, step = 32301 (0.218 sec)
INFO:tensorflow:global_step/sec: 462.142
INFO:tensorflow:loss = 0.025272995, step = 32401 (0.217 sec)
INFO:tensorflow:global_step/sec: 472.328
INFO:tensorflow:loss = 0.090703145, step = 3250

INFO:tensorflow:global_step/sec: 477.569
INFO:tensorflow:loss = 0.015949624, step = 39701 (0.209 sec)
INFO:tensorflow:global_step/sec: 489.18
INFO:tensorflow:loss = 0.007895786, step = 39801 (0.204 sec)
INFO:tensorflow:global_step/sec: 464.033
INFO:tensorflow:loss = 0.0019254872, step = 39901 (0.217 sec)
INFO:tensorflow:global_step/sec: 451.337
INFO:tensorflow:loss = 0.011241112, step = 40001 (0.221 sec)
INFO:tensorflow:global_step/sec: 350.533
INFO:tensorflow:loss = 0.020417571, step = 40101 (0.286 sec)
INFO:tensorflow:global_step/sec: 330.003
INFO:tensorflow:loss = 0.020271366, step = 40201 (0.302 sec)
INFO:tensorflow:global_step/sec: 453.116
INFO:tensorflow:loss = 0.02745831, step = 40301 (0.221 sec)
INFO:tensorflow:global_step/sec: 443.198
INFO:tensorflow:loss = 0.0059565185, step = 40401 (0.226 sec)
INFO:tensorflow:global_step/sec: 310.849
INFO:tensorflow:loss = 0.0191252, step = 40501 (0.323 sec)
INFO:tensorflow:global_step/sec: 291.712
INFO:tensorflow:loss = 0.008302279, step = 

<tensorflow.python.estimator.canned.dnn.DNNClassifier at 0x119f41b38>

This code first creates a set of real valued columns from the training set. Then we create the `DNNClassifier` and we wrap it in a sklearn compatibility helper. Finally, we run 40,000 training iterations using batches of 50 instances. If you run this code on the dataset after scaling it, you'll get a model that achieves around 98.2% accuracy on the test set!

Under the hood, the `DNNClassifier` class creates all the neuron layers based on the ReLU activation function (we can change this by setting the `activation_fn` hyperparameter). The output layer relies on softmax, and the cost function is cross entropy (from chapter 4).

In [5]:
test_input_fn = tf.estimator.inputs.numpy_input_fn(
    x={"X": X_test}, y=y_test, shuffle=False)
eval_results = dnn_clf.evaluate(input_fn=test_input_fn)

INFO:tensorflow:Calling model_fn.
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Starting evaluation at 2018-09-19-05:39:35
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Restoring parameters from /var/folders/pz/0k_47k855d194vh0354xcvrc0000gn/T/tmplpcyceck/model.ckpt-44000
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Finished evaluation at 2018-09-19-05:39:35
INFO:tensorflow:Saving dict for global step 44000: accuracy = 0.9804, average_loss = 0.107365295, global_step = 44000, loss = 13.590544
INFO:tensorflow:Saving 'checkpoint_path' summary for global step 44000: /var/folders/pz/0k_47k855d194vh0354xcvrc0000gn/T/tmplpcyceck/model.ckpt-44000


In [6]:
eval_results

{'accuracy': 0.9804,
 'average_loss': 0.107365295,
 'loss': 13.590544,
 'global_step': 44000}

In [7]:
y_pred_iter = dnn_clf.predict(input_fn=test_input_fn)
y_pred = list(y_pred_iter)

INFO:tensorflow:Calling model_fn.
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Restoring parameters from /var/folders/pz/0k_47k855d194vh0354xcvrc0000gn/T/tmplpcyceck/model.ckpt-44000
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.


We could also use Keras for this task. The code would look like the following:

```
model = tf.keras.Sequential()
model.add(tf.keras.layers.Dense(300, activation='relu'))
model.add(tf.keras.layers.Dense(100, activation='relu))
model.add(tf.keras.layers.Dense(10, activation='softmax'))
model.compile(optimizer=tf.train.GradientDescentOptimizer(learning_rate=0.01, loss='categorical_crossentropy', metrics=['categorical_accuracy'])
```

## Training a DNN Using Plain TensorFlow

If you want more control over the architecture of the network, you may prefer to use TensorFlow's lower-level Python API. We'll build the same model as above using this API implementing mini-batch gradient descent to train it on the MNIST dataset.

Step 1 is building the graph in the construction phase, and step two is the execution phase where we run the graph to train the model.

### Construction Phase

First, we need to specify the inputs and outputs and set the number of hidden neurons in each layer:

In [8]:
n_inputs = 28 * 28 # MNIST
n_hidden1 = 300
n_hidden2 = 100
n_outputs = 10

reset_graph()

Placeholder nodes will represent the training data and the targets. We're only partially defining the shape of __X__ cause we know it'll be a 2D matrix with instances along the first dimension and features along the second, and we know we'll have one feature per pixel for 784 features, but we don't know how many instances the training batches will contain. Hence the shape must be `(None, n_inputs)`. Similarly, we know that `y` will be a 1D tensor with one entry per instance, but we don't know sizes so the shape must be `(None)`.

In [9]:
# Create placeholders

X = tf.placeholder(tf.float32, shape=(None, n_inputs), name='X')
y = tf.placeholder(tf.int32, shape=(None), name='y')

Now, create the ANN. `X` will act as the input layer; during the execution phase, it will be replaced with one training batch at a time *note that ll instance in a training batch will be processed simultaneously by the neural net).* Now you need to create the two hidden layers and the output layer. The two hidden layers will really only differ by the inputs they're connected to and the number of neurons, and the output layer will be softmax instead of ReLU. Time to write a function for it:

In [10]:
def neuron_layer(X, n_neurons, name, activation=None):
    with tf.name_scope(name):
        n_inputs = int(X.get_shape()[1])
        std_dev = 2 / np.sqrt(n_inputs + n_neurons)
        init = tf.truncated_normal((n_inputs, n_neurons), stddev=std_dev)
        W = tf.Variable(init, name='kernel')
        b = tf.Variable(tf.zeros([n_neurons]), name='bias')
        Z = tf.matmul(X, W) + b
        
        return activation(Z) if activation is not None else Z

Break the code down line by line:

1) Create a namescope using the name of the layer; it will contain all of the computation nodes for this neuron layer. This is optional, but the graph will look much nicer in TensorBoard if the nodes are well organized.

2) Get the number of inputs by looking up the input matrix's shape and getting the size of the second dimension (first is for instances)

3) Next three lines create a `W` variable that will hold the weights matrix (often called the *kernel* of the layer). It will be a 2D matrix containing all the connection weights between each input and each neuron; hense, its shape will be (n_inputs, n_neurons). It'll be randomly initialized using a truncated Gaussian distribution with standard deviation of $\frac{2}{\sqrt{n_{\text{inputs}} + n_{\text{neurons}}}}$ Using this specific standard deviation helps the algo converge must faster (more to come in chapter 11). It's important to initialize connection weights randomly for all hidden layers to avoid any symmetries that Gradient Descent wouldn't be able to break.

4) The nxt line creates a `b` variable for biases, initialized to 0 (no syhmmetry issue in this case) with one bias param per neuron.

5) WE create a subgraph to compute $\textbf{Z} = \textbf{X} \cdot \textbf{W} + \textbf{b}$. This vectorized implementation will efficiently compute the weighted sums of the inputs plus the bias term for each and every neuron in the layer, for all instanes in the batch in just one shot. *Note: adding a 1D array to a 2D matrix with the same number of columns results in the 1D array being added to every row in the matrix. This is known as *broadcasting*

6) Finally, if an `activation` param is provided such as `tf.nn.relu`, then the code returns `activation(Z)` otherwise it returns just `Z`.

Okay so we can create a layer, so let's make a network!

In [None]:
"""
with tf.name_scope('dnn'):
    hidden1 = neuron_layer(X, n_hidden1, name='hidden1', activation=tf.nn.relu)
    hidden2 = neuron_layer(hidden1, n_hidden2, name='hidden2', activation=tf.nn.relu)
    logits = neuron_layer(hidden2, n_outputs, name='outputs')
"""

*Note that `logits` is the output __before__ going through softmax activation. We'll handle softmax computation later*

TensorFlow comes with functinos to create standard neural network lauyers, so there usually isn't a need to define your own `neuron_layer` function in the way that we did. The function `tf.layers.dense()` creates a fully connected layer, where all of the inputs are connected to all the neuron in the layer. It creates the weights and biases variables (named `kernel` and `bias` respectively) using the appropriate initialization strategy, and the activation function can be set via the `activation` hyperparameter.

The following code will use the built-in function instead of our own custom one:

In [11]:
with tf.name_scope('dnn'):
    hidden1 = tf.layers.dense(X, n_hidden1, name='hidden1', activation=tf.nn.relu)
    hidden2 = tf.layers.dense(hidden1, n_hidden2, name='hidden2', activation=tf.nn.relu)
    logits = tf.layers.dense(hidden2, n_outputs, name='outputs')

With the model ready to go, we'll need our cost function. We'll use cross entropy (the one that penalizes models that estimate a low probability for the target class). TensorFlow provides a few functions for computing cross entropy. The one we'll use is called `sparse_soft_max_cross_entropy_with_logits()`; it computes cross entropy based on the "logits" (output of the network *before* going through softmax activation) and it expects labels in the form of integers ranging from 0 to the number of classes - 1 (so for us, 0-9). This will give a 1D tensor containing the cross entropy for each instance. We can then use `reduce_mean()` to compute the mean cross entropy over all instances:

In [12]:
with tf.name_scope('loss'):
    xentropy = tf.nn.sparse_softmax_cross_entropy_with_logits(labels=y, logits=logits)
    loss = tf.reduce_mean(xentropy, name='loss')

*Note: The `sparse_soft` function is equivalent to applying the softmax activation function and then computing the cross entropy, but it's more efficent and it properly takes care of corner cases. When logits are large, floating-point rounding errors may cause the output to be exactly equal to 0 or 1, and in this case the cross entropy equation would contain a $log(0)$ term, equal to negative infinity. The function solves this by computing $log(\epsilon)$ instead, where $\epsilon$ is a tiny positive number. This is why we didn't apply the softmax function earlier. There's also another function called `softmax_cross_entropy_with_logits()` that takes labels in the form of one-hot vectors instead of ints from 0 to the number of classes minus 1.*

We have the neural net, we have the cost function, and now we need to define a `GradientDescentOptimizer` that'll tweak model params to minimize cost function. This is the same thing we did last chapter:

In [13]:
learning_rate = 0.01

with tf.name_scope('train'):
    optimizer = tf.train.GradientDescentOptimizer(learning_rate)
    training_op = optimizer.minimize(loss)

Last important step in construction is to specify how to evaluate the model. We'll simply use accuracy as our metric. First, for each instance, determine if the neural net's prediction is correct by checking whether or not the highest logit corresponds to the target class. For this, you can use the `in_top_k()` function. This returns a 1D tensor full of boolean values, so we'll need to cast these to floats and then compute the average to get the overall accuracy:

In [14]:
with tf.name_scope('eval'):
    correct = tf.nn.in_top_k(logits, y, 1)
    accuracy = tf.reduce_mean(tf.cast(correct, tf.float32))

As usual, we'll create an initializer node

In [15]:
init = tf.global_variables_initializer()

Construction is complete! This was less than 40 lines, but we did a lot: we created placeholders for inputs and targets, created a function to build the neuron layer, used that function to create a DNN, defined the cost function, created an optimizer, and defined the performance metric. Now to execution!

### Execution Phase

This part is shorter and simpler. First up, load the MNIST data. We'll use sklearn for this

In [16]:
def shuffle_batch(X, y, batch_size):
    rnd_idx = np.random.permutation(len(X))
    n_batches = len(X) // batch_size
    for batch_idx in np.array_split(rnd_idx, n_batches):
        X_batch, y_batch = X[batch_idx], y[batch_idx]
        yield X_batch, y_batch

In [17]:
n_epochs = 40
batch_size = 50

with tf.Session() as sess:
    init.run()
    for epoch in range(n_epochs):
        for X_batch, y_batch in shuffle_batch(X_train, y_train, batch_size):
            sess.run(training_op, feed_dict={X: X_batch, y: y_batch})
        acc_batch = accuracy.eval(feed_dict={X: X_batch, y: y_batch})
        acc_valid = accuracy.eval(feed_dict={X: X_valid, y: y_valid})
        print(epoch, "Batch accuracy:", acc_batch, "Validation accuracy:", acc_valid)
        

0 Batch accuracy: 0.9 Validation accuracy: 0.9024
1 Batch accuracy: 0.92 Validation accuracy: 0.9254
2 Batch accuracy: 0.94 Validation accuracy: 0.9372
3 Batch accuracy: 0.9 Validation accuracy: 0.9416
4 Batch accuracy: 0.94 Validation accuracy: 0.9472
5 Batch accuracy: 0.94 Validation accuracy: 0.9512
6 Batch accuracy: 1.0 Validation accuracy: 0.9548
7 Batch accuracy: 0.94 Validation accuracy: 0.961
8 Batch accuracy: 0.96 Validation accuracy: 0.9622
9 Batch accuracy: 0.94 Validation accuracy: 0.9648
10 Batch accuracy: 0.92 Validation accuracy: 0.9656
11 Batch accuracy: 0.98 Validation accuracy: 0.9666
12 Batch accuracy: 0.98 Validation accuracy: 0.9684
13 Batch accuracy: 0.98 Validation accuracy: 0.9704
14 Batch accuracy: 1.0 Validation accuracy: 0.9694
15 Batch accuracy: 0.94 Validation accuracy: 0.9718
16 Batch accuracy: 0.98 Validation accuracy: 0.9726
17 Batch accuracy: 1.0 Validation accuracy: 0.9728
18 Batch accuracy: 0.98 Validation accuracy: 0.9744
19 Batch accuracy: 0.98 Vali

This code opens a session and initializes all variables. It then runs the training loop and at the end of each epoch, it evaluates the model on the last mini-batch and on the validation set and reports the accuracy.

### Using the Neural Network

Now that the network is trained, you can use it for predictions. To do this, you'd reuse the construction phase but change the execution phase to restore from a previous session.

The next section of the book talks about fine-tuning hyperparams, so read the book to learn this info.

# Exercises

### 9) Train a deep MLP on the MNIST dataset and aim for 98% precision

In [18]:
# numbers setup
n_inputs = 28 * 28 # number of pixels per MNIST photo
n_hidden1 = 300 # number of nodes in 1st layer
n_hidden2 = 100 # number of nodes in 2nd layer
n_output = 10 # number of output nodes

In [19]:
# set up placeholders
reset_graph()

X = tf.placeholder(tf.float32, shape=(None, n_inputs), name='X')
y = tf.placeholder(tf.int32, shape=(None), name='y')

In [20]:
# set up the neural net
with tf.name_scope('dnn'):
    hidden1 = tf.layers.dense(X, n_hidden1, name='hidden1', activation=tf.nn.relu)
    hidden2 = tf.layers.dense(hidden1, n_hidden2, name='hidden2', activation=tf.nn.relu)
    logits = tf.layers.dense(hidden2, n_output, name='outputs')
    
# set up the loss function
with tf.name_scope('loss'):
    xentropy = tf.nn.sparse_softmax_cross_entropy_with_logits(labels=y, logits=logits)
    loss = tf.reduce_mean(xentropy, name='loss')
    loss_summary = tf.summary.scalar('log_loss', loss)

# set up the training system
learning_rate = 0.01
with tf.name_scope('train'):
    optimizer = tf.train.GradientDescentOptimizer(learning_rate=learning_rate)
    training_op = optimizer.minimize(loss)
    
# set up the evaluation system
with tf.name_scope('eval'):
    correct = tf.nn.in_top_k(logits, y, 1)
    accuracy = tf.reduce_mean(tf.cast(correct, tf.float32))
    accuracy_summary = tf.summary.scalar('accuracy', accuracy)
    
# initialize variables
init = tf.global_variables_initializer()

Okay, now let's implement early stopping. We'll need the validation set for this

In [21]:
m, n = X_train.shape

In [25]:
n_epochs = 10001
batch_size = 50
n_batches = int(np.ceil(m / batch_size))
saver = tf.train.Saver()

# Set up directories for the model
checkpoint_path = '/tmp/my_deep_mnist_model.ckpt'
checkpoint_epoch_path = checkpoint_path + '.epoch'
final_model_path = './my_deep_mnist_model'

# Prepare for early stopping
best_loss = np.infty
epochs_without_progress = 0
max_epochs_without_progress = 50

with tf.Session() as sess:
    if os.path.isfile(checkpoint_epoch_path):
        # if the file exists, load it and load the epoch number
        with open(checkpoint_epoch_path, 'rb') as infile:
            start_epoch = int(infile.read())
        print("Training was interrupted, continuing at epoch", start_epoch)
        saver.restore(sess, checkpoint_path)
    else:
        start_epoch = 0
        sess.run(init)
        
    for epoch in range(n_epochs):
        for X_batch, y_batch in shuffle_batch(X_train, y_train, batch_size):
            sess.run(training_op, feed_dict={X: X_batch, y: y_batch})
        accuracy_val, loss_val, accuracy_summary_str, loss_summary_str = sess.run([accuracy, loss, accuracy_summary, loss_summary], feed_dict={X: X_valid, y: y_valid})
        if epoch % 5 == 0:
            print("Epoch:", epoch,
                  "\tValidation accuracy: {:.3f}%".format(accuracy_val * 100),
                  "\tLoss: {:.5f}".format(loss_val))
            saver.save(sess, checkpoint_path)
            with open(checkpoint_epoch_path, 'wb') as outfile:
                outfile.write(b"%d" % (epoch + 1))
            if loss_val < best_loss:
                saver.save(sess, final_model_path)
                best_loss = loss_val
            else:
                epochs_without_progress += 5
                if epochs_without_progress > max_epochs_without_progress:
                    print("Early stopping")
                    break

Training was interrupted, continuing at epoch 1
INFO:tensorflow:Restoring parameters from /tmp/my_deep_mnist_model.ckpt
Epoch: 0 	Validation accuracy: 92.540%
Epoch: 5 	Validation accuracy: 95.480%
Epoch: 10 	Validation accuracy: 96.660%
Epoch: 15 	Validation accuracy: 97.260%
Epoch: 20 	Validation accuracy: 97.340%
Epoch: 25 	Validation accuracy: 97.680%
Epoch: 30 	Validation accuracy: 97.800%
Epoch: 35 	Validation accuracy: 97.920%
Epoch: 40 	Validation accuracy: 98.080%
Epoch: 45 	Validation accuracy: 98.080%
Epoch: 50 	Validation accuracy: 98.000%
Epoch: 55 	Validation accuracy: 98.100%
Epoch: 60 	Validation accuracy: 98.140%
Epoch: 65 	Validation accuracy: 98.160%
Epoch: 70 	Validation accuracy: 98.200%
Epoch: 75 	Validation accuracy: 98.180%
Epoch: 80 	Validation accuracy: 98.220%
Epoch: 85 	Validation accuracy: 98.200%
Epoch: 90 	Validation accuracy: 98.200%
Epoch: 95 	Validation accuracy: 98.240%
Epoch: 100 	Validation accuracy: 98.200%
Epoch: 105 	Validation accuracy: 98.260%


In [26]:
os.remove(checkpoint_epoch_path)

with tf.Session() as sess:
    saver.restore(sess, final_model_path)
    accuracy_val = accuracy.eval(feed_dict={X: X_test, y: y_test})
    
accuracy_val

INFO:tensorflow:Restoring parameters from ./my_deep_mnist_model


0.9791