# Introduction to Artificial Neural Networks
page 253<br>
See
- https://github.com/ageron/handson-ml/blob/master/10_introduction_to_artificial_neural_networks.ipynb,
- https://link.springer.com/article/10.1007/BF02478259,
- https://en.wikipedia.org/wiki/Neuron,
- https://en.wikipedia.org/wiki/Dual_number,
- http://oscar.calldesk.ai, and
- https://www.tensorflow.org/api_docs/python/tf/math/in_top_k for details.

The study of birds inspired humans to build an airplane and fly. Similarly, the study of the brain inspired the development of artificial neural networks (ANNs), allowing man to build intelligent machines. Yet, just like airplanes do not need to flap their wings in order to fly, ANNs work also different from biological neurons. ANNs are the central ingredient of Deep Learning. They are versatile, powerful, and scalable. Many great feats have been achieved with programs based on ANNs: beating humans at the game of "Go", in the TV game show "Jeopardy", at identifying cancerous tissue, at classifying billions of images, and so on.

This chapter will give a tour of basic ANN architectures, introduce *Mulit-Layer Perceptrons* (MLPs), and use TensorFlow to implement a program that uses ANNs to tackle the MNIST dataset (see also Chapter 3).

## From Biological to Artificial Neurons
page 254<br>
In 1943, Warren McCulloch (neurophysiologist) and Walter Pitts (mathematician) invented the first ANN architecture (see the second link above). Many architectures have followed since. Due to early successes, there was significant belief that humans would soon converse with truly intelligent machines. But funding dropped around 1960 due to stagnation of the field. It recovered around 1980 due to new network architectures and better training techniques but by 1990, alternative machine learning techniques like support vector machines seemed more promising. Today, we are once more facing enthusiasm around ANNs. There are several reasons why this time, ANNs will have a lasting impact on our lives:
- There is now a huge quantity of data available to train nerual networks, and ANNs frequently outperform other ML techniques on very large and complex problems.
- The tremendous increase in computing power since the 1990s now makes it possible to train large neural networks in a reasonable amount of time. This is in part due to Moore's Law, but also thanks to the gaming industy, which has produced powerful GPU cards by the millions.
- The training algorithms have been improved. To be fair they are only slightly different from the ones used in the 1990s, but these relatively small tweaks have a huge positive impact.
- Some theoretical limitations of ANNs have turned out to be benign in practice. For example, many people thought that ANN training algrothims were doomed because they were likely to get stuck in local optima, but it turns out that this is rather rare in practice (or when it is the case, they are usually fairly close to the global optimum).
- ANNs seem to have entered a virtuous circle of funding and progress. Amazing products based on ANNs regularly make the headline news, which pulls more and more attention and funding toward them, resulting in more and more progress, and even more amazing products.

### Biological Neurons
page 255<br>
Biological neurons are composed of a cell body -  or *soma* - that contains the cell's complex components (the Golgi apparatus, nucleus, endoplastic reticulum, and the mitochondrion), *dendrites*, and an *axon*. An axon (or *nerve fiber*) can reach a length of 1 meter and at its end it splits into many branches called *telodendria* that have *synaptic terminals* (or simply *synapses*) at their ends. Many axons are often gathered into a bundle / nerve. Dendrites reach lengths of hundreds of micrometers and branch into complex dendritic trees. Most neurons receive signals via the dendrites and send signals out via the axon. Electric impulses called *signals* are transmitted between neurons via the synapses. When a neuron receives a certain signal intensity within a few milliseconds, it will emit an own signal via its axon (see the third link above for details).<br>
So biological neurons function in a rather simple way. But they are organised in a vast network of billions of neurons, each neuron typically connected to thousands of other neurons. This allows the biological neural networks (BNNs - *neural networks* or NNs are usually assumed to be artificial, at least in the context of machine learning) to perform highly complex computations. Although our understanding of these networks is far from complete, it seems that neurons are often organized in consecutive layers.

### Logical Computations with Neurons
page 256<br>
McCulloch and Pitts introduced a model that later became known as the *artificial neuron*: it has one ore more binary inputs and one binary output. It only fires its binary output when the number of activated inputs exceeds a certain threshold. Importantly, McCulloch and Pitts showed that even with this simple model, a big enough network will be able to compute any logical function. Figure 10-3 of the book shows how minimal networks of such artificial neurons perfrom *AND, OR, NOT* and *NAND* operations.

### The Perceptron
page 257<br>
In 1957, Frank Rosenblatt invented the *perceptron*, one of the most simple ANN architectures. Its artificial neuron is a *linear threshold unit* (LTU), which weighs each scalar and real input $x_i$ with a weight $w_i$ and adds the weighted inputs up to $z(x)=\sum_ix_iw_i$ and then applies a step function (e.g., the Heaviside, sign, or arctan functions), thus yielding the result
$$h_w(x)={\rm step}(z)={\rm step}(w^T\cdot x)\,.$$
<br>
In previous chapters, we had also included a bias term $w_0$. This can be included in the perceptron by using an additional constant input $x_0=1$ via an additional neuron, the *bias neuron*. Then, a single LTU can be used as a binary classifier, e.g., on the iris dataset: use a constant bias feature $w_0=1$ and features $w_1$ and $w_2$ for petal length and petal width.<br>
Sometimes, the word *perceptron* is used to mean a single LTU, and sometimes it meant to represent a simple network that consists of only one single layer of LTUs. The perceptron diagram in Figure 10-5 of the book shows a single layer network with 3 parallel LTUs, all of which receive the same inputs, $x_0=1$, $x_1$, and $x_2$. The three outputs could be used to classify the three classes of the iris dataset. But to do so, the network needs to be trained. How is this done?<br><br>
According to Donald Hebb and Siegrid Löwel, "cells that fire together, wire together" (*Hebbian learning* or Hebb's rule), that is the connecting weight between two neurons is increased whenever they produce the same output. Perceptrons use a variant of this rule: they only reinforce connections that lead to the correct output. The according **perceptron learning rule** (weight update) is<br><br>
$$w_{i,j}^{\rm(next\,step)}=w_{i,j}+\eta(y_j-y^{'}_j)x_i\,,$$
<br>
where $w_{i,j}$ is the weight between input neuron $i$ and output neuron $j$, $x_i$ is the output of input neuron $i$ (i.e., the $i$-th feature of the current instance), $y^{'}_j$ is the output of the output neuron $j$, $y_j$ is the target output (class) of the current instance, and $\eta$ is the learning rate. In short, the weight $w_{i,j}$ can only be updated if the input neuron $i$ has fired ($x_i=1$) and if it has, then the weight is updated according to the deviation from the desired class and according to the learning rate. Perceptrons are linear models with linear decision boundaries (see, e.g., Chapter 4 on linear models like logistic regression classifiers). But according to Rosenblatt's *perceptron convergence theorem*, this algorithm will converge towards a solution.<br><br>
Let's try perceptrons out with Scikit-Learn's perceptron class!

In [1]:
# imports
import os
import numpy as np
from sklearn.datasets import load_iris
from sklearn.linear_model import Perceptron
# prepare the training data
iris = load_iris()
X = iris.data[:, (2,3)] # petal length, petal width
y = (iris.target==0).astype(np.int) # Iris Setosa
print(iris.target)
print(y)

[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2
 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
 2 2]
[1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0]


According to the above outputs, the positive class (yes, it is a Iris Setosa) is characterized by the prediction 1. Below, the features (petal length = 2cm, petal width = 0.5cm) lead to the prediction that such specimen is indeed an Iris Setosa.

In [2]:
# training of the perceptron classifier
per_clf = Perceptron(random_state=42)
per_clf.fit(X, y)
# inference on a new instance
y_pred = per_clf.predict([[2, 0.5]])
print(y_pred)

[1]


Due to the perceptron learning rule displayed above (Equation 10-2 in the book), the perceptron learning rule resembles strongly stochastic gradient descent (SGD). Indeed, Scikit-Learn's perceptron class is simply an SGD classifier with hyperparameters loss="perceptron", learning_rate="constant", eta0=1 (the learning rate), and penalty=None (no regularization). Unlike logistic regression classifiers, perceptrons do not output a class probability. Instead, they just make predictions based on a hard threshold, which is one of many good reasons to prefer logistic regression over perceptrons.<br>
In 1969, Marvin Minsky and Seymour Papert showed that perceptrons have serious weaknesses, e.g., the inability to compute the exclusive OR (XOR). This is true for any linear classification model (e.g., logistic regression classifiers) but was a disappointment for researchers, nevertheless. As a consequence, many researchers stopped their efforts towards *connectionism*, i.e., of neural networks.<br>
But meanwhile, people found out that many shortcomings can be resolved by stacking several perceptrons. An XOR gate based on stacked perceptrons is demonstrated by the code below, were the network shown in the right part of Figure 10-6 of the book is implemented. However, the penalty due to firing of the left (first) LTU in the bottom (first) layer has been increased from -1 to -3. Of course, this does not change the fact that stacked perceptrons can indeed perform XOR operations.

In [3]:
x1, x2 = 0, 0
in1 = x1 + x2 - 1.5                       # input to left LTU in first layer
in2 = x1 + (x2*(x2+1)) - 0.5              # input to right LTU in first layer
out1 = np.heaviside(in1, 0.5)             # left LTU
out2 = np.heaviside(in2, 0.5)             # right LTU
in3 = out1*(out1-3) + out2*(out2+1) - 0.5 # input LTU in second layer
out3 = np.heaviside(in3, 0.5)             # top LTU
print(out3)

0.0


### Multi-Layer Perceptron and Backpropagation
page 261<br>
An MLP (Mulit-Layer Perceptron) consists of an *input layer*, one or more *hidden layers*, and an additional *output layer*, see also Figure 10-7 in the book. Apart from the output layer, every layer also has a bias neuron. When there is more than one hidden layer, it is called a *deep neural network* (DNN). Training MLPs was a struggle until 1986, when D. E. Rumelhart introduced the *backpropagation* training algorithm. In short, it is gradient descent with rervers-mode autodiff (see chapter 4 and 9, as well as Appendinx D). In order for this to work, it was helpful to use a differentiabla activation function instead of a discontinuous step function. Rumelhart *et al.* chose a *logistic function*, $\sigma(z)=(1+e^{-z})^{-1}$ but other differentiable step functions are also possible. When $\tanh(z)$ is used, the output values lie between ±1 such that the input from on layer of LTUs to an LTU of the next layer will on average be 0, at least in the beginning of training. Notably, this can speed up convergence. The *rectified linear unit*, ${\rm ReLU}(z)={\rm max}(0,z)$ – introduced in chapter 9 – has a discontinuous derivative but can help reduce issues with gradient descent due to its lack of an upper bound (more on that in chapter 11).<br><br>
MLPs are often used for classification tasks, with each output corresponding to a different binary class (spam/ham, urgent/not urgent, ..., or 0/not 0, 1/not 1, 2/not 2, ...). When these classes are exclusive, it makes sense to further process the output of the output neurons with a softmax function (see chapter 4) so that the actual outputs are probabilities for that class. This makes sense when classes are exclusive since then, the outputs should add up to one, such that an interpretation as probability is meaningful.<br><br>
**General note**<br>
Biological neurons seem to implement a roughly sigmoid (S-shaped) activation function, so researchers stuck to sigmoid functions for a very long time. But it turns out that the ReLU activation function generally works better in ANNs. This is one of the cases where the biological analogy was misleading.<br><br>
Tensorflow's reverse-mode autodiff algorithm is nicely explained in Appendix D. The parts "Forward-Mode Autodiff" and "Reverse-Mode Autodiff" are really worth reading. The concept of dual numbers can be understood from the wikipedia page, https://en.wikipedia.org/wiki/Dual_number, in particular from the parts on "Linear representation" and "Differentiation". A "physicist's" alternative to the linear formulation using matrices would be to introduce the "dual element" as an infinitesimal that squares to zero, $x+{\rm d}x\to(x+{\rm d}x)/x=1+\epsilon$ where ${\rm d}x^2=\epsilon^2=0$, while both ${\rm d}x$ and $\epsilon$ are nonzero.<br>
With these tools, the Taylor expansion of $f(a+b\epsilon)$ around $a$ (see also the linked wikipedia page),
$$f(a+b\epsilon)=\sum_n\frac{f^{(n)}(a)}{n!}(b\epsilon)^n=f(a)+f'(a)b\epsilon\,,$$
explains differentiation via dual numbers very well ($\epsilon^2=0$). Figure D-3 on page 516 shows how reverse-mode autodiff really requires going through the graph only twice. Importantly, the node $n_2$ highlights that in the chain rule, $\partial_xf=(\partial_{n_i}\,f)/(\partial_xn_i)$, the sum remains necessary. In the individual nodes, $x$ is replaced by $n_j$. A short part from that appendix section shall be quoted, here: "How much does $f$ vary when $n_5$ varies? The answer is $\partial_{n_5}\,f=(\partial_{n_7}\,f)/(\partial_{n_5}n_7)$."

## Training an MLP with TensorFlow's High-Level API
page 264<br>
The simplest way to train an MLP with TensorFlow is to use the high-level API tf.learn, which is compatible with Scikit-Learn. Below, we use the DNNClassifier class.

In [4]:
import tensorflow as tf
# the below code is taken from the github link above, since tensorflow has evolved since publication of the book
# features and labels
(X_train, y_train), (X_test, y_test) = tf.keras.datasets.mnist.load_data()
X_train = X_train.astype(np.float32).reshape(-1, 28*28) / 255.0
X_test = X_test.astype(np.float32).reshape(-1, 28*28) / 255.0
y_train = y_train.astype(np.int32)
y_test = y_test.astype(np.int32)
X_valid, X_train = X_train[:5000], X_train[5000:]
y_valid, y_train = y_train[:5000], y_train[5000:]
# training
feature_cols = [tf.feature_column.numeric_column("X", shape=[28 * 28])]
dnn_clf = tf.estimator.DNNClassifier(hidden_units=[300,100], n_classes=10, feature_columns=feature_cols)
input_fn = tf.estimator.inputs.numpy_input_fn(x={"X": X_train}, y=y_train, num_epochs=40, batch_size=50, shuffle=True)
dnn_clf.train(input_fn=input_fn)

  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])
  np_resource = np.dtype([("resource", np.ubyte, 1)])


INFO:tensorflow:Using default config.
INFO:tensorflow:Using config: {'_model_dir': '/var/folders/j2/hf6944zn74l9y35sr4mgbzb00000gn/T/tmp1i3slmao', '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_steps': None, '_save_checkpoints_secs': 600, '_session_config': None, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_log_step_count_steps': 100, '_service': None, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x125f85e48>, '_task_type': 'worker', '_task_id': 0, '_master': '', '_is_chief': True, '_num_ps_replicas': 0, '_num_worker_replicas': 1}
INFO:tensorflow:Create CheckpointSaverHook.
INFO:tensorflow:Saving checkpoints for 1 into /var/folders/j2/hf6944zn74l9y35sr4mgbzb00000gn/T/tmp1i3slmao/model.ckpt.
INFO:tensorflow:loss = 111.2738, step = 1
INFO:tensorflow:global_step/sec: 74.6508
INFO:tensorflow:loss = 12.76069, step = 101 (1.336 sec)
INFO:tensorflow:global_step/sec: 71.1829
INFO:tensorflow:loss = 14.68929, step

INFO:tensorflow:global_step/sec: 79.5877
INFO:tensorflow:loss = 2.466872, step = 7501 (1.257 sec)
INFO:tensorflow:global_step/sec: 77.9726
INFO:tensorflow:loss = 1.2315158, step = 7601 (1.282 sec)
INFO:tensorflow:global_step/sec: 81.9786
INFO:tensorflow:loss = 5.6266356, step = 7701 (1.220 sec)
INFO:tensorflow:global_step/sec: 82.5624
INFO:tensorflow:loss = 0.09036364, step = 7801 (1.211 sec)
INFO:tensorflow:global_step/sec: 81.9089
INFO:tensorflow:loss = 0.2519115, step = 7901 (1.221 sec)
INFO:tensorflow:global_step/sec: 80.4359
INFO:tensorflow:loss = 6.8407655, step = 8001 (1.243 sec)
INFO:tensorflow:global_step/sec: 83.2174
INFO:tensorflow:loss = 0.39095354, step = 8101 (1.202 sec)
INFO:tensorflow:global_step/sec: 82.053
INFO:tensorflow:loss = 1.024648, step = 8201 (1.219 sec)
INFO:tensorflow:global_step/sec: 80.9161
INFO:tensorflow:loss = 0.54383826, step = 8301 (1.236 sec)
INFO:tensorflow:global_step/sec: 80.3709
INFO:tensorflow:loss = 0.496621, step = 8401 (1.244 sec)
INFO:tensor

INFO:tensorflow:global_step/sec: 80.7374
INFO:tensorflow:loss = 0.018453483, step = 15701 (1.238 sec)
INFO:tensorflow:global_step/sec: 74.2792
INFO:tensorflow:loss = 0.005063211, step = 15801 (1.346 sec)
INFO:tensorflow:global_step/sec: 82.165
INFO:tensorflow:loss = 0.045956515, step = 15901 (1.217 sec)
INFO:tensorflow:global_step/sec: 78.044
INFO:tensorflow:loss = 0.14933804, step = 16001 (1.281 sec)
INFO:tensorflow:global_step/sec: 79.9524
INFO:tensorflow:loss = 0.02612287, step = 16101 (1.252 sec)
INFO:tensorflow:global_step/sec: 81.993
INFO:tensorflow:loss = 0.1607581, step = 16201 (1.219 sec)
INFO:tensorflow:global_step/sec: 83.0595
INFO:tensorflow:loss = 0.19471687, step = 16301 (1.206 sec)
INFO:tensorflow:global_step/sec: 78.5668
INFO:tensorflow:loss = 0.03702511, step = 16401 (1.271 sec)
INFO:tensorflow:global_step/sec: 77.9001
INFO:tensorflow:loss = 0.13329874, step = 16501 (1.285 sec)
INFO:tensorflow:global_step/sec: 79.2804
INFO:tensorflow:loss = 0.048678815, step = 16601 (1

INFO:tensorflow:global_step/sec: 80.7882
INFO:tensorflow:loss = 0.047273476, step = 23801 (1.238 sec)
INFO:tensorflow:global_step/sec: 79.6435
INFO:tensorflow:loss = 0.07347919, step = 23901 (1.256 sec)
INFO:tensorflow:global_step/sec: 84.4893
INFO:tensorflow:loss = 0.09225047, step = 24001 (1.183 sec)
INFO:tensorflow:global_step/sec: 80.8638
INFO:tensorflow:loss = 0.010594401, step = 24101 (1.237 sec)
INFO:tensorflow:global_step/sec: 83.0595
INFO:tensorflow:loss = 0.07059836, step = 24201 (1.203 sec)
INFO:tensorflow:global_step/sec: 80.5136
INFO:tensorflow:loss = 0.0026259483, step = 24301 (1.242 sec)
INFO:tensorflow:global_step/sec: 80.2328
INFO:tensorflow:loss = 0.12016117, step = 24401 (1.247 sec)
INFO:tensorflow:global_step/sec: 83.8898
INFO:tensorflow:loss = 0.07194876, step = 24501 (1.192 sec)
INFO:tensorflow:global_step/sec: 79.6262
INFO:tensorflow:loss = 0.0043789004, step = 24601 (1.256 sec)
INFO:tensorflow:global_step/sec: 79.7332
INFO:tensorflow:loss = 0.037498884, step = 2

INFO:tensorflow:global_step/sec: 84.5704
INFO:tensorflow:loss = 0.073672, step = 31901 (1.181 sec)
INFO:tensorflow:global_step/sec: 83.8997
INFO:tensorflow:loss = 0.14368615, step = 32001 (1.193 sec)
INFO:tensorflow:global_step/sec: 79.7406
INFO:tensorflow:loss = 0.016972434, step = 32101 (1.253 sec)
INFO:tensorflow:global_step/sec: 81.3682
INFO:tensorflow:loss = 0.015100708, step = 32201 (1.229 sec)
INFO:tensorflow:global_step/sec: 85.5397
INFO:tensorflow:loss = 0.07284905, step = 32301 (1.168 sec)
INFO:tensorflow:global_step/sec: 88.118
INFO:tensorflow:loss = 0.07950735, step = 32401 (1.136 sec)
INFO:tensorflow:global_step/sec: 83.2871
INFO:tensorflow:loss = 0.018703043, step = 32501 (1.200 sec)
INFO:tensorflow:global_step/sec: 82.249
INFO:tensorflow:loss = 0.0143075995, step = 32601 (1.216 sec)
INFO:tensorflow:global_step/sec: 83.2776
INFO:tensorflow:loss = 0.013998631, step = 32701 (1.202 sec)
INFO:tensorflow:global_step/sec: 80.1945
INFO:tensorflow:loss = 0.0028412684, step = 3280

INFO:tensorflow:global_step/sec: 80.1984
INFO:tensorflow:loss = 0.02137714, step = 40001 (1.245 sec)
INFO:tensorflow:global_step/sec: 79.2889
INFO:tensorflow:loss = 0.010872903, step = 40101 (1.261 sec)
INFO:tensorflow:global_step/sec: 80.1327
INFO:tensorflow:loss = 0.0070594614, step = 40201 (1.248 sec)
INFO:tensorflow:global_step/sec: 81.2819
INFO:tensorflow:loss = 0.00930143, step = 40301 (1.230 sec)
INFO:tensorflow:global_step/sec: 80.2444
INFO:tensorflow:loss = 0.017696235, step = 40401 (1.246 sec)
INFO:tensorflow:global_step/sec: 81.2309
INFO:tensorflow:loss = 0.016471116, step = 40501 (1.231 sec)
INFO:tensorflow:global_step/sec: 83.3296
INFO:tensorflow:loss = 0.025980936, step = 40601 (1.200 sec)
INFO:tensorflow:global_step/sec: 79.448
INFO:tensorflow:loss = 0.013709705, step = 40701 (1.259 sec)
INFO:tensorflow:global_step/sec: 77.1169
INFO:tensorflow:loss = 0.009738188, step = 40801 (1.296 sec)
INFO:tensorflow:global_step/sec: 79.1722
INFO:tensorflow:loss = 0.00778044, step = 4

<tensorflow.python.estimator.canned.dnn.DNNClassifier at 0x1100e33c8>

This code first creates the feature and label data for training, testing, and validation, followed by the DNNClassifier. Then, the classifier is trained. With tensorflow version 1.5, the code as in the book needed to be adapted, see the github link above. To get the accuracy, we also deviate from the code from the book (see github link above) and obtain the following. The accuracy is around 98%, indeed a little bit higher than what we found with a k-nearest neighbor classifier in chapter 3.

In [5]:
test_input_fn = tf.estimator.inputs.numpy_input_fn(
    x={"X": X_test}, y=y_test, shuffle=False)
eval_results = dnn_clf.evaluate(input_fn=test_input_fn)
eval_results

INFO:tensorflow:Starting evaluation at 2019-09-19-13:39:04
INFO:tensorflow:Restoring parameters from /var/folders/j2/hf6944zn74l9y35sr4mgbzb00000gn/T/tmp1i3slmao/model.ckpt-44000
INFO:tensorflow:Finished evaluation at 2019-09-19-13:39:06
INFO:tensorflow:Saving dict for global step 44000: accuracy = 0.9779, average_loss = 0.11287731, global_step = 44000, loss = 14.288267


{'accuracy': 0.9779,
 'average_loss': 0.11287731,
 'loss': 14.288267,
 'global_step': 44000}

**Warning / caution**<br>
The tensorflow.contirb package contains many useful functions, bit it is a place for experimental code that has not yet graduated to be part of the core TensorFlow API. So the DNNClassifier class (and any other contrib code) may change wihtout notice in the future. [This is exactly why the code had to be adapted, see the github link above].

## Traning a DNN Using Plain TensorFlow
page 265<br>
Here, we use TensorFlow's lower-level python API (see chapter 9) to build and the same network as above and train it on the MNIST dataset using batch gradient descent.

### Construction Phase and Execution Phase
page 265<br>
For the construction of the DNN, we first specify its overall structure. Here, we use two fully connected layers: the first with 300 neurons and the second with 100 neurons. The output layer has 10 neurons, in accordance with the 10 binary classes of the MNIST dataset. Then, we need placeholder nodes that will receive the data (features and labels). Next, we build the layers. There are two options: (i) construction "by hand" with an own function and (ii) using function provided by TensorFlow. Here, we try out both but one network needs to be deleted (or something like that) before the other one can be used. Restarting the kernel might not be necessary but is sufficient to switch from own construction to construction via TensorFlow.<br>
Let's go through the function that is used for own construction of the layers:
1. In order to keep the graph neat, a name scope for the layer is defined.
2. The number of inputs is the number of features is the size of the second dimension of the data (first dimension = instances).
3. W – or the *kernel* of that layer – will be a node for the matrix holding the weights for this layer. Its shape is (n_inputs, n_neurons). The weights shall be initialized randomly (to avoid certain symmetries - e.g. all weights are 0 - that might compromise training) and truncated (i.e., have a cutoff, which reduces the probability of slow training). A normal Gaussian distribution with a standard deviation of $2/\sqrt{n_{\rm inputs}}$ (and witha cutoff) is used. This standard deviation can speed up training tremendously for reasons that are going to be discussed in chapter 11.
4. Node "b" will hold the biases.
5. Compute $Z=X\cdot W+b$. In this vectorized form, all instances (of the current batch) are processed in one go.
6. Finally, pass $Z$ to an activation function, if provided, and return the result. If no activation function is provided just return $Z$.

After construction of the layers - via the own function or directly via TensorFlow - is finished, we use softmax regression to assign probabilities to the classes and cross entropy to penalize probabilities that differ from perfect prediction of the correct class.<br><br>
**General note**<br>
The spares_softmax_cross_entropy_with_logits() function is equivalent to applying the softmax activation function and then computing the cross entropy, but it is more efficient, and it properly takes care of corner cases like logits equal to 0. This is why we did not apply the softmax activation function earlier. There is also another function called softmax_cross_entropy_with_logits(), which takes labels in the form of one-hot vectors (instead of ints from 0 to the number of classes minus 1).<br><br>
With the cost function ready, we next implement training via gradient descent with the "train" node. We will also want to evaluate the quality of the predictions. Here, we use the network's overall accuracy within the "eval" node. The node also uses the "in_top_k()" function. With $k=1$, it checks whether the prediction with the highest assigned probability is correct. Finally, there is a global variable initializer node and a saver node.<br><br>
The execution phase starts by loading the MNIST data.

In [6]:
import tensorflow as tf
# CONSTRUCTION
# DNN overall structure
n_inputs = 28 * 28 # MNIST features size
n_hidden1 = 300    # neurons in hidden layer 1
n_hidden2 = 100    # neurons in hidden layer 2
n_outputs = 10     # outputs
# data placeholder nodes
X = tf.placeholder(tf.float32, shape=(None, n_inputs), name="X") # placeholder for features
y = tf.placeholder(tf.int64, shape=(None), name="y")             # placeholder for labels
# choose on option to build the two hidden layer and the output layer
dnn_choice = 2
if dnn_choice is 1:
    # function for building a layer
    def neuron_layer(X, n_neurons, name, activation=None):
        with tf.name_scope(name):
            n_inputs = int(X.get_shape()[1])                                 # number of features
            stddev = 2 / np.sqrt(n_inputs)                                   # standard deviation, see text above
            init = tf.truncated_normal((n_inputs, n_neurons), stddev=stddev) # initial matrix with random weigths
            W = tf.Variable(init, name="kernel")                             # node holding the weights
            b = tf.Variable(tf.zeros([n_neurons]), name="bias")              # node holding the biases
            Z = tf.matmul(X, W) + b                                          # input to activation function
            if activation is not None:                                       # return activation(Z) or Z
                return activation(Z)
            else:
                return Z
    # build the layers via the function
    with tf.name_scope("dnn"):
        hidden1 = neuron_layer(X, n_hidden1, name="hidden1", activation=tf.nn.relu) # take X as input use ReLU
        hidden2 = neuron_layer(X, n_hidden2, name="hidden2", activation=tf.nn.relu) # the same, but input is "hidden1"
        logits = neuron_layer(hidden2, n_outputs, name="outputs")                   # now "hidden2", keep output as is
else:
    # build the layers directly from tensorflow
    with tf.name_scope("dnn"):
        hidden1 = tf.layers.dense(X, n_hidden1, name="hidden1", activation=tf.nn.relu)      # everything clear ...
        hidden2 = tf.layers.dense(hidden1, n_hidden2, name="hidden2",activation=tf.nn.relu) # ... no furhter ...
        logits = tf.layers.dense(hidden2, n_outputs, name="outputs")                        # ... explanation required
# define a loss function and node
with tf.name_scope("loss"):
    xentropy=tf.nn.sparse_softmax_cross_entropy_with_logits(labels=y,logits=logits) # labels = classes (0-9) ...
                                                                                    # ... logits = previous output
    loss = tf.reduce_mean(xentropy, name="loss")                                    # apply cross entropy to the ...
                                                                                    # ... returned probabilities
# build the optimizer and give it a learning rate
learning_rate = 0.01
with tf.name_scope("train"):                                     # training node
    optimizer = tf.train.GradientDescentOptimizer(learning_rate) # use gradient descent ...
    training_op = optimizer.minimize(loss)                       # ... on the loss function (cross entropy)
# specify evaluation
with tf.name_scope("eval"):
    correct = tf.nn.in_top_k(logits, y, 1)                  # check whether the topmost prediction is correct
    accuracy = tf.reduce_mean(tf.cast(correct, tf.float32)) # get accuracy
# initialize all variables and use a saver
init = tf.global_variables_initializer()
saver = tf.train.Saver()
# EXECUTION
# get the MNIST dataset
from tensorflow.examples.tutorials.mnist import input_data # load the data ...
mnist = input_data.read_data_sets("/tmp/data/")            # ... from here

Successfully downloaded train-images-idx3-ubyte.gz 9912422 bytes.
Extracting /tmp/data/train-images-idx3-ubyte.gz
Successfully downloaded train-labels-idx1-ubyte.gz 28881 bytes.
Extracting /tmp/data/train-labels-idx1-ubyte.gz
Successfully downloaded t10k-images-idx3-ubyte.gz 1648877 bytes.
Extracting /tmp/data/t10k-images-idx3-ubyte.gz
Successfully downloaded t10k-labels-idx1-ubyte.gz 4542 bytes.
Extracting /tmp/data/t10k-labels-idx1-ubyte.gz


Next, define the number of epochs and the number of instances in a batch. Then start a session, run the initializer and loop throuhg the epochs. In each epoch, loop through all the batches and succesively, train the alogrithm on all batches. At the end of an epoch, evaluate the accuracy for the current batch and for the test set and print the results on screen. After all epochs are run, save the final result.

In [7]:
# specify epochs and batches
n_epochs = 40   # number of epochs
batch_size = 50 # batch size
# train the algorithm for all epochs on all batches
with tf.Session() as sess:                                                               # session start
    init.run()                                                                           # initialize variables
    for epoch in range(n_epochs):                                                        # loop through epochs
        for iteration in range(mnist.train.num_examples // batch_size):                  # loop through batches
            X_batch, y_batch = mnist.train.next_batch(batch_size)                        # get a batch
            sess.run(training_op, feed_dict={X: X_batch, y: y_batch})                    # train on the batch
        acc_train = accuracy.eval(feed_dict={X: X_batch, y: y_batch})                    # accuracy on the batch
        acc_test = accuracy.eval(feed_dict={X: mnist.test.images, y: mnist.test.labels}) # accuracy on the test set
        print(epoch, "Train accuracy:", acc_train, "Test accuracy:", acc_test)           # print the results
    save_path = saver.save(sess, "./tf_logs/10_NN_Intro/my_model_final.ckpt")            # save the final model

0 Train accuracy: 0.9 Test accuracy: 0.9045
1 Train accuracy: 0.94 Test accuracy: 0.924
2 Train accuracy: 0.96 Test accuracy: 0.9335
3 Train accuracy: 0.96 Test accuracy: 0.9354
4 Train accuracy: 0.92 Test accuracy: 0.9441
5 Train accuracy: 0.98 Test accuracy: 0.9476
6 Train accuracy: 0.98 Test accuracy: 0.9516
7 Train accuracy: 1.0 Test accuracy: 0.9527
8 Train accuracy: 0.94 Test accuracy: 0.9564
9 Train accuracy: 0.98 Test accuracy: 0.9589
10 Train accuracy: 0.94 Test accuracy: 0.9603
11 Train accuracy: 0.98 Test accuracy: 0.9645
12 Train accuracy: 0.98 Test accuracy: 0.9642
13 Train accuracy: 0.96 Test accuracy: 0.9658
14 Train accuracy: 0.98 Test accuracy: 0.9667
15 Train accuracy: 0.96 Test accuracy: 0.9693
16 Train accuracy: 0.94 Test accuracy: 0.969
17 Train accuracy: 1.0 Test accuracy: 0.9703
18 Train accuracy: 1.0 Test accuracy: 0.9707
19 Train accuracy: 0.96 Test accuracy: 0.9703
20 Train accuracy: 0.98 Test accuracy: 0.9721
21 Train accuracy: 1.0 Test accuracy: 0.9723
22 Tr

### Using the Neural Network
page 269<br>
With the model being trained, we can now infer instances. Therefore, we first restore the model, then get a few new instances from the test set, calculate the logits for each instance, and pick the one with the highest probability.

In [8]:
number_of_instances = 20                                              # number of instances from the test set
with tf.Session() as sess:                                            # start a session
    saver.restore(sess, "./tf_logs/10_NN_Intro/my_model_final.ckpt")  # restore the saved model
    X_new_scaled = X_test[:number_of_instances]                       # specify new instances
    Z = logits.eval(feed_dict={X: X_new_scaled})                      # calculate the logits for each instance
    y_pred = np.argmax(Z, axis=1)                                     # pick the highest probability for each instance
# print the predicted classes and check whether all instances have been processed
print("Predicted classes:", y_pred)
print("Actual classes:   ", y_test[:20])
len(y_pred) is number_of_instances

INFO:tensorflow:Restoring parameters from ./tf_logs/10_NN_Intro/my_model_final.ckpt
Predicted classes: [7 2 1 0 4 1 4 9 6 9 0 6 9 0 1 5 9 7 3 4]
Actual classes:    [7 2 1 0 4 1 4 9 5 9 0 6 9 0 1 5 9 7 3 4]


True

## Fine-Tuning Neural Network Hyperparameters
page 270<br>
The flexibility of neural networks poses also a problem: there are just so many possibilities for a network structure (number of layers, number of neurons in a layer, connections, activation functions, initialization of weights and biases, learning rate, etc.) that the choice of hyperparameters still remains some kind of black magic. As for hyperparameter search, randomized search is much more efficient than grid search (see also chapter 2). Another option is to use tools like Oscar: http://oscar.calldesk.ai.

### Number of Hidden Layers
page 270<br>
It has been shown that a neural network with only one hidden layer can calculate *any* function. It only takes a sufficient number of neurons in this layer. For this reason, researchers initially did not give much attention to multilayer networks. However, the *parameter efficiency* is usually much better in networks with more layers. This can be understood as follows: imagine the task of drawing a forest in a graphics program (e.g. inkscape). Without being allowed to copy, one has to draw every leaf to get a twig, every twig to get a branch, every branch to get a tree, and every tree to get the forest. Every time, all the leaves, twigs, etc. have to be drawn from scratch (no copying). Obviously, it would be much faster to draw a few leaves, copy them in a random manner to get a twig, and do the same with twigs, branches, and trees. One would be finished much, much sooner since repetitive tasks have to be done only once.

In deep neural networks, different layers match different patterns: low-level patterns like edges etc. in lower layers, then possibly shapes in intermediate layers, and objects at a higher layer. It would be inefficient if every object needed to be matched by *its* neurons in a single layer since most of these neurons recognize things that other neurons also recognize for another object. Using more layers speeds up training (due to higher parameter efficiency) but can also be helpful for generalization. Tasks that mainly differ at a high level can share the low and intermediate level layers, at least as a starting point. As a start, one layer should usually be fine. From there, one may incrementally add layers until the model starts to overfit. It is also very common to start with a pretrained model instead of starting from scratch.

### Number of Neurons per Hidden Layer
page 271<br>
The number of input neurons obviously should match the number of features in the data. Likewise, 10 output neurons are obviously appropriate for the MNIST dataset with 10 exclusive classes. It has been consensus that a funnel structure for the hidden layers is appropriate, the idea being that many low-level features, coalesce into fewer high-level features. On the other hand, all the richness of the universe seems to emerge from only 16 particles in the standard model. Those 16 particles give rise to much more than 16 objects. Anyway, the former consensus of using a funnel-shaped layout is not widely shared, anymore. And for the MNIST dataset, one may also get good results with two layers of 150 neurons each. Just like with the number of layers, one can increase the number of neurons per layer until overfitting sets in. In general, increasing the number of layers is improves the results more than increasing the number of neurons per layer.

Another, more brute force, way to get a good model is to use more layers and neurons than one actually expects to need and then use early stopping to prevent the model from overfitting. This is also called the *stretch pants* approach, akin to buying large stretch pants and waiting until they shrink in to fit: it avoids having to deal with a tedious search.

### Activation Functions
page 272<br>
The ReLU function (or one of its variants, see chapter 11) is a good default activation function because it is easier to compute than many other activation functions and because gradient descent usually does not get stuck on plateaus due to the fact that ReLU does not saturate for large input values (as opposed to the logistic function or hyperbolic tangent function that saturate at 1). As for output, the softmax activation function is usually a good choice for classification (when the classes are mutually exclusive). For regresseion, one may simply use the last layers outputs, without any further activation function.

## Exercises
page 272
### 1.-8.
Solutions are shown in Appendix A of the book and in the separate notebook *ExercisesWithoutCode*.
### 9.
Train a deep MLP on the MNIST dataset and see if you can get over 98% precision. Just like in the last exercise of Chapter 9, try adding all the bells and whistles (i.e., save checkpoints, restore the last checkpoint in case of an interruption, add summaries, plot learning curves using TensorBoard, and so on).

In [9]:
# The entire solution of exercise 9 is from the github link above.
# firts, define the architecture
n_inputs = 28*28  # MNIST features
n_hidden1 = 300
n_hidden2 = 100
n_outputs = 10
# function to reset the graph
def reset_graph(seed=42):
    tf.reset_default_graph()
    tf.set_random_seed(seed)
    np.random.seed(seed)
reset_graph()
# feature and label placeholders
X = tf.placeholder(tf.float32, shape=(None, n_inputs), name="X")
y = tf.placeholder(tf.int32, shape=(None), name="y")
# neural network within node "dnn"
with tf.name_scope("dnn"):
    hidden1 = tf.layers.dense(X, n_hidden1, name="hidden1", activation=tf.nn.relu)
    hidden2 = tf.layers.dense(hidden1, n_hidden2, name="hidden2", activation=tf.nn.relu)
    logits = tf.layers.dense(hidden2, n_outputs, name="outputs")
# loss node with softmax (=> probabilities) and cross entropy (=> penalty for wrong probabilities)
with tf.name_scope("loss"):
    xentropy = tf.nn.sparse_softmax_cross_entropy_with_logits(labels=y, logits=logits)
    loss = tf.reduce_mean(xentropy, name="loss")
    loss_summary = tf.summary.scalar('log_loss', loss)
# training node and learning rate
learning_rate = 0.01
with tf.name_scope("train"):
    optimizer = tf.train.GradientDescentOptimizer(learning_rate)
    training_op = optimizer.minimize(loss)
# evaluation node uses only the highest probability and calculates the accuracy
with tf.name_scope("eval"):
    correct = tf.nn.in_top_k(logits, y, 1)                     # check whether the topmost prediction is correct
    accuracy = tf.reduce_mean(tf.cast(correct, tf.float32))    # get the accuracy
    accuracy_summary = tf.summary.scalar('accuracy', accuracy) # append the accuracy to the summary
# initialization node
init = tf.global_variables_initializer()
# saver node
saver = tf.train.Saver()
# import datetime and use it to build a function that return a path for saving the logged data
from datetime import datetime
def log_dir(prefix=""):
    now = datetime.utcnow().strftime("%Y%m%d%H%M%S") # date string
    root_logdir = "./tf_logs/10_NN_Intro"            # default directory
    if prefix:                                       # if a prefix has been specified ...
        prefix += "-"                                # ... append a "-" sign
    name = prefix + "run-" + now                     # in any case, append "run-" and the date string from above
    return "{}/{}/".format(root_logdir, name)        # build the complete path and return it
# specify a directory
logdir = log_dir("mnist_dnn")
# create a file writer
file_writer = tf.summary.FileWriter(logdir, tf.get_default_graph())
# get the number of instances (m) and features (n)
m, n = X_train.shape
# training schedule
n_epochs = 1001
batch_size = 50
n_batches = int(np.ceil(m / batch_size))
# checkpoints etc.
checkpoint_path = "./tf_logs/10_NN_Intro/tmp/my_deep_mnist_model.ckpt"
checkpoint_epoch_path = checkpoint_path + ".epoch"
final_model_path = "./tf_logs/10_NN_Intro/my_deep_mnist_model"
# iniitalize some quantities
best_loss = np.infty
epochs_without_progress = 0
max_epochs_without_progress = 50
# build a function that randomly puts instances into a batch
def shuffle_batch(X, y, batch_size):
    rnd_idx = np.random.permutation(len(X))
    n_batches = len(X) // batch_size
    for batch_idx in np.array_split(rnd_idx, n_batches):
        X_batch, y_batch = X[batch_idx], y[batch_idx]
        yield X_batch, y_batch
# And now open a session and run this thing!
with tf.Session() as sess:
    # if the checkpoint file exists, restore the model, including the session, and load the epoch number
    if os.path.isfile(checkpoint_epoch_path):
        with open(checkpoint_epoch_path, "rb") as f:
            start_epoch = int(f.read())
        print("Training was interrupted. Continuing at epoch", start_epoch)
        saver.restore(sess, checkpoint_path)
    # if no checkpoint file exists, start from the beginning (with a new session)
    else:
        start_epoch = 0
        sess.run(init)
    # now loop through the (remaining epochs)
    for epoch in range(start_epoch, n_epochs):
        # in each epoch, loop through all the (randomized) batches and run the training operation on each batch
        for X_batch, y_batch in shuffle_batch(X_train, y_train, batch_size):
            sess.run(training_op, feed_dict={X: X_batch, y: y_batch})
        # caluclate metrics
        accuracy_val, loss_val, accuracy_summary_str, loss_summary_str = sess.run([accuracy, loss, accuracy_summary,
                                                                                   loss_summary],
                                                                                  feed_dict={X: X_valid, y: y_valid})
        # save the results
        file_writer.add_summary(accuracy_summary_str, epoch)
        file_writer.add_summary(loss_summary_str, epoch)
        # print a report every 5 epochs
        if epoch % 5 == 0:
            print("Epoch:", epoch,                                             # current epoch
                  "\tValidation accuracy: {:.3f}%".format(accuracy_val * 100), # current accuracy on validation set
                  "\tLoss: {:.5f}".format(loss_val))                           # current loss on validation set
            saver.save(sess, checkpoint_path)                                  # save a checkpoint for the session
            with open(checkpoint_epoch_path, "wb") as f:                       # open epoch checkpoints and ...
                f.write(b"%d" % (epoch + 1))                                   # update the epoch
            # if the loss has decreased (good), update the according quantities
            if loss_val < best_loss:
                saver.save(sess, final_model_path)
                best_loss = loss_val
            # if it has nos decreased over the last 5 epochs, ...
            else:
                # ... then count the epochs without progress ...
                epochs_without_progress += 5
                # ... and when this count exceeds a certain threshold, ...
                if epochs_without_progress > max_epochs_without_progress:
                    # ... then it is time to stop (early, i.e., before the model totally goes south overfitting)
                    print("Early stopping")
                    break

Epoch: 0 	Validation accuracy: 90.240% 	Loss: 0.35380
Epoch: 5 	Validation accuracy: 95.120% 	Loss: 0.17921
Epoch: 10 	Validation accuracy: 96.520% 	Loss: 0.12785
Epoch: 15 	Validation accuracy: 97.180% 	Loss: 0.10320
Epoch: 20 	Validation accuracy: 97.480% 	Loss: 0.09166
Epoch: 25 	Validation accuracy: 97.620% 	Loss: 0.08206
Epoch: 30 	Validation accuracy: 97.740% 	Loss: 0.07877
Epoch: 35 	Validation accuracy: 97.820% 	Loss: 0.07419
Epoch: 40 	Validation accuracy: 97.840% 	Loss: 0.07160
Epoch: 45 	Validation accuracy: 98.080% 	Loss: 0.06740
Epoch: 50 	Validation accuracy: 98.040% 	Loss: 0.06734
Epoch: 55 	Validation accuracy: 98.000% 	Loss: 0.06683
Epoch: 60 	Validation accuracy: 98.040% 	Loss: 0.06728
Epoch: 65 	Validation accuracy: 98.180% 	Loss: 0.06668
Epoch: 70 	Validation accuracy: 98.180% 	Loss: 0.06600
Epoch: 75 	Validation accuracy: 98.140% 	Loss: 0.06639
Epoch: 80 	Validation accuracy: 98.120% 	Loss: 0.06659
Epoch: 85 	Validation accuracy: 98.220% 	Loss: 0.06595
Epoch: 90 	V

Above, all the model is eplained via comments. It stops early once the loss function on the validation set seems to permanently rise again (overfitting). Below, the best model before the onset of overfitting is restored and evaluated with the accuracy on the **test set** (since this is the *final* model).

In [10]:
os.remove(checkpoint_epoch_path)                                    # get the checkpoint of the "best" epoch
with tf.Session() as sess:                                          # restart the session ...
    saver.restore(sess, final_model_path)                           # ... of that checkpoint and restore the model 
    accuracy_test = accuracy.eval(feed_dict={X: X_test, y: y_test}) # get the accuracy ...
accuracy_test                                                       # ... and print it to screen

INFO:tensorflow:Restoring parameters from ./tf_logs/10_NN_Intro/my_deep_mnist_model


0.9796

Now, we have already learned quite a lot about artificial neural networks!