## Chapter 10. Introduction to Artificial Neural Networks

Key idea inspired _artificial neural networks_(ANNs): study brain's architecture for inspiration on how to build an intelligent machine.

ANNs are the very core of Deep Learning. They are versatile, powerful, and scalable, making them ideal to tackle large and highly complex Machine Learning tasks.

### From Biological to Artificial Neurons

ANNS first introduced as _propositional logic_ in 1943 by Warren McCulloch and Walter Pitts. 

In the early 1980s there was a revival of interest in ANNs as new network architectures were invented and better training techniques were developed. But by the 1990s, powerful alternative Machine Learning techniques such as
Support Vector Machines.

Reasons to believe this wave of interest in ANNs is different and will have a much more profound impact on our lives:
 - Huge quantity of data available to train neural networks, and ANNs frequently outperform other ML techniques on very large and complex problems.
 - Computing power (Moore's Law, GPUs)
 - The training algorithms have been improved.
 - Some theoretical limitations of ANNs have turned out to be benign in practice.
 - ANNs seem to have entered a virtuous circle of funding and progress.
 
#### Biological Neurons

Each neuron typically connected to thousands of other neurons. Highly complex computations can be performed by a vast network of fairly simple neurons.

<div style="width:400 px; font-size:100%; text-align:center;"> <center><img src="img/fig10-1.png" width=400px alt="fig10-1" style="padding-bottom:1.0em;padding-top:2.0em;"></center>_Figure 10-1. Biological neuron_</div>

#### Logical Computations with Neurons

_Artificial neuron_: one or more binary (on/off) inputs and one binary output. (Such simplified model can build a network of artificial neurons that computes any logical proposition.)

#### The perceptron

*Perceptron*: one of the simplest ANN architectures, invented in 1957 by Frank Rosenblatt. It is based on a slightly different artificial neuron called a _linear threshold unit_ (LTU): the inputs and output are now numbers (instead of binary on/off values) and each input connection is associated with a weight.

<div style="width:400 px; font-size:100%; text-align:center;"> <center><img src="img/fig10-4.png" width=400px alt="fig10-4" style="padding-bottom:1.0em;padding-top:2.0em;"></center>_Figure 10-4. Linear shreshold unit_</div>

_Common step functions used in Perceptrons_

$$heaviside \ (z) = \left\{\begin{matrix}
0 \ \ if \ z<0
\\ 
1 \ \ if \ z \ge 0
\end{matrix}\right. \ \ \ \ \ \ \ sgn \ (z) = \left\{\begin{matrix}
-1 \ \ if \ z<0
\\ 
0 \ \ if \ z=0
\\
1 \ \ if \ z \ge 0
\end{matrix}\right.$$

A Perceptron is simply composed of a single layer of LTUs, with each neuron connected to all the inputs.
These connections are often represented using special passthrough neurons called _input neurons_: they just
output whatever input they are fed. Moreover, an extra bias feature is generally added ($x_0 = 1$). This bias
feature is typically represented using a special type of neuron called a _bias neuron_, which just outputs 1
all the time.

<div style="width:400 px; font-size:100%; text-align:center;"> <center><img src="img/fig10-5.png" width=400px alt="fig10-5" style="padding-bottom:1.0em;padding-top:2.0em;"></center>Figure 10-5. Perceptron diagram. This Perceptron can classify instances simultaneously into three different binary classes, which makes it a multioutput
classifier.</div>

How is a Perceptron trained?

Hebb's rule (Hebbian learning): the connection weight between two neurons is increased whenever they have the same
output.

Perceptrons are trained using a variant of this rule that takes into account the error made by the network; it does not reinforce connections that lead to the wrong output. More specifically, the Perceptron is fed one training instance at a time, and for each instance it makes its predictions. For every output neuron that produced a wrong prediction, it reinforces the connection weights from the inputs that would have contributed to the correct prediction. The rule is shown as

*Perceptron learning rule (weight update)*

$$ w_{i,j}^{next\_step} = w_{i,j} + \eta(\hat{y}_j - y_j)x_i$$

*Perceptron convergence Theorem* if the training instances are linearly separable, this algorithm would converge to a solution. 

Prefer Logistic Regression over Perceptrons, because instead of outputting a class probability, Perceptrons just make predictions based on a hard threshold.

*Multi-Layer Perceptron* (MLP) can eliminate some of the limitations of Perceptrons, while single-layer perceptrons are incapable of solving some trivial problems.

#### Multi_layer Perceptron and Backpropagation

An MLP is composed of one (passthrough) input layer, one or more layers of LTUs, called _hidden layers_, and one final layer of LTUs called the _output layer_. Every layer except the output layer includes a bias neuron and is fully connected to the next layer. 

*Deep neural network* (DNN): ANN has two or more hidden layers.

<div style="width:400 px; font-size:100%; text-align:center;"> <center><img src="img/fig10-7.png" width=400px alt="fig10-7" style="padding-bottom:1.0em;padding-top:2.0em;"></center>*Figure 10-7. Multi-Layer Perceptron*</div>

Backpropagation training algorithm, same as Gradient Descent using reverse-mode autodiff: For each training instance the backpropagation algorithm first makes a prediction (forward pass), measures the error, then goes through each layer in reverse to measure the error contribution from each connection (reverse pass), and finally slightly tweaks the connection weights to reduce the error (Gradient Descent step).

*Activation (Step) function* with well-defined nonzero derivative everywhere:
 - *Logistic function*, $\sigma (z)=1/(1+\exp(z)) \ \ \in [0,1]$ 
 - _Hyperbolic tangent function_, $\tanh (z) = 2\sigma (2z)-1 \ \ \in [-1,1]$, make each layer's output normalized (i.e., centered around 0) at the beginning of training. Speed up convergence.
 - _ReLU function_, $ReLU(z)=\max(0,z) \ \  \in [0,\infty)$, fast to compute gradient.
 
<div style="width:400 px; font-size:100%; text-align:center;"> <center><img src="img/fig10-8.png" width=500px alt="fig10-8" style="padding-bottom:1.0em;padding-top:2.0em;"></center>_Figure 10-8. Activation functions ans their derivatives_</div>

An MLP is often used for classification, with each output corresponding to a different binary class. When the classes are exclusive, the output layer is typically modified by replacing the individual activation
functions by a shared _softmax_ function. The output of each neuron corresponds to the estimated probability of the corresponding class. Note that the signal flows only in one direction (from the inputs to the outputs), so this architecture is an example of a *feedforward neural network* (FNN).

<div style="width:400 px; font-size:100%; text-align:center;"> <center><img src="img/fig10-9.png" width=400px alt="fig10-9" style="padding-bottom:1.0em;padding-top:2.0em;"></center>_Figure 10-9. A modern MLP (including ReLU and softmax) for classification_</div>

<font color=blue>*NOTE*</font>
>Biological neurons seem to implement a roughly sigmoid (S-shaped) activation function, so researchers stuck to sigmoid functions for a very long time. But it turns out that the ReLU activation function generally works better in ANNs.

### Training an MLP with TensorFlow's High-Level API

Trains a DNN for classification with two hidden layers (one with 300 neurons, and the other with 100 neurons) and a softmax output layer with 10 neurons:

In [3]:
import numpy as np
import tensorflow as tf

(X_train, y_train), (X_test, y_test) = tf.keras.datasets.mnist.load_data()
X_train = X_train.astype(np.float32).reshape(-1, 28*28) / 255.0
X_test = X_test.astype(np.float32).reshape(-1, 28*28) / 255.0
y_train = y_train.astype(np.int32)
y_test = y_test.astype(np.int32)
X_valid, X_train = X_train[:5000], X_train[5000:]
y_valid, y_train = y_train[:5000], y_train[5000:]

In [5]:
feature_cols = [tf.feature_column.numeric_column("X", shape=[28 * 28])]
dnn_clf = tf.estimator.DNNClassifier(hidden_units=[300,100], n_classes=10,
                                     feature_columns=feature_cols)

input_fn = tf.compat.v1.estimator.inputs.numpy_input_fn(
    x={"X": X_train}, y=y_train, num_epochs=40, batch_size=50, shuffle=True)
dnn_clf.train(input_fn=input_fn)

INFO:tensorflow:Using default config.
INFO:tensorflow:Using config: {'_model_dir': '/tmp/tmpxbj91xes', '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_steps': None, '_save_checkpoints_secs': 600, '_session_config': allow_soft_placement: true
graph_options {
  rewrite_options {
    meta_optimizer_iterations: ONE
  }
}
, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_log_step_count_steps': 100, '_train_distribute': None, '_device_fn': None, '_protocol': None, '_eval_distribute': None, '_experimental_distribute': None, '_experimental_max_worker_delay_secs': None, '_session_creation_timeout_secs': 7200, '_service': None, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7f18a0121fd0>, '_task_type': 'worker', '_task_id': 0, '_global_id_in_cluster': 0, '_master': '', '_evaluation_master': '', '_is_chief': True, '_num_ps_replicas': 0, '_num_worker_replicas': 1}
Instructions for updating:
If using Keras pass *_constra

INFO:tensorflow:loss = 0.31678146, step = 4700 (0.244 sec)
INFO:tensorflow:global_step/sec: 438.598
INFO:tensorflow:loss = 0.37759888, step = 4800 (0.226 sec)
INFO:tensorflow:global_step/sec: 437.283
INFO:tensorflow:loss = 0.371641, step = 4900 (0.229 sec)
INFO:tensorflow:global_step/sec: 424.439
INFO:tensorflow:loss = 0.2906327, step = 5000 (0.236 sec)
INFO:tensorflow:global_step/sec: 476.048
INFO:tensorflow:loss = 0.26791564, step = 5100 (0.211 sec)
INFO:tensorflow:global_step/sec: 457.227
INFO:tensorflow:loss = 0.3268251, step = 5200 (0.217 sec)
INFO:tensorflow:global_step/sec: 461.902
INFO:tensorflow:loss = 0.40923783, step = 5300 (0.216 sec)
INFO:tensorflow:global_step/sec: 467.934
INFO:tensorflow:loss = 0.5843144, step = 5400 (0.214 sec)
INFO:tensorflow:global_step/sec: 424.606
INFO:tensorflow:loss = 0.48011422, step = 5500 (0.238 sec)
INFO:tensorflow:global_step/sec: 447.317
INFO:tensorflow:loss = 0.48280215, step = 5600 (0.222 sec)
INFO:tensorflow:global_step/sec: 463.955
INFO:

INFO:tensorflow:loss = 0.2516647, step = 12900 (0.229 sec)
INFO:tensorflow:global_step/sec: 462.942
INFO:tensorflow:loss = 0.55250883, step = 13000 (0.216 sec)
INFO:tensorflow:global_step/sec: 439.515
INFO:tensorflow:loss = 0.2429716, step = 13100 (0.227 sec)
INFO:tensorflow:global_step/sec: 448.176
INFO:tensorflow:loss = 0.36149848, step = 13200 (0.223 sec)
INFO:tensorflow:global_step/sec: 453.112
INFO:tensorflow:loss = 0.3518336, step = 13300 (0.221 sec)
INFO:tensorflow:global_step/sec: 468.023
INFO:tensorflow:loss = 0.27905267, step = 13400 (0.214 sec)
INFO:tensorflow:global_step/sec: 439.126
INFO:tensorflow:loss = 0.3723917, step = 13500 (0.228 sec)
INFO:tensorflow:global_step/sec: 447.547
INFO:tensorflow:loss = 0.13402195, step = 13600 (0.222 sec)
INFO:tensorflow:global_step/sec: 461.921
INFO:tensorflow:loss = 0.160181, step = 13700 (0.217 sec)
INFO:tensorflow:global_step/sec: 471.781
INFO:tensorflow:loss = 0.3622495, step = 13800 (0.212 sec)
INFO:tensorflow:global_step/sec: 458.1

INFO:tensorflow:global_step/sec: 475.745
INFO:tensorflow:loss = 0.20468278, step = 21100 (0.210 sec)
INFO:tensorflow:global_step/sec: 429.512
INFO:tensorflow:loss = 0.21004833, step = 21200 (0.233 sec)
INFO:tensorflow:global_step/sec: 441.514
INFO:tensorflow:loss = 0.45350826, step = 21300 (0.227 sec)
INFO:tensorflow:global_step/sec: 430.56
INFO:tensorflow:loss = 0.27318355, step = 21400 (0.232 sec)
INFO:tensorflow:global_step/sec: 426.499
INFO:tensorflow:loss = 0.2144009, step = 21500 (0.234 sec)
INFO:tensorflow:global_step/sec: 432.038
INFO:tensorflow:loss = 0.2712891, step = 21600 (0.231 sec)
INFO:tensorflow:global_step/sec: 438.878
INFO:tensorflow:loss = 0.14156277, step = 21700 (0.228 sec)
INFO:tensorflow:global_step/sec: 436.02
INFO:tensorflow:loss = 0.11544184, step = 21800 (0.229 sec)
INFO:tensorflow:global_step/sec: 434.627
INFO:tensorflow:loss = 0.14515421, step = 21900 (0.230 sec)
INFO:tensorflow:global_step/sec: 426.394
INFO:tensorflow:loss = 0.20970288, step = 22000 (0.234

INFO:tensorflow:loss = 0.13354865, step = 29200 (0.204 sec)
INFO:tensorflow:global_step/sec: 469.457
INFO:tensorflow:loss = 0.1612702, step = 29300 (0.213 sec)
INFO:tensorflow:global_step/sec: 461.147
INFO:tensorflow:loss = 0.13857025, step = 29400 (0.219 sec)
INFO:tensorflow:global_step/sec: 462.582
INFO:tensorflow:loss = 0.1596251, step = 29500 (0.214 sec)
INFO:tensorflow:global_step/sec: 471.75
INFO:tensorflow:loss = 0.11997793, step = 29600 (0.212 sec)
INFO:tensorflow:global_step/sec: 472.309
INFO:tensorflow:loss = 0.16250257, step = 29700 (0.213 sec)
INFO:tensorflow:global_step/sec: 456.584
INFO:tensorflow:loss = 0.30367324, step = 29800 (0.218 sec)
INFO:tensorflow:global_step/sec: 463.152
INFO:tensorflow:loss = 0.2399237, step = 29900 (0.216 sec)
INFO:tensorflow:global_step/sec: 457.918
INFO:tensorflow:loss = 0.07693576, step = 30000 (0.218 sec)
INFO:tensorflow:global_step/sec: 466.055
INFO:tensorflow:loss = 0.09361076, step = 30100 (0.215 sec)
INFO:tensorflow:global_step/sec: 46

INFO:tensorflow:global_step/sec: 443.51
INFO:tensorflow:loss = 0.23833331, step = 37400 (0.225 sec)
INFO:tensorflow:global_step/sec: 440.467
INFO:tensorflow:loss = 0.16198583, step = 37500 (0.227 sec)
INFO:tensorflow:global_step/sec: 435.522
INFO:tensorflow:loss = 0.21523859, step = 37600 (0.230 sec)
INFO:tensorflow:global_step/sec: 435.536
INFO:tensorflow:loss = 0.23297766, step = 37700 (0.230 sec)
INFO:tensorflow:global_step/sec: 450.618
INFO:tensorflow:loss = 0.113907434, step = 37800 (0.222 sec)
INFO:tensorflow:global_step/sec: 438.638
INFO:tensorflow:loss = 0.11536718, step = 37900 (0.228 sec)
INFO:tensorflow:global_step/sec: 441.036
INFO:tensorflow:loss = 0.18860286, step = 38000 (0.226 sec)
INFO:tensorflow:global_step/sec: 451.628
INFO:tensorflow:loss = 0.19688633, step = 38100 (0.222 sec)
INFO:tensorflow:global_step/sec: 425.425
INFO:tensorflow:loss = 0.20131367, step = 38200 (0.235 sec)
INFO:tensorflow:global_step/sec: 464.163
INFO:tensorflow:loss = 0.23716137, step = 38300 (0

<tensorflow_estimator.python.estimator.canned.dnn.DNNClassifierV2 at 0x7f18a00fb8d0>

In [7]:
test_input_fn = tf.compat.v1.estimator.inputs.numpy_input_fn(
    x={"X": X_test}, y=y_test, shuffle=False)
eval_results = dnn_clf.evaluate(input_fn=test_input_fn)

INFO:tensorflow:Calling model_fn.
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Starting evaluation at 2020-02-18T21:50:32Z
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Restoring parameters from /tmp/tmpxbj91xes/model.ckpt-44000
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Finished evaluation at 2020-02-18-21:50:33
INFO:tensorflow:Saving dict for global step 44000: accuracy = 0.9488, average_loss = 0.18019786, global_step = 44000, loss = 0.17890291
INFO:tensorflow:Saving 'checkpoint_path' summary for global step 44000: /tmp/tmpxbj91xes/model.ckpt-44000


In [8]:
eval_results

{'accuracy': 0.9488,
 'average_loss': 0.18019786,
 'loss': 0.17890291,
 'global_step': 44000}

In [9]:
y_pred_iter = dnn_clf.predict(input_fn=test_input_fn)
y_pred = list(y_pred_iter)
y_pred[0]

INFO:tensorflow:Calling model_fn.
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Restoring parameters from /tmp/tmpxbj91xes/model.ckpt-44000
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.


{'logits': array([ 0.25431165, -3.9832058 ,  1.7883002 ,  2.7589452 , -3.0482638 ,
        -0.64774305, -7.0888457 ,  9.104177  , -1.8259383 ,  1.6111878 ],
       dtype=float32),
 'probabilities': array([1.4294300e-04, 2.0645834e-06, 6.6277420e-04, 1.7494904e-03,
        5.2586311e-06, 5.7996989e-05, 9.2484719e-08, 9.9680638e-01,
        1.7853439e-05, 5.5519660e-04], dtype=float32),
 'class_ids': array([7]),
 'classes': array([b'7'], dtype=object),
 'all_class_ids': array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9], dtype=int32),
 'all_classes': array([b'0', b'1', b'2', b'3', b'4', b'5', b'6', b'7', b'8', b'9'],
       dtype=object)}

`DNNClassifier` class creates all the neuron layers, based on the ReLU activation function (change `activation_fn` hyperparameter). The output layer relies on the softmax function, and the cost function is cross entropy.

### Training a DNN Using Plain TensorFlow

Mini-batch Gradient Descent to train it on the MNIST dataset. 1st step - construction phase, building the TensorFlow graph. 2nd step - the execution phase, where you actually run the graph to train the model.

#### Construction Phase

In [10]:
import tensorflow as tf

n_inputs = 28*28  # MNIST
n_hidden1 = 300
n_hidden2 = 100
n_outputs = 10

In [11]:
tf.reset_default_graph()

X = tf.placeholder(tf.float32, shape=(None, n_inputs), name="X")
y = tf.placeholder(tf.int32, shape=(None), name="y")

AttributeError: module 'tensorflow' has no attribute 'reset_default_graph'

In [12]:
def neuron_layer(X, n_neurons, name, activation=None):
    with tf.name_scope(name):
        n_inputs = int(X.get_shape()[1])
        stddev = 2 / np.sqrt(n_inputs)    # helps the algorithm converge much faster
        init = tf.truncated_normal((n_inputs, n_neurons), stddev=stddev)
        W = tf.Variable(init, name="kernel")
        b = tf.Variable(tf.zeros([n_neurons]), name="bias")
        Z = tf.matmul(X, W) + b
        if activation is not None:
            return activation(Z)
        else:
            return Z

In [None]:
with tf.name_scope("dnn"):
    hidden1 = neuron_layer(X, n_hidden1, name="hidden1", activation=tf.nn.relu)
    hidden2 = neuron_layer(hidden1, n_hidden2, name="hidden2", activation=tf.nn.relu)
    logits = neuron_layer(hidden2, n_outputs, name="outputs")
    y_proba = tf.nn.softmax(logits)

In [None]:
with tf.name_scope("loss"):
    xentropy = tf.nn.sparse_softmax_cross_entropy_with_logits(labels=y, logits=logits)
    loss = tf.reduce_mean(xentropy, name="loss")

In [None]:
learning_rate = 0.01

with tf.name_scope("train"):
    optimizer = tf.train.GradientDescentOptimizer(learning_rate)
    training_op = optimizer.minimize(loss)

In [None]:
with tf.name_scope("eval"):
    correct = tf.nn.in_top_k(logits, y, 1)
    accuracy = tf.reduce_mean(tf.cast(correct, tf.float32))

In [None]:
init = tf.global_variables_initializer()
saver = tf.train.Saver()

Note that `logits` is the output of the neural network _before_ going through the `softmax` activation function: for optimization reasons, we will handle the softmax computation later.

You can use TensorFlow's `tf.layers.dense()` function to create a fully connected layer. methods: `name`, `activation`, `kernel_initializer`, the default `activation` is now `None`.

The `sparse_softmax_cross_entropy_with_logits()` function is equivalent to applying the softmax activation function and then computing the cross entropy, but it is more efficient, and it properly takes care of corner cases like logits equal to 0. There is also another function called `softmax_cross_entropy_with_logits()`, which takes labels in the form of one-hot vectors.

#### Execution Phase

In [None]:
from tensorflow.examples.tutorials.mnist import input_data
mnist = input_data.read_data_sets("/tmp/data/")

n_epochs = 20
batch_size = 50

with tf.Session() as sess:
    init.run()
    for epoch in range(n_epochs):
        for iteration in range(mnist.train.num_examples // batch_size):
            X_batch, y_batch = mnist.train.next_batch(batch_size)
            sess.run(training_op, feed_dict={X: X_batch, y: y_batch})
        acc_train = accuracy.eval(feed_dict={X: X_batch, y: y_batch})
        acc_test = accuracy.eval(feed_dict={X: mnist.test.images,
                                        y: mnist.test.labels})
        print(epoch, "Train accuracy:", acc_train, "Test accuracy:", acc_test)

    save_path = saver.save(sess, "./logs/my_model_final.ckpt")

#### Using the Neural Network

First the code loads the model parameters from disk. Then it loads some new images that you want to classify. Remember to apply the same feature scaling as for the training data (in this case, scale it from 0 to 1). Then the code evaluates the `logits` node. If you wanted to know all the estimated class probabilities, you would need to apply the `softmax()` function to the logits, but if you just want to predict a class, you can simply pick the class that has the highest logit value (using the `argmax()` function does the trick).

In [None]:
with tf.Session() as sess:
    saver.restore(sess, "./logs/my_model_final.ckpt")
    X_new_scaled = [...] # some new images (scaled from 0 to 1)
    Z = logits.eval(feed_dict={X: X_new_scaled})
    y_pred = np.argmax(Z, axis=1)

### Fine-Tuning Neural Network Hyperparameters

The flexibility of neural networks is also one of their main drawbacks: there are many hyperparameters to tweak. Any _network topology_, number of layers, number of neurons per layer, type of activation function, weight initialization logic. 

It is much better to use randomized search over grid search. Another option is to use a tool such as Oscar (http://oscar.calldesk.ai/), which implements more complex algorithms to help you find a good set of hyperparameters quickly.

#### Number of Hidden Layers

MLP with just one hidden layer can model even the most complex functions provided it has enough neurons. However, deep networks have a much higher parameter efficiency than shallow ones: they can model complex functions using exponentially fewer neurons than shallow nets, making them much faster to train.

Real-world data is often structured in such a hierarchical way and DNNs automatically take advantage of this fact.

Not only does this hierarchical architecture help DNNs converge faster to a good solution, it also
improves their ability to generalize to new datasets. 

In summary, for many problems you can start with just one or two hidden layers and it will work just fine. For more complex problems, you can gradually ramp up the number of hidden layers, until you start overfitting the training set. Very complex tasks, typically require networks with dozens of layers (or even hundreds, but not fully connected ones), and they need a huge amount of training data. However, you will rarely have to train such networks from scratch: it is much more common to reuse parts of a pretrained state-of-the-art network that performs a similar task. Training will be a lot faster and require much less data.

#### Number of Neurons per Hidden Layer

As for the hidden layers, a common practice is to size them to form a funnel, with fewer and fewer neurons at each layer - the rationale being that many low-level features can coalesce into far fewer high-level features. However, this practice is not as common now, and you may simply use the same size for all hidden layers. Just like for the number of layers, you can try increasing the number of neurons gradually until the network starts overfitting. In general you will get more bang for the buck by increasing the number of layers than the number of neurons per layer.

A simpler approach is to pick a model with more layers and neurons than you actually need, then use _early stopping_ to prevent it from overfitting (and other regularization techniques, especially _dropout_). This has been dubbed the “stretch pants” approach: instead of wasting time looking for pants that perfectly match your size, just use large stretch pants that will shrink down to the right size.

#### Activation Functions

In most cases you can use the ReLU activation function in the hidden layers (or one of its variants). It is a bit faster to compute than other activation functions, and Gradient Descent does not get stuck as much on plateaus, thanks to the fact that it does not saturate for large input values (as opposed to the logistic function or the hyperbolic tangent function, which saturate at 1).

For the output layer, the softmax activation function is generally a good choice for classification tasks (when the classes are mutually exclusive). For regression tasks, you can simply use no activation function at all.