## X. Introduction to Artificial Neural Networks

Pages 253 to 263 of 'Hands-on-ML give an amazing introduction into Artificial Neural Networks (ANNs). Here, I will only present the implementation of an ANN using tensorflow. Tensorflow's high-level API *TF.Learn* makes it very easy to train a deep neural network. In fact, following code trains a DNN for classification with two hidden layers (or Perceptrons), one with 300 neurons and one with 100 neurons.In the end, there is a softmax layer with 10 neurons (=10 classes).

### X.1 Training an MLP with TensorFlow's High-Level API

In [9]:
import numpy as np
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split

iris = load_iris() 
X = iris.data[:, (2, 3)]  # petal length, petal width 
y = (iris.target == 0).astype(np.int)  # Iris Setosa?

X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                    test_size=0.33,
                                                    random_state=42)

In [10]:
import tensorflow as tf

feature_columns = tf.contrib.learn.infer_real_valued_columns_from_input(X_train)
dnn_clf = tf.contrib.learn.DNNClassifier(hidden_units=[300,100],
                                         n_classes=10, 
                                         feature_columns=feature_columns)
dnn_clf.fit(x=X_train, y=y_train, batch_size=50, steps=40000)

INFO:tensorflow:Using default config.
INFO:tensorflow:Using config: {'_task_type': None, '_task_id': 0, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7fcdec24ab70>, '_master': '', '_num_ps_replicas': 0, '_num_worker_replicas': 0, '_environment': 'local', '_is_chief': True, '_evaluation_master': '', '_train_distribute': None, '_eval_distribute': None, '_experimental_max_worker_delay_secs': None, '_device_fn': None, '_tf_config': gpu_options {
  per_process_gpu_memory_fraction: 1.0
}
, '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_secs': 600, '_log_step_count_steps': 100, '_protocol': None, '_session_config': None, '_save_checkpoints_steps': None, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_model_dir': '/tmp/tmpm4ddjuee'}
INFO:tensorflow:Create CheckpointSaverHook.
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Savi

INFO:tensorflow:global_step/sec: 421.756
INFO:tensorflow:loss = 6.083028e-05, step = 6801 (0.239 sec)
INFO:tensorflow:global_step/sec: 433.376
INFO:tensorflow:loss = 6.901702e-05, step = 6901 (0.229 sec)
INFO:tensorflow:global_step/sec: 386.096
INFO:tensorflow:loss = 4.332005e-05, step = 7001 (0.260 sec)
INFO:tensorflow:global_step/sec: 449.935
INFO:tensorflow:loss = 8.935858e-05, step = 7101 (0.220 sec)
INFO:tensorflow:global_step/sec: 495.183
INFO:tensorflow:loss = 7.238131e-05, step = 7201 (0.202 sec)
INFO:tensorflow:global_step/sec: 426.195
INFO:tensorflow:loss = 3.9033064e-05, step = 7301 (0.234 sec)
INFO:tensorflow:global_step/sec: 605.118
INFO:tensorflow:loss = 7.784444e-05, step = 7401 (0.166 sec)
INFO:tensorflow:global_step/sec: 452.214
INFO:tensorflow:loss = 4.179583e-05, step = 7501 (0.221 sec)
INFO:tensorflow:global_step/sec: 468.644
INFO:tensorflow:loss = 4.0688003e-05, step = 7601 (0.213 sec)
INFO:tensorflow:global_step/sec: 482.256
INFO:tensorflow:loss = 4.858956e-05, st

INFO:tensorflow:global_step/sec: 436.712
INFO:tensorflow:loss = 1.1943605e-05, step = 14501 (0.228 sec)
INFO:tensorflow:global_step/sec: 442.327
INFO:tensorflow:loss = 3.278858e-05, step = 14601 (0.226 sec)
INFO:tensorflow:global_step/sec: 428.112
INFO:tensorflow:loss = 1.9164525e-05, step = 14701 (0.233 sec)
INFO:tensorflow:global_step/sec: 442.679
INFO:tensorflow:loss = 1.2363733e-05, step = 14801 (0.226 sec)
INFO:tensorflow:global_step/sec: 445.973
INFO:tensorflow:loss = 2.4470028e-05, step = 14901 (0.224 sec)
INFO:tensorflow:global_step/sec: 428.957
INFO:tensorflow:loss = 2.1035921e-05, step = 15001 (0.233 sec)
INFO:tensorflow:global_step/sec: 455.754
INFO:tensorflow:loss = 2.2732862e-05, step = 15101 (0.220 sec)
INFO:tensorflow:global_step/sec: 438.103
INFO:tensorflow:loss = 1.7674563e-05, step = 15201 (0.228 sec)
INFO:tensorflow:global_step/sec: 449.529
INFO:tensorflow:loss = 2.2717508e-05, step = 15301 (0.223 sec)
INFO:tensorflow:global_step/sec: 639.481
INFO:tensorflow:loss = 2

INFO:tensorflow:global_step/sec: 448.001
INFO:tensorflow:loss = 2.3319828e-05, step = 22401 (0.223 sec)
INFO:tensorflow:global_step/sec: 427.877
INFO:tensorflow:loss = 1.41868695e-05, step = 22501 (0.234 sec)
INFO:tensorflow:global_step/sec: 445.065
INFO:tensorflow:loss = 1.262058e-05, step = 22601 (0.225 sec)
INFO:tensorflow:global_step/sec: 446.309
INFO:tensorflow:loss = 1.12214075e-05, step = 22701 (0.225 sec)
INFO:tensorflow:global_step/sec: 441.881
INFO:tensorflow:loss = 1.7300543e-05, step = 22801 (0.225 sec)
INFO:tensorflow:global_step/sec: 435.1
INFO:tensorflow:loss = 2.1164531e-05, step = 22901 (0.230 sec)
INFO:tensorflow:global_step/sec: 501.797
INFO:tensorflow:loss = 1.8883127e-05, step = 23001 (0.199 sec)
INFO:tensorflow:global_step/sec: 459.144
INFO:tensorflow:loss = 2.4878787e-05, step = 23101 (0.218 sec)
INFO:tensorflow:global_step/sec: 488.18
INFO:tensorflow:loss = 1.933852e-05, step = 23201 (0.205 sec)
INFO:tensorflow:global_step/sec: 502.671
INFO:tensorflow:loss = 4.1

INFO:tensorflow:loss = 8.093653e-06, step = 30301 (0.224 sec)
INFO:tensorflow:global_step/sec: 450.616
INFO:tensorflow:loss = 1.4451791e-05, step = 30401 (0.221 sec)
INFO:tensorflow:global_step/sec: 489.727
INFO:tensorflow:loss = 8.0031705e-06, step = 30501 (0.204 sec)
INFO:tensorflow:global_step/sec: 480.538
INFO:tensorflow:loss = 1.0871064e-05, step = 30601 (0.208 sec)
INFO:tensorflow:global_step/sec: 462.654
INFO:tensorflow:loss = 5.2950304e-06, step = 30701 (0.216 sec)
INFO:tensorflow:global_step/sec: 455.243
INFO:tensorflow:loss = 6.7418932e-06, step = 30801 (0.221 sec)
INFO:tensorflow:global_step/sec: 459.625
INFO:tensorflow:loss = 8.9068535e-06, step = 30901 (0.216 sec)
INFO:tensorflow:global_step/sec: 475.395
INFO:tensorflow:loss = 1.0241767e-05, step = 31001 (0.210 sec)
INFO:tensorflow:global_step/sec: 486.962
INFO:tensorflow:loss = 1.2163376e-05, step = 31101 (0.205 sec)
INFO:tensorflow:global_step/sec: 422.393
INFO:tensorflow:loss = 1.2585232e-05, step = 31201 (0.237 sec)
IN

INFO:tensorflow:global_step/sec: 448.392
INFO:tensorflow:loss = 9.993902e-06, step = 38301 (0.223 sec)
INFO:tensorflow:global_step/sec: 442.624
INFO:tensorflow:loss = 7.30942e-06, step = 38401 (0.226 sec)
INFO:tensorflow:global_step/sec: 443.727
INFO:tensorflow:loss = 1.4053937e-05, step = 38501 (0.226 sec)
INFO:tensorflow:global_step/sec: 426.535
INFO:tensorflow:loss = 4.889791e-06, step = 38601 (0.234 sec)
INFO:tensorflow:global_step/sec: 435.889
INFO:tensorflow:loss = 8.775602e-06, step = 38701 (0.229 sec)
INFO:tensorflow:global_step/sec: 449.786
INFO:tensorflow:loss = 4.3176246e-06, step = 38801 (0.223 sec)
INFO:tensorflow:global_step/sec: 441.474
INFO:tensorflow:loss = 1.1131127e-05, step = 38901 (0.226 sec)
INFO:tensorflow:global_step/sec: 466.569
INFO:tensorflow:loss = 6.0270518e-06, step = 39001 (0.215 sec)
INFO:tensorflow:global_step/sec: 448.518
INFO:tensorflow:loss = 4.7943886e-06, step = 39101 (0.223 sec)
INFO:tensorflow:global_step/sec: 407.483
INFO:tensorflow:loss = 6.982

DNNClassifier(params={'head': <tensorflow.contrib.learn.python.learn.estimators.head._MultiClassHead object at 0x7fcdec24a7b8>, 'hidden_units': [300, 100], 'feature_columns': (_RealValuedColumn(column_name='', dimension=2, default_value=None, dtype=tf.float64, normalizer=None),), 'optimizer': None, 'activation_fn': <function relu at 0x7fce87ee69d8>, 'dropout': None, 'gradient_clip_norm': None, 'embedding_lr_multipliers': None, 'input_layer_min_slice_size': None})

Under the hood, DNNClassifier uses ReLU activation functions by default. This can be changed by setting the activation_fn hyperparameter.

In [11]:
from sklearn.metrics import accuracy_score

y_pred = list(dnn_clf.predict(X_test))
accuracy_score(y_test, y_pred)

Instructions for updating:
Please switch to predict_classes, or set `outputs` argument.
Instructions for updating:
Estimator is decoupled from Scikit Learn interface by moving into
separate class SKCompat. Arguments x, y and batch_size are only
available in the SKCompat class, Estimator will only accept input_fn.
Example conversion:
  est = Estimator(...) -> est = SKCompat(Estimator(...))
INFO:tensorflow:Graph was finalized.
Instructions for updating:
Use standard file APIs to check for files with this prefix.
INFO:tensorflow:Restoring parameters from /tmp/tmpm4ddjuee/model.ckpt-40000
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.


1.0

Tf.Learn also has its own function to evaluate models:

In [12]:
dnn_clf.evaluate(X_test, y_test) 

Instructions for updating:
Estimator is decoupled from Scikit Learn interface by moving into
separate class SKCompat. Arguments x, y and batch_size are only
available in the SKCompat class, Estimator will only accept input_fn.
Example conversion:
  est = Estimator(...) -> est = SKCompat(Estimator(...))
Instructions for updating:
Estimator is decoupled from Scikit Learn interface by moving into
separate class SKCompat. Arguments x, y and batch_size are only
available in the SKCompat class, Estimator will only accept input_fn.
Example conversion:
  est = Estimator(...) -> est = SKCompat(Estimator(...))
INFO:tensorflow:Starting evaluation at 2020-05-13T20:51:40Z
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Restoring parameters from /tmp/tmpm4ddjuee/model.ckpt-40000
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Finished evaluation at 2020-05-13-20:51:40
INFO:tensorflow:Saving dict for global step 40000: accuracy = 1.0, global_ste

{'accuracy': 1.0, 'global_step': 40000, 'loss': 1.8024381e-06}

### X.2 Training a DNN Using Plain TensorFlow

In this part, we will built the same model as before, and we will implement Minibatch Gradient Descent to train on the MNIST dataset. As always with tensorflow, we start with the construction phase, building the graph. In the second step, we run the graph to train the model.

**Construction Phase**

In [13]:
import tensorflow as tf

n_inputs = 28*28 # the number of features --> each pixel = one feature
n_hidden1 = 300
n_hidden2 = 100
n_outputs = 10

I will create some placeholders for X and y. We do not know the batch size yet, which is why we insert 'None' for the first dimension. We do know the number of features though.

In [14]:
X = tf.placeholder(tf.float32, shape=(None, n_inputs), name='X')
y = tf.placeholder(tf.int64, shape=(None), name='y')

Next will be the actual structure of the neural network. Placeholder 'X' will be the input layer. During the execution phase, it will be replaced with one training batch at a time.

In [15]:
def neuron_layer(X, n_neurons, name, activation=None):
    with tf.name_scope(name):
        n_inputs = int(X.get_shape()[1])
        stddev = 2/np.sqrt(n_inputs)
        init = tf.truncated_normal((n_inputs, n_neurons), stddev=stddev)
        W = tf.Variable(init, name='weights')
        b = tf.Variable(tf.zeros([n_neurons]), name='bias')
        z = tf.matmul(X, W)+b
        if activation=='relu':
            return tf.nn.relu(z)
        else:
            return z

Let’s go through this code line by line: 
1. First we create a name scope using the name of the layer: it will contain all the computation nodes for this neuron layer. This is optional, but the graph will look much nicer in TensorBoard if its nodes are well organized. 
2. Next, we get the number of inputs by looking up the input matrix’s shape and getting the size of the second dimension (the first dimension is for instances). 
3. The next three lines create a W variable that will hold the weights matrix. It will be a 2D tensor containing all the connection weights between each input and each neuron; hence, its shape will be (n_inputs, n_neurons). It will be initialized randomly, using a truncated10 normal (Gaussian) distribution with a standard deviation of 2/ ninputs. Using this specific standard deviation helps the algorithm converge much faster (we will discuss this further in Chapter 11; it is one of those small tweaks to neural networks that have had a tremendous impact on their efficiency). It is important to initialize connection weights randomly for all hidden layers to avoid any symmetries that the Gradient Descent algorithm would be unable to break.11 
4. The next line creates a b variable for biases, initialized to 0 (no symmetry issue in this case), with one bias parameter per neuron. 
5. Then we create a subgraph to compute z = X · W + b. This vectorized implementation will efficiently compute the weighted sums of the inputs plus the bias term for each and every neuron in the layer, for all the instances in the batch in just one shot. 
6. Finally, if the activation parameter is set to "relu", the code returns relu(z) (i.e., max (0, z)), or else it just returns z.<br>

Now we can create the actual DNN. The first hidden layer takes X as input, the second hidden layer takes the the output of the first hidden layer as input, and, finally, the output layer takes the output of the second hidden layer as input.

In [17]:
with tf.name_scope('dnn'):
    hidden1 = neuron_layer(X, n_hidden1, 'hidden1', activation='relu')
    hidden2 = neuron_layer(hidden1, n_hidden2, 'hidden2', activation='relu')
    logits = neuron_layer(hidden2, n_outputs, 'outputs')

Tensorflow actually provides a package, so that we do not need to write our own neuron layers. The fully_connected() function does exactly that. It also takes care of creating the weights and biases with the proper initialization strategy, and it uses ReLU by default.

In [18]:
from tensorflow.contrib.layers import fully_connected

with tf.name_scope('dnn'):
    hidden1 = fully_connected(X, n_hidden1, scope='hidden1')
    hidden2 = fully_connected(hidden1, n_hidden2, scope='hidden2')
    logits = fully_connected(hidden2, n_outputs, scope='outputs',
                             activation_fn=None)

Next, we need to choose a cost function to train our model on. For softmax, cross entropy is a good choice. Cross entropy penalizes models that estimate a low probability for the target class. Tensorflow's 'sparse_softmax_cross_entropy_with_logits()' function computes the entropy based on logits (output *before* going through the softmax activation function). This will result in a 1D tensor containing cross entropy for each instance. We can then use 'reduce_mean()' to compute the mean cross entropy over all instances.

In [19]:
with tf.name_scope('loss'):
    xentropy = tf.nn.sparse_softmax_cross_entropy_with_logits(labels=y,
                                                             logits=logits)
    loss = tf.reduce_mean(xentropy, name='loss')

Next, we define the optimizer that will tweak the model parameters to minimize the cost function. Here we are using 'GradientDescentOptimizer'.

In [20]:
learning_rate = 0.01

with tf.name_scope('train'):
    optimizer = tf.train.GradientDescentOptimizer(learning_rate)
    training_op = optimizer.minimize(loss)

The last important step of the construction phase is to specify how to evaluate the model. For this we can simply use accuracy as our performance measure. We first determine if the neural network's prediction is correct by checking whether or not the highest logit corresponds to the target class (use 'in_top_k()' function). This returns a 1D tensor full of booleans, which we can then cast to floats and compute the average. This will return the networks overall accuracy.

In [21]:
with tf.name_scope('eval'):
    correct = tf.nn.in_top_k(logits, y, 1)
    accuracy = tf.reduce_mean(tf.cast(correct, tf.float32))

We also need a node to initialize all variables, and we will need a Saver to save the trained model to disk:

In [22]:
init = tf.global_variables_initializer()
saver = tf.train.Saver()

**Execution Phase**

In [23]:
from tensorflow.examples.tutorials.mnist import input_data

mnist = input_data.read_data_sets('/tmp/data/')

Instructions for updating:
Please use alternatives such as official/mnist/dataset.py from tensorflow/models.
Instructions for updating:
Please write your own downloading logic.
Instructions for updating:
Please use urllib or similar directly.
Successfully downloaded train-images-idx3-ubyte.gz 9912422 bytes.
Instructions for updating:
Please use tf.data to implement this functionality.
Extracting /tmp/data/train-images-idx3-ubyte.gz
Successfully downloaded train-labels-idx1-ubyte.gz 28881 bytes.
Instructions for updating:
Please use tf.data to implement this functionality.
Extracting /tmp/data/train-labels-idx1-ubyte.gz
Successfully downloaded t10k-images-idx3-ubyte.gz 1648877 bytes.
Extracting /tmp/data/t10k-images-idx3-ubyte.gz
Successfully downloaded t10k-labels-idx1-ubyte.gz 4542 bytes.
Extracting /tmp/data/t10k-labels-idx1-ubyte.gz
Instructions for updating:
Please use alternatives such as official/mnist/dataset.py from tensorflow/models.


Next, we define the number of epochs and the size of the mini-batches and train the model:

In [24]:
n_epochs = 400
batch_size = 50

with tf.Session() as sess:
    init.run()
    for epoch in range(n_epochs):
        for iteration in range(mnist.train.num_examples // batch_size):
            X_batch, y_batch = mnist.train.next_batch(batch_size)
            sess.run(training_op, feed_dict={X: X_batch, y: y_batch})
        acc_train = accuracy.eval(feed_dict={X: X_batch, y: y_batch})
        acc_test = accuracy.eval(feed_dict={X: mnist.test.images,
                                           y: mnist.test.labels})
        print(epoch, 'Train accuracy:', acc_train, 'Test accuracy:', acc_test)
    save_path = saver.save(sess, './test_tensorflow_model.ckpt')

0 Train accuracy: 0.84 Test accuracy: 0.905
1 Train accuracy: 0.88 Test accuracy: 0.9221
2 Train accuracy: 0.86 Test accuracy: 0.9305
3 Train accuracy: 0.96 Test accuracy: 0.9389
4 Train accuracy: 0.94 Test accuracy: 0.9418
5 Train accuracy: 0.98 Test accuracy: 0.9455
6 Train accuracy: 0.96 Test accuracy: 0.9506
7 Train accuracy: 0.96 Test accuracy: 0.9532
8 Train accuracy: 0.96 Test accuracy: 0.9558
9 Train accuracy: 0.94 Test accuracy: 0.9571
10 Train accuracy: 0.96 Test accuracy: 0.9588
11 Train accuracy: 0.94 Test accuracy: 0.9608
12 Train accuracy: 0.96 Test accuracy: 0.9624
13 Train accuracy: 1.0 Test accuracy: 0.964
14 Train accuracy: 1.0 Test accuracy: 0.964
15 Train accuracy: 1.0 Test accuracy: 0.9661
16 Train accuracy: 0.96 Test accuracy: 0.9666
17 Train accuracy: 0.96 Test accuracy: 0.9672
18 Train accuracy: 0.96 Test accuracy: 0.9684
19 Train accuracy: 0.96 Test accuracy: 0.9705
20 Train accuracy: 0.96 Test accuracy: 0.9716
21 Train accuracy: 1.0 Test accuracy: 0.9712
22 Tr

181 Train accuracy: 1.0 Test accuracy: 0.9794
182 Train accuracy: 1.0 Test accuracy: 0.9794
183 Train accuracy: 1.0 Test accuracy: 0.9794
184 Train accuracy: 1.0 Test accuracy: 0.9793
185 Train accuracy: 1.0 Test accuracy: 0.9795
186 Train accuracy: 1.0 Test accuracy: 0.9793
187 Train accuracy: 1.0 Test accuracy: 0.9792
188 Train accuracy: 1.0 Test accuracy: 0.9793
189 Train accuracy: 1.0 Test accuracy: 0.9791
190 Train accuracy: 1.0 Test accuracy: 0.9792
191 Train accuracy: 1.0 Test accuracy: 0.9792
192 Train accuracy: 1.0 Test accuracy: 0.9796
193 Train accuracy: 1.0 Test accuracy: 0.9791
194 Train accuracy: 1.0 Test accuracy: 0.9796
195 Train accuracy: 1.0 Test accuracy: 0.9794
196 Train accuracy: 1.0 Test accuracy: 0.9788
197 Train accuracy: 1.0 Test accuracy: 0.9793
198 Train accuracy: 1.0 Test accuracy: 0.9793
199 Train accuracy: 1.0 Test accuracy: 0.9791
200 Train accuracy: 1.0 Test accuracy: 0.9793
201 Train accuracy: 1.0 Test accuracy: 0.9794
202 Train accuracy: 1.0 Test accur

360 Train accuracy: 1.0 Test accuracy: 0.9791
361 Train accuracy: 1.0 Test accuracy: 0.9792
362 Train accuracy: 1.0 Test accuracy: 0.9793
363 Train accuracy: 1.0 Test accuracy: 0.9792
364 Train accuracy: 1.0 Test accuracy: 0.9791
365 Train accuracy: 1.0 Test accuracy: 0.9792
366 Train accuracy: 1.0 Test accuracy: 0.979
367 Train accuracy: 1.0 Test accuracy: 0.9794
368 Train accuracy: 1.0 Test accuracy: 0.9792
369 Train accuracy: 1.0 Test accuracy: 0.9793
370 Train accuracy: 1.0 Test accuracy: 0.9791
371 Train accuracy: 1.0 Test accuracy: 0.9793
372 Train accuracy: 1.0 Test accuracy: 0.9794
373 Train accuracy: 1.0 Test accuracy: 0.9791
374 Train accuracy: 1.0 Test accuracy: 0.9793
375 Train accuracy: 1.0 Test accuracy: 0.979
376 Train accuracy: 1.0 Test accuracy: 0.9792
377 Train accuracy: 1.0 Test accuracy: 0.9794
378 Train accuracy: 1.0 Test accuracy: 0.9792
379 Train accuracy: 1.0 Test accuracy: 0.9793
380 Train accuracy: 1.0 Test accuracy: 0.9792
381 Train accuracy: 1.0 Test accurac

To use the saved neural network for new data, change the execution phase as follows:

In [None]:
with tf.Session() as sess:
    saver.restore(sess, './test_tensorflow_model.ckpt')
    X_new_scaled = [...] # Some new images (scaled from 0 to 1)
    Z = logits.eval(feed_dicts={X: X_new_scaled})
    y_pred = np.argmax(Z, axis=1)

### X.3 Fine-Tuning Neural Network Hyperparameters

A neural network has a lot of different hyperparameters to tune. To name just a few, obvious ones: number of layers, number of neurons per layer, type of activation function to use in each layer, weight initialization logic etc. Using grid-search to find all the right parameters would be too costly. Using a **randomized search** or a tool such as [Oscar](http://oscar.calldesk.ai/) would be much better.

#### X.3.1 Number of Hidden Layers

Start with one or two layers and work your way up, if needed. Remember, nets with more layers need less neurons per layer to model the same complexity, and are, therefore, more efficient to train. Also, more layers usually make neural nets more generalizable. Add layers until you overfit the data. When that point is reached stop.

#### X.3.2 Number of Neurons per Hidden Layer

Obviously, the number of neurons for the input and output layers is determined by the number of features (input) and classes/categories (output). As for the hidden layers, it makes sense, from a logical perspective, to use fewer and fewer neurons per layer, since the first layers evaluate low-level structures (like lines for images), the intermediate layers combine these levels to evaluate intermediate-level-structures (like squares etc.), and the highest hidden layers combine the intermediate levels to model high-level structures (like faces etc). However, you could also use the same number of neurons for all layers. It would just take a bit longer to train.<br>

One way to start is to gradually increase the number of neurons until it starts overfitting. Alternativly, you could use a very large number of neurons, more than you need, and implement an early-stopping policy to prevent it from overfitting (*streth pants approach*).

#### X.3.3 Activation Functions

For hidden layers, ReLU will mostly be a good choice, as it is fast to compute and does not saturate for large input values, because of which it will not get stuck on plateaus as easily (as opposed to the logistic function or the hyperbolic tangent function).<br>

For classification tasks, use a softmax function in the end. For regression tasks, use no funtion at all.<br>