# Chapter 11: Training Deep Neural Nets - Part 3

In [2]:
#set up libs and data from first notebook!
import tensorflow as tf
import numpy as np
from tensorflow.examples.tutorials.mnist import input_data

def reset_graph(seed=42):
    tf.reset_default_graph()
    tf.set_random_seed(seed)
    np.random.seed(seed)

mnist = input_data.read_data_sets("/tmp/data/")

Extracting /tmp/data/train-images-idx3-ubyte.gz
Extracting /tmp/data/train-labels-idx1-ubyte.gz
Extracting /tmp/data/t10k-images-idx3-ubyte.gz
Extracting /tmp/data/t10k-labels-idx1-ubyte.gz


# Tweaking, Dropping, or Replacing the Upper Layers
In gneral the upper hidden layers of the original model that you are reusing are less likely to be as useful as the lower
layers, since the high-level features that are most useful for the new task may differ significantly from the ones that were most useful for the original task. You want to find the right number of layers to reuse.

- Try freezing all the copied layers first, then train your model and see how it performs. 
- Then try unfreezing one or two of the top hidden layers to let backpropagation tweak them and see if performance improves. 

The more training data you have, the more layers you can unfreeze. If you still cannot get good performance, and you have little training data, try dropping the top hidden layer(s) and freeze all remaining hidden layers again. You can iterate until you find the right number of layers to reuse. If you have plenty of training data, you may try replacing the top hidden layers instead of dropping them, and even add more hidden layers.

# Model Zoos
#### Where can you find a neural network trained for a task similar to the one you want to tackle? 
The first place to look is obviously in your own catalog of models. This is one good reason to save all your models and organize them so you can retrieve them later easily. Another option is to search in a model zoo. Many people train Machine Learning models for various tasks and kindly release their pretrained models to the public.

#### TensorFlow has its own model zoo available at https://github.com/tensorflow/models. 
In particular, it contains most of the state-of-the-art image classification nets such as VGG, Inception, and ResNet (see Chapter 13, and check out the models/slim directory), including the code, the pretrained models, and tools to download popular image datasets. Another popular model zoo is Caffe’s Model Zoo. It also contains many computer vision models
(e.g., LeNet, AlexNet, ZFNet, GoogLeNet, VGGNet, inception) trained on various datasets (e.g., ImageNet, Places Database, CIFAR10, etc.). Saumitro Dasgupta wrote a converter, which is available at https://github.com/ethereon/caffe-tensorflow.

# Unsupervised Pretraining
Suppose you want to tackle a complex task for which you don’t have much labeled training data, but unfortunately you cannot find a model trained on a similar task. Don’t lose all hope! 
- First, you should of course try to gather more labeled training data, but if this is too hard or too expensive, you may still be able to perform unsupervised pretraining (see Figure 11-5). 
- That is, if you have plenty of unlabeled training data, you can try to train the layers one by one, starting with the lowest layer and then going up, using an unsupervised feature detector algorithm such as Restricted Boltzmann Machines (RBMs; see Appendix E) or autoencoders (see Chapter 15). 
- Each layer is trained on the output of the previously trained layers (all layers except the one being trained are frozen). Once all layers have been trained this way, you can fine-tune the network using supervised learning (i.e., with backpropagation).

This is a rather long and tedious process, but it often works well; in fact, it is this technique that Geoffrey Hinton and his team used in 2006 and which led to the revival of neural networks and the success of Deep Learning. Until 2010, unsupervised pretraining (typically using RBMs) was the norm for deep nets, and it was only after the vanishing gradients problem was alleviated that it became much more common to train DNNs purely using backpropagation. However, unsupervised pretraining (today typically using autoencoders rather than RBMs) is still a good option when you have a complex task to solve, no similar model you can reuse, and little labeled training data but plenty of unlabeled training data.
![](pictures/homl_ch11_unsup.jpg)

# Pretraining on an Auxiliary Task
One last option is to train a first neural network on an auxiliary task for which you can easily obtain or generate labeled training data, then reuse the lower layers of that network for your actual task. The first neural network’s lower layers will learn feature detectors that will likely be reusable by the second neural network.

For example, if you want to build a system to recognize faces, you may only have a few pictures of each individual — clearly not enough to train a good classifier. Gathering hundreds of pictures of each person would not be practical. **However, you could gather a lot of pictures of random people on the internet and train a first neural network to detect whether or not two different pictures feature the same person. Such a network would learn good feature detectors for faces, so reusing its lower layers would allow you to train a good face classifier using little training data.**

It is often rather cheap to gather unlabeled training examples, but quite expensive to label them. In this situation, a common technique is to label all your training examples as “good,” then generate many new training instances by corrupting the good ones, and label these corrupted instances as “bad.” Then you can train a first neural network to classify instances as good or bad. For example, you could download millions of sentences, label them as “good,” then randomly change a word in each sentence and label the resulting sentences as “bad.” If a neural network can tell that “The dog sleeps” is a good
sentence but “The dog they” is bad, it probably knows quite a lot about language. 

Reusing its lower layers will likely help in many language processing tasks. **Another approach is to train a first network to output a score for each training instance, and use a cost function that ensures that a good instance’s score is greater than a bad instance’s score by at least some margin. This is called max margin learning.**

# Faster Optimizers
Training a very large deep neural network can be painfully slow.** So far we have seen four ways to speed up training (and reach a better solution):**
- applying a good initialization strategy for the connection weights,
- using a good activation function,
- using Batch Normalization,
- and reusing parts of a pretrained network.

### Another huge speed boost comes from using a faster optimizer than the regular Gradient Descent optimizer. 

**In this section we will present the most popular ones: **
- Momentum optimization, 
- Nesterov Accelerated Gradient,
- AdaGrad, 
- RMSProp,
- **Adam optimization. **

# SPOILER ALERT LOLZ: the conclusion of this section is that you should almost ALWAYS USE Adam optimization...
...so if you don’t care about how it works, simply replace your GradientDescentOptimizer with an AdamOptimizer and skip to the next section! 

With just this small change, training will typically be several times faster. However, **Adam optimization does have three hyperparameters that you can tune (plus the learning rate);** the default values usually work fine, but if you ever need to tweak them it may be helpful to know what they do. Adam optimization combines several ideas from other optimization algorithms, so it is useful to look at these algorithms first.

# Momentum Optimization
Recall that Gradient Descent simply updates the weights θ by directly subtracting the gradient of the cost function J(θ) with regards to the weights (∇θJ(θ)) multiplied by the learning rate η. **The equation is: θ ← θ – η∇θJ(θ). It does not care about what the earlier gradients were.** If the local gradient is tiny, it goes very slowly. 

Momentum optimization cares a great deal about what previous gradients were: 
- at each iteration, it adds the local gradient to the momentum vector m (multiplied by the learning rate η), 
- and it updates the weights by simply subtracting this momentum vector. 

In other words, the gradient is used as an acceleration, not as a speed. To simulate some sort of friction mechanism and
prevent the momentum from growing too large, the algorithm introduces a new hyperparameter β, simply called the momentum, which must be set between 0 (high friction) and 1 (no friction). A typical momentum value is 0.9.

![](pictures/homl_ch11_momenopt.jpg)

**You can easily verify that if the gradient remains constant, the terminal velocity (i.e., the maximum size of the weight updates) is equal to that gradient multiplied by the learning rate η multiplied by 1/(1-β).**

For example, if β = 0.9, then the terminal velocity is equal to 10 times the gradient times the learning rate, **so Momentum optimization ends up going 10 times faster than Gradient Descent!** This allows Momentum optimization to escape from plateaus much faster than Gradient Descent. In particular, when the inputs have very different scales the cost function will look like an elongated bowl. Gradient Descent goes down the steep slope quite fast, but then it takes a very long time to go down the valley. In contrast, Momentum optimization will roll down the bottom of the valley faster and faster until it reaches the bottom (the optimum). 

**In deep neural networks that DONT USE BATCH NORMALIZATION, the upper layers will often end up having inputs with very different scales, so using Momentum optimization helps a lot. It can also help roll past local optima.**

**NOTE:** Due to the momentum, the optimizer may overshoot a bit, then come back, overshoot again, and oscillate like this many times before stabilizing at the minimum. This is one of the reasons why it is good to have a bit of friction in the system: it gets rid of these oscillations and thus speeds up convergence.

In [5]:
#Momentum optimization
learning_rate = 0.01
momentum = 0.9

optimizer = tf.train.MomentumOptimizer(learning_rate=learning_rate,
                                       momentum=momentum)

# Nesterov Accelerated Gradient
One small variant to Momentum optimization, proposed by Yurii Nesterov in 1983, is almost always faster than vanilla Momentum optimization. The idea of Nesterov Momentum optimization, or Nesterov Accelerated Gradient (NAG), is to measure the gradient of the cost function not at the local position but slightly ahead in the direction of the momentum. The only difference from vanilla Momentum optimization is that the gradient is measured at θ + βm rather than at θ.

![](pictures/homl_ch11_nestopt.jpg)

**This small tweak works because in general the momentum vector will be pointing in the right direction (i.e., toward the optimum), so it will be slightly more accurate to use the gradient measured a bit farther in that direction rather than using the gradient at the original position**, as you can see in Figure 11-6 (where ∇1 represents the gradient of the cost function measured at the starting point θ, and ∇2 represents the gradient at the point located at θ + βm). 

As you can see, the Nesterov update ends up slightly closer to the optimum. After a while, these small improvements add up and NAG ends up being significantly faster than regular Momentum optimization. Moreover, note that when the momentum pushes the weights across a valley, ∇1 continues to push further across the valley, while ∇2 pushes back toward the bottom of the valley. This helps reduce oscillations and thus converges faster.

![](pictures/homl_ch11_nestopt2.jpg)

In [9]:
#Nesterov Accelerated Gradient
optimizer = tf.train.MomentumOptimizer(learning_rate=learning_rate,
                                       momentum=0.9, use_nesterov=True)

# AdaGrad
Consider the elongated bowl problem again: Gradient Descent starts by quickly going down the steepest slope, then slowly goes down the bottom of the valley. **It would be nice if the algorithm could detect this early on and correct its direction to point a bit more toward the global optimum. The AdaGrad algorithm achieves this by scaling down the gradient vector along the steepest dimensions:

![](pictures/homl_ch11_adag.jpg)
- The first step accumulates the square of the gradients into the vector s (the ⊗ symbol represents the element-wise multiplication). This vectorized form is equivalent to computing si ← si + (∂ / ∂ θi J(θ))2 for each element si of the vector s; in other words, each si accumulates the squares of the partial derivative of the cost function with regards to parameter θi. If the cost function is steep along the ith dimension, then si will get larger and larger at each iteration.
- The second step is almost identical to Gradient Descent, but with one big difference: **the gradient vector is scaled down by a factor of np.sqrt(s+ϵ)...(the ⊘ symbol represents the element-wise division, and ϵ is a smoothing term to avoid division by zero, typically set to 10–10).** This vectorized form is equivalent to computing for all parameters θi (simultaneously).

#### In short, this algorithm decays the learning rate, but it does so faster for steep dimensions than for dimensions with gentler slopes. This is called an adaptive learning rate. It helps point the resulting updates more directly toward the global optimum. One additional benefit is that it requires much less tuning of the learning rate hyperparameter η.

![](pictures/homl_ch11_adag2.jpg)

AdaGrad often performs well for simple quadratic problems, **but unfortunately it often stops too early when training neural networks. The learning rate gets scaled down so much that the algorithm ends up stopping entirely before reaching the global optimum. So even though TensorFlow has an AdagradOptimizer, YOU SHOULD NOT USE IT TO TRAIN DEEP NEURAL NETWORKS** (it may be efficient for simpler tasks such as Linear Regression, though).


In [10]:
#AdaGrad
optimizer = tf.train.AdagradOptimizer(learning_rate=learning_rate)

# RMSProp
Although **AdaGrad slows down a bit too fast and ends up never converging to the global optimum, the
RMSProp algorithm fixes this by accumulating only the gradients from the most recent iterations (as
opposed to all the gradients since the beginning of training).** 
- It does so by using exponential decay in the first step.

![](pictures/homl_ch11_rmsopt.jpg)

The decay rate β is typically set to 0.9. Yes, it is once again a new hyperparameter, but this default value often works well, so you may not need to tune it at all.

#### Except on very simple problems, this optimizer almost always performs much better than AdaGrad. It also generally performs better than Momentum optimization and Nesterov Accelerated Gradients. In fact, it was the preferred optimization algorithm of many researchers until Adam optimization came around.

In [8]:
#RMSProp
optimizer = tf.train.RMSPropOptimizer(learning_rate=learning_rate,
                                      momentum=0.9, decay=0.9, epsilon=1e-10)

# Adam Optimization
Adam, which stands for **adaptive moment estimation**, combines the ideas of **Momentum optimization and RMSProp: just like Momentum optimization it keeps track of an exponentially decaying average of past gradients, and just like RMSProp it keeps track of an exponentially decaying average of past squared gradients.
![](pictures/homl_ch11_adam.jpg)

- T represents the iteration number (starting at 1).

If you just look at steps 1, 2, and 5, you will notice Adam’s close similarity to both Momentum optimization and RMSProp. **The only difference is that step 1 computes an exponentially decaying average rather than an exponentially decaying sum, but these are actually equivalent except for a constant factor (the decaying average is just 1 – β1 times the decaying sum).** 

Steps 3 and 4 are somewhat of a technical detail: 
- since m and s are initialized at 0, they will be biased toward 0 at the beginning of training, so these two steps will help boost m and s at the beginning of training.
- The momentum decay hyperparameter β1 is typically initialized to 0.9, while the scaling decay hyperparameter β2 is often initialized to 0.999. As earlier, the smoothing term ϵ is usually initialized to a tiny number such as 10–8. 

These are the default values for TensorFlow’s AdamOptimizer class, so you can simply use:
    
    optimizer = tf.train.AdamOptimizer(learning_rate=learning_rate)

#### In fact, since Adam is an adaptive learning rate algorithm (like AdaGrad and RMSProp), it requires less tuning of the learning rate hyperparameter η. You can often use the default value η = 0.001, making Adam even easier to use than Gradient Descent.

#### NOTE: 
All the optimization techniques discussed so far only rely on the first-order partial derivatives (Jacobians). The optimization literature contains amazing algorithms based on the second-order partial derivatives (the Hessians). Unfortunately, these algorithms are very hard to apply to deep neural networks because there are n2 Hessians per output
(where n is the number of parameters), as opposed to just n Jacobians per output. Since DNNs typically have tens of thousands of parameters, the second-order optimization algorithms often don’t even fit in memory, and even when they do, computing the Hessians is just too slow.

In [None]:
#Adam Optimization
optimizer = tf.train.AdamOptimizer(learning_rate=learning_rate)

# TRAINING SPARSE MODELS

**ALL of the optimization algorithms just presented produce dense models, meaning that most parameters will be nonzero. If you need a blazingly fast model at runtime, or if you need it to take up less memory, you may prefer to end up with a sparse model instead.**

- One trivial way to achieve this is to train the model as usual, then get rid of the tiny weights (set them to 0).
- Another option is to **apply strong ℓ1 regularization during training**, as it pushes the optimizer to ZERO OUT as many weights as it can.

#### However, in some cases these techniques may remain insufficient. One last option is to apply Dual Averaging, often called Follow The Regularized Leader (FTRL), a technique proposed by Yurii Nesterov. When used with ℓ1 regularization, this technique often leads to very sparse models. TensorFlow implements a variant of FTRL called FTRL-Proximal18 in the FTRLOptimizer class.

# Learning Rate Scheduling
Finding a good learning rate can be tricky: 
- If you set it way too high, training may actually diverge
- If you set it too low, training will eventually converge to the optimum, but it will take a very long time. 
- If you set it slightly too high, it will make progress very quickly at first, but it will end up dancing around the optimum, never settling down 
    - (unless you use an adaptive learning rate optimization algorithm such as AdaGrad, RMSProp, or Adam, but even then it may take time to settle). **If you have a limited computing budget, you may have to interrupt training before it has converged properly, yielding a suboptimal solution.**
    
![](pictures/homl_ch11_learningrate.jpg)

**You may be able to find a fairly good learning rate by training your network several times during just a few epochs using various learning rates and comparing the learning curves.** The ideal learning rate will learn quickly and converge to good solution. 

### However, you can do better than a constant learning rate: if you start with a high learning rate and then reduce it once it stops making fast progress, you can reach a good solution faster than with the optimal constant learning rate. 

There are many different strategies to reduce the learning rate during training. **These strategies are called learning schedules**, the most common of which are:
#### Predetermined piecewise constant learning rate
- For example, set the learning rate to η0 = 0.1 at first, then to η1 = 0.001 after 50 epochs.
- Although this solution can work very well, it often requires fiddling around to figure out the right learning rates and when to use them.

#### Performance scheduling
- Measure the validation error every N steps (just like for early stopping) and **reduce the learning rate by a factor of λ when the error STOPS dropping.**

#### Exponential scheduling <- this is probably the best one I think?!?
- Set the learning rate to a function of the iteration number t: η(t) = η0 10–t/r. This works great, but it requires tuning η0 and r. The learning rate will drop by a factor of 10 every r steps.

#### Power scheduling
- Set the learning rate to η(t) = η0 (1 + t/r)–c. The hyperparameter c is typically set to 1. This is similar to exponential scheduling, but the learning rate drops much more slowly. 

A 2013 paper by Andrew Senior et al. compared the performance of some of the most popular learning schedules when training deep neural networks for speech recognition using Momentum optimization. The authors concluded that, in this setting, both performance scheduling and exponential scheduling performed well, **but they favored exponential scheduling because it is simpler to implement, is easy to tune, and converged slightly faster to the optimal solution.**

Implementing a learning schedule with TensorFlow is fairly straightforward:

    initial_learning_rate = 0.1
    decay_steps = 10000
    decay_rate = 1/10
    global_step = tf.Variable(0, trainable=False)
    learning_rate = tf.train.exponential_decay(initial_learning_rate, global_step,
    decay_steps, decay_rate)
    optimizer = tf.train.MomentumOptimizer(learning_rate, momentum=0.9)
    training_op = optimizer.minimize(loss, global_step=global_step)

After setting the hyperparameter values:
- we create a nontrainable variable global_step (initialized to 0) to keep track of the current training iteration number.
- Then we define an exponentially decaying learning rate (with η0 = 0.1 and r = 10,000) using TensorFlow’s exponential_decay() function.
- Next, we create an optimizer (in this example, a MomentumOptimizer) using this decaying learning rate. 
- Finally, we create the training operation by calling the optimizer’s minimize() method; since we pass it the global_step variable, it will kindly take care of incrementing it. That’s it!

Since AdaGrad, RMSProp, and Adam optimization automatically reduce the learning rate during training, it is not necessary to add an extra learning schedule. For other optimization algorithms, using exponential decay or performance scheduling can considerably speed up convergence.

Let's give this a shot:

In [14]:
#------------------------------------------------------------#
##################### CONSTRUCTION PHASE #####################
#------------------------------------------------------------#

reset_graph()

n_inputs = 28 * 28  # MNIST
n_hidden1 = 300
n_hidden2 = 50
n_outputs = 10

X = tf.placeholder(tf.float32, shape=(None, n_inputs), name="X")
y = tf.placeholder(tf.int64, shape=(None), name="y")

with tf.name_scope("dnn"):
    hidden1 = tf.layers.dense(X, n_hidden1, activation=tf.nn.relu, name="hidden1")
    hidden2 = tf.layers.dense(hidden1, n_hidden2, activation=tf.nn.relu, name="hidden2")
    logits = tf.layers.dense(hidden2, n_outputs, name="outputs")

with tf.name_scope("loss"):
    xentropy = tf.nn.sparse_softmax_cross_entropy_with_logits(labels=y, logits=logits)
    loss = tf.reduce_mean(xentropy, name="loss")

with tf.name_scope("eval"):
    correct = tf.nn.in_top_k(logits, y, 1)
    accuracy = tf.reduce_mean(tf.cast(correct, tf.float32), name="accuracy")
    
#set up train scope with exponential_decay learning rate
with tf.name_scope("train"):       # not shown in the book
    initial_learning_rate = 0.1
    decay_steps = 10000
    decay_rate = 1/10
    global_step = tf.Variable(0, trainable=False, name="global_step")
    learning_rate = tf.train.exponential_decay(initial_learning_rate, global_step,
                                               decay_steps, decay_rate)
    optimizer = tf.train.MomentumOptimizer(learning_rate, momentum=0.9)
    training_op = optimizer.minimize(loss, global_step=global_step)

init = tf.global_variables_initializer()
saver = tf.train.Saver()

In [15]:
#------------------------------------------------------------#
####################### EVALUATION PHASE #####################
#------------------------------------------------------------#
n_epochs = 5
batch_size = 50

with tf.Session() as sess:
    init.run()
    for epoch in range(n_epochs):
        for iteration in range(mnist.train.num_examples // batch_size):
            X_batch, y_batch = mnist.train.next_batch(batch_size)
            sess.run(training_op, feed_dict={X: X_batch, y: y_batch})
        accuracy_val = accuracy.eval(feed_dict={X: mnist.test.images,
                                                y: mnist.test.labels})
        print(epoch, "Test accuracy:", accuracy_val)

    #save_path = saver.save(sess, "./my_model_final.ckpt")

0 Test accuracy: 0.955
1 Test accuracy: 0.9696
2 Test accuracy: 0.9752
3 Test accuracy: 0.9769
4 Test accuracy: 0.979


# Avoiding Overfitting Through Regularization
Deep neural networks typically have tens of thousands of parameters, sometimes even millions. With so many parameters, the network has an incredible amount of freedom and can fit a huge variety of complex datasets. But this great flexibility also means that it is prone to overfitting the training set. With millions of parameters you can fit the whole zoo. In this section we will present some of the most popular regularization techniques for neural networks, and how to implement them with TensorFlow: 
- early stopping, 
- ℓ1 and ℓ2 regularization, 
- dropout, 
- max-norm regularization, 
- and data augmentation.

## Early Stopping
To avoid overfitting the training set, a great solution is early stopping: 
- just interrupt training when its performance on the validation set starts dropping.

One way to implement this with TensorFlow is to evaluate the model on a validation set at regular intervals (e.g., every 50 steps), and save a “winner” snapshot if it outperforms previous “winner” snapshots. Count the number of steps since the last “winner” snapshot was saved, and interrupt training when this number reaches some limit (e.g., 2,000 steps). Then restore the last “winner” snapshot.

##### Although early stopping works very well in practice, you can usually get much higher performance out of your network by combining it with other regularization techniques.

## ℓ1 and ℓ2 Regularization
You can use ℓ1 and ℓ2 regularization to constrain a neural network’s connection weights (but typically not its biases). One way to do this using TensorFlow is to **simply add the appropriate regularization terms to your cost function.** For example, assuming you have just one hidden layer with weights weights1 and one output layer with weights weights2, then you can apply ℓ1 regularization like this:

    [...] # construct the neural network
    base_loss = tf.reduce_mean(xentropy, name="avg_xentropy")
    reg_losses = tf.reduce_sum(tf.abs(weights1)) + tf.reduce_sum(tf.abs(weights2))
    loss = tf.add(base_loss, scale * reg_losses, name="loss")

However, if there are many layers, this approach is not very convenient. Fortunately, TensorFlow provides a better option.
#####  Many functions that create variables (such as get_variable() or fully_connected()) accept a *_regularizer argument for each created variable (e.g., weights_regularizer). You can pass any function that takes weights as an argument and returns the corresponding regularization loss. The l1_regularizer(), l2_regularizer(), and l1_l2_regularizer() functions return such functions. 

The following code puts all this together:

    with arg_scope(
    [fully_connected],
    weights_regularizer=tf.contrib.layers.l1_regularizer(scale=0.01)):
    hidden1 = fully_connected(X, n_hidden1, scope="hidden1")
    hidden2 = fully_connected(hidden1, n_hidden2, scope="hidden2")
    logits = fully_connected(hidden2, n_outputs, activation_fn=None,scope="out")

This code creates a neural network with two hidden layers and one output layer, and it also creates nodes in the graph to compute the ℓ1 regularization loss corresponding to each layer’s weights. TensorFlow automatically adds these nodes to a special collection containing all the regularization losses. You just need to add these regularization losses to your overall loss, like this:

    reg_losses = tf.get_collection(tf.GraphKeys.REGULARIZATION_LOSSES)
    loss = tf.add_n([base_loss] + reg_losses, name="loss")

### WARNING: Don’t forget to add the regularization losses to your overall loss, or else they will simply be ignored. Remember, the larger the weights the greater the penalty!!! So l1 forces the weights to remain small as possible.


Let's implement $\ell_1$ regularization manually. First, we create the model, as usual (with just one hidden layer this time, for simplicity):

In [17]:
reset_graph()

n_inputs = 28 * 28  # MNIST
n_hidden1 = 300
n_outputs = 10

X = tf.placeholder(tf.float32, shape=(None, n_inputs), name="X")
y = tf.placeholder(tf.int64, shape=(None), name="y")

with tf.name_scope("dnn"):
    hidden1 = tf.layers.dense(X, n_hidden1, activation=tf.nn.relu, name="hidden1")
    logits = tf.layers.dense(hidden1, n_outputs, name="outputs")

### Next, we get a handle on the layer weights, and we compute the total loss, which is equal to the sum of the usual cross entropy loss and the $\ell_1$ loss (i.e., the absolute values of the weights). 

In [18]:
#ACCESS THE WEIGHTS SO YOU CAN ADD THE TOGETHER AND GET ABSOLUTE MEAN FOR L1 REGULARIZATION
W1 = tf.get_default_graph().get_tensor_by_name("hidden1/kernel:0")
W2 = tf.get_default_graph().get_tensor_by_name("outputs/kernel:0")

scale = 0.001 # l1 regularization hyperparameter

with tf.name_scope("loss"):
    xentropy = tf.nn.sparse_softmax_cross_entropy_with_logits(labels=y, logits=logits)
    base_loss = tf.reduce_mean(xentropy, name="avg_xentropy")
    reg_losses = tf.reduce_sum(tf.abs(W1)) + tf.reduce_sum(tf.abs(W2))
    loss = tf.add(base_loss, scale * reg_losses, name="loss")

In [19]:
#then just the rest as usual!
with tf.name_scope("eval"):
    correct = tf.nn.in_top_k(logits, y, 1)
    accuracy = tf.reduce_mean(tf.cast(correct, tf.float32), name="accuracy")

learning_rate = 0.01

with tf.name_scope("train"):
    optimizer = tf.train.GradientDescentOptimizer(learning_rate)
    training_op = optimizer.minimize(loss)

init = tf.global_variables_initializer()
saver = tf.train.Saver()

n_epochs = 5
batch_size = 200

with tf.Session() as sess:
    init.run()
    for epoch in range(n_epochs):
        for iteration in range(mnist.train.num_examples // batch_size):
            X_batch, y_batch = mnist.train.next_batch(batch_size)
            sess.run(training_op, feed_dict={X: X_batch, y: y_batch})
        accuracy_val = accuracy.eval(feed_dict={X: mnist.test.images,
                                                y: mnist.test.labels})
        print(epoch, "Test accuracy:", accuracy_val)

    #save_path = saver.save(sess, "./my_model_final.ckpt")

0 Test accuracy: 0.8357
1 Test accuracy: 0.8713
2 Test accuracy: 0.8819
3 Test accuracy: 0.8911
4 Test accuracy: 0.8954


### Alternatively, we can pass a regularization function to the tf.layers.dense() function, which will use it to create operations that will compute the regularization loss, and it adds these operations to the collection of regularization losses. The beginning is the same as above:

In [25]:
reset_graph()

n_inputs = 28 * 28  # MNIST
n_hidden1 = 300
n_hidden2 = 50
n_outputs = 10

X = tf.placeholder(tf.float32, shape=(None, n_inputs), name="X")
y = tf.placeholder(tf.int64, shape=(None), name="y")

#### Next, we will use Python's partial() function to avoid repeating the same arguments over and over again. Note that we set the kernel_regularizer argument:

In [26]:
from functools import partial
scale = 0.001

#create partial function to avoide duplicating code.. use kernel_regularizer arg!!!
my_dense_layer = partial(tf.layers.dense, 
                         activation=tf.nn.relu, 
                         kernel_regularizer=tf.contrib.layers.l1_regularizer(scale))

with tf.name_scope("dnn"):
    hidden1 = my_dense_layer(X, n_hidden1, name="hidden1")
    hidden2 = my_dense_layer(hidden1, n_hidden2, name="hidden2")
    logits = my_dense_layer(hidden2, n_outputs, activation=None, name="outputs")

### Next we must add the regularization losses to the base loss:
### Use tf.get_collection(tf.GraphKeys.REGULARIZATION_LOSSES) to get all the weights sums. This is WAYYYY better than accessing and summing all the weights manually as we did above... especially if our network has  like, thousands of layers!!!

In [27]:
# use tf.get_collection() to get the reg_losses val!
with tf.name_scope("loss"):                                     
    xentropy = tf.nn.sparse_softmax_cross_entropy_with_logits(labels=y, logits=logits)
    base_loss = tf.reduce_mean(xentropy, name="avg_xentropy")   # not shown
    reg_losses = tf.get_collection(tf.GraphKeys.REGULARIZATION_LOSSES)
    loss = tf.add_n([base_loss] + reg_losses, name="loss")

# And the rest is the same as usual:
with tf.name_scope("eval"):
    correct = tf.nn.in_top_k(logits, y, 1)
    accuracy = tf.reduce_mean(tf.cast(correct, tf.float32), name="accuracy")

learning_rate = 0.01

with tf.name_scope("train"):
    optimizer = tf.train.GradientDescentOptimizer(learning_rate)
    training_op = optimizer.minimize(loss)

init = tf.global_variables_initializer()
saver = tf.train.Saver()

n_epochs = 20
batch_size = 200

with tf.Session() as sess:
    init.run()
    for epoch in range(n_epochs):
        for iteration in range(mnist.train.num_examples // batch_size):
            X_batch, y_batch = mnist.train.next_batch(batch_size)
            sess.run(training_op, feed_dict={X: X_batch, y: y_batch})
        accuracy_val = accuracy.eval(feed_dict={X: mnist.test.images,
                                                y: mnist.test.labels})
        print(epoch, "Test accuracy:", accuracy_val)

    #save_path = saver.save(sess, "./my_model_final.ckpt")

0 Test accuracy: 0.8307
1 Test accuracy: 0.8757
2 Test accuracy: 0.8935
3 Test accuracy: 0.902
4 Test accuracy: 0.9068
5 Test accuracy: 0.9108
6 Test accuracy: 0.9116
7 Test accuracy: 0.9158
8 Test accuracy: 0.916
9 Test accuracy: 0.9167
10 Test accuracy: 0.9189
11 Test accuracy: 0.9161
12 Test accuracy: 0.9196
13 Test accuracy: 0.9194
14 Test accuracy: 0.9191
15 Test accuracy: 0.9194
16 Test accuracy: 0.9191
17 Test accuracy: 0.9195
18 Test accuracy: 0.9185
19 Test accuracy: 0.9186


# Dropout
#### The most popular regularization technique for deep neural networks is arguably DROPOUT. It was proposed by G. E. Hinton in 2012 and further detailed in a paper by Nitish Srivastava et al., and it has proven to be highly successful: even the state-of-the-art neural networks got a 1–2% accuracy boost simply by adding dropout. This may not sound like a lot, but when a model already has 95% accuracy, getting a 2% accuracy boost means dropping the error rate by almost 40% (going from 5% error to roughly 3%).

It is a fairly simple algorithm: 
- at every training step, every neuron (including the input neurons but excluding the output neurons) has a probability p of being temporarily “dropped out,” meaning it will be entirely ignored during this training step, but it may be active during the next step. 
- The hyperparameter p is called the dropout rate, and it is typically set to 50%. After training, neurons don’t get dropped anymore. And that’s all (except for a technical detail we will discuss momentarily).

![](pictures/homl_ch11_dropout.jpg)


It is quite surprising at first that this rather brutal technique works at all. Would a company perform better if its employees were told to toss a coin every morning to decide whether or not to go to work? Well, who knows; perhaps it would! The company would obviously be forced to adapt its organization; it could not rely on any single person to fill in the coffee machine or perform any other critical tasks, so this expertise would have to be spread across several people. Employees would have to learn to cooperate with many of their coworkers, not just a handful of them. The company would become much more resilient. If one person quit, it wouldn’t make much of a difference. It’s unclear whether this idea would actually work for companies, but it certainly does for neural networks. 

##### Neurons trained with dropout cannot co-adapt with their neighboring neurons; they have to be as useful as possible on their own. They also cannot rely excessively on just a few input neurons; they must pay attention to each of their input neurons. They end up being less sensitive to slight changes in the inputs. In the end you get a more robust network that generalizes better.

Another way to understand the power of dropout is to realize that a unique neural network is generated at each training step. Since each neuron can be either present or absent, there is a total of 2^N possible networks (where N is the total number of droppable neurons). This is such a huge number that it is virtually impossible for the same neural network to be sampled twice. **Once you have run a 10,000 training steps, you have essentially trained 10,000 different neural networks (each with just one training instance). These neural networks are obviously not independent since they share many of
their weights, but they are nevertheless all different. The resulting neural network can be seen as an averaging ensemble of all these smaller neural networks.**

There is one small but important technical detail:
- Suppose p = 50, in which case during testing a neuron will be connected to twice as many input neurons as it was (on average) during training. 
- To compensate for this fact, we need to multiply each neuron’s input connection weights by 0.5 after training. If we don’t, each neuron will get a total input signal roughly twice as large as what the network was trained on, and it is unlikely to perform well. 
- More generally, we need to multiply each input connection weight by the keep probability (1 – p) after training. Alternatively, we can divide each neuron’s output by the keep probability during training (these alternatives are not perfectly equivalent, but they work equally well).

To implement dropout using TensorFlow, you can simply apply the dropout() function to the input layer and to the output of every hidden layer. During training, this function randomly drops some items (setting them to 0) and divides the remaining items by the keep probability. After training, this function does nothing at all. The following code applies dropout regularization to our three-layer neural network:

    from tensorflow.contrib.layers import dropout
    is_training = tf.placeholder(tf.bool, shape=(), name='is_training')
    keep_prob = 0.5
    X_drop = dropout(X, keep_prob, is_training=is_training)
    hidden1 = fully_connected(X_drop, n_hidden1, scope="hidden1")
    hidden1_drop = dropout(hidden1, keep_prob, is_training=is_training)
    hidden2 = fully_connected(hidden1_drop, n_hidden2, scope="hidden2")
    hidden2_drop = dropout(hidden2, keep_prob, is_training=is_training)
    logits = fully_connected(hidden2_drop, n_outputs, activation_fn=None, scope="outputs")

#### WARNING:
You want to use the dropout() function in tensorflow.contrib.layers, not the one in tensorflow.nn. The first one turns off (no-op) when not training, which is what you want, while the second one does not. **Of course, just like you did earlier for Batch Normalization, you need to set is_training to True when training, and to False when testing.** 
- **OVERFITTING:** If you observe that the model is overfitting, you can increase the dropout rate (i.e., REDUCE the keep_prob hyperparameter). 
- **UNDERFITTING:** Conversely, you should try decreasing the dropout rate (i.e., increasing keep_prob) if the model underfits the training set. 

### It can also help to increase the dropout rate for large layers, and reduce it for small ones. Dropout does tend to significantly slow down convergence, but it usually results in a much better model when tuned properly. So, it is generally well worth the extra time and effort.

### NOTE: Dropconnect is a variant of dropout where individual connections are dropped randomly rather than whole neurons. In general dropout performs better.

#### Let's implement dropout!!!!

In [28]:
#usual construction phase
reset_graph()

X = tf.placeholder(tf.float32, shape=(None, n_inputs), name="X")
y = tf.placeholder(tf.int64, shape=(None), name="y")

#set training true/false placeholder so we can train and test data!!!
#remember, we dont want to use dropout when testing
training = tf.placeholder_with_default(False, shape=(), name='training')

#set dropout rate param
dropout_rate = 0.5  # == 1 - keep_prob
#randomly drop some of the features!!!
X_drop = tf.layers.dropout(X, dropout_rate, training=training)

with tf.name_scope("dnn"):
    hidden1 = tf.layers.dense(X_drop, n_hidden1, activation=tf.nn.relu, name="hidden1")
    hidden1_drop = tf.layers.dropout(hidden1, dropout_rate, training=training)
    
    hidden2 = tf.layers.dense(hidden1_drop, n_hidden2, activation=tf.nn.relu, name="hidden2")
    hidden2_drop = tf.layers.dropout(hidden2, dropout_rate, training=training)
    
    logits = tf.layers.dense(hidden2_drop, n_outputs, name="outputs")

with tf.name_scope("loss"):
    xentropy = tf.nn.sparse_softmax_cross_entropy_with_logits(labels=y, logits=logits)
    loss = tf.reduce_mean(xentropy, name="loss")

with tf.name_scope("train"):
    optimizer = tf.train.MomentumOptimizer(learning_rate, momentum=0.9)
    training_op = optimizer.minimize(loss)    

with tf.name_scope("eval"):
    correct = tf.nn.in_top_k(logits, y, 1)
    accuracy = tf.reduce_mean(tf.cast(correct, tf.float32))
    
init = tf.global_variables_initializer()
saver = tf.train.Saver()

n_epochs = 20
batch_size = 50

with tf.Session() as sess:
    init.run()
    for epoch in range(n_epochs):
        for iteration in range(mnist.train.num_examples // batch_size):
            X_batch, y_batch = mnist.train.next_batch(batch_size)
            sess.run(training_op, feed_dict={training: True, X: X_batch, y: y_batch})
        acc_test = accuracy.eval(feed_dict={X: mnist.test.images, y: mnist.test.labels})
        print(epoch, "Test accuracy:", acc_test)

    #save_path = saver.save(sess, "./my_model_final.ckpt")

0 Test accuracy: 0.9214
1 Test accuracy: 0.9407
2 Test accuracy: 0.9437
3 Test accuracy: 0.951
4 Test accuracy: 0.9548
5 Test accuracy: 0.9585
6 Test accuracy: 0.9583
7 Test accuracy: 0.9601
8 Test accuracy: 0.9617
9 Test accuracy: 0.9614
10 Test accuracy: 0.9644
11 Test accuracy: 0.965
12 Test accuracy: 0.9669
13 Test accuracy: 0.9657
14 Test accuracy: 0.9641
15 Test accuracy: 0.9677
16 Test accuracy: 0.9695
17 Test accuracy: 0.9681
18 Test accuracy: 0.9679
19 Test accuracy: 0.9729


# Max-Norm Regularization
#### Another regularization technique that is quite popular for neural networks is called max-norm regularization: for each neuron, it constrains the weights w of the incoming connections such that ∥ w ∥2 ≤ r, where r is the max-norm hyperparameter and ∥ · ∥2 is the ℓ2 norm.

We typically implement this constraint by computing ∥w∥2 after each training step and clipping w if needed: ![](pictures/homl_ch11_maxnorm.jpg)

#### Reducing r increases the amount of regularization and helps reduce overfitting. Max-norm regularization can also help alleviate the vanishing/exploding gradients problems (IF YOU ARE NOT USING BATCH NORMALIZATION).

TensorFlow does not provide an off-the-shelf max-norm regularizer, but it is not too hard to implement. The following code creates a node clip_weights that will clip the weights variable along the second axis so that each row vector has a maximum norm of 1.0:

    threshold = 1.0
    clipped_weights = tf.clip_by_norm(weights, clip_norm=threshold, axes=1)
    clip_weights = tf.assign(weights, clipped_weights)

You would then apply this operation after each training step, like so:

    with tf.Session() as sess:
        [...]
        for epoch in range(n_epochs):
            [...]
            for X_batch, y_batch in zip(X_batches, y_batches):
                sess.run(training_op, feed_dict={X: X_batch, y: y_batch})
                clip_weights.eval()

You may wonder how to get access to the weights variable of each layer. For this you can simply use a variable scope like this:

    hidden1 = fully_connected(X, n_hidden1, scope="hidden1")
    with tf.variable_scope("hidden1", reuse=True):
        weights1 = tf.get_variable("weights")

Alternatively, you can use the root variable scope:

    hidden1 = fully_connected(X, n_hidden1, scope="hidden1")
    hidden2 = fully_connected(hidden1, n_hidden2, scope="hidden2")
    [...]
    with tf.variable_scope("", default_name="", reuse=True): # root scope
        weights1 = tf.get_variable("hidden1/weights")
        weights2 = tf.get_variable("hidden2/weights")

Although the preceding solution should work fine, it is a bit messy. A cleaner solution is to create a max_norm_regularizer() function and use it just like the earlier l1_regularizer() function:

    def max_norm_regularizer(threshold, axes=1, name="max_norm", collection="max_norm"):
        def max_norm(weights):
            clipped = tf.clip_by_norm(weights, clip_norm=threshold, axes=axes)
            clip_weights = tf.assign(weights, clipped, name=name)
            tf.add_to_collection(collection, clip_weights)
            return None # there is no regularization loss term
        return max_norm

This function returns a parametrized max_norm() function that you can use like any other regularizer:

    max_norm_reg = max_norm_regularizer(threshold=1.0)
    hidden1 = fully_connected(X, n_hidden1, scope="hidden1", weights_regularizer=max_norm_reg)
    
Note that max-norm regularization does not require adding a regularization loss term to your overall loss function, so the max_norm() function returns None. But you still need to be able to run the clip_weights operation after each training step, so you need to be able to get a handle on it. This is why the max_norm() function adds the clip_weights node to a collection of max-norm clipping operations. You need to fetch these clipping operations and run them after each training step:

    clip_all_weights = tf.get_collection("max_norm")
        with tf.Session() as sess:
                [...]
                for epoch in range(n_epochs):
                    [...]
                    for X_batch, y_batch in zip(X_batches, y_batches):
                        sess.run(training_op, feed_dict={X: X_batch, y: y_batch})
                        sess.run(clip_all_weights)
                        
# Lets implement this!

In [30]:
# create a basic network
reset_graph()

n_inputs = 28 * 28
n_hidden1 = 300
n_hidden2 = 50
n_outputs = 10

learning_rate = 0.01
momentum = 0.9

X = tf.placeholder(tf.float32, shape=(None, n_inputs), name="X")
y = tf.placeholder(tf.int64, shape=(None), name="y")

with tf.name_scope("dnn"):
    hidden1 = tf.layers.dense(X, n_hidden1, activation=tf.nn.relu, name="hidden1")
    hidden2 = tf.layers.dense(hidden1, n_hidden2, activation=tf.nn.relu, name="hidden2")
    logits = tf.layers.dense(hidden2, n_outputs, name="outputs")

with tf.name_scope("loss"):
    xentropy = tf.nn.sparse_softmax_cross_entropy_with_logits(labels=y, logits=logits)
    loss = tf.reduce_mean(xentropy, name="loss")

with tf.name_scope("train"):
    optimizer = tf.train.MomentumOptimizer(learning_rate, momentum)
    training_op = optimizer.minimize(loss)    

with tf.name_scope("eval"):
    correct = tf.nn.in_top_k(logits, y, 1)
    accuracy = tf.reduce_mean(tf.cast(correct, tf.float32))

##### Next, let's get a handle on the first hidden layer's weight and create an operation that will compute the clipped weights using the clip_by_norm() function. Then we create an assignment operation to assign the clipped weights to the weights variable:

In [32]:
threshold = 1.0
weights = tf.get_default_graph().get_tensor_by_name("hidden1/kernel:0")
clipped_weights = tf.clip_by_norm(weights, clip_norm=threshold, axes=1)
clip_weights = tf.assign(weights, clipped_weights)

#We can do this as well for the second hidden layer:
weights2 = tf.get_default_graph().get_tensor_by_name("hidden2/kernel:0")
clipped_weights2 = tf.clip_by_norm(weights2, clip_norm=threshold, axes=1)
clip_weights2 = tf.assign(weights2, clipped_weights2)

#Let's add an initializer and a saver:
init = tf.global_variables_initializer()
saver = tf.train.Saver()

#And now we can train the model. 
#Only change is that right after running the training_op, we run the clip_weights and clip_weights2 operations:
n_epochs = 20
batch_size = 50

with tf.Session() as sess:                                             
    init.run()                                                         
    for epoch in range(n_epochs):                                      
        for iteration in range(mnist.train.num_examples // batch_size):
            X_batch, y_batch = mnist.train.next_batch(batch_size)      
            sess.run(training_op, feed_dict={X: X_batch, y: y_batch})
            clip_weights.eval()
            clip_weights2.eval()                                       
        acc_test = accuracy.eval(feed_dict={X: mnist.test.images,      
                                            y: mnist.test.labels})     
        print(epoch, "Test accuracy:", acc_test)                       

    #save_path = saver.save(sess, "./my_model_final.ckpt")   

0 Test accuracy: 0.9587
1 Test accuracy: 0.9651
2 Test accuracy: 0.9742
3 Test accuracy: 0.9761
4 Test accuracy: 0.9772
5 Test accuracy: 0.979
6 Test accuracy: 0.9785
7 Test accuracy: 0.9792
8 Test accuracy: 0.9764
9 Test accuracy: 0.9778
10 Test accuracy: 0.9801
11 Test accuracy: 0.9793
12 Test accuracy: 0.9808
13 Test accuracy: 0.9818
14 Test accuracy: 0.9814
15 Test accuracy: 0.9818
16 Test accuracy: 0.9828
17 Test accuracy: 0.9828
18 Test accuracy: 0.9822
19 Test accuracy: 0.9832


## The implementation above is straightforward and it works fine, but it is a bit messy. A better approach is to define a max_norm_regularizer() function.

### Then you can call this function to get a max norm regularizer (with the threshold you want). When you create a hidden layer, you can pass this regularizer to the kernel_regularizer argument:

In [33]:
def max_norm_regularizer(threshold, axes=1, name="max_norm",
                         collection="max_norm"):
    def max_norm(weights):
        clipped = tf.clip_by_norm(weights, clip_norm=threshold, axes=axes)
        clip_weights = tf.assign(weights, clipped, name=name)
        tf.add_to_collection(collection, clip_weights)
        return None # there is no regularization loss term
    return max_norm

In [None]:
reset_graph()

n_inputs = 28 * 28
n_hidden1 = 300
n_hidden2 = 50
n_outputs = 10

learning_rate = 0.01
momentum = 0.9

X = tf.placeholder(tf.float32, shape=(None, n_inputs), name="X")
y = tf.placeholder(tf.int64, shape=(None), name="y")

max_norm_reg = max_norm_regularizer(threshold=1.0)

with tf.name_scope("dnn"):
    hidden1 = tf.layers.dense(X, n_hidden1, activation=tf.nn.relu, kernel_regularizer=max_norm_reg, name="hidden1")
    hidden2 = tf.layers.dense(hidden1, n_hidden2, activation=tf.nn.relu, kernel_regularizer=max_norm_reg, name="hidden2")
    logits = tf.layers.dense(hidden2, n_outputs, name="outputs")

    with tf.name_scope("loss"):
    xentropy = tf.nn.sparse_softmax_cross_entropy_with_logits(labels=y, logits=logits)
    loss = tf.reduce_mean(xentropy, name="loss")

with tf.name_scope("train"):
    optimizer = tf.train.MomentumOptimizer(learning_rate, momentum)
    training_op = optimizer.minimize(loss)    

with tf.name_scope("eval"):
    correct = tf.nn.in_top_k(logits, y, 1)
    accuracy = tf.reduce_mean(tf.cast(correct, tf.float32))

init = tf.global_variables_initializer()
saver = tf.train.Saver()

#Training is as usual, except you must run the weights clipping operations after each training operation:
n_epochs = 20
batch_size = 50

clip_all_weights = tf.get_collection("max_norm")

with tf.Session() as sess:
    init.run()
    for epoch in range(n_epochs):
        for iteration in range(mnist.train.num_examples // batch_size):
            X_batch, y_batch = mnist.train.next_batch(batch_size)
            sess.run(training_op, feed_dict={X: X_batch, y: y_batch})
            sess.run(clip_all_weights)
        acc_test = accuracy.eval(feed_dict={X: mnist.test.images, y: mnist.test.labels})
        print(epoch, "Test accuracy:", acc_test)

    #save_path = saver.save(sess, "./my_model_final.ckpt")

# Data Augmentation
#### Data augmentation, consists of generating new training instances from existing ones, artificially boosting the size of the training set. THIS WILL REDUCE OVERFITTING, MAKING IT A REGULARIZATION TECHNIQUE YO!!!!. 

##### The trick is to generate realistic training instances; ideally, a human should not be able to tell which instances were generated and which ones were not. Moreover, simply adding white noise will not help; the modifications you apply should be learnable (white noise is not).

For example, **if your model is meant to classify pictures of mushrooms, you can slightly shift, rotate, and resize every picture in the training set by various amounts and add the resulting pictures to the training set.** This forces the model to be more tolerant to the position, orientation, and size of the mushrooms in the picture. **If you want the model to be more tolerant to lighting conditions, you can similarly generate many images with various contrasts.** Assuming the mushrooms are symmetrical, you can also flip the pictures horizontally. 
##### By combining these transformations you can greatly increase the size of your training set.

![](pictures/homl_ch11_dataaugment.jpg)

It is often preferable to generate training instances on the fly during training rather than wasting storage space and network bandwidth. TensorFlow offers several image manipulation operations such as transposing (shifting), rotating, resizing, flipping, and cropping, as well as adjusting the brightness, contrast, saturation, and hue (see the API documentation for more details). This makes it easy to implement data augmentation for image datasets.

### NOTE: Another powerful technique to train very deep neural networks is to add skip connections (a skip connection is when you add the input of a layer to the output of a higher layer).

# Practical Guidelines:

### We have covered a wide range of techniques and you may be wondering which ones you should use. The configuration in Table 11-2 will work fine in most cases.

![](pictures/homl_ch11_dnndefaults.jpg)

Of course, you should try to reuse parts of a pretrained neural network if you can find one that solves a similar problem.
This default configuration may need to be tweaked:
- If you can’t find a good learning rate (convergence was too slow, so you increased the training rate, and now convergence is fast but the network’s accuracy is suboptimal), then you can try adding a learning schedule such as exponential decay.
- If your training set is a bit too small, you can implement data augmentation.
- If you need a sparse model, you can add some ℓ1 regularization to the mix (and optionally zero out the tiny weights after training). If you need an even sparser model, you can try using FTRL instead of Adam optimization, along with ℓ1 regularization.
- If you need a lightning-fast model at runtime, you may want to drop Batch Normalization, and possibly replace the ELU activation function with the leaky ReLU. Having a sparse model will also help.

With these guidelines, you are now ready to train very deep nets — well, if you are very patient, that
is! If you use a single machine, you may have to wait for days or even months for training to complete.
In the next chapter we will discuss how to use distributed TensorFlow to train and run models across
many servers and GPUs.

# Exercises:
**1. Is it okay to initialize all the weights to the same value as long as that value is selected randomly using He initialization?**

- No, all weights should be sampled independently; they should not all have the same initial value. **One important goal of sampling weights randomly is to break symmetries: if all the weights have the same initial value, even if that value is not zero, then symmetry is not broken (i.e., all neurons in a given layer are equivalent), and backpropagation will be unable to break it.** Concretely, this means that all the neurons in any given layer will always have the same weights. It’s like having just one neuron per layer, and much slower. It is virtually impossible for such a configuration to converge to a good solution.

**2. Is it okay to initialize the bias terms to 0?**
- It is perfectly fine to initialize the bias terms to zero. Some people like to initialize them just like weights, and that’s okay too; it does not make much difference.

**3. Name three advantages of the ELU activation function over ReLU.**
- It can take on negative values, so the average output of the neurons in any given layer is typically closer to 0 than when using the ReLU activation function (which never outputs negative values). This helps alleviate the vanishing gradients problem.
- It always has a nonzero derivative, which avoids the dying units issue that can affect ReLU units. It is smooth everywhere, whereas the ReLU’s slope abruptly jumps from 0 to 1 at z = 0. Such an abrupt change can slow down Gradient Descent because it will bounce around z = 0.

**4. In which cases would you want to use each of the following activation functions: ELU, leaky ReLU (and its variants), ReLU, tanh, logistic, and softmax?**
- The ELU activation function is a good default. If you need the neural network to be as fast as possible, you can use one of the leaky ReLU variants instead (e.g., a simple leaky ReLU using the default hyperparameter value). The simplicity of the ReLU activation function makes it many people’s preferred option, despite the fact that they are generally outperformed by the ELU and leaky ReLU. However, the ReLU activation function’s capability of outputting precisely zero can be useful in some cases.
- The hyperbolic tangent (tanh) can be useful in the output layer if you need to output a number between –1 and 1, but nowadays it is not used much in hidden layers. 
- The logistic activation function is also useful in the output layer when you need to estimate a probability (e.g., for binary classification), but it is also rarely used in hidden layers (there are exceptions — for example, for the coding layer of variational autoencoders). 
- Finally, the softmax activation function is useful in the output layer to output probabilities for mutually exclusive classes, but other than that it is rarely (if ever) used in hidden layers.

**5. What may happen if you set the momentum hyperparameter too close to 1 (e.g., 0.99999) when using a MomentumOptimizer?**
- If you set the momentum hyperparameter too close to 1 (e.g., 0.99999) when using a MomentumOptimizer, then the algorithm will likely pick up a lot of speed, hopefully roughly toward the global minimum, but then it will shoot right past the minimum, due to its momentum. Then it will slow down and come back, accelerate again, overshoot again, and so on. 
- It may oscillate this way many times before converging, so overall it will take much longer to converge than with a smaller momentum value.

**6. Name three ways you can produce a sparse model.**
- One way to produce a sparse model (i.e., with most weights equal to zero) is to train the model normally, then zero out tiny weights. 
- For more sparsity, you can apply ℓ1 regularization during training, which pushes the optimizer toward sparsity. 
- A third option is to combine ℓ1 regularization with dual averaging, using TensorFlow’s FTRLOptimizer class.

**7. Does dropout slow down training? Does it slow down inference (i.e., making predictions on new instances)?**
- Yes, dropout does slow down training, in general roughly by a factor of two. However, it has no impact on inference since it is only turned on during training.
