# Training Deep neural nets 

should you want to solve a more complex problem of identifying images in high resolution images. It would be necessary to include possible 10 layers each contaiing hundreds of neurons with 100,000s of connections. The issues that would arise in training 

* Vanashing/exploding Gradients problem can make lower layers hard to train 
* models with large network are extrememly slow to train
* A module with millions of parameters would severely risk overfittin ghte training set. 

## Vanishing/Exploding Gradients Problems 
in backpropagation when calculating the gradients they get smaller and smaller as the algorithm progresses down to the lower layers. As a result, the Gradient Descent update leaves the lower layer connection weights virtually untrained and never converges to a solution. 

In some cases the gradients of the lower levels can grow bigger and bigger, the deep neural networks suffer from unstable gradients. 

This problem is increased using a random initialisation of gaussian distribution and a sigmoid function as it has a mean of 0.5. The varience keeps increasing after each layer. 

For high and lower values there is virtually no gradient on the sigmoid function. This reduces the effectiveness of gradient descent to converge. What gradient exists keeps gettin diluted as backrpropagation progresses through the top lay3rs sos there si really nothing left for the lower levels. 

### Zavier and He initialisation 
They showed that to solve the problem the varience of both the inputs and outputs of a layer in both directions of backpropogation need to have the same varience which is not possible to occur unless the layers have equal number of inputs and outputs. 

They proposed a solution that works very well in practice using the following initialisation scheme. 

normal distribution with mean 0 and standard deviation 
$\sigma = \sqrt{\frac{2}{n_{intputs} + n_{output}}}$

or a unifom distribution between -r and +r with 
$ r = \sqrt{\frac{6}{n_{intputs} + n_{output}}}$

in pactice this looks like 

In [1]:
import tensorflow as tf

n_inputs = 28 * 28  # MNIST
n_hidden1 = 300

X = tf.placeholder(tf.float32, shape=(None, n_inputs), name="X")

he_init = tf.contrib.layers.variance_scaling_initializer()
hidden1 = tf.layers.dense(X, n_hidden1, activation=tf.nn.relu, 
                          kernel_initializer=he_init, name="hidden1")

## Nonsaturating Activation Functions 
The Sigmoid saturates at high and low values. This Relu function is very fast to comput however suffers from dying. e.g. once the weights are altered to receve a 0. they no longer have a nonzero derivative and become effectively useless. 

A solution is the use of a leaky relu function LeakyReLU = max($\alpha$ z, z) the hyperparamemter $\alpha$ defines how much the function leaks. So it can still die but has a higher chance of coming to life. $\alpha$ can be picked randomly in a given range during training, and it is fixed to an average value during testing. It acts as a regulization techique reducing overfitting. 

A new exponential linear unit outperformed all of the ReLU varients. training time reduced and the neural network performed better on the training set. 

$ \alpha(exp(z)-1)$ if z < 0    
$z$ if z > 0

* It has a non zero gradient for z<0, avoiding the dying issue. 
* if $\alpha$ is equal to 1 the function is smooth everywhere, which helps speed up Gradient Descent. 

### Tenserflow offers ELU function 

In [2]:
import tensorflow as tf
tf.reset_default_graph()
X = tf.placeholder(tf.float32, shape=(None, 3))
X2 = tf.placeholder(tf.float32, shape=(None, 3))
n_hidden1 = 100
hidden1 = tf.layers.dense(X, n_hidden1, activation=tf.nn.elu, name="hidden1")

In [3]:
#relu 
def leaky_relu(z, name=None):
    return tf.maximum(0.01 * z, z, name=name)

n_hidden2 = 100
hidden1 = tf.layers.dense(X2, n_hidden2, activation=leaky_relu, name="hidden2")

# Batch Normalisation 
Although using He initialization along with ELU can significantly reduce the exploding/vanishing gradient problem at the beginning of training it may come back towards the end. 

The technique consists of adding an operation in the model just before the activation function on each layer, simply zero-centering and normalizing the inputs,then scaling and shifting the result using two new parameters per layer (one for scaling, the other for shifting). in order to zero-centre and normalize the inputs it must calculate the inputs mean and standard deviation. It does this by measuring over each current mini-branch. Hence batch normalization. 

![image.png](attachment:image.png)

When testing you do not have a mini-batch to compute the empirical mean and standard deviation, so instead use the training set's whole mean in std. 

Batch normalization also actis as a regularizer, reducing the need for other regularization techniques. It does however add some complexity. The neural network makes slower predictions due to the etra computations required at each layer, So if you need predictions to be lighting-fast you may want to check how well plain ELU + He initialization performs. 

## Implementing Batch Normalization with TensorFlow

Tensorflow provides a tf.nn.batch_normalization() function that simply chenters and normalizers the inputs but you must compute the mean and standard deviation yoruself and pass them as parameters to this function, and you must handle the creation of the scalaing and offset parameters and pass them to the function. 

Alternatively tf.layers.batch_normalization() function, handles it all for you. 

In [8]:
import numpy as np

(X_train, y_train), (X_test, y_test) = tf.keras.datasets.mnist.load_data()
X_train = X_train.astype(np.float32).reshape(-1, 28*28) / 255.0
X_test = X_test.astype(np.float32).reshape(-1, 28*28) / 255.0
y_train = y_train.astype(np.int32)
y_test = y_test.astype(np.int32)
X_valid, X_train = X_train[:5000], X_train[5000:]
y_valid, y_train = y_train[:5000], y_train[5000:]


import tensorflow as tf
tf.reset_default_graph()
n_inputs = 28*28
n_hidden1 = 300
n_hidden2 = 200
n_outputs = 10

X = tf.placeholder(tf.float32, shape=(None, n_inputs), name="X")

training = tf.placeholder_with_default(False, shape=(), name="training")

hidden1 = tf.layers.dense(X, n_hidden1, name="hidden1")
bn1 = tf.layers.batch_normalization(hidden1, training=training, momentum=0.9)
bn1_act = tf.nn.elu(bn1)
hidden2 = tf.layers.dense(bn1_act, n_hidden2, name="hidden2")
bn2 = tf.layers.batch_normalization(hidden2, training=training, momentum=0.9)
bn2_act = tf.nn.elu(bn2)
logits_before_bn = tf.layers.dense(bn2_act, n_outputs, name="outputs")
logits = tf.layers.batch_normalization(logits_before_bn, training=training, momentum=0.9)

the training placeholder is set True during training otherwise it is default to False. it will be used to tell the tf.layer.batch_normalization() function whether it should use the current mini-bath's mean and std or the whole training sets mean and std (during testing)

then alternate tf.layers.dense() function to create the layers and just the batch_normalization between. Teh BN algorithm uses exponential decay to compute a new value v, which is why it requires the momentum parameter. A good momentum value is typically close to 1. 0.9,0.99,0.999 (you want more 9s for larger datasets)


In [9]:
def shuffle_batch(X, y, batch_size):
    rnd_idx = np.random.permutation(len(X))
    n_batches = len(X) // batch_size
    for batch_idx in np.array_split(rnd_idx, n_batches):
        X_batch, y_batch = X[batch_idx], y[batch_idx]
        yield X_batch, y_batch


from functools import partial 

tf.reset_default_graph()

learning_rate = 0.01
X = tf.placeholder(tf.float32, shape=(None, n_inputs), name="X")
y = tf.placeholder(tf.int32, shape=(None), name="y")

training = tf.placeholder_with_default(False, shape=(), name="training")

my_batch_norm_layer = partial(tf.layers.batch_normalization,
                              training=training, momentum=0.9)


hidden1 = tf.layers.dense(X, n_hidden1, name="hidden")
bn1 = my_batch_norm_layer(hidden1)
bn1_act = tf.nn.elu(bn1)
hidden2 = tf.layers.dense(bn1_act, n_hidden2, name="hidden2")
bn2 = my_batch_norm_layer(hidden2)
bn2_act = tf.nn.elu(bn2)
logits_before_bn = tf.layers.dense(bn2_act, n_outputs, name="outputs")
logits = my_batch_norm_layer(logits_before_bn)

The partial() function from functools uses a thin wrapper to make use of another fnction and allows you to define default values for some parameters. 

The execution phase has two differences. Whenever you you run an operation than depends on the batch_normalization() function you need to set the training placeholder to True.

The atch_normalization() creates a few operations that must be evaluated at each step during training in order to update the moving averages. moving averages are needed to evaluate the mean and std. The operations are automatically added to the UPDATE_OPS collection.


In [10]:



with tf.name_scope("loss"):
    xentropy = tf.nn.sparse_softmax_cross_entropy_with_logits(labels=y, logits=logits)
    loss = tf.reduce_mean(xentropy, name="loss")

with tf.name_scope("train"):
    optimizer = tf.train.GradientDescentOptimizer(learning_rate)
    training_op = optimizer.minimize(loss)

with tf.name_scope("eval"):
    correct = tf.nn.in_top_k(logits, y, 1)
    accuracy = tf.reduce_mean(tf.cast(correct, tf.float32))
    
    
init = tf.global_variables_initializer()
saver = tf.train.Saver()
n_epochs = 20
batch_size = 200

extra_update_ops = tf.get_collection(tf.GraphKeys.UPDATE_OPS)

with tf.Session() as sess:
    init.run()
    for epoch in range(n_epochs):
        for X_batch, y_batch in shuffle_batch(X_train, y_train, batch_size):
            sess.run([training_op, extra_update_ops],
                     feed_dict={training: True, X: X_batch, y: y_batch})
        accuracy_val = accuracy.eval(feed_dict={X: X_valid, y: y_valid})
        print(epoch, "Validation accuracy:", accuracy_val)

    save_path = saver.save(sess, "models/my_model_final.cpkt")

0 Validation accuracy: 0.8894
1 Validation accuracy: 0.9166
2 Validation accuracy: 0.9284
3 Validation accuracy: 0.9384
4 Validation accuracy: 0.9444
5 Validation accuracy: 0.9494
6 Validation accuracy: 0.9528
7 Validation accuracy: 0.9544
8 Validation accuracy: 0.9598
9 Validation accuracy: 0.9596
10 Validation accuracy: 0.9616
11 Validation accuracy: 0.9632
12 Validation accuracy: 0.964
13 Validation accuracy: 0.9656
14 Validation accuracy: 0.9652
15 Validation accuracy: 0.967
16 Validation accuracy: 0.9666
17 Validation accuracy: 0.968
18 Validation accuracy: 0.9698
19 Validation accuracy: 0.9712


## Gradient Clipping 
Another method to eliminate the exploding Gradients problem is to simply clip the gradients during backpropagation so that they never exceed a certain threshold. In general Batch Normalization, is more useful. 

in tf the optimizers minimize() function takes care of both computing the gradients and applying them, so you must call compute_gradients() method first. Then create an operation to clip the gradients using the clip_by_balue() function, and finally create an operation to apply the clipped gradients using the optimizers apply_gradients()


In [24]:

tf.reset_default_graph()

n_inputs = 28 * 28  # MNIST
n_hidden1 = 300
n_hidden2 = 50
n_hidden3 = 50
n_hidden4 = 50
n_hidden5 = 50
n_outputs = 10

X = tf.placeholder(tf.float32, shape=(None, n_inputs), name="X")
y = tf.placeholder(tf.int32, shape=(None), name="y")

with tf.name_scope("dnn"):
    hidden1 = tf.layers.dense(X, n_hidden1, activation=tf.nn.relu, name="hidden1")
    hidden2 = tf.layers.dense(hidden1, n_hidden2, activation=tf.nn.relu, name="hidden2")
    hidden3 = tf.layers.dense(hidden2, n_hidden3, activation=tf.nn.relu, name="hidden3")
    hidden4 = tf.layers.dense(hidden3, n_hidden4, activation=tf.nn.relu, name="hidden4")
    hidden5 = tf.layers.dense(hidden4, n_hidden5, activation=tf.nn.relu, name="hidden5")
    logits = tf.layers.dense(hidden5, n_outputs, name="outputs")

with tf.name_scope("loss"):
    xentropy = tf.nn.sparse_softmax_cross_entropy_with_logits(labels=y, logits=logits)
    loss = tf.reduce_mean(xentropy, name="loss")
    

learning_rate = 0.01

############# Gradient Clipping ############
threshold = 1.0

optimizer = tf.train.GradientDescentOptimizer(learning_rate)
grads_and_vars = optimizer.compute_gradients(loss)
capped_gvs = [(tf.clip_by_value(grad, -threshold, threshold), var)
              for grad, var in grads_and_vars]
training_op = optimizer.apply_gradients(capped_gvs)

########## rest as usual ############
with tf.name_scope("eval"):
    correct = tf.nn.in_top_k(logits, y, 1)
    accuracy = tf.reduce_mean(tf.cast(correct, tf.float32), name="accuracy")
    
init = tf.global_variables_initializer()
saver = tf.train.Saver()


n_epochs = 20
batch_size = 200

with tf.Session() as sess:
    init.run()
    for epoch in range(n_epochs):
        for X_batch, y_batch in shuffle_batch(X_train, y_train, batch_size):
            sess.run(training_op, feed_dict={X: X_batch, y: y_batch})
        accuracy_val = accuracy.eval(feed_dict={X: X_valid, y: y_valid})
        print(epoch, "Validation accuracy:", accuracy_val)

    save_path = saver.save(sess, "models/my_model_final.cpkt")

0 Validation accuracy: 0.5956
1 Validation accuracy: 0.8354
2 Validation accuracy: 0.886
3 Validation accuracy: 0.9098
4 Validation accuracy: 0.922
5 Validation accuracy: 0.9324
6 Validation accuracy: 0.9346
7 Validation accuracy: 0.9412
8 Validation accuracy: 0.9416
9 Validation accuracy: 0.949
10 Validation accuracy: 0.9502
11 Validation accuracy: 0.9512
12 Validation accuracy: 0.9556
13 Validation accuracy: 0.957
14 Validation accuracy: 0.9594
15 Validation accuracy: 0.96
16 Validation accuracy: 0.9602
17 Validation accuracy: 0.963
18 Validation accuracy: 0.9602
19 Validation accuracy: 0.9638


## Reusing Pertained Layers
It is not a good idea to train a very large DNN from scratch: instead you should always try to find an existing neurla network that accomplishes a similar task and re-use the lower layers. 

### Reusing a Tensorflow Model
you can use the import_meta_graph() function to import the operations into the defualt graph. 

In [29]:
tf.reset_default_graph()
saver = tf.train.import_meta_graph("models/my_model_final.cpkt.meta")

You must then find out the operations and tensors you will need for training. you can do this by using get_operation-by_name() and get_tensor_by_name() methods. The name of a tensor is the name of the operation that outputs it followed by :0 or :1 if it is the second output 

In [30]:

for op in tf.get_default_graph().get_operations():
    print(op.name)

X
y
hidden1/kernel/Initializer/random_uniform/shape
hidden1/kernel/Initializer/random_uniform/min
hidden1/kernel/Initializer/random_uniform/max
hidden1/kernel/Initializer/random_uniform/RandomUniform
hidden1/kernel/Initializer/random_uniform/sub
hidden1/kernel/Initializer/random_uniform/mul
hidden1/kernel/Initializer/random_uniform
hidden1/kernel
hidden1/kernel/Assign
hidden1/kernel/read
hidden1/bias/Initializer/zeros
hidden1/bias
hidden1/bias/Assign
hidden1/bias/read
dnn/hidden1/MatMul
dnn/hidden1/BiasAdd
dnn/hidden1/Relu
hidden2/kernel/Initializer/random_uniform/shape
hidden2/kernel/Initializer/random_uniform/min
hidden2/kernel/Initializer/random_uniform/max
hidden2/kernel/Initializer/random_uniform/RandomUniform
hidden2/kernel/Initializer/random_uniform/sub
hidden2/kernel/Initializer/random_uniform/mul
hidden2/kernel/Initializer/random_uniform
hidden2/kernel
hidden2/kernel/Assign
hidden2/kernel/read
hidden2/bias/Initializer/zeros
hidden2/bias
hidden2/bias/Assign
hidden2/bias/read
dn

### Lets resuse layer 3

In [32]:
import tensorflow as tf
tf.reset_default_graph() 

n_hidden4 = 20
n_outputs = 10

svaer = tf.train.import_meta_graph("models/my_model_final.cpkt.meta")

########## Reuse placeholders to feed data into ##########
X = tf.get_default_graph().get_tensor_by_name("X:0")
y = tf.get_default_graph().get_tensor_by_name("y:0")

############ Reuse layer ############
hidden3 = tf.get_default_graph().get_tensor_by_name("dnn/hidden3/Relu:0") 

#create new layer
new_hidden4 = tf.layers.dense(hidden3, n_hidden4, activation=tf.nn.relu, name="new_hidden4") 
new_logits = tf.layers.dense(new_hidden4, n_outputs, name="new_outputs")

with tf.name_scope("new_loss"):
    xentropy = tf.nn.sparse_softmax_cross_entropy_with_logits(labels=y, logits=new_logits)
    loss = tf.reduce_mean(xentropy, name="loss")

with tf.name_scope("new_eval"):
    correct = tf.nn.in_top_k(new_logits, y, 1)
    accuracy = tf.reduce_mean(tf.cast(correct, tf.float32), name="accuracy")

with tf.name_scope("new_train"):
    optimizer = tf.train.GradientDescentOptimizer(learning_rate)
    training_op = optimizer.minimize(loss)

init = tf.global_variables_initializer()
new_saver = tf.train.Saver()                                                                                                     

In [34]:
with tf.Session() as sess:
    init.run()
    saver.restore(sess, "models/my_model_final.cpkt")
    
    for epoch in range(n_epochs):
            for X_batch, y_batch in shuffle_batch(X_train, y_train, batch_size):
                sess.run(training_op, feed_dict={X: X_batch, y: y_batch})
            accuracy_val = accuracy.eval(feed_dict={X: X_valid, y: y_valid})
            print(epoch, ":validation accuracy:", accuracy_val)
    save_path = new_saver.save(sess, "models/my_new_model_final.cpkt")
    

INFO:tensorflow:Restoring parameters from models/my_model_final.cpkt
0 :validation accuracy: 0.896
1 :validation accuracy: 0.9272
2 :validation accuracy: 0.9388
3 :validation accuracy: 0.9468
4 :validation accuracy: 0.9502
5 :validation accuracy: 0.9528
6 :validation accuracy: 0.956
7 :validation accuracy: 0.9574
8 :validation accuracy: 0.9606
9 :validation accuracy: 0.9618
10 :validation accuracy: 0.9618
11 :validation accuracy: 0.963
12 :validation accuracy: 0.9634
13 :validation accuracy: 0.9638
14 :validation accuracy: 0.9666
15 :validation accuracy: 0.9648
16 :validation accuracy: 0.9674
17 :validation accuracy: 0.9668
18 :validation accuracy: 0.9662
19 :validation accuracy: 0.9656


## Freezing the Lower Layers 
It is like.y that the lower layers of the first DNN have leraned to detect low-level features in pictures that will be usefull across bothimage classification tasks. So you can just reuse these layers. It is a good idean to freeze their weights. If the lower layers fixed it will be easier to train the higher layers. 

In [26]:
train_vars = tf.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES, 
                              scope="hidden[34]|outputs")
training_op = optimizer.minimize(loss, var_list=train_vars)

The first line gets the list of all trainable variables in hidden layers 3 and 4 and in the output layer. they you can provide the restricted list of trainable variables to the optimizers minimize() function. This leaves out layers 1 and 2. The trainable layers. 

## Tweaking, Dropping, or Replacing the Upper Layers 
If you still cannot get good performance, and you have little training data, try dropping the top hidden layer(s) and freeze all remaining hidden layers again. You can iterate untill you find the right number of layers to reuse. If you have plenty of training data, you may try replacing the top hidden layers instead of  dropping them, and even add more hidden layers.

# Model Z00s 
Tensorflow offers it's own model zoo although there are others that you can use to find a similar one that can be addapted for your purposes. 

Another Popular mois Caffe's Model Zoo ("https://homl.info/53). It also contains many computer vision models.

## Unsupervised Pretraining 
Suppose you want to tackle a complex task for which your you do not have enough labeled data for and cannot find another available model and gathering more labeled data is too expensive/time consuming. you can start by training eachlayer one by one using an unsupervised deep feature detector algorithm such as Restricted Boltzman- Machines or autoencoders. Each layer is trained on the output of the previously trained layers. Once all layers have been training this way, you can finetune the network using supervised learning. 

## Pretraining on an Auxiliary Task 
If it is difficult to get good labeled data on the task you want to tackle. You could instead training it on a similar issue where there is better available data and they transfer it over. 

You could also create some of your own data. For instance if you needed creating a model for NLP you could read in documents ect label it "good" and then automate a process to add errors in to the data and label it "bad". Then pre-train your model with this data. 

## Faster Optimizers 
training a very large deep neural network can be painfully slow. Seen above are four ways to speet up training: applying a good initialization strategy for the connection weights, using a good acitvation function, using Batch Normalization and re-using parts of a pretrained network.

Another hugh speed boost comes from using a faster optimizer than the regular Gradient Descent optimizer. The next section looks at the other popular ones. 

* Momentm optimization
* Nesterov Accelerated Gradient 
* AdaGrad
* RMSProp
* Adam optimization 

## Momentum Optimzation
imagine a bowling ball rolling down a gentle slope on a smooth surface. It will start out slowly but eventially reach some termincal velocity. 

Gradient descent updates the weights $\theta$ by dir3ctly subtracting the gradient of the cost function $j(\theta)$ with regards to the weghts multiplied by the learning rate. 

$\theta <= \theta - n \Delta_{\theta} J(\theta)$

It does not care about what the earlier gradients were. If the local gradient is tiny, it goes very slowly. 

Momentum optimization takes into account what previous gradients were. At each iteration, it subtracts the local gradient from the momentum vector m(multiplied by the learning reate n), and then updates the weights by simply adding this momentum vector. The gradient is therefore used as an acceleration not as a speed. To simulate some type of terminal velocity the a new hyper parameter $\beta$, simply called the momentm, which must be set between 0 (high friction_ and 1 (no friction). typical value is 0.9.

$m <= \beta - n\Delta_{\theta} J(\theta)$    
$\theta <= \theta + m$

It will overshoot slightly with a type of damped oscillation. It is vairly simple to do in tensorflow. 

In [5]:
import tensorflow as tf
learning_rate = 0.9
optimizer = tf.train.MomentumOptimizer(learning_rate=learning_rate, momentum=0.9)

The one drawback is that it adds another hyperparameter to tune. However the momentum value of 0.9 usually works well in practice and goes alot faster than gradient Descent

## Nesterov Accelerated Gradient 
a small variant to Momentum optimization, is almost always faster than normal Momentum optimization. The idea is to measure the gradient of the cost function not at the local position but slightly ahead in the direction of the momentum. The only differene from vanilla Momentum optimization si that the gradient is measured at $\theta + \beta m$ rather that $\theta$.


$m <= \beta m - n \Delta_{\theta} J(\theta + B m) $    
$\theta <= \theta + m$

This small tweak works because in general the momentum vector will be pointing inthe right direction, so it will be slightly more accurate to use the gradient measures a bit farther in theat direction rather than using the gradeint at the original position. This helps reduce oscillation and thus cnverge faster 

In [6]:
optimizer = tf.train.MomentumOptimizer(learning_rate=learning_rate, momentum=0.9,
                                      use_nesterov=True)

## AdaGrad
imagine an elongated bowl. Gradient descent works by following the steepest gradient and then moving slowly across the bottom of the bowl. Adagrad takes a more direct root by scaling down the gradient vectore along the steepest Dimensions 

$s <= s + \Delta_{\theta} J(\theta) \otimes \Delta_{\theta} J(\theta)$    
$\theta <= \theta - n \Delta_{\theta} \ominus\sqrt{s + \epsilon}$   

The first step accumlates the square of the gradients into vector s. It accumulates the squares of the partial derivative of the cost function with regards to $\theta$. If the cost function is steep along the i_th dimension, then $s_i$ will get larger and larger each iteration. 

The second step almost the same as Gradient descent however the gradient vector is scaled by a vactor of $\sqrt{s + \epsilon}$ and $\ominus$ represents element wise division. To avoid division by zero, typicaly set to $1-^{-10}$

The algorithm decays the earning rate, but does so faster for steep dimensions than for dimensions with gentler slops. Called an adaptive learning rate. A benefit is that it requires much less tuning of the learning rate hyperparameter n. 

AdaGrad operforms well on quadratic problems, but it often stops oo early when training neural networks. The learning rate gets scaled down so much that the algorithm ends up stopping entirely before reaching the global optimum. 

## RMSProp
Although Adagra slows down a bit too fast and ends up never converging to the global optimum, the RMSProp algorihm fixes this by accumulating only the gradients frm the most recent iterations. It does so by using an exponential decayin the frst step 

$s <= \beta s + (1-\beta)\Delta_{\theta} J(\theta) \otimes \Delta_{\theta} J(\theta)$   
$\theta <= \theta - n \Delta_{\theta} \ominus\sqrt{s + \epsilon}$    

The decay rate $\beta$ is typically set to 0.9. This default often works well so usually not necessary to tune. 


In [8]:
optimizer = tf.train.RMSPropOptimizer(learning_rate=learning_rate,
                                      momentum=0.9, decay=0.9, epsilon=1e-6)

execpt on very simple problems, this optimizer almost always performs much better than AdaGrad. It was the most prefered iuntill AdamOprimization came around

# Adam Optimization 
stands for adaptive moment estimation, cmbnes the ideas of Momentum optimization and RMSProp: just like Momentum optimization keeps track of an exponenetially decaying average of past gradients, and RMSProp keeps track of an exponentially decaying average of past squared Gradients. 

![image.png](attachment:image.png)

$\beta_1$ is usually typically initialized to 0.9, while the scaling decay hyperameter $\beta_2$ is often initialized to 0.999. $\epsilon$ the smoohing parameter is initialized to a tiny number e.g. $10^-8$ 

In [9]:
optimizer = tf.train.AdamOptimizer(learning_rate=learning_rate)

*It is generally recommended to use Adam Optimization because it is considered faster than other methods. However a 2017 paper by Ashia C. Wilson et al. showed adaptive optimization methods can lead to solutions that generalize poorly on some datstets. So it's probabily best to stick to Momentum optimization or Nesterov Accelerated Gradients for now.*

## Learning Rate Scheduling 
Getting the right learning rate is important. If it is too high, training will actually diverge. if you set it too low, training will eventually converge to the optimum, but it will take a very long time. If you set it slightly too high it will make progress at first but end up dancing around the optimum. 

### Learning Schedules 
* **Predetermined piecewise constant learning rate** 
e.g. set a learning rate to 0.1 at first then 0.0001 after 50 epochs. Although this solution can work well, it often required fiddiling around tyring to find the best time to use te right learning rate 

* **Performance Scehduling** 
Measure the validation error every N steps and reduce the learning rate bypass  a factor of $\lambda$  when the error stops dropping.\\

* **Exponential Scheduling** 
Set the learning rate to a function of the iteration number t: $n_0 10^{(t/r)}$ It works great but requires fine tuning of the parameters. 

* **Power scheduling**
Set the learning rate to $n(t) = n_0(1 + t/r)^{-c}. They hyperparameter c is typically set to 1. THis is similar to exponential scheduling, but the learning rate drops much slower. 

A 2013 paper by Andrew Senior et al. Compared performance of learning schedules. The authos concluded that both Power and Exponential Scheduling worked well but Exponential Was of more use because i converged slightly faster to the optimal solution. 

In [11]:
initial_learning_rate = 0.1
decay_steps = 10000
decay_rate = 1/10
global_step = tf.Variable(0, trainable=False, name="global_step")
learning_rate = tf.train.exponential_decay(initial_learning_rate, global_step, 
                                          decay_steps, decay_rate)
optimizer = tf.train.MomentumOptimizer(learning_rate, momentum=0.9)
#training_op = optimizer.minimize(loss, global_step=global_step)

after setting the hyperparameter vlaues, a nontrainable variabel global_step is created to keep track of the current iteration number. then define an exponentiall decaying learning rate with $n_0 = 0.1 and r=10,000) Next create the optimizer and minimize() training op. since it is passed the global_step variable, it will take care of incrementing it for us. 

AdaGrad, RMSProp, and Adam optimiation automatically reduce the leraning rate during training so it's not necessary to add an extra learning scheudle.