# Notes L12_CustomModels_and_Training_with_Tensorflow.ipynb
* Here are some of the problems you could run into:
  
* You may be faced with the problem of gradients growing ever smaller or larger,
  * when flowing backward through the DNN during training. Both of these problems make lower layers very hard to train.
   
*  You might not have enough training data for such a large network, or it might be too costly to label.
*   Training may be extremely slow.
*    A model with millions of parameters would severely risk overfitting the training
     *    set, especially if there are not enough training instances or if they are too noisy.
*    we will look at transfer learning and unsupervised pretraining, which can help you 
     *    tackle complex tasks even when you have little labeled data. 
     *    Using various optimizers that can speed up training large models tremendously.

# Training Deep Neural Networks

* Better initialization
  
* Other Activation functions
* Gradient Clipping
* Better Optimizers
* Transfer Learning , Auxillary output training 
  

# Vanishing / Exploding Gradients problem

* Unfortunately, gradients often get smaller and smaller as the algorithm progresses
  * down to the lower layers. As a result, the gradient descent update leaves the lower
  * layers’ connection weights virtually unchanged, and training never converges to a
  * good solution. This is called the **vanishing gradients problem.**

* due to the above, NN were **abandoned** in the early 2000s
* one of the **suspects** was SIGMOID activaiton funciton and the weight initalization technique used at that time
* that has a normal distribution with a mean of 0 and a standard deviation of 1
* The saturation effect by the sigmoid function is worse and has a mean of 0.5 not 0
  * the **tanh** function has a mean of 0 and behaves slightly better than the sigmoid in DNN

![image.png](attachment:image.png)

# Glorot & HE initalization

* It is actually not possible to guarantee both unless the layer has an
  * equal number of inputs and outputs (these numbers are called the fan-in and fan-out of the layer)

* where fanavg = (fanIN + fanOUT) /2 
* The initialization strategy proposed for the ReLU activation function and
  *  its variants is called He initialization 
*  ![image.png](attachment:image.png)
*  Table : initialization parameters for each trype of activation function

In [34]:
import tensorflow as tf 
import warnings
warnings.filterwarnings("ignore")

In [35]:

dense = tf.keras.layers.Dense(50, activation='relu', kernel_initializer='he_normal')

In [36]:
# We can obtain any of the initializations listed in above table and more
# using the variance scaling initializer
# lets do HE initialization, with uniform distribution and fan_avg scaling rather than fan_in

he_avg_init = tf.keras.initializers.VarianceScaling(scale=2., mode='fan_avg',
                                                    distribution='uniform')
dense = tf.keras.layers.Dense(50, activation='sigmoid',
                              kernel_initializer=he_avg_init)
he_avg_init

<keras.src.initializers.random_initializers.VarianceScaling at 0x14529fa00>

# Better Activation Functions
## ReLU and Leaky RELU

* **ReLU** activation function is popular, because it does not saturate for positive values
  * and also it is very fast to compute
  
* Unfortunately, the ReLU activation function is not perfect. 
* It suffers from a problem known as the dying ReLUs: during training,
  *  some neurons effectively “die”, meaningthey stop outputting anything other than 0. 
  *  In some cases, you may find that half of your network’s neurons are dead,
  *   especially if you used a **large learning rate**. 
  *   A neuron dies when its weights get tweaked in such a way that the input of the ReLU
  *   function (i.e., the weighted sum of the neuron’s inputs plus its bias term) is negative
  *   for all instances in the training set. When this happens, it just keeps **outputting zeros,**
  *   and gradient descent does not affect it anymore because the gradient of the ReLU
  *   function is zero when its input is negative
*  To solve this problem, you may want to use a variant of the ReLU function, 
   *  such as the **leaky ReLU**
* The leaky ReLU activation function is defined as **LeakyReLU α(z) = max(αz, z)**
  * The hyperparameter α defines how much the function “leaks”: it is
  * the slope of the function for z < 0. Having a slope for z < 0 ensures that leaky
  * ReLUs never die; they can go into a long coma,

* Leaky RELU outperformed the strict ReLU activation function. 
  * In fact, setting α = 0.2 (a huge leak) seemed to result in 
  * better performance than α = 0.01 (a small leak). 
  * The paper also evaluated the **randomized leaky ReLU (RReLU)**,
  *  where α is picked randomly in a given range during training and 
  *  is fixed to an average value during testing.
  *  RReLU also performed fairly well and seemed to act as a regularizer,
  *  reducing the risk of overfitting the training set.
  *  Finally, the paper evaluated the **parametric leaky ReLU (PReLU)**,
  *  where α is authorized to be learned during training:
  *  instead of being a hyperparameter, it becomes a parameter that can be modified by backpropagation 
  *  like any other parameter. PReLU was reported to strongly outperform ReLU on large 
  *  image datasets, but on smaller datasets it runs the risk of overfitting the training set.
*  ![image.png](attachment:image.png)

In [37]:
# using leaky relu activation function

leaky_relu = tf.keras.layers.LeakyReLU(alpha=0.2)
dense = tf.keras.layers.Dense(50, activation=leaky_relu,
                              kernel_initializer='he_normal')
leaky_relu

<LeakyReLU name=leaky_re_lu_4, built=True>

In [38]:
# using leaky relu as a separate layer in the model
model = tf.keras.models.Sequential([
    tf.keras.layers.Dense(50, kernel_initializer='he_normal'),
    tf.keras.layers.LeakyReLU(alpha=0.2)
])
model.summary()

* ReLU, leaky ReLU, and PReLU all suffer from the fact that they are not smooth
  * functions: their derivatives abruptly change (at z = 0)., 
  * this sort of discontinuity can make gradient descent bounce
  * around the optimum, and slow down convergence. So now we will look at some
  * smooth variants of the ReLU activation function, starting with ELU and SELU.

## ELU and SELU

* ELU
* ![image.png](attachment:image.png)
* It takes on negative values when z < 0, which allows the unit to have an
* average output closer to 0 and helps alleviate the vanishing gradients problem.
* The hyperparameter α defines the opposite of the value that the ELU function
* approaches when z is a large negative number. It is usually set to 1, but you cant 
* weak it like any other hyperparameter.
* It has a nonzero gradient for z < 0, which avoids the dead neurons problem.
* If α is equal to 1 then the function is smooth everywhere, including around 
* z =0, which helps speed up gradient descent since it does not bounce as much to the left and right of z = 0.

* Like with other RELU variants, you should use HE Initialization
* Its faster convergence rate during training may compensate for that slow computation,
* but still, at test time an ELU network will be a bit slower than a RELU network

* **scaled ELU(SELU)** activation function: as its name suggests,
*  it is a scaled variant of the ELU activation function (about 1.05 times ELU, using α ≈ 1.67). 
*  ![image-2.png](attachment:image-2.png)

## GELU, Swish, and Mish

* all the activation functions we’ve discussed so far were both convex and monotonic,
  * the GELU activation function is neither: from left to right, it starts by going straight,
  * then it wiggles down, reaches a low point around –0.17 (near z ≈ –0.75), and finally
  * bounces up and ends up going straight toward the top right.
  * In practice, **GELU** often **outperforms** every other activation function discussed
  * so far. However, it is a bit more computationally intensive, and the performance boost
  * it provides is not always sufficient to justify the extra cost
* ![image.png](attachment:image.png)

* However, Swish is probably a better default for more complex
  * tasks, and you can even try parametrized Swish with a learnable β
  * parameter for the most complex tasks. 
* **If you care a lot about runtime latency, then you may prefer leaky ReLU, or\
  parametrized leaky ReLU for more complex tasks.**

# Batch Normalization

* Even though with best initialization and activaiton functions
  * it can reduce the vanishing/exploding gradients
  * but it doesnt guarantee that they wont come back during training
* Batch Normalization BN, consists of adding an operation in the model
*  just before or after the activation function of each hidden layer.
*   This operation simply zero-centers and normalizes each input, then
*   scales and shifts the result using two new parameter vectors per layer: one for scaling,
*   the other for shifting. In other words, the operation lets the model learn the optimal
*   scale and mean of each of the layer’s inputs. In many cases, if you add a BN layer
*   as the very first layer of your neural network, you do not need to standardize your training set.

* The vanishing gradients problem was strongly reduced, to the point that they could use saturating activation functions such as the tanh and even the sigmoid activation function. 
* The networks were also much less sensitive to the weight initialization.
* You can use much larger learning rates, significantly speeding up the learning process
* **Batch Normalizaiton acts like a regularizer**

## Implementing batch Normalization with keras

In [39]:
## Batch Normalization layer after the activaiton function
model = tf.keras.Sequential([
    tf.keras.layers.Flatten(input_shape=[28, 28]),  
    tf.keras.layers.BatchNormalization(),
    
    tf.keras.layers.Dense(300, activation='relu',
                          kernel_initializer='he_normal'),  
    tf.keras.layers.BatchNormalization(),
    
    tf.keras.layers.Dense(100,activation='relu',
                          kernel_initializer='he_normal'),
    tf.keras.layers.BatchNormalization(),
    
    tf.keras.layers.Dense(10, activation='softmax') 
])
model.summary()
## Batch Normalization params are non trainable, in the backpgrogration it is not effected

In [40]:
[(var.name, var.trainable) for var in model.layers[1].variables]

[('gamma', True),
 ('beta', True),
 ('moving_mean', False),
 ('moving_variance', False)]

In [41]:
# Batch Normalization before the activation function

model = tf.keras.Sequential([
    tf.keras.layers.Flatten(input_shape=[28,28]),
    tf.keras.layers.Dense(300, kernel_initializer='he_normal', use_bias=False),
    tf.keras.layers.BatchNormalization(),
    tf.keras.layers.Activation('relu'),
    
    tf.keras.layers.Dense(100, kernel_initializer='he_normal', use_bias=False),
    tf.keras.layers.BatchNormalization(),
    tf.keras.layers.Activation('relu'),
    
    tf.keras.layers.Dense(10, activation='softmax')     
])

model.summary()

* whether to use BN before or after activation functions is a debate itself
* you can experiment with this to see models performance

In [42]:
# saving some model for future work purposes
model.save('my_model.h5')   # save the model



# Gradient clipping

* To mitigate the exploding gradients problem is to clip the gradients
  *  during backpropagation so that they never exceed some threshold. 
  *  This is called gradient clipping.
*   This technique is generally used in recurrent neural networks,
    *   where using batch normalization is **tricky**

In [43]:
optimizer = tf.keras.optimizers.SGD(learning_rate=1e-3, clipvalue=1.0)
model.compile(optimizer=optimizer)

# re-using pre trained layers
## Transfer Learning

* In neural network, you can generally reuse most of its layers, except for the top ones.
* This technique is called **transfer learning**
* It will not only speed up training considerably, but alsoe requires significantly less training data
* If the input pictures for your new task don’t have the same size
  * as the ones used in the original task, you will usually have to add
  * a preprocessing step to resize them to the size expected by the
  * original model. More generally, transfer learning will work best
  * when the inputs have similar low-level features.
  * ![image.png](attachment:image.png)
  * 

* Similarly, the upper hidden layers of the original model are less likely to be as useful
  * as the lower layers, since the high-level features that are most useful for the new task
  * may differ significantly from the ones that were most useful for the original task. 
  * You want to find the **right number of layers to reuse.**

* The more similar the tasks are, the more layers you will want to
  * reuse (starting with the lower layers). For very similar tasks, try to
  * keep all the hidden layers and just replace the output layer.
* Then try unfreezing one or two of the top hidden
  * layers to let backpropagation tweak them and see if performance improves.
  *  The more training data you have, the more layers you can unfreeze. 
*  If you still cannot get good performance, and you have little training data,
*   try dropping the top hidden layer(s) and freezing all the remaining hidden layers again

## Transfer learning with keras

In [44]:
# lets try loading a model 
model_a = tf.keras.models.load_model('my_model.h5')
model_b_on_a = tf.keras.Sequential(model_a.layers[:-1])
model_b_on_a.add(tf.keras.layers.Dense(1, activation='sigmoid'))

## here when we train model_b_on_a, it will also affect model_a, as the layers are shared
## to avoid this we can clone the model_a before adding the new layers



In [45]:
## cloning the model, to avoid the shared layers
model_a_clone = tf.keras.models.clone_model(model_a) ## this will only copy the architecture
model_a_clone.set_weights(model_a.get_weights()) ## defualt weights are not copied
model_a_clone.summary()

* You must always compile your model after you freeze or unfreeze layers.
* After unfreezing the reused layers, it is usually a good idea to reduce
  * the learning rate, once again to avoid damaging the reused weights.


In [46]:
for layer in model_b_on_a.layers[:-1]:
    layer.trainable = False
    
optimizer = tf.keras.optimizers.SGD(learning_rate=0.001)
model_b_on_a.compile(loss='binary_crossentropy', optimizer=optimizer,
                     metrics=['accuracy'])


In [47]:
# history = model_b_on_a.fit(xtrainb, ytrainb, epochs=4,
# validation_data=(xvalidb,yvalidb)) ## enable this with the data



for layer in model_b_on_a.layers[:-1]:
    layer.trainable = True

optimizer = tf.keras.optimizers.SGD(learning_rate=0.001)
model_b_on_a.compile(loss='binary_crossentropy', optimizer=optimizer,
                     metrics=['accuracy'])
# history = model_b_on_a.fit(xtrainb, ytrainb, epochs=16,
#                            validation_data=(xvalidb,yvalidb))     ## enable this with the data


# “torturing the data until it confesses”
* It turns out that transfer learning does not work very well with
* small dense networks, presumably because small networks learn few patterns, and
* dense networks learn very specific patterns, which are unlikely to be useful in other
* tasks. Transfer learning works best with deep convolutional neural networks, which
* tend to learn feature detectors that are much more general

# Unsupervised Pre training

* If you can gather plenty of unlabeled training data, you can try to use it to train
* an unsupervised model, such as an autoencoder or a generative adversarial network
* Then you can reuse the lower layers of the autoencoder or
* the lower layers of the GAN’s discriminator, add the output layer for your task on
* top, and fine-tune the final network using supervised learning
* ![image.png](attachment:image.png)
* in above image: In unsupervised training, a model is trained on all data, 
  * including the unlabeled data, using an unsupervised learning technique, then it is 
  * fine tuned for the final task on just the labeled data using a supervised learning technique;
  * the unsupervised part may train one layer at a a time as shown here
  * or it may train the full model directly

# Pre training on an auxillary task

* One last option is to train a first neural network on an auxiliary task
  * for which you can easily obtain or generate labeled training data,
  *  then reuse the lower layers of that network for your actual task.
  *  The first neural network’s lower layers will learn feature detectors that will likely be
  *  reusable by the second neural network
*  Self-supervised learning is when you automatically generate the
   *  labels from the data itself, as in the text-masking example, then you
   *  train a model on the resulting “labeled” dataset using supervised
   *  learning techniques.

# Fast Optimizers

* Training a very large deep neural network can be painfully slow. 
  * So far we have seen four ways to speed up training (and reach a better solution): applying a good
  * initialization strategy for the connection weights, using a good activation function,
  * using batch normalization, and reusing parts of a pretrained network

## Momentum

* regular gradient descent is generally much slower to reach the minimum than momentum optimization.
* Momentum optimization cares a great deal about what previous gradients were: at
  * each iteration, it subtracts the local gradient from the 
  * momentum vector m (multiplied by the learning rate η), and
  *  it updates the weights by adding this momentum vector 
  *  ![image.png](attachment:image.png)
*  Beta - allows momentum optimization to escape from plateaus much faster than gradient  descent.
*  Gradient descent goes down the steep slope quite fast, but then it takes a very long time to go down
   *  the valley. In contrast, momentum optimization will roll down the valley faster and
   *  faster until it reaches the bottom (the optimum).
*  Due to the momentum, the optimizer may overshoot a bit, then
   *  come back, overshoot again, and oscillate like this many times
   *  before stabilizing at the minimum. This is one of the reasons it’s
   *  good to have a bit of friction in the system: it gets rid of these
   *  oscillations and thus speeds up convergence.
*  The one drawback of momentum optimization is that it adds yet another hyperparameter to tune.

In [48]:
# implement momentum optimizer in Keras
optimizer = tf.keras.optimizers.SGD(learning_rate=1e-3, momentum=0.9)


## Nesterov Accelerated Gradient - NAG method

* measures the gradient of the cost function not at the local position θ 
  
* but slightly ahead in the direction of the momentum, at θ + βm 
* ![image.png](attachment:image.png)
* NAG ENDS UP CLOSER TO THE OPTIMUM. 
* After a while improvements add up and NAG ends up being significantly faster than regular MOM optimization

In [49]:
optimizer = tf.keras.optimizers.SGD(learning_rate=1e-3, momentum=0.9, nesterov=True)
optimizer.variables

[<Variable path=SGD/iteration, shape=(), dtype=int64, value=0>,
 <Variable path=SGD/learning_rate, shape=(), dtype=float32, value=0.0010000000474974513>]

## Adagrad

* Consider the elongated bowl problem again: gradient descent starts by quickly going
  * down the steepest slope, which does not point straight toward the global optimum,
  * then it very slowly goes down to the bottom of the valley. It would be nice if the
  * algorithm could correct its direction earlier to point a bit more toward the global
  * optimum. The AdaGrad algorithm achieves this correction by scaling down the
  * gradient vector along the steepest dimensions 
* In short, this algorithm decays the learning rate, but it does so faster for 
  * steep dimensions than for dimensions with gentler slopes. This is called an adaptive learning
  * rate. It helps point the resulting updates more directly toward the global optimum
  * One additional benefit is that it requires much less tuning of the learning rate hyperparameter η.
* Ada grad frequently performs well for simple quadratic problems,
  *  but it often stops too early when training enural netowrks
*  this should not use it to train deep neural networks
   *  it may be efficient for simpler tasks such as linear regression

## RMS Prop

* It fixes this by accumulating only the gradients from the most recent iterations,
  *  as opposed to all the gradients since the beginning of training
*  Except on very simple problems, this optimizer almost always performs much better than AdaGrad.

In [50]:
optimizer = tf.keras.optimizers.RMSprop(learning_rate=0.001, rho=0.9)

## Adam

* Its combination of Momentum and RMS prop
* just like momentum optimization, it keeps track of an exponentially
  *  decaying average of past gradients; and just like RMSProp, it keeps
  *  track of an exponentially decaying average of past squared gradients.
  *   These are estimations of the mean and (uncentered) variance of the gradients.
  *   The mean is often called the first moment while the variance is often 
  *   called the second moment, hence the name of the algorithm.

In [51]:
optimizer = tf.keras.optimizers.Adam(learning_rate=0.001, beta_1=0.9,
                                     beta_2=0.999)

* Since Adam is an adaptive learning rate algorithm, like AdaGrad and RMSProp, 
  * it requires less tuning of the learning rate hyperparameter η. You can often use the
  * default value η = 0.001, making Adam even easier to use than gradient descent

## AdaMax, Nadam, AdamW

* When you are not satisified by your model's performance, 
  * try using NAG instead: your dataset may just be 
  * allergic to adaptive gradients
* All the Optimization techniques discussed so far only rely on the first-order partial derivatives-jacobians