# The exploding Vanishing gradient problem

Fanin = no of inputs coming in the layer
Fanout = no of neurons in that layer
## glorot --> tanh, logistic, softmax
## He initialisaiion --> Relu
## LeCUN --> SELU

In [1]:
# by default, keras uses Glorot Intialsiation, you can change it like this

import tensorflow as tf
import keras

keras.layers.Dense(10, activation="relu", kernel_initializer="he_normal")


<Dense name=dense, built=False>

In [2]:
# If you want He initialization with a uniform distribution but based on fanavg rather
# than fanin, you can use the VarianceScaling initializer like this:

he_avg_init = keras.initializers.VarianceScaling(scale=2., mode="fan_avg", distribution="uniform")

keras.layers.Dense(10, activation="sigmoid", kernel_initializer=he_avg_init)


<Dense name=dense_1, built=False>

One of the insights in the 2010 paper by Glorot and Bengio was that the problems with unstable gradients were in part due to a poor choice of activation function. Until then most people had assumed that if Mother Nature had chosen to use roughly sig‐ moid activation functions in biological neurons, they must be an excellent choice. But it turns out that other activation functions behave much better in deep neural net‐ works—in particular, the ReLU activation function, mostly because it does not satu‐ rate for positive values (and because it is fast to compute).

## Dying Relu (as weighted sum becomes negative and does not change with gradient descent as slope becomes zero as well)

Unfortunately, the ReLU activation function is not perfect. It suffers from a problem known as the dying ReLUs: during training, some neurons effectively “die,” meaning they stop outputting anything other than 0. In some cases, you may find that half of your network’s neurons are dead, especially if you used a large learning rate. A neu‐ ron dies when its weights get tweaked in such a way that the weighted sum of its inputs are negative for all instances in the training set. When this happens, it just keeps outputting zeros, and Gradient Descent does not affect it anymore because the gradient of the ReLU function is zero when its input is negative

To solve this problem, you may want to use a variant of the ReLU function, such as the leaky ReLU. This function is defined as LeakyReLUα(z) = max(αz, z) (see Figure 11-2). The hyperparameter α defines how much the function “leaks”: it is the slope of the function for z < 0 and is typically set to 0.01. This small slope ensures that leaky ReLUs never die; they can go into a long coma, but they have a chance to even‐ tually wake up

1. Leaky relu
2. Randomized leaky relu (RRelu)
3. Parametric leaky relu (PRelu)

# exponential linear unit (ELU) # new activation function

## MUCH BETTER PERFORMANCE THAN RELUU

The main drawback of the ELU activation function is that it is slower to compute than the ReLU function and its variants (due to the use of the exponential function). Its faster convergence rate during training compensates for that slow computation, but still, at test time an ELU network will be slower than a ReLU network

Then, a 2017 paper7 by Günter Klambauer et al. introduced the Scaled ELU (SELU) activation function: as its name suggests, it is a scaled variant of the ELU activation function. The authors showed that if you build a neural network composed exclu‐ sively of a stack of dense layers, and if all hidden layers use the SELU activation func‐ tion, then the network will self-normalize: the output of each layer will tend to preserve a mean of 0 and standard deviation of 1 during training, which solves the vanishing/exploding gradients problem. As a result, the SELU activation function often significantly outperforms other activation functions for such neural nets (espe‐ cially deep ones). There are, however, a few conditions for self-normalization to hap‐ pen (see the paper for the mathematical justification):
• The input features must be standardized (mean 0 and standard deviation 1).
• Every hidden layer’s weights must be initialized with LeCun normal initialization.
In Keras, this means setting kernel_initializer="lecun_normal".
• The network’s architecture must be sequential. Unfortunately, if you try to use SELU in nonsequential architectures, such as recurrent networks (see Chap‐ ter 15) or networks with skip connections (i.e., connections that skip layers, such as in Wide & Deep nets), self-normalization will not be guaranteed, so SELU will not necessarily outperform other activation functions.

### whcih actiocation functiont theh ?
So, which activation function should you use for the hidden layers of your deep neural networks? Although your mileage will vary, in general SELU > ELU > leaky ReLU (and its variants) > ReLU > tanh > logistic. If the network’s architecture prevents it from self- normalizing, then ELU may perform better than SELU (since SELU is not smooth at z = 0). If you care a lot about runtime latency, then you may prefer leaky ReLU. If you don’t want to tweak yet another hyperparameter, you may use the default α values used by Keras (e.g., 0.3 for leaky ReLU). If you have spare time and computing power, you can use cross-validation to evaluate other activation functions, such as RReLU if your network is overfitting or PReLU if you have a huge training set. That said, because ReLU is the most used activation function (by far), many libraries and hardware accelerators provide ReLU-specific optimizations; therefore, if speed is your priority, ReLU might still be the best choice.

In [4]:
# To use the leaky ReLU activation function, create a LeakyReLU layer and add it to your model just after the layer you want to apply it to:
model = keras.models.Sequential([
    keras.layers.Dense(10, kernel_initializer="he_normal"),
    keras.layers.LeakyReLU(alpha=0.2) # adds leaky relu to above layer

])



In [8]:
# apply PRELU activation
# To use the leaky ReLU activation function, create a LeakyReLU layer and add it to your model just after the layer you want to apply it to:
model = keras.models.Sequential([
    keras.layers.Dense(10, kernel_initializer="he_normal"),
    keras.layers.PReLU() # adds leaky relu to above layer

])



# selu activation 
layer = keras.layers.Dense(10, activation="selu", kernel_initializer="lecun_normal")

## No need to use StandardScaler, use batch normalization layer instead

f you add a BN layer as the very first layer of your neural network, you do not need to standardize your train‐ ing set (e.g., using a StandardScaler); the BN layer will do it for you (well, approxi‐ mately, since it only looks at one batch at a time, and it can also rescale and shift each input feature)

In [None]:
#### Batch normalisation

# appling Batch normalization in keras

model = keras.models.Sequential([
    keras.layers.Flatten(input_shape=[28, 28]),
    keras.layers.BatchNormalization(),
    keras.layers.Dense(300, activation="elu", kernel_initializer="he_normal"),
    keras.layers.BatchNormalization(),
    keras.layers.Dense(100, activation="elu", kernel_initializer="he_normal"),
    keras.layers.BatchNormalization(),
    keras.layers.Dense(10, activation="softmax") # output layer does not require batch normalization

])

  super().__init__(**kwargs)


In [11]:
model.summary() # BN layer adds four parameters per input

In [12]:
### Gradient clipppping

## clips all gradient between a certain limit

optimizer = keras.optimizers.SGD(clipvalue=1) # clips gradient from -1 to 1.. this is to preecnt exploding gradient problem
model.compile(optimizer=optimizer, loss="mse") 

In [13]:
## tip: use clipnorm, instead of clipvalue, to retain orientation

### but this may result in a very small value if the gradient vector has small value in one dirn and one value in another, 
## so, you can use a combination of both


In [None]:
bn_layer = model.layers[1] # batch normalisation layer


[]

In [25]:
### reusing pretrained layers

# using layers of "model" on a new model

model_layers_without_output_layer = model.layers[:-1]
new_model = keras.models.Sequential(model_layers_without_output_layer)
new_model.add(keras.layers.Dense(1, activation="sigmoid")) # output layer for out new model

Note that model_A and model_B_on_A now share some layers. When you train model_B_on_A, it will also affect model_A. If you want to avoid that, you need to clone model_A before you reuse its layers. To do this, you clone model A’s architecture with clone_model(), then copy its weights (since clone_model() does not clone the weights):

In [31]:
new_model_clone = keras.models.clone_model(new_model)
new_model_clone.set_weights(new_model.get_weights()) # now you can train the new model clone without altering the original model layers parameteres

In [32]:
## making reused layers untrainable

for layer in new_model_clone.layers:
    layer.trainable= False

# compile model whenever you change its layers settings (like if any of them is trainable or not trainable etc)
new_model_clone.compile(loss="binary_crossentropy", optimizer="sgd", metrics=["accuracy"])


so,
1. we first trained the model with reused layers, with setting all reused layers as untrainable.. this way the output layer weights of the new_model_clone will adjust themselves with the reused layers... lets say for 4 epochs

2. then we set all layers as trainable, nowww we train the rest of the epochs on the it (of course after compiling it.. )

3. bonus: also reduce the learning rate so we do not completely ruin the reused layers weights

# AND BOOM, WE GET A GREAT FUcKING ACCURACY!

In [None]:
# history = model_B_on_A.fit(X_train_B, y_train_B, epochs=4,
#                                validation_data=(X_valid_B, y_valid_B)) 
# for layer in model_B_on_A.layers[:-1]: 
#         layer.trainable = True 
# optimizer = keras.optimizers.SGD(lr=1e-4) # the default lr is 1e-2 model_B_on_A.compile(loss="binary_crossentropy", optimizer=optimizer,
#                          metrics=["accuracy"])
#     history = model_B_on_A.fit(X_train_B, y_train_B, epochs=16,
#                                validation_data=(X_valid_B, y_valid_B))

## optimizers i have completed and made a notes of them when to use what optimizer

In [36]:
# implementing Momenturm with SGD

optimizer = keras.optimizers.SGD(momentum=0.9, learning_rate=0.001) # putting beta as 0.9 in momentum with SGD optimizer



In [37]:
# implemeting NAG (Nestreov accelareated gradient)

# NAG is generally faster than regular momentum optimization. To use it, simply set nesterov=True when creating the SGD optimizer:

optimizer = keras.optimizers.SGD(momentum=0.9, learning_rate=0.001, nesterov=True) # NAG implemented
 

In [38]:
# implementing ADAGRAD (for scarce datasets) # dynamic learning rate

## Keras has an Adagrad optimizer, you should not use it to train deep neu‐ ral networks (it may be efficient for simpler tasks such as Linear Regression, though).

optimizer = keras.optimizers.Adagrad(learning_rate=0.001) # no hyperparameters to tune





In [39]:
## RMS PROP (BEST LOVELY) for scarce datasets (ADAGRAD KE CONECEPTS KE UPAR BUILT HEI.. DEAL WITH THE VANISHING LEARNING RATE PROBLEM)
# it does so by using exponential decay


optimizer = keras.optimizers.RMSprop(rho=0.9) # rho is beta here # jitni badi rho(beta) ki value, utna hi kam importance to purani values(gradients)



In [None]:
# ADAM OPTIMIZER (LOVELY BEST BEST BESTT!!!!! IMPLEMENTING EXPOENETIAL DECAY LEARNING RATE AND MOMMENTUM IN A SINGLE OPTIMIZER)

adam_opti = keras.optimizers.Adam(learning_rate=0.001, beta_1=0.9, beta_2=0.999)

# beta1 is for momentum
# beta2 is for decaing gradients


# Nadam is basically adam with Nestorov trick