<center><font size="10"> Problems And Issues In Neural Networks </font></center>

In [8]:
import tensorflow as tf
from tensorflow import keras
import numpy as np
import matplotlib.pyplot as plt

#### The backpropagation algorithm works by going from the output layer to the input layer, propagating the error gradient on the way.
#### Unfortunately, gradients often get smaller and smaller as the algorithm progresses down to the lower layers. As a result, the Gradient Descent update leaves the lower layer connection weights  virtually unchanged, and training never converges to a good solution. This is called the **VANISHING GRADIENTS** problem. In some cases, the opposite can happen: the gradients can grow bigger and bigger, so many layers get insanely large weight updates and the algorithm diverges. This is the **EXPLODING GRADIENTS** problem, which is mostly encountered in recurrent neural networks.

#### To solve the above problem we need the **variance of the outputs of each layer to be equal to the variance of its inputs** and we also need the gradients to have **equal variance before and after flowing through a layer in the reverse direction**. It is actually not possible to guarantee both unless the layer has an equal number of inputs and neurons (these numbers are called the **fan-in and fan-out** of the layer)

#### To over come this we use **Xavier initialization** or **He initialization**

In [4]:
keras.layers.Dense(10, activation="relu", kernel_initializer="he_normal")

<keras.layers.core.dense.Dense at 0x21c5b2d9610>

#### He initialization with a uniform distribution, but based on $ fan_{avg} $ rather than $ fan_{in} $, we can use the VarianceScaling initializer like this

In [5]:
init = keras.initializers.VarianceScaling(scale=2., mode='fan_avg',
                                        distribution='uniform')

keras.layers.Dense(10, activation="relu", kernel_initializer=init)

<keras.layers.core.dense.Dense at 0x21c10fd2e80>

### Nonsaturating Activation Functions

#### The ReLU activation function suffers from a problem known as the **dying ReLUs**: during training, some neurons effectively die, meaning they stop outputting anything other than 0. In some cases, you may find that half of your network’s neurons are dead, especially if you used a large learning rate.

#### To solve this problem, you may want to use a variant of the ReLU function, such as the **leaky ReLU** or **randomized leaky ReLU (RReLU)** or **parametric leaky ReLU (PReLU)**. Sole difference in the 3 being difference in α values. Also u can use **exponential linear unit (ELU)** and **Scaled ELU (SELU)**

#### Lets train Fashion MNIST on leaky ReLU and PReLU and also SELU

#### ReLU

In [6]:
(X_train_full, y_train_full), (X_test, y_test) = keras.datasets.fashion_mnist.load_data()
X_train_full = X_train_full / 255.0
X_test = X_test / 255.0
X_valid, X_train = X_train_full[:5000], X_train_full[5000:]
y_valid, y_train = y_train_full[:5000], y_train_full[5000:]

In [9]:
tf.random.set_seed(42)
np.random.seed(42)

model = keras.models.Sequential([
    keras.layers.Flatten(input_shape=[28, 28]),
    keras.layers.Dense(300, kernel_initializer="he_normal"),
    keras.layers.LeakyReLU(),
    keras.layers.Dense(100, kernel_initializer="he_normal"),
    keras.layers.LeakyReLU(),
    keras.layers.Dense(10, activation="softmax")
])

In [10]:
model.compile(loss="sparse_categorical_crossentropy",
            optimizer=keras.optimizers.SGD(learning_rate=1e-3),
            metrics=["accuracy"])

In [11]:
history = model.fit(X_train, y_train, epochs=10,
                    validation_data=(X_valid, y_valid))

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


#### PReLU

In [12]:
tf.random.set_seed(42)
np.random.seed(42)

model = keras.models.Sequential([
    keras.layers.Flatten(input_shape=[28, 28]),
    keras.layers.Dense(300, kernel_initializer="he_normal"),
    keras.layers.PReLU(),
    keras.layers.Dense(100, kernel_initializer="he_normal"),
    keras.layers.PReLU(),
    keras.layers.Dense(10, activation="softmax")
])

In [13]:
model.compile(loss="sparse_categorical_crossentropy",
            optimizer=keras.optimizers.SGD(learning_rate=1e-3),
            metrics=["accuracy"])

In [14]:
history = model.fit(X_train, y_train, epochs=10,
                    validation_data=(X_valid, y_valid))

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


#### Lazy ReLU ==> 8s 5ms/step - loss: 0.4880 - accuracy: 0.8347 - val_loss: 0.4806 - val_accuracy: 0.8414
####   PReLU   ==> 10s 6ms/step - loss: 0.4935 - accuracy: 0.8319 - val_loss: 0.4795 - val_accuracy: 0.8400

#### All though we cannot see a difference except for the time taken but in complex conditions they do affect


#### SELU

In [15]:
np.random.seed(42)
tf.random.set_seed(42)
model = keras.models.Sequential()
model.add(keras.layers.Flatten(input_shape=[28, 28]))
model.add(keras.layers.Dense(300, activation="selu",
                            kernel_initializer="lecun_normal"))
for layer in range(99):
    model.add(keras.layers.Dense(100, activation="selu",
                                kernel_initializer="lecun_normal"))
model.add(keras.layers.Dense(10, activation="softmax"))

In [16]:
model.compile(loss="sparse_categorical_crossentropy",
            optimizer=keras.optimizers.SGD(learning_rate=1e-3),
            metrics=["accuracy"])

#### Scaling the Inputs for SELU

In [17]:
pixel_means = X_train.mean(axis=0, keepdims=True)
pixel_stds = X_train.std(axis=0, keepdims=True)
X_train_scaled = (X_train - pixel_means) / pixel_stds
X_valid_scaled = (X_valid - pixel_means) / pixel_stds
X_test_scaled = (X_test - pixel_means) / pixel_stds

In [18]:
history = model.fit(X_train_scaled, y_train, epochs=5,
                    validation_data=(X_valid_scaled, y_valid))

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


#### Now lets try using ReLU

In [19]:
np.random.seed(42)
tf.random.set_seed(42)
model = keras.models.Sequential()
model.add(keras.layers.Flatten(input_shape=[28, 28]))
model.add(keras.layers.Dense(300, activation="relu", kernel_initializer="he_normal"))
for layer in range(99):
    model.add(keras.layers.Dense(100, activation="relu", kernel_initializer="he_normal"))
model.add(keras.layers.Dense(10, activation="softmax"))

In [20]:
model.compile(loss="sparse_categorical_crossentropy",
            optimizer=keras.optimizers.SGD(learning_rate=1e-3),
            metrics=["accuracy"])

In [21]:
history = model.fit(X_train_scaled, y_train, epochs=5,
                    validation_data=(X_valid_scaled, y_valid))

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


#### Much worse accuracy, we suffered from the vanishing/exploding gradients problem.

#### Using He initialization along with ELU (or any variant of ReLU) can significantly reduce the vanishing/exploding gradients problems at the beginning of training, it doesn’t guarantee that they won’t come back during training.
#### To ensure the the problems do not come back during training we have to use **Batch Normalization**

## Batch Normalization

In [23]:
model = keras.models.Sequential([
        keras.layers.Flatten(input_shape = [28, 28]),
        keras.layers.BatchNormalization(),
        keras.layers.Dense(300, activation = 'elu', kernel_initializer = 'he_normal'),
        keras.layers.BatchNormalization(),
        keras.layers.Dense(100, activation = "elu", kernel_initializer = 'he_normal'),
        keras.layers.BatchNormalization(),
        keras.layers.Dense(10, activation = 'softmax')
        ])

In [24]:
model.summary()

Model: "sequential_4"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 flatten_5 (Flatten)         (None, 784)               0         
                                                                 
 batch_normalization_2 (Batc  (None, 784)              3136      
 hNormalization)                                                 
                                                                 
 dense_211 (Dense)           (None, 300)               235500    
                                                                 
 batch_normalization_3 (Batc  (None, 300)              1200      
 hNormalization)                                                 
                                                                 
 dense_212 (Dense)           (None, 100)               30100     
                                                                 
 batch_normalization_4 (Batc  (None, 100)             

In [25]:
model.compile(loss= "sparse_categorical_crossentropy",
                optimizer = keras.optimizers.SGD(learning_rate=1e-3),
                metrics = ['accuracy'])

In [26]:
history = model.fit(X_train, y_train, epochs = 10, validation_data=(X_valid, y_valid))

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


#### Another technique to mitigate the exploding gradients problem is to simply clip the gradients during backpropagation so that they never exceed some threshold. This is called **Gradient Clipping**.

### Gradient Clipping

In [27]:
optimizer = keras.optimizers.SGD(clipvalue = 1.0)
model.compile(loss = 'mse', optimizer=optimizer)

#### OR

In [None]:
optimizer = keras.optimizers.SGD(clipnorm=1.0)
model.compile(loss = 'mse', optimizer=optimizer)