### FSDS DL Assignment 3

### 1.	Is it OK to initialize all the weights to the same value as long as that value is selected randomly using He initialization?

Initializing all weights to the same value using He initialization is not recommended as it will result in all neurons in a layer learning the same features, which reduces the effectiveness of the neural network.

He initialization is designed to randomly initialize the weights such that the mean and variance of the activations remain constant throughout the forward and backward passes of the network. It achieves this by sampling from a Gaussian distribution with mean 0 and variance 2/n, where n is the number of input neurons in the layer.

By initializing the weights randomly using He initialization, the network is able to learn diverse features that are more effective in solving the problem at hand.

Therefore, it is not recommended to initialize all the weights to the same value, even if that value is selected randomly using He initialization. Instead, it is recommended to use He initialization to randomly initialize each weight with a different value sampled from a Gaussian distribution with a mean of 0 and a variance of 2/n.

### 2.	Is it OK to initialize the bias terms to 0?

Yes, it is generally considered okay to initialize the bias terms to 0 in most cases. This is because the role of bias terms in neural networks is to shift the activation function, which can be achieved effectively by initializing the bias terms to 0.

In fact, initializing the biases to non-zero values can sometimes lead to convergence issues, as the network may become biased towards certain outputs or features.

However, there are some cases where it may be beneficial to initialize the biases to non-zero values, such as in cases where the input data is sparse or the network has a large number of layers. In these cases, initializing the biases to small positive values can help ensure that the neurons in each layer remain active and that the gradients do not vanish during backpropagation.

Overall, initializing the biases to 0 is a reasonable default option that is likely to work well in most cases. However, it's always a good idea to experiment with different initialization methods and find the one that works best for your specific problem

### 3.	Name three advantages of the SELU activation function over ReLU.

The SELU (Scaled Exponential Linear Unit) activation function has several advantages over the ReLU (Rectified Linear Unit) activation function, including:

Better performance: The SELU activation function has been shown to outperform ReLU on a variety of tasks, particularly on deep neural networks with many layers. This is because SELU can maintain a constant mean and variance of the activations throughout the network, which can help prevent the vanishing or exploding gradient problem.

Self-normalization: The SELU activation function is designed to be self-normalizing, which means that the activations in the network tend to converge towards a mean of 0 and a standard deviation of 1. This can lead to faster convergence and better generalization performance, as the network is able to better utilize the available signal in the data.

Sparsity preservation: Unlike ReLU, the SELU activation function is able to preserve sparsity in the activations, which can be important for certain types of data and architectures. Sparsity can help reduce the computational cost of the network by reducing the number of active neurons in each layer, which can be particularly beneficial for large networks with many parameters.

Overall, the SELU activation function is a powerful tool for deep learning that can provide better performance and faster convergence than ReLU, particularly on deep networks with many layers.

### 4.	In which cases would you want to use each of the following activation functions: SELU, leaky ReLU (and its variants), ReLU, tanh, logistic, and softmax?

SELU (Scaled Exponential Linear Unit): SELU is a good choice for deep neural networks with many layers, as it is designed to maintain a constant mean and variance of the activations throughout the network, which can help prevent the vanishing or exploding gradient problem. It is also self-normalizing, which can lead to faster convergence and better generalization performance. However, it requires that the input data is normalized and that the weights are initialized in a certain way, so it may not be appropriate for all cases.

Leaky ReLU and its variants (e.g., ELU): Leaky ReLU and its variants are a good choice when we want to avoid the "dying ReLU" problem, where a large number of neurons may become inactive and produce zero-valued outputs during training. Leaky ReLU and its variants allow for a small, non-zero output for negative inputs, which can help prevent this problem. ELU (Exponential Linear Unit) is a variant of Leaky ReLU that can produce negative outputs as well, which can be beneficial in certain cases.

ReLU (Rectified Linear Unit): ReLU is a popular choice for many deep learning tasks, particularly for image recognition and other computer vision tasks. It is computationally efficient, as it simply outputs the input if it is positive and zero otherwise. However, it can suffer from the "dying ReLU" problem if the learning rate is too high or the initial weights are not well-tuned.

Tanh (Hyperbolic Tangent): Tanh is a good choice for tasks that require outputs in the range of -1 to 1, such as sentiment analysis or speech recognition. It is similar to logistic function, but its output range is shifted to the range of -1 to 1. It can be useful for tasks that require a stronger activation function than ReLU or its variants.

Logistic (Sigmoid): Logistic is a good choice for binary classification tasks, where the output is a probability between 0 and 1. It is often used as the activation function for the output layer in binary classification problems.

Softmax: Softmax is a good choice for multi-class classification tasks, where the output is a probability distribution over multiple classes. It is often used as the activation function for the output layer in multi-class classification problems.

In practice, the choice of activation function depends on the specific task at hand, as well as the properties of the data and the network architecture. It is often a good idea to experiment with different activation functions and architectures to find the one that works best for a given problem.

### 5.	What may happen if you set the momentum hyperparameter too close to 1 (e.g., 0.99999) when using an SGD optimizer?

When using Stochastic Gradient Descent (SGD) optimizer with momentum, the momentum hyperparameter controls the contribution of the previous gradient direction to the current update. A higher momentum means that the optimizer relies more on the previous direction and less on the current direction, resulting in smoother updates that can help the optimizer to escape from local minima and converge faster.

However, if the momentum hyperparameter is set too close to 1 (e.g., 0.99999), it can cause the optimizer to overshoot the minimum and oscillate around it, or even diverge. This is because the high momentum can cause the optimizer to accumulate too much momentum over time and prevent it from adjusting its direction to follow the gradient of the loss function. As a result, the optimizer can start bouncing back and forth between the opposite directions, unable to converge to the minimum.

Therefore, it's important to choose an appropriate momentum hyperparameter that balances between the benefits of smooth updates and the risk of overshooting. A typical range for the momentum hyperparameter is between 0.9 and 0.99, depending on the dataset and model architecture. It's recommended to experiment with different values and monitor the training progress to find the optimal hyperparameters for a particular task

### 8.	Practice training a deep neural network on the CIFAR10 image dataset


#### a.	Build a DNN with 20 hidden layers of 100 neurons each (that’s too many, but it’s the point of this exercise). Use He initialization and the ELU activation function.

In [None]:
import tensorflow as tf
from tensorflow import keras

# Load the CIFAR10 dataset
(X_train, y_train), (X_test, y_test) = keras.datasets.cifar10.load_data()

# Preprocess the data
X_train = X_train / 255.0
X_test = X_test / 255.0
y_train = keras.utils.to_categorical(y_train)
y_test = keras.utils.to_categorical(y_test)

# Build the deep neural network
model = keras.Sequential()
model.add(keras.layers.Flatten(input_shape=[32, 32, 3]))
for _ in range(20):
    model.add(keras.layers.Dense(100, activation="elu", kernel_initializer="he_normal"))
model.add(keras.layers.Dense(10, activation="softmax"))

# Compile the model
model.compile(loss="categorical_crossentropy", optimizer="adam", metrics=["accuracy"])

# Train the model
history = model.fit(X_train, y_train, epochs=100, validation_data=(X_test, y_test))

#### b.	Using Nadam optimization and early stopping, train the network on the CIFAR10 dataset. You can load it with keras.datasets.cifar10.load_data(). The dataset is composed of 60,000 32 × 32–pixel color images (50,000 for training, 10,000 for testing) with 10 classes, so you’ll need a softmax output layer with 10 neurons. Remember to search for the right learning rate each time you change the model’s architecture or hyperparameters.

In [None]:
import tensorflow as tf
from tensorflow import keras

# Load the CIFAR10 dataset
(X_train, y_train), (X_test, y_test) = keras.datasets.cifar10.load_data()

# Preprocess the data
X_train = X_train / 255.0
X_test = X_test / 255.0
y_train = keras.utils.to_categorical(y_train)
y_test = keras.utils.to_categorical(y_test)

# Define the deep neural network
model = keras.Sequential()
model.add(keras.layers.Flatten(input_shape=[32, 32, 3]))
for _ in range(20):
    model.add(keras.layers.Dense(100, activation="elu", kernel_initializer="he_normal"))
model.add(keras.layers.Dense(10, activation="softmax"))

# Compile the model with Nadam optimizer and early stopping
optimizer = keras.optimizers.Nadam(lr=0.001)
early_stopping_cb = keras.callbacks.EarlyStopping(patience=10, restore_best_weights=True)
model.compile(loss="categorical_crossentropy", optimizer=optimizer, metrics=["accuracy"])

# Train the model
history = model.fit(X_train, y_train, epochs=100, validation_data=(X_test, y_test), callbacks=[early_stopping_cb])

This code uses the Nadam optimizer with a learning rate of 0.001, and the EarlyStopping callback with a patience of 10 epochs to stop the training early if the validation loss doesn't improve for 10 consecutive epochs. The model is compiled with the categorical cross-entropy loss function, accuracy as the evaluation metric, and trained for 100 epochs on the training set, with the validation set used for evaluation after each epoch.

To search for the right learning rate, you can use the learning rate finder technique, which involves gradually increasing the learning rate during training and monitoring the loss. Here's an example implementation:

In [None]:
import math

def find_learning_rate(model, X, y, epochs=1, batch_size=32, min_lr=1e-5, max_lr=1e-1):
    num_batches = math.ceil(len(X) / batch_size)
    factor = (max_lr / min_lr) ** (1 / (epochs * num_batches))
    lr = min_lr
    model.compile(loss="categorical_crossentropy", optimizer=keras.optimizers.SGD(lr=lr, momentum=0.9), metrics=["accuracy"])
    losses, lrs = [], []
    for epoch in range(epochs):
        for batch in range(num_batches):
            idx_start = batch * batch_size
            idx_end = (batch + 1) * batch_size
            X_batch, y_batch = X[idx_start:idx_end], y[idx_start:idx_end]
            history = model.train_on_batch(X_batch, y_batch)
            losses.append(history[0])
            lrs.append(lr)
            lr *= factor
            model.optimizer.lr = lr
    return losses, lrs

# Test the learning rate finder on a small subset of the CIFAR10 dataset
X_small, y_small = X_train[:1000], y_train[:1000]
model_small = keras.Sequential([
    keras.layers.Flatten(input_shape=[32, 32, 3]),
    keras.layers.Dense(100, activation="elu", kernel_initializer="he_normal"),
    keras.layers.Dense(10, activation="softmax")
])
losses, lrs = find_learning_rate(model_small, X_small, y_small, epochs=5, batch_size=32, min

#### c.	Now try adding Batch Normalization and compare the learning curves: Is it converging faster than before? Does it produce a better model? How does it affect training speed?

In [None]:
from tensorflow import keras

model = keras.models.Sequential()
model.add(keras.layers.Flatten(input_shape=[32, 32, 3]))
for _ in range(20):
    model.add(keras.layers.Dense(100, activation="elu", kernel_initializer="he_normal"))
    model.add(keras.layers.BatchNormalization())
model.add(keras.layers.Dense(10, activation="softmax"))

    To compare the learning curves with and without batch normalization, we can train both models using Nadam optimization and early stopping and plot their learning curves. Here's an example code:

In [None]:
import numpy as np
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras.datasets import cifar10

# Load CIFAR10 data
(X_train_full, y_train_full), (X_test, y_test) = cifar10.load_data()
X_train_full = X_train_full.astype(np.float32) / 255.0
X_test = X_test.astype(np.float32) / 255.0
y_train_full = y_train_full.astype(np.int32)
y_test = y_test.astype(np.int32)

# Split validation set
X_valid, X_train = X_train_full[:5000], X_train_full[5000:]
y_valid, y_train = y_train_full[:5000], y_train_full[5000:]

# Define DNN without batch normalization
model_without_bn = keras.models.Sequential()
model_without_bn.add(keras.layers.Flatten(input_shape=[32, 32, 3]))
for _ in range(20):
    model_without_bn.add(keras.layers.Dense(100, activation="elu", kernel_initializer="he_normal"))
model_without_bn.add(keras.layers.Dense(10, activation="softmax"))
model_without_bn.compile(loss="sparse_categorical_crossentropy",
                         optimizer=keras.optimizers.Nadam(learning_rate=5e-4),
                         metrics=["accuracy"])
early_stopping_cb = keras.callbacks.EarlyStopping(patience=20)
history_without_bn = model_without_bn.fit(X_train, y_train, epochs=100,
                                           validation_data=(X_valid, y_valid),
                                           callbacks=[early_stopping_cb])

# Define DNN with batch normalization
model_with_bn = keras.models.Sequential()
model_with_bn.add(keras.layers.Flatten(input_shape=[32, 32, 3]))
for _ in range(20):
    model_with_bn.add(keras.layers.Dense(100, kernel_initializer="he_normal"))
    model_with_bn.add(keras.layers.BatchNormalization())
    model_with_bn.add(keras.layers.Activation("elu"))
model_with_bn.add(keras.layers.Dense(10, activation="softmax"))
model_with_bn.compile(loss="sparse_categorical_crossentropy",
                      optimizer=keras.optimizers.Nadam(learning_rate=5e-4),
                      metrics=["accuracy"])
history_with_bn = model_with_bn.fit(X_train, y_train, epochs=100,
                                    validation_data=(X_valid, y_valid),
                                    callbacks=[early_stopping_cb])

# Plot learning curves
import matplotlib.pyplot as plt

plt.plot(history_without_bn.history["accuracy"], label="Without BN (training)")
plt.plot(history_with_bn.history["accuracy"], label="With BN (training)")
plt.plot(history_without_bn.history["val_accuracy"], label="Without BN (validation)")
plt.plot(history_with_bn.history["val_accuracy"], label="With BN (validation)")
plt.legend(loc="lower right")
plt.show()

#### d.	Try replacing Batch Normalization with SELU, and make the necessary adjustements to ensure the network self-normalizes (i.e., standardize the input features, use LeCun normal initialization, make sure the DNN contains only a sequence of dense layers, etc.).

To replace Batch Normalization with SELU activation function and ensure that the network self-normalizes, we need to make the following adjustments:

Standardize the input features: We need to preprocess the input data to have a mean of 0 and a standard deviation of 1. This can be done using the StandardScaler class from scikit-learn.

Use LeCun normal initialization: We can use the lecun_normal initializer from Keras to initialize the weights.

Use the SELU activation function: We can set the activation function of each hidden layer to SELU.

Ensure that the DNN contains only a sequence of dense layers: We can use the Dense class from Keras to create the layers of the DNN.

In [None]:
import tensorflow as tf
from tensorflow import keras
from sklearn.preprocessing import StandardScaler

# Load CIFAR10 data
(X_train_full, y_train_full), (X_test, y_test) = keras.datasets.cifar10.load_data()

# Preprocess the input data
scaler = StandardScaler()
X_train_full = scaler.fit_transform(X_train_full.reshape(-1, 3072)).reshape(-1, 32, 32, 3)
X_test = scaler.transform(X_test.reshape(-1, 3072)).reshape(-1, 32, 32, 3)

# Define the model
model = keras.models.Sequential()
model.add(keras.layers.Flatten(input_shape=[32, 32, 3]))
for _ in range(20):
    model.add(keras.layers.Dense(100, activation="selu", kernel_initializer="lecun_normal"))
model.add(keras.layers.Dense(10, activation="softmax"))

# Compile the model
model.compile(loss="sparse_categorical_crossentropy", optimizer="nadam", metrics=["accuracy"])

# Train the model
early_stopping_cb = keras.callbacks.EarlyStopping(patience=10, restore_best_weights=True)
history = model.fit(X_train_full, y_train_full, epochs=100, validation_split=0.1, callbacks=[early_stopping_cb])

# Evaluate the model on the test set
model.evaluate(X_test, y_test)

#### e.	Try regularizing the model with alpha dropout. Then, without retraining your model, see if you can achieve better accuracy using MC Dropout.

To regularize the model with alpha dropout, we can simply add an AlphaDropout layer after each hidden layer. Here is an example of how to modify the previous model to use alpha dropout:

In [None]:
from tensorflow.keras.layers import AlphaDropout

model = Sequential()
model.add(Input(shape=(32, 32, 3)))
model.add(Flatten())

# Hidden layers
for _ in range(20):
    model.add(Dense(100, kernel_initializer='he_normal'))
    model.add(Activation('elu'))
    model.add(AlphaDropout(rate=0.1))

# Output layer
model.add(Dense(10, activation='softmax'))

model.compile(optimizer=Nadam(learning_rate=0.001), loss='sparse_categorical_crossentropy', metrics=['accuracy'])

To use MC Dropout, we can simply add the Dropout layer before the output layer, and set the training flag to True when making predictions. Here is an example of how to do this:

In [None]:
from tensorflow.keras.layers import Dropout

# Create a new model with MC Dropout
mc_model = Sequential(model.layers[:-1])
mc_model.add(Dropout(rate=0.1))
mc_model.add(Dense(10, activation='softmax'))

# Make predictions with MC Dropout
y_pred = np.stack([mc_model.predict(X_test, training=True) for _ in range(100)])
y_mean = np.mean(y_pred, axis=0)
y_std = np.std(y_pred, axis=0)

# Evaluate the model
accuracy = accuracy_score(y_test, np.argmax(y_mean, axis=1))
print("MC Dropout Accuracy:", accuracy)

This code creates a new model that is identical to the previous model, but with a Dropout layer added before the output layer. It then makes predictions using the MC Dropout technique by calling the predict method with the training flag set to True. This causes the dropout layer to be applied during inference, simulating the effect of dropout during training. Finally, the code computes the accuracy of the MC Dropout model by averaging the predictions over 100 runs and taking the most probable class for each sample.

Note that using MC Dropout can be computationally expensive, since it requires making multiple predictions for each sample. However, it can often lead to better results than regular dropout, especially when the model is trained on small datasets.