# <center>DeepLearning: Assignment 03</center>

# Question 01

Is it OK to initialize all the weights to the same value as long as that value is selected
randomly using He initialization?


## <span style='color:blue'>Answer</span>

No, initializing all the weights to the same value, even if chosen randomly using He initialization, is not recommended. He initialization is designed to address the issue of vanishing gradients during training by adjusting the scale of weights based on the number of input units. However, initializing all weights to the same value would negate the purpose of He initialization.

In He initialization, weights are initialized with random values drawn from a normal distribution with mean 0 and variance 2 divided by the number of input units in the neuron's layer. This helps prevent gradients from becoming too small during backpropagation, enabling more stable and efficient training.

If all weights are initialized to the same value, there will be no diversity in the initial weights, and the network may face symmetry issues, hindering the learning process. It's crucial to allow diversity in the initial weights, even when using techniques like He initialization, to enable the network to explore different pathways during training and converge to an optimal solution.

# Question 02

Is it OK to initialize the bias terms to 0?

## <span style='color:blue'>Answer</span>

Initializing bias terms to 0 is a common practice and generally acceptable in neural network initialization. Unlike weight initialization, biases being set to 0 doesn't lead to issues related to symmetry or vanishing gradients during training. During the training process, the network adjusts both weights and biases based on the gradients calculated during backpropagation. 

However, some variations in initializing biases might be beneficial in certain scenarios. For example, in batch normalization, biases are initialized to non-zero values to prevent scaling issues during training. Additionally, some specific activation functions or network architectures might respond differently to different bias initialization strategies.

In most cases, initializing biases to 0 is a reasonable default, and modern deep learning frameworks often handle bias initialization automatically based on the network's structure and the chosen activation functions.

# Question 03

Name three advantages of the SELU activation function over ReLU.

## <span style='color:blue'>Answer</span>

The Scaled Exponential Linear Unit (SELU) activation function offers several advantages over the Rectified Linear Unit (ReLU) activation function:

1. **Non-Zero Mean Output:**
   - SELU has a non-zero mean output, which helps address the vanishing gradients problem. Unlike ReLU, which can output zero for negative inputs, SELU ensures a non-zero mean, promoting stable gradients during training.

2. **Self-Normalizing Properties:**
   - SELU has self-normalizing properties, meaning it can stabilize the activations of neurons during training. Neural networks with many layers and SELU activations tend to converge faster and generalize better, especially in deep architectures, without the need for elaborate normalization techniques like Batch Normalization.

3. **Preservation of Network Weights:**
   - SELU activations tend to preserve the magnitude of network weights during training. This means the weights don't explode to very large values or diminish to very small values, leading to more stable and effective learning.

These advantages make SELU a favorable choice in deep neural networks, particularly when dealing with architectures where stable gradients and consistent weight updates are crucial for efficient and effective learning.

# Question 04

In which cases would you want to use each of the following activation functions: SELU, leaky
ReLU (and its variants), ReLU, tanh, logistic, and softmax?

## <span style='color:blue'>Answer</span>

1. **SELU (Scaled Exponential Linear Unit):**
   - **When to Use:** Use SELU when building deep neural networks, especially deep feedforward networks (MLPs).
   - **Advantages:** SELU can mitigate vanishing gradients and stabilize training in deep architectures, leading to faster convergence and better generalization.
   - **Note:** Ensure the input features are standardized (zero mean and unit variance) for SELU to exhibit its self-normalizing properties effectively.

2. **Leaky ReLU and Its Variants (e.g., Parametric ReLU, Randomized Leaky ReLU):**
   - **When to Use:** Use Leaky ReLU and its variants when dealing with dead neurons (neurons always output 0) in traditional ReLU activations.
   - **Advantages:** Leaky ReLU variants address the dying ReLU problem by allowing small negative slopes for negative inputs, preventing neurons from becoming inactive during training.
  
3. **ReLU (Rectified Linear Unit):**
   - **When to Use:** Use ReLU as a default choice for most hidden layers in deep neural networks.
   - **Advantages:** ReLU is computationally efficient and helps mitigate vanishing gradients for positive inputs, promoting faster training for deep networks.
   - **Note:** Watch out for dying ReLU problem (neurons always output 0 for negative inputs). If this occurs, consider using Leaky ReLU or its variants.

4. **Tanh (Hyperbolic Tangent):**
   - **When to Use:** Use tanh for hidden layers when the data is centered around 0 (e.g., mean-shifted to zero) or when the output range needs to be between -1 and 1.
   - **Advantages:** Tanh squashes inputs to the range [-1, 1], making it zero-centered, which can help learning in certain scenarios, especially in the hidden layers of neural networks.
  
5. **Logistic (Sigmoid):**
   - **When to Use:** Use logistic (sigmoid) for binary classification problems in the output layer.
   - **Advantages:** Sigmoid function maps inputs to the range [0, 1], making it suitable for binary classification tasks where the output needs to represent probabilities.
   - **Note:** Avoid using sigmoid in hidden layers of deep networks due to vanishing gradients, unless specifically required for a certain design reason.
  
6. **Softmax:**
   - **When to Use:** Use softmax in the output layer for multi-class classification problems.
   - **Advantages:** Softmax function converts raw scores (logits) into probabilities, allowing the model to predict multiple classes mutually exclusively.
   - **Note:** Softmax is essential for multi-class problems, ensuring that the class probabilities sum up to 1, making it suitable for classification tasks with more than two classes.

# Question 05

What may happen if you set the momentum hyperparameter too close to 1 (e.g., 0.99999)
when using an SGD optimizer?


## <span style='color:blue'>Answer</span>

Setting the momentum hyperparameter too close to 1 (e.g., 0.99999) in Stochastic Gradient Descent (SGD) can lead to several issues:

1. **Reduced Learning Rate Effectiveness:** Momentum enhances the effective learning rate, allowing the optimizer to continue moving in the previous direction with a certain velocity. If momentum is extremely close to 1, it essentially retains almost all the previous velocity, making the learning rate's impact negligible. This can slow down the learning process, as the updates become very small.

2. **Overshooting and Instability:** With extremely high momentum, the optimizer might overshoot the optimal point. It can cause the algorithm to oscillate around the minimum without converging. The momentum can become so dominant that the optimizer might miss the optimal solution completely.

3. **Dampened Responsiveness:** High momentum makes the optimizer less responsive to local changes in the loss landscape. While momentum is supposed to help the optimizer escape local minima, excessively high momentum can prevent the optimizer from exploring new regions of the parameter space, leading to suboptimal solutions.

4. **Difficulty in Convergence Analysis:** Extremely high momentum can make it difficult to analyze the convergence behavior of the optimization algorithm. It might converge, diverge, or oscillate unpredictably, making it challenging to determine the convergence criteria and stopping conditions.

5. **Difficulty in Escaping Local Minima:** While momentum helps escape local minima, setting it too close to 1 might cause the optimizer to overshoot the global minimum and get stuck in a new local minimum, especially in non-convex optimization problems.

To avoid these issues, it's crucial to carefully choose the momentum hyperparameter based on the specific problem and dataset. It's often beneficial to start with moderate values (e.g., 0.9) and tune the hyperparameters through experimentation and validation performance.

# Question 06

Name three ways you can produce a sparse model.

## <span style='color:blue'>Answer</span>

1. **L1 Regularization (Lasso):**
   - Introducing an L1 penalty in the loss function encourages sparsity by pushing irrelevant or less relevant features' weights towards zero. Features with zero weights are effectively pruned, creating a sparse model.

2. **Feature Selection Techniques:**
   - Utilize feature selection methods like Mutual Information, Recursive Feature Elimination, or Tree-based feature importance to identify and keep only the most informative features, discarding irrelevant or redundant ones, leading to a sparser representation.

3. **Dropout in Neural Networks:**
   - In neural networks, dropout is a regularization technique where random neurons are temporarily removed during training. This randomness encourages the network to rely on multiple pathways, promoting a sparse network by preventing over-reliance on specific neurons or features.

# Question 07

Does dropout slow down training? Does it slow down inference (i.e., making predictions on
new instances)? What about MC Dropout?

1. **Dropout and Training Speed:**
   - Dropout can slightly slow down the training process because, during each training iteration, a fraction of neurons is randomly dropped out, requiring additional computations. However, this slowdown is usually not significant, and dropout is still widely used in practice due to its regularization benefits.

2. **Dropout and Inference Speed:**
   - Dropout does not slow down inference (making predictions on new instances). During inference, dropout is typically turned off, and the model operates with all neurons active. Therefore, the prediction speed is not affected by dropout regularization.

3. **MC Dropout (Monte Carlo Dropout):**
   - MC Dropout involves performing multiple forward passes with dropout enabled during inference and averaging the predictions. While this technique introduces additional computations, it provides uncertainty estimates along with predictions, making it valuable for tasks like uncertainty quantification or Bayesian neural networks.
   - MC Dropout can significantly slow down inference, especially if a large number of forward passes are required for accurate uncertainty estimation. The trade-off between prediction accuracy and computational cost needs to be considered when using MC Dropout in real-time applications.

# Question 08

**a) Build a DNN with 20 hidden layers of 100 neurons each (that’s too many, but it’s the point of this exercise). Use He initialization and the ELU activation function.**

## <span style='color:blue'>Answer</span>

In [None]:
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.initializers import HeNormal
from tensorflow.keras.activations import elu

# Build the DNN model
model = Sequential()

# Input layer (assuming input shape is known, adjust input_shape accordingly)
model.add(Dense(units=100, activation=elu, kernel_initializer=HeNormal(), input_shape=(input_shape,)))

# 20 hidden layers with 100 neurons each
for _ in range(20):
    model.add(Dense(units=100, activation=elu, kernel_initializer=HeNormal()))

# Output layer with appropriate number of units/neurons for your task (e.g., classification)
model.add(Dense(units=num_classes, activation='softmax'))

# Compile the model
model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])

# Print the model summary
model.summary()


**b) Using Nadam optimization and early stopping, train the network on the CIFAR10
dataset. You can load it with keras.datasets.cifar10.load_​data(). The dataset is
composed of 60,000 32 × 32–pixel color images (50,000 for training, 10,000 for
testing) with 10 classes, so you’ll need a softmax output layer with 10 neurons.
Remember to search for the right learning rate each time you change the model’s
architecture or hyperparameters.**


## <span style='color:blue'>Answer</span>

In [None]:
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Flatten
from tensorflow.keras.optimizers import Nadam
from tensorflow.keras.initializers import HeNormal
from tensorflow.keras.activations import elu
from tensorflow.keras.datasets import cifar10
from tensorflow.keras.utils import to_categorical
from tensorflow.keras.callbacks import EarlyStopping

# Load CIFAR10 dataset
(x_train, y_train), (x_test, y_test) = cifar10.load_data()

# Preprocess the data
x_train, x_test = x_train / 255.0, x_test / 255.0  # Normalize pixel values to be between 0 and 1
y_train, y_test = to_categorical(y_train, 10), to_categorical(y_test, 10)  # One-hot encode labels

# Build the DNN model
model = Sequential()
model.add(Flatten(input_shape=(32, 32, 3)))  # Flatten input images
for _ in range(20):
    model.add(Dense(100, activation=elu, kernel_initializer=HeNormal()))  # 20 hidden layers with 100 neurons each
model.add(Dense(10, activation='softmax'))  # Output layer with 10 neurons for 10 classes

# Compile the model with Nadam optimizer and categorical crossentropy loss
model.compile(optimizer=Nadam(), loss='categorical_crossentropy', metrics=['accuracy'])

# Define early stopping
early_stopping = EarlyStopping(monitor='val_loss', patience=10, restore_best_weights=True)

# Train the model
history = model.fit(x_train, y_train, epochs=100, batch_size=32, 
                    validation_data=(x_test, y_test), callbacks=[early_stopping])

# Evaluate the model on the test set
test_loss, test_accuracy = model.evaluate(x_test, y_test)

print(f"Test Accuracy: {test_accuracy:.2f}")


**c) Now try adding Batch Normalization and compare the learning curves: Is it
converging faster than before? Does it produce a better model? How does it affect
training speed?**

## <span style='color:blue'>Answer</span>

In [None]:
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Flatten, BatchNormalization
from tensorflow.keras.optimizers import Nadam
from tensorflow.keras.initializers import HeNormal
from tensorflow.keras.activations import elu
from tensorflow.keras.datasets import cifar10
from tensorflow.keras.utils import to_categorical
from tensorflow.keras.callbacks import EarlyStopping
import matplotlib.pyplot as plt

# Load CIFAR10 dataset
(x_train, y_train), (x_test, y_test) = cifar10.load_data()

# Preprocess the data
x_train, x_test = x_train / 255.0, x_test / 255.0  # Normalize pixel values to be between 0 and 1
y_train, y_test = to_categorical(y_train, 10), to_categorical(y_test, 10)  # One-hot encode labels

# Build the DNN model with Batch Normalization
model = Sequential()
model.add(Flatten(input_shape=(32, 32, 3)))  # Flatten input images
for _ in range(20):
    model.add(Dense(100, activation=None, kernel_initializer=HeNormal()))  # 20 hidden layers with 100 neurons each
    model.add(BatchNormalization())  # Batch Normalization layer
    model.add(Activation(elu))  # ELU activation function
model.add(Dense(10, activation='softmax'))  # Output layer with 10 neurons for 10 classes

# Compile the model with Nadam optimizer and categorical crossentropy loss
model.compile(optimizer=Nadam(), loss='categorical_crossentropy', metrics=['accuracy'])

# Define early stopping
early_stopping = EarlyStopping(monitor='val_loss', patience=10, restore_best_weights=True)

# Train the model
history_with_batch_norm = model.fit(x_train, y_train, epochs=100, batch_size=32, 
                                    validation_data=(x_test, y_test), callbacks=[early_stopping])

# Evaluate the model on the test set
test_loss, test_accuracy = model.evaluate(x_test, y_test)

print(f"Test Accuracy (with Batch Normalization): {test_accuracy:.2f}")

# Plot learning curves
plt.figure(figsize=(12, 4))
plt.subplot(1, 2, 1)
plt.plot(history.history['accuracy'], label='Training Accuracy (Without Batch Norm)')
plt.plot(history.history['val_accuracy'], label='Validation Accuracy (Without Batch Norm)')
plt.xlabel('Epoch')
plt.ylabel('Accuracy')
plt.legend()
plt.grid(True)
plt.title('Learning Curves without Batch Normalization')

plt.subplot(1, 2, 2)
plt.plot(history_with_batch_norm.history['accuracy'], label='Training Accuracy (With Batch Norm)')
plt.plot(history_with_batch_norm.history['val_accuracy'], label='Validation Accuracy (With Batch Norm)')
plt.xlabel('Epoch')
plt.ylabel('Accuracy')
plt.legend()
plt.grid(True)
plt.title('Learning Curves with Batch Normalization')

plt.tight_layout()
plt.show()


**d) Try replacing Batch Normalization with SELU, and make the necessary adjustements
to ensure the network self-normalizes (i.e., standardize the input features, use
LeCun normal initialization, make sure the DNN contains only a sequence of dense
layers, etc.).**

## <span style='color:blue'>Answer</span>

In [None]:
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.initializers import LeCunNormal
from tensorflow.keras.datasets import cifar10
from tensorflow.keras.utils import to_categorical
from tensorflow.keras.callbacks import EarlyStopping
import matplotlib.pyplot as plt

# Load CIFAR10 dataset
(x_train, y_train), (x_test, y_test) = cifar10.load_data()

# Preprocess the data (standardize input features)
x_train, x_test = x_train / 255.0, x_test / 255.0

# Build the DNN model with SELU activation and LeCun initialization
model = Sequential()
model.add(tf.keras.layers.Flatten(input_shape=(32, 32, 3)))  # Flatten input images
for _ in range(20):
    model.add(Dense(100, activation='selu', kernel_initializer=LeCunNormal()))  # 20 hidden layers with SELU activation
model.add(Dense(10, activation='softmax'))  # Output layer with 10 neurons for 10 classes

# Compile the model with Nadam optimizer and categorical crossentropy loss
model.compile(optimizer='nadam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])

# Define early stopping
early_stopping = EarlyStopping(monitor='val_loss', patience=10, restore_best_weights=True)

# Train the model
history_with_selu = model.fit(x_train, y_train, epochs=100, batch_size=32, 
                               validation_data=(x_test, y_test), callbacks=[early_stopping])

# Evaluate the model on the test set
test_loss, test_accuracy = model.evaluate(x_test, y_test)

print(f"Test Accuracy (with SELU activation): {test_accuracy:.2f}")

# Plot learning curves
plt.plot(history_with_selu.history['accuracy'], label='Training Accuracy (SELU)')
plt.plot(history_with_selu.history['val_accuracy'], label='Validation Accuracy (SELU)')
plt.xlabel('Epoch')
plt.ylabel('Accuracy')
plt.legend()
plt.grid(True)
plt.title('Learning Curves with SELU Activation')
plt.show()


**e) Try regularizing the model with alpha dropout. Then, without retraining your model,
see if you can achieve better accuracy using MC Dropout.**

1. **Regularize the Model with Alpha Dropout:**

In [None]:
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, AlphaDropout
from tensorflow.keras.initializers import LeCunNormal
from tensorflow.keras.datasets import cifar10
from tensorflow.keras.utils import to_categorical
from tensorflow.keras.callbacks import EarlyStopping
import matplotlib.pyplot as plt

# Load CIFAR10 dataset
(x_train, y_train), (x_test, y_test) = cifar10.load_data()

# Preprocess the data (standardize input features)
x_train, x_test = x_train / 255.0, x_test / 255.0

# Build the DNN model with SELU activation, LeCun initialization, and Alpha Dropout regularization
model = Sequential()
model.add(tf.keras.layers.Flatten(input_shape=(32, 32, 3)))  # Flatten input images
for _ in range(20):
    model.add(Dense(100, activation='selu', kernel_initializer=LeCunNormal()))  # 20 hidden layers with SELU activation
    model.add(AlphaDropout(rate=0.1))  # Alpha Dropout regularization with dropout rate 0.1
model.add(Dense(10, activation='softmax'))  # Output layer with 10 neurons for 10 classes

# Compile the model with Nadam optimizer and categorical crossentropy loss
model.compile(optimizer='nadam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])

# Define early stopping
early_stopping = EarlyStopping(monitor='val_loss', patience=10, restore_best_weights=True)

# Train the model
history_with_alpha_dropout = model.fit(x_train, y_train, epochs=100, batch_size=32, 
                                       validation_data=(x_test, y_test), callbacks=[early_stopping])

# Evaluate the model on the test set
test_loss, test_accuracy = model.evaluate(x_test, y_test)

print(f"Test Accuracy (with Alpha Dropout): {test_accuracy:.2f}")

# Plot learning curves
plt.plot(history_with_alpha_dropout.history['accuracy'], label='Training Accuracy (Alpha Dropout)')
plt.plot(history_with_alpha_dropout.history['val_accuracy'], label='Validation Accuracy (Alpha Dropout)')
plt.xlabel('Epoch')
plt.ylabel('Accuracy')
plt.legend()
plt.grid(True)
plt.title('Learning Curves with Alpha Dropout')
plt.show()


2. **Apply MC Dropout for Improved Accuracy:**

In [None]:
import numpy as np

# Number of Monte Carlo samples (forward passes) for MC Dropout
num_mc_samples = 100

# Predict using MC Dropout (perform multiple forward passes and average predictions)
predictions = np.zeros((len(x_test), 10))
for _ in range(num_mc_samples):
    predictions += model.predict(x_test, batch_size=32)
predictions /= num_mc_samples

# Calculate accuracy based on MC Dropout predictions
mc_dropout_accuracy = np.mean(np.argmax(predictions, axis=1) == np.argmax(y_test, axis=1))
print(f"MC Dropout Accuracy: {mc_dropout_accuracy:.2f}")
