In [1]:
# 1. Is it OK to initialize all the weights to the same value as long as that value is selected randomly using He initialization?

# Ans:
# No, it is not recommended to initialize all weights to the same value, even if it is chosen randomly using He initialization. 
# Weight symmetry can occur when all weights are the same, which leads to the inability of neurons to learn different features.
# It is important to introduce diversity in weight initialization to promote effective learning and prevent symmetry problems.

In [2]:
# 2. Is it OK to initialize the bias terms to 0?

# Ans:
# Yes, it is generally acceptable to initialize the bias terms to 0. The bias term represents the offset or baseline value for each 
# neuron and does not need to be initialized with a specific value as long as the weights are appropriately initialized.

In [4]:
# 3. Name three advantages of the SELU activation function over ReLU.

# Ans:
# SELU activation can self-normalize, leading to stable and well-behaved gradients during training.
# SELU can alleviate the vanishing/exploding gradient problem, allowing for deeper networks to be trained effectively.
# SELU has a negative saturation region, which helps capture negative correlations and enables improved performance on 
# certain types of data.

In [5]:
# 4. In which cases would you want to use each of the following activation functions: SELU, leaky
# ReLU (and its variants), ReLU, tanh, logistic, and softmax?

# Ans:
# SELU: SELU activation function is suitable for deep neural networks as it helps in stabilizing gradients and enables self-normalization.
# It is recommended when dealing with complex and deep architectures.
# Leaky ReLU (and its variants): Leaky ReLU is useful when preventing the "dying ReLU" problem, which occurs when ReLU units
# become inactive during training. It can be used as an alternative to ReLU to introduce a small negative slope for negative inputs,
# promoting better gradient flow.
# ReLU: ReLU is commonly used in many deep learning applications as it helps address the vanishing gradient problem and accelerates 
# training. It is suitable for most hidden layers in deep neural networks.
# tanh: tanh activation function is often used in recurrent neural networks (RNNs) or when working with data that has negative values.
# It squashes the input to the range [-1, 1], making it useful for capturing both positive and negative correlations in the data.
# logistic (sigmoid): The logistic activation function (sigmoid) is commonly used in binary classification problems or when outputs 
# need to be interpreted as probabilities. It maps inputs to a range between 0 and 1, enabling probabilistic interpretations.
# softmax: Softmax activation function is used in multi-class classification problems, where the goal is to assign probabilities
# to each class. It normalizes the output values to sum up to 1, allowing for easy interpretation as class probabilities.

In [6]:
# 5. What may happen if you set the momentum hyperparameter too close to 1 (e.g., 0.99999) when using an SGD optimizer?

# Ans:
# Setting the momentum hyperparameter too close to 1 (e.g., 0.99999) in an SGD optimizer can cause the model to have slow convergence or
# even result in instability during training. The excessive momentum can cause the optimizer to overshoot the optimal solution and 
# hinder the ability to converge effectively.

In [7]:
# 6. Name three ways you can produce a sparse model.

# Ans:
# Three ways to produce a sparse model are:

# L1 regularization: Applying L1 regularization to the model's loss function encourages sparsity by promoting many weights to 
# become exactly zero.
# Dropout: Randomly setting a fraction of the neurons or their activations to zero during training can induce sparsity in the network.
# Pruning: Removing connections with small weights or neurons with low activations can create a sparse model by eliminating unnecessary
# parameters.

In [8]:
# 7. Does dropout slow down training? Does it slow down inference (i.e., making predictions on new instances)? What about MC Dropout?

# Ans:
# Dropout can slightly slow down training since it requires additional computations to randomly drop out units during each training
# iteration. However, dropout does not typically significantly impact inference speed as it is not applied during inference.

# MC Dropout (Monte Carlo Dropout) can slow down inference since it involves performing multiple forward passes with dropout enabled 
# to obtain predictions with uncertainty estimates. The increased computation time can be a trade-off for obtaining more reliable
# predictions and uncertainty information.

In [9]:
# 8. Practice training a deep neural network on the CIFAR10 image dataset:
# a. Build a DNN with 20 hidden layers of 100 neurons each (that’s too many, but it’s the
# point of this exercise). Use He initialization and the ELU activation function.
# b. Using Nadam optimization and early stopping, train the network on the CIFAR10
# dataset. You can load it with keras.datasets.cifar10.load_​data(). The dataset is
# composed of 60,000 32 × 32–pixel color images (50,000 for training, 10,000 for 
# testing) with 10 classes, so you’ll need a softmax output layer with 10 neurons.
# Remember to search for the right learning rate each time you change the model’s
# architecture or hyperparameters.
#  c. Now try adding Batch Normalization and compare the learning curves: Is it
# converging faster than before? Does it produce a better model? How does it affect training speed?
# d. Try replacing Batch Normalization with SELU, and make the necessary adjustements
# to ensure the network self-normalizes (i.e., standardize the input features, use
# LeCun normal initialization, make sure the DNN contains only a sequence of dense layers, etc.).
# e. Try regularizing the model with alpha dropout. Then, without retraining your model,
# see if you can achieve better accuracy using MC Dropout.

# Ans:
# a. Building a DNN: The first step is to construct a deep neural network with 20 hidden layers, each consisting of 100 neurons. 
# He initialization is used to initialize the weights, and the ELU activation function is applied to introduce non-linearity.

# b. Training with Nadam optimization and early stopping: The network is trained using the CIFAR10 dataset, which contains 60,000 
# 32x32-pixel color images with 10 different classes. Nadam optimization is utilized as the optimizer, and early stopping is 
# implemented to monitor the validation loss and halt training if there is no improvement.

# c. Adding Batch Normalization: Batch Normalization, a technique that normalizes the inputs of each layer during training, 
# is added to the network. The learning curves are compared to observe if the model converges faster and produces better results. 
# Additionally, the impact on training speed is assessed.

# d. Replacing Batch Normalization with SELU: Batch Normalization is replaced with SELU (Scaled Exponential Linear Unit) 
# activation function, which allows for self-normalization of the network. Adjustments are made to ensure the network is self-normalizing,
# such as standardizing input features and using LeCun normal initialization.

# e. Regularizing with alpha dropout and MC Dropout: The model is regularized using alpha dropout, a type of dropout that retains mean 
# and variance information. Afterward, the model is evaluated using MC Dropout, where multiple forward passes with dropout enabled
# are performed to obtain more accurate predictions and uncertainty estimates.

# By following these steps and experimenting with different techniques, it is possible to explore the effects of various strategies
# on the model's performance, convergence speed, and accuracy on the CIFAR10 dataset.