In [None]:
1. Is it OK to initialize all the weights to the same value as long as that value is selected
randomly using He initialization?
2. Is it OK to initialize the bias terms to 0?
3. Name three advantages of the SELU activation function over ReLU.
4. In which cases would you want to use each of the following activation functions: SELU, leaky
ReLU (and its variants), ReLU, tanh, logistic, and softmax?
5. What may happen if you set the momentum hyperparameter too close to 1 (e.g., 0.99999)
when using an SGD optimizer?
6. Name three ways you can produce a sparse model.
7. Does dropout slow down training? Does it slow down inference (i.e., making predictions on
new instances)? What about MC Dropout?
8. Practice training a deep neural network on the CIFAR10 image dataset:
a. Build a DNN with 20 hidden layers of 100 neurons each (that’s too many, but it’s the
point of this exercise). Use He initialization and the ELU activation function.
b. Using Nadam optimization and early stopping, train the network on the CIFAR10
dataset. You can load it with keras.datasets.cifar10.load_​data(). The dataset is
composed of 60,000 32 × 32–pixel color images (50,000 for training, 10,000 for
testing) with 10 classes, so you’ll need a softmax output layer with 10 neurons.
Remember to search for the right learning rate each time you change the model’s
architecture or hyperparameters.
c. Now try adding Batch Normalization and compare the learning curves: Is it
converging faster than before? Does it produce a better model? How does it affect
training speed?
d. Try replacing Batch Normalization with SELU, and make the necessary adjustements
to ensure the network self-normalizes (i.e., standardize the input features, use
LeCun normal initialization, make sure the DNN contains only a sequence of dense
layers, etc.).
e. Try regularizing the model with alpha dropout. Then, without retraining your model,
see if you can achieve better accuracy using MC Dropout.

In [None]:
1. Yes, it is generally okay to initialize all the weights to the same value as long as that value is randomly selected using He initialization. He initialization scales the initial weights based on the number of input units, which helps prevent the vanishing or exploding gradients problem during training. Using the same initial value for all weights ensures that they start from a similar point, and the randomness introduced by He initialization helps break symmetry.

2. Yes, it is generally okay to initialize the bias terms to 0. Unlike weights, which control the strength of connections between neurons, biases control the shift or offset of the activation function, affecting the overall output of the neuron. Initializing biases to 0 ensures that the network starts with no preference for positive or negative values, allowing it to learn more effectively during training.

3. Three advantages of the SELU (Scaled Exponential Linear Unit) activation function over ReLU (Rectified Linear Unit) are:

   a. **Self-normalization**: SELU has a property of self-normalization, which helps stabilize and speed up training. This property ensures that the activations and gradients remain close to mean 0 and standard deviation 1 throughout the network, reducing the risk of vanishing or exploding gradients.
   
   b. **Non-zero mean**: SELU has a non-zero mean, which helps alleviate the dying ReLU problem, where neurons can become inactive (outputting zero) during training. The non-zero mean ensures that neurons remain active and continue to learn even for negative inputs.
   
   c. **Better performance**: In some cases, SELU has been shown to outperform ReLU and its variants in deep neural networks, especially for networks with many layers. It can lead to faster convergence and better generalization performance.

4. Activation function selection depends on the specific characteristics of the problem and the network architecture. Here are some guidelines for choosing activation functions:

   - **SELU**: Suitable for deep neural networks due to its self-normalizing property and non-zero mean. It can help stabilize training and improve performance, especially for networks with many layers.
   
   - **Leaky ReLU and its variants**: Suitable for preventing dying ReLU problem by allowing a small gradient for negative inputs. They are robust choices for deep networks and can help alleviate the vanishing gradient problem.
   
   - **ReLU**: Widely used for its simplicity and computational efficiency. It works well for most problems but may suffer from dying ReLU problem for negative inputs.
   
   - **Tanh**: Suitable for hidden layers in feedforward neural networks. It squashes input values to the range [-1, 1], making it useful for models where inputs are normalized.
   
   - **Logistic (Sigmoid)**: Suitable for binary classification problems where outputs need to be in the range [0, 1]. It is also used in output layers of recurrent neural networks for probability estimation.
   
   - **Softmax**: Suitable for multi-class classification problems. It converts raw scores into probabilities, allowing the model to predict the class with the highest probability.

5. If the momentum hyperparameter is set too close to 1 (e.g., 0.99999) when using an SGD optimizer, it can lead to oscillations or instability during training. Momentum accumulates past gradients to determine the direction of weight updates, and setting it too close to 1 can cause the optimizer to overshoot or oscillate around the minimum, making convergence difficult. It can also result in slow or erratic progress towards the optimal solution.

6. Three ways to produce a sparse model are:

   - **L1 Regularization (Lasso)**: Introduces a penalty term proportional to the absolute values of weights, encouraging sparsity by driving some weights to zero. It promotes feature selection by shrinking less important weights towards zero.
   
   - **Dropout**: Randomly drops units (neurons) from the neural network during training, forcing the remaining units to learn more robust features. Dropout effectively creates a sparse model by removing connections between neurons, reducing overfitting and promoting generalization.
   
   - **Pruning**: Iteratively removes less important weights or connections from the trained model based on their magnitude or contribution to the network's performance. Pruning helps reduce model size and computational complexity while preserving accuracy.

7. Dropout can slow down training by introducing noise and requiring the model to learn redundant representations, but it helps prevent overfitting and improves generalization performance. During inference (making predictions on new instances), dropout is typically turned off, so it does not affect prediction speed. However, dropout can be applied during inference with Monte Carlo Dropout (MC Dropout) to estimate uncertainty and improve model calibration.

8. Training a deep neural network on the CIFAR10 dataset:

   a. Build a deep neural network with 20 hidden layers of 100 neurons each using He initialization and ELU activation function.
   
   b. Train the network on the CIFAR10 dataset using Nadam optimization and early stopping.
   
   c. Add Batch Normalization and compare learning curves to see if it converges faster and produces a better model.
   
   d. Replace Batch Normalization with SELU and ensure the network self-normalizes by standardizing input features and using LeCun normal initialization.
   
   e. Regularize the model with alpha dropout and explore using MC Dropout for better accuracy without retraining.