1. Initializing all weights to the same value, even if that value is selected randomly using He initialization, is not a recommended practice. He initialization is designed to initialize weights to small random values with a mean of 0, which helps in preventing the vanishing gradient problem. If all weights are set to the same value, it can lead to symmetry problems during training and hinder the network's ability to learn different features. So, it's better to use random values drawn from a suitable distribution with He initialization.

2. Initializing bias terms to 0 is generally acceptable and is a common practice in neural network initialization. The reason is that biases are often used to shift the activation function, and starting them at 0 provides a neutral starting point. However, in some cases, initializing biases with small non-zero values might help the network converge faster or improve training stability, especially when dealing with certain activation functions like sigmoid or tanh.

3. Three advantages of the SELU (Scaled Exponential Linear Unit) activation function over ReLU (Rectified Linear Unit):
   - **Self-Normalization:** SELU can help neurons self-normalize their activations during training. This can mitigate vanishing and exploding gradient problems, leading to more stable training.
   - **Continuity and Smoothness:** SELU is a smooth and continuously differentiable function, unlike ReLU, which has a derivative of 0 for negative inputs. This property can help in gradient-based optimization.
   - **Improved Generalization:** SELU has been shown to improve the generalization performance of deep networks, potentially reducing the need for extensive regularization techniques.

4. Use cases for different activation functions:
   - **SELU:** Use SELU in deep neural networks, especially when network self-normalization is desired. It's a good choice for feedforward architectures with many hidden layers.
   - **Leaky ReLU (and variants):** Leaky ReLU variants like Parametric ReLU (PReLU) are helpful when you want to mitigate the "dying ReLU" problem, which can occur with standard ReLU. They are robust and can be used in various scenarios.
   - **ReLU:** ReLU is a good default choice for many scenarios, particularly when you have a shallow network or when you want to benefit from faster convergence.
   - **tanh:** Use tanh when you need an activation function that maps inputs to a range between -1 and 1, suitable for tasks where outputs can be negative.
   - **Logistic (Sigmoid):** Use sigmoid in binary classification tasks where you need to produce probabilities as outputs.
   - **Softmax:** Softmax is used in the output layer of multi-class classification problems to produce class probabilities.

5. Setting the momentum hyperparameter too close to 1 (e.g., 0.99999) in an SGD optimizer can lead to slow convergence and oscillations during training. When momentum is close to 1, the optimizer has a strong memory of past gradients, which can cause it to overshoot optimal solutions and result in slow convergence. It may also have difficulty escaping local minima.

6. Three ways to produce a sparse model:
   - **L1 Regularization (Lasso):** Apply L1 regularization to the model's weights during training. This encourages many weights to become exactly zero, resulting in a sparse model.
   - **Pruning:** After training a model, prune (remove) weights that are close to zero or have low impact on the model's performance. This results in a sparse network.
   - **Dropout:** While dropout is primarily used for regularization, it can also be viewed as a method to create a form of sparsity during training by randomly deactivating neurons.

7. Dropout does slow down training as it randomly deactivates neurons during each training step, which effectively reduces the network's capacity during training. However, it helps prevent overfitting, which can lead to better generalization.

   During inference (making predictions on new instances), dropout is typically turned off, so it does not slow down inference. Inference is usually faster than training because there's no need for stochastic dropout and backpropagation.

   MC Dropout (Monte Carlo Dropout) involves making predictions with the model multiple times while applying dropout during each prediction. It can be slower during inference as it involves multiple forward passes with dropout enabled, but it can provide more reliable uncertainty estimates for each prediction.

8. Here's how you can perform the tasks mentioned for training a deep neural network on the CIFAR10 dataset using Keras:

   a. Build a DNN with 20 hidden layers of 100 neurons each, using He initialization and the ELU activation function:

   ```python
   import tensorflow as tf
   from tensorflow import keras

   model = keras.models.Sequential()
   model.add(keras.layers.Flatten(input_shape=[32, 32, 3]))
   for _ in range(20):
       model.add(keras.layers.Dense(100, activation="elu", kernel_initializer="he_normal"))
   model.add(keras.layers.Dense(10, activation="softmax"))
   ```

   b. Train the network using Nadam optimization and early stopping:

   ```python
   (X_train_full, y_train_full), (X_test, y_test) = keras.datasets.cifar10.load_data()
   X_train_full = X_train_full / 255.0
   X_test = X_test / 255.0
   early_stopping_cb = keras.callbacks.EarlyStopping(patience=10, restore_best_weights=True)
   model.compile(loss="sparse_categorical_crossentropy", optimizer="nadam", metrics=["accuracy"])
   model.fit(X_train_full, y_train_full, epochs=100, validation_split=0.1, callbacks=[early_stopping_cb])
   ```

   c. Add Batch Normalization to the model and compare learning curves:

   ```python
   model = keras.models.Sequential()
   model.add(keras.layers.Flatten(input_shape=[32, 32, 3]))
   for _ in range(20):
       model.add(keras.layers.Dense(100, activation="elu", kernel_initializer="he_normal"))
       model.add(keras.layers.BatchNormalization())
   model.add(keras.layers.Dense(10, activation="softmax"))
   ```

   d. Replace Batch Normalization with SELU and make necessary adjustments:

   ```python
   model = keras.models.Sequential()
   model.add(keras.layers.Flatten(input_shape=[32, 32, 3]))
   for _ in range(20):
       model.add(keras.layers.Dense(100, activation="selu", kernel_initializer="lecun_normal"))
   model.add(keras.layers.Dense(10, activation="softmax"))
   ```

   e. Regularize the model with alpha dropout and explore MC Dropout (note that MC Dropout would require retraining):

   ```python
   from tensorflow.keras.layers import AlphaDropout

   model = keras.models.Sequential()
   model.add(keras.layers.Flatten(input_shape=[32, 32, 3]))
   for _ in range(20):
       model.add(keras.layers.Dense(100, activation="selu", kernel_initializer="lecun_normal"))
       model.add(AlphaDropout(rate=0.5))  # AlphaDropout for regularization
   model.add(keras.layers.Dense(10, activation="softmax"))

   # For MC Dropout during inference, you can compile and use the model as usual.
   # For example, to get multiple predictions with dropout enabled:
   predictions = np

.stack([model.predict(X_test) for _ in range(100)])
   ```

   Remember to fine-tune hyperparameters and monitor training performance to achieve the best results.