1. **Logistic Regression vs. Perceptron:**
   - Logistic Regression is generally preferred over the classical Perceptron for several reasons:
     - Logistic Regression provides continuous probabilistic outputs, making it suitable for classification tasks with well-calibrated probabilities.
     - The Perceptron only produces binary outputs, which lack probabilistic interpretation.
   - To make a Perceptron equivalent to a Logistic Regression classifier, you can:
     - Replace the step function (threshold activation) with the logistic (sigmoid) activation function.
     - Use a training algorithm like gradient descent with cross-entropy loss instead of the Perceptron training algorithm.

2. **Logistic Activation in MLPs:**
   - The logistic (sigmoid) activation function was crucial in training the first MLPs because it allowed for smooth, differentiable activation that facilitates gradient-based optimization techniques like backpropagation.
   - The smoothness of the logistic function enables gradient information to flow backward through the network during training, making it possible to adjust weights and biases effectively.

3. **Popular Activation Functions:**
   - Three popular activation functions are:
     1. **Sigmoid Function:**
        - Formula: σ(x) = 1 / (1 + e^(-x))
     2. **ReLU (Rectified Linear Unit) Function:**
        - Formula: ReLU(x) = max(0, x)
     3. **Tanh (Hyperbolic Tangent) Function:**
        - Formula: tanh(x) = (e^(x) - e^(-x)) / (e^(x) + e^(-x))
   - Drawing them requires plotting their respective curves, which can be done using libraries like Matplotlib.

4. **Shapes in MLP:**
   - Shape of Input Matrix X: (batch_size, num_inputs)
   - Shape of Hidden Layer's Weight Vector Wh: (num_inputs, num_hidden_neurons)
   - Shape of Hidden Layer's Bias Vector bh: (num_hidden_neurons,)
   - Shape of Output Layer's Weight Vector Wo: (num_hidden_neurons, num_output_neurons)
   - Shape of Output Layer's Bias Vector bo: (num_output_neurons,)
   - Shape of Network's Output Matrix Y: (batch_size, num_output_neurons)
   - Equation for Y: Y = ReLU(X * Wh + bh) * Wo + bo

5. **Output Layer Neurons for Email Spam Classification and MNIST:**
   - For email spam classification (binary classification), you need one neuron in the output layer with a sigmoid activation function.
   - For MNIST (multi-class classification with 10 classes), you need 10 neurons in the output layer with a softmax activation function.

6. **Backpropagation vs. Reverse-Mode Autodiff:**
   - Backpropagation is a supervised learning algorithm for training neural networks by computing gradients of the loss function with respect to model parameters. It involves forward and backward passes.
   - Reverse-Mode Autodiff is a technique for efficiently computing gradients in computational graphs. It is used in backpropagation for the reverse pass, where gradients are computed efficiently.

7. **Hyperparameters in MLP:**
   - Learning Rate
   - Number of Hidden Layers
   - Number of Neurons in Each Layer
   - Activation Functions
   - Batch Size
   - Weight Initialization
   - Regularization (e.g., Dropout)
   - Optimizer (e.g., SGD, Adam)
   - Loss Function
   - Number of Epochs
   - Early Stopping
   - Learning Rate Schedule

   If the MLP overfits, you can:
   - Decrease the model complexity (reduce the number of neurons or layers).
   - Apply regularization techniques (e.g., dropout).
   - Increase the amount of training data.
   - Adjust learning rate and learning rate schedule.
   - Experiment with different architectures and hyperparameters.

8. **Training Deep MLP on MNIST:**
   - Training a deep MLP on MNIST with all the mentioned features requires significant code and resources. It's typically done using frameworks like TensorFlow or PyTorch with proper data loading, model definition, optimization, and monitoring. It involves multiple steps, and the full code is too extensive to provide here. You can refer to online tutorials and documentation for detailed examples.