#### 1. Convergence.
    a. [E] When we say an algorithm converges, what does convergence mean?
    b. [E] How do we know when a model has converged?

**Answer:**

a. When we say an algorithm converges, it means that it approaches a specific value or set of values as the number of iterations increases. This means that the algorithm's output becomes stable and does not change significantly with further iterations.

b. A model has converged when the change in the value of the objective function, or the change in the model's parameters, is below a certain threshold. Additionally, if the training error stops decreasing or plateaus, the model is considered to have converged. Other techniques like monitoring the change in the validation error or using early stopping can also be used to determine convergence.

#### 2. [E] Draw the loss curves for overfitting and underfitting.

**Answer:**
The loss curve for overfitting typically has a small gap between the training loss and the validation loss, but as the model continues to train, the training loss decreases while the validation loss increases. This is because the model is fitting too closely to the training data and is not generalizing well to new data.
![image.png](attachment:image.png)

On the other hand, the loss curve for underfitting has a large gap between the training loss and the validation loss, and as the model continues to train, both the training loss and the validation loss decrease at a similar rate. This is because the model is not fitting the training data well enough and is not capturing the underlying patterns in the data.
![image-2.png](attachment:image-2.png)

A good model should have a small gap between the training loss and validation loss, and as the model continues to train, both the training loss and validation loss should decrease until it reaches a minimum or converge.
![image-3.png](attachment:image-3.png)

#### 3. Bias-variance trade-off
    1. [E] What’s the bias-variance trade-off?
    1. [M] How’s this tradeoff related to overfitting and underfitting?
    1. [M] How do you know that your model is high variance, low bias? What would you do in this case?
    1. [M] How do you know that your model is low variance, high bias? What would you do in this case?

**Anser:**

a. The bias-variance trade-off is a fundamental concept in machine learning that refers to the tradeoff between how well a model fits the training data (bias) and how well it generalizes to new, unseen data (variance). High bias models are often simple and make strong assumptions about the data, while high variance models are more complex and can fit the training data very well, but may not generalize well to new data.

b. The bias-variance trade-off is related to overfitting and underfitting because overfitting occurs when a model has high variance and low bias, while underfitting occurs when a model has high bias and low variance. In the case of overfitting, the model is too complex and is able to fit the noise in the training data, leading to poor generalization. On the other hand, underfitting occurs when the model is too simple and is not able to capture the underlying patterns in the data.

c. A model is high variance, low bias if it performs well on the training data but poorly on validation or test data. This can be determined by comparing the training and validation/test error. In this case, we can try to decrease the model's complexity by simplifying the model, adding more regularization, or gathering more training data.

d. A model is low variance, high bias if it performs poorly on the training data and validation/test data. This can be determined by comparing the training and validation/test error. In this case, we can try to increase the model's complexity by adding more features, increasing the capacity of the model or decreasing the regularization.

#### 4. Cross-validation.
    a. [E] Explain different methods for cross-validation.
    b. [M] Why don’t we see more cross-validation in deep learning? 

**Answer:**

a. Cross-validation is a technique used to assess the performance of a machine learning model by dividing the data into training and validation sets. Some of the most popular methods for cross-validation include:

- i. *k-fold cross-validation:* The data is divided into k subsets, and the model is trained and validated on different subsets k times, each time using a different subset as the validation set.

- ii. *Leave-p-out cross-validation:* This method involves training the model on all but p observations, and then validating on the p observations that were left out.

- iii. *Leave-one-out cross-validation:* A special case of leave-p-out cross-validation where p=1.

- iv. *Stratified k-fold cross-validation:* This method is similar to k-fold cross-validation, but it ensures that each fold contains a representative proportion of observations from each class.

b. Cross-validation in Deep Learning (DL) might be a little tricky because most of the CV techniques require training the model at least a couple of times. In deep learning, you would normally tempt to avoid CV because of the cost associated with training k different models.

When we have limited data, dividing the dataset into Train and Validation sets may casue some data points with useful information to be excluded from the training procedure, and the model fails to learn the data distrubution properly.

5. Train, valid, test splits.
    a. [E] What’s wrong with training and testing a model on the same data?
    b. [E] Why do we need a validation set on top of a train set and a test set?
    c. [M] Your model’s loss curves on the train, valid, and test sets look like this. What might have been the cause of this? What would you do?
    ![image.png](attachment:image.png)

**Answer:**
a. [E] Training and testing a model on the same data can lead to overfitting, as the model will be optimized to perform well on the training data but may not generalize well to new, unseen data.

b. [E] A validation set is used to tune the hyperparameters of a model and evaluate its performance during the training process. It helps to avoid overfitting by providing a way to track the model's performance on unseen data during training.

c. [M] This could indicate that the model is overfitting to the training data. As the model is trained, it is able to decrease its loss on the training data but is unable to generalize well to the validation data. Also from this plot, we can assume that the test set is not well distributed.

This could be caused by having too many parameters in the model, or a lack of regularization to prevent overfitting. To address this issue, one could try using techniques such as regularization, early stopping, or using a simpler model with fewer parameters. Additionally, it's important to consider the size of the dataset, the complexity of the model, and the amount of noise in the data.


6. [E] Your team is building a system to aid doctors in predicting whether a patient has cancer or not from their X-ray scan. Your colleague announces that the problem is solved now that they’ve built a system that can predict with 99.99% accuracy. How would you respond to that claim?


I would respond by saying that while high accuracy is certainly a desirable trait for a prediction system, it is not the only metric to consider. We should also take into account other important factors such as precision, recall, and overall model performance on different subsets of the data (e.g. specific patient demographics or types of cancer). Additionally, it's important to consider the potential consequences of false positives or false negatives, and to ensure that the model is not overfitting to the training data. It's also important to evaluate the model's performance on a new unseen dataset to ensure it generalizes well.

7. F1 score.
    a. [E] What’s the benefit of F1 over the accuracy?
    b. [M] Can we still use F1 for a problem with more than two classes. How?

**Answer:**
a. [E] The F1 score is a measure of a model's accuracy that considers both precision and recall. It is particularly useful when the classes in the problem are imbalanced, as it gives equal weight to both false positives and false negatives. The F1 score is defined as:
$F_1 = \frac{2 * (precision * recall)} {(precision + recall)}$.

b. [M] Yes, we can still use F1 for problems with more than two classes. One way to do this is to calculate the F1 score for each class and then take the average of these scores. This is called the micro-average F1 score. Another way to do this is to calculate the F1 score for each class and then take the average of these scores, weighting each class by the number of samples it contains. This is called the macro-average F1 score. These are two ways to handle multi-class classification problems.

8. Given a binary classifier that outputs the following confusion matrix.
  <table>
    <tr>
     <td>
     </td>
     <td>
  Predicted True
     </td>
     <td>Predicted False
     </td>
    </tr>
    <tr>
     <td>Actual True
     </td>
     <td>30
     </td>
     <td>20
     </td>
    </tr>
    <tr>
     <td>Actual False
     </td>
     <td>5
     </td>
     <td>40
     </td>
    </tr>
  </table>

  1. [E] Calculate the model’s precision, recall, and F1.
  1. [M] What can we do to improve the model’s performance?


A. $Precision = \frac{T_P}{T_P + F_P} = \frac{30}{30 + 5} = 0.86$

   $Recall = \frac{T_P}{T_P + F_N} = \frac{30}{30 + 20} = 0.6$
   
   $F_1 = \frac{2 * P * R} {P +R} = \frac{2 * 0.86 * 0.6} {0.86 + 0.6} = 0.71$
   
   $Accuracy = \frac{30 + 40}{30 + 40 + 5 + 20} = 0.74$

9. Consider a classification where 99% of data belongs to class A and 1% of data belongs to class B.
    1. [M] If your model predicts A 100% of the time, what would the F1 score be? **Hint**: The F1 score when A is mapped to 0 and B to 1 is different from the F1 score when A is mapped to 1 and B to 0.
    1. [M] If we have a model that predicts A and B at a random (uniformly), what would the expected F1 be?
10. [M] For logistic regression, why is log loss recommended over MSE (mean squared error)?
11. [M] When should we use RMSE (Root Mean Squared Error) over MAE (Mean Absolute Error) and vice versa?
12. [M] Show that the negative log-likelihood and cross-entropy are the same for binary classification tasks.
13. [M] For classification tasks with more than two labels (e.g. MNIST with 10 labels), why is cross-entropy a better loss function than MSE?
14. [E] Consider a language with an alphabet of 27 characters. What would be the maximal entropy of this language?
15. [E] A lot of machine learning models aim to approximate probability distributions. Let’s say P is the distribution of the data and Q is the distribution learned by our model. How do measure how close Q is to P?
16. MPE (Most Probable Explanation) vs. MAP (Maximum A Posteriori)
    1. [E] How do MPE and MAP differ?
    1. [H] Give an example of when they would produce different results.
17. [E] Suppose you want to build a model to predict the price of a stock in the next 8 hours and that the predicted price should never be off more than 10% from the actual price. Which metric would you use?

    **Hint**: check out MAPE.

---
> In case you need a refresh on information entropy, here's an explanation without any math.

> Your parents are finally letting you adopt a pet! They spend the entire weekend taking you to various pet shelters to find a pet.

> The first shelter has only dogs. Your mom covers your eyes when your dad picks out an animal for you. You don't need to open your eyes to know that this animal is a dog. It isn't hard to guess.

> The second shelter has both dogs and cats. Again your mom covers your eyes and your dad picks out an animal. This time, you have to think harder to guess which animal is that. You make a guess that it's a dog, and your dad says no. So you guess it's a cat and you're right. It takes you two guesses to know for sure what animal it is.

> The next shelter is the biggest one of them all. They have so many different kinds of animals: dogs, cats, hamsters, fish, parrots, cute little pigs, bunnies, ferrets, hedgehogs, chickens, even the exotic bearded dragons! There must be close to a hundred different types of pets. Now it's really hard for you to guess which one your dad brings you. It takes you a dozen guesses to guess the right animal.

> Entropy is a measure of the "spread out" in diversity. The more spread out the diversity, the harder it is to guess an item correctly. The first shelter has very low entropy. The second shelter has a little bit higher entropy. The third shelter has the highest entropy.