$$
\newcommand{\mat}[1]{\boldsymbol {#1}}
\newcommand{\mattr}[1]{\boldsymbol {#1}^\top}
\newcommand{\matinv}[1]{\boldsymbol {#1}^{-1}}
\newcommand{\vec}[1]{\boldsymbol {#1}}
\newcommand{\vectr}[1]{\boldsymbol {#1}^\top}
\newcommand{\rvar}[1]{\mathrm {#1}}
\newcommand{\rvec}[1]{\boldsymbol{\mathrm{#1}}}
\newcommand{\diag}{\mathop{\mathrm {diag}}}
\newcommand{\set}[1]{\mathbb {#1}}
\newcommand{\cset}[1]{\mathcal{#1}}
\newcommand{\norm}[1]{\left\lVert#1\right\rVert}
\newcommand{\pderiv}[2]{\frac{\partial #1}{\partial #2}}
\newcommand{\bb}[1]{\boldsymbol{#1}}
\newcommand{\E}[2][]{\mathbb{E}_{#1}\left[#2\right]}
\newcommand{\ip}[3]{\left<#1,#2\right>_{#3}}
\newcommand{\given}[]{\,\middle\vert\,}
\newcommand{\DKL}[2]{\cset{D}_{\text{KL}}\left(#1\,\Vert\, #2\right)}
\newcommand{\grad}[]{\nabla}
\newcommand{\norm}[1]{\left\lVert#1\right\rVert}
$$

# Part 3: Summary Questions
<a id=part2></a>

This section contains summary questions about various topics from the course material.

You can add your answers in new cells below the questions.

**Notes**

- Clearly mark where your answer begins, e.g. write "**Answer:**" in the beginning of your cell.
- Provide a full explanation, even if the question doesn't explicitly state so. We will reduce points for partial explanations!
- This notebook should be runnable from start to end without any errors.

### CNNs

1. Explain the meaning of the term "receptive field" in the context of CNNs.

**Answer:**

The receptive field of a neuron in a CNN is the region in the input space that affects the value of the neuron. In other words, it is the region in the input space that the neuron "sees". For example, if you consider a single neuron in the first convolutional layer of a CNN that is processing an image, its receptive field is the size of the filter (e.g., 3x3 pixels), and it's the portion of the image that the filter scans at a time. As you go deeper into the network, each neuron's receptive field becomes larger, covering a broader area of the input image. This is because each neuron in a layer is connected to neurons of the previous layer, effectively increasing the size of the input image area that it can 'see'. This concept allows CNNs to automatically and adaptively learn spatial hierarchies, where high-level, global features are defined in terms of lower-level, local features.

2. Explain and elaborate about three different ways to control the rate at which the receptive field grows from layer to layer. Compare them to each other in terms of how they combine input features.

**Answer:**

**Stride**: The stride parameter controls the number of pixels that the filter moves at each step. A stride of 1 means that the filter moves one pixel at a time, and a stride of 2 means that the filter moves two pixels at a time. A larger stride means that the receptive field grows faster, because the filter covers a larger area of the input image at each step. However, a larger stride also means that the filter covers less of the input image, and therefore the output image will be smaller.

**Dilation**: The dilation parameter controls the number of pixels that are skipped at each step. A dilation of 1 means that no pixels are skipped, and a dilation of 2 means that every other pixel is skipped. A larger dilation means that the receptive field grows faster, because the filter covers a larger area of the input image at each step. However, a larger dilation also means that the filter covers less of the input image, and therefore the output image will be smaller.

**Pooling**: The pooling operation reduces the size of the input image by taking the maximum value of each sub-region of the image. This means that the output image will be smaller than the input image, and therefore the receptive field will grow faster. However, the pooling operation also reduces the amount of information in the image, because it only keeps the maximum value of each sub-region.

3. Imagine a CNN with three convolutional layers, defined as follows:

In [1]:
import torch
import torch.nn as nn

cnn = nn.Sequential(
    nn.Conv2d(in_channels=3, out_channels=4, kernel_size=3, padding=1),
    nn.ReLU(),
    nn.MaxPool2d(2),
    nn.Conv2d(in_channels=4, out_channels=16, kernel_size=5, stride=2, padding=2),
    nn.ReLU(),
    nn.MaxPool2d(2),
    nn.Conv2d(in_channels=16, out_channels=32, kernel_size=7, dilation=2, padding=3),
    nn.ReLU(),
)

cnn(torch.rand(size=(1, 3, 1024, 1024), dtype=torch.float32)).shape

torch.Size([1, 32, 122, 122])

What is the size (spatial extent) of the receptive field of each "pixel" in the output tensor?

**Answer:**
Let's calculate the receptive field of each layer:
The receptive field after the first convolutional layer is 3x3 pixels, because the filter size is 3x3 pixels and the stride is 1.
Since we had the stride of 1, the receptive field after MaxPool2d is 4x4 pixels.
The receptive field after the second convolutional layer is 20x20 pixels, because the filter size is 5x5 pixels.
The stride is 2, so the receptive field grows by 2 pixels in each direction at each step, meaning that the receptive field after the second pooling is 28x28 pixels.
The receptive field of convolutional layer with kernel size 7 and dialation 2 is 13x13 pixels (kernel size -1)\*dialation + 1 = (7-1)\*2 + 1 = 13.
Resulting in 28\*13 = 364x364 pixels of the receptive field of each "pixel" in the output tensor.

4. You have trained a CNN, where each layer $l$ is represented by the mapping $\vec{y}_l=f_l(\vec{x};\vec{\theta}_l)$, and $f_l(\cdot;\vec{\theta}_l)$ is a convolutional layer (not including the activation function).

  After hearing that residual networks can be made much deeper, you decide to change each layer in your network you used the following residual mapping instead $\vec{y}_l=f_l(\vec{x};\vec{\theta}_l)+\vec{x}$, and re-train.

  However, to your surprise, by visualizing the learned filters $\vec{\theta}_l$ you observe that the original network and the residual network produce completely different filters. Explain the reason for this.

**Answer:**
Residual networks, were designed to address the problem of vanishing gradients that arises in deep convolutional networks. As the depth of a CNN increases, the gradients in the deeper layers can become smaller during the learning process, eventually reaching a point where the network is unable to learn effectively.

The introduction of skip connections, which propagate the input directly to the output of the convolutional layer, allows gradients to flow through these connections during the backpropagation phase. This structural alteration to the network architecture changes the optimization landscape and the manner in which gradients are propagated, which leads to differences in the filters that are learned.

### Dropout

1. Consider the following neural network:

In [2]:
import torch.nn as nn

p1, p2 = 0.1, 0.2
nn.Sequential(
    nn.Conv2d(in_channels=3, out_channels=4, kernel_size=3, padding=1),
    nn.ReLU(),
    nn.Dropout(p=p1),
    nn.Dropout(p=p2),
)

Sequential(
  (0): Conv2d(3, 4, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
  (1): ReLU()
  (2): Dropout(p=0.1, inplace=False)
  (3): Dropout(p=0.2, inplace=False)
)

If we want to replace the two consecutive dropout layers with a single one defined as follows:
```python
nn.Dropout(p=q)
```
what would the value of `q` need to be? Write an expression for `q` in terms of `p1` and `p2`.

**Answer:**
The two consecutive dropout layers can be replaced with: $q = 1 - (1-p_1)(1-p_2)$

2. **True or false**: dropout must be placed only after the activation function.

**Answer:**
False. Dropout can be placed before or after the activation function. The dropout layer purpose is to randomly remove a portion of the neurons in the layer. It doesn't backpropagated and doesn't affect the values of the activation function if it's applied before or after it.However, placing dropout before the activation function is more common.


3. After applying dropout with a drop-probability of $p$, the activations are scaled by $1/(1-p)$. Prove that this scaling is required in order to maintain the value of each activation unchanged in expectation.

**Answer:**
When using the dropout method in neural networks, activations are adjusted using a $1/(1-p)$ scaling factor to maintain a consistent output expectation. This process takes into account the $(1-p)$ fraction of activations that are kept and the $p$ fraction that are turned off.

Assume a neuron with initial activation, denoted as $x$. After applying dropout, its expected activation, before any scaling, becomes:

$E[x_{dropout}] = (1 - p)x+p*0=(1 - p)x$

Then, to keep the expected activation value consistent with $x$, we scale $E[x_{dropout}]$ by $1/(1-p)$:

$E[x_{scaled}] = \frac{1}{1-p} E[x_{dropout}] = x$

This shows that the neuron's expected activation remains as $x$, despite dropout and subsequent scaling.

### Losses and Activation functions

1. You're training a an image classifier that, given an image, needs to classify it as either a dog (output 0) or a hotdog (output 1). Would you train this model with an L2 loss? if so, why? if not, demonstrate with a numerical example. What would you use instead?

**Answer:**
L2 loss, also known as Mean Squared Error (MSE), is not the best fit for binary classification tasks like distinguishing between images of dogs and hotdogs. While it's useful for regression tasks where outputs are a spectrum of real numbers, it's less effective for binary outcomes (0 or 1).

For instance, consider an image of a dog. A model making a prediction of 0.1 for this image would result in an L2 loss of $(0-0.1)^2 = 0.01$. This is quite close to the correct label of 0. On the other hand, if a model incorrectly predicts 0.9 for the same image, the resulting L2 loss is $(0-0.9)^2 = 0.81$.

While the L2 loss correctly ranks the second prediction as more incorrect than the first, it falls short of expressing that both are incorrect in a binary classification context. Predicting anything other than 0 for a dog image is wrong, but L2 loss gives the impression that 0.1 is a better prediction than 0.9, when in reality, both are misclassifications.

Binary classification tasks often benefit more from the binary cross-entropy or log loss. Unlike L2 loss, it considers prediction probabilities, making it better suited for tasks with binary outcomes.

2. After months of research into the origins of climate change, you observe the following result:

<center><img src="https://sparrowism.soc.srcf.net/home/piratesarecool4.gif" /></center>

You decide to train a cutting-edge deep neural network regression model, that will predict the global temperature based on the population of pirates in `N` locations around the globe.
You define your model as follows:

In [3]:
import torch.nn as nn

N = 42  # number of known global pirate hot spots
H = 128
mlpirate = nn.Sequential(
    nn.Linear(in_features=N, out_features=H),
    nn.Sigmoid(),
    *[
        nn.Linear(in_features=H, out_features=H), nn.Sigmoid(),
    ]*24,
    nn.Linear(in_features=H, out_features=1),
)

While training your model you notice that the loss reaches a plateau after only a few iterations.
It seems that your model is no longer training.
What is the most likely cause?

**Answer:**
The most likely cause is the vanishing gradient problem. The sigmoid activation function is used in the model, which is known to cause vanishing gradients. This means that the gradients of the loss function with respect to the weights of the model become very small, causing the model to stop learning.
Additionally, the network's depth compounds the issue. Deep networks involving many layers can exacerbate the vanishing gradient problem, particularly when paired with the Sigmoid function. This problematic combo likely halts your model's learning progress.

3. Referring to question 2 above: A friend suggests that if you replace the `sigmoid` activations with `tanh`, it will solve your problem. Is he correct? Explain why or why not.

**Answer:**
No, replacing the sigmoid activation function with tanh will not solve the problem. While tanh is less prone to the vanishing gradient problem than sigmoid, it still suffers from it. The tanh function is also symmetric around zero, which can cause the gradients to be zero, resulting in a dead neuron, especially with 24 consecutive tanh layers.

4. Regarding the ReLU activation, state whether the following sentences are **true or false** and explain:
  1. In a model using exclusively ReLU activations, there can be no vanishing gradients.
  1. The gradient of ReLU is linear with its input when the input is positive.
  1. ReLU can cause "dead" neurons, i.e. activations that remain at a constant value of zero.

**Answer 4:**

5. False. While ReLU is less prone to the vanishing gradient problem than sigmoid and tanh, it can still occur. For example, if the input to a ReLU activation is negative, the gradient will be zero, causing the vanishing gradient problem.

6. True. When the input is positive, the gradient of the ReLU function is linear, with a specific derivative of 1.

7. True. When the input is negative, the gradient of the ReLU function is zero, causing the neuron to remain at a constant value of zero.

### Optimization

1. Explain the difference between: stochastic gradient descent (SGD), mini-batch SGD and regular gradient descent (GD).

**Answer:**
**Gradient Descent (GD)**: Uses all training data for each gradient computation, making it accurate but computationally demanding, particularly for large datasets.
**Stochastic Gradient Descent (SGD)**: Computes gradient using a single random data point per step. This improves speed and scalability but might lead to unstable convergence due to its reliance on individual data points.
**Mini-batch Stochastic Gradient Descent**: Combines the above methods, using a small random subset of data for each step. This provides faster convergence than GD and less noise than SGD, with performance influenced by the mini-batch size, often a power of 2 for computational ease.

Essentially, these methods differ in the volume of data they utilize for gradient computation, impacting their speed, efficiency, and stability.

2. Regarding SGD and GD:
  1. Provide at least two reasons for why SGD is used more often in practice compared to GD.
  2. In what cases can GD not be used at all?

**Answer 2:**

3. SGD is generally preferred over GD for the following reasons:
**Firstly**, its efficiency and capacity to handle larger datasets. SGD only uses a single instance for each gradient calculation, making it not only faster, but also more capable of dealing with extensive datasets.
**Secondly**, its knack for avoiding local minima. Given SGD's unpredictability arising from its use of single data points, it's more successful in bypassing local minima during the process of optimization.

4. GD cannot be used in cases where the dataset is too large to fit in memory. This is because GD requires the entire dataset to be loaded into memory for each gradient computation.

3. You have trained a deep resnet to obtain SoTA results on ImageNet.
While training using mini-batch SGD with a batch size of $B$, you noticed that your model converged to a loss value of $l_0$ within $n$ iterations (batches across all epochs) on average.
Thanks to your amazing results, you secure funding for a new high-powered server with GPUs containing twice the amount of RAM.
You're now considering to increase the mini-batch size from $B$ to $2B$.
Would you expect the number of of iterations required to converge to $l_0$ to decrease or increase when using the new batch size? explain in detail.

**Answer:**
The connection between batch size and convergence iterations isn't always linear. Theoretically, doubling the batch size could make each update more effective and reduce iterations to hit the target loss, l_0.
However, in practice, larger batches can complicate optimization and may result in a worse performance on unseen data, known as the "generalization gap." Also, larger batches can make it tougher for the model to escape local minima and saddle points, possibly leading to more iterations to reach l_0.
Furthermore, the benefits of larger batch sizes tend to plateau due to matrix computation limitations. Therefore, a larger batch size doesn't always ensure faster training or fewer iterations.

4. For each of the following statements, state whether they're **true or false** and explain why.
  1. When training a neural network with SGD, every epoch we perform an optimization step for each sample in our dataset.
  1. Gradients obtained with SGD have less variance and lead to quicker convergence compared to GD.
  1. SGD is less likely to get stuck in local minima, compared to GD.
  1. Training  with SGD requires more memory than with GD.
  1. Assuming appropriate learning rates, SGD is guaranteed to converge to a local minimum, while GD is guaranteed to converge to the global minimum.
  1. Given a loss surface with a narrow ravine (high curvature in one direction): SGD with momentum will converge more quickly than Newton's method which doesn't have momentum.

**Answer 4:**

5. True. Unlike Batch Gradient Descent (GD), which uses all data points in each update, Stochastic Gradient Descent (SGD) optimizes the model parameters for each individual sample during an epoch.

6. False. SGD tends to yield gradients with higher variance than GD. This occurs because SGD uses only one data instance at a time, which introduces noise and could cause slower convergence than GD, which leverages the whole dataset for gradient computation.

7. True. SGD, compared to GD, is more likely to bypass local minima. This advantage arises from the noisiness in the SGD's gradient approximation, which can help the method to move out from shallow local minima.

8. False. SGD generally demands less memory than GD. The reason is that SGD calculates the gradient one sample at a time, while GD needs to compute the gradient across the entire dataset, hence requiring more memory space.

9. False. Regardless of the method, be it SGD or GD, convergence is only assured to a local minimum, not the global one. This happens because both SGD and GD follow the gradient, which only guides the algorithm to the nearest minimum and not necessarily the absolute minimum.

10. True. When dealing with a loss surface with a narrow ravine, SGD armed with momentum outperforms Newton's method which lacks momentum. Momentum enables SGD to smooth out the trajectory by maintaining directionality, making it advantageous in high-curvature scenarios, where Newton's method might falter.

5. **Bonus** (we didn't discuss this at class):  We can use bi-level optimization in the context of deep learning, by embedding an optimization problem as a layer in the network.
  **True or false**: In order to train such a network, the inner optimization problem must be solved with a descent based method (such as SGD, LBFGS, etc).
  Provide a mathematical justification for your answer.

**Answer:**


6. You have trained a neural network, where each layer $l$ is represented by the mapping $\vec{y}_l=f_l(\vec{x};\vec{\theta}_l)$ for some arbitrary parametrized functions $f_l(\cdot;\vec{\theta}_l)$.
  Unfortunately while trying to break the record for the world's deepest network, you discover that you are unable to train your network with more than $L$ layers.
  1. Explain the concepts of "vanishing gradients", and "exploding gradients".
  2. How can each of these problems be caused by increased depth?
  3. Provide a numerical example demonstrating each.
  4. Assuming your problem is either of these, how can you tell which of them it is without looking at the gradient tensor(s)?

**Answer 6:**

7. "Vanishing gradients" is a phenomenon where gradients of the loss function become negligible, slowing down the learning. Conversely, "exploding gradients" occur when these gradients grow excessively large, destabilizing the learning process.

8. These problems amplify with network depth. This is due to the cumulative multiplicative effect of gradients and weights during backpropagation, leading to exponential gradient diminishing or inflating with depth.

9.  On vanishing gradients, consider this: with large weights (say 10) and a sigmoid activation function, our gradient at sigmoid(10) is nearly zero (~0.000045). Now, during backpropagation, this tiny gradient multiplies by large weights across several layers, essentially becoming zero - the gradient vanishes.
On the contrary, for exploding gradients, let's initialize large weights with not necessarily large inputs. With a 0.1 gradient at the output layer and 10 as weights, backpropagation multiplies the gradient by weight, yielding a 1.0 gradient for the prior layer. Repeating this across layers, the gradient grows tenfold each time, exploding exponentially.

10. If the model is plagued by vanishing gradients, the learning process will decelerate or halt, evidenced by minimal to no changes in the loss over many epochs. If exploding gradients are the issue, the loss often ends up as NaN or Inf or experiences dramatic fluctuations from one iteration to the next, indicating that the learning process has become unstable.

### Backpropagation

1. You wish to train the following 2-layer MLP for a binary classification task:
  $$
  \hat{y}^{(i)} =\mat{W}_2~ \varphi(\mat{W}_1 \vec{x}^{(i)}+ \vec{b}_1) + \vec{b}_2
  $$
  Your wish to minimize the in-sample loss function is defined as
  $$
  L_{\mathcal{S}} = \frac{1}{N}\sum_{i=1}^{N}\ell(y^{(i)},\hat{y}^{(i)}) + \frac{\lambda}{2}\left(\norm{\mat{W}_1}_F^2 + \norm{\mat{W}_2}_F^2 \right)
  $$
  Where the pointwise loss is binary cross-entropy:
  $$
  \ell(y, \hat{y}) =  - y \log(\hat{y}) - (1-y) \log(1-\hat{y})
  $$
  
  Write an analytic expression for the derivative of the final loss $L_{\mathcal{S}}$ w.r.t. each of the following tensors: $\mat{W}_1$, $\mat{W}_2$, $\mat{b}_1$, $\mat{b}_2$, $\mat{x}$.

**Answer:**
Calculating the derivatives of the loss function $L_{\mathcal{S}}$ with respect to the parameters above and input tensors involves using chain rule of differentiation and backpropagation. Below are the expressions for the derivatives.

Define some intermediate variables that we'll use:

$h^{(i)} = \mat{W}_1 \vec{x}^{(i)}+ \vec{b}_1$

$z^{(i)} = \varphi(h^{(i)})$

$\hat{y}^{(i)} =\mat{W}_2~z^{(i)} + \vec{b}_2$

And, the derivative of binary cross-entropy loss:

$\frac{\partial \ell(y^{(i)}, \hat{y}^{(i)})}{\partial \hat{y}^{(i)}} = -\frac{y^{(i)}}{\hat{y}^{(i)}} + \frac{1-y^{(i)}}{1-\hat{y}^{(i)}}$

Derivative of $L_{\mathcal{S}}$ w.r.t. $\mat{W}_2$:
$\frac{\partial L_{\mathcal{S}}}{\partial \mat{W}2} = \frac{1}{N}\sum{i=1}^{N} \frac{\partial \ell(y^{(i)}, \hat{y}^{(i)})}{\partial \hat{y}^{(i)}} z^{(i)T} + \lambda \mat{W}_2$

Derivative of $L_{\mathcal{S}}$ w.r.t. $\mat{b}_2$:
$\frac{\partial L_{\mathcal{S}}}{\partial \mat{b}2} = \frac{1}{N}\sum{i=1}^{N} \frac{\partial \ell(y^{(i)}, \hat{y}^{(i)})}{\partial \hat{y}^{(i)}}$

Derivative of $L_{\mathcal{S}}$ w.r.t. $\mat{W}_1$:
$\frac{\partial L_{\mathcal{S}}}{\partial \mat{W}1} = \frac{1}{N}\sum{i=1}^{N} \frac{\partial \ell(y^{(i)}, \hat{y}^{(i)})}{\partial \hat{y}^{(i)}} \mat{W}_2^T \varphi'(h^{(i)}) \vec{x}^{(i)T} + \lambda \mat{W}_1$

Here, $\varphi'$ represents the derivative of the activation function.

Derivative of $L_{\mathcal{S}}$ w.r.t. $\mat{b}_1$:
$\frac{\partial L_{\mathcal{S}}}{\partial \mat{b}1} = \frac{1}{N}\sum{i=1}^{N} \frac{\partial \ell(y^{(i)}, \hat{y}^{(i)})}{\partial \hat{y}^{(i)}} \mat{W}_2^T \varphi'(h^{(i)})$

Derivative of $L_{\mathcal{S}}$ w.r.t. $\mat{x}$:
$\frac{\partial L_{\mathcal{S}}}{\partial \vec{x}^{(i)}} = \frac{\partial \ell(y^{(i)}, \hat{y}^{(i)})}{\partial \hat{y}^{(i)}} \mat{W}_2^T \varphi'(h^{(i)}) \mat{W}_1^T$

2. The derivative of a function $f(\vec{x})$ at a point $\vec{x}_0$ is
  $$
  f'(\vec{x}_0)=\lim_{\Delta\vec{x}\to 0} \frac{f(\vec{x}_0+\Delta\vec{x})-f(\vec{x}_0)}{\Delta\vec{x}}
  $$
  
  1. Explain how this formula can be used in order to compute gradients of neural network parameters numerically, without automatic differentiation (AD).
  
  2. What are the drawbacks of this approach? List at least two drawbacks compared to AD.

**Answer:**
1. This formula represents the foundation of the finite difference method, a strategy used to calculate gradients numerically without the use of automatic differentiation. It modifies a neural network parameter minutely and monitors the ensuing change in the function, approximating the derivative of the loss function. This process is repeated for each network parameter.
2. This method has certain drawbacks compared to automatic differentiation. Firstly, it's computationally costly as each parameter requires two evaluations. Automatic differentiation, on the other hand, computes all gradients in one forward and backward pass. Secondly, the finite difference method's accuracy relies heavily on the chosen increment, $\Delta \vec{x}$. Large values can lead to inaccuracies, while small ones can introduce numerical instability. Automatic differentiation overcomes these issues by providing precise derivatives.

3. Given the following code snippet:
  1. Write a short snippet that implements that calculates gradient of `loss` w.r.t. `W` and `b` using the approach of numerical gradients from the previous question.
  2. Calculate the same derivatives with autograd.
  3. Show, by calling `torch.allclose()` that your numerical gradient is close to autograd's gradient.

In [4]:
import torch

N, d = 100, 5
dtype = torch.float64
X = torch.rand(N, d, dtype=dtype)
W, b = torch.rand(d, d, requires_grad=True, dtype=dtype), torch.rand(d, requires_grad=True, dtype=dtype)

def foo(W, b):
    return torch.mean(X @ W + b)

loss = foo(W, b)
print(f"{loss=}")

# TODO: Calculate gradients numerically for W and b
h = 1e-6
grad_W = torch.zeros_like(W)
grad_b = torch.zeros_like(b)

for i in range(W.shape[0]):
    for j in range(W.shape[1]):
        W_perturbed = W.clone()

        W_perturbed[i, j] += h
        loss_plus = foo(W_perturbed, b)

        W_perturbed[i, j] -= 2*h
        loss_minus = foo(W_perturbed, b)

        grad_W[i, j] = (loss_plus - loss_minus) / (2. * h)

for i in range(b.shape[0]):
    b_perturbed = b.clone()

    b_perturbed[i] += h
    loss_plus = foo(W, b_perturbed)

    b_perturbed[i] -= 2*h
    loss_minus = foo(W, b_perturbed)

    grad_b[i] = (loss_plus - loss_minus) / (2. * h)

# TODO: Compare with autograd using torch.allclose()
loss.backward()
autograd_W = W.grad
autograd_b = b.grad
assert torch.allclose(grad_W, autograd_W)
assert torch.allclose(grad_b, autograd_b)

loss=tensor(1.8946, dtype=torch.float64, grad_fn=<MeanBackward0>)


### Sequence models

1. Regarding word embeddings:
  1. Explain this term and why it's used in the context of a language model.
  1. Can a language model like the sentiment analysis example from the tutorials be trained without an embedding (i.e. trained directly on sequences of tokens)? If yes, what would be the consequence for the trained model? if no, why not?

**Answer 1:**

2. Word embeddings are vector representations of words that capture different aspects of a word's semantic meaning, letting similar words have alike representations. They're vital in language models as they efficiently map words into a high-dimensional space, catching the relations between words. This allows the model to learn from the data more effectively.

3. Although feasible, training a language model like the sentiment analysis example directly on token sequences without embeddings would result in a less effective model. This approach necessitates one-hot encoding, which leads to sparse and high-dimensional vectors, failing to efficiently capture word relationships. Also, the model would be more computationally demanding and struggle in generalizing training to new texts.

2. Considering the following snippet, explain:
  1. What does `Y` contain? why this output shape?
  2. **Bonus**: How you would implement `nn.Embedding` yourself using only torch tensors. 

In [5]:
import torch.nn as nn

X = torch.randint(low=0, high=42, size=(5, 6, 7, 8))
embedding = nn.Embedding(num_embeddings=42, embedding_dim=42000)
Y = embedding(X)
print(f"{Y.shape=}")

Y.shape=torch.Size([5, 6, 7, 8, 42000])


**Answer:**
3. `Y` is a tensor that contains the embeddings for the input `X`. In the context of the code snippet, `X` is a tensor of size `(5, 6, 7, 8)` containing integers between `0` and `41` (inclusive), and each integer is an index that corresponds to a `42000`-dimensional embedding. The `nn.Embedding` layer maps each index in `X` to a `42000`-dimensional vector, so the output `Y` is a tensor of size `(5, 6, 7, 8, 42000)`. This output shape is due to the fact that an additional dimension for the embedding vector is added to the shape of `X`.

In [6]:
# 4. Bonus Answer:

class MyEmbedding:
    def __init__(self, num_embeddings, embedding_dim):
        self.weights = torch.randn(num_embeddings, embedding_dim, requires_grad=True)

    def forward(self, X):
        return self.weights[X]

my_embedding = MyEmbedding(num_embeddings=42, embedding_dim=42000)
X = torch.randint(low=0, high=42, size=(5, 6, 7, 8))
Y_custom = my_embedding.forward(X)

print(f"{Y_custom.shape=}")

Y_custom.shape=torch.Size([5, 6, 7, 8, 42000])


3. Regarding truncated backpropagation through time (TBPTT) with a sequence length of S: State whether the following sentences are **true or false**, and explain.
  1. TBPTT uses a modified version of the backpropagation algorithm.
  2. To implement TBPTT we only need to limit the length of the sequence provided to the model to length S.
  3. TBPTT allows the model to learn relations between input that are at most S timesteps apart.

**Answer 3:**

4. True. TBPTT is a tailored version of the classic backpropagation algorithm, designed to enhance efficiency when training recurrent neural networks on long sequences by limiting the number of timesteps over which the error is backpropagated.

5. False. Implementing TBPTT isn't simply about restricting the sequence length provided to the model. While the input sequence is indeed trimmed to S steps, the process of backpropagating error also needs to be confined to the same number of timesteps.

6. True. TBPTT enables the model to capture dependencies between inputs that are separated by no more than S timesteps. It does so by limiting error backpropagation to S steps, potentially missing out on longer-term dependencies that span more than S timesteps.

### Attention

1. In tutorial 5 we learned how to use attention to perform alignment between a source and target sequence in machine translation.
  1. Explain qualitatively what the addition of the attention mechanism between the encoder and decoder does to the hidden states that the encoder and decoder each learn to generate (for their language). How are these hidden states different from the model without attention?
  
  2. After learning that self-attention is gaining popularity thanks to the transformer models, you decide to change the model from the tutorial: instead of the queries being equal to the decoder hidden states, you use self-attention, so that the keys, queries and values are all equal to the encoder's hidden states (with learned projections, like in the tutorial..). What influence do you expect this will have on the learned hidden states?


**Answer:**
2. The addition of an attention mechanism significantly alters the hidden states formed by the encoder and decoder. Without attention, these hidden states are static and don't effectively capture the source sequence's context. However, with attention, these states become dynamic, focusing on different portions of the source sequence for each target token, thus leading to more accurate representations.
3. Switching to self-attention, which derives keys, queries, and values from the encoder's hidden states, could further influence these states. This method might make the encoder generate states that capture interactions within the source sequence, and the decoder would then use these to create more nuanced target tokens. Such a change could enhance both the source sequence's representation and understanding of dependencies in the target sequence.

### Unsupervised learning

1. As we have seen, a variational autoencoder's loss is comprised of a reconstruction term and  a KL-divergence term. While training your VAE, you accidentally forgot to include the KL-divergence term.
What would be the qualitative effect of this on:

  1. Images reconstructed by the model during training ($x\to z \to x'$)?
  1. Images generated by the model ($z \to x'$)?

**Answer 1:**

2. Reconstructed Images During Training ($x\to z \to x'$):
Missing the KL-divergence term from VAE's loss function essentially turns your model into a plain autoencoder, making the reconstruction loss its primary focus. Thus, the reconstructed images during the training process could still be of reasonable quality. This is due to the model's continuous effort to minimize the reconstruction error and learn a potentially intricate pathway from input to the latent space and back. However, the model's latent space might not be as well-structured as it would be with the KL-divergence term, and the model might not be able to generate new images.

3. Generated Images by the Model ($z \to x'$):
The KL-divergence component of the VAE loss guides the latent space to align with a standard Gaussian distribution - a crucial aspect for the generation of new images. Without this, the structure of the latent space might be unorganized and lack a discernible distribution. Consequently, when new latent vectors are sampled from a Gaussian distribution to generate novel images, the results might be nonsensical or unpredictable. This is because these sampled points may not align with meaningful encodings within the poorly structured latent space.

2. Regarding VAEs, state whether each of the following statements is **true or false**, and explain:
  1. The latent-space distribution generated by the model for a specific input image is $\mathcal{N}(\vec{0},\vec{I})$.
  2. If we feed the same image to the encoder multiple times, then decode each result, we'll get the same reconstruction.
  3. Since the real VAE loss term is intractable, what we actually minimize instead is it's upper bound, in the hope that the bound is tight.

**Answer:**

3. False. The model does not yield a $\mathcal{N}(\vec{0},\vec{I})$ distribution in the latent-space for any specific input image. It instead produces a $\mathcal{N}(\vec{\mu},\vec{\sigma})$ distribution, with the mean ($\vec{\mu}$) and standard deviation ($\vec{\sigma}$) vectors specifically tailored for that image.

4. False. Given the random nature of the sampling process in the latent space, encoding the same image multiple times may not result in the same reconstruction. The inherent variability in this process can lead to different reconstructions from the same image.

5. True. The actual VAE loss term is hard to compute due to the complexity of determining the true posterior distribution. As a result, we minimize the VAE loss's upper bound, known as the evidence lower bound (ELBO), with the hope that this bound closely approximates the true loss, enabling the model to learn effective representations.

2. Regarding GANs, state whether each of the following statements is **true or false**, and explain:
  1. Ideally, we want the generator's loss to be low, and the discriminator's loss to be high so that it's fooled well by the generator.
  2. It's crucial to backpropagate into the generator when training the discriminator.
  3. To generate a new image, we can sample a latent-space vector from $\mathcal{N}(\vec{0},\vec{I})$.
  4. It can be beneficial for training the generator if the discriminator is trained for a few epochs first, so that it's output isn't arbitrary.
  5. If the generator is generating plausible images and the discriminator reaches a stable state where it has 50% accuracy (for both image types), training the generator more will further improve the generated images.

**Answer:**

3. True. Ideally, a low loss for the generator implies its success in producing believable images. Simultaneously, a high loss for the discriminator means it's getting fooled by the generator, which suggests that the generated images are hard to distinguish from the real ones.

4. False. While training the discriminator, it's vital to avoid backpropagation into the generator. The discriminator's role is to learn the difference between real and generated images, and this should be achieved without the generator's influence. Otherwise, the generator might learn to produce images that fool the discriminator but are not realistic.

5. True. To create a new image, a vector from the latent space is sampled from a standard normal distribution, typically represented as $\mathcal{N}(\vec{0},\vec{I})$. The generator uses this vector as input to produce a new image.

6. True. It can be beneficial to initially train the discriminator for a few epochs before starting with the generator. This approach ensures that the discriminator can provide meaningful feedback to the generator, which helps prevent arbitrary outputs that could mislead the generator's learning.

7. False. If the generator has already started creating convincing images and the discriminator is correctly identifying real and generated images about 50% of the time, further training the generator may not improve the quality of images. At this stage, the GAN has likely reached an equilibrium, where the skills of the generator and discriminator are balanced, and additional training could lead to instabilities.

### Detection and Segmentation 

1. What is the diffrence between IoU and Dice score? what's the diffrance between IoU and mAP?
    shortly explain when would you use what evaluation?

**Answer:**
Intersection over Union (IoU) and Dice Score are metrics used in object detection and segmentation.
IoU evaluates the overlap between the predicted and true targets, giving a score from 0 (no overlap) to 1 (perfect match). Conversely, Dice Score calculates the shared area relative to the total pixels in both targets, being more sensitive to small changes in overlap. It also ranges from 0 to 1.
Mean Average Precision (mAP), used in object detection, provides a more comprehensive measure, considering precision and recall across various thresholds.
IoU and mAP are better for tasks focused on object bounding boxes and class labels, offering a holistic view of model performance. However, Dice Score is ideal for segmentation tasks due to its emphasis on spatial overlap.

2. regarding of YOLO and mask-r-CNN, witch one is one stage detector? describe the RPN outputs and the YOLO output, adress how the network produce the output and the shapes of each output.

**Answer:**
YOLO, an abbreviation for You Only Look Once, is a single-stage object detection model. It conducts a single run through the network to deliver predictions regarding the bounding box positions, class probabilities, and the confidence levels. The shape of YOLO's output depends on both the grid size and the count of bounding boxes each grid cell is programmed to predict.

Contrarily, Mask R-CNN follows a two-stage process for object detection. In the first stage, it deploys a Region Proposal Network (RPN) that suggests potential areas where an object might be located by predicting the coordinates of the bounding box and their objectness scores. The RPN generates output consisting of anchor boxes and corresponding objectness scores. During the second stage, these proposals undergo refinement, leading to the prediction of more precise bounding box positions, accurate class labels, and segmentation masks at the pixel level.

In a nutshell, YOLO serves as a fast, one-stage detector outputting bounding boxes and class probabilities simultaneously. In contrast, Mask R-CNN operates as a detailed two-stage detector using an RPN for initial object proposal generation, followed by a refinement stage for precise detection and segmentation.