In [1]:
import torch
import warnings
import sys
sys.path.append('/home/jovyan/work/d2l_solutions/notebooks/exercises/d2l_utils/')
import d2l
warnings.filterwarnings('ignore')


class Classifier(d2l.Module):
    def validation_step(self, batch):
        y_hat = self(*batch[:-1])
        self.plot('loss', self.loss(y_hat, batch[-1]), train=False)
        self.plot('acc', self.accuracy(y_hat, batch[-1]), train=False)
        
    def configure_optimizers(self):
        return torch.optim.SGD(self.parameters(), lr=self.lr)
    
    def accuracy(self, y_hat, y, averaged=True):
        y_hat = y_hat.reshape((-1, y_hat.shape[-1]))
        preds = y_hat.argmax(axis=1).type(y.dtype)
        comp = (preds == y.reshape(-1)).type(torch.float32)
        return comp.mean if averaged else comp

  assert(self, 'net'), 'Neural network is defined'
  assert(self, 'trainer'), 'trainer is not inited'


# 4.3.4. Exercises

## 1. Denote by $L_v$ the validation loss, and let $L_v^q$ be its quick and dirty estimate computed by the loss function averaging in this section. Lastly, denote by $l_v^b$ the loss on the last minibatch. Express $L_v$ in terms of $L_v^q$, $l_v^b$, and the sample and minibatch sizes.

We assume that the validation dataset is split into \(N\) samples, and each minibatch contains \(M\) samples.

The quick and dirty estimate \(L_v^q\) is computed by averaging the loss computed on each minibatch. Since there are \(N\) samples in total, and each minibatch contains \(M\) samples, there are \(N/M\) minibatches in total.

The relationship between \(L_v^q\) and \(l_v^b\) is that \(L_v^q\) is an average of the minibatch losses, while \(l_v^b\) is the loss computed on the last minibatch. In other words:

\[L_v^q = \frac{1}{N/M} \sum_{i=1}^{N/M} l_v^b\]

Now, let's express \(L_v\) in terms of \(L_v^q\), \(l_v^b\), \(N\), and \(M\):

\(L_v\) is the true validation loss, and it can be considered as an average of the minibatch losses, similar to \(L_v^q\). However, instead of just using the minibatch losses, it's computed over the entire validation dataset of size \(N\):

\[L_v = \frac{1}{N} \sum_{i=1}^{N} l_v^b\]

Now, we can substitute the expression for \(L_v^q\) into the equation for \(L_v\):

\[L_v = \frac{1}{N} \sum_{i=1}^{N} l_v^b = \frac{N/M}{N} \sum_{i=1}^{N/M} l_v^b = \frac{1}{M} \sum_{i=1}^{N/M} l_v^b = L_v^q\]

So, the final expression is:

\[L_v = L_v^q\]

In other words, the true validation loss \(L_v\) is equal to the quick and dirty estimate \(L_v^q\), which is the average of the loss computed on each minibatch during validation. This result assumes that the minibatches are representative of the entire validation dataset.

## 2. Show that the quick and dirty estimate $L_v^q$ is unbiased. That is, show that  $E[L_v]=E[L_v^q]$. Why would you still want to use $L_v$ instead?

## 3. Given a multiclass classification loss, denoting by $l(y,y^\prime)$ the penalty of estimating $y^\prime$ when we see $y$ and given a probabilty $p(y|x)$, formulate the rule for an optimal selection of $y^\prime$.
Hint: express the expected loss, using $l$ and $p(y|x)$.