<!-- Assignment 3 - SS 2024 -->

# Monitoring, Hyperparameters and efficient CNNs  (15 points)

This notebook contains one of the assignments for the exercises in Deep Learning and Neural Nets 2.
It provides a skeleton, i.e. code with gaps, that will be filled out by you in different exercises.
All exercise descriptions are visually annotated by a vertical bar on the left and some extra indentation,
unless you already messed with your jupyter notebook configuration.
Any questions that are not part of the exercise statement do not need to be answered,
but should rather be interpreted as triggers to guide your thought process.

**Note**: The cells in the introductory part (before the first subtitle)
perform all necessary imports and provide utility functions that should work without (too much) problems.
Please, do not alter this code or add extra import statements in your submission, unless explicitly allowed!

<span style="color:#d95c4c">**IMPORTANT:**</span> Please, change the name of your submission file so that it contains your student ID!

In this assignment, the main goal is to get familiar with neural network hyperparameter search.
More specifically, you will perform hyperparameter search on some real-world data.
To prepare you for the search, we will first look at how you can monitor the training progress.

In [1]:
import random
from pathlib import Path

import torch
import torchvision
from torch import nn, optim
from tqdm.notebook import tqdm
from torch.utils.data import DataLoader, random_split
from torch.utils.tensorboard import SummaryWriter

torch.manual_seed(1806)
device = 'cuda' if torch.cuda.is_available() else 'cpu'
print(device)

%load_ext tensorboard

cpu


In [2]:
# google colab data management
import os.path

try:
    from google.colab import drive
    drive.mount('/content/gdrive')
    _home = 'gdrive/MyDrive/'
except ImportError:
    _home = '~'
finally:
    data_root = os.path.join(_home, '.pytorch')

print(data_root)

~/.pytorch


## Tracking Progress

Training a deep neural network with millions of parameters can cost quite some time.
E.g. Alexnet already requires roughly [225 hours][alexnet] (>1 week) of compute on a single GPU.
In order to make sure that the network is training as expected,
it is crucial to get some insights into how training progresses.
After all, you do not want to waste hundreds of hours of compute to find out 
that training had already diverged in the first few minutes.
Therefore it is important to be able to monitor the training process.

[alexnet]: https://arxiv.org/abs/1404.5997

As a matter of fact, the `update` and `evaluate` functions 
already implement some sort of ad hoc monitoring by providing the list of errors in a batch.
This list can be used to print the mean loss after every epoch
and can therefore be used to get an idea of how learning is progressing.
This specific implementation of monitoring the loss is not very flexible, however,
since it is not possible to access the information before the epoch has finished.

Before we start, we will tackle the flexibility of monitoring the loss
by creating a separate `Tracker` class to keep track
of important steps and results during training.

In [3]:
class Tracker:
    """ Tracks useful information as learning progresses. """

    def __init__(self, *loggers: "Logger"):
        """
        Parameters
        ----------
        logger0, logger1, ... loggerN : Logger
            One or more loggers for logging training information.
        """
        self.epoch = 0
        self.update = 0
        self._tag = None
        self._losses = []
        self._summary = {}

        self.loggers = list(loggers)

    def start_epoch(self, count: bool = True):
        """ Start one iteration of updates over the training data. """
        if count:
            self.epoch += 1
        
        self._summary.clear()
        for logger in self.loggers:
            logger.on_epoch_start(self.epoch)

    def end_epoch(self):
        """ Wrap up one iteration of updates over the training data. """
        for logger in self.loggers:
            logger.on_epoch_end(self.epoch, **self._summary)

        return dict(self._summary)

    def start(self, tag: str, num_batches: int = None):
        """ Start a loop over mini-batches. """
        self._tag = tag
        self._losses.clear()
        for logger in self.loggers:
            logger.on_iter_start(self.epoch, self.update, self._tag, num_steps_expected=num_batches)
    
    def step(self, loss: float):
        """ Register the loss of a single mini-batch. """
        self._losses.append(loss)
        for logger in self.loggers:
            logger.on_iter_update(self.epoch, self.update, self._tag, loss=loss)  

    def summary(self):
        """ Wrap up and summarise a loop over mini-batches. """
        losses = self._losses
        avg_loss = float("nan") if len(losses) == 0 else sum(losses) / len(losses)
        self._summary[self._tag] = avg_loss
        for logger in self.loggers:
            logger.on_iter_end(self.epoch, self.update, self._tag, avg_loss=avg_loss)

        return avg_loss

    def count_update(self):
        """ Increase the update counter. """
        self.update += 1
        for logger in self.loggers:
            logger.on_update(self.epoch, self.update)

This class provides the same functionality as the list that
you might have used in the current `update` and `evaluate` functions.
However, it also makes it possible to extend the functionality
of both functions without the need to interfere with existing code.

Note that there are libraries and frameworks out there that provide
(parts of) the functionality we will implement in what follows.
Two example frameworks that directly build on pytorch are
[pytorch-lightning](https://www.pytorchlightning.ai/)
and [pytorch ignite](https://pytorch.org/ignite/).

### Exercise 1: Combining Classes for Tracking (3 points)

You might not have noticed yet, but in assignment 2, a `Trainer` class was introduced.
The goal of this exercise is to extend this `Trainer` class to make use of the `Tracker`.

 > Update the `Trainer` class to make use of the `tracker` attribute (see `__init__`).
 > The functionality and outputs of the current implementation should be preserved.
 > Also, make sure to offload as much as possible to the `tracker`.
 > You will want to use every method of the `Tracker` class.

In [25]:
class Trainer:
    """ Class to organise learning and monitoring. """

    def __init__(self,
         model: nn.Module,
         criterion: nn.Module,
         optimiser: optim.Optimizer,
         tracker: Tracker = None,
    ):
        """
        Parameters
        ----------
        model : torch.nn.Module
            Neural Network that will be trained.
        criterion : torch.nn.Module
            Loss function to use for training.
        optimiser : torch.optim.Optimizer
            Optimisation strategy for training.
        tracker : Tracker, optional
            Tracker to keep track of training progress.
        """
        if tracker is None:
            tracker = Tracker()

        self.model = model
        self.criterion = criterion
        self.optimiser = optimiser

        self.tracker = tracker

    def state_dict(self):
        """ Current state of learning. """
        return {
            "model": self.model.state_dict(),
            "objective": self.criterion.state_dict(),
            "optimiser": self.optimiser.state_dict(),
            "num_epochs": self.tracker.epoch,
            "num_updates": self.tracker.update,
        }

    @property
    def device(self):
        """ Device of the (first) model parameters. """
        return next(self.model.parameters()).device

    @torch.no_grad()
    def evaluate(self, batches: DataLoader, tag: str = None):
        """
        One epoch of evaluating the network.

        Parameters
        ----------
        batches : DataLoader
            An iterator over mini-batches of data to use for updating.
        tag : str, optional
            Identification tag for tracking loss values.

        Returns
        -------
        avg_loss : float
            The average loss over all mini-batches.
        """
        self.model.eval()
        device = self.device
        
        # YOUR CODE HERE
        self.tracker.start('eval', num_batches=batches.batch_size) # start the tracker for the evaluation 
        
        losses = []
        for x, y in batches:
            x, y = x.to(device), y.to(device)
            logits = self.model(x)
            loss = self.criterion(logits, y)
            losses.append(loss.item())
            self.tracker.step(loss.item()) # track the loss of one epoch
        avg_loss = sum(losses) / len(losses)
        self.tracker.summary() # summary of the loop over the batches
        self.tracker.count_update() # increase the update counter
        return avg_loss

    @torch.enable_grad()
    def update(self, batches: DataLoader, tag: str = None):
        """
        One epoch of updating the network.

        Parameters
        ----------
        batches : DataLoader
            An iterator over mini-batches of data to use for updating.
        tag : str, optional
            Identification tag for tracking loss values.

        Returns
        -------
        avg_loss : float
            The average loss over all mini-batches.
        """
        self.model.train()
        device = self.device
        
        # YOUR CODE HERE
        self.tracker.start('train', num_batches=batches.batch_size) # start the tracker for the training epch
        
        losses = []
        for x, y in batches:
            x, y = x.to(device), y.to(device)
            logits = self.model(x)
            loss = self.criterion(logits, y)
            losses.append(loss.item())

            self.tracker.step(loss=loss.item()) # track the loss of one epoch
            
            self.optimiser.zero_grad()
            loss.backward()
            self.optimiser.step()

        self.tracker.summary() # summary of the loop over the batches
        self.tracker.count_update() # increase the update counter
        avg_loss = sum(losses) / len(losses)
        return avg_loss

    def train(self, train_batches, valid_batches=None, num_epochs: int = 1):
        """
        Train the network for multiple epochs.

        Parameters
        ----------
        train_batches : DataLoader
            The training data for updating the network.
        valid_batches : DataLoader, optional
            The validation data for estimating the generalisation performance.
        num_epochs : int, optional
            The number of epochs to train.

        Returns
        -------
        results : dict
            The average loss estimates after `num_epochs` epochs.
            
        """
        if valid_batches is None:
            valid_batches = ()

        # YOUR CODE HERE
        
        
        train_loss = self.evaluate(train_batches)
        valid_loss = self.evaluate(valid_batches)
        for _ in range(num_epochs):
            self.tracker.start_epoch()
            train_loss = self.update(train_batches)
            valid_loss = self.evaluate(valid_batches)
            self.tracker.end_epoch()
        return {"train": train_loss, "valid": valid_loss}

In [5]:
# sanity check (and test setup)
from torchvision import transforms
mean, std = .1307, .3081
normalise = transforms.Compose([
    transforms.ToTensor(),
    transforms.Normalize((mean, ), (std, ))
])

dataset = torchvision.datasets.FashionMNIST(data_root, train=False, transform=normalise, download=True)
loader = DataLoader(dataset, batch_size=1024, shuffle=True, num_workers=2)
    
conv_net = nn.Sequential(
    nn.Conv2d(1, 8, 5), nn.MaxPool2d(3), nn.ELU(),
    nn.Conv2d(8, 16, 7), nn.ELU(),
    nn.Flatten(),
    nn.Linear(64, 10),
)

trainer = Trainer(
    model=conv_net.to(device),
    criterion=nn.CrossEntropyLoss(reduction="sum"),
    optimiser=optim.Adam(conv_net.parameters(), lr=1e-2),
)

results = trainer.train(loader, loader)
assert "train" in results, "ex1: could not find training loss in results"
assert "valid" in results, "ex1: could not find validation loss in results"
assert isinstance(results["train"], float), (
    f"ex1: expected training loss to be of type 'float', but found '{type(results['train'])}'"
)
assert isinstance(results["valid"], float), (
    f"ex1: expected validation loss to be of type 'float', but found '{type(results['valid'])}'"
)

In [6]:
# Test Cell: do not edit or delete!

In [7]:
# Test Cell: do not edit or delete!
trainer = Trainer(
    model=conv_net.to(device),
    criterion=nn.CrossEntropyLoss(reduction="sum"),
    optimiser=optim.Adam(conv_net.parameters(), lr=1e-2),
)
results = trainer.train(loader, loader, num_epochs=1)
assert trainer.tracker.epoch == 1, (
    f"ex1: expected tracker to have counted 1 epoch, but found {trainer.tracker.epoch} "
)

In [8]:
# Test Cell: do not edit or delete!
assert "train" in results, "ex1: could not find training loss in results"
assert "valid" in results, "ex1: could not find validation loss in results"

In [9]:
# Test Cell: do not edit or delete!
results = trainer.evaluate(loader, tag="extra")

## Logging Tracked Information

In its simplest form, a `Tracker` only keeps track of what happens in an epoch.
It knows about the loss values for each mini-batch,
but also how many epochs and updates already happened.
However, as mentioned earlier, a lot of features can be added to the `Tracker`.

Most notably, we can use the `Tracker` to store certain information during training.
Thus far, loss information has been collected to compute the average and is then discarded.
In order to revisit this information later, it can be written to a file, or _logged_.

For this purpose, we will use the interface provided by the `Logger` class (below).
This way, different types of information can be logged in a flexible way.
Luckily the `Tracker` class already provides everything that is necessary
to work with loggers to monitor whatever we need during learning.

In [10]:
class Logger:
    """ Extracts and/or persists tracker information. """

    def __init__(self, path: str = None):
        """
        Parameters
        ----------
        path : str or Path, optional
            Path to where data will be stored.
        """
        path = Path("run") if path is None else Path(path)
        self.path = path.expanduser().resolve()

    def on_epoch_start(self, epoch: int, **kwargs):
        """Actions to take on the start of an epoch."""
        pass

    def on_epoch_end(self, epoch: int, **kwargs):
        """Actions to take on the end of an epoch."""
        pass

    def on_iter_start(self, epoch: int, update: int, tag: str, **kwargs):
        """Actions to take on the start of an iteration."""
        pass

    def on_iter_update(self, epoch: int, update: int, tag: str, **kwargs):
        """Actions to take when an update has occurred."""
        pass

    def on_iter_end(self, epoch: int, update: int, tag: str, **kwargs):
        """Actions to take on the end of an iteration."""
        pass
    
    def on_update(self, epoch: int, update: int):
        """Actions to take when the model is updated."""
        pass

### Exercise 2: Progress bar (1 point)

Monitoring the loss early on during training can be useful
to check whether things are working as expected.
In combination with an indication of progress in training,
expectations can be properly managed early on.

 > Create a logger that produces some sort of progress bar for each epoch.
 > The progress bar should show the current epoch, the current trainnig stage (tag) and the current loss value.
 > Moreover, it should print a short summary after each epoch, including the average loss for each tag.
 > Note that most of this information is passed through the `kwargs` in the `Logger` methods.

**Hint:** You probably want to make use of the [`tqdm` library](https://tqdm.github.io/docs/tqdm/) to manage the progress bar.

In [None]:
class ProgressBar(Logger):
    """Log progress of epoch using a progress bar."""

    def __init__(self):
        super().__init__()
        # YOUR CODE HERE
        self.progress_bar = None
        self.track_loss = {}

    
    # Init a progress bar for a epoch
    #logger.on_iter_start(self.epoch, self.update, self._tag, num_steps_expected=num_batches)
    def on_iter_start(self, epoch, update,  tag : str, num_steps_expected : int):
        self.progress_bar = tqdm(total=epoch, desc=f'{tag} progress')

    # Update the progress bar with the loss of the current iteration
    # logger.on_iter_update(self.epoch, self.update, self._tag, loss=loss)  
    def on_iter_update(self, epoch, update, tag, loss):
        if(epoch not in self.track_loss):
            self.track_loss[epoch] = []
        self.track_loss[epoch].append(update)
        self.progress_bar.set_postfix_str({'Epoch= ': epoch, 'Loss= ': update})
        self.progress_bar.update(1)

    # End the progress bar for the epoch and print a summary 
    # ogger.on_iter_end(self.epoch, self.update, self._tag, avg_loss=avg_loss)
    def on_iter_end(self, epoch, update, tag, avg_loss):
        average_loss = {epoch: sum(losses)/len(losses) for epoch, losses in self.track_loss.items()}
        print(f'Summary for {tag}: Epoch: {epoch}, Average loss: {avg_loss}')
        self.progress_bar.close()

# TODO: avg loss is totaly high sth is wrong here

In [12]:
# sanity check (and test setup)
progress = ProgressBar()
trainer.tracker.loggers = [progress]
trainer.train(loader, loader, num_epochs=5)

eval progress:   0%|          | 0/1 [00:00<?, ?it/s]

Summary for eval: Epoch: 1, Average loss: 665.5848999023438


eval progress:   0%|          | 0/1 [00:00<?, ?it/s]

Summary for eval: Epoch: 1, Average loss: 665.5848937988281


update progress:   0%|          | 0/2 [00:00<?, ?it/s]

Summary for update: Epoch: 2, Average loss: 617.1163665771485


eval progress:   0%|          | 0/2 [00:00<?, ?it/s]

Summary for eval: Epoch: 2, Average loss: 549.9794769287109


update progress:   0%|          | 0/3 [00:00<?, ?it/s]

Summary for update: Epoch: 3, Average loss: 524.6403076171875


eval progress:   0%|          | 0/3 [00:00<?, ?it/s]

Summary for eval: Epoch: 3, Average loss: 490.3679718017578


update progress:   0%|          | 0/4 [00:00<?, ?it/s]

Summary for update: Epoch: 4, Average loss: 470.1817108154297


eval progress:   0%|          | 0/4 [00:00<?, ?it/s]

Summary for eval: Epoch: 4, Average loss: 433.6232849121094


update progress:   0%|          | 0/5 [00:00<?, ?it/s]

Summary for update: Epoch: 5, Average loss: 430.7246520996094


eval progress:   0%|          | 0/5 [00:00<?, ?it/s]

Summary for eval: Epoch: 5, Average loss: 406.626904296875


update progress:   0%|          | 0/6 [00:00<?, ?it/s]

Summary for update: Epoch: 6, Average loss: 402.27284545898436


eval progress:   0%|          | 0/6 [00:00<?, ?it/s]

Summary for eval: Epoch: 6, Average loss: 402.0369384765625


{'train': 402.27284545898436, 'valid': 402.0369384765625}

In [13]:
# Test Cell: do not edit or delete!

In [14]:
# Test Cell: do not edit or delete!

### Exercise 3: Tensorboard (2 points)

[Tensorboard](https://www.tensorflow.org/tensorboard) 
is a library that allows to track and visualise data during and after training.
Apart from scalar metrics, tensorboard can process distributions, images and much more.
It started as a part of tensorflow, but was then made available as a standalone library.
This makes it possible to use tensorboard for visualising pytorch data.
As a matter of fact, tensorboard is readily available in pytorch.
From [`torch.utils.tensorboard`](https://pytorch.org/docs/stable/tensorboard.html),
the `SummaryWriter` class can be used to track various types of data.

 > Create a Logger that makes use of the `Summarywriter` to monitor the loss with tensorboard.
 > On one hand, it should monitor the loss for every batch and both modes using `<tag>/loss` as tag.
 > On the other hand, it should monitor the average losses after every stage, using `'<tag>/avg_loss'`.

In [None]:
class TensorBoard(Logger):
    """Log loss values to tensorboard."""

    def __init__(self, path: Path = None, every: int = 1):
        super().__init__(path)
        self.every = every
        # YOUR CODE HERE
        self.path = Path('runs') #TODO: check if this is correct normalerweise muss ich den path hier nicht angeben!
        self.writer = SummaryWriter(log_dir=self.path)

    def on_iter_update(self, epoch, update, tag, loss):
        tag = f'{tag}/loss'
        self.writer.add_scalar(tag, loss, epoch)

    def on_iter_end(self, epoch, update, tag, avg_loss):
        tag = f'{tag}/avg_loss'
        self.writer.add_scalar(tag, avg_loss, epoch)

        

In [16]:
# run in the following in the command line if it does not work in jupyter
%tensorboard --logdir runs

In [26]:
# sanity check (and test setup)
tb = TensorBoard()
trainer.tracker.loggers = [tb]
trainer.train(loader, loader, num_epochs=5)

{'train': 227.17418670654297, 'valid': 216.96737060546874}

In [27]:
# Test Cell: do not edit or delete!
from tensorboard.backend.event_processing.event_accumulator import EventAccumulator
path = next(tb.path.glob("events.out.tfevents.*"))
tb_data = EventAccumulator(str(path)).Reload()
tags = tb_data.Tags()["scalars"]

assert "train/loss" in tags, "ex3: could not find training loss"
assert "valid/loss" in tags, "ex3: could not find validation loss"

AssertionError: ex3: could not find training loss

In [None]:
# Test Cell: do not edit or delete!
assert "train/avg_loss" in tags, "ex3: could not find avg training loss"
assert "valid/avg_loss" in tags, "ex3: could not find avg validation loss"

NameError: name 'tags' is not defined

### Exercise 4: Always have a Backup-plan (1 point)

Apart from logging metrics like e.g. loss and accuracy,
it can often be useful to create a backup (or checkpoint) of training progress.
After all, you do not want hours of training to get lost
due to a programming error in a print statement at the end of your code.
This idea can also be useful to implement some form of early-stopping.
However, we will ignore that for now.

 > Implement a logger that saves the state of the trainer every few epochs.
 > For the sake of convention, use the `.pth` extension for storing these backups.

**Hint:** you may want to raise a [`warning`](https://docs.python.org/3/library/warnings.html#available-functions) if no trainer has been attached.

In [29]:
class Backup(Logger):
    
    DEFAULT_FILE = "backup.pth"
    
    def __init__(self, path: Path = None, every: int = 1):
        super().__init__(path)
        self.trainer = None
        self.every = every
        
        if self.path.is_dir() or not self.path.suffix:
            self.path = self.path / self.DEFAULT_FILE
        
        self.path.parent.mkdir(exist_ok=True, parents=True)
    
    def attach_trainer(self, trainer: Trainer):
        self.trainer = trainer
    
    # YOUR CODE HERE
    def on_iter_end(self, epoch, update, tag, avg_loss):
        if epoch % self.every == 0:
            print(f"Backup at epoch {epoch}")
            if self.trainer is not None:
                torch.save(self.trainer.state_dict(), self.path)
            else:
                print("No trainer attached to the logger")
                                                

In [30]:
# sanity check (and test setup)
checkpoints = Backup(every=2)
trainer.tracker.loggers = [checkpoints]
checkpoints.attach_trainer(trainer)
trainer.train(loader, loader, num_epochs=4)
trainer.tracker.epoch

Backup at epoch 22
Backup at epoch 22
Backup at epoch 24
Backup at epoch 24


25

In [31]:
# Test Cell: do not edit or delete!
print(torch.load(checkpoints.path)["num_epochs"])

24


In [32]:
# clean up checkpoints and tensorboard logs
! rm -r run

## Hyperparameter Search

Finding good hyperparameters for a model is a general problem in machine learning (or even statistics).
However, neural networks are (in)famous for their large number of hyperparameters.
To list a few: learning rate, batch size, epochs, pre-processing, layer count, neurons for each layer, 
activation function, initialisation, normalisation, layer type, skip connections, regularisation, ...
Moreover, it is often not possible to theoretically justify a particular choice for a hyperparameter.
E.g. there is no way to tell whether $N$ or $N + 1$ neurons in a layer would be better, without trying it out.
Therefore, hyperparameter search for neural networks is an especially tricky problem to solve.

###### Manual Search

The most straightforward approach to finding good hyperparameters is to just 
try out *reasonable* combinations of hyperparameters and pick the best model (using e.g. the validation set).
The first problem with this approach is that it requires a gut feeling as to what *reasonable* combinations are.
Moreover, it is often unclear how different hyperparameters interact with each other,
which can make an irrelevant hyperparameter look more important than it actually is or vice versa.
Finally, manual hyperparameter search is time consuming, since it is generally not possible to automate.

###### Grid Search

Getting a feeling for combinations of hyperparameters is often much harder than for individual hyperparameters.
The idea of grid search is to get a set of *reasonable* values for each hyperparameter individually
and organise these sets in a grid that represents all possible combinations of these values.
Each combinations of hyperparameters in the grid can then be run simultaneously,
assuming that so much hardware is available, which can speed up the search significantly.

###### Random Search

Since there are plenty of hyperparameters and each hyperparameters can have multiple *reasonable* values,
it is often not feasible to try out every possible combination in the grid.
On top of that, most of the models will be thrown away anyway because only the best model is of interest,
even though they might achieve similar performance.
The idea of random search is to randomly sample configurations, rather than choosing from pre-defined choices.
This can be interpreted as setting up an infinite grid and trying only a few --- rather than all --- possibilities.
Under the assumption that there are a lot of configurations with similarly good performance as the best model,
this should provide a model that performs very good with high probability for a fraction of the compute.

###### Bayesian Optimisation 

Rather than picking configurations completely at random, 
it is also possible to guide the random search.
This is essentially the premise of Bayesian optimisation:
sample inputs and evaluate the objective to find which parameters are likely to give good performance.

Bayesian optimisation uses a function approximator for the objective 
and what is known as an *acquisition* function.
The function approximator, or *surrogate*, 
has to be able to model a distribution over function values, e.g. a Gaussian Process.
The acquisition function then uses these distributions
to find where the largest improvements can be made, e.g. using the cdf.
For a more elaborate explanation of Bayesian optimisation, 
see e.g. [this tutorial](https://arxiv.org/abs/1807.02811)

This approach is less parallellisable than grid or random search,
since it uses the information from previous runs to find good sampling regions.
However, often there are more configurations to be tried out than there are computing devices
and it is still possible to sample multiple configurations at each step with Bayesian Optimisation.
Also consider [this paper](https://papers.nips.cc/paper/4522-practical-bayesian-optimization-of-machine-learning-algorithms) in this regard.

###### Neural Architecture Search

Instead of using Bayesian optimisation, 
the problem of hyperparameter search can also be tackled by other optimisation algorithms.
This approach is also known as *Neural Architecture Search* (NAS).
There are different optimisation strategies that can be used for NAS,
but the most common are evolutionary algorithms and (deep) reinforcement learning.
Consider reading [this survey](http://jmlr.org/papers/v20/18-598.html) 
to get an overview of how NAS can be used to construct neural networks.

## Efficient CNNs

In recent times CNNs have become more computationally efficient. Traditional convolutional layers apply filters across the entire depth of the input volume, mixing all the input channels to produce a single output channel. Depthwise separable convolutions, introduced as a key innovation in architectures like Xception, are a more efficient variant of the standard convolution operation. This process is divided into two layers: the depthwise convolution and the pointwise convolution. In the depthwise convolution, a single filter is applied per input channel, which significantly reduces the computational cost. Following this, a 1x1 convolution (pointwise convolution) is applied to combine the outputs of the depthwise layer, creating a new set of feature maps. This approach drastically reduces the number of parameters and computations, making the network more efficient and faster, which is especially beneficial for mobile and embedded devices.

<img src="https://www.researchgate.net/publication/358585116/figure/fig1/AS:1127546112487425@1645839350616/Depthwise-separable-convolutions.png" />

Squeeze-and-Excitation layers introduce an additional level of adaptivity in CNNs, enabling the network to perform dynamic channel-wise feature recalibration. Squeeze-and-Exitation blocks are usually executed after a convolutional layer or block
and before the residual connection by a series of relatively inexpensive computations

1. A three dimensional input consisting of different channels and the two spati l
dimensions is compressed into one dimension by global aver ge pooling. As a res lt
the spatial information is squeezed into one descriptor per channel.
2. The squeezed data is transformed by a two layer feed-forward neural network.  fter
the first linear layer ReLU is used as activation functi n and after the se ond a
sigmoid function is applied. This normalizes the output between 0 and 1 and can be
interpreted as the significance per channel.
3. The result is used to scale the input of the Squeeze-and-Exitation block by an element-
wise multiplication.

<img src="https://miro.medium.com/v2/resize:fit:1100/format:webp/1*bmObF5Tibc58iE9iOu327w.png" />



### Exercise 5: Create an efficient CNN (4 points)

Today, neural networks frequently have millions or billions of parameters. However, CNNs have become more computationally efficient over the years. How far can you get with a limited amount of compute?

> Create an efficient CNN with less than 30.000 parameters.
> Use at least one depthwise separable or groupwise convolution or apply at least one squeeze-and-exitation layer after a convolution.

Hint: Skip-connections and Normalization layers are frequently used to stabilize the training behavoir of deep CNNs.

In [None]:
class EfficientCNN(nn.Module):
    def __init__(self, in_channels, num_classes):
        # TODO: implement __init__
        raise NotImplementedError()
    def forward():
        # TODO: implement forward
        raise NotImplementedError()

# YOUR CODE HERE
raise NotImplementedError()

        

In [None]:
# sanity-check
model = EfficientCNN(in_channels=3, num_classes=10)
model(torch.zeros((1, 3, 32, 32)))
print("number of parameters: ", sum([p.numel() for p in model.parameters()]))

In [None]:
# Test Cell: do not edit or delete!

In [None]:
# Test Cell: do not edit or delete!

### Exercise 6: Training (4 points)

In order to get a feeling for hyperparameter search, you have to try it out on some example. You can use the monitoring tools from previous exercises to log performance and get a feeling for which hyperparameters work well. 

> Train your EfficientCNN on CIFAR10 using the Trainer class. Use hyperparameter search for the learning rate, optimizer and maybe even the model architecture to get a CrossEntropyLoss < 1.5 within 10 epochs of training. 

In [None]:
# TODO: Cell for Hyperparameter search, you can freely edit or delete this code
train_dataset = torchvision.datasets.CIFAR10(data_root, train=True, transform=transforms.ToTensor(), download=True)
test_dataset = torchvision.datasets.CIFAR10(data_root, train=False, transform=transforms.ToTensor(), download=True)
train_loader = DataLoader(train_dataset, batch_size=1024, shuffle=True, num_workers=2)
test_loader = DataLoader(test_dataset, batch_size=1024, shuffle=True, num_workers=2)
model = EfficientCNN(in_channels=3, num_classes=10)

optimizer = torch.optim.SGD(model.parameters(), lr=0.01)
criterion = nn.CrossEntropyLoss()
trainer = Trainer(model, 
                  criterion, 
                  optimizer)
trainer.train(train_loader, test_loader, num_epochs=10)

In [None]:
# Test Cell: do not edit or delete!