[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/lorenzobasile/DeepLearning2022/blob/main/5_autoencoder.ipynb)

# Lab 5

In [None]:
from google.colab import drive
import matplotlib.pyplot as plt
import torch
import torchvision


drive.mount('/content/drive')
%cd drive/MyDrive/DeepLearning2022

# Autoencoders

Autoencoders are neural models that aim at **reconstructing** data. An autoencoder is made of an encoder part, that maps the input in a latent (or hidden) representation and a decoder, which maps this representation back in the initial space.

This is not a requirement, but in most cases the hidden space is much lower-dimensional than the input (and output) space, meaning that the autoencoder is performing **dimensionality reduction**, learning a low-dimensional code that represents a substantial fraction of the variability of the input data. If trained with MSE Loss and without nonlinearities, it can be shown to perform equivalently to Principal Component Analysis (PCA).

![](https://miro.medium.com/max/1400/1*44eDEuZBEsmG_TCAKRI3Kw@2x.png)

## Handling RGB images and hardware acceleration

We will see an application example of an autoencoder on colored images. The dataset we will use is CIFAR10 (available on `torchvision`). CIFAR10 is one of the most famous benchmarks for neural networks, and in a sense it is the colored equivalent of MNIST, as it contains 10 classes of 32x32 RGB images.

In [None]:
batch_size = 256

trainset = torchvision.datasets.CIFAR10(root='./data', train=True, download=True, transform=torchvision.transforms.ToTensor())
testset = torchvision.datasets.CIFAR10(root='./data', train=False, download=True, transform=torchvision.transforms.ToTensor())

trainloader = torch.utils.data.DataLoader(trainset, batch_size=batch_size, shuffle=True)
testloader = torch.utils.data.DataLoader(testset, batch_size=batch_size, shuffle=True)

We define a function to show some images and their reconstructions. Please pay attention to the transposition step. Our batch of data will have shape [batch, channels, height, width], but to use matplotlib functions we need to convert it to [batch, height, width, channels]. We can do that by using the `transpose` function, with proper parameters.

In [None]:
def display_images(input, output):
    if input is not None:
        input_pics = input.data.cpu().numpy().transpose((0,2,3,1))
        plt.figure(figsize=(18, 4))
        for i in range(4):
            plt.subplot(1,4,i+1)
            plt.imshow(input_pics[i])
    plt.figure(figsize=(18, 4))
    output_pics = output.data.cpu().numpy().transpose((0,2,3,1))
    for i in range(4):
        plt.subplot(1,4,i+1)
        plt.imshow(output_pics[i])

We are using a GPU hardware accelerator, so we need to inform torch of this resource by defining a device variable. We will later load our model and our data on the GPU to exploit its capabilities. The following line guarantees that if the GPU is not available the code below can still work without any issue on a CPU.

In [None]:
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

## A convolutional autoencoder architecture

Since we are dealing with image data, it makes sense to exploit the impicit bias introduced by convolutions, making our Autoencoder a Convolutional Autoencoder.

For the encoder part, this simply translates into having a CNN (similar to a classifier, such as the one we saw during the last lecture), that given an image returns a hidden representation of size $d$.

Then, this representation is fed into the decoder network, which employs transpose convolutional layers.

### Transpose Convolutions

Transpose Convolutions follow a working principle that is very similar to "normal" Convolutions, but they are used to enlarge an image instead of reducing its size.

The parameters to construct a Transpose Conv layer are exactly the same:
```
ConvTranspose2d(in_channels: int, out_channels: int, kernel_size: Union[int, Tuple[int, int]], stride: Union[int, Tuple[int, int]] = 1, padding: Union[int, Tuple[int, int]] = 0)
```
However, note that the meaning of some of these parameters may be counterintuitive, the whole process flows like this:

- We start from an image;
- We space the pixels apart so that the step between two pixels is equal to the `stride`;
- We pad the image with as many rows and columns of 0s as we can, while making sure that the kernel always contains at least one "meaningful" pixel;
- We remove external rows and columns according to the `padding` parameter: pay attention! now we use this parameter to **remove** lines, not to add them;
- Finally we move our filter through the image using a step size of 1.

[Example with `stride=2`, `kernel_size=3`, `padding=0`.](https://github.com/vdumoulin/conv_arithmetic/blob/master/gif/no_padding_strides_transposed.gif)

[Example with `stride=2`, `kernel_size=3`, `padding=1`.](https://github.com/vdumoulin/conv_arithmetic/blob/master/gif/padding_strides_transposed.gif)

In [None]:
d = 50

class AE(torch.nn.Module):
    def __init__(self):
        super().__init__()
        
        self.encoder = torch.nn.Sequential(
            torch.nn.Conv2d(in_channels=3, out_channels=64, kernel_size=5, stride=1, padding=0),
            torch.nn.ReLU(),
            torch.nn.MaxPool2d(kernel_size=2),
            torch.nn.Dropout(p=0.2),
            torch.nn.Conv2d(in_channels=64, out_channels=128, kernel_size=3, stride=1, padding=1),
            torch.nn.ReLU(),
            torch.nn.AvgPool2d(kernel_size=2),
            torch.nn.Flatten(),
            torch.nn.Linear(128*7*7, d)
        )
        self.decoder = torch.nn.Sequential(
            torch.nn.Linear(d, 128*8*8),
            torch.nn.ReLU(),
            torch.nn.Unflatten(1, torch.Size([128, 8, 8])),
            torch.nn.ConvTranspose2d(in_channels=128, out_channels=64, kernel_size=2, stride=2),
            torch.nn.ReLU(),
            torch.nn.ConvTranspose2d(in_channels=64, out_channels=3, kernel_size=2, stride=2),
            torch.nn.Sigmoid()
            
        )    

    def forward(self, x):
        out = self.encoder(x)
        out = self.decoder(out)
        return out

model = AE().to(device)

Training an Autoencoder consists in performing regression of the output on the input, so we employ MSE Loss.

In [None]:
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)
loss = torch.nn.MSELoss(reduction='sum')

We run only 5 epochs just to show that everything works, to obtain better performances you can try to extend the training time.

In [None]:
epochs = 5
for epoch in range(epochs):
    model.train()
    train_loss = 0
    for x,y in iter(trainloader):
        x=x.to(device)
        x_hat=model(x)
        l=loss(x_hat,x)
        train_loss+=l.item()
        optimizer.zero_grad()
        l.backward()
        optimizer.step()
        
    print("Epoch "+str(epoch+1)+" Training loss:"+str(train_loss / (len(trainloader.dataset))))
        

with torch.no_grad():
    model.eval()
    test_loss=0 
    for x,y in iter(testloader):
            x=x.to(device)
            x_hat=model(x)
            l=loss(x_hat,x)
            test_loss+=l.item()
    test_loss /= len(testloader.dataset)
    print("Test set loss:"+str(test_loss))
    display_images(x, x_hat)

# Second assignment (deadline 11 December)



- 1. Compute the Intrinsic Dimension of KMNIST dataset using [TwoNN](https://www.nature.com/articles/s41598-017-11873-y). There is no need to implement TwoNN from scratch: you can use the implementation in the [`dadapy` package](https://dadapy.readthedocs.io/en/latest/jupyter_example_3.html), which you can install via `pip install dadapy`.
- 2. Build a Convolutional Autoencoder to reduce the dimensionality of KMNIST to a value $d$ that is compatible with the ID estimate you found in step 1. Once you have trained the model (until you notice some stability in the loss, likely not before 20 epochs of training), plot some examples of reconstructed images to qualitatively assess how well the model is performing. For this step, you can use the following setup (or modify it if you prefer):


```
Encoder:
	Convolutional layer with 32 filters, kernel size=3, stride=1, padding=0
    Max pooling layer with kernel size=2
	Convolutional layer with 64 filters, kernel size=3, stride=2, padding=1
    Max pooling layer with kernel size=2, stride=1
	Flattening layer
	Linear layer with input size=6x6x64, output size=d
	ReLU activations wherever needed

Decoder:
	Linear layer with input size=d, output size=64x8x8
	Unflattening layer to obtain [64x8x8] data
	Transpose Convolution with 32 filters, kernel size=2, stride=2, padding=1
	Transpose Convolution with 1 filter, kernel size=2, stride=2, padding=0
	ReLU activations wherever needed
```

- 3. If the encoder is performing sufficiently well, you should be able to perform almost any task of your choice on the original data by only looking at the internal hidden representations that it extracts. To see if this is the case, build a new model made of the encoder you have already trained (freezed) and a classification head (a trainable fully connected layer with `in_features=d` and `ou_features=10`). Train this new model for classification and assess the final accuracy. Don't expect perfect accuracy: anything substantially higher than random classification is fine.

You can send your work as a jupyter notebook in any format you prefer (`ipynb`, `pdf` or `html`) to lore.basile@outlook.com by 23.59, 11/12/2022. Please name the file as `NameSurname.<format>`.

As always, please do not hesitate to reach out if there are doubts or difficulties.

## Note after lab (**VERY** optional part of the assignment)

As pointed out by prof. Ansuini during the lab, it may be interesting to visualize the representations that your Autoencoder learns while training. Since your hidden dimension $d$ will likely be much higher than 2 or 3, it would be impossible to simply plot the representations and see where they end up being.

To visualize them, you could use another dimensionality reduction technique on top of your Encoder, such as t-SNE, to further reduce your $d$ features to 2 or 3, just like what is shown for example in this nice [github repo](https://github.com/ncampost/vis-autoencoder-tsne) and in this great [lecture](https://atcold.github.io/NYU-DLSP21/en/week09/09-3/) by Alfredo Canziani, which is also a good reference to study Autoencoders in general.

What is shown in this repo (and hopefully this may be visible also in this assignment, on KMNIST) is that after the Autoencoder is trained, it learns to somehow map the images belonging to the same class in a cluster, meaning that it could really capture features that are meaningful to a certain class. In a sense, such a finding would also be reassuring when moving to the classification task (clustered representations are much more likely to be easily separable).

**Note**: in the lecture by Alfredo Canziani the t-SNE plot is in the section on Variational Autoencoders, a topic that we will cover in the next lectures, but in principle this kind of clustering behaviour may appear also in normal Autoencoders like the ones we are working on.

If you want, you are welcome to try this kind of approach in your work. However, please note that this step is not a requirement.