# Generative Modelling with VAEs and GANs

From a practical perspective, neural networks have primarily been used for supervised learning, with applications ranging from activity recognition in videos to language translation. However, labelled data is sparse or nonexistent in many situations, so we'd like to make use of unsupervised learning to make sense of the vast amount of unlabelled data in the world. Here we'll look at two common types of deep generative models - variational autoencoders (VAEs) and generative adversarial networks (GANs) - that can learn to generate samples from a dataset.

## Data

We'll use the [MNIST](http://yann.lecun.com/exdb/mnist/) dataset, which comprises of 70,000 28x28 grayscale images of handwritten digits. Apart from the underlying classes (0-9), there's a bunch of factors of variation such as line thickness and slant:

![MNIST samples](https://qph.fs.quoracdn.net/main-qimg-d01751bdf7dab3d9a5949f226a35b7ba)

This dataset is relatively easy to model, so we'll do so with the 60,000 samples from the training set.

In [1]:
import os
import time
from matplotlib import pyplot as plt
import torch
from torch import nn, optim
from torch.nn import functional as F
from torch.utils.data import DataLoader
from torchvision import datasets, transforms
from IPython.display import clear_output, display
%matplotlib inline

In [3]:
batch_size = 64

data_path = os.path.join(os.path.expanduser('~'), '.torch', 'datasets', 'mnist')
train_data = datasets.MNIST(data_path, train=True, download=True, transform=transforms.ToTensor())
test_data = datasets.MNIST(data_path, train=False, transform=transforms.ToTensor())

train_loader = DataLoader(train_data, batch_size=batch_size, shuffle=True, num_workers=4)
test_loader = DataLoader(test_data, batch_size=batch_size, num_workers=4)

## VAE

### Model

The VAE (somewhat misnamed) consists of an autoencoder with a stochastic bottleneck layer, trained via variational inference. The encoder and decoder will make use of convolutions and transposed (also known as fractionally-strided) convolutions, respectively. The latent encoding in the middle of the autoencoder - the variational posterior - will be a simple diagonal covariance Gaussian. In addition, we'll make use of the "reparameterisation trick", where the stochastic latents are decomposed into a combination of deterministic variables and another source of noise - so there is no longer a need to devise a gradient estimator for backpropagating through a sampling process.

In [17]:
class VAE(nn.Module):
    def __init__(self):
        super().__init__()
        self.conv1 = nn.Conv2d(1, 8, 5, padding=2)
        self.conv2 = nn.Conv2d(8, 16, 3, stride=2, padding=1)
        self.conv3 = nn.Conv2d(16, 16, 3, padding=1)
        self.conv4 = nn.Conv2d(16, 32, 3, stride=2, padding=1)
        self.fc_mu = nn.Linear(32 * 7 * 7, 10)
        self.fc_log_var = nn.Linear(32 * 7 * 7, 10)
        self.fc_dec = nn.Linear(10, 32 * 7 * 7)
        self.conv5 = nn.ConvTranspose2d(32, 16, 3, stride=2, padding=1, output_padding=1)
        self.conv6 = nn.Conv2d(16, 16, 3, padding=1)
        self.conv7 = nn.ConvTranspose2d(16, 8, 3, stride=2, padding=1, output_padding=1)
        self.conv8 = nn.Conv2d(8, 1, 5, padding=2)

    def forward(self, x):
        x = F.leaky_relu(self.conv1(x))
        x = F.leaky_relu(self.conv2(x))
        x = F.leaky_relu(self.conv3(x))
        x = F.leaky_relu(self.conv4(x))
        x = x.view(-1, 32 * 7 * 7)
        mu = self.fc_mu(x)
        log_var = self.fc_log_var(x)
        x = self.fc_dec(mu)
        x = x.view(-1, 32, 7, 7)
        x = F.leaky_relu(self.conv5(x))
        x = F.leaky_relu(self.conv6(x))
        x = F.leaky_relu(self.conv7(x))
        return F.sigmoid(self.conv8(x))

### Training, Sampling and Interpolating

To train the model we will optimise the variational or evidence lower bound (ELBO). The ELBO consists of minimising the reconstruction error of the input sample (as with a normal autoencoder), as well as the Kullback-Leibler (KL) divergence between the variational posterior and the prior - which we set to a unit Gaussian. Because of the form of our prior and posterior we can actually construct an analytical form of the KL to minimise.

Once the model is trained we can sample a latent code from our prior and pass this through the decoder to form a sample of the data. Although the posterior may not match the prior, this is a reasonable assumption for low-dimensional latent codes. Other than picking random samples, we can also pick a dimension of the latent code to interpolate in (keeping all other dimensions fixed), which would reveal what that dimension is coding for.

In [18]:
model = VAE()
optimiser = optim.Adam(model.parameters(), lr=1e-3)
epochs = 3

def train():
    model.train()
    for i, (x, _) in enumerate(train_loader):
        optimiser.zero_grad()
        x_hat = model(x)
        loss = F.binary_cross_entropy(x_hat, x)
        loss.backward()
        optimiser.step()


def sample():
    model.eval()
    with torch.no_grad():
        z = torch.randn(64, 10)
        x_hat = model(z)

for _ in range(epochs):
    train()
    sample()

RuntimeError: Expected 4-dimensional input for 4-dimensional weight [8, 1, 5, 5], but got input of size [64, 10] instead