# GauGAN

It is recommended that you should already be familiar with:
 - Pix2PixHD, from [High-Resolution Image Synthesis and Semantic Manipulation with Conditional GANs](https://arxiv.org/abs/1711.11585) (Wang et al. 2018)
 - Synchronized batch norm. See Pytorch's [SyncBatchNorm](https://pytorch.org/docs/stable/generated/torch.nn.SyncBatchNorm.html) documentation.
 - Kullbach-Leibler divergence

### Goals

In this notebook, you will learn about GauGAN, which synthesizes high-resolution images from semantic label maps, which you implement and train. GauGAN is based around a special denormalization technique proposed in [Semantic Image Synthesis with Spatially-Adaptive Normalization](https://arxiv.org/abs/1903.07291) (Park et al. 2019)

### Background
GauGAN builds on Pix2PixHD but simplifies the overall network by adding spatially adaptive denormalization layers. Because it learns its denormalization parameters via convolving the instance segmentation map, it actually is better for multi-modal synthesis, since all it needs as is a random noise vector. Later in the notebook, you will see how the authors further control diversity with the noise vector.

### Synchronized BatchNorm

So you've already heard of batch norm, which is a normalization technique that tries to normalize the statistics of activations a standard Gaussian distribution.

Batch norm, however, performs poorly with small batch sizes. This becomes a problem when training large models that can only fit small batch sizes on GPUs. Training on multiple GPUs will increase the effective batch size, but vanilla batch norm will only update its statistics asynchronously on each GPU. Essentially, if you train on 2 gpus with `nn.BatchNorm2d`, the two batchnorm modules will have a different running averages of statistics and batch norm stability isn't better from larger effective batch size.

Synchronized batch norm ([nn.SyncBatchNorm](https://pytorch.org/docs/stable/generated/torch.nn.SyncBatchNorm.html)) does exactly what its name suggests - it synchronizes batch norm running average updates across multiple processes so that each update will be with the statistics across all your minibatches.

The authors report slightly better scores with synchronized batch norm as opposed to regular (asynchronous) batch norm. Since you will likely be running this on one machine, this notebook will stick to regular `nn.BatchNorm2d` modules.

### Spatially Adaptive Denormalization (SPADE)

Recall that normalization layers are formulated as

\begin{align*}
    y &= \dfrac{x - \hat{\mu}}{\hat{\sigma}} * \gamma + \beta
\end{align*}

where $\hat{\mu}$ and $\hat{\sigma}$ correspond to an exponential moving average of minibatch means and standard deviations and are used to normalize the input activation $x$. The parameters $\gamma$ and $\beta$ apply "denormalization," essentially allowing the model to invert the normalization if necessary.

In GauGAN, batch norm is the preferred normalization scheme. Recall that batch norm can be formulated for each input neuron as

\begin{align*}
    y_{c,h,w} &= \dfrac{x_{c,h,w} - \hat{\mu}_c}{\hat{\sigma}_c} * \gamma_c + \beta_c
\end{align*}

where $\hat{\mu}_c$ and $\hat{\sigma}_c$ are per-channel statistics computed across the batch and spatial dimensions. Similarly, $\gamma_c$ and $\beta_c$ are per-channel denormalization parameters.

With vanilla batch norm, these denormalization parameters are spatially invariant - that is, the same values are applied to every position in the input activation. As you may imagine, this could be limiting for the model. Oftentimes it's conducive for the model to learn denormalization parameters for each position.

The authors address this with **SPatially Adaptive DEnormalization (SPADE)**. They compute denormalization parameters $\gamma$ and $\beta$ by convolving the input segmentation masks and apply these elementwise. SPADE can therefore be formulated as

\begin{align*}
    y_{c,h,w} &= \dfrac{x_{c,h,w} - \hat{\mu}_c}{\hat{\sigma}_c} * \gamma_{c,h,w} + \beta_{c,h,w}
\end{align*}

Now let's implement SPADE!

Note: the authors use spectral norm in all convolutional layers in the generator and discriminator, but the official code omits spectral norm for SPADE layers.