# StyleGAN

Generative adversarial networks (GANs) have been around since [2014](https://papers.nips.cc/paper/5423-generative-adversarial-nets.pdf), and have shown steady progress in developing models that can generate realistic images. GANs work by creating two competing networks - a Generator and a Discriminator. The Generator maps a latent vector to an image. The Discriminator evaluates the quality of the generated image relative to some existing dataset. Over the training process, the generated images converge from random noise to images that resemble the target dataset.

GANs ideally learn to generate images that *look like* the target dataset without actually *copying* the target dataset. GANs learn some probability distribution over the target dataset and generate images that look like the *could have* come from that dataset, without actually memorizing and regurgitating images from the dataset. In a perfect world, each different latent vector maps to a different output, allowing for infinitely many generated images. Latent inputs that are close in vector space generate images that are similar but not exactly the same. In practice however, GANs can suffer from what is known as Mode Collapse, where the Generator converges to only generating a small, finite set of outputs regardless of the input vector.

GANs also suffer from stability problems and are notoriously hard to train. Historically, generating larger images while maintaining the quality and diversity possible at lower scales has proven challenging. Various techniques like [Progressive Growing](https://arxiv.org/abs/1710.10196), careful [weight clipping](https://arxiv.org/abs/1701.07875), [regularization strategies](https://arxiv.org/abs/1704.00028) and [normalization approaches](https://arxiv.org/abs/1802.05957) have been developed to improve the stability of training GANs.

Another outstanding mystery in GAN-world is understanding how the latent vector maps to the output. Specifically, what parts of the latent input control aspects of the latent output. How can we break apart the latent vector and disentangle the latent representation to have finer control over the resulting image?

In 2018, Nvidia released [StyleGAN](https://arxiv.org/pdf/1812.04948.pdf), a new type of GAN architecture that builds on the history of GAN research to create a fundamentally different kind of generator that tackles the problem of understanding the properties of the latent vector.

Instead of starting with a latent vector, StyleGAN first sends the latent vector through a mapping network consisting of several fully connected layers. This allows the generator to learn an intermediate latent space that is best for generating images. The intermediate latent vector is then injected into the generator at different layers, rather than just being a starting template.

![](media/stylegan.png)

The key finding of StyleGAN was that adding the intermediate latent vector at different layers of the generator caused different effects on the output image. Injection at early layers of the generator influenced major details of the image, while injection at later layers influenced finer details. The result is that the StyleGAN generator can be used to blend different latent vectors. The exact method by which the intermediate latent vector is added to the generator is discussed later.

![](media/stylegan_mixing.jpg)

StyleGAN is able to train with good stability to images as large as 1024x1024 by training with [progressive growing](https://arxiv.org/pdf/1710.10196.pdf). We start training on 4x4 images. After some time, we grow to 8x8, then 16x16 and so on. When we grow the model, there's a small issue. We now have a new layer that converts a tensor of activations to a 3 channel RGB image. Since this new layer is untrained, it likely produces poor quality images. To prevent this from messing up training, the new RGB layer is phased in.

![](media/rgb.png)

So that's a quick overview. Letâ€™s start looking at the model.

In [1]:
%reload_ext autoreload
%autoreload 2
%matplotlib inline

In [2]:
from fastai import *
from fastai.vision import *
from fastai.vision.gan import *
from fastai.callbacks import *
import matplotlib.animation as animation
from IPython.display import HTML

In [3]:
path = Path('G:/FFHQ/images')

In [4]:
def get_data(bs, size, path, num_workers, noise_size=512):
    return (GANItemList.from_folder(path, noise_sz=noise_size)
               .split_none()
               .label_from_func(noop)
               .transform(tfms=[[crop_pad(size=size, row_pct=(0,1), col_pct=(0,1))], []], size=size, tfm_y=True)
               .databunch(bs=bs, num_workers=num_workers)
               .normalize(stats = [torch.tensor([0.5,0.5,0.5]), torch.tensor([0.5,0.5,0.5])], do_x=False, do_y=True))

In [5]:
# I resized the main dataset into individual sets for each image size
# This has a noticeable effect on training speed
data_4 = get_data(256, 4, path/'images_4', 8)
data_8 = get_data(256, 8, path/'images_8', 8)
data_16 = get_data(128, 16, path/'images_16', 8)
data_32 = get_data(128, 32, path/'images_32', 8)
data_64 = get_data(48, 64, path/'images_64', 8)
data_128 = get_data(22, 128, path/'images_128', 8)
data_256 = get_data(10, 256, path/'images_256', 8)
data_512 = get_data(4, 512, path/'images_512', 8)

## StyleGAN Fundamentals

First we start with the basic building blocks of the StyleGAN. StyleGAN uses Equalized Learning Rates for all layers. This is a technique from the [Progressive GAN](https://arxiv.org/pdf/1710.10196.pdf) paper.

Layers are typically initialized to a careful distribution to ensure the initial output of the untrained layer has approximately mean zero and standard deviation one. This is to avoid having activations or gradients vanish or explode early in training. Kaiming initialization scales all values by a factor of $c = \frac{\sqrt2}{W}$ where $W$ is the length of the weight matrix along one dimension.

Equalized learning rates applies the Kaiming constant dynamically during runtime, rather than just at the initialization. The reason for doing this is, to quote the paper, "somewhat subtle". Many optimization algorithms normalize gradients in such a way that the scale of the gradient update independent of the scale of the parameter being updated. This leads to some small parameters receiving updates that are too large, and large parameters receiving updates that are too small. The effect is that the learning rate is too high for some parameters and too low for others, leading to instability. The Equalized Learning Weights technique applies Kaiming scaling on every forward pass to ensure all the weights in the model are of a similar range.

In [6]:
class EqualLR:
    def __init__(self, name):
        self.name = name

    def compute_weight(self, module):
        weight = getattr(module, self.name + '_orig')
        fan_in = weight.data.size(1) * weight.data[0][0].numel()

        return weight * math.sqrt(2 / fan_in)

    @staticmethod
    def apply(module, name):
        fn = EqualLR(name)

        weight = getattr(module, name)
        del module._parameters[name]
        module.register_parameter(name + '_orig', nn.Parameter(weight.data))
        module.register_forward_pre_hook(fn)

        return fn

    def __call__(self, module, input):
        weight = self.compute_weight(module)
        setattr(module, self.name, weight)


def equal_lr(module, name='weight'):
    EqualLR.apply(module, name)

    return module

In [7]:
class EqualConv2d(nn.Module):
    def __init__(self, *args, **kwargs):
        super().__init__()

        conv = nn.Conv2d(*args, **kwargs)
        conv.weight.data.normal_()
        conv.bias.data.zero_()
        self.conv = equal_lr(conv)

    def forward(self, input):
        return self.conv(input)


class EqualLinear(nn.Module):
    def __init__(self, in_dim, out_dim):
        super().__init__()

        linear = nn.Linear(in_dim, out_dim)
        linear.weight.data.normal_()
        linear.bias.data.zero_()

        self.linear = equal_lr(linear)

    def forward(self, input):
        return self.linear(input)

This is the standard convolutional block for the model. It consists of two convolutional layers with leaky relu activations. This sort of convolutional block will be used in the discriminator.

In [8]:
class ConvBlock(nn.Module):
    def __init__(
        self,
        in_channel,
        out_channel,
        kernel_size,
        padding,
        kernel_size2=None,
        padding2=None,
        pixel_norm=True,
        spectral_norm=False,
    ):
        super().__init__()

        pad1 = padding
        pad2 = padding
        if padding2 is not None:
            pad2 = padding2

        kernel1 = kernel_size
        kernel2 = kernel_size
        if kernel_size2 is not None:
            kernel2 = kernel_size2

        self.conv = nn.Sequential(
            EqualConv2d(in_channel, out_channel, kernel1, padding=pad1),
            nn.LeakyReLU(0.2),
            EqualConv2d(out_channel, out_channel, kernel2, padding=pad2),
            nn.LeakyReLU(0.2),
        )

    def forward(self, input):
        out = self.conv(input)

        return out