Skip to content

StyleGAN2

Jimena Lozano edited this page Oct 4, 2021 · 1 revision

A Style-Based Generator Architecture for Generative Adversarial Networks

Today, GANs come in a variety of forms: ProGAN, DCGAN, CycleGAN, SAGAN… Out of so many GANs to choose from, StyleGAN proposes great changes to the original architecture, it generates high resolution images and allows the experimentation of coarse features (pose, face shape) to fine details (hair color) to generate artificial faces. This is done using a mapping network: mapping points in the latent space to an intermediate latent space, where the generator can control the style at each point, and noise can be introduced as a source of variation to be displayed in the output images.

StyleGAN exploits the potential features of the ProGAN Generator multilayer network architecture to allow control of visual features. The higher the layer (and the higher the resolution), the greater the detail of the features it affects. At lower resolutions, greater will be the change in features, and coarser it looks in the image. Authors categorize the effects on features according to resolutions:

  1. Coarse resolutions (4x4 – 8x8): high-level aspects such as pose, general hair style, face shape, and eyeglasses.
  2. Middle resolutions (16x16 – 32x32): smaller scale facial features such as hair style, eyes open/closed.
  3. Fine resolutions (64x64 – 1024x1024): color scheme and microstructures.

The innovations introduced by StyleGAN have a lot to do with the generator network, and the picture below illustrates how it changed from a traditional GAN generator network. The major changes made in the architecture will be described with more detail in the following sections.

image

Mapping network

The ability to control visual features is very difficult in any GAN architecture because it greatly depends on the training dataset. The traditional GAN doesn’t allow for control over finer styling of the image because it follows its own distribution. For example, if the dataset has a major number of facial images with red hair, then there is a great probability that input values will be mapped to that feature. The only control a user has over the visual features it is changing the input value (latent vector z, as seen in the image above) and obtaining a different generated image. Therefore, if the dataset contains a majority of males with short hair and females with long hair, changing the input to obtain females with short hair would result in a change in gender, because a male with short hair is most likely to be generated. In order to obtain more control over features and styles, StyleGAN introduces another network that allows them to be independent to the training dataset's probability distribution and generate an input vector whose elements are not correlated to the dataset's features.

image

This is the key variation in the GAN architecture introduced by StyleGAN. The mapping network maps points in the latent space (input) to another latent space (output), which have the same size, the generator uses to control style at each resolution layer. It encodes the input vector into an intermediate vector whose different elements control different visual features. While a traditional generator feeds the latent code though the input layer only, StyleGAN first maps the input to an intermediate latent space W, which then controls the generator through adaptive instance normalization (AdaIN) at each convolution layer of the synthesis network.

Synthesis network

image

The synthesis network works like a decoder: it converts the information obtained from the mapping network to the generated image, this being the actual output of the Generator network. However, let us take a look at how these trasnformation takes place. At the above image the letter A can be seen as a layer from the mapping network to the synthesis network. This refers to the learned affine transform: it transforms the intermediate vector W into a scale and bias for each channel of the convolutional layer. It specializes the latent code W to a style Y = (Y_s, Y_b) that control the adaptive instance normalization (AdaIN). The AdaIN module then receives a content input (b in the picture below) and a style input Y (a in the picture below) and aligns the channel-wise mean and variance of (b) to match those of (a) using the scale (Y_s) and bias (Y_b), shifting each channel of the convolutional output. As a result a visual representation of the information (Y or style) from vector W can be obtained. Before the transformation takes place, each feature map x_i or channel is normalized separately and AdaIN adaptively computes the affine parameters from the style input using the formula shown in the above image.

image

Noise

At each resolution level of the synthesis network another variation takes place in the architecture:

image

Noise is added in a similar way to the AdaIN process: each channel is added noise before the AdaIN transformation takes place. This is meant to change in a small scale (and at the resolution level it takes place) the visual representation obtained from the synthesis network. This noise inputs are stochastic latent variables that are visual details (and stochastic variations) on a facial image such as freckles, wrinkles, specific hair direction, which at a great resolution can be identified and helps to generate a more realistic output. These are single-channel images consisting of uncorrelated Gaussian noise, and the network feeds a dedicated noise image to each layer of the synthesis network. The noise image is broadcasted to all feature maps using learned per-feature scaling factors and then added to the output of the corresponding convolution.

An example of a stochastic variable like hair placement can be seen below:

image

Style mixing

StyleGAN employs mixing regularization to decorrelate neighboring styles and enables more finegrained control over the generated imagery. During training, a number of images are trained using two latent codes instead of one. When training such images, and at a randomly selected point in the synthesis network, the generator will start using one latent code and then switches to the other one. This operation is know as style mixing. This regularization technique prevents the network from assuming that adjacent styles are correlated.

As a result, multiple images can be combined in a coherent way. Combining two images i1 and i2, the result takes some features from i1 and the rest of the features from i2. For example, the figure below presents examples of images synthesized by mixing two latent codes at various scales. It can be seen that each subset of styles controls meaningful high-level attributes of the image.

image

Analyzing and Improving the Image Quality of StyleGAN

A second version of StyleGAN includes changes to the architecture.

Implementation

Initiate TensorFlow session

In order for pickle.load() to work, you will need to have the dnnlib source directory in your PYTHONPATH and a tf.Session set as default. The session can initialized by calling dnnlib.tflib.init_tf().

dnnlib.tflib.init_tf()

Loading the network

The pre-trained networks are stored as standard pickle files on Google Drive. To load the StyleGAN2 for FFHQ dataset at 1024×1024 pre-trained network:

network_pkl = 'gdrive:networks/stylegan2-ffhq-config-f.pkl'
_G, _D, Gs = pretrained_networks.load_networks(network_pkl)

_G = Instantaneous snapshot of the generator. Mainly useful for resuming a previous training run.

_D = Instantaneous snapshot of the discriminator. Mainly useful for resuming a previous training run.

Gs = Long-term average of the generator. Yields higher-quality results than the instantaneous snapshot.

The above code downloads the file and unpickles it to yield 3 instances of dnnlib.tflib.Network. To generate images, you will typically want to use Gs – the other two networks are provided for completeness.

Generate random images

Gs_kwargs = dnnlib.EasyDict()
Gs_kwargs.output_transform = dict(func=tflib.convert_images_to_uint8, nchw_to_nhwc=True)
Gs_kwargs.randomize_noise = False
Gs_kwargs.truncation_psi = truncation_psi

rnd = np.random.RandomState()
tflib.set_vars({var: rnd.randn(*var.shape.as_list()) for var in noise_vars})

images = Gs.run(seed, None, **Gs_kwargs)

Keyword arguments

  • truncation_psi and truncation_cutoff control the truncation trick that that is performed by default when using Gs (ψ=0.7, cutoff=8). It can be disabled by setting truncation_psi=1 or is_validation=True, and the image quality can be further improved at the cost of variation by setting e.g. truncation_psi=0.5. Note that truncation is always disabled when using the sub-networks directly. The average w needed to manually perform the truncation trick can be looked up using Gs.get_var('dlatent_avg').

  • randomize_noise determines whether to use re-randomize the noise inputs for each generated image (True, default) or whether to use specific noise values for the entire minibatch (False). The specific values can be accessed via the tf.Variable instances that are found using [var for name, var in Gs.components.synthesis.vars.items() if name.startswith('noise')].

  • When using the mapping network directly, you can specify dlatent_broadcast=None to disable the automatic duplication of dlatents over the layers of the synthesis network.

  • Runtime performance can be fine-tuned via structure='fixed' and dtype='float16'. The former disables support for progressive growing, which is not needed for a fully-trained generator, and the latter performs all computation using half-precision floating point arithmetic.

Accessing sub-networks of the generator

Look up Gs.components.mapping and Gs.components.synthesis to access individual sub-networks of the generator. Similar to Gs, the sub-networks are represented as independent instances of dnnlib.tflib.Network:

src_latents = np.stack(np.random.RandomState(seed).randn(Gs.input_shape[1]) for seed in src_seeds)
src_dlatents = Gs.components.mapping.run(src_latents, None) # [seed, layer, component]
src_images = Gs.components.synthesis.run(src_dlatents, randomize_noise=False, **synthesis_kwargs)

The above code is from generate_figures.py. It first transforms a batch of latent vectors into the intermediate W space using the mapping network and then turns these vectors into a batch of images using the synthesis network. The dlatents array stores a separate copy of the same w vector for each layer of the synthesis network to facilitate style mixing.

The exact details of the generator are defined in training/networks_stylegan.py (see G_style, G_mapping, and G_synthesis).

Generating a series of images that show a transition between the two generated images

vector_size = Gs.input_shape[1:][0]
seeds = expand_seed(seed_from, seed_to, vector_size)
diff = seeds[1] - seeds[0]
step = diff / steps
current = seeds[0]
for i in range(steps):
    transition_seeds.append(current)
    current = current + step
for seed in transition_seed:
    Gs.run(seed, None, **Gs_kwargs)

The above code shows how a transition between two images seed_from and seed_to can be achieved. Each seed represents

How to make changes to the latent vector in order to obtain small perceptual changes in the generated image

Latent vectors (or directions) that represent changes to certain features in facial images were obtained for each of the following attributes:

  • Age
  • Beauty
  • Emotion: angry
  • Emotion: disgust
  • Emotion: easy
  • Emotion: fear
  • Emotion: happy
  • Emotion: sad
  • Emotion: surprise
  • Eyes open
  • Gender
  • Smile

These attributes are represented as numpy files that contain the latent directions the latent vectors representing the facial images will change to when the following arithmetic operations are applied:

  1. The latent direction of the attribute is multiplied to a scalar value representing the intensity of the attribute that wants to be seen.
  2. The latent direction is added to the latent vector.
new_latent_vector = latent_vector.copy()
new_latent_vector[0][:8] = (latent_vector[0] + intensity * direction)[:8]

The above code shows how the operation is done, with latent_vector being the latent vector of the encoded facial image, direction the attribute represented with another latent vector and intensity the scalar representing the intensity of the directional change.

Clone this wiki locally