-
Notifications
You must be signed in to change notification settings - Fork 0
StyleGAN2
Today, GANs come in a variety of forms: ProGAN, DCGAN, CycleGAN, SAGAN… Out of so many GANs to choose from, StyleGAN proposes great changes to the original architecture, it generates high resolution images and allows the experimentation of coarse features (pose, face shape) to fine details (hair color) to generate artificial faces. This is done using a mapping network: mapping points in the latent space to an intermediate latent space, where the generator can control the style at each point, and noise can be introduced as a source of variation to be displayed in the output images.
StyleGAN exploits the potential features of the ProGAN Generator multilayer network architecture to allow control of visual features. The higher the layer (and the higher the resolution), the greater the detail of the features it affects. At lower resolutions, greater will be the change in features, and coarser it looks in the image. Authors categorize the effects on features according to resolutions:
- Coarse resolutions (4x4 – 8x8): high-level aspects such as pose, general hair style, face shape, and eyeglasses.
- Middle resolutions (16x16 – 32x32): smaller scale facial features such as hair style, eyes open/closed.
- Fine resolutions (64x64 – 1024x1024): color scheme and microstructures.
The innovations introduced by StyleGAN have a lot to do with the generator network, and the picture below illustrates how it changed from a traditional GAN generator network. The major changes made in the architecture will be described with more detail in the following sections.

The ability to control visual features is very difficult in any GAN architecture because it greatly depends on the training dataset. The traditional GAN doesn’t allow for control over finer styling of the image because it follows its own distribution. For example, if the dataset has a major number of facial images with red hair, then there is a great probability that input values will be mapped to that feature. The only control a user has over the visual features it is changing the input value (latent vector z, as seen in the image above) and obtaining a different generated image. Therefore, if the dataset contains a majority of males with short hair and females with long hair, changing the input to obtain females with short hair would result in a change in gender, because a male with short hair is most likely to be generated. In order to obtain more control over features and styles, StyleGAN introduces another network that allows them to be independent to the training dataset's probability distribution and generate an input vector whose elements are not correlated to the dataset's features.

This is the key variation in the GAN architecture introduced by StyleGAN. The mapping network maps points in the latent space (input) to another latent space (output), which have the same size, the generator uses to control style at each resolution layer. It encodes the input vector into an intermediate vector whose different elements control different visual features. While a traditional generator feeds the latent code though the input layer only, StyleGAN first maps the input to an intermediate latent space W, which then controls the generator through adaptive instance normalization (AdaIN) at each convolution layer of the synthesis network.

The synthesis network works like a decoder: it converts the information obtained from the mapping network to the generated image, this being the actual output of the Generator network. However, let us take a look at how these trasnformation takes place. At the above image the letter A can be seen as a layer from the mapping network to the synthesis network. This refers to the learned affine transform: it transforms the intermediate vector W into a scale and bias for each channel of the convolutional layer. It specializes the latent code W to a style Y = (Y_s, Y_b) that control the adaptive instance normalization (AdaIN). The AdaIN module then receives a content input (b in the picture below) and a style input Y (a in the picture below) and aligns the channel-wise mean and variance of (b) to match those of (a) using the scale (Y_s) and bias (Y_b), shifting each channel of the convolutional output. As a result a visual representation of the information (Y or style) from vector W can be obtained. Before the transformation takes place, each feature map x_i or channel is normalized separately and AdaIN adaptively computes the affine parameters from the style input using the formula shown in the above image.

At each resolution level of the synthesis network another variation takes place in the architecture:

Noise is added in a similar way to the AdaIN process: each channel is added noise before the AdaIN transformation takes place. This is meant to change in a small scale (and at the resolution level it takes place) the visual representation obtained from the synthesis network. This noise inputs are stochastic latent variables that are visual details (and stochastic variations) on a facial image such as freckles, wrinkles, specific hair direction, which at a great resolution can be identified and helps to generate a more realistic output. These are single-channel images consisting of uncorrelated Gaussian noise, and the network feeds a dedicated noise image to each layer of the synthesis network. The noise image is broadcasted to all feature maps using learned per-feature scaling factors and then added to the output of the corresponding convolution.
An example of a stochastic variable like hair placement can be seen below:

StyleGAN employs mixing regularization to decorrelate neighboring styles and enables more finegrained control over the generated imagery. During training, a number of images are trained using two latent codes instead of one. When training such images, and at a randomly selected point in the synthesis network, the generator will start using one latent code and then switches to the other one. This operation is know as style mixing. This regularization technique prevents the network from assuming that adjacent styles are correlated.
As a result, multiple images can be combined in a coherent way. Combining two images i1 and i2, the result takes some features from i1 and the rest of the features from i2. For example, the figure below presents examples of images synthesized by mixing two latent codes at various scales. It can be seen that each subset of styles controls meaningful high-level attributes of the image.

A second version of StyleGAN includes changes to the architecture.
In order for pickle.load() to work, you will need to have the dnnlib source directory in your PYTHONPATH and a tf.Session set as default. The session can initialized by calling dnnlib.tflib.init_tf().
dnnlib.tflib.init_tf()The pre-trained networks are stored as standard pickle files on Google Drive. To load the StyleGAN2 for FFHQ dataset at 1024×1024 pre-trained network:
network_pkl = 'gdrive:networks/stylegan2-ffhq-config-f.pkl'
_G, _D, Gs = pretrained_networks.load_networks(network_pkl)_G = Instantaneous snapshot of the generator. Mainly useful for resuming a previous training run.
_D = Instantaneous snapshot of the discriminator. Mainly useful for resuming a previous training run.
Gs = Long-term average of the generator. Yields higher-quality results than the instantaneous snapshot.
The above code downloads the file and unpickles it to yield 3 instances of dnnlib.tflib.Network. To generate images, you will typically want to use Gs – the other two networks are provided for completeness.
Gs_kwargs = dnnlib.EasyDict()
Gs_kwargs.output_transform = dict(func=tflib.convert_images_to_uint8, nchw_to_nhwc=True)
Gs_kwargs.randomize_noise = False
Gs_kwargs.truncation_psi = truncation_psi
rnd = np.random.RandomState()
tflib.set_vars({var: rnd.randn(*var.shape.as_list()) for var in noise_vars})
images = Gs.run(seed, None, **Gs_kwargs)-
truncation_psiandtruncation_cutoffcontrol the truncation trick that that is performed by default when usingGs(ψ=0.7, cutoff=8). It can be disabled by settingtruncation_psi=1oris_validation=True, and the image quality can be further improved at the cost of variation by setting e.g.truncation_psi=0.5. Note that truncation is always disabled when using the sub-networks directly. The average w needed to manually perform the truncation trick can be looked up usingGs.get_var('dlatent_avg'). -
randomize_noisedetermines whether to use re-randomize the noise inputs for each generated image (True, default) or whether to use specific noise values for the entire minibatch (False). The specific values can be accessed via thetf.Variableinstances that are found using[var for name, var in Gs.components.synthesis.vars.items() if name.startswith('noise')]. -
When using the mapping network directly, you can specify
dlatent_broadcast=Noneto disable the automatic duplication ofdlatentsover the layers of the synthesis network. -
Runtime performance can be fine-tuned via
structure='fixed'anddtype='float16'. The former disables support for progressive growing, which is not needed for a fully-trained generator, and the latter performs all computation using half-precision floating point arithmetic.
Look up Gs.components.mapping and Gs.components.synthesis to access individual sub-networks of the generator. Similar to Gs, the sub-networks are represented as independent instances of dnnlib.tflib.Network:
src_latents = np.stack(np.random.RandomState(seed).randn(Gs.input_shape[1]) for seed in src_seeds)
src_dlatents = Gs.components.mapping.run(src_latents, None) # [seed, layer, component]
src_images = Gs.components.synthesis.run(src_dlatents, randomize_noise=False, **synthesis_kwargs)The above code is from generate_figures.py. It first transforms a batch of latent vectors into the intermediate W space using the mapping network and then turns these vectors into a batch of images using the synthesis network. The dlatents array stores a separate copy of the same w vector for each layer of the synthesis network to facilitate style mixing.
The exact details of the generator are defined in training/networks_stylegan.py (see G_style, G_mapping, and G_synthesis).
vector_size = Gs.input_shape[1:][0]
seeds = expand_seed(seed_from, seed_to, vector_size)
diff = seeds[1] - seeds[0]
step = diff / steps
current = seeds[0]
for i in range(steps):
transition_seeds.append(current)
current = current + step
for seed in transition_seed:
Gs.run(seed, None, **Gs_kwargs)The above code shows how a transition between two images seed_from and seed_to can be achieved. Each seed represents
How to make changes to the latent vector in order to obtain small perceptual changes in the generated image
Latent vectors (or directions) that represent changes to certain features in facial images were obtained for each of the following attributes:
- Age
- Beauty
- Emotion: angry
- Emotion: disgust
- Emotion: easy
- Emotion: fear
- Emotion: happy
- Emotion: sad
- Emotion: surprise
- Eyes open
- Gender
- Smile
These attributes are represented as numpy files that contain the latent directions the latent vectors representing the facial images will change to when the following arithmetic operations are applied:
- The latent direction of the attribute is multiplied to a scalar value representing the intensity of the attribute that wants to be seen.
- The latent direction is added to the latent vector.
new_latent_vector = latent_vector.copy()
new_latent_vector[0][:8] = (latent_vector[0] + intensity * direction)[:8]The above code shows how the operation is done, with latent_vector being the latent vector of the encoded facial image, direction the attribute represented with another latent vector and intensity the scalar representing the intensity of the directional change.