# **Semantic Image Synthesis with Spatially-Adaptive Normalization**

**Authors: Taesung Park, Ming-Yu Liu, Ting-Chun Wang, Jun-Yan ZhuL**

**Original Paper**: https://arxiv.org/pdf/1903.07291.pdf

**Official Github**: https://github.com/NVlabs/SPADE

---

**Edited By Su Hyung Choi (Key Summary & Code Practice)**

If you have any issues on this scripts, please PR to the repository below.

**[Github: @JonyChoi - Computer Vision Paper Reviews]** https://github.com/jonychoi/Computer-Vision-Paper-Reviews

Edited Jan 16 2022

---

<img src="./imgs/figure1.png" />

### **Abstract**
<p>
We propose spatially-adaptive normalization, a simple
but effective layer for synthesizing photorealistic images
given an input semantic layout. Previous methods directly
feed the semantic layout as input to the deep network, which
is then processed through stacks of convolution, normalization, and nonlinearity layers. We show that this is suboptimal as the normalization layers tend to “wash away” semantic information. To address the issue, we propose using
the input layout for modulating the activations in normalization layers through a spatially-adaptive, learned transformation. Experiments on several challenging datasets
demonstrate the advantage of the proposed method over existing approaches, regarding both visual fidelity and alignment with input layouts. Finally, our model allows user
control over both semantic and style. Code is available at https://github.com/NVlabs/SPADE.
</p>


### **1. Introduction**
<p>
Conditional image synthesis refers to the task of generating photorealistic images conditioning on certain input data. Seminal work computes the output image by
stitching pieces from a single image (e.g., Image Analogies [16]) or using an image collection [7, 14, 23, 30, 35].
Recent methods directly learn the mapping using neural networks [3, 6, 22, 47, 48, 54, 55, 56]. The latter methods are
faster and require no external database of images.
</p>
<p>
    We are interested in a specific form of conditional image synthesis, which is converting a semantic segmentation
mask to a photorealistic image. This form has a wide range
of applications such as content generation and image editing [6, 22, 48]. We refer to this form as semantic image
synthesis. In this paper, we show that the conventional network architecture [22, 48], which is built by stacking convolutional, normalization, and nonlinearity layers, is at best arXiv:1903.07291v2 [cs.CV] 5 Nov 2019
suboptimal because their normalization layers tend to “wash
away” information contained in the input semantic masks.
To address the issue, we propose spatially-adaptive normalization, a conditional normalization layer that modulates the
activations using input semantic layouts through a spatiallyadaptive, learned transformation and can effectively propagate the semantic information throughout the network.
</p>
<p>
    We conduct experiments on several challenging datasets
including the COCO-Stuff [4, 32], the ADE20K [58], and
the Cityscapes [9]. We show that with the help of our
spatially-adaptive normalization layer, a compact network
can synthesize significantly better results compared to several state-of-the-art methods. Additionally, an extensive ablation study demonstrates the effectiveness of the proposed
normalization layer against several variants for the semantic
image synthesis task. Finally, our method supports multimodal and style-guided image synthesis, enabling controllable, diverse outputs, as shown in Figure 1. Also, please
see our SIGGRAPH 2019 Real-Time Live demo and try our
online demo by yourself.
</p>

### **2. Related Work**
<p>
<strong>Deep generative models</strong> can learn to synthesize images.
Recent methods include generative adversarial networks
(GANs) [13] and variational autoencoder (VAE) [28]. Our
work is built on GANs but aims for the conditional image
synthesis task. The GANs consist of a generator and a discriminator where the goal of the generator is to produce realistic images so that the discriminator cannot tell the synthesized images apart from the real ones.
</p>
<p>
<strong>Conditional image synthesis</strong> exists in many forms that differ in the type of input data. For example, class-conditional
models [3, 36, 37, 39, 41] learn to synthesize images given
category labels. Researchers have explored various models
for generating images based on text [18,44,52,55]. Another
widely-used form is image-to-image translation based on a
type of conditional GANs [20, 22, 24, 25, 33, 57, 59, 60],
where both input and output are images. Compared to
earlier non-parametric methods [7, 16, 23], learning-based
methods typically run faster during test time and produce
more realistic results. In this work, we focus on converting
segmentation masks to photorealistic images. We assume
the training dataset contains registered segmentation masks
and images. With the proposed spatially-adaptive normalization, our compact network achieves better results compared to leading methods.
</p>
<p>
<strong>Unconditional normalization layers</strong> have been an important component in modern deep networks and can be found
in various classifiers, including the Local Response Normalization in the AlexNet [29] and the Batch Normalization (BatchNorm) in the Inception-v2 network [21]. Other
popular normalization layers include the Instance Normalization (InstanceNorm) [46], the Layer Normalization [2],
the Group Normalization [50], and the Weight Normalization [45]. We label these normalization layers as unconditional as they do not depend on external data in contrast to
the conditional normalization layers discussed below.
</p>
<p>
<strong>Conditional normalization layers</strong> include the Conditional
Batch Normalization (Conditional BatchNorm) [11] and
Adaptive Instance Normalization (AdaIN) [19]. Both were
first used in the style transfer task and later adopted in various vision tasks [3, 8, 10, 20, 26, 36, 39, 42, 49, 54]. Different from the earlier normalization techniques, conditional normalization layers require external data and generally operate as follows. First, layer activations are normalized to zero mean and unit deviation. Then the normalized activations are denormalized by modulating the
activation using a learned affine transformation whose parameters are inferred from external data. For style transfer tasks [11, 19], the affine parameters are used to control
the global style of the output, and hence are uniform across
spatial coordinates. In contrast, our proposed normalization
layer applies a spatially-varying affine transformation, making it suitable for image synthesis from semantic masks.
Wang et al. proposed a closely related method for image
super-resolution [49]. Both methods are built on spatiallyadaptive modulation layers that condition on semantic inputs. While they aim to incorporate semantic information
into super-resolution, our goal is to design a generator for
style and semantics disentanglement. We focus on providing the semantic information in the context of modulating
normalized activations. We use semantic maps in different
scales, which enables coarse-to-fine generation. The reader
is encouraged to review their work for more details.
</p>


### **3. Semantic Image Synthesis**
<p>
Let m ∈ L
H×W be a semantic segmentation mask
where L is a set of integers denoting the semantic labels,
and H and W are the image height and width. Each entry
in m denotes the semantic label of a pixel. We aim to learn
a mapping function that can convert an input segmentation
mask m to a photorealistic image.
</p>
<p>
<table>
    <tbody>
        <tr>
            <td>
                <img src="./imgs/figure2.png" width="400" />
            </td>
        </tr>
    </tbody>
</table>
<strong>Spatially-adaptive denormalization.</strong> Let h
i denote the activations of the i-th layer of a deep convolutional network
for a batch of N samples. Let C
i be the number of channels in the layer. Let Hi
and Wi be the height and width
of the activation map in the layer. We propose a new conditional normalization method called the SPatially-Adaptive
(DE)normalization (1)
(SPADE). Similar to the Batch Normalization [21], the activation is normalized in the channelwise manner and then modulated with learned scale and
bias. Figure 2 illustrates the SPADE design. The activation value at site (n ∈ N, c ∈ C
i
, y ∈ Hi
, x ∈ Wi
) is
</p>
<table>
    <tbody>
        <tr>
            <td>
                <img src="./imgs/equation1.png" width="400" />
            </td>
        </tr>
    </tbody>
</table>
<p>
where h
i
n,c,y,x is the activation at the site before normalization and µ
i
c
and σ
i
c
are the mean and standard deviation of
the activations in channel c:
</p>
<table>
    <tbody>
        <tr>
            <td>
                <img src="./imgs/equation2.png" width="400" />
            </td>
            <td>
                <img src="./imgs/equation3.png" width="400" />
            </td>
        </tr>
    </tbody>
</table>
<p>
The variables γ
i
c,y,x(m) and β
i
c,y,x(m) in (1) are the
learned modulation parameters of the normalization layer.
In contrast to the BatchNorm [21], they depend on the input segmentation mask and vary with respect to the location
(y, x). We use the symbol γ
i
c,y,x and β
i
c,y,x to denote the
functions that convert m to the scaling and bias values at
the site (c, y, x) in the i-th activation map. We implement
the functions γ
i
c,y,x and β
i
c,y,x using a simple two-layer convolutional network, whose design is in the appendix.
</p>
<p>
In fact, SPADE is related to, and is a generalization
of several existing normalization layers. First, replacing
the segmentation mask m with the image class label and
making the modulation parameters spatially-invariant (i.e.,
γ
i
c,y1,x1 ≡ γ
i
c,y2,x2
and β
i
c,y1,x1 ≡ β
i
c,y2,x2
for any y1, y2 ∈
{1, 2, ..., Hi} and x1, x2 ∈ {1, 2, ..., Wi}), we arrive at the
form of the Conditional BatchNorm [11]. Indeed, for any
spatially-invariant conditional data, our method reduces to
the Conditional BatchNorm. Similarly, we can arrive at
the AdaIN [19] by replacing m with a real image, making the modulation parameters spatially-invariant, and setting N = 1. As the modulation parameters are adaptive to
the input segmentation mask, the proposed SPADE is better
suited for semantic image synthesis.
</p>
<table>
    <tbody>
        <tr>
            <td>
                <img src="./imgs/figure3.png" width="400" />
            </td>
        </tr>
    </tbody>
</table>
<p>
<strong>SPADE generator.</strong> With the SPADE, there is no need to
feed the segmentation map to the first layer of the generator, since the learned modulation parameters have encoded
enough information about the label layout. Therefore, we
discard encoder part of the generator, which is commonly
used in recent architectures [22, 48]. This simplification results in a more lightweight network. Furthermore, similarly
to existing class-conditional generators [36,39,54], the new
generator can take a random vector as input, enabling a simple and natural way for multi-modal synthesis [20, 60].
</p>
<p>
Figure 4 illustrates our generator architecture, which employs several ResNet blocks [15] with upsampling layers.
The modulation parameters of all the normalization layers
are learned using the SPADE. Since each residual block
operates at a different scale, we downsample the semantic
mask to match the spatial resolution.
</p>
<p>
We train the generator with the same multi-scale discriminator and loss function used in pix2pixHD [48] except that
we replace the least squared loss term [34] with the hinge
loss term [31,38,54]. We test several ResNet-based discriminators used in recent unconditional GANs [1, 36, 39] but
observe similar results at the cost of a higher GPU memory requirement. Adding the SPADE to the discriminator
also yields a similar performance. For the loss function, we
observe that removing any loss term in the pix2pixHD loss
function lead to degraded generation results.
</p>
<p>
<strong>Why does the SPADE work better?</strong> A short answer is that
it can better preserve semantic information against common
normalization layers. Specifically, while normalization layers such as the InstanceNorm [46] are essential pieces in
almost all the state-of-the-art conditional image synthesis
models [48], they tend to wash away semantic information
when applied to uniform or flat segmentation masks.
</p>
---
(1) Conditional normalization [11, 19] uses external data to denormalize
the normalized activations; i.e., the denormalization part is conditional.
