# **Arbitrary Style Transfer in Real-time with Adaptive Instance Normalization**

**Authors: Xun Huang, Serge Belongie - Department of Computer Science & Cornell Tech, Cornell University, {xh258,sjb344}@cornell.edu**

**Official Github**: https://github.com/xunhuang1995/AdaIN-style

**Original Paper**: https://arxiv.org/pdf/1703.06868.pdf

---

**Edited By Su Hyung Choi (Key Summary & Code Practice)**

If you have any issues on this scripts, please PR to the repository below.

**[Github: @JonyChoi - Computer Vision Paper Reviews]** https://github.com/jonychoi/Computer-Vision-Paper-Reviews

Edited Jan 16 2022

---

### **Abstract**


<p><i>Gatys et al. recently introduced a neural algorithm that
renders a content image in the style of another image,
achieving so-called style transfer. However, their framework requires a slow iterative optimization process, which
limits its practical application. Fast approximations with
feed-forward neural networks have been proposed to speed
up neural style transfer. Unfortunately, the speed improvement comes at a cost: the network is usually tied to a fixed
set of styles and cannot adapt to arbitrary new styles. In this
paper, we present a simple yet effective approach that for the
first time enables arbitrary style transfer in real-time. At the
heart of our method is a novel adaptive instance normalization (AdaIN) layer that aligns the mean and variance of the
content features with those of the style features. Our method
achieves speed comparable to the fastest existing approach,
without the restriction to a pre-defined set of styles. In addition, our approach allows flexible user controls such as
content-style trade-off, style interpolation, color & spatial
controls, all using a single feed-forward neural network.
</i>
</p>


### **1. Introduction**
<p>
The seminal work of Gatys et al. [16] showed that deep
neural networks (DNNs) encode not only the content but
also the style information of an image. Moreover, the image style and content are somewhat separable: it is possible
to change the style of an image while preserving its content. The style transfer method of [16] is flexible enough to
combine content and style of arbitrary images. However, it
relies on an optimization process that is prohibitively slow.
</p>
<p>
Significant effort has been devoted to accelerating neural
style transfer. [24, 51, 31] attempted to train feed-forward
neural networks that perform stylization with a single forward pass. A major limitation of most feed-forward methods is that each network is restricted to a single style. There
are some recent works addressing this problem, but they are
either still limited to a finite set of styles [11, 32, 55, 5], or
much slower than the single-style transfer methods [6].
</p>
<p>
    In this work, we present the first neural style transfer
algorithm that resolves this fundamental flexibility-speed
dilemma. Our approach can transfer arbitrary new styles
in real-time, combining the flexibility of the optimizationbased framework [16] and the speed similar to the fastest
feed-forward approaches [24, 52]. Our method is inspired
by the instance normalization (IN) [52, 11] layer, which
is surprisingly effective in feed-forward style transfer. To
explain the success of instance normalization, we propose
a new interpretation that instance normalization performs
style normalization by normalizing feature statistics, which
have been found to carry the style information of an image [16, 30, 33]. Motivated by our interpretation, we introduce a simple extension to IN, namely adaptive instance
normalization (AdaIN). Given a content input and a style
input, AdaIN simply adjusts the mean and variance of the
content input to match those of the style input. Through
experiments, we find AdaIN effectively combines the content of the former and the style latter by transferring feature
statistics. A decoder network is then learned to generate the
final stylized image by inverting the AdaIN output back to
the image space. Our method is nearly three orders of magnitude faster than [16], without sacrificing the flexibility of
transferring inputs to arbitrary new styles. Furthermore, our
approach provides abundant user controls at runtime, without any modification to the training process.
</p>


### **2. Related Work**

<p><strong>Style transfer.</strong> The problem of style transfer has its origin
from non-photo-realistic rendering [28], and is closely related to texture synthesis and transfer [13, 12, 14]. Some
early approaches include histogram matching on linear filter responses [19] and non-parametric sampling [12, 15].
These methods typically rely on low-level statistics and often fail to capture semantic structures. Gatys et al. [16] for
the first time demonstrated impressive style transfer results
by matching feature statistics in convolutional layers of a
DNN. Recently, several improvements to [16] have been
proposed. Li and Wand [30] introduced a framework based
on markov random field (MRF) in the deep feature space to
enforce local patterns. Gatys et al. [17] proposed ways to
control the color preservation, the spatial location, and the
scale of style transfer. Ruder et al. [45] improved the quality of video style transfer by imposing temporal constraints.
</p>
<p>
The framework of Gatys et al. [16] is based on a slow
optimization process that iteratively updates the image to
minimize a content loss and a style loss computed by a loss
network. It can take minutes to converge even with modern GPUs. On-device processing in mobile applications is
therefore too slow to be practical. A common workaround
is to replace the optimization process with a feed-forward
neural network that is trained to minimize the same objective [24, 51, 31]. These feed-forward style transfer approaches are about three orders of magnitude faster than
the optimization-based alternative, opening the door to realtime applications. Wang et al. [53] enhanced the granularity
of feed-forward style transfer with a multi-resolution architecture. Ulyanov et al. [52] proposed ways to improve the
quality and diversity of the generated samples. However,
the above feed-forward methods are limited in the sense that
each network is tied to a fixed style. To address this problem, Dumoulin et al. [11] introduced a single network that
is able to encode 32 styles and their interpolations. Concurrent to our work, Li et al. [32] proposed a feed-forward
architecture that can synthesize up to 300 textures and transfer 16 styles. Still, the two methods above cannot adapt to
arbitrary styles that are not observed during training.
</p>
<p>
Very recently, Chen and Schmidt [6] introduced a feedforward method that can transfer arbitrary styles thanks to
a style swap layer. Given feature activations of the content
and style images, the style swap layer replaces the content
features with the closest-matching style features in a patchby-patch manner. Nevertheless, their style swap layer creates a new computational bottleneck: more than 95% of the
computation is spent on the style swap for 512 × 512 input
images. Our approach also permits arbitrary style transfer,
while being 1-2 orders of magnitude faster than [6].
</p>
<p>
    while being 1-2 orders of magnitude faster than [6].
Another central problem in style transfer is which style
loss function to use. The original framework of Gatys et
al. [16] matches styles by matching the second-order statistics between feature activations, captured by the Gram matrix. Other effective loss functions have been proposed,
such as MRF loss [30], adversarial loss [31], histogram
loss [54], CORAL loss [41], MMD loss [33], and distance
between channel-wise mean and variance [33]. Note that all
the above loss functions aim to match some feature statistics
between the style image and the synthesized image.
</p>
<p>
<strong>Deep generative image modeling.</strong> There are several alternative frameworks for image generation, including variational auto-encoders [27], auto-regressive models [40], and
generative adversarial networks (GANs) [18]. Remarkably,
GANs have achieved the most impressive visual quality.
Various improvements to the GAN framework have been
proposed, such as conditional generation [43, 23], multistage processing [9, 20], and better training objectives [46,
1]. GANs have also been applied to style transfer [31] and cross-domain image generation [50, 3, 23, 38, 37, 25].
</p>

### **3. Background**

#### 3.1. Batch Normalization

<p>
The seminal work of Ioffe and Szegedy [22] introduced
a batch normalization (BN) layer that significantly ease the
training of feed-forward networks by normalizing feature
statistics. BN layers are originally designed to accelerate training of discriminative networks, but have also been
found effective in generative image modeling [42]. Given
an input batch x ∈ R
N×C×H×W , BN normalizes the mean
and standard deviation for each individual feature channel:
</p>
<table>
    <tbody>
        <tr>
            <td>
                <img src="./imgs/equation1.png" width="300" />
            </td>
        </tr>
    </tbody>
</table>
<p>
    where γ, β ∈ R
C are affine parameters learned from data;
µ(x), σ(x) ∈ R
C are the mean and standard deviation,
computed across batch size and spatial dimensions independently for each feature channel:
</p>
<table>
    <tbody>
        <tr>
            <td>
                <img src="./imgs/equation2.png" width="350" />
            </td>
            <td>
                <img src="./imgs/equation3.png" width="300" />
            </td>
        </tr>
    </tbody>
</table>
<p>
BN uses mini-batch statistics during training and replace
them with popular statistics during inference, introducing
discrepancy between training and inference. Batch renormalization [21] was recently proposed to address this issue
by gradually using popular statistics during training. As
another interesting application of BN, Li et al. [34] found
that BN can alleviate domain shifts by recomputing popular
statistics in the target domain. Recently, several alternative
normalization schemes have been proposed to extend BN’s
effectiveness to recurrent architectures [35, 2, 47, 8, 29, 44].
</p>

### **3.2. Instance Normalization**
<p>
In the original feed-forward stylization method [51], the
style transfer network contains a BN layer after each convolutional layer. Surprisingly, Ulyanov et al. [52] found
that significant improvement could be achieved simply by
replacing BN layers with IN layers:
</p>
<table>
    <tbody>
        <tr>
            <td>
                <img src="./imgs/equation4.png" width="300" />
            </td>
        </tr>
    </tbody>
</table>
<p>
Different from BN layers, here µ(x) and σ(x) are computed across spatial dimensions independently for each
channel <i>and each sample:</i>
</p>
<table>
    <tbody>
        <tr>
            <td>
                <img src="./imgs/equation5.png" width="300" />
            </td>
        </tr>
    </tbody>
</table>
<img src="./imgs/figure1.png" />
<table>
    <tbody>
        <tr>
            <td>
                <img src="./imgs/equation6.png" width="300" />
            </td>
        </tr>
    </tbody>
</table>
<p>
Another difference is that IN layers are applied at test
time unchanged, whereas BN layers usually replace minibatch statistics with population statistics.
</p>

### **3.3. Conditional Instance Normalization**
Instead of learning a single set of affine parameters γ
and β, Dumoulin et al. [11] proposed a conditional instance
normalization (CIN) layer that learns a different set of parameters γ
s
and β
s
for each style s:
<table>
    <tbody>
        <tr>
            <td>
                <img src="./imgs/equation7.png" width="300" />
            </td>
        </tr>
    </tbody>
</table>
<p>
During training, a style image together with its index
s are randomly chosen from a fixed set of styles s ∈
{1, 2, ..., S} (S = 32 in their experiments). The content image is then processed by a style transfer network
in which the corresponding γ*
and β*
are used in the CIN
layers. Surprisingly, the network can generate images in
completely different styles by using the same convolutional
parameters but <i>different</i> affine parameters in IN layers.
</p>
<p>
Compared with a network without normalization layers,
a network with CIN layers requires 2F S additional parameters, where F is the total number of feature maps in the
network [11]. Since the number of additional parameters
scales linearly with the number of styles, it is challenging to
extend their method to model a large number of styles (e.g.,
tens of thousands). Also, their approach cannot adapt to
arbitrary new styles without re-training the network.
</p>

### **4. Interpreting Instance Normalization**

<p>
Despite the great success of (conditional) instance normalization, the reason why they work particularly well for
style transfer remains elusive. Ulyanov <i>et al</i>. [52] attribute the success of IN to its invariance to the contrast of the content image. However, IN takes place in the feature space,
therefore it should have more profound impacts than a simple contrast normalization in the pixel space. Perhaps even
more surprising is the fact that the affine parameters in IN
can completely change the style of the output image.
</p>
<p>
It has been known that the convolutional feature statistics
of a DNN can capture the style of an image [16, 30, 33].
While Gatys <i>et al</i>. [16] use the second-order statistics as
their optimization objective, Li <i>et al</i>. [33] recently showed
that matching many other statistics, including channel-wise
mean and variance, are also effective for style transfer. Motivated by these observations, we argue that instance normalization performs a form of <i>style normalization</i> by normalizing feature statistics, namely the mean and variance.
Although DNN serves as a image <i>descriptor</i> in [16, 33], we
believe that the feature statistics of a <i>generator</i> network can
also control the style of the generated image.
</p>
<p>
We run the code of improved texture networks [52] to
perform single-style transfer, with IN or BN layers. As
expected, the model with IN converges faster than the BN
model (Fig. 1 (a)). To test the explanation in [52], we then
normalize all the training images to the same contrast by
performing histogram equalization on the luminance channel. As shown in Fig. 1 (b), IN remains effective, suggesting the explanation in [52] to be incomplete. To verify our hypothesis, we normalize all the training images to
the same style (different from the target style) using a pretrained style transfer network provided by [24]. According
to Fig. 1 (c), the improvement brought by IN become much
smaller when images are already style normalized. The remaining gap can explained by the fact that the style normalization with [24] is not perfect. Also, models with BN
trained on style normalized images can converge as fast as
models with IN trained on the original images. Our results
indicate that IN does perform a kind of style normalization.
</p>
<p>
Since BN normalizes the feature statistics of a batch of
samples instead of a single sample, it can be intuitively
understood as normalizing a batch of samples to be centered around a single style. Each single sample, however,
may still have different styles. This is undesirable when we
want to transfer all images to the same style, as is the case
in the original feed-forward style transfer algorithm [51].
Although the convolutional layers might learn to compensate the intra-batch style difference, it poses additional challenges for training. On the other hand, IN can normalize the
style of each individual sample to the target style. Training
is facilitated because the rest of the network can focus on
content manipulation while discarding the original style information. The reason behind the success of CIN also becomes clear: different affine parameters can normalize the
feature statistics to different values, thereby normalizing the
output image to different styles.
</p>

### **5. Adaptive Instance Normalization**

<p>
If IN normalizes the input to a single style specified by
the affine parameters, is it possible to adapt it to arbitrarily
given styles by using adaptive affine transformations? Here,
we propose a simple extension to IN, which we call adaptive
instance normalization (AdaIN). AdaIN receives a content
input x and a style input y, and simply aligns the channelwise mean and variance of x to match those of y. Unlike
BN, IN or CIN, AdaIN has no learnable affine parameters.
Instead, it adaptively computes the affine parameters from
the style input:
</p>
<img src="./imgs/equation8.png" width="300" />
<p>
in which we simply scale the normalized content input
with σ(y), and shift it with µ(y). Similar to IN, these statistics are computed across spatial locations.
</p>
<p>
Intuitively, let us consider a feature channel that detects
brushstrokes of a certain style. A style image with this kind
of strokes will produce a high average activation for this
feature. The output produced by AdaIN will have the same
high average activation for this feature, while preserving the
spatial structure of the content image. The brushstroke feature can be inverted to the image space with a feed-forward
decoder, similar to [10]. The variance of this feature channel can encoder more subtle style information, which is also
transferred to the AdaIN output and the final output image.
</p>
<p>
    In short, AdaIN performs style transfer in the feature space by transferring feature statistics, specifically the
channel-wise mean and variance. Our AdaIN layer plays
a similar role as the style swap layer proposed in [6].
While the style swap operation is very time-consuming and
memory-consuming, our AdaIN layer is as simple as an IN
layer, adding almost no computational cost.
</p>
<img src="./imgs/figure2.png" width="500" />