# **Arbitrary Style Transfer in Real-time with Adaptive Instance Normalization**

**Authors: Xun Huang, Serge Belongie - Department of Computer Science & Cornell Tech, Cornell University, {xh258,sjb344}@cornell.edu**

**Official Github**: https://github.com/xunhuang1995/AdaIN-style

**Original Paper**: https://arxiv.org/pdf/1703.06868.pdf

---

**Edited By Su Hyung Choi (Key Summary & Code Practice)**

If you have any issues on this scripts, please PR to the repository below.

**[Github: @JonyChoi - Computer Vision Paper Reviews]** https://github.com/jonychoi/Computer-Vision-Paper-Reviews

Edited Jan 16 2022

---

### **Abstract**


<p><i>Gatys et al. recently introduced a neural algorithm that
renders a content image in the style of another image,
achieving so-called style transfer. However, their framework requires a slow iterative optimization process, which
limits its practical application. Fast approximations with
feed-forward neural networks have been proposed to speed
up neural style transfer. Unfortunately, the speed improvement comes at a cost: the network is usually tied to a fixed
set of styles and cannot adapt to arbitrary new styles. In this
paper, we present a simple yet effective approach that for the
first time enables arbitrary style transfer in real-time. At the
heart of our method is a novel adaptive instance normalization (AdaIN) layer that aligns the mean and variance of the
content features with those of the style features. Our method
achieves speed comparable to the fastest existing approach,
without the restriction to a pre-defined set of styles. In addition, our approach allows flexible user controls such as
content-style trade-off, style interpolation, color & spatial
controls, all using a single feed-forward neural network.
</i>
</p>


### **1. Introduction**
<p>
The seminal work of Gatys et al. [16] showed that deep
neural networks (DNNs) encode not only the content but
also the style information of an image. Moreover, the image style and content are somewhat separable: it is possible
to change the style of an image while preserving its content. The style transfer method of [16] is flexible enough to
combine content and style of arbitrary images. However, it
relies on an optimization process that is prohibitively slow.
</p>
<p>
Significant effort has been devoted to accelerating neural
style transfer. [24, 51, 31] attempted to train feed-forward
neural networks that perform stylization with a single forward pass. A major limitation of most feed-forward methods is that each network is restricted to a single style. There
are some recent works addressing this problem, but they are
either still limited to a finite set of styles [11, 32, 55, 5], or
much slower than the single-style transfer methods [6].
</p>
<p>
    In this work, we present the first neural style transfer
algorithm that resolves this fundamental flexibility-speed
dilemma. Our approach can transfer arbitrary new styles
in real-time, combining the flexibility of the optimizationbased framework [16] and the speed similar to the fastest
feed-forward approaches [24, 52]. Our method is inspired
by the instance normalization (IN) [52, 11] layer, which
is surprisingly effective in feed-forward style transfer. To
explain the success of instance normalization, we propose
a new interpretation that instance normalization performs
style normalization by normalizing feature statistics, which
have been found to carry the style information of an image [16, 30, 33]. Motivated by our interpretation, we introduce a simple extension to IN, namely adaptive instance
normalization (AdaIN). Given a content input and a style
input, AdaIN simply adjusts the mean and variance of the
content input to match those of the style input. Through
experiments, we find AdaIN effectively combines the content of the former and the style latter by transferring feature
statistics. A decoder network is then learned to generate the
final stylized image by inverting the AdaIN output back to
the image space. Our method is nearly three orders of magnitude faster than [16], without sacrificing the flexibility of
transferring inputs to arbitrary new styles. Furthermore, our
approach provides abundant user controls at runtime, without any modification to the training process.
</p>


### **2. Related Work**

<p><strong>Style transfer.</strong> The problem of style transfer has its origin
from non-photo-realistic rendering [28], and is closely related to texture synthesis and transfer [13, 12, 14]. Some
early approaches include histogram matching on linear filter responses [19] and non-parametric sampling [12, 15].
These methods typically rely on low-level statistics and often fail to capture semantic structures. Gatys et al. [16] for
the first time demonstrated impressive style transfer results
by matching feature statistics in convolutional layers of a
DNN. Recently, several improvements to [16] have been
proposed. Li and Wand [30] introduced a framework based
on markov random field (MRF) in the deep feature space to
enforce local patterns. Gatys et al. [17] proposed ways to
control the color preservation, the spatial location, and the
scale of style transfer. Ruder et al. [45] improved the quality of video style transfer by imposing temporal constraints.
</p>
<p>
The framework of Gatys et al. [16] is based on a slow
optimization process that iteratively updates the image to
minimize a content loss and a style loss computed by a loss
network. It can take minutes to converge even with modern GPUs. On-device processing in mobile applications is
therefore too slow to be practical. A common workaround
is to replace the optimization process with a feed-forward
neural network that is trained to minimize the same objective [24, 51, 31]. These feed-forward style transfer approaches are about three orders of magnitude faster than
the optimization-based alternative, opening the door to realtime applications. Wang et al. [53] enhanced the granularity
of feed-forward style transfer with a multi-resolution architecture. Ulyanov et al. [52] proposed ways to improve the
quality and diversity of the generated samples. However,
the above feed-forward methods are limited in the sense that
each network is tied to a fixed style. To address this problem, Dumoulin et al. [11] introduced a single network that
is able to encode 32 styles and their interpolations. Concurrent to our work, Li et al. [32] proposed a feed-forward
architecture that can synthesize up to 300 textures and transfer 16 styles. Still, the two methods above cannot adapt to
arbitrary styles that are not observed during training.
</p>
<p>
Very recently, Chen and Schmidt [6] introduced a feedforward method that can transfer arbitrary styles thanks to
a style swap layer. Given feature activations of the content
and style images, the style swap layer replaces the content
features with the closest-matching style features in a patchby-patch manner. Nevertheless, their style swap layer creates a new computational bottleneck: more than 95% of the
computation is spent on the style swap for 512 × 512 input
images. Our approach also permits arbitrary style transfer,
while being 1-2 orders of magnitude faster than [6].
</p>
<p>
    while being 1-2 orders of magnitude faster than [6].
Another central problem in style transfer is which style
loss function to use. The original framework of Gatys et
al. [16] matches styles by matching the second-order statistics between feature activations, captured by the Gram matrix. Other effective loss functions have been proposed,
such as MRF loss [30], adversarial loss [31], histogram
loss [54], CORAL loss [41], MMD loss [33], and distance
between channel-wise mean and variance [33]. Note that all
the above loss functions aim to match some feature statistics
between the style image and the synthesized image.
</p>
<p>
<strong>Deep generative image modeling.</strong> There are several alternative frameworks for image generation, including variational auto-encoders [27], auto-regressive models [40], and
generative adversarial networks (GANs) [18]. Remarkably,
GANs have achieved the most impressive visual quality.
Various improvements to the GAN framework have been
proposed, such as conditional generation [43, 23], multistage processing [9, 20], and better training objectives [46,
1]. GANs have also been applied to style transfer [31] and cross-domain image generation [50, 3, 23, 38, 37, 25].
</p>

### **3. Background**

#### 3.1. Batch Normalization

<p>
The seminal work of Ioffe and Szegedy [22] introduced
a batch normalization (BN) layer that significantly ease the
training of feed-forward networks by normalizing feature
statistics. BN layers are originally designed to accelerate training of discriminative networks, but have also been
found effective in generative image modeling [42]. Given
an input batch x ∈ R
N×C×H×W , BN normalizes the mean
and standard deviation for each individual feature channel:
</p>