# **Instance Normalization: The Missing Ingredient for Fast Stylization**

**Authors: Dmitry Ulyanov [dmitry.ulyanov@skoltech.ru], Andrea Vedaldi [vedaldi@robots.ox.ac.uk], Victor Lempitsky [lempitsky@skoltech.ru]**

**Original Paper**: https://arxiv.org/pdf/1607.08022.pdf

**Official Github**: https://github.com/DmitryUlyanov/texture_nets

---

**Edited By Su Hyung Choi (Key Summary & Code Practice)**

If you have any issues on this scripts, please PR to the repository below.

**[Github: @JonyChoi - Computer Vision Paper Reviews]** https://github.com/jonychoi/Computer-Vision-Paper-Reviews

Edited Jan 15 2022

---

### **Abstract**

It this paper we revisit the fast stylization method introduced in Ulyanov et al.
(2016). We show how a small change in the stylization architecture results in a
significant qualitative improvement in the generated images. The change is limited to swapping batch normalization with instance normalization, and to apply
the latter both at training and testing times. The resulting method can be used to
train high-performance architectures for real-time image generation. The code is
available at https://github.com/DmitryUlyanov/texture_nets. Full paper can be found at https://arxiv.org/abs/1701.02096.

### **1 Introduction**
<p>
The recent work of Gatys et al. (2016) introduced a method for transferring a style from an image
onto another one, as demonstrated in fig. 1. The stylized image matches simultaneously selected
statistics of the style image and of the content image. Both style and content statistics are obtained
from a deep convolutional network pre-trained for image classification. The style statistics are extracted from shallower layers and averaged across spatial locations whereas the content statistics are
extracted form deeper layers and preserve spatial information. In this manner, the style statistics
capture the “texture” of the style image whereas the content statistics capture the “structure” of the
content image.
</p>
<p>
Although the method of Gatys et. al produces remarkably good results, it is computationally inefficient. The stylized image is, in fact, obtained by iterative optimization until it matches the desired
statistics. In practice, it takes several minutes to stylize an image of size 512 × 512. Two recent
works, Ulyanov et al. (2016) Johnson et al. (2016), sought to address this problem by learning
equivalent feed-forward generator networks that can generate the stylized image in a single pass.
These two methods differ mainly by the details of the generator architecture and produce results of
a comparable quality; however, neither achieved as good results as the slower optimization-based
method of Gatys et. al.
</p>
<img src="./imgs/figure1.png" />
<img src="./imgs/figure2.png" />
<p>
In this paper we revisit the method for feed-forward stylization of Ulyanov et al. (2016) and show
that a small change in a generator architecture leads to much improved results. The results are in
fact of comparable quality as the slow optimization method of Gatys et al. but can be obtained in
real time on standard GPU hardware. The key idea (section 2) is to replace batch normalization layers in the generator architecture with instance normalization layers, and to keep them at test
time (as opposed to freeze and simplify them out as done for batch normalization). Intuitively,
the normalization process allows to remove instance-specific contrast information from the content
image, which simplifies generation. In practice, this results in vastly improved images (section 3).
</p>