# Introduction

The task is to acquire a new image that has content of the image X and the style of referral image Y.


- **Content** is a high-level semantic content of an image (what's on the picture)


- **Style** is low-level texture of an image (how is it drawn)

<img src="img/setting.png" width=500>

Neural Style Transfer (NPS) - subclass of ST methods that utilize deep networks to accomplish this


- Texture syntesis - generation of textures. Naturally it's a part of style transfer task. 


- Image analogy - we want to get image A' based on image A so that A relates to A' the same way that B relates to B'


- Photorealistic textures - 


- Non-photorealistic tetures -




Good reference:
- https://arxiv.org/pdf/1705.04058.pdf


According to the taxonomy, methods are grouped into two classes:
- online - target image is being iteratively optimized from the original one
- offline - target image is being generated from scratch

<img src="img/taxonomy.png">


# A Neural Algorithm of Artistic Style

[[arxiv]](https://arxiv.org/pdf/1508.06576.pdf)

This is a central work released in 2015 (Max Plank University) by Gatys et al that defined the new standard for a neural transfer architecture.

There is one backbone network (for example VGG)


С точки зрения сети - есть 3 входа. Один - оптимизируемый, наша целевая картинка (исходного размера). И есть 2 константы - картинка, содержание которой мы хотим скопировать (content), и картинка, стиль которой мы хотим скопировать (style).

<img src="img/architecture.png" width=500>


We separatly define Loss in content and loss in style:

### How we define content?
It's just a feature map from a layer that is close to the output
* Таким макаром можно семантические синонимы искать среди картинок

### How we define Loss(content)
Lcontent is a per pixel sum of squared error between pair of feature maps

Note that here i - feature, j - pixel, l - layer.

### How we define style
We define style as a covariance matrix between features from some intermediate layer (or several intermediate layers).

Those matrices are also called Gram matrices since they can be represented by a scalar product between all pairs of features. If we flatten each 2D feature map to a 1D vector, correlation becomes a scalar product between pair of vectors. 

#### How we define Loss(style)
Lstyle - sum of squared errors when comparing two Gram matrices - of original and generated image:

where G и A - Gram matrice of original / A - of target (i, j - features, k - pixel, l - layer)

Lstyle is usually summed over multiple layers:


Почему имеем право сравнивать матрицы ковариаций от разных картинок?
Since the network is the same for original and reference images, the order of channels is determined. So, if images have the same style, the correlations of features must be high.
Потому что сеть одна и та же. Поэтому порядок каналов одинаковый и каждый канал отвечает за один и тот же паттерн на обоих картинках. Поэтому можем сравнивать.

Почему матрица ковариаций описывает стилистику?
Что может указывать на наличие уникального стиля картинки
повышенная встречаемость какого-то паттерна (например, контрастные линии), на это указывает высокое значение диагонального элемента матрицы Грама
частая одновременная встречаемость паттернов (например, горизонтальные красные линии плюс диагонаьные желтые линии)


What else can we do to add more accuracy to the syntesized image?

## Laplacian-steered neural style transfer

[[arxiv]](https://arxiv.org/pdf/1707.01253.pdf) (2017)

-----

Laplacian is a sum of second order derivatives.
$$L = \sum_i \frac{\partial^2 f}{\partial x_i^2} $$

It measures the level of curvature at the point.

For example for a 2D gaussian the laplacian will look like this:
<img src="img/laplacian.png" width=500>

That is, 
- the lowest curvature will be at the center of the Gaussian - the derivative there is almost linear, slowly starting to decrease. 
- the highest curvature is at the foundation of the Gaussian.

------

In Computer Vision laplacians are usually used as edge detectors. Examples of discrete laplacian filters

<img src="img/laplacian_filter_examples.png" width=200>

Li et al added squared error between laplacians of original and generated image as additional loss. 

$$L_{Lap} =  \sum \big( L(x) - L(x_c) \big) ^2$$

where x_c - content image, x - currently optimized image

Thus they made the model retain more edges from the original image.

So the total loss becomes

$$L_{total} = \alpha L_{content} + \beta L_{style} + \gamma L_{Lap} $$

Also they proposed to use laplacians from different resolutions (similar to style matrices)

$$L_{total} = \alpha L_{content} + \beta L_{style} + \sum_k \gamma_k L^k_{Lap} $$



## Stable and Controllable Neural Texture Synthesis and Style Transfer Using Histogram Losses


[[arxiv]](https://arxiv.org/abs/1701.08893)(2017), Virginia university

The authors investingated Gatys approach and ran into several problems.

<img src="img/gatys_drawbacks.png" width = 500>

Among them:
- unstable histogram (left)
- ghosting artifacts (right)

In order to achieve more stable results they proposed two augmentations to the model:
- to add variance loss
- to add histogram loss

#### Total variation

[Total variation](https://en.wikipedia.org/wiki/Total_variation_denoising) is a popular in CV measure that assesses the amount of noise on the picture.

<img src="img/tv.svg">

It simply computes the total amount of "jittering" on the image. So the loss becomes

$$L_{total} = L_{style} + L_{tv}$$

#### Histogram loss
To make $x$ look more like $x_s$ they added the comparison of the distributions (for each feature separately).

Directly comparing two histograms is not a good idea, since histograms are coarse structures and change a litle => unsuitable for fitting the model.

Instead:
1. They remaped all features to the corresponsing original histograms
2. Computed the amount of mapping that was done

So they define the histogram loss as:
$$L_{hist} = \sum_i \gamma_i || F_i - R(F_i) ||^2$$
where 
- $F_i$ - i-th feature map, 
- $R(F_i)$ - remapped to original histogram feature map
- $\gamma_i$ - weight for each feature map


So the total loss becomes

$$L_{total} = L_{style} + L_{tv} + L_{histogram}$$
