In [1]:
%run Latex_macros.ipynb

<IPython.core.display.Latex object>

# Neural Style Transfer: Feature extractor, Training Loop


The objective of *Neural Style Transfer*:
- Given Content Image $C$
- Given Style Image $S$
- Create Generated Image $G$ that is the Content image re-drawn in the "style" of the Style image

<table>
    <tr>
         <td><center><img src=images/starry_night_crop.jpg width=60%></center></td>
        <td><strong>+</strong></td>
        <td><center><img src=images/chicago.jpg width=80%></center></td>
        <td><strong>=<s/trong></td>
        <td><center><img src=images/chicago_starry_night.jpg width=100%></center></td>
    </tr>
    <tr>
        <td><center><strong>Style image S</strong></center></td>
        <td></td>
        <td><center><strong>Content image C</strong></center></td>
        <td></td>
        <td><center><strong>Generated image G</strong></center></td>
    </tr>
</table>

We used this example to
preview the concept that Deep Learning is all about defining a Loss Function
that captures the semantics of the task.

## Content Loss and Style Loss

Neural Style Transfer is solved, like most other Machine Learning tasks, by minimizing a loss

$$
G = \argmin{I} \loss
$$
- where $I$ is an image.
- $\loss = \loss_\text{content} + \loss_\text{style}$
    - where
        - $\loss_\text{content}$ measures the dissimilarity of the "content" of $G$ and "content" f $C$
        - $\loss_\text{style}$ measures the dissimilarity of  the "style" of $G$ and "style" of $C$
        
That is: the "weights" we are optimizing are the *pixels of image* $I$.

How do we measure the dissimilarity of the "content" ?

We can't just use plain MSE of the pixel-wise differences
- $G$ is different than $C$, by definition (the "styles" are different)

And how do we define what the "style" of an image is ?
- And how do we measure dissimilarity of the "style" ?

$$
\newcommand{\ICL}{\mathbb{C}}
\newcommand{\GM}{\mathbb{G}}
$$

Recall that each layer in a multi-layer Neural Network 
is creating an *alternate representation* of the input.

Rather than directly comparing $G$ with $C$ (and $G$ with $S$) our dissimilarity will be measured
- Not on raw images as seen by the human eye
- But on their alternate representations as created at some layer of a multi-layer Neural Network

We will
- Use a pre-trained multi-layer Image Classifier $\ICL$ (e.g., VGG19)
- Define some layer $\ll_c$ to be the "content" layer
- Define some layer $\ll_s$ to be the "style" layer
- And measure the dissimilarity via the alternate representations created at the respective layers


Suppose $\ICL$ consists of a sequence of CNN Layers

Let $\ICL_\llp$ denote the set of $n_\llp$ feature maps produced at layer $\ll$
- Feature map: value of one feature, at each spatial location

We choose 
- One layer $\ll_c$ of $\ICL$ and call it the "content representation" layer
    - Will tend to be shallow: closer to the input
    - Features of shallow layers will be more "syntax" than "semantics"
- One layer $\ll_s$ of $\ICL$ and call it the "style representation" layer
  - Will tend to be deep: closer to the output
    - Features of deep layers will be more  "semantics" than "syntax"
    

For arbitrary image $I$, let
- $\ICL_{(\ll_c)}(I)$ 
    - denote the feature maps of the Classifier $\ICL$, on image $I$,  at the "content representation" layer
- $\ICL_{(\ll_s)}(I)$
    - denote the feature maps of the Classifier $\ICL$, on image $I$, at the "style representation" layer

We can now define the dissimilarity of the "content" of Content Image $C$ and "content" of Generated Image $G$
- by comparing $\ICL_{(\ll_c)}(C)$ and $\ICL_{(\ll_c)}(G)$

Similarly, we can define the dissimilarity of the "style" of Content Image $C$ and "style" of Generated Image $G$
- by comparing $\ICL_{(\ll_s)}(S)$ and $\ICL_{(\ll_s)}(G)$

For any image $I$: $\ICL_{(\ll)}(I)$ consists of $n_\llp$ feature maps.

We need to define what it means to compare  $\ICL_{(\ll)}(I)$ and  $\ICL_{(\ll)}(I')$.

The *Gramm Matrix $\GM$* of $\ICL_{(\ll)}(I)$ 
- Has shape ($n_\llp \times n_\llp$)
- $\GM_{j,j'}(I) = \text{correlation}( \text{flatten}(\ICL_{(\ll),j}(I)), \text{flatten}(\ICL_{(\ll),j'}(I)) )$
    - the correlation of the feature map $j$ of $\ICL_{(\ll)}(I)$ with feature map $j'$ of $\ICL_{(\ll)}(I')$
    
Intuitively, the Gramm Matrix 
- measures the correlation of the values across pixel locations (flattened feature maps)
of two feature maps of image $I$

We can now define the dissimilarity of $\ICL_{(\ll)}(I)$ and  $\ICL_{(\ll)}(I')$
- As the MSE of $\GM(I)$ and $\GM(I')$



Using this dissimilarity measure, we can define the
- $\loss_\text{content}$ as the dissimilarity of $\ICL_{(\ll_c)}(C)$ and  $\ICL_{(\ll_c)}(G)$
- $\loss_\text{style}$ as the dissimilarity of $\ICL_{(\ll_s)}(S)$ and  $\ICL_{(\ll_c)}(G)$

## Gradient ascent: generating $G$

We can find image $G$ via Gradient Ascent
- Initialize $G$ to noise
- Update pixel $G_{i, i', k}$ by $- \frac{\partial \loss}{G_{i, i', k}}$

## Feature extractor

One key coding trick that we will illustrate
- Obtaining the feature maps of the Classifier $\ICL$, on image $I$,  at an arbitrary layer

We will call this tool the *feature extractor*

[Here](https://www.tensorflow.org/tutorials/generative/style_transfer) is a tutorial view of the notebook.

In [2]:
print("Done")

Done
