參考以下資源:

https://github.com/keras-team/keras/blob/master/examples/neural_style_transfer.py

https://github.com/robertomest/neural-style-keras

原論文：https://arxiv.org/abs/1508.06576

tv loss: https://arxiv.org/pdf/1412.0035.pdf

In [1]:
from keras.preprocessing.image import load_img, img_to_array
import numpy as np
from PIL import Image

# pretrained model
from keras.applications import vgg19
from keras.layers import Input
from keras import backend as K
from keras.optimizers import Nadam, Adam

  from ._conv import register_converters as _register_converters
Using TensorFlow backend.


In [2]:
def preprocess_image(image_path, img_size=None):
    if img_size:
        img = load_img(image_path, target_size = img_size)
    else:
        img = load_img(image_path)
    img_arr = img_to_array(img)
    img_arr = np.expand_dims(img_arr, axis=0)
    # 套normalization, 不做scaling
    # 若mode = 'tf'則會scale到[0-1]
    img_arr = vgg19.preprocess_input(img_arr)
    return img, img_arr

def deprocess_image(x):
    x = x[0]
    # vgg19的mean
    x[:, :, 0] += 103.939
    x[:, :, 1] += 116.779
    x[:, :, 2] += 123.68
    # 'BGR'->'RGB'
    x = x[:, :, ::-1]
    x = np.clip(x, 0, 255).astype('uint8')
    return x

In [3]:
# dim 3 -> 4 (batch_size, rows, cols, channels)
base_img, base_img_arr = preprocess_image('dog.jpeg')
width, height = base_img_arr.shape[2], base_img_arr.shape[1]
img_nrows, img_ncols = height, width

# to tensorflow tensor variable
ref_img, ref_img_arr = preprocess_image('style_3.png')

# 會改變的變數, 這就是我們想要的輸出結果
# 我們想要：圖整體長得像base但是風格類似於ref
# std = 0.001
# 從random開始, 效果有點差
# generated_img_var = K.variable(std * np.random.randn(*base_img_arr.shape))
# 直接從base image開始
generated_img_var = K.variable(base_img_arr)

接下來我們要定義loss fcuntion, 我們有兩種loss, 第一個為content loss, 衡量訓練出來的圖跟base長得像不像, 另一個則是style loss, 衡量訓練出來的圖style是否類似於ref

首先定義content loss公式如下：

<img style="float:center; width:310px" src="content_loss.png"/>

$p$: base_img

$x$: 生成的圖片

$F^l$和$P^l$為一個$N*M$的矩陣, 其中包含了$p$和$x$的feature representation in layer $l$, $N$為kernel(filter)的數目而$M$為此kernel中的大小(比如5*7 kernel, 則$M$等於35)

如此一來, 上式即可衡量base以及生成的圖像的差距, 並利用BP來訓練

接下來我們要定義style loss, 這裡並沒有使用類似上述的方式來對feature map值的差距當作loss, 因為我們並沒有要讓生成圖片和ref在"數值上"相似, 而是在style上相似, 論文中利用gram matrix來當作衡量style的依據, 首先定義gram matrix G, 並且每個元素定義如下:
<table>
    <tr>
        <td><img alt="Drawing" style="width:150px" src="gram_matrix.png"/></td>
        <td><img alt="Drawing" style="width:150px" src="gram_matrix_1.png"/></td>
    </tr>
</table>
意義在於對於每個row $i$, 第j個值代表了這個row與第j個row的內積, 這代表了我們把各個kernel間的相似程度當作一張圖片總體的style, 每個feature map想像其隱含某種style的資訊, 那隱含差不多style資訊的值應該會比較高, 則G在這點的值也會比較大, 而某個feature map的值原本就很小也就代表了雖然具有某種style, 但並不強烈。

定義完$G$後我們就能定義layer $l$的loss如下：

<img style="float:center; width:300px" src="style_layer_loss.png"/>

$A$, $G$分別為ref和生成後的gram matrix, $N$和$M$的定義與上面相同

因此我們定義total style loss如下:

<img style="float:center; width:300px" src="style_loss.png"/>

在這邊$w_l$為自己手動設定的權重, 也就是一個超參數, 我們可以對每一層的loss都去使用不同的權重, 看自己覺得哪個重要, 然後這個部分的微分式如

![style gradient](style_gradient.png)

最後我們就能得到總體的loss如下：
![loss](loss_func.png)

論文中的設計只到以上, 但實際上要得到更好的效果, 會再加入total variation error, 此error的形式如：
![tv_loss](tv_loss.png)
此式的意義在於我們同時希望鄰近的pixel間其差距要越低越好, 亦即希望結果smooth

綜合以上, 我們可以寫程式如下:

In [4]:
def gram_matrix(x):
    # 第三維是kernel數, 整理成(nb_kernels x 剩下的)
    features = K.batch_flatten(K.permute_dimensions(x, (2, 0, 1)))
    # 變成squared, 之後算error要用
    gram = K.dot(features, K.transpose(features))
    # normalize for better result, optional
    norm = x.shape
    normalize_term = norm[1] * norm[2]
    gram = gram / K.cast(normalize_term, x.dtype)
    return gram

def style_loss(style, generated):
    # style loss為兩張圖片彼此比較
    A = gram_matrix(style)
    G = gram_matrix(generated)
    channels = 3
    size = img_nrows * img_ncols
    return K.sum(K.square(G - A)) / (4. * (channels ** 2) * (size ** 2))

def content_loss(base, generated): 
    # content loss為兩張圖片彼此比較    
    return K.sum(K.square(generated - base))

def total_variation_loss(x, beta=2.5):
    assert K.ndim(x) == 3
    a = K.square(x[:img_nrows-1, 1:img_ncols, :] - x[:img_nrows-1, :img_ncols-1, :]) # 上述tv loss第一項
    b = K.square(x[1:img_nrows, :img_ncols-1, :] - x[:img_nrows-1, :img_ncols-1, :]) # tv loss第二項
    return K.sum(K.pow(a+b, beta/2))

In [5]:
content_weight = 0.001 # alpha
style_weight = 50000 # beta
total_variation_weight = 1e-4 # tv loss

我們前面的網路利用VGG19來提取圖像資訊

VGG19的架構如下

<img style="float:center width:200x" src="vgg19.png"/>

每個相同大小kernel的區段稱為一個block, 因此可以看出其在FC之前有5個block

In [6]:
# include_top: 最後flatten那幾層FC要不要讀?
# 因為我們只要抽出feature map的部分，所以不用
# 首先先算base和ref的輸出值, 這個部分的結果不會改變, 所以叫做static
static_model = vgg19.VGG19(weights='imagenet', include_top=False)
outputs_dict = {layer.name: layer.output for layer in static_model.layers}

# Q: 為何用block5_conv2而不是block5_conv4?  
# A: 沒說, 目前想不到原因, 感覺純粹是實驗結果好, 不然以論文中所說, 應該要盡量用抽象資訊才對
# Q: 為何只取一個block不取全部?
# A: 論文貌似也沒說明, 感覺上是因為這層足夠抽象(但又回到為啥不取block5_conv4了QQ")
content_feature_layers = ['block5_conv2']
# Q: 為何都用conv1?
# A: 因為我們通常會認為較淺層保留偏向局部的資訊, 而較深層含有更抽象的資訊
#    因此我們也認為淺層比較能夠代表style (我猜的)
style_feature_layers = ['block1_conv1', 'block2_conv1',
                        'block3_conv1', 'block4_conv1',
                        'block5_conv1']

content_features = [outputs_dict[layer] for layer in content_feature_layers]
style_features = [outputs_dict[layer] for layer in style_feature_layers]

# 輸入input給model, 得到指定層的output(這些output根據輸入而變化)
get_content_fun = K.function([static_model.input], content_features)
get_style_fun = K.function([static_model.input], style_features)
# 輸入content和style, 分別得到想要的層的結果, 用來算loss
content_targets = get_content_fun([base_img_arr])
style_targets = get_style_fun([ref_img_arr])

# 輸出的值要轉成variable才能計算error
content_targets_dict = {k: K.variable(v) for k, v in zip(content_feature_layers, content_targets)}
style_targets_dict = {k: K.variable(v) for k, v in zip(style_feature_layers, style_targets)}
# 這是要訓練的NN
trainable_model = vgg19.VGG19(weights='imagenet', include_top=False,
                              input_tensor=Input(tensor=generated_img_var))
generated_outputs_dict = {layer.name: layer.output for layer in trainable_model.layers}

total_loss = K.variable(0.)
total_content_loss = K.variable(0.)
total_style_loss = K.variable(0.)
total_tv_loss = K.variable(0.)
# add content loss
for layer_name in content_feature_layers:
    layer_feature = content_targets_dict[layer_name]
    gen_feature = generated_outputs_dict[layer_name]
    l = content_loss(layer_feature, gen_feature)
    weighted_content_loss = content_weight * l
    total_content_loss = total_content_loss + weighted_content_loss
    total_loss = total_loss + weighted_content_loss
# add style loss
for layer_name in style_feature_layers:
    # [0] 因為三維比較好理解(可以直接套上面公式), 
    # 第0維可以代表batch size, 如果要一次做多種style
    # 那就要改一下這邊來得到多組style error
    layer_feature = style_targets_dict[layer_name][0]
    gen_feature = generated_outputs_dict[layer_name][0]
    l = style_loss(layer_feature, gen_feature)
    weighted_style_loss = style_weight * l
    total_style_loss = total_style_loss + weighted_style_loss
    total_loss = total_loss + weighted_style_loss
# add tv loss
weighted_tv_loss = total_variation_weight * total_variation_loss(generated_img_var[0])
total_tv_loss = total_tv_loss + weighted_tv_loss
total_loss = total_loss + weighted_tv_loss

In [None]:
opt = Nadam(10)
#opt = Adam(10)
updates = opt.get_updates([generated_img_var], {}, total_loss)
# List of outputs
outputs = [total_loss, total_content_loss, total_style_loss, total_tv_loss]
# Function that makes a step after backpropping to the image
make_step = K.function([], outputs, updates)

In [None]:
# 如果效果不好, 根據error的值回去調整alpha/beta/tv weight
max_iters = 1000
for i in range(max_iters):
    out = make_step([])
    if (i+1) % 10 == 0:
        print(' Content loss at pass{}: %g'.format(i+1) %out[1])
        print(' Style loss at pass{}: %g'.format(i+1) %out[2])
        print(' TV loss at pass{}: %g'.format(i+1) %out[3])
        print(' Total loss at pass{}: %g'.format(i+1) %out[0])
        x = K.get_value(generated_img_var)        
        fname = 'results/{}.png'.format(i+1)
        img = deprocess_image(x)
        img = Image.fromarray(img)
        img.save(fname)

 Content loss at pass10: 521307
 Style loss at pass10: 2.1361e+06
 TV loss at pass10: 247952
 Total loss at pass10: 2.90536e+06
 Content loss at pass20: 539696
 Style loss at pass20: 2.75188e+06
 TV loss at pass20: 271838
 Total loss at pass20: 3.56342e+06
 Content loss at pass30: 422104
 Style loss at pass30: 2.45959e+06
 TV loss at pass30: 205711
 Total loss at pass30: 3.08741e+06
 Content loss at pass40: 403624
 Style loss at pass40: 2.03165e+06
 TV loss at pass40: 229554
 Total loss at pass40: 2.66483e+06
 Content loss at pass50: 332097
 Style loss at pass50: 947925
 TV loss at pass50: 187509
 Total loss at pass50: 1.46753e+06
 Content loss at pass60: 319558
 Style loss at pass60: 1.16925e+06
 TV loss at pass60: 157461
 Total loss at pass60: 1.64626e+06
 Content loss at pass70: 699168
 Style loss at pass70: 7.26295e+06
 TV loss at pass70: 417066
 Total loss at pass70: 8.37919e+06
 Content loss at pass80: 466268
 Style loss at pass80: 2.17011e+06
 TV loss at pass80: 362382
 Total lo

下圖為實驗結果, 左邊為base, 右邊為ref
<table>
    <tr>
        <td> <img style="float:left; width:310px" src="dog.jpeg"/> </td>
        <td> <img style="float:right; width:310px" src="style_3.png"/> </td> 
    </tr>
</table>
我們用上述設定, 經過250次iteration後得到

<img style="float:center; width:310px" src="250.png"/>