# Deep Pensieve™
An Enhanced Deep Residual (EDSR) Information Maximizing Variational Auto-Encoder (InfoVAE) with Group Normalization (GN), Residual Bottleneck Attention Modules (RBAM), Efficient Sub-Pixel Convolution Super-Resolution (ESPCN), and Perceptual Similarity Loss (SSIM)

<table><tr>
<td><img src='https://s3.amazonaws.com/neurokinetikz/latent-animation-1540551741.4923084-final.gif'></td>
<td><img src='https://s3.amazonaws.com/neurokinetikz/latent-animation-1540552110.470284-final.gif'></td>
<td><img src="https://s3.amazonaws.com/neurokinetikz/1542113809.6163912-080.gif"></td>
<td><img src='https://s3.amazonaws.com/neurokinetikz/latent-animation-1540552122.395882-final.gif'></td>
<td><img src='https://s3.amazonaws.com/neurokinetikz/latent-animation-1540551578.5925505-final.gif'></td>
</tr></table>

## Multi-Stage Variational Auto-Encoders for Coarse-to-Fine Image Generation 

https://arxiv.org/abs/1705.07202

<img src="https://s3.amazonaws.com/neurokinetikz/download-1.png">

Variational auto-encoder (VAE) is a powerful unsupervised learning framework for image generation. One drawback of VAE is that it generates blurry images due to its Gaussianity assumption and thus L2 loss. To allow the generation of high quality images by VAE, we increase the capacity of decoder network by employing residual blocks and skip connections, which also enable efficient optimization. To overcome the limitation of L2 loss, we propose to generate images in a multi-stage manner from coarse to fine. In the simplest case, the proposed multi-stage VAE divides the decoder into two components in which the second component generates refined images based on the course images generated by the first component. Since the second component is independent of the VAE model, it can employ other loss functions beyond the L2 loss and different model architectures. The proposed framework can be easily generalized to contain more than two components. Experiment results on the MNIST and CelebA datasets demonstrate that the proposed multi-stage VAE can generate sharper images as compared to those from the original VAE.

## Imports

In [1]:
import time
import json
import random
import numpy as np
import tensorflow as tf

from libs import utils, gif
from libs.group_norm import GroupNormalization
from libs.variance_pooling import GlobalVariancePooling2D

from keras.models import Model, load_model, model_from_json
from keras.layers import Input, Flatten, Reshape, Add, Multiply, Activation, Lambda
from keras.layers import Dense, Conv2D, DepthwiseConv2D, SeparableConv2D
from keras.layers import MaxPooling2D, UpSampling2D, GlobalAveragePooling2D
from keras.callbacks import LambdaCallback

from keras import optimizers
from keras import backend as K

from keras_contrib.losses import DSSIMObjective
from keras_contrib.layers.convolutional import SubPixelUpscaling

Using TensorFlow backend.


## Load Images

In [2]:
DIRECTORY = 'roadtrip'

SIZE = 256
CHANNELS = 3

SCALE_FACTOR = 2

FEATURES = SIZE*SIZE*CHANNELS
FEATURES_2X = SCALE_FACTOR*SIZE*SCALE_FACTOR*SIZE*CHANNELS

MODEL_NAME = DIRECTORY+'-'+str(SIZE)+'-'+str(time.time())

In [3]:
# load images
imgs, xs, ys  = utils.load_images(directory="imgs/"+DIRECTORY,rx=SIZE,ry=SIZE)
imgs_2x, xs_2x, ys_2x = utils.load_images(directory="imgs/"+DIRECTORY,rx=SCALE_FACTOR*SIZE,ry=SCALE_FACTOR*SIZE)

# normalize pixels
IMGS = imgs/127.5 - 1
FLAT = np.reshape(IMGS,(-1,FEATURES))

IMGS_2X = imgs_2x/127.5 - 1
FLAT_2X = np.reshape(IMGS_2X,(-1,FEATURES_2X)) 

SAMPLES =  np.random.permutation(FLAT)[:9]
SAMPLES_2X =  np.random.permutation(FLAT_2X)[:9]

TOTAL_BATCH = IMGS.shape[0]

# print shapes
print("MODEL: ",MODEL_NAME)
print("IMGS: ",IMGS.shape,IMGS_2X.shape)
print("FLAT: ",FLAT.shape,FLAT_2X.shape)
print("SAMPLES: ",SAMPLES.shape,SAMPLES_2X.shape)

Loading images:	184
Loading images:	184
MODEL:  roadtrip-256-1545488800.133229
IMGS:  (184, 256, 256, 3) (184, 512, 512, 3)
FLAT:  (184, 196608) (184, 786432)
SAMPLES:  (9, 196608) (9, 786432)


### Very Deep Convolutional Networks for Large-Scale Image Recognition 

https://arxiv.org/abs/1409.1556

<img src="https://leonardoaraujosantos.gitbooks.io/artificial-inteligence/content/image_folder_3/CascadingConvolutions.png">

In this work we investigate the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting. Our main contribution is a thorough evaluation of networks of increasing depth using an architecture with very small (3×3) convolution filters, which shows that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16–19 weight layers.

First, we incorporate three non-linear rectification layers instead of a single one, which makes the decision function more discriminative.

Second, we decrease the number of parameters: assuming that both the input and the output of a
three-layer 3 × 3 convolution stack has C channels, the stack is parametrised by (W) weights; at the same time, a single 7 × 7 conv. layer would require 81% more. This can be seen as imposing a regularisation on the 7 × 7 conv. filters, forcing them to have a decomposition through the 3 × 3 filters (with non-linearity injected in between)

### Group Normalization

https://medium.com/syncedreview/facebook-ai-proposes-group-normalization-alternative-to-batch-normalization-fb0699bffae7

<img src="https://s3.amazonaws.com/neurokinetikz/groupnorm.png">


The mainstream normalization technique for almost all convolutional neural networks today is Batch Normalization (BN), which has been widely adopted in the development of deep learning. Proposed by Google in 2015, BN can not only accelerate a model’s converging speed, but also alleviate problems such as Gradient Dispersion in the deep neural network, making it easier to train models.

Dr. Wu and Dr. He however argue in their paper Group Normalization that normalizing with batch size has limitations, as BN cannot ensure the model accuracy rate when the batch size becomes smaller. As a result, researchers today are normalizing with large batches, which is very memory intensive, and are avoiding using limited memory to explore higher-capacity models.

Dr. Wu and Dr. He believe their new GN technique is a simple but effective alternative to BN. Specifically, GN divides channels — also referred to as feature maps that look like 3D chunks of data — into groups and normalizes the features within each group. GN only exploits the layer dimensions, and its computation is independent of batch sizes.

The paper reports that GN had a 10.6% lower error rate than its BN counterpart for ResNet-50 in ImageNet with a batch size of 2 samples; and matched BN performance while outperforming other normalization techniques with a regular batch size.

## Encoder

<img width="800" src="https://s3.amazonaws.com/neurokinetikz/Asset+4.png">

<br>

In [4]:
def encode(x):
    # set current layer
    current_layer = Reshape((SIZE,SIZE,CHANNELS))(x)
    
    # convolution layers
    for layer, n_filters in enumerate(FILTERS):

        # stacked 3x3 convolutions with group normalization + activation
        current_layer = Conv2D(n_filters,3,padding='SAME',kernel_initializer=INITIALIZER)(current_layer)
        current_layer = GroupNormalization(groups=n_filters,axis=-1)(current_layer)
        current_layer = Activation(ACTIVATION)(current_layer)

        current_layer = Conv2D(n_filters,3,padding='SAME',kernel_initializer=INITIALIZER)(current_layer)
        current_layer = GroupNormalization(groups=n_filters,axis=-1)(current_layer)
        current_layer = Activation(ACTIVATION)(current_layer)
         
        # max pooling
        current_layer = MaxPooling2D()(current_layer)
    
    # grab the last shape for reconstruction
    shape = current_layer.get_shape().as_list()
    
    # flatten
    flat = Flatten()(current_layer)
    
    # latent vector
    z = Dense(LATENT_DIM,name='encoder')(flat)
    
    return z, (shape[1],shape[2],shape[3])

### InfoVAE: Information Maximizing Variational Autoencoders

https://arxiv.org/abs/1706.02262v3

https://ermongroup.github.io/blog/a-tutorial-on-mmd-variational-autoencoders/

<table><tr>
<td><img src="https://s3.amazonaws.com/neurokinetikz/kl_latent.gif" ></td>
<td><img src="https://s3.amazonaws.com/neurokinetikz/mmd_latent.gif" ></td>
</tr></table>

Maximum mean discrepancy (MMD, (Gretton et al. 2007)) is based on the idea that two distributions are identical if and only if all their moments are the same. Therefore, we can define a divergence by measuring how “different” the moments of two distributions p(z) and q(z) are. MMD can accomplish this efficiently via the kernel embedding trick:

A kernel can be intuitively interpreted as a function that measures the “similarity” of two samples. It has a large value when two samples are similar, and small when they are different. For example, the Gaussian kernel considers points that are close in Euclidean space to be “similar”. A rough intuition of MMD, then, is that if two distributions are identical, then the average “similarity” between samples from each distribution, should be identical to the average “similarity” between mixed samples from both distributions.



In [5]:
def compute_kernel(x, y):
    x_size = tf.shape(x)[0]
    y_size = tf.shape(y)[0]
    dim = tf.shape(x)[1]
    tiled_x = tf.tile(tf.reshape(x, tf.stack([x_size, 1, dim])), tf.stack([1, y_size, 1]))
    tiled_y = tf.tile(tf.reshape(y, tf.stack([1, y_size, dim])), tf.stack([x_size, 1, 1]))
    return tf.exp(-tf.reduce_mean(tf.square(tiled_x - tiled_y), axis=2) / tf.cast(dim, tf.float32))

def compute_mmd(x, y):
    x_kernel = compute_kernel(x, x)
    y_kernel = compute_kernel(y, y)
    xy_kernel = compute_kernel(x, y)
    return tf.reduce_mean(x_kernel) + tf.reduce_mean(y_kernel) - 2 * tf.reduce_mean(xy_kernel)

In [6]:
def vae_loss(y_true,y_pred):
    epsilon = tf.random_normal(tf.stack([BATCH_SIZE, LATENT_DIM]))
    latent_loss = compute_mmd(epsilon, y_pred)
    return latent_loss

## Decoder

<img width="800" src="https://s3.amazonaws.com/neurokinetikz/Asset+5.png">

<br>

In [7]:
def decode(z,z_g,shape=None):
    
    # reverse the encoder
    filters = FILTERS[::-1]

    # inflate
    inflated = shape[0]*shape[1]*shape[2]
    inflate = Dense(inflated,name='generator')
    current_layer = inflate(z) ; generator = inflate(z_g)
    
    # reshape
    reshape = Reshape(shape)
    current_layer = reshape(current_layer) ; generator = reshape(generator)
    
    # build layers
    for layer, n_filters in enumerate(filters):
        
        # upsample
        u = UpSampling2D()
        current_layer = u(current_layer) ; generator = u(generator)

        # stacked 3x3 convolutions with group normalization + activation
        c1 = Conv2D(n_filters,3,padding='SAME',kernel_initializer=INITIALIZER)
        b1 = GroupNormalization(groups=n_filters,axis=-1)
        a1 = Activation(ACTIVATION)

        current_layer = c1(current_layer) ; generator = c1(generator)
        current_layer = b1(current_layer) ; generator = b1(generator)
        current_layer = a1(current_layer) ; generator = a1(generator)

        c2 = Conv2D(n_filters,3,padding='SAME',kernel_initializer=INITIALIZER)
        b2 = GroupNormalization(groups=n_filters,axis=-1)
        a2 = Activation(ACTIVATION)

        current_layer = c2(current_layer) ; generator = c2(generator)
        current_layer = b2(current_layer) ; generator = b2(generator)
        current_layer = a2(current_layer) ; generator = a2(generator)
    
    # output convolution + activation
    conv = Conv2D(CHANNELS,1,padding='SAME')
    activation = Activation('tanh',name='decoder_dssim')
    
    current_layer = conv(current_layer)       ; generator = conv(generator)
    current_layer = activation(current_layer) ; generator = activation(generator)
    
    flatten = Flatten(name='decoder')
    decoder_loss = flatten(current_layer)
    
    return current_layer, generator, decoder_loss

## Deconvolution and Checkerboard Artifacts 

https://distill.pub/2016/deconv-checkerboard/

When we have neural networks generate images, we often have them build them up from low resolution, high-level descriptions. This allows the network to describe the rough image and then fill in the details.

In order to do this, we need some way to go from a lower resolution image to a higher one. We generally do this with the deconvolution operation. Roughly, deconvolution layers allow the model to use every point in the small image to “paint” a square in the larger one.

Unfortunately, deconvolution can easily have “uneven overlap,” putting more of the metaphorical paint in some places than others. In particular, deconvolution has uneven overlap when the kernel size (the output window size) is not divisible by the stride (the spacing between points on the top). While the network could, in principle, carefully learn weights to avoid this  — as we’ll discuss in more detail later — in practice neural networks struggle to avoid it completely.

<img src="https://s3.amazonaws.com/neurokinetikz/download-2.png">

To avoid these artifacts, we’d like an alternative to regular deconvolution (“transposed convolution”). Unlike deconvolution, this approach to upsampling shouldn’t have artifacts as its default behavior. Ideally, it would go further, and be biased against such artifacts.

One approach is to separate out upsampling to a higher resolution from convolution to compute features. For example, you might resize the image (using nearest-neighbor interpolation or bilinear interpolation) and then do a convolutional layer. This seems like a natural approach, and roughly similar methods have worked well in image super-resolution.

Our experience has been that nearest-neighbor resize followed by a convolution works very well, in a wide variety of contexts.

## Refiner

<img  src="https://s3.amazonaws.com/neurokinetikz/Asset+7.png">

<br>

In [8]:
def refine(x,x_g,name='refiner'):
    # 1x1 channel convolution
    c1 = Conv2D(CHANNELS,1,padding='SAME')
    current_layer = c1(x) ; generator = c1(x_g)
    
    # shortcut
    shortcut = current_layer; shortcut_g = generator
    
    # reshape convolution
    c2 = Conv2D(R_FILTERS,3,padding='SAME',kernel_initializer=INITIALIZER)
    current_layer = c2(x) ; generator = c2(x_g)

    # residual layers
    for i in range(R_LAYERS):
        current_layer, generator = residual(current_layer, generator, R_ATTENTION)
    
    # output convolution
    c3 = Conv2D(CHANNELS,1,padding='SAME')
    current_layer = c3(current_layer); generator = c3(generator)
    
    # merge shortcut
    merge = Add()
    current_layer = merge([current_layer, shortcut]) ; generator = merge([generator, shortcut_g])
    
    #activate
    activation = Activation('tanh',name=name+'_dssim')
    current_layer = activation(current_layer) ; generator = activation(generator)
    
    #flatten
    flatten = Flatten(name=name)
    refiner_loss = flatten(current_layer)
    
    return current_layer, generator, refiner_loss

### Enhanced Deep Residual Networks for Single Image Super-Resolution (EDSR)

https://arxiv.org/abs/1707.02921

<img width=500 src="https://s3.amazonaws.com/neurokinetikz/Screen+Shot+2018-11-18+at+8.54.43+AM.png">

Recently, the powerful capability of deep neural networks has led to dramatic improvements in SR. Since Dong et al. [4, 5] first proposed a deep learning-based SR method, various CNN architectures have been studied for SR. Kim et al. [11, 12] first introduced the residual network for training much deeper network architectures and achieved superior performance. In particular, they showed that skip connection and recursive convolution alleviate the burden of carrying identity information in the super-resolution network. Similarly to [20], Mao et al. [16] tackled the general image restoration problem with encoder-decoder networks and symmetric skip connections. In [16], they argue that those nested skip connections provide fast and improved convergence.

https://arxiv.org/abs/1707.02921

<img src="https://s3.amazonaws.com/neurokinetikz/download-3.png">

Recently, residual networks exhibit excellent performance in computer vision problems from the lowlevel to high-level tasks. Although Ledig et al. successfully applied the ResNet architecture to the super-resolution problem with SRResNet, we further improve the performance by employing better ResNet structure.

We remove the batch normalization layers from our network as Nah et al.[19] presented in their image deblurring work. Since batch normalization layers normalize the features, they get rid of range flexibility from networks by normalizing the features, it is better to remove them. We experimentally show that this simple modification increases the performance substantially as detailed in

Furthermore, GPU memory usage is also sufficiently reduced since the batch normalization layers consume the same amount of memory as the preceding convolutional layers. Our baseline model without batch normalization layer saves approximately 40% of memory usage during training, compared to SRResNet. Consequently, we can build up a larger model that has better performance than conventional ResNet structure under limited computational resources.

In [9]:
def residual(x,x_g,attention=False):
    # current layer
    current_layer = x ; generator = x_g

    # shortcuts
    shortcut = current_layer ; shortcut_g = generator

    # conv 1
    c1 = Conv2D(R_FILTERS,3,padding='SAME',kernel_initializer=INITIALIZER)
    current_layer = c1(current_layer) ; generator = c1(generator)
    
    # activation 1
    a1 = Activation(ACTIVATION)
    current_layer = a1(current_layer) ; generator = a1(generator)

    # conv 2
    c2 = Conv2D(R_FILTERS,3,padding='SAME',kernel_initializer=INITIALIZER)
    current_layer = c2(current_layer) ; generator = c2(generator)
    
    # residual scaling
    scale = Lambda(lambda x: x * R_SCALING)
    current_layer = scale(current_layer) ; generator = scale(generator)
    
    # residual attention
    if(attention):
        current_layer, generator = residual_attention(current_layer,generator)
    
    # merge shortcut
    merge = Add()
    current_layer = merge([current_layer, shortcut]) ; generator = merge([generator, shortcut_g])

    return current_layer, generator

## Residual Attention

<img width='800' src="https://s3.amazonaws.com/neurokinetikz/Screen+Shot+2018-12-15+at+5.14.32+PM.png">

### Image Super-Resolution Using Very Deep Residual Channel Attention Networks (RCAB)

https://arxiv.org/abs/1807.02758v2

<img width="600" src="https://s3.amazonaws.com/neurokinetikz/Screen+Shot+2018-12-16+at+11.12.08+AM.png">

To make a further step, we propose channel attention (CA) mechanism to adaptively rescale each channel-wise feature by modeling the interdependencies across feature channels. Such CA mechanism allows our proposed network to concentrate on more useful channels and enhance discriminative learning ability.

### Residual Attention Module for Single Image Super-Resolution (RAM)

https://arxiv.org/abs/1811.12043

<img width="500" src="https://s3.amazonaws.com/neurokinetikz/Screen+Shot+2018-12-07+at+8.51.39+AM.png">

In this paper, we propose a new attention method, which is composed of new channel-wise and spatial attention mechanisms optimized for SR and a new fused attention to combine them. Based on this, we propose a new residual attention module (RAM) and a SR network using RAM (SRRAM). We provide in-depth experimental analysis of different attention mechanisms in SR. It is shown that the proposed method can construct both deep and lightweight SR networks showing improved performance in comparison to existing state-of-the-art methods.

### BAM: Bottleneck Attention Module 

https://arxiv.org/abs/1807.06514v2

<img width="600" src="https://s3.amazonaws.com/neurokinetikz/Screen+Shot+2018-12-16+at+10.15.09+AM.png">

In this work, we focus on the effect of attention in general deep neural networks. We propose a simple and effective attention module, named Bottleneck Attention Module (BAM), that can be integrated with any feed-forward convolutional neural networks. Our module infers an attention map along two separate pathways, channel and spatial. We place our module at each bottleneck of models where the downsampling of feature maps occurs. Our module constructs a hierarchical attention at bottlenecks with a number of parameters and it is trainable in an end-to-end manner jointly with any feed-forward models.

As the channels of feature maps can be regarded as feature detectors, the two branches (spatial and channel) explicitly learn ‘what’ and ‘where’ to focus on.

<img src="https://s3.amazonaws.com/neurokinetikz/Asset+8.png">

<br>

In [10]:
def residual_attention(x,x_g):
    # current layer
    current_layer = x ; generator = x_g
    
    # shortcuts
    shortcut = current_layer ; shortcut_g = generator
    
    # channel attention
    ca, ca_g = channel_attention(current_layer, generator)
    
    # spatial attention
    sa, sa_g = spatial_attention(current_layer, generator)
    
    # fuse channel and spatial attention
    fuse = Add()
    current_layer = fuse([ca,sa]); generator = fuse([ca_g,sa_g])
    
    # sigmoid activation
    s = Activation("sigmoid")
    current_layer = s(current_layer) ; generator = s(generator)
    
    # merge fused attention with shortcut
    m = Multiply()
    current_layer = m([current_layer,shortcut]) ; generator = m([generator, shortcut_g])
    
    return current_layer, generator

In [11]:
def channel_attention(x,x_g):
    # current layer
    current_layer = x ; generator = x_g
    
    # global variance pooling
    gvp = GlobalVariancePooling2D(); reshape = Reshape((1,1,R_FILTERS))
    current_layer = reshape(gvp(current_layer)); generator = reshape(gvp(generator))
    
    # squeeze
    squeeze = Conv2D(int(R_FILTERS/R_REDUCTION),1,padding='SAME',kernel_initializer=INITIALIZER)
    current_layer = squeeze(current_layer); generator = squeeze(generator);
    
    # excitation
    a1 = Activation(ACTIVATION)
    current_layer = a1(current_layer); generator = a1(generator)
    
    # scaling
    c2 = Conv2D(R_FILTERS,1,padding='SAME',kernel_initializer=INITIALIZER)
    current_layer = c2(current_layer) ; generator = c2(generator)
        
    return current_layer, generator

In [12]:
def spatial_attention(x,x_g):
    # current layer
    current_layer = x ; generator = x_g
    
    # 1x1 convolution
    c1 = Conv2D(int(R_FILTERS/R_REDUCTION),1,padding='SAME',kernel_initializer=INITIALIZER)
    current_layer = c1(current_layer); generator = c1(generator)

    # dilated convolution
    c2 = Conv2D(int(R_FILTERS/R_REDUCTION),3,dilation_rate=R_DILATION,padding='SAME',kernel_initializer=INITIALIZER)
    current_layer = c2(current_layer) ; generator = c2(generator)

    # dilated convolution
    c3 = Conv2D(int(R_FILTERS/R_REDUCTION),3,dilation_rate=R_DILATION,padding='SAME',kernel_initializer=INITIALIZER)
    current_layer = c3(current_layer) ; generator = c3(generator)

    # 1x1 convolution
    c4 = Conv2D(1,1,padding='SAME',kernel_initializer=INITIALIZER)
    current_layer = c4(current_layer); generator = c4(generator)
    
    # group normalization
    gn = GroupNormalization(groups=1,axis=-1)
    current_layer = gn(current_layer); generator = gn(generator)
    
    return current_layer, generator

### Real-Time Single Image and Video Super-Resolution Using an Efficient Sub-Pixel Convolutional Neural Network 

https://arxiv.org/abs/1609.05158

<img width="75%" src="https://s3.amazonaws.com/neurokinetikz/Screen+Shot+2018-11-23+at+10.26.45+AM.png">

Recently, several models based on deep neural networks have achieved great success in terms of both reconstruction accuracy and computational performance for single image super-resolution. In these methods, the low resolution (LR) input image is upscaled to the high resolution (HR) space using a single filter, commonly bicubic interpolation, before reconstruction. This means that the super-resolution (SR) operation is performed in HR space. We demonstrate that this is sub-optimal and adds computational complexity. 

In this paper, we present the first convolutional neural network (CNN) capable of real-time SR of 1080p videos on a single K2 GPU. To achieve this, we propose a novel CNN architecture where the feature maps are extracted in the LR space. In addition, we introduce an efficient sub-pixel convolution layer which learns an array of upscaling filters to upscale the final LR feature maps into the HR output. By doing so, we effectively replace the handcrafted bicubic filter in the SR pipeline with more complex upscaling filters specifically trained for each feature map, whilst also reducing the computational complexity of the overall SR operation. 

We evaluate the proposed approach using images and videos from publicly available datasets and show that it performs significantly better (+0.15dB on Images and +0.39dB on Videos) and is an order of magnitude faster than previous CNN-based methods.

In [13]:
def upsample(x,x_g,img_g,name='super'):
    
    # current layer
    current_layer = x ; generator = x_g; img_generator = img_g
    
    # convolution
    c1 = Conv2D(R_FILTERS*SCALE_FACTOR*SCALE_FACTOR,3,padding='SAME',kernel_initializer=INITIALIZER)
    current_layer = c1(current_layer) ; generator = c1(generator) ; img_generator = c1(img_generator)
    
    # activation
    a1 = Activation(ACTIVATION)
    current_layer = a1(current_layer) ; generator = a1(generator); img_generator = a1(img_generator)
    
    # sub-pixel upscaling
    upscale = SubPixelUpscaling(scale_factor=SCALE_FACTOR)
    current_layer = upscale(current_layer); generator = upscale(generator); img_generator = upscale(img_generator)
    
    # In practice, it is useful to have a second convolution layer after the 
    # SubPixelUpscaling layer to speed up the learning process.
    c2 = Conv2D(R_FILTERS*SCALE_FACTOR*SCALE_FACTOR,3,padding='SAME',kernel_initializer=INITIALIZER)
    current_layer = c2(current_layer) ; generator = c2(generator); img_generator = c2(img_generator)
    
    # activation
    a2 = Activation(ACTIVATION)
    current_layer = a2(current_layer) ; generator = a2(generator); img_generator = a2(img_generator)
    
    # convolution
    c3 = Conv2D(CHANNELS,1,padding='SAME')
    current_layer = c3(current_layer); generator = c3(generator); img_generator = c3(img_generator)
    
    # activation
    a3 = Activation('tanh',name=name+"_dssim")
    current_layer = a3(current_layer) ; generator = a3(generator); img_generator = a3(img_generator)
    
    # flatten
    flatten = Flatten(name=name)
    upscale_loss = flatten(current_layer)
    
    return current_layer, generator, img_generator, upscale_loss

## Model

### Fast and Accurate Deep Network Learning by Exponential Linear Units (ELUs) 

https://arxiv.org/abs/1511.07289

<img width="300" style="float:left;" src="https://blogs.mathworks.com/deep-learning/files/2017/12/defining_elu_layer_01.png">

We introduce the "exponential linear unit" (ELU) which speeds up learning in deep neural networks and leads to higher classification accuracies. Like rectified linear units (ReLUs), leaky ReLUs (LReLUs) and parametrized ReLUs (PReLUs), ELUs alleviate the vanishing gradient problem via the identity for positive values. However, ELUs have improved learning characteristics compared to the units with other activation functions. 

In contrast to ReLUs, ELUs have negative values which allows them to push mean unit activations closer to zero like batch normalization but with lower computational complexity. Mean shifts toward zero speed up learning by bringing the normal gradient closer to the unit natural gradient because of a reduced bias shift effect. While LReLUs and PReLUs have negative values, too, they do not ensure a noise-robust deactivation state. ELUs saturate to a negative value with smaller inputs and thereby decrease the forward propagated variation and information. 

Therefore, ELUs code the degree of presence of particular phenomena in the input, while they do not quantitatively model the degree of their absence. In experiments, ELUs lead not only to faster learning, but also to significantly better generalization performance than ReLUs and LReLUs on networks with more than 5 layers. 

In [14]:
# default activation
ACTIVATION  = 'elu'

### Hyper-parameters in Action! Part II — Weight Initializers 

https://towardsdatascience.com/hyper-parameters-in-action-part-ii-weight-initializers-35aee1a28404

<img width="300" style="float:left;" src="https://cdn-images-1.medium.com/max/1600/1*WLUL_bcjsNK9sXNw6nC-cg.png">

If you dug a little bit deeper, you’ve likely also found out that one should use Xavier / Glorot initialization if the activation function is a Tanh, and that He initialization is the recommended one if the activation function is a ReLU.

In summary, for a ReLU activated network, the He initialization scheme using an Uniform distribution is a pretty good choice.



In [15]:
# initializers
INITIALIZER = 'he_uniform'

## Learning to Generate Images with Perceptual Similarity Metrics

https://arxiv.org/abs/1511.06409

<img width="200" style="float:left;" src="https://s3.amazonaws.com/neurokinetikz/download-4.png">

In this paper, we explore loss functions that, unlike MSE, MAE, and likelihoods, are grounded in human perceptual judgments. We show that these perceptual losses lead to representations are superior to other methods, both with respect to reconstructing given images, and generating novel ones. This superiority is demonstrated both in quantitative studies and human judgements ... We (also) demonstrate that perceptual losses yield a convincing win when applied to a state-of-the-art architecture for single image super-resolution.

As observed in the deterministic case, MS-SSIM is better at capturing fine details than either MSE or MAE.

## Training

In [17]:
EPOCHS      = 12501
BATCH_SIZE  = 4

MODEL_STEPS = 50
GIF_STEPS   = 10

SAMPLES =  np.random.permutation(FLAT)[:9]

### Callbacks

In [16]:
def gifit(epoch=None):
    if (epoch % GIF_STEPS == 0):
        print('saving gif ...')
        z,y,yc,i,ic,s,sc = AUTOENCODER.predict_on_batch(SAMPLES)
        img = np.clip(127.5*(s+1).reshape((-1, SCALE_FACTOR*SIZE, SCALE_FACTOR*SIZE, CHANNELS)), 0, 255)
        RECONS.append(utils.montage(img).astype(np.uint8))
        
def saveit(epoch=None):
    if ((epoch > 0) and (epoch % MODEL_STEPS == 0)):
        print('saving model ...')
        AUTOENCODER.save(MODEL_NAME+'-autoencoder-model.h5')
        ENCODER.save(MODEL_NAME+'-encoder-model.h5')
        DECODER.save(MODEL_NAME+'-generator-model.h5')
        SUPER.save(MODEL_NAME+'-super-model.h5')
        SUPERSIZER.save(MODEL_NAME+'-supersizer-model.h5')
        print('done')
       
        
# callbacks
giffer = LambdaCallback(on_epoch_end=lambda epoch, logs: gifit(epoch))
saver = LambdaCallback(on_epoch_end=lambda epoch, logs: saveit(epoch))

### Model Architecture 

<img width ="1024" src="https://s3.amazonaws.com/neurokinetikz/Layer+1.png">

In [18]:
# Encoder/Decoder
if (SIZE == 256):
    FILTERS = [64,80,96,128,160,192]
    
elif (SIZE == 128):
    FILTERS = [64,96,128,160,192]
    
elif (SIZE == 64):
    FILTERS = [64,96,128,160]
    
elif (SIZE == 32):
    FILTERS = [64,96,128]

# Residuals
R_LAYERS  = 16
R_FILTERS = 64
R_SCALING = 0.01

# Attention Modules
R_ATTENTION = True
R_REDUCTION = 16
R_DILATION = 4

# Latent dimension size
LATENT_DIM = 1024

### Model Definition 

In [19]:
# input
X = Input(shape=(FEATURES,))

# encode
Z, shape = encode(X)

# decoder input
Z_G = Input(shape=(LATENT_DIM,))

# decode
Y, Y_G, Y_F = decode(Z,Z_G,shape)

# refine
IMG, IMG_G, IMG_F = refine(Y,Y_G)

# supersizer input
X_G = Input(shape=(SIZE,SIZE,CHANNELS))

# supersize
IMG_S, IMG_G_S, X_G_S, IMG_S_F = upsample(IMG,IMG_G,X_G)


# model definitions
ENCODER = Model(inputs=[X], outputs=[Z])
DECODER = Model(inputs=[Z_G], outputs=[IMG_G])
SUPER = Model(inputs=[Z_G], outputs=[IMG_G_S])
SUPERSIZER = Model(inputs=[X_G], outputs=[X_G_S])

# define optimizer
ADAM = optimizers.Adam(amsgrad=True)

# compile models
ENCODER.compile(optimizer=ADAM,loss='mse')
DECODER.compile(optimizer=ADAM,loss='mae')
SUPER.compile(optimizer=ADAM,loss='mae')
SUPERSIZER.compile(optimizer=ADAM,loss='mae')


# define autoencoder
AUTOENCODER = Model(inputs=[X], outputs=[Z,Y,Y_F,IMG,IMG_F,IMG_S,IMG_S_F])

# define losses
losses = {'encoder':vae_loss,
          'decoder':'mse',
          'decoder_dssim':DSSIMObjective(),
          'refiner':'mae',
          'refiner_dssim':DSSIMObjective(),
          'super':'mae',
          'super_dssim':DSSIMObjective()}

# compile model
AUTOENCODER.compile(optimizer=ADAM,loss=losses)

# print summary
AUTOENCODER.summary()

__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_1 (InputLayer)            (None, 196608)       0                                            
__________________________________________________________________________________________________
reshape_1 (Reshape)             (None, 256, 256, 3)  0           input_1[0][0]                    
__________________________________________________________________________________________________
conv2d_1 (Conv2D)               (None, 256, 256, 64) 1792        reshape_1[0][0]                  
__________________________________________________________________________________________________
group_normalization_1 (GroupNor (None, 256, 256, 64) 128         conv2d_1[0][0]                   
__________________________________________________________________________________________________
activation

## Training

In [None]:
RECONS = []

# fit model
AUTOENCODER.fit(x=FLAT,
                y=[FLAT,IMGS,FLAT,IMGS,FLAT,IMGS_2X,FLAT_2X],
                batch_size=BATCH_SIZE,
                epochs=EPOCHS,
                callbacks=[giffer,saver])

# save training gif
gif.build_gif(RECONS, saveto=MODEL_NAME+'-final'+ "-"+str(time.time())+'.gif')

print("done")

## Load Models

In [None]:
MODEL_NAME = 'roadtrip-64-1545070403.8379235'

print('loading encoder ...', MODEL_NAME)
ENCODER = load_model(MODEL_NAME+'-encoder-model.h5')

print('loading decoder ...')
DECODER = load_model(MODEL_NAME+'-generator-model.h5', custom_objects={'R_SCALING':R_SCALING,'GlobalVariancePooling2D':GlobalVariancePooling2D})

print('loading supersizer ...')
SUPERSIZER = load_model(MODEL_NAME+'-supersizer-model.h5')

print("done")

## Reconstruction 


<img width=600 src="https://s3.amazonaws.com/neurokinetikz/Screen+Shot+2018-11-13+at+12.17.10+PM.png">


In [None]:
def reconstruct(index=0):
    
    # input
    x = np.reshape(FLAT[index],(-1,FEATURES))
    z = ENCODER.predict_on_batch(x)
    
    # output
    img = np.reshape(DECODER.predict_on_batch(z),(-1,FEATURES))
    img_s = np.reshape(SUPER.predict_on_batch(z),(-1,FEATURES*SCALE_FACTOR*SCALE_FACTOR))
    
    # reference
    ref = IMGS[index]/2 + .5
    ref_s = IMGS_2X[index]/2 + .5
    
    # denormalize
    img = np.reshape(img/2 + .5,(SIZE,SIZE,CHANNELS))
    img_s= np.reshape(img_s/2 + .5,(SCALE_FACTOR*SIZE,SCALE_FACTOR*SIZE,CHANNELS))
    
    # print scores
    print("PSNR: %.3f %.3f <> MS-SSIM: %.3f %.3f" % ((utils.psnr(ref,img)),
                                                     (utils.psnr(ref_s,img_s)),
                                           (utils.MultiScaleSSIM(np.reshape(ref,(1,SIZE,SIZE,CHANNELS)),
                                                                 np.reshape(img,(1,SIZE,SIZE,CHANNELS)),
                                                                 max_val=1.)),
                                            (utils.MultiScaleSSIM(np.reshape(ref_s,(1,SCALE_FACTOR*SIZE,SCALE_FACTOR*SIZE,CHANNELS)),
                                                                 np.reshape(img_s,(1,SCALE_FACTOR*SIZE,SCALE_FACTOR*SIZE,CHANNELS)),
                                                                 max_val=1.))
                                               ))
    
    # show images
    utils.showImagesHorizontally(images=[ref,img])
    utils.showImagesHorizontally(images=[ref_s,img_s])

In [None]:
reconstruct(random.randint(0,TOTAL_BATCH-1))

## Latent  Animation

http://www.youtube.com/watch?v=grEi3uRlSb4
<table><tr>
<td><img src="https://s3.amazonaws.com/neurokinetikz/1542113809.6163912-109.gif"></td>
<td><img src="https://s3.amazonaws.com/neurokinetikz/1542113809.6163912-100.gif"></td>
<td><img src="https://s3.amazonaws.com/neurokinetikz/1542113809.6163912-080.gif"></td>
</tr></table>


In [None]:
def random_latents(n_imgs=3,steps=30):
    rimgs = np.random.permutation(FLAT)[:n_imgs]
    rimgs = np.append(rimgs, [rimgs[0]],axis=0)
    latent_animation(rimgs,steps,filename=str(time.time()))

def latent_animation(imgs=None,steps=None,filename="latent-animation"):
    animate(generate(get_latents(imgs,steps),filename),filename)
    
def get_latents(imgs,steps):
    # get latent encodings for images
#     print('getting latent vectors ...')
    latents = []
    for index,img in enumerate(imgs):
        img = np.reshape(img,(-1,FEATURES))
        latent = ENCODER.predict_on_batch(img)
        latents.append(latent)

    # calculate latent path
#     print('calculating latent path ...')
    latent_path = []
    for i in range(len(latents)-1):
        # get latent vectors
        l1 = latents[i] ; l2 = latents[i+1]

        # calculate latent distance
        image_distance = l2 - l1

        # create the latent path
        for j in range(steps):
            latent_path.append(l1 + j*image_distance/steps)
        latent_path.append(l2)
    
    return latent_path
       
    
def generate(latent_path,filename=None):
     # reconstruct images along the path
#     print('reconstructing latent paths... ')
    latent_path = np.reshape(latent_path,(-1,LATENT_DIM))
    
#     print('decoding ...')
    recons = DECODER.predict_on_batch(latent_path)
    
    if(filename != None):
        print('saving decoder gif')
        build_gif(np.asarray(recons),SIZE,filename)
    
    return recons
    
def animate(recons,filename):
    print('supersizing ...')
    chunks = SUPERSIZER.predict_on_batch(recons[:10])
    for i in range(10,len(recons)-10,10):
        s2 = SUPERSIZER.predict_on_batch(recons[i:i+10])
        chunks = np.concatenate([chunks,s2])
    
    print('saving supersizer gif')
    build_gif(chunks,SIZE*SCALE_FACTOR,filename+"-"+str(SCALE_FACTOR)+"x")
   
    # done
    print(filename)
    
def build_gif(recons,size,filename='latent-animation'):
    final = np.clip((127.5*(recons+1)).reshape((-1,size,size,CHANNELS)),0,255)
    gif.build_gif([utils.montage([r]).astype(np.uint8) for r in final], saveto=filename+".gif",dpi=72)

In [None]:
LATENT_DIM=1024
for i in range(100):
    random_latents(3,50)

In [None]:
grid=[]
GRID = 100
N_IMGS = 4
STEPS = 50

for i in range(GRID):
    imgs = np.random.permutation(FLAT)[:N_IMGS]
    imgs = np.append(imgs, [imgs[0]],axis=0)
    
    latent_path = get_latents(imgs,STEPS)
    recons = generate(latent_path)
    final = np.clip((127.5*(recons+1)).reshape((-1,SIZE,SIZE,CHANNELS)),0,255)
    grid.append(final)
    
print(np.asarray(grid).shape)


rs = []

for img in grid:
    rs.append([frame for frame in img])
    
rs = np.moveaxis(grid,1,0)

print(np.asarray(rs).shape)
    
gif.build_gif([utils.montage(r).astype(np.uint8) for r in rs], saveto=str(time.time())+".gif",dpi=72)

print('done')

In [None]:
imgs =  np.random.permutation(FLAT)
t = str(time.time())
for i in range(TOTAL_BATCH):
    print(i)
    latent_animation([imgs[i],imgs[i+1]],25,filename=t+'-'+ ('%03d' % i))

## Load Model & Continue Training 

In [None]:
print('load model ...')
AUTOENCODER = load_model(MODEL_NAME+'-autoencoder-model.h5',
                         custom_objects={'vae_loss': vae_loss, 
                                         'R_SCALING':0.1, 
                                         'DSSIMObjective':DSSIMObjective()})

# define encoder
ENCODER = Model(inputs=[AUTOENCODER.input], outputs=[AUTOENCODER.get_layer("encoder").output])

# define generator
Z = Input(shape=(LATENT_DIM,))
GENERATOR = Model(inputs=[Z], outputs=[AUTOENCODER.get_layer("generator")(Z)])

ENCODER.compile(optimizer=ADAM,loss='mse')
GENERATOR.compile(optimizer=ADAM,loss='mse')

print('resume training ...')
AUTOENCODER.fit(x=FLAT,y=[FLAT,IMGS,FLAT,IMGS,FLAT],batch_size=BATCH_SIZE,epochs=EPOCHS,callbacks=[giffer,saver])