# Unsupervised Learning for Physical Interaction through Video Prediction

- Authors: C. Finn, I. Goodfellow, S. Levine.
- Year: 2016

## Abstract

Based on an initial state (i.e: the position of the robot) and an action to execute (i.e: push a block from x1, y1 to x2, y2), predict the physiscal interaction before executing the action.

## Models

The models learn physic instead of the object appearance: makes them able to generalize unseen object.

Three models are proposed in this paper:
    1. Dynamic Neural Advection (DNA)
        - For pixels that are constrained in a local region;
        - Outputs a distrbution over locations in the previoys frame for each pixel in the new frame;
        - the predicted pixel value becomes the expectation under the distribution.
    2. Convolutional Dynamix Neural Advection (CDNA)
        - Variant of a DNA;
        - Output multupe normalized convolution kernels to apply to the previous image to compute new pixel values.
    3. Spatial Transformer Predictors (STP)
        - Output the parameters of multiple affine transformations to apply to the previous image
        - The predicted transformation handle separate objects

## Architecture

The core trunk of each models is made of:
    1. One 5x5 convolytion with a stride of 2;
    2. Seven convolutional LSTMS;
    3. A full-resolution mask for compositing the various transformed predictions (CDNA and STP only);
    4. **** Two skip connections exists in the network to preserve high-resolution informations: 
        - LSTM 1 to convolution 2;
        - LSTM 3 to LSTM 7.

The main differences between the three models are:
    1. CDNA:
        - Ten filters of size 5 x 5 are created in the transormation area, they're normalized to via a spatial softmax;
        - **** The spatial is used to return the expected pixiel location of the new image based on the distribution of the filters over the previous image
        - The transformations corresponds to a convolution
    2. STP:
        - Ten filters of size 3 x 2 of affine transformation matrices (with a [spatial transformer](https://arxiv.org/abs/1506.02025))
        - The transformations are applied to the preceding image to create 10 separate transformed images
        - The transformations correspond to an affine transformation
    3. DNA:
        - No filters for the transformation
        - The transformation parameters are outputted at the last layer, in the same place as the mask
        - The transformation correspond to a 5 x 5 convolutional kernel

## Non-deterministic behaviour in TensorFlow

For some operation, TensorFlow is non-deterministic (the same result is not guaranteed at each iteration) for both the GPU and the CPU. Especialy, in our case, the function "reduce_sum" may have a different behaviour at each executions. To test that, the code bellow was produced.

In [None]:
def debug(tensor, session, feed_dict):
    arr = tensor.eval(session=session, feed_dict=feed_dict)
    return arr[0][0][0][0][0] # Print the first element
    

stop = 1500
i = 0
while i < stop:
	norm_factor = tf.reduce_sum(cdna_kerns, [1, 2, 3], keep_dims=True)
	Debug.push(norm_factor, "Norm factor 1", deb)
	i += 1
    
# cdna_kerns is an array of shape (32, 5, 5, 1, 10)
# The average result at arr[0][0][0][0][0] is "", min is "" and max is ""
# The result is supposed to be 19.780447

## TensorFlow to Chainer and Numpy conversion

Bellow are the conversion for TensorFlow's functions to Chainer's and Numpy's functions

In [47]:
import numpy as np
import tensorflow as tf
from tensorflow.contrib.layers.python import layers as tf_layers
import chainer as chainer

def print_tf_shape(tensor):
    print("[TF] Shape is {}".format(tensor.get_shape()))

def print_ch_shape(variable):
    print("[Chainer] Shape is {}".format(variable.shape))

def print_np_shape(array):
    print("[Numpy] Shape is {}".format(array))

In [14]:
# Create a Tensor/Variable
x = np.arange(9.0)
tf_res = tf.constant(x)
ch_res = chainer.variable.Variable(x)
print_tf_shape(tf_res)
print_ch_shape(ch_res)

[ 0.  1.  2.  3.  4.  5.  6.  7.  8.]
[TF] Shape is (9,)
[Chainer] Shape is (9,)


In [21]:
# Split an array into multiple sub-arrays
x = np.random.randint(0, 255,(2,6))
tf_res = tf.split(axis=1, num_or_size_splits=2, value=x)
ch_res = chainer.functions.split_axis(chainer.variable.Variable(x), indices_or_sections=2, axis=1)
print(len(tf_res))
print_tf_shape(tf_res[0])
print_tf_shape(tf_res[1])
print(len(ch_res))
print_ch_shape(ch_res[0])
print_ch_shape(ch_res[1])

2
[TF] Shape is (2, 3)
[TF] Shape is (2, 3)
2
[Chainer] Shape is (2, 3)
[Chainer] Shape is (2, 3)


In [33]:
# Join a sequence of arrays along an existing axis
x = np.random.randint(0,255, (32, 32, 32, 32))
y = np.random.randint(0,255, (32, 32, 32, 32))
tf_res = tf.concat(axis=0, values=[tf.constant(x), tf.constant(y)])
ch_res = chainer.functions.concat((chainer.variable.Variable(x), chainer.variable.Variable(y)), axis=0)
print_tf_shape(tf_res)
print_ch_shape(ch_res)

[TF] Shape is (64, 32, 32, 32)
[Chainer] Shape is (64, 32, 32, 32)


In [50]:
# Gives a new shape to an array without changing its data
x = np.random.randint(0.,255., (32, 32, 32, 32))
tf_res = tf.reshape(tf.constant(x), [x.shape[0], -1])
ch_res = chainer.functions.reshape(chainer.variable.Variable(x), (x.shape[0], -1))
print_tf_shape(tf_res)
print_ch_shape(ch_res)

[TF] Shape is (32, 32768)
[Chainer] Shape is (32, 32768)


In [61]:
# Construct an array by repeating the number of times given by reps
x = np.random.randint(0.,255., (1, 1, 1, 1))
tf_res = tf.tile(x, [2,2,2,2])
ch_res = chainer.functions.tile(x, (2,2,2,2))
print_tf_shape(tf_res)
print_ch_shape(ch_res)

[TF] Shape is (2, 2, 2, 2)
[Chainer] Shape is (2, 2, 2, 2)


In [27]:
# AdamOptimizer
from chainer.links.model.vision import resnet
learning_rate = 0.01
tf_res = tf.train.AdamOptimizer(learning_rate).minimize(tf.constant([]))
ch_res = chainer.optimizers.Adam(alpha=learning_rate)
model = resnet.ResNet50Layers()
ch_res.setup(model)
# ...
ch_res.update()

ValueError: No variables to optimize.

In [43]:
# 2D convolution
tf_image = np.float32(np.random.randint(0.,255., (32, 64, 64, 3)))
chainer_image = np.float32(np.random.randint(0.,255., (32, 3, 64, 64)))
tf_res = tf.contrib.slim.layers.conv2d(tf_image, 32, [5, 5], stride=2, normalizer_fn=None)
ch_res = chainer.links.Convolution2D(in_channels=3, out_channels=32, ksize=(5, 5), stride=2, pad=5/2)(chainer_image) 
print_tf_shape(tf_res)
print_ch_shape(ch_res)

[TF] Shape is (32, 32, 32, 32)
[Chainer] Shape is (32, 32, 32, 32)


In [49]:
# Layer normalization
tf_image = np.float32(np.random.randint(0.,255., (32, 64, 64, 3)))
chainer_image = np.float32(np.random.randint(0.,255., (32, 3, 64, 64)))
tf_res = tf_layers.layer_norm(tf_image)

ch_res = chainer.functions.reshape(chainer_image, (chainer_image.shape[0], -1))
ch_res = chainer.links.LayerNormalization()(ch_res)
ch_res = chainer.functions.reshape(ch_res, (chainer_image.shape[0], 
                                            chainer_image.shape[1], 
                                            chainer_image.shape[2], 
                                            chainer_image.shape[3]))
print_tf_shape(tf_res)
print_ch_shape(ch_res)

[TF] Shape is (32, 64, 64, 3)
[Chainer] Shape is (32, 3, 64, 64)




In [55]:
# 2D Deconvolution
tf_image = np.float32(np.random.randint(0.,255., (32, 64, 64, 3)))
chainer_image = np.float32(np.random.randint(0.,255., (32, 3, 64, 64)))
tf_res = tf.contrib.slim.layers.conv2d_transpose(tf.constant(tf_image), tf_image.shape[3], 3, stride=2)
ch_res = chainer.links.Deconvolution2D(in_channels=chainer_image.shape[1], 
                                       out_channels=chainer_image.shape[1], 
                                       ksize=(3,3), 
                                       stride=2, 
                                       outsize=(chainer_image.shape[2]*2, chainer_image.shape[3]*2), pad=3/2)(
                                            chainer.variable.Variable(chainer_image)
                                       )
print_tf_shape(tf_res)
print_ch_shape(ch_res)

[TF] Shape is (32, 128, 128, 3)
[Chainer] Shape is (32, 3, 128, 128)


In [58]:
# Softmax
tf_image = np.float32(np.random.randint(0.,255., (32, 64, 64, 3)))
chainer_image = np.float32(np.random.randint(0.,255., (32, 3, 64, 64)))
tf_res = tf.nn.softmax(tf.constant(tf_image))
ch_res = chainer.functions.softmax(chainer.variable.Variable(chainer_image))
print_tf_shape(tf_res)
print_ch_shape(ch_res)

[TF] Shape is (32, 64, 64, 3)
[Chainer] Shape is (32, 3, 64, 64)


In [60]:
# Relu
tf_image = np.float32(np.random.randint(0.,255., (32, 64, 64, 3)))
chainer_image = np.float32(np.random.randint(0.,255., (32, 3, 64, 64)))
tf_res = tf.nn.relu(tf.constant(tf_image))
ch_res = chainer.functions.relu(chainer_image)
print_tf_shape(tf_res)
print_ch_shape(ch_res)

[TF] Shape is (32, 64, 64, 3)
[Chainer] Shape is (32, 3, 64, 64)
