# Exercise 8 - Recurrent Neural networks


In this exercise we want to use recurrent neural networks in order to classify the IMDB dataset.

This exercise is based on https://github.com/leriomaggio/deep-learning-keras-tensorflow and https://github.com/keras-team/keras/blob/master/examples/imdb_lstm.py and https://machinelearningmastery.com/sequence-classification-lstm-recurrent-neural-networks-python-keras/

<img src="https://austingwalters.com/wp-content/uploads/2019/01/rnn-multihidden.png" width="40%">

A recurrent neural network (RNN) is a class of artificial neural network where connections between units form a directed cycle. This creates an internal state of the network which allows it to exhibit dynamic temporal behavior.

```python
tensorflow.keras.layers.SimpleRNN(units, activation='tanh', use_bias=True,
                                 kernel_initializer='glorot_uniform',
                                 recurrent_initializer='orthogonal',
                                 bias_initializer='zeros',
                                 kernel_regularizer=None,
                                 recurrent_regularizer=None,
                                 bias_regularizer=None,
                                 activity_regularizer=None,
                                 kernel_constraint=None, recurrent_constraint=None,
                                 bias_constraint=None, dropout=0.0, recurrent_dropout=0.0)
```

#### Arguments:

<ul>
<li><strong>units</strong>: Positive integer, dimensionality of the output space.</li>
<li><strong>activation</strong>: Activation function to use
    (see <a href="http://keras.io/activations/">activations</a>).
    If you pass None, no activation is applied
    (ie. "linear" activation: <code>a(x) = x</code>).</li>
<li><strong>use_bias</strong>: Boolean, whether the layer uses a bias vector.</li>
<li><strong>kernel_initializer</strong>: Initializer for the <code>kernel</code> weights matrix,
    used for the linear transformation of the inputs.
    (see <a href="https://keras.io/initializers/">initializers</a>).</li>
<li><strong>recurrent_initializer</strong>: Initializer for the <code>recurrent_kernel</code>
    weights matrix,
    used for the linear transformation of the recurrent state.
    (see <a href="https://keras.io/initializers/">initializers</a>).</li>
<li><strong>bias_initializer</strong>: Initializer for the bias vector
    (see <a href="https://keras.io/initializers/">initializers</a>).</li>
<li><strong>kernel_regularizer</strong>: Regularizer function applied to
    the <code>kernel</code> weights matrix
    (see <a href="https://keras.io/regularizers/">regularizer</a>).</li>
<li><strong>recurrent_regularizer</strong>: Regularizer function applied to
    the <code>recurrent_kernel</code> weights matrix
    (see <a href="https://keras.io/regularizers/">regularizer</a>).</li>
<li><strong>bias_regularizer</strong>: Regularizer function applied to the bias vector
    (see <a href="https://keras.io/regularizers/">regularizer</a>).</li>
<li><strong>activity_regularizer</strong>: Regularizer function applied to
    the output of the layer (its "activation").
    (see <a href="https://keras.io/regularizers/">regularizer</a>).</li>
<li><strong>kernel_constraint</strong>: Constraint function applied to
    the <code>kernel</code> weights matrix
    (see <a href="https://keras.io/constraints/">constraints</a>).</li>
<li><strong>recurrent_constraint</strong>: Constraint function applied to
    the <code>recurrent_kernel</code> weights matrix
    (see <a href="https://keras.io/constraints/">constraints</a>).</li>
<li><strong>bias_constraint</strong>: Constraint function applied to the bias vector
    (see <a href="https://keras.io/constraints/">constraints</a>).</li>
<li><strong>dropout</strong>: Float between 0 and 1.
    Fraction of the units to drop for
    the linear transformation of the inputs.</li>
<li><strong>recurrent_dropout</strong>: Float between 0 and 1.
    Fraction of the units to drop for
    the linear transformation of the recurrent state.</li>
</ul>

#### Backprop Through time  

Contrary to feed-forward neural networks, the RNN is characterized by the ability of encoding longer past information, thus very suitable for sequential models. The BPTT extends the ordinary BP algorithm to suit the recurrent neural
architecture.

<img src="https://upload.wikimedia.org/wikipedia/commons/e/ee/Unfold_through_time.png" width="45%">

**Reference**: [Backpropagation through Time](http://ir.hit.edu.cn/~jguo/docs/notes/bptt.pdf)

## IMDB sentiment classification task

The problem that we will use to demonstrate sequence learning in this tutorial is the IMDB movie review sentiment classification problem. Each movie review is a variable sequence of words and the sentiment of each movie review must be classified.

The Large Movie Review Dataset (often referred to as the IMDB dataset) contains 25,000 highly-polar movie reviews (good or bad) for training and the same amount again for testing. The problem is to determine whether a given movie review has a positive or negative sentiment.
http://ai.stanford.edu/~amaas/data/sentiment/

The data was collected by Stanford researchers and was used in a 2011 paper (http://ai.stanford.edu/~amaas/papers/wvSent_acl2011.pdf) where a split of 50-50 of the data was used for training and test. An accuracy of 88.89% was achieved.

Keras provides access to the IMDB dataset built-in. The imdb.load_data() function allows you to load the dataset in a format that is ready for use in neural network and deep learning models.

The words have been replaced by integers that indicate the ordered frequency of each word in the dataset. The sentences in each review are therefore comprised of a sequence of integers.



### Word Embedding
We will map each movie review into a real vector domain, a popular technique when working with text called word embedding. This is a technique where words are encoded as real-valued vectors in a high dimensional space, where the similarity between words in terms of meaning translates to closeness in the vector space.

Keras provides a convenient way to convert positive integer representations of words into a word embedding by an Embedding layer.

We will map each word onto a 32 length real valued vector. We will also limit the total number of words that we are interested in modeling to the 10000 most frequent words, and zero out the rest. Finally, the sequence length (number of words) in each review varies, so we will constrain each review to be 500 words, truncating long reviews and pad the shorter reviews with zero values.

Now that we have defined our problem and how the data will be prepared and modeled, we are ready to develop an LSTM model to classify the sentiment of movie reviews.

### Data Preparation - IMDB

As usual we will load tensorflow 2 and make sure we use python 3:

In [None]:
#Check if colab is running
import sys
IN_COLAB = 'google.colab' in sys.modules

if IN_COLAB:
  %tensorflow_version 2.x

#import TF
import tensorflow as tf
from platform import python_version
print("Tensorflow version", tf.__version__)
print("Python version =",python_version())

Python version = 3.6.9


Importing everything we need...

In [None]:
import numpy
from tensorflow.keras.datasets import imdb
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout
from tensorflow.keras.layers import LSTM, SimpleRNN
from tensorflow.keras.layers import Embedding
from tensorflow.keras.preprocessing import sequence

from sklearn.model_selection import train_test_split

# fix random seed for reproducibility
numpy.random.seed(42)

We need to load the IMDB dataset. We are constraining the dataset to the top 10,000 most commonly used words. We also split the dataset into train (70%) and validation (30%) sets.

In [None]:
# load the dataset but only keep the top n words, zero the rest
top_words = 10000
print("Loading data...")
(X_train, y_train), (X_test, y_test) = imdb.load_data(num_words=top_words)

#Split data
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.3, random_state=42, stratify=y_train)

# Map for readable classnames
class_names = ["Negative", "Positive"]


Loading data...
Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/imdb.npz


Let's inspect the data a little bit, how many sequences do we have and how do they look like?

In [None]:
print(len(X_test), 'test sequences')
print(len(X_train), 'train sequences')
print(len(X_val), 'validation sequences')
print('Example: ',X_train[1])

#### Create map for converting IMDB dataset to readable reviews

Apparantely, reviews in the IMDB dataset have been encoded as a sequence of integers. Luckily the dataset also contains an index for converting the reviews back into human readable form.

In [None]:
# Get the word index from the dataset
word_index = tf.keras.datasets.imdb.get_word_index()

# Ensure that "special" words are mapped into human readable terms
word_index = {k:(v+3) for k,v in word_index.items()}
word_index["<PAD>"] = 0
word_index["<START>"] = 1
word_index["<UNKNOWN>"] = 2
word_index["<UNUSED>"] = 3

# Perform reverse word lookup and make it callable
reverse_word_index = dict([(value, key) for (key, value) in word_index.items()])
def decode_review(text):
    return ' '.join([reverse_word_index.get(i, '?') for i in text])

Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/imdb_word_index.json


Let's have a closer look at our data. How many words do our reviews contain? And how does  the review look like in human readable form?

In [None]:
import numpy as np

# Concatonate test and training datasets
allreviews = np.concatenate((X_train, X_test), axis=0)

# Review lengths across test and training whole datasets
print("Maximum review length: {}".format(len(max((allreviews), key=len))))
print("Minimum review length: {}".format(len(min((allreviews), key=len))))
result = [len(x) for x in allreviews]
print("Mean review length: {}".format(np.mean(result)))

# Print a review and it's class as stored in the dataset. Replace the number
# to select a different review.
print("")
print("Machine readable Review")
print("  Review Text: " + str(X_train[1]))
print("  Review Sentiment: " + str(y_train[1]))

# Print a review and it's class in human readable format. Replace the number
# to select a different review.
print("")
print("Human Readable Review")
print("  Review Text: " + decode_review(X_train[1]))
print("  Review Sentiment: " + class_names[y_train[1]])

### Pre-processing Data

We need to make sure that our reviews are of a uniform length, which is needed for the LSTM model. Some reviews will need to be truncated, while others need to be padded. The model will learn the zero values carry no information so indeed the sequences are not the same length in terms of content, but same length vectors is required to perform the computation in Keras.

In [None]:
# truncate and pad input sequences
max_review_length = 500
X_train = sequence.pad_sequences(X_train, maxlen=max_review_length)
X_test = sequence.pad_sequences(X_test, maxlen=max_review_length)
X_val = sequence.pad_sequences(X_val, maxlen=max_review_length)

# Check the size of our datasets. Review data for both test and training should
# contain 25000 reviews of 500 integers. Class data should contain 25000 values,
# one for each review. Class values are 0 or 1, indicating a negative
# or positive review.
print("Shape Training Review Data: " + str(X_train.shape))
print("Shape Training Class Data: " + str(y_train.shape))
print("Shape Test Review Data: " + str(X_test.shape))
print("Shape Test Class Data: " + str(y_test.shape))

# Note padding is added to start of review, not the end
print("")
print("Machine readable Review Text (post padding):\n " + str(X_train[1]))
print("")
print("Human Readable Review Text (post padding):\n " + decode_review(X_train[1]))


## A simple RNN model

In [None]:
print('Build model...')
# create the model
embedding_vecor_length = 32
model = Sequential()

# The Embedding Layer provides a spatial mapping (or Word Embedding) of all the
# individual words in our training set. Words close to one another share context
# and or meaning. This spatial mapping is learning during the training process.
model.add(
    tf.keras.layers.Embedding(
        input_dim = top_words, # The size of our vocabulary
        output_dim = embedding_vecor_length, # Dimensions to which each words shall be mapped
        input_length = max_review_length # Length of input sequences
    )
)
#Dropout for regularization
model.add(Dropout(0.4))

#We are using the simplest RNN model for now with 128 units
model.add(SimpleRNN(128))

#A second dropout layer
model.add(Dropout(0.4))

#Connect the output to a single dense output unit
model.add(Dense(1, activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
print(model.summary())

### Training

Let's train the model, training 20 epochs with a relatively large batch size of 1024 will take already around 3-5 minutes on a GPU...time to get a coffee...

In [None]:
history = model.fit(X_train, y_train, validation_data=(X_val, y_val), epochs=20, batch_size=1024)

In [None]:
%matplotlib inline
from matplotlib import pyplot as plt
def plot_history(network_history):
    plt.figure()
    plt.xlabel('Epochs')
    plt.ylabel('Loss')
    plt.plot(network_history.history['loss'])
    plt.plot(network_history.history['val_loss'])
    plt.legend(['Training', 'Validation'])

    plt.figure()
    plt.xlabel('Epochs')
    plt.ylabel('Accuracy')
    plt.plot(network_history.history['accuracy'])
    plt.plot(network_history.history['val_accuracy'])
    plt.legend(['Training', 'Validation'], loc='lower right')
    plt.show()

In [None]:
plot_history(history)

### Evaluation

In [None]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, classification_report, confusion_matrix
import numpy as np


def evaluate(X_test, Y_test, X_train, Y_train, model):

    ##Evaluate loss and metrics and predict & classes
    loss, accuracy = model.evaluate(X_test, Y_test, verbose=0)
    Y_pred = model.predict(X_test, batch_size=1028)
    Y_cls  = (model.predict(X_test) > 0.5).astype("int32")

    print('Test Loss:', loss)
    print('Accuracy: %.2f' % accuracy_score(Y_test, Y_cls))
    print("Precision: %.2f" % precision_score(Y_test, Y_cls, average='weighted'))
    print("Recall: %.2f" % recall_score(Y_test, Y_cls, average='weighted'))
    print('Classification Report:\n', classification_report(Y_test, Y_cls))

    ## Plot 0 probability including overtraining test
    plt.figure(figsize=(8,8))

    label=1
    #Test prediction
    plt.hist(Y_pred[Y_test == label], alpha=0.5, color='red', range=[0, 1], bins=10)
    plt.hist(Y_pred[Y_test != label], alpha=0.5, color='blue', range=[0, 1], bins=10)

    #Train prediction
    Y_train_pred = model.predict(X_train)
    plt.hist(Y_train_pred[Y_train == label], alpha=0.5, color='red', range=[0, 1], bins=10, histtype='step', linewidth=2)
    plt.hist(Y_train_pred[Y_train != label], alpha=0.5, color='blue', range=[0, 1], bins=10, histtype='step', linewidth=2)

    plt.legend(['train == 1', 'train == 0', 'test == 1', 'test == 0'], loc='upper right')
    plt.xlabel('Probability of being a good review')
    plt.ylabel('Number of entries')
    plt.show()

    #Lets have a look at some of the incorrectly classified reviews. For readability we remove the padding.
    predicted_classes_reshaped = np.reshape(Y_cls, Y_test.shape)
    incorrect = np.nonzero(predicted_classes_reshaped!=Y_test)[0]

    # We select the first 10 incorrectly classified reviews
    for j, incorrect in enumerate(incorrect[0:9]):

        predicted = class_names[predicted_classes_reshaped[incorrect]]
        actual = class_names[Y_test[incorrect]]
        human_readable_review = decode_review(X_test[incorrect])

        print("Incorrectly classified Test Review ["+ str(j+1) +"]")
        print("Test Review #" + str(incorrect)  + ": Predicted ["+ predicted + "] Actual ["+ actual + "]")
        print("Test Review Text: " + human_readable_review.replace("<PAD> ", ""))
        print("")


In order to run a bit faster, we will only evaluate the first 10000 reviews.

In [None]:
evaluate(X_test[:1000], y_test[:1000],X_train[:1000], y_train[:1000], model)

Ok, this didn't too bad, but can we do better using a more advanced recurrent model like the LSTM.

## LSTM model

A LSTM network is an artificial neural network that contains LSTM blocks instead of, or in addition to, regular network units. A LSTM block may be described as a "smart" network unit that can remember a value for an arbitrary length of time.

Unlike traditional RNNs, an Long short-term memory network is well-suited to learn from experience to classify, process and predict time series when there are very long time lags of unknown size between important events.

```python
tensorflow.keras.layers.LSTM(units, activation='tanh', recurrent_activation='hard_sigmoid', use_bias=True,
                            kernel_initializer='glorot_uniform', recurrent_initializer='orthogonal',
                            bias_initializer='zeros', unit_forget_bias=True, kernel_regularizer=None,
                            recurrent_regularizer=None, bias_regularizer=None, activity_regularizer=None,
                            kernel_constraint=None, recurrent_constraint=None, bias_constraint=None,
                            dropout=0.0, recurrent_dropout=0.0)
```

#### Arguments

<ul>
<li><strong>units</strong>: Positive integer, dimensionality of the output space.</li>
<li><strong>activation</strong>: Activation function to use
    If you pass None, no activation is applied
    (ie. "linear" activation: <code>a(x) = x</code>).</li>
<li><strong>recurrent_activation</strong>: Activation function to use
    for the recurrent step.</li>
<li><strong>use_bias</strong>: Boolean, whether the layer uses a bias vector.</li>
<li><strong>kernel_initializer</strong>: Initializer for the <code>kernel</code> weights matrix,
    used for the linear transformation of the inputs.</li>
<li><strong>recurrent_initializer</strong>: Initializer for the <code>recurrent_kernel</code>
    weights matrix,
    used for the linear transformation of the recurrent state.</li>
<li><strong>bias_initializer</strong>: Initializer for the bias vector.</li>
<li><strong>unit_forget_bias</strong>: Boolean.
    If True, add 1 to the bias of the forget gate at initialization.
    Setting it to true will also force <code>bias_initializer="zeros"</code>.
    This is recommended in <a href="http://www.jmlr.org/proceedings/papers/v37/jozefowicz15.pdf">Jozefowicz et al.</a></li>
<li><strong>kernel_regularizer</strong>: Regularizer function applied to
    the <code>kernel</code> weights matrix.</li>
<li><strong>recurrent_regularizer</strong>: Regularizer function applied to
    the <code>recurrent_kernel</code> weights matrix.</li>
<li><strong>bias_regularizer</strong>: Regularizer function applied to the bias vector.</li>
<li><strong>activity_regularizer</strong>: Regularizer function applied to
    the output of the layer (its "activation").</li>
<li><strong>kernel_constraint</strong>: Constraint function applied to
    the <code>kernel</code> weights matrix.</li>
<li><strong>recurrent_constraint</strong>: Constraint function applied to
    the <code>recurrent_kernel</code> weights matrix.</li>
<li><strong>bias_constraint</strong>: Constraint function applied to the bias vector.</li>
<li><strong>dropout</strong>: Float between 0 and 1.
    Fraction of the units to drop for
    the linear transformation of the inputs.</li>
<li><strong>recurrent_dropout</strong>: Float between 0 and 1.
    Fraction of the units to drop for
    the linear transformation of the recurrent state.</li>
</ul>

## Task 1: Train and evaluate an LSTM model
* Build a model using again one embedding layer and one dense output node but with an LSTM layer with 128 units instead of the RNN layer
* Use a dropout layer between the embedding and LSTM layer and between the LSTM and the dense layer
* Train the model and plot the loss and accuracy over epochs
* Evaluate the performance of the model and compare it with the RNN model

We can now define, compile and fit our LSTM model.

The first layer is the Embedded layer that uses 32 length vectors to represent each word. The next layer is the LSTM layer with 128 memory units (smart neurons). Finally, because this is a classification problem we use a Dense output layer with a single neuron and a sigmoid activation function to make 0 or 1 predictions for the two classes (good and bad) in the problem.

Because it is a binary classification problem, log loss is used as the loss function (binary_crossentropy in Keras). The efficient ADAM optimization algorithm is used. The model is fit for 20 epochs. A large batch size of 128 reviews is used to space out weight updates.

In [None]:
# build a model

### Training

In [None]:
# train a model

### Evaluation

In [None]:
# evaluate the model

## LSTM with dropout

Recurrent Neural networks like LSTM generally have the problem of overfitting.

Dropout can be applied between layers using the Dropout Keras layer. We have done this easily by adding new Dropout layers between the Embedding and LSTM layers and the LSTM and Dense output layers.

Alternately, dropout can be applied to the input and recurrent connections of the memory units with the LSTM precisely and separately.
Keras provides this capability with parameters on the LSTM layer, the dropout for configuring the input dropout and recurrent_dropout for configuring the recurrent dropout. For example, we can modify the first example to add dropout to the input and recurrent connections as follows:

`model.add(LSTM(128, dropout=0.2, recurrent_dropout=0.2))`

## Task 2: Train and evaluate an LSTM model with dropout
* Instead of using two dropout layers, apply dropout to the input and recurrent connections of the LSTM model
* Train the model over 20 epochs with a batch size of 2048 and plot the loss and accuracy over epochs
* Evaluate the performance of the model and compare it with the previous models

In [None]:
# create the model

### Training

In [None]:
# train the model

### Evaluation

In [None]:
# evaluate the model

## Convolutional LSTM

Convolutional neural networks excel at learning the spatial structure in input data.

The IMDB review data does have a one-dimensional spatial structure in the sequence of words in reviews and the CNN may be able to pick out invariant features for good and bad sentiment. This learned spatial features may then be learned as sequences by an LSTM layer.

We can easily add a one-dimensional CNN and max pooling layers after the Embedding layer which then feed the consolidated features to the LSTM. We can use a smallish set of 32 features with a small filter length of 3. The pooling layer can use the standard length of 2 to halve the feature map size.


## Task 3: Train and evaluate an LSTM model with a convolutional layer
* Add one convolutional layer and one maxpooling layer before the LSTM layer
* Train the model for 10 epochs and with a batch size of 2048 and plot the loss and accuracy over epochs
* Evaluate the performance of the model and compare it with the previous models

In [None]:
from tensorflow.keras.layers import Conv1D,MaxPooling1D
# create the model

### Training

In [None]:
# well, you know

### Evaluation

In [None]:
#

---

## Bonus: LSTM with convolutional input & recurrent transformation

[Convolutional LSTM Network: A Machine Learning Approach for
Precipitation Nowcasting](http://arxiv.org/abs/1506.04214v1)

Based on https://github.com/keras-team/keras/blob/master/examples/conv_lstm.py

This network is used to predict the next frame of an artificially generated movie which contains moving squares.

#### Artificial Data Generation

Generate movies with `3` to `7` moving squares inside.

The squares are of shape $1 \times 1$ or $2 \times 2$ pixels, which move linearly over time.

For convenience we first create movies with bigger width and height (`80x80`)
and at the end we select a $40 \times 40$ window.

In [None]:
# Artificial Data Generation
def generate_movies(n_samples=1200, n_frames=15):
    row = 80
    col = 80
    noisy_movies = np.zeros((n_samples, n_frames, row, col, 1), dtype=np.float)
    shifted_movies = np.zeros((n_samples, n_frames, row, col, 1),
                              dtype=np.float)

    for i in range(n_samples):
        # Add 3 to 7 moving squares
        n = np.random.randint(3, 8)

        for j in range(n):
            # Initial position
            xstart = np.random.randint(20, 60)
            ystart = np.random.randint(20, 60)
            # Direction of motion
            directionx = np.random.randint(0, 3) - 1
            directiony = np.random.randint(0, 3) - 1

            # Size of the square
            w = np.random.randint(2, 4)

            for t in range(n_frames):
                x_shift = xstart + directionx * t
                y_shift = ystart + directiony * t
                noisy_movies[i, t, x_shift - w: x_shift + w,
                             y_shift - w: y_shift + w, 0] += 1

                # Make it more robust by adding noise.
                # The idea is that if during inference,
                # the value of the pixel is not exactly one,
                # we need to train the network to be robust and still
                # consider it as a pixel belonging to a square.
                if np.random.randint(0, 2):
                    noise_f = (-1)**np.random.randint(0, 2)
                    noisy_movies[i, t,
                                 x_shift - w - 1: x_shift + w + 1,
                                 y_shift - w - 1: y_shift + w + 1,
                                 0] += noise_f * 0.1

                # Shift the ground truth by 1
                x_shift = xstart + directionx * (t + 1)
                y_shift = ystart + directiony * (t + 1)
                shifted_movies[i, t, x_shift - w: x_shift + w,
                               y_shift - w: y_shift + w, 0] += 1

    # Cut to a 40x40 window
    noisy_movies = noisy_movies[::, ::, 20:60, 20:60, ::]
    shifted_movies = shifted_movies[::, ::, 20:60, 20:60, ::]
    noisy_movies[noisy_movies >= 1] = 1
    shifted_movies[shifted_movies >= 1] = 1
    return noisy_movies, shifted_movies

### Model

In [None]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Conv3D, ConvLSTM2D, BatchNormalization

import numpy as np
from matplotlib import pyplot as plt

%matplotlib inline

We create a layer which take as input movies of shape `(n_frames, width, height, channels)` and returns a movie
of identical shape.

In [None]:
seq = Sequential()
seq.add(ConvLSTM2D(filters=40, kernel_size=(3, 3),
                   input_shape=(None, 40, 40, 1),
                   padding='same', return_sequences=True))
seq.add(BatchNormalization())

seq.add(ConvLSTM2D(filters=40, kernel_size=(3, 3),
                   padding='same', return_sequences=True))
seq.add(BatchNormalization())

seq.add(ConvLSTM2D(filters=40, kernel_size=(3, 3),
                   padding='same', return_sequences=True))
seq.add(BatchNormalization())

seq.add(ConvLSTM2D(filters=40, kernel_size=(3, 3),
                   padding='same', return_sequences=True))
seq.add(BatchNormalization())

seq.add(Conv3D(filters=1, kernel_size=(3, 3, 3),
               activation='sigmoid',
               padding='same', data_format='channels_last'))
seq.compile(loss='binary_crossentropy', optimizer='adam')

### Train the Network

#### Beware: This takes time

In [None]:
# Train the network
noisy_movies, shifted_movies = generate_movies(n_samples=1200)
seq.fit(noisy_movies[:1000], shifted_movies[:1000], batch_size=10,
        epochs=20, validation_split=0.05)

### Test the Network

In [None]:
# Testing the network on one movie
# feed it with the first 7 positions and then
# predict the new positions
which = 1004
track = noisy_movies[which][:7, ::, ::, ::]

for j in range(16):
    new_pos = seq.predict(track[np.newaxis, ::, ::, ::, ::])
    new = new_pos[::, -1, ::, ::, ::]
    track = np.concatenate((track, new), axis=0)

In [None]:
# And then compare the predictions
# to the ground truth
track2 = noisy_movies[which][::, ::, ::, ::]
for i in range(15):
    fig = plt.figure(figsize=(10, 5))

    ax = fig.add_subplot(121)

    if i >= 7:
        ax.text(1, 3, 'Predictions !', fontsize=20, color='w')
    else:
        ax.text(1, 3, 'Inital trajectory', fontsize=20)

    toplot = track[i, ::, ::, 0]

    plt.imshow(toplot)
    ax = fig.add_subplot(122)
    plt.text(1, 3, 'Ground truth', fontsize=20)

    toplot = track2[i, ::, ::, 0]
    if i >= 2:
        toplot = shifted_movies[which][i - 1, ::, ::, 0]

    plt.imshow(toplot)
    plt.savefig('%i_animate.png' % (i + 1))