# Sequence learning with Keras

In recent years, there has been a considerable increase in the attention for sequence modelling in Deep Learning. In this chapter, we will delve into the exciting topic of recurrent in neural networks. We will build upon the POS-Chunk data which we loaded in the previous chapter. Importantly, we will demonstrate that a keras model (as any Theano or TensorFlow graph) can have multiple inputs and outputs. We will show, for instance, that it is perfectly possible to to train a model that **simultaneously** learns to pos tag and chunk. As always, we first set up our booth:

In [None]:
from __future__ import print_function

import codecs
from sklearn.preprocessing import LabelEncoder
from keras.utils import np_utils

import numpy as np

We load the CONLL train data again, but this time, we also load the chunk labels:

In [None]:
def load_data(path):
    data = []
    for line in codecs.open(path, 'r', 'utf8'):
        line = line.strip()
        if line:
            try:
                token, pos, chunk = line.strip().split()
                data.append((token, pos, chunk))
            except:
                pass
    return data
        
train_data = load_data('data/seq/train.txt')

print(len(train_data))
for i in train_data[:10]:
    print(' - '.join(i))
    
train_tokens, train_pos, train_chunk = zip(*train_data)

Let us start with the POS labels, which we encode as before:

In [None]:
tag_encoder = LabelEncoder()
tag_encoder.fit(train_pos)
print('Total nb POS tags:', len(tag_encoder.classes_))

y_train_pos = tag_encoder.transform(train_pos)

Y_train_pos = np_utils.to_categorical(y_train_pos,
                                  nb_classes=len(tag_encoder.classes_))

Likewise, we vectorize our training instances as before:

In [None]:
from collections import Counter
vocab = Counter(train_tokens)
indexer = {'unk': 0}

for k, v in vocab.most_common():
    indexer[k] = len(indexer)

nb_left, nb_right = 2, 1

def vectorize(tokens):
    sequences = []
    for curr_idx, token in enumerate(tokens):
        left_context = tokens[(curr_idx - 2) : curr_idx]
        while len(left_context) < nb_left:
            left_context = ['<unk>'] + left_context

        right_context = tokens[curr_idx + 1 : curr_idx + 2]
        while len(right_context) < nb_right:
            right_context += ['<unk>']

        seq = left_context + [token] + right_context

        ints = [indexer[t] if t in indexer else 0 for t in seq]

        sequences.append(ints)
    
    return np.array(sequences, dtype='int8')

X_train = vectorize(list(train_tokens))

Let's start building our model:

In [None]:
from keras.models import Sequential
from keras.layers import Embedding, Dense, Activation

from keras.layers.recurrent import LSTM

model = Sequential()
model.add(Embedding(input_dim=len(indexer), output_dim=150,
                    input_length=nb_left + 1 + nb_right))
model.add(LSTM(100, return_sequences=False, activation='tanh'))
model.add(Dense(len(tag_encoder.classes_)))
model.add(Activation('softmax'))

Interestingly, if you compare the following model to the one we had before, you see that only a single line has changed: instead of collapsing our 4 embedding vectors into a single, flat vector, we now have a **recurrent layer** loop over the embeddings and produce a single vector representation of the sequence at the end (hence `return_sequences=False`). The recurent layer we use is a [Long-Short Term Memory](http://colah.github.io/posts/2015-08-Understanding-LSTMs/)) layer: such a layer will loop through our embedding vectors and model them *as a sequence* from left to tight. While the exact working of an LSTM is well out of the scope of this tutorial, the main advantage is that such a layer can remember information from previous timesteps for a pretty long time and have this information affect the way it processes the vectors it sees along the way. Let us test the model:

In [None]:
model.compile(optimizer='rmsprop', loss='categorical_crossentropy', metrics=['accuracy'])
model.fit(X_train, Y_train_pos, batch_size=10, nb_epoch=10)

## The functional API

So far we have been working with keras's 'vanilla' model, i.e. the simple Sequential model where we plainly stack a series of layers on top of each other. There are many cases, however, where we would to like to have a more flexible way of construing a layer graph. Below, we will first work with a simple, yet highly relevant example: we will combine a left-to-right LSTM, with a right-to-left LSTM. To this end, we will make use of keras's extremely powerful `Model`, which is part of its so-called 'functional' API. When working with the `Model`, we first need to tell keras what our input data will look like. In our case, it primarily need to know the length of the integer sequences we will be passing it later. Note by the way, that we never have to specify anything about the size of the batches itself, because the model is largely agnostic of this, because we start actually fitting it.

In [None]:
from keras.layers import Input
context_input = Input(shape=(nb_left + 1 + nb_right,), dtype='int32')

We assigned our `Input()` object to a variable called `context_input`. If we would like to feed this layer into our embeddings layer, we can now use the following, highly functional syntax:

In [None]:
embedding = Embedding(input_dim=len(indexer), output_dim=150)(context_input)

The output of a single layer can be fed to multiple new layers. This is exactly what we do in the next code block. After retrieving the relevant series of embedding vectors, we pass both to a left-to-right LSTM, as well as a right-to-left LSTM (cf. `go_backwards=True`). Both recurrent layers will produce a different result, but take the same input.

In [None]:
left_to_right = LSTM(100, return_sequences=False, activation='tanh')(embedding)
right_to_left = LSTM(100, return_sequences=False, activation='tanh', go_backwards=True)(embedding)

With the functional API, it is also no problem to have multiple incoming layers. In the following code block, we combina the result of our two LSTM into a single layer, through summing them. This is where the `merge` layers comes in handy.

In [None]:
from keras.layers import merge
merged = merge([left_to_right, right_to_left], mode='sum')

We can now top of the model with a plain output layer to predict our class labels:

In [None]:
output = Dense(len(tag_encoder.classes_), activation='softmax')(merged)

Now that we have defined our full graph, we can instantiate our actual model, by specifying the start point and point of our graph:

In [None]:
from keras.models import Model
model = Model(input=context_input, output=output)

Our model architecture neatly reflects the branching of our graph tree:

In [None]:
model.summary()

Let us compile and test:

In [None]:
model.compile(optimizer='rmsprop', loss='categorical_crossentropy', metrics=['accuracy'])
model.fit(X_train, Y_train_pos, batch_size=50, nb_epoch=25)

As you will have seen, each time we add more candy to our network, it becomes slower to train. (Luckily, we have GPU to take of that, but more about that later.)

## Multiple inputs and outputs

Now that we have seen that is easy to branch and merge mayers using the functional API, it also becomes easy to understand that models can have multiple inputs and outputs. One additional input which we might want is the following: right now, our model using the same embeddings matrix for both the focus word and the context words. Intuitively however, it would make sense to reserve a second weight matrix for the focus token only. This is easy to achieve.

### Multiple inputs

First, we engineer a new input feature for our data outside keras. We separately encode the focus token again, using our vocabulary index:

In [None]:
X_train_focus = [indexer[focus] if focus in indexer else 0 for focus in train_tokens]
X_train_focus = np.array(X_train_focus, dtype='int32')
print(len(X_train_focus))

Now, we start building our graph, specifying *two*, instead of a single `Input` -- pay attention to the difference in the `shape` argument specified. To be able to distentangle both inputs later, it is useful to provide them with a name in the constructor:

In [None]:
context_input = Input(shape=(4,), dtype='int32', name='context')
focus_input = Input(shape=(1,), dtype='int32', name='focus')

Now, we can create two separate embedding layers, which will have two completely independent weight matrices, so that we can train different representations for the focus and context tokens:

In [None]:
context_embedding = Embedding(input_dim=len(indexer), output_dim=150)(context_input)
focus_embedding = Embedding(input_dim=len(indexer), output_dim=150)(focus_input)

For the `context_embedding`, we can proceed as we did before, with two LSTM branches, that get merged afterwards:

In [None]:
left_to_right = LSTM(100, return_sequences=False, activation='tanh')(context_embedding)
right_to_left = LSTM(100, return_sequences=False, activation='tanh', go_backwards=True)(context_embedding)
merged1 = merge([left_to_right, right_to_left], mode='sum')

Silly enough, our embedding matrix is a 'matrix' consisting of a single vector, which means that we can safely flatten it into a plain vector, without losing any relevant spatio-temporal information.

In [None]:
from keras.layers.core import Flatten
flat_context = Flatten()(focus_embedding)

We can now use the `merge` trick again to combine our two 'heads'. However, instead of summing them, we now simple **concatenate** the vectors for the layer pairs into a single long vector, using the `mode` argument:

In [None]:
merged2 = merge([merged1, flat_context], mode='concat')

On top of this result, we can place our output layer again:

In [None]:
pos_output = Dense(len(tag_encoder.classes_), activation='softmax', name='pos')(merged2)

When we now instantiate our `Model`, make sure that you specify a list of inputs (but a single output):

In [None]:
from keras.models import Model
model = Model(input=[context_input, focus_input], output=pos_output)

It's fitting time! Note that, because we gave our layers names above, we can feed in the data using 

In [None]:
model.compile(optimizer='rmsprop', loss='categorical_crossentropy', metrics=['accuracy'])
model.fit({'context':X_train,
           'focus': X_train_focus},
          {'pos': Y_train_pos},
          batch_size=50, nb_epoch=3, shuffle=True)

Again, this is embarrasingly slow... Notice however that separating the embedding matrices for the input and context features leads to a quite a dramatic improvement in the fitting capacity of our model!

### Multiple outputs

Finally, let us go *completely* nuts, and also add multiple outputs to our model. Interestingly, keras makes it really easy to learn several task simultaneously. In our case, we could try to predict both POS tags and chunking labels at the same time. Let us encode the chunk labels first:

In [None]:
chunk_encoder = LabelEncoder()
chunk_encoder.fit(train_chunk)
print('Total nb chunk labels:', len(chunk_encoder.classes_))

y_train_chunk = chunk_encoder.transform(train_chunk)

Y_train_chunk = np_utils.to_categorical(y_train_chunk,
                                  nb_classes=len(chunk_encoder.classes_))

Adding a second output is as simple as this:

In [None]:
context_input = Input(shape=(4,), dtype='int32', name='context')
focus_input = Input(shape=(1,), dtype='int32', name='focus')

context_embedding = Embedding(input_dim=len(indexer), output_dim=150)(context_input)
left_to_right = LSTM(100, return_sequences=False, activation='tanh')(context_embedding)
right_to_left = LSTM(100, return_sequences=False, activation='tanh', go_backwards=True)(context_embedding)
merged1 = merge([left_to_right, right_to_left], mode='sum')

focus_embedding = Embedding(input_dim=len(indexer), output_dim=150)(focus_input)
flat_context = Flatten()(focus_embedding)

merged2 = merge([merged1, flat_context], mode='concat')

pos_output = Dense(len(tag_encoder.classes_), activation='softmax', name='pos')(merged2)
chunk_output = Dense(len(chunk_encoder.classes_), activation='softmax', name='chunk')(merged2)

model = Model(input=[context_input, focus_input], output=[pos_output, chunk_output])

The only thing we add is another 'outgoing' softmax layer. Apart from that, when instantiating our model, we also use a list of outputs. Let us train this horse:

In [None]:
model.compile(optimizer='rmsprop', loss='categorical_crossentropy', metrics=['accuracy'])
model.fit({'context': X_train,
           'focus': X_train_focus},
          {'pos': Y_train_pos,
           'chunk': Y_train_chunk},
          batch_size=100, nb_epoch=10, shuffle=True)

------------------------------------