# https://danijar.com/variable-sequence-lengths-in-tensorflow/

# Variable Sequence Lengths in TensorFlow
I recently wrote a guide on recurrent networks in TensorFlow: https://danijar.com/introduction-to-recurrent-networks-in-tensorflow/
That covered the basics but often we want to learn on sequences of variable lengths, possibly even within the same batch of training examples.
In this post, I will explain how to use variable length sequences in TensorFlow and what implications they have on your model.

# Computing the Sequence Length
Since TensorFlow unfolds our recurrent network for a given number of steps, we can only feed sequences of that shape to the network.
We also want the input to have a fixed size so that we can represent a training batch as a single tensor of shape `batch size x max length x features`.

I will assume that the sequences are padded with zero vectors to fill up the remaining time steps in the batch.
To pass sequence lengths to TensorFlow, we have to compute them from the batch.
While we could do this in Numpy in a pre-processing step, let’s do it on the fly as part of the compute graph!

We first collapse the frame vectors (third dimension of a batch) into scalars using maximum.
Each sequence is now a vector of scalars that will be zero for the padded frames at the end.
We then use tf.sign() to convert the actual frames from their maximum values to values of one.
<u>This gives us a binary mask of ones for used frames and zeros for unused frames that we can just sum to get the sequence length.</u>

# Using the Length Information
Now that we have a vector holding the sequence lengths, we can pass that to `dynamic_rnn()`, the function that unfolds our network, using the optional `sequence_length` parameter.
When running the model later, TensorFlow will return zero vectors for states and outputs after these sequence lengths.
Therefore, weights will not affect those outputs and don’t get trained on them.

# Masking the Cost Function
Note that our output will still be of size `batch_size x max_length x out_size`, but with the last being zero vectors for sequences shorter than the maximum length.
When you use the outputs at each time step, as in sequence labeling, we don’t want to consider them in our cost function.
We mask out the unused frames and compute the mean error over the sequence length by dividing by the actual length.
Using `tf.reduce_mean()` does not work here because it would devide by the maximum sequence length.

You can compute the average of your error function the same way.
Actually, we wouldn’t have to do the masking for the cost and error functions because both prediction and target are zero vectors for the padding frames so they are perfect predictions.
Anyway, it’s nice to be explicit in code.
Here is a full example of variable-length sequence labeling.

# Select the Last Relevant Output
For sequence classification, we want to feed the last output of the recurrent network into a predictor, e.g. a softmax layer.
While taking the last frame worked well for fixed-sized sequences, we not have to select the last relevant frame.
This is a bit cumbersome in TensorFlow since it does’t support advanced slicing yet.
In Numpy this would just be `output[:, length - 1]`.
But we need the indexing to be part of the compute graph in order to train the whole system end-to-end.

What happens here?
We flatten the output tensor to shape frames in all `examples x output size`.
Then we construct an index into that by creating a tensor with the start indices for each example `tf.range(0, batch_size) * max_length` and add the individual sequence lengths to it.
`tf.gather()` then performs the actual indexing.
Let’s hope the TensorFlow guys can provide proper indexing soon so this gets much easier.

On a side node: A one-layer GRU network outputs its full state.
In that case, we can use the state returned by `tf.nn.dynamic_rnn()` directly.
Similarly, we can use state.o for a one-layer LSTM network.
For more complex architectures, that doesn’t work or at least result in a large amount of parameters.

We got the last relevant output and can feed that into a simple softmax layer to predict the class of each sequence:

You can of course use more complex predictors with multiple layers as well.
Here is the working example for variable-length sequence classification.

I explained how to use recurrent networks on variable-length sequences and how to use their outputs. Feel free to comment with questions and remarks.

In [1]:
# Updated to work with TF 1.4: https://gist.github.com/abaybektursun/98656e483ec6e918c26235b47f3f5d60
# Working example for my blog post at:
# http://danijar.com/variable-sequence-lengths-in-tensorflow/
import functools
import sets
import tensorflow as tf
from tensorflow import nn


def lazy_property(function):
    attribute = '_' + function.__name__

    @property
    @functools.wraps(function)
    def wrapper(self):
        if not hasattr(self, attribute):
            setattr(self, attribute, function(self))
        return getattr(self, attribute)
    return wrapper


class VariableSequenceClassification:

    def __init__(self, data, target, num_hidden=200, num_layers=2):
        self.data = data
        self.target = target
        self._num_hidden = num_hidden
        self._num_layers = num_layers
        self.prediction
        self.error
        self.optimize

    @lazy_property
    def length(self):
        used = tf.sign(tf.reduce_max(tf.abs(self.data), reduction_indices=2))
        length = tf.reduce_sum(used, reduction_indices=1)
        length = tf.cast(length, tf.int32)
        return length

    @lazy_property
    def prediction(self):
        # Recurrent network.
        output, _ = nn.dynamic_rnn(
            nn.rnn_cell.GRUCell(self._num_hidden),
            data,
            dtype=tf.float32,
            sequence_length=self.length,
        )
        last = self._last_relevant(output, self.length)
        # Softmax layer.
        weight, bias = self._weight_and_bias(
            self._num_hidden, int(self.target.get_shape()[1]))
        prediction = tf.nn.softmax(tf.matmul(last, weight) + bias)
        return prediction

    @lazy_property
    def cost(self):
        cross_entropy = -tf.reduce_sum(self.target * tf.log(self.prediction))
        return cross_entropy

    @lazy_property
    def optimize(self):
        learning_rate = 0.003
        optimizer = tf.train.RMSPropOptimizer(learning_rate)
        return optimizer.minimize(self.cost)

    @lazy_property
    def error(self):
        mistakes = tf.not_equal(
            tf.argmax(self.target, 1), tf.argmax(self.prediction, 1))
        return tf.reduce_mean(tf.cast(mistakes, tf.float32))

    @staticmethod
    def _weight_and_bias(in_size, out_size):
        weight = tf.truncated_normal([in_size, out_size], stddev=0.01)
        bias = tf.constant(0.1, shape=[out_size])
        return tf.Variable(weight), tf.Variable(bias)

    @staticmethod
    def _last_relevant(output, length):
        batch_size = tf.shape(output)[0]
        max_length = int(output.get_shape()[1])
        output_size = int(output.get_shape()[2])
        index = tf.range(0, batch_size) * max_length + (length - 1)
        flat = tf.reshape(output, [-1, output_size])
        relevant = tf.gather(flat, index)
        return relevant


if __name__ == '__main__':
    # We treat images as sequences of pixel rows.
    train, test = sets.Mnist()
    _, rows, row_size = train.data.shape
    num_classes = train.target.shape[1]
    
    data = tf.placeholder(tf.float32, [None, rows, row_size])
    target = tf.placeholder(tf.float32, [None, num_classes])
    
    model = VariableSequenceClassification(data, target)
    
    sess = tf.Session()
    sess.run(tf.global_variables_initializer())
    for epoch in range(10):
        for _ in range(100):
            batch = train.sample(10)
            sess.run(model.optimize, {data: batch.data, target: batch.target})
        error = sess.run(model.error, {data: test.data, target: test.target})
        print('Epoch {:2d} error {:3.1f}%'.format(epoch + 1, 100 * error))

  "Converting sparse IndexedSlices to a dense Tensor of unknown shape. "


Epoch  1 error 45.2%
Epoch  2 error 25.0%
Epoch  3 error 15.6%
Epoch  4 error 14.9%
Epoch  5 error 10.1%
Epoch  6 error 11.4%
Epoch  7 error 9.7%
Epoch  8 error 7.8%
Epoch  9 error 6.5%
Epoch 10 error 6.5%
