# Dealing with variable-length sequences efficiently

This notebook is meant to be run using a GPU.

RNNs can work with variable length sequences. This notebook shows how to do this efficiently with `keras`.

In [1]:
import tensorflow as tf
import numpy as np
import wget
import pandas

from os.path import exists
from tqdm.auto import tqdm

  from .autonotebook import tqdm as notebook_tqdm


For this hands-on we will use the [Top Tagging dataset](https://arxiv.org/abs/1707.08966). This simplified MC data consists of signal Top jets and other quark and gluon background jets. For each jet, a maximum of 200 constituents are identified and ranked according their $P_T$. The four-momenta of these ranked constituents are now used as an input for our RNN. Note that the the number of constituents varies between the jets.



At first we will download the data and have a look at it.

In [12]:
if not exists("test.h5"):
    wget.download('https://desycloud.desy.de/index.php/s/llbX3zpLhazgPJ6/download?path=%2F&files=test.h5')

In [13]:
store = pandas.HDFStore("test.h5")
data = store.select("table", stop=20) # Read the first 100 events

data = data.iloc[: , :-6] # drop the last six columns with truth information
data.head()

Unnamed: 0,E_0,PX_0,PY_0,PZ_0,E_1,PX_1,PY_1,PZ_1,E_2,PX_2,...,PY_197,PZ_197,E_198,PX_198,PY_198,PZ_198,E_199,PX_199,PY_199,PZ_199
436,218.364243,-172.341858,110.129105,-76.503624,153.661118,-111.320465,93.167969,-50.390713,76.708054,-56.523701,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
440,122.238762,26.738468,-91.613998,76.382225,121.227135,17.644758,-93.01545,75.715302,90.420105,21.377417,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
441,383.772308,-97.906456,79.640709,-362.426361,200.625992,-54.921326,37.994343,-189.184753,123.247223,-33.828953,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
444,132.492752,-77.763947,-87.322601,-62.3046,83.946594,-49.450481,-53.823605,-41.28801,28.072624,-19.964916,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
445,730.786987,-209.12001,-193.454315,-672.973877,225.477325,-75.36335,-66.22699,-201.926651,217.040192,-63.698189,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In order to have a dataset as if we measured it, we will remove all padded zeros from the dataset..

In [14]:
data = np.array(data)
data = [np.trim_zeros(data[i,:]) for i in range(data.shape[0])] # delete zeros and write jets in a list
data = [np.array([dat[0::4], dat[1::4], dat[2::4], dat[3::4]]).T for dat in data] # write features in an extra axis

Our dataset is now a list of measured jets consisting of arrays of shape (constituent, feature). Or more general: a list (batch) of arrays (sequence, feature). The number of constituents varies between the jets.

In [15]:
for jet in range(5):
    print(data[jet].shape[0], 'constituents for jet nr.', jet)

17 constituents for jet nr. 0
61 constituents for jet nr. 1
45 constituents for jet nr. 2
74 constituents for jet nr. 3
30 constituents for jet nr. 4


We want to feed this through the following 2-layer LSTM model - it takes a batch of sequences of arbitrary length and outputs a batch of numbers:

In [16]:
model = tf.keras.Sequential([
    tf.keras.layers.LSTM(128, input_shape=(None, 4), return_sequences=True),
    tf.keras.layers.LSTM(128),
    tf.keras.layers.Dense(1),
])

In [28]:
model.summary()

Model: "sequential_1"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 lstm_2 (LSTM)               (None, None, 128)         68096     
                                                                 
 lstm_3 (LSTM)               (None, 128)               131584    
                                                                 
 dense_1 (Dense)             (None, 1)                 129       
                                                                 
Total params: 199,809
Trainable params: 199,809
Non-trainable params: 0
_________________________________________________________________


How do we feed in the variable number of constituents? Well, since the first two input dimensions of our model are unspecified `(jets, constituent)` we can pass each jet separately. Let's see how fast this is:

In [32]:
for jet in tqdm(data):
    model(jet[np.newaxis, :])

100%|██████████| 20/20 [00:02<00:00,  9.69it/s]


Doesn't seem that bad does it?

Wait! We haven't seen yet how fast it could be ...

If you look at the GPU utilization (e.g. with `nvidia-smi`) while this is running you will see it is rather low. That's because RNNs are inherently sequential - we can't process the different steps of a sequence (constituents in jet) in parallel.

But what we can do is process each step of the sequence (each constituent in jet) in parallel across all jets in our batch!

Keras will do this if we provide batches that are Tensors of fixed length.

To try this out, let's enlarge the sequences to a fixed length and fill missing values with 0:

In [33]:
padded_data = tf.keras.preprocessing.sequence.pad_sequences(data, padding="post", dtype="float32")

Now we have a dataset with a uniform number of constituents (again).

In [20]:
padded_data.shape # (batch, constituents, features)

(20, 94, 4)

In [26]:
padded_data[0,:3,:] # four-momenta of the first three constituents in the first jet

array([[ 218.36424 , -172.34186 ,  110.129105,  -76.503624],
       [ 153.66112 , -111.320465,   93.16797 ,  -50.390713],
       [  76.70805 ,  -56.5237  ,   46.127293,  -23.695349]],
      dtype=float32)

In [None]:
model.predict(padded_data, batch_size=256, verbose=True)



array([[-1.6438955e-13],
       [-9.1246635e-11],
       [-9.8816380e-12],
       [-4.3296894e-10],
       [-1.4093572e-12]], dtype=float32)

That should have been **much** faster.

But now the model also processed the 0-padded values. We can see that e.g. the first output is different than what we expect from passing in the first sequence:

In [None]:
model(data[0][np.newaxis, :])

<tf.Tensor: shape=(1, 1), dtype=float32, numpy=array([[-1.6438963e-13]], dtype=float32)>

In keras we can solve this by a `Masking` layer - subsequent RNN layers will respect this and only process non-masked inputs.

For more info, see https://keras.io/guides/understanding_masking_and_padding/

In [None]:
masked_model = tf.keras.Sequential([
    tf.keras.layers.Masking(mask_value=0.0),
    tf.keras.layers.LSTM(128, input_shape=(None, 4), return_sequences=True),
    tf.keras.layers.LSTM(128),
    tf.keras.layers.Dense(1),
])

In [None]:
masked_model.build(input_shape=(None, None, 4))

In [None]:
# set the weights such that we can compare the outputs of both models
masked_model.set_weights(model.get_weights())

In [None]:
masked_model.predict(padded_data, batch_size=256, verbose=True)



array([[ 0.17153287],
       [-0.02316774],
       [ 0.3422022 ],
       [ 0.10857085],
       [ 0.31025657]], dtype=float32)

This time the output is compatible with the one-by-one processing.