# Setup

In [1]:
import numpy as np
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers

# Introduction
- **Masking** is a way to tell sequence-processing layers that certain timesteps in an inputs are missing and thus should be skipped when processing the data.
- **Padding** is a special form of masking where the masked steps are at the start or the beggining of the sequence.
    - Padding comes from the need to encode sequence data into contiguous batches: in order to make all sequences in a batch fit a given standard length, it is necessary to pad or truncate some sequences.

# Padding sequence data
- When processing sequence data, it is very common for individual samples to have different lengths.
- Consider the following example.

In [2]:
[
    ["Hello", "world", "!"],
    ["How", "are", "you", "doing", "today"],
    ["The", "weather", "will", "be", "nice", "tomorrow"]
]

[['Hello', 'world', '!'],
 ['How', 'are', 'you', 'doing', 'today'],
 ['The', 'weather', 'will', 'be', 'nice', 'tomorrow']]

- After vocabulary lookup, the data might be vectorized as integers.

In [3]:
[
  [71, 1331, 4231],
  [73, 8, 3215, 55, 927],
  [83, 91, 1, 645, 1253, 927]
]

[[71, 1331, 4231], [73, 8, 3215, 55, 927], [83, 91, 1, 645, 1253, 927]]

- The data is a nested list with individual samples with length 3, 5, and 6.
- Since the input data for a deep learning model must be a single tensor, samples that are shorter than the longest item need to be **padded with some placeholder value**.
- Alternatively, one might also **truncate long samples before padding short samples**.
- Keras provides a utility function to truncate and pad Python lists to a common length: `tf.keras.preprocessing.sequence.pad_sequence`.
    - By default, the function will pad with 0s, but the placeholder value can be configured using the `value` parameter.
    - Note that you could `pre` padding (at the beginning) or `post` padding (at the end).
        - We recommend using `post` padding when working with RNN layers in order to use the CuDNN of the layer.

In [4]:
raw_inputs = [
    [711, 632, 71],
    [73, 8, 3215, 55, 927],
    [83, 91, 1, 645, 1253, 927],
]

In [5]:
padded_inputs = tf.keras.preprocessing.sequence.pad_sequences(raw_inputs, padding='post')

In [6]:
padded_inputs

array([[ 711,  632,   71,    0,    0,    0],
       [  73,    8, 3215,   55,  927,    0],
       [  83,   91,    1,  645, 1253,  927]], dtype=int32)

# Masking
- Now that all samples have a uniform length, the model must be informed that some part of the data is actually padded and should be ignored.
- This machanism is **masking**.
- There are three ways to introduce masks in Keras models:
    - Add a `keras.layers.Masking` layer
    - Configure a `keras.layers.Embedding` layer with `mask_zero=True`
        - The `keras.layers.Embedding` layer turns positive integers (indexes) into dense vectors of fixed size.
    - Pass a `mask` argument manually when calling layers that support this argument (e.g. RNN layers).

## Masking-generating layers: `Embedding` and `Masking`
- Under the hood, these layers will create a mask tensor (2D tensor with shape `(batch, sequence_length)`), and attach it to the tensor output returned by the `Masking` or `Embedding` layer.
    - Each individual `False` entry indicates that the corresponding timestep should be ignored during training.

In [7]:
embedding = layers.Embedding(input_dim=5000, output_dim=16, mask_zero=True)
masked_output = embedding(padded_inputs)

print(masked_output._keras_mask)

tf.Tensor(
[[ True  True  True False False False]
 [ True  True  True  True  True False]
 [ True  True  True  True  True  True]], shape=(3, 6), dtype=bool)


In [8]:
masking_layer = layers.Masking()

# Simulate the embedding lookup by expanding the 2D input to 3D with embedding dimension of 10
unmasked_embedding = tf.cast(
    tf.tile(tf.expand_dims(padded_inputs, axis=-1), [1,1,10]), tf.float32
)

masked_embedding = masking_layer(unmasked_embedding)
print(masked_embedding._keras_mask)

tf.Tensor(
[[ True  True  True False False False]
 [ True  True  True  True  True False]
 [ True  True  True  True  True  True]], shape=(3, 6), dtype=bool)


## Mask propagation in the Functional API and Sequential API

In [9]:
# Sequential API
model = keras.Sequential([
    layers.Embedding(input_dim=5000, output_dim=16, mask_zero=True),
    layers.LSTM(32)
])

In [10]:
# Functional API
inputs = keras.Input(shape=(None, ), dtype='int32')
X = layers.Embedding(input_dim=5000, output_dim=16, mask_zero=True)(inputs)
outputs = layers.LSTM(32)(X)

model = keras.Model(inputs, outputs)

## Passing mask tensors directly to layers
- Layers that can handle masks (such as the `LSTM` layer) have a `mask` argument in their `__call__` method.
- Meanwhile, layers that produce a mask (e.g. `Embedding`) expose a `compute_mask(input, previous_mask)` method which you can call.
- Thus, you can pass the output of the **`compute_mask()` method of a mask-producing layer** to the **`__call__` method of a mask-consuming layer**, like the following.

In [11]:
class MyLayer(layers.Layer):
    def __init__(self, **kwargs):
        super().__init__(**kwargs)
        self.embedding = layers.Embedding(input_dim=5000, output_dim=16, mask_zero=True)
        self.lstm = layers.LSTM(32)
        
    def call(self, inputs):
        X = self.embedding(inputs)
        mask = self.embedding.compute_mask(inputs) # Note that you can also prepare a mask tensor manually, which is a boolean tensor with the right shape
        output = self.lstm(X, mask=mask)
        return output

In [12]:
layer = MyLayer()

X = np.random.random((32, 10)) * 100
X = X.astype("int32")
layer(X)

<tf.Tensor: id=4712, shape=(32, 32), dtype=float32, numpy=
array([[ 0.0039643 , -0.00655454, -0.00223116, ..., -0.00368348,
         0.00161009, -0.00469382],
       [-0.00226298,  0.00542886,  0.00200883, ..., -0.00303189,
         0.00095562, -0.0027541 ],
       [ 0.00613192, -0.00525401,  0.00232902, ...,  0.00764465,
         0.00270565,  0.00929067],
       ...,
       [ 0.00265467, -0.0041955 ,  0.00143585, ..., -0.00253374,
         0.00506226,  0.00651153],
       [-0.00052386, -0.00336463, -0.00240109, ...,  0.00329279,
        -0.00707992, -0.00113707],
       [ 0.00523299, -0.00444659,  0.00104043, ..., -0.00032395,
        -0.00407169,  0.00483833]], dtype=float32)>

## Supporting masking in your custom layers
- Sometimes, you may need to write layers that generate a mask (like `Embedding`), or layers that need to modify the current mask.
    - For instance, any layer that produces a tensor with a different time dimension than its input, such as a Concatenate layer that concatenates on the time dimension, will need to modify the current mask so that downstream layers will be able to properly take masked timesteps into account.
- To do this, your layer should implement the `layer.compute_mask()` method, which produces a new mask given the input and the current mask.
- Here is an example of a TemporalSplit layer that needs to modify the current mask.

In [13]:
# This layer splits the input tensor into 2 tensors along the time dimension
class TemporalSplit(keras.layers.Layer):
    def call(self, inputs):
        return tf.split(inputs, 2, axis=1)
    
    def compute_mask(self, inputs, mask=None):
        if mask is None:
            return None
        else:
            return tf.split(mask, 2, axis=1)

In [14]:
first_half, second_half = TemporalSplit()(masked_embedding)
print(first_half._keras_mask)
print(second_half._keras_mask)

tf.Tensor(
[[ True  True  True]
 [ True  True  True]
 [ True  True  True]], shape=(3, 3), dtype=bool)
tf.Tensor(
[[False False False]
 [ True  True False]
 [ True  True  True]], shape=(3, 3), dtype=bool)


- Here's another example of a `CustomEmbedding` layer that is capable of generating a mask from input values.

In [15]:
class CustomEmbedding(keras.layers.Layer):
    def __init__(self, input_dim, output_dim, mask_zero=False, **kwargs):
        super().__init__(**kwargs)
        self.input_dim = input_dim
        self.output_dim = output_dim
        self.mask_zero = mask_zero
        
    def build(self, input_shape):
        self.embeddings = self.add_weight(
            shape=(self.input_dim, self.output_dim),
            initializer='random_normal',
            dtype='float32'
        )
        
    def call(self, inputs):
        return tf.nn.embedding_lookup(self.embeddings, inputs)
    
    def compute_mask(self, inputs, mask=None):
        if not self.mask_zero:
            return None
        else:
            return tf.not_equal(inputs, 0)

In [16]:
layer = CustomEmbedding(10, 32, mask_zero=True)

X = np.random.random((3, 10)) * 9
X = X.astype("int32")

y = layer(X)
mask = layer.compute_mask(X)

print(mask)

tf.Tensor(
[[ True False  True  True  True  True  True  True False  True]
 [ True False False False  True  True  True  True  True  True]
 [ True  True  True  True False  True  True  True  True  True]], shape=(3, 10), dtype=bool)


## Opting-in to mask propagation on compatible layers
- Most layers don't modify the time dimension, so don't need to modify the current mask. 
- However, they may still want to be able to propagate the current mask, unchanged, to the next layer. This is an **opt-in behavior**. 
    - By default, a custom layer will destroy the current mask (since the framework has no way to tell whether propagating the mask is safe to do).
- If you have a custom layer that does not modify the time dimension, and if you want it to be able to propagate the current input mask, you should set `self.supports_masking = True` in the layer constructor. 
    - In this case, the default behavior of `compute_mask()` is just pass the current mask through.
- Here's an example of a layer that is whitelisted for mask propagation.

In [17]:
class MyActivation(keras.layers.Layer):
    def __init__(self, **kwargs):
        super().__init__(**kwargs)
        
        # Signal that the layer is safe for mask propagation
        self.supports_masking = True
        
    def call(self, inputs):
        return tf.nn.relu(inputs)

- You can now use this custom layer in-between a mask-generating layer (like `Embedding`) and a mask-consuming layer (like `LSTM`), and it will **pass the mask along** so that it reachs the mask-consuming layer.

In [18]:
inputs = keras.Input(shape=(None,), dtype='int32')
X = layers.Embedding(input_dim=5000, output_dim=16, mask_zero=True)(inputs)
X = MyActivation()(X) # Will pass the mask along
print("Mask found", X._keras_mask)
outputs = layers.LSTM(32)(X) # Will receive the mask

model = keras.Model(inputs, outputs)

Mask found Tensor("embedding_4/NotEqual:0", shape=(None, None), dtype=bool)


# Writing layers that need mask information
- Some layers are **mask consumers**: they accept a mask argument in call and use it to determine whether to skip certain time steps.
- To write such a layer, you can simply add a `mask=None` argument in your `call` signature. 
    - The mask associated with the inputs will be passed to your layer whenever it is available.
- Here's a simple example below: a layer that computes a softmax over the time dimension (axis 1) of an input sequence, while discarding masked timesteps.

In [19]:
class TemporalSoftmax(keras.layers.Layer):
    def call(self, inputs, mask=None):
        broad_float_mask = tf.expand_dims(tf.cast(mask, 'float32'), -1)
        inputs_exp = tf.exp(inputs) * broad_float_mask
        inputs_sum = tf.reduce_sum(inputs * broad_float_mask, axis=1, keepdims=True)
        return inputs_exp / inputs_sum

In [20]:
inputs = keras.Input(shape=(None,), dtype="int32")
X = layers.Embedding(input_dim=10, output_dim=32, mask_zero=True)(inputs)
X = layers.Dense(1)(X)
outputs = TemporalSoftmax()(X)

In [21]:
model = keras.Model(inputs, outputs)
y = model(np.random.randint(0, 10, size=(32, 100)), np.random.random((32, 100, 1)))

In [22]:
y

<tf.Tensor: id=6936, shape=(32, 100, 1), dtype=float32, numpy=
array([[[-0.5138908 ],
        [-0.52876   ],
        [-0.54227686],
        ...,
        [-0.54227686],
        [-0.5360632 ],
        [-0.5231054 ]],

       [[-0.33242685],
        [-0.34676975],
        [-0.33242685],
        ...,
        [-0.33542195],
        [-0.        ],
        [-0.32429668]],

       [[-0.49944207],
        [-0.47864464],
        [-0.4950649 ],
        ...,
        [-0.47864464],
        [-0.5048409 ],
        [-0.50174344]],

       ...,

       [[-0.385618  ],
        [-0.37282786],
        [-0.37870866],
        ...,
        [-0.39866406],
        [-0.37870866],
        [-0.38902748]],

       [[-0.61954373],
        [-0.626725  ],
        [-0.        ],
        ...,
        [-0.61954373],
        [-0.61954373],
        [-0.6073538 ]],

       [[-0.50378186],
        [-0.        ],
        [-0.5166602 ],
        ...,
        [-0.5166602 ],
        [-0.51074004],
        [-0.47764057]]], dtype=

# Summary
- This is all you need to know about padding & masking in Keras.
    - **Masking** is how layers are able to know when to skip/ignore certain timesteps in sequence inputs.
    - Some layers are **mask-generators**.
        - `Embedding` can generate a mask from input values (if `mask_zero=True`), and so can the `Mask` layer.
    - Some layers are **mask-consumers**.
        - They expose a `mask` argument in their `__call__` method (e.g. RNN layers).
    - In the Functional API and Sequential API, mask information is propagated automatically.
    - When using layers in a standalone way, you can pass the `mask` arguments to layers manually.
    - You can easily write layers that modify the current mask, that generate a new mask, or that consume the mask associated with the inputs.