- Title: Pad a Sequence in Python
- Slug: python-pad-sequence
- Date: 2020-03-05 11:45:08
- Category: Computer Science
- Tags: programming, Python, AI, data science, machine learning, deep learning, PyTorch, Keras, pad, sequence, numpy, array, TensorFlow
- Author: Ben Du
- Modified: 2020-03-05 11:45:08


In [36]:
import numpy as np
import torch
import tensorflow as tf

In [23]:
x = torch.tensor([
    [1., 2, 3, 4, 5],
    [6., 7, 8, 9, 10],
])
x

tensor([[ 1.,  2.,  3.,  4.,  5.],
        [ 6.,  7.,  8.,  9., 10.]])

## Tips

1. `numpy.pad` and `torch.nn.utils.rnn.pad_sequence` can only increase the length of sequence (nump array, list or tensor) 
    while `tf.keras.preprocessing.sequence.pad_sequence` can both increase and decrease the length of a sequence.
    
2. `numpy.pad` implements many different ways 
    (constant, edge, linear_ramp, maximum, mean, median, minimum, reflect, symmetric, wrap, empty and abitrary padding function) 
    to pad a sequence
    while `torch.nn.utils.rnn.pad_sequence` and `tf.keras.preprocessing.sequence.pad_sequence` only support padding a constant value
    (as this is only use case in NLP).
    
3. You can easily control the final length (after padding) 
    with `numpy.pad` and `tf.keras.preprocessing.sequence.pad_sequence`. 
    `torch.nn.utils.rnn.pad_sequence` pad each tesor to be have the max length of all tensors. 
    You cannot easily use `torch.nn.utils.rnn.pad_sequeence` to pad sequence to an arbitrary length. 
    
4. Both `numpy.pad` pads a single iterable object (numpy array, list or Tensor),
    `torch.nn.utils.rnn.pad_sequence` pads a sequence of Tensors,
    and `tf.keras.preprocessing.sequence.pad_sequence` pads a sequence of iterable objects 
    (numpy arrays, lists or Tensors).
    
    
Overall, 
`tf.keras.preprocessing.sequence.pad_sequence` is the most useful for NLP.
`torch.nn.utisl.rnn.pad_sequence` seems to be quite limited. 
`numpy.pad` can be used to easily implement customized padding strategy.

## [numpy.pad](https://docs.scipy.org/doc/numpy/reference/generated/numpy.pad.html)



In [24]:
a = [1, 2, 3, 4, 5]
np.pad(a, (2, 3), 'constant', constant_values=(4, 6))

array([4, 4, 1, 2, 3, 4, 5, 6, 6, 6])

## [torch.nn.utils.rnn.pad_sequence](https://pytorch.org/docs/stable/nn.html#torch.nn.utils.rnn.pad_sequence)

In [27]:
t = torch.nn.utils.rnn.pad_sequence(
    [
        torch.tensor([1, 2, 3]),
        torch.tensor([1, 2, 3, 4]),
    ]
)
t

tensor([[1, 1],
        [2, 2],
        [3, 3],
        [0, 4]])

In [28]:
t[0]

tensor([1, 1])

In [25]:
torch.nn.utils.rnn.pad_sequence([
    [1, 2, 3],
    [1, 2, 3, 4],
])

AttributeError: 'list' object has no attribute 'size'

## [tf.keras.preprocessing.sequence.pad_sequences](https://www.tensorflow.org/api_docs/python/tf/keras/preprocessing/sequence/pad_sequences)

In [34]:
tf.keras.preprocessing.sequence.pad_sequences(
    [[1, 2, 3, 4, 5]],
    maxlen=3,
    dtype="long",
    value=0,
    truncating="post",
    padding="post"
)

array([[1, 2, 3]])

In [35]:
tf.keras.preprocessing.sequence.pad_sequences(
    [[1, 2, 3, 4, 5]],
    maxlen=9,
    dtype="long",
    value=0,
    truncating="post",
    padding="post"
)

array([[1, 2, 3, 4, 5, 0, 0, 0, 0]])

## Reference

https://www.tensorflow.org/api_docs/python/tf/keras/preprocessing/sequence/pad_sequences
    
https://docs.scipy.org/doc/numpy/reference/generated/numpy.pad.html
    
https://pytorch.org/docs/stable/nn.html#torch.nn.utils.rnn.pad_sequence