New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is padding necessary for LSTM network? #2375

Closed
gbezerra opened this Issue Apr 18, 2016 · 9 comments

Comments

Projects
None yet
7 participants
@gbezerra
Copy link

gbezerra commented Apr 18, 2016

Hi,

I would like to input sequences of different length into an LSTM network without having to pad them (thus reducing the huge waste of memory). Is this possible?

I'm getting the following error when I try to input a list of lists or an array of arrays:

ValueError: ('Bad input argument to theano function with name "/opt/anaconda2/lib/python2.7/site-packages/keras/backend/theano_backend.py:484" at index 0(0-based)', 'setting an array element with a sequence.')

I have no problems however when I use pad_sequences before I pass the input. Below is my code:

from keras.layers import Dense, Dropout, Activation
from keras.layers import LSTM
from keras.layers import Embedding
from keras.preprocessing.sequence import pad_sequences
import pandas as pd
import config
from numpy import array, random
import pdb

def load_data(test_split = 0.2):
    print 'Loading data...'
    df = pd.read_csv(config.Data.feature_matrix)
    df['encoded_msg'] = df['encoded_msg'].apply(lambda x: \
        [int(e) for e in x.split()])
    df = df.reindex(random.permutation(df.index))

    X_train = df['encoded_msg'].values[:int(len(df) * (1 - test_split))]
    y_train = array(df['was_blocked'].values[:int(len(df) * (1 - test_split))])

    X_test = array(df['encoded_msg'].values[int(len(df) * (1 - test_split))])
    y_test = array(df['was_blocked'].values[int(len(df) * (1 - test_split))])

    #return pad_sequences(X_train) ,y_train, X_test, y_test
    return X_train ,y_train, X_test, y_test


def create_model(input_length):
    print 'Creating model...'
    model = Sequential()
    model.add(Embedding(input_dim = 188, output_dim = 50, input_length = input_length))
    model.add(LSTM(output_dim=256, activation='sigmoid', inner_activation='hard_sigmoid',
        return_sequences = True))
    model.add(Dropout(0.5))
    model.add(LSTM(output_dim=256, activation='sigmoid', inner_activation='hard_sigmoid'))
    model.add(Dropout(0.5))
    model.add(Dense(1, activation = 'sigmoid'))

    print 'Compiling...'
    model.compile(loss='binary_crossentropy',
                  optimizer='rmsprop',
                  metrics=['accuracy'])
    return model


X_train, y_train, X_test, y_test = load_data()
model = create_model(len(X_train[0]))

print 'Fitting model...'
hist = model.fit(X_train, y_train, batch_size=64, nb_epoch=10, validation_split = 0.1,
    verbose = 1)

Any help is highly appreciated.

Thanks!

Please make sure that the boxes below are checked before you submit your issue. Thank you!

  • [ x] Check that you are up-to-date with the master branch of Keras. You can update with:
    pip install git+git://github.com/fchollet/keras.git --upgrade --no-deps
  • [ x] If running on Theano, check that you are up-to-date with the master branch of Theano. You can update with:
    pip install git+git://github.com/Theano/Theano.git --upgrade --no-deps
  • [ x] Provide a link to a GitHub Gist of a Python script that can reproduce your issue (or just copy the script here if it is short).
@philipperemy

This comment has been minimized.

Copy link

philipperemy commented Apr 18, 2016

Alternatives to padding

The padding is useful when you batch your sequences. If you don't want to mask, you have several options:

1 - batch_size=1. You feed the sequences one by one and in this case you don't need to have them of the same length. Something like (from my memory):

for seq, label in zip(sequences, y):
   model.train(np.array([seq]), [label])

2 - Grouping samples by length (all sequences of length 5 together and all sequences of length 4 together)

I tried both these methods before. Batch of size 1 is indeed too slow (you don't have the batch parallel optimisation), and grouping samples is something I don't find too elegant.

So I would go for the masking.

Explanation of your error

Seems like you give an array/matrix and the function expects a list.

keras.preprocessing.sequence.pad_sequences(sequences, maxlen=None, dtype='int32')
@gbezerra

This comment has been minimized.

Copy link

gbezerra commented Apr 18, 2016

Thanks @philipperemy , that's very helpful.

I'm not sure masking will work for me. Currently, my problem is that the GPU runs out of memory for batches larger than 64. I'd like to reduce the memory footprint by not having to pad the sequences so that I could run larger batches.

However, from what I read about masking, I would still have to send the full padded sequence into the network, which wouldn't reduce memory use.

Am I understanding it right?

Also, how did you manage to train the network by grouping sequences of same size? Do you run all epochs for a given size, then you move on to the next size?

Thanks again.

@philipperemy

This comment has been minimized.

Copy link

philipperemy commented Apr 18, 2016

I'd like to reduce the memory footprint by not having to pad the sequences so that I could run larger batches.
=> If you use the batch mode, all sequences must be of the same length inside your batch. You can't help it. Maybe you can truncate your sequences if you have a large number of small sequences and very few big sequences. Because when you pad, you always take into account the largest of your sequences. So your matrix can be very sparse. This approach depends on the length distribution of your sequences (look at if you have a thin or fat tail on the right).

Also, how did you manage to train the network by grouping sequences of same size? Do you run all epochs for a given size, then you move on to the next size?
=> Yes. One epoch means you run through all your training set. Your epoch ends when you processed all the possible lengths (you can begin with the sequences of length=1, then length=2, ...). But I advised you to shuffle this order (by default, the order of the batches is random).

@braingineer

This comment has been minimized.

Copy link
Contributor

braingineer commented Apr 18, 2016

I haven't tried grouping, but I have heard of some approaches that try to group by values when running batches have tried to do it in a stratification of a sorts. Deep Mind uses this in their improved Experience Replay. They stratify the experiences by TD error so they can sample batches that relatively uniform across all TD errors.

Maybe one thing you could try to do, to find a middle ground, is to arrange the length of your sequences and then group those +- some value. Then, you find some balance: the groups should be small enough that your stochastic sampling isn't ruined (correlated errors are bad and will result in a bad model). The groups should also be large enough that you get the parallelization gains.

@philipperemy - when you have different sequence lengths, do you have to recompile the model? Or is the model happy not knowing the sequence length ahead of time? I usually have issues with those types of parameters so I over-do it and provide every size ahead of time.

@codekansas

This comment has been minimized.

Copy link
Contributor

codekansas commented Apr 18, 2016

I'd like to reduce the memory footprint by not having to pad the sequences so that I could run larger batches.

You can try consume_less='mem' on the RNN layers, it should help on your GPU and might actually speed up computation. I think most of the memory issues I had come from the RNN's hidden states, since if you're using an embedding, the RNN part is where you get a blowup (masking will help reduce this I believe, although I'm not sure).

To skip over padding for your RNNs, add a masking layer as suggested by @gbezerra although it doesn't work if you want to merge layers later on.

Someone really needs to get masking working for merge layers, at least for certain merge types.

@voletiv

This comment has been minimized.

Copy link

voletiv commented Apr 25, 2017

Sorry for the late reply, shouldn't:

X_test = array(df['encoded_msg'].values[int(len(df) * (1 - test_split))])
y_test = array(df['was_blocked'].values[int(len(df) * (1 - test_split))])

instead be:

X_test = array(df['encoded_msg'].values[int(len(df) * (1 - test_split)):])
y_test = array(df['was_blocked'].values[int(len(df) * (1 - test_split)):])

?

I added a ':' after ... * (1 - test_split)) in each line. So your training data is data[**:**int(length*0.8)] and test data is data[int(length*0.8)**:**].

@stale stale bot added the stale label Jul 24, 2017

@stale

This comment has been minimized.

Copy link

stale bot commented Jul 24, 2017

This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 30 days if no further activity occurs, but feel free to re-open a closed issue if needed.

@stale stale bot closed this Aug 24, 2017

@xu-song

This comment has been minimized.

Copy link

xu-song commented Jan 15, 2018

Is padding necessary for LSTM?

No.
But it is a common strategy for batch optimization.

LSTM with zero-padding

The LSTM implementation in keras example adopt padding method. It treats None as a special word, with non-zero embedding. However I don't think it is a good strategy.

(x_train, y_train), (x_test, y_test) = imdb.load_data(num_words=max_features)
x_train = sequence.pad_sequences(x_train, maxlen=maxlen)   # pre_padding with 0

model = Sequential()
model.add(Embedding(max_features, 128))  # non_zero embedding for zero_padding
model.add(LSTM(128, dropout=0.2, recurrent_dropout=0.2))

LSTM with no padding

pytorch

It is easy to implement LSTM with no-padding in dynamic graph. Here is an example in pytorch.
Optimization one by one. That is bath_size = 1 #40 (comment)

tensorflow

https://www.tensorflow.org/api_docs/python/tf/nn/dynamic_rnn

LSTM with mask

mask in Embedding Layer

model = Sequential()
model.add(Embedding(max_features, 128, mask_zero = True))  # zero embedding for zero_padding
model.add(LSTM(128, dropout=0.2, recurrent_dropout=0.2))

mask in Masking Layer

model = Sequential()
model.add(Masking(mask_value=0., input_shape=(timesteps, features)))
model.add(LSTM(32))

The top 2 methods are same. As keras doc says:

If mask_zero is set to True, the input value 0 will be a special "padding" that should be masked out.
Index 0 cannot be used in the vocabulary. (Embedding Layer doc)

Masks a sequence by using a mask value to skip timesteps. (Masking Layer doc)

By default, 0 would be considered as a word in vocabulary, if mask_zero=false.
if we set mask_zero = True, the padding timesteps will be set as zero in embedding.
Then, all padding timesteps will be skipped in LSTM layer.

will LSTM with mask reduce memory? reduce computation?

don't know.
This issue may refer to "static computation graph" VS "dynamic computation graph"
Although tensorflow and theano support mask in LSTM. The implementation ....

@eromoe

This comment has been minimized.

Copy link

eromoe commented May 7, 2018

Ihave a padding quesion, which seems different :

For example:

I want to train a model to detect wrong word .

  1. I have 1 million sentences with different length.
  2. If a sentence length is 10, I want LSTM learn 2 words surrounding current word ( so each word become 5 word) . So I need transform the sentence to [10, 5] .
  3. But the first word(doesn't have left words) and right word (doesn't have right words) . So I need pad them.
  4. Then use word2vec transform this sentence to [10, 5, embedding_size] , same to other sentences.
  5. trainning (feed model by [5, embedding_size] * bacth_size )

Is my steps right ? Looks like Keras doesn't have this feature built-in .

PS:
I have some confusing with LSTM input.

  1. Some pad all sentence to same length
  2. some split sentence into word blocks with same size (like I do above).

I am not sure when to use 1 and 2 .
Only understand 2 can use for squence tagging.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment