Masks for RNNs #176

elanmart · 2015-05-30T18:03:20Z

Hey,

I think it would be cool if we could specify when the recurrent network should stop updating its hidden state. For example if my sequences have max length 100, and a particular example has length of only 10, the network will update its hidden state 90 times before returning the final vector, which is not neccesarly desireable.

fchollet · 2015-05-30T20:46:17Z

How would you suggest masks are implemented?

For now, you could simply group your samples into batches where all samples have the sample lengths, or even simpler (but slower): use a batch size of 1 (and no zero-padding).

elanmart · 2015-05-31T19:10:11Z

Well, I've tried both these methods before. Batch of size 1 is indeed too slow, and grouping samples by length is something I don't find too elegant. I'm also not sure if it does not hurt performance, since one is not able to sample the data fully randomly.

I'm not really an expert when it comes to implementing stuff in Theano, but I think people from Lasagne have something like this:

https://github.com/craffel/nntools/blob/master/lasagne/layers/recurrent.py

charlesollion · 2015-06-16T08:49:40Z

I've worked a bit with masks for RNN, it can be implemented in many different ways. I think it can be quite useful.

If you're interested in the last output only, one easy way is to pass the mask to the step function so that is doesn't compute anything when the mask is 0 (state and output stays the same):
step ([...], h_tm1, mask):
[...]
tmp_h_t = //computation here
h_t = (1 - mask) * t_tm1 + mask * tmp_h_t

The input to the whole model is [sequences, masks]. Could also be computed in theano

If interested in the whole output sequence, you also need to compute a masked loss which can be tricky

fchollet · 2015-06-17T21:19:22Z

If interested in the whole output sequence, you also need to compute a masked loss which can be tricky

The current layers can output both the last output or the entire sequence. We need to have to be compatible with that.

I wonder how much of a bad practice it would be not to keep a separate mask variable, and instead just stop the iteration when an all-0 input is found at a certain timestep. It would make things much easier. What do you guys think?

elanmart · 2015-06-17T21:41:03Z

I thought about it but it'll work only in case of one example per batch or all examples of the same size, right?

charlesollion · 2015-06-17T21:49:45Z

If we're sure to always pad with zeros and that no input are all-0 before the end of the sequence that would be OK.
Still, you need to carry on the computation for the inputs in the batch that have a larger size, while keeping the result for the 'stopped' ones. Can be done in the step function.

if output sequence is on, you have a batch of sequences of outputs, but some of the output sequences are padded, which is not easy to deal with

wxs · 2015-06-17T23:41:18Z

I took a stab at this in #239. I'm still massaging it a bit, and it's just in the SimpleRNN for the moment, but I'd be interested to get your feedback.

My issue now is how best to get the mask input to pass it into the SimpleRNN (mine come after an Embedding layer so I need to use Merge to merge them back in. I'm working on that now). @fchollet this would be an issue with what you describe of not keeping a separate mask variable, since after an embedding there would be no all-0 input.

I suppose another option would be to put a constraint on the Embedding forcing it to not be allowed to learn a representation for the "pad" value.

fchollet · 2015-06-17T23:46:42Z

this would be an issue with what you describe of not keeping a separate mask variable, since after an embedding there would be no all-0 input.

Correct, but how would masking work with Embedding layers in the case of a separate mask parameter?

It would be very easy to rectify the Embedding layer to output all-0 feature vectors for 0-inputs. After the embedding stage, just go over the input indices and when a zero is encountered, set the corresponding feature vector to 0.

This would be compatible with our text preprocessing utils, which assume that 0 is a non-character.

wxs · 2015-06-17T23:52:00Z

how would masking work with Embedding layers in the case of a separate mask parameter?

I was thinking of either modifying Embedding to optionally pass-through a mask (following the convention that masks are always concatenated along the time dimension), or else using a Merge to concatenate the embedding with the mask.

It would be very easy to rectify the Embedding layer to output all-0 feature vectors for 0-inputs

Hmm, doesn't this introduce a small probability that, for instance, an all 0 vector is learned by the embedding layer which then gets "stuck"? I suppose for high-dimensional vectors that's pretty unlikely. But this could happen at any stage of the network. If a value ever "happens" to hit 0 suddenly its properties change.

Perhaps safer to use, e.g. NaN or -Inf but I don't know how those interact with the GPU.

wxs · 2015-06-17T23:53:12Z

Also: for a large feature vector isn't it quite inefficient to iterate over the entire vector just to check if it's masked?

fchollet · 2015-06-18T00:40:21Z

Also: for a large feature vector isn't it quite inefficient to iterate over the entire vector just to check if it's masked?

Yes, but that should still be negligible compared to the matrix multiplications for non-zero vectors.

Regarding the Embedding layer, the fix could be done by adding one line:

self.W = self.init((self.input_dim, self.output_dim))
self.W[0] *= 0. # the embedding of index 0 (non-character) will be an all-zero vector

Then every recurrent layer would check if the current input is all 0, and return a 0 output if that's the case (in the case of LSTM it would also return the previous memory unchanged). Thoughts?

fchollet · 2015-06-18T00:44:08Z

Hmm, doesn't this introduce a small probability that, for instance, an all 0 vector is learned by the embedding layer which then gets "stuck"? I suppose for high-dimensional vectors that's pretty unlikely. But this could happen at any stage of the network. If a value ever "happens" to hit 0 suddenly its properties change.

I think that's statistically impossible because every value in the feature vector would need to reach exactly zero, starting from a random initialization. Even if all-zero happened to be an optimum in the context of some task, the learned value could end up epsilon-close to all-zero but likely never all-zero.

wxs · 2015-06-18T00:50:48Z

Then every recurrent layer would check if the current input is all 0, and return a 0 output if that's the case

I think you'd prefer to return h_tm1 here since in your examples and utilities you post-pad with 0 for shorter sequences (or change the examples to pre-pad, I suppose. I think pre-padding makes a bit more sense anyway).

I guess this is a bit easier to understand than concatenating the mask to the input. But potentially more prone to "accidental" bugs, where the user passes in some zero-data without understanding this effect, and gets strange behaviour. Like if this becomes the standard behaviour in all layer types, what if I'm doing a CNN on an image and I happen have a patch of black pixels?

fchollet · 2015-06-18T01:00:56Z

what if I'm doing a CNN on an image and I happen have a patch of black pixels?

Typically you'll first run your input through a conv2D layer, then run a sequence of output vectors through a recurrent layer. Again, it will be statistically impossible for the processed vectors to be all zero.

I agreed that the behavior seems "dirty", however as long as the behavior is clearly documented we should be fine. And accidental bugs will be so improbable as to be impossible.

The main argument for this setup is that it introduces no architecture issues (the nature and shape of the data being passed around is unchanged) and it is very easy to implement / simple to understand.

fchollet · 2015-06-18T01:06:58Z

I think pre-padding makes a bit more sense anyway

Agreed on that.

charlesollion · 2015-06-18T07:04:22Z

If you pre-pad, you could even mask the 0s from the beginning only: when a non zero entry appears in the vector, every following entries are considered and computed, even all-0s. I think then it's the cleanest!

You can compute this mask in the layer computation and pass it to the step; when I get a bit of time I'll try and write that.

wxs · 2015-06-18T14:59:36Z

I think that's statistically impossible because every value in the feature vector would need to reach exactly zero

Ah wait: what about after ReLU activation? Suddenly getting all 0 becomes significantly more likely (i.e. 1/2^n where n is the feature vector dimension)

fchollet · 2015-06-18T17:44:47Z

We'll definitely switch to pre-padding (it's a trivial change).

Ah wait: what about after ReLU activation? Suddenly getting all 0 becomes significantly more likely (i.e. 1/2^n where n is the feature vector dimension)

That's right. I think a good solution would be to make the mask value configurable in the Embedding layer and the recurrent layers, much like what XGBoost does. The default could be, for instance, -999.0.

model.add(Embedding(indim, outdim, mask_value=-1.)) # replaces index 0 with all-(-999.) vectors
model.add(SimpleRNN(outdim, outdim, mask_value=-1.)) # skips all-(-999.) vectors

wxs · 2015-06-18T17:54:03Z

OK @fchollet sounds like you're pretty set on the mask_value approach, which seems fine, you're right that it will be simpler to implement everywhere. Feels slightly "wrong" to me but that's just aesthetic.

I'm happy to implement this, but let me know if you're doing it so we don't dupe work.

Is it confusing that the Embedding input expects 0 as a pad, if everywhere else expects -999 (or whatever) as a pad? Seems a bit inconsistent for the API that on SimpleRNN mask_value would represent what inputs are masked, but on Embedding they would represent how to represent a pad on the output.

fchollet · 2015-06-18T18:00:15Z

Is it confusing that the Embedding input expects 0 as a pad, if everywhere else expects -999 (or whatever) as a pad? Seems a bit inconsistent for the API that on SimpleRNN mask_value would represent what inputs are masked, but on Embedding they would represent how to represent a pad on the output.

The reason for the discrepancy is that the input of an Embedding is a tensor of indices, which are positive integers. The default convention for the non-character index is 0.

The rest of the network uses an arbitrary mask value (float).

wxs · 2015-06-18T20:38:04Z

OK I've put up a preliminary implementation at #244, would love some review before I dive in to getting more of the recurrent types supported.

wxs · 2015-06-19T14:21:42Z

Btw, looks like Bricks takes the approach I did initially, of having a separate channel over which the mask is sent:

http://blocks.readthedocs.org/en/latest/api/bricks.html

wxs · 2015-06-29T16:10:50Z

The PR implementing masks has now been merged, for those of you watching this issue.

mbchang · 2016-12-17T19:18:22Z

Yes, but that should still be negligible compared to the matrix multiplications for non-zero vectors.
Regarding the Embedding layer, the fix could be done by adding one line:
self.W = self.init((self.input_dim, self.output_dim))
self.W[0] *= 0. # the embedding of index 0 (non-character) will be an all-zero vector
Then every recurrent layer would check if the current input is all 0, and return a 0 output if that's the case (in the case of LSTM it would also return the previous memory unchanged). Thoughts?

Has this been implemented? I looked in the source code I couldn't find this.

braingineer · 2016-12-17T23:14:14Z

are you referring to the recurrent pass through, @mbchang? if so, check the backend code. e. g. in the theano back end, there is a switch over 0 for the next hidden state.

wxs · 2016-12-19T21:57:27Z

@mbchang in general after this discussion Keras ended up moving to a separate explicitly sent mask after all, rather than a special masking value.

Embeddings take a mask_zero boolean parameter which can generate that mask automatically anywhere there's a 0 on the input.

* Export operations * dual export

fchollet mentioned this issue Jun 17, 2015

Time-Masked RNN #239

Closed

wxs mentioned this issue Jun 18, 2015

Introduce time-masking to recurrent layers. #244

Merged

vzhong mentioned this issue Jul 2, 2015

Non time distributed Embeddings / Lookup tables #320

Closed

fchollet closed this as completed Jul 4, 2015

sbodenstein mentioned this issue Jul 27, 2016

RNN Symbol apache/mxnet#2401

Closed

slaterb1 mentioned this issue Mar 22, 2017

How does Masking work? #3086

Closed

fchollet pushed a commit that referenced this issue Sep 22, 2023

Export operations (#176)

59876af

* Export operations * dual export

hubingallin pushed a commit to hubingallin/keras that referenced this issue Sep 22, 2023

Export operations (keras-team#176)

b6ed860

* Export operations * dual export

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Masks for RNNs #176

Masks for RNNs #176

elanmart commented May 30, 2015

fchollet commented May 30, 2015

elanmart commented May 31, 2015

charlesollion commented Jun 16, 2015

fchollet commented Jun 17, 2015

elanmart commented Jun 17, 2015

charlesollion commented Jun 17, 2015

wxs commented Jun 17, 2015

fchollet commented Jun 17, 2015

wxs commented Jun 17, 2015

wxs commented Jun 17, 2015

fchollet commented Jun 18, 2015

fchollet commented Jun 18, 2015

wxs commented Jun 18, 2015

fchollet commented Jun 18, 2015

fchollet commented Jun 18, 2015

charlesollion commented Jun 18, 2015

wxs commented Jun 18, 2015

fchollet commented Jun 18, 2015

wxs commented Jun 18, 2015

fchollet commented Jun 18, 2015

wxs commented Jun 18, 2015

wxs commented Jun 19, 2015

wxs commented Jun 29, 2015

mbchang commented Dec 17, 2016

braingineer commented Dec 17, 2016 •

edited

wxs commented Dec 19, 2016

Masks for RNNs #176

Masks for RNNs #176

Comments

elanmart commented May 30, 2015

fchollet commented May 30, 2015

elanmart commented May 31, 2015

charlesollion commented Jun 16, 2015

fchollet commented Jun 17, 2015

elanmart commented Jun 17, 2015

charlesollion commented Jun 17, 2015

wxs commented Jun 17, 2015

fchollet commented Jun 17, 2015

wxs commented Jun 17, 2015

wxs commented Jun 17, 2015

fchollet commented Jun 18, 2015

fchollet commented Jun 18, 2015

wxs commented Jun 18, 2015

fchollet commented Jun 18, 2015

fchollet commented Jun 18, 2015

charlesollion commented Jun 18, 2015

wxs commented Jun 18, 2015

fchollet commented Jun 18, 2015

wxs commented Jun 18, 2015

fchollet commented Jun 18, 2015

wxs commented Jun 18, 2015

wxs commented Jun 19, 2015

wxs commented Jun 29, 2015

mbchang commented Dec 17, 2016

braingineer commented Dec 17, 2016 • edited

wxs commented Dec 19, 2016

braingineer commented Dec 17, 2016 •

edited