Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Masks for RNNs #176

Closed
elanmart opened this issue May 30, 2015 · 26 comments
Closed

Masks for RNNs #176

elanmart opened this issue May 30, 2015 · 26 comments

Comments

@elanmart
Copy link

Hey,

I think it would be cool if we could specify when the recurrent network should stop updating its hidden state. For example if my sequences have max length 100, and a particular example has length of only 10, the network will update its hidden state 90 times before returning the final vector, which is not neccesarly desireable.

@fchollet
Copy link
Member

How would you suggest masks are implemented?

For now, you could simply group your samples into batches where all samples have the sample lengths, or even simpler (but slower): use a batch size of 1 (and no zero-padding).

@elanmart
Copy link
Author

Well, I've tried both these methods before. Batch of size 1 is indeed too slow, and grouping samples by length is something I don't find too elegant. I'm also not sure if it does not hurt performance, since one is not able to sample the data fully randomly.

I'm not really an expert when it comes to implementing stuff in Theano, but I think people from Lasagne have something like this:

https://github.com/craffel/nntools/blob/master/lasagne/layers/recurrent.py

@charlesollion
Copy link
Contributor

I've worked a bit with masks for RNN, it can be implemented in many different ways. I think it can be quite useful.

If you're interested in the last output only, one easy way is to pass the mask to the step function so that is doesn't compute anything when the mask is 0 (state and output stays the same):
step ([...], h_tm1, mask):
[...]
tmp_h_t = //computation here
h_t = (1 - mask) * t_tm1 + mask * tmp_h_t

The input to the whole model is [sequences, masks]. Could also be computed in theano

If interested in the whole output sequence, you also need to compute a masked loss which can be tricky

@fchollet
Copy link
Member

If interested in the whole output sequence, you also need to compute a masked loss which can be tricky

The current layers can output both the last output or the entire sequence. We need to have to be compatible with that.

I wonder how much of a bad practice it would be not to keep a separate mask variable, and instead just stop the iteration when an all-0 input is found at a certain timestep. It would make things much easier. What do you guys think?

@elanmart
Copy link
Author

I thought about it but it'll work only in case of one example per batch or all examples of the same size, right?

@charlesollion
Copy link
Contributor

If we're sure to always pad with zeros and that no input are all-0 before the end of the sequence that would be OK.
Still, you need to carry on the computation for the inputs in the batch that have a larger size, while keeping the result for the 'stopped' ones. Can be done in the step function.

if output sequence is on, you have a batch of sequences of outputs, but some of the output sequences are padded, which is not easy to deal with

@wxs
Copy link
Contributor

wxs commented Jun 17, 2015

I took a stab at this in #239. I'm still massaging it a bit, and it's just in the SimpleRNN for the moment, but I'd be interested to get your feedback.

My issue now is how best to get the mask input to pass it into the SimpleRNN (mine come after an Embedding layer so I need to use Merge to merge them back in. I'm working on that now). @fchollet this would be an issue with what you describe of not keeping a separate mask variable, since after an embedding there would be no all-0 input.

I suppose another option would be to put a constraint on the Embedding forcing it to not be allowed to learn a representation for the "pad" value.

@fchollet
Copy link
Member

this would be an issue with what you describe of not keeping a separate mask variable, since after an embedding there would be no all-0 input.

Correct, but how would masking work with Embedding layers in the case of a separate mask parameter?

It would be very easy to rectify the Embedding layer to output all-0 feature vectors for 0-inputs. After the embedding stage, just go over the input indices and when a zero is encountered, set the corresponding feature vector to 0.

This would be compatible with our text preprocessing utils, which assume that 0 is a non-character.

@wxs
Copy link
Contributor

wxs commented Jun 17, 2015

how would masking work with Embedding layers in the case of a separate mask parameter?

I was thinking of either modifying Embedding to optionally pass-through a mask (following the convention that masks are always concatenated along the time dimension), or else using a Merge to concatenate the embedding with the mask.

It would be very easy to rectify the Embedding layer to output all-0 feature vectors for 0-inputs

Hmm, doesn't this introduce a small probability that, for instance, an all 0 vector is learned by the embedding layer which then gets "stuck"? I suppose for high-dimensional vectors that's pretty unlikely. But this could happen at any stage of the network. If a value ever "happens" to hit 0 suddenly its properties change.

Perhaps safer to use, e.g. NaN or -Inf but I don't know how those interact with the GPU.

@wxs
Copy link
Contributor

wxs commented Jun 17, 2015

Also: for a large feature vector isn't it quite inefficient to iterate over the entire vector just to check if it's masked?

@fchollet
Copy link
Member

Also: for a large feature vector isn't it quite inefficient to iterate over the entire vector just to check if it's masked?

Yes, but that should still be negligible compared to the matrix multiplications for non-zero vectors.

Regarding the Embedding layer, the fix could be done by adding one line:

self.W = self.init((self.input_dim, self.output_dim))
self.W[0] *= 0. # the embedding of index 0 (non-character) will be an all-zero vector

Then every recurrent layer would check if the current input is all 0, and return a 0 output if that's the case (in the case of LSTM it would also return the previous memory unchanged). Thoughts?

@fchollet
Copy link
Member

Hmm, doesn't this introduce a small probability that, for instance, an all 0 vector is learned by the embedding layer which then gets "stuck"? I suppose for high-dimensional vectors that's pretty unlikely. But this could happen at any stage of the network. If a value ever "happens" to hit 0 suddenly its properties change.

I think that's statistically impossible because every value in the feature vector would need to reach exactly zero, starting from a random initialization. Even if all-zero happened to be an optimum in the context of some task, the learned value could end up epsilon-close to all-zero but likely never all-zero.

@wxs
Copy link
Contributor

wxs commented Jun 18, 2015

Then every recurrent layer would check if the current input is all 0, and return a 0 output if that's the case

I think you'd prefer to return h_tm1 here since in your examples and utilities you post-pad with 0 for shorter sequences (or change the examples to pre-pad, I suppose. I think pre-padding makes a bit more sense anyway).

I guess this is a bit easier to understand than concatenating the mask to the input. But potentially more prone to "accidental" bugs, where the user passes in some zero-data without understanding this effect, and gets strange behaviour. Like if this becomes the standard behaviour in all layer types, what if I'm doing a CNN on an image and I happen have a patch of black pixels?

@fchollet
Copy link
Member

what if I'm doing a CNN on an image and I happen have a patch of black pixels?

Typically you'll first run your input through a conv2D layer, then run a sequence of output vectors through a recurrent layer. Again, it will be statistically impossible for the processed vectors to be all zero.

I agreed that the behavior seems "dirty", however as long as the behavior is clearly documented we should be fine. And accidental bugs will be so improbable as to be impossible.

The main argument for this setup is that it introduces no architecture issues (the nature and shape of the data being passed around is unchanged) and it is very easy to implement / simple to understand.

@fchollet
Copy link
Member

I think pre-padding makes a bit more sense anyway

Agreed on that.

@charlesollion
Copy link
Contributor

If you pre-pad, you could even mask the 0s from the beginning only: when a non zero entry appears in the vector, every following entries are considered and computed, even all-0s. I think then it's the cleanest!

You can compute this mask in the layer computation and pass it to the step; when I get a bit of time I'll try and write that.

@wxs
Copy link
Contributor

wxs commented Jun 18, 2015

I think that's statistically impossible because every value in the feature vector would need to reach exactly zero

Ah wait: what about after ReLU activation? Suddenly getting all 0 becomes significantly more likely (i.e. 1/2^n where n is the feature vector dimension)

@fchollet
Copy link
Member

We'll definitely switch to pre-padding (it's a trivial change).

Ah wait: what about after ReLU activation? Suddenly getting all 0 becomes significantly more likely (i.e. 1/2^n where n is the feature vector dimension)

That's right. I think a good solution would be to make the mask value configurable in the Embedding layer and the recurrent layers, much like what XGBoost does. The default could be, for instance, -999.0.

model.add(Embedding(indim, outdim, mask_value=-1.)) # replaces index 0 with all-(-999.) vectors
model.add(SimpleRNN(outdim, outdim, mask_value=-1.)) # skips all-(-999.) vectors 

@wxs
Copy link
Contributor

wxs commented Jun 18, 2015

OK @fchollet sounds like you're pretty set on the mask_value approach, which seems fine, you're right that it will be simpler to implement everywhere. Feels slightly "wrong" to me but that's just aesthetic.

I'm happy to implement this, but let me know if you're doing it so we don't dupe work.

Is it confusing that the Embedding input expects 0 as a pad, if everywhere else expects -999 (or whatever) as a pad? Seems a bit inconsistent for the API that on SimpleRNN mask_value would represent what inputs are masked, but on Embedding they would represent how to represent a pad on the output.

@fchollet
Copy link
Member

Is it confusing that the Embedding input expects 0 as a pad, if everywhere else expects -999 (or whatever) as a pad? Seems a bit inconsistent for the API that on SimpleRNN mask_value would represent what inputs are masked, but on Embedding they would represent how to represent a pad on the output.

The reason for the discrepancy is that the input of an Embedding is a tensor of indices, which are positive integers. The default convention for the non-character index is 0.

The rest of the network uses an arbitrary mask value (float).

@wxs
Copy link
Contributor

wxs commented Jun 18, 2015

OK I've put up a preliminary implementation at #244, would love some review before I dive in to getting more of the recurrent types supported.

@wxs
Copy link
Contributor

wxs commented Jun 19, 2015

Btw, looks like Bricks takes the approach I did initially, of having a separate channel over which the mask is sent:

http://blocks.readthedocs.org/en/latest/api/bricks.html

@wxs
Copy link
Contributor

wxs commented Jun 29, 2015

The PR implementing masks has now been merged, for those of you watching this issue.

@mbchang
Copy link

mbchang commented Dec 17, 2016

Yes, but that should still be negligible compared to the matrix multiplications for non-zero vectors.
Regarding the Embedding layer, the fix could be done by adding one line:
self.W = self.init((self.input_dim, self.output_dim))
self.W[0] *= 0. # the embedding of index 0 (non-character) will be an all-zero vector
Then every recurrent layer would check if the current input is all 0, and return a 0 output if that's the case (in the case of LSTM it would also return the previous memory unchanged). Thoughts?

Has this been implemented? I looked in the source code I couldn't find this.

@braingineer
Copy link
Contributor

braingineer commented Dec 17, 2016

are you referring to the recurrent pass through, @mbchang? if so, check the backend code. e. g. in the theano back end, there is a switch over 0 for the next hidden state.

@wxs
Copy link
Contributor

wxs commented Dec 19, 2016

@mbchang in general after this discussion Keras ended up moving to a separate explicitly sent mask after all, rather than a special masking value.

Embeddings take a mask_zero boolean parameter which can generate that mask automatically anywhere there's a 0 on the input.

fchollet pushed a commit that referenced this issue Sep 22, 2023
* Export operations

* dual export
hubingallin pushed a commit to hubingallin/keras that referenced this issue Sep 22, 2023
* Export operations

* dual export
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants