Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merge doesn't support masking #2393

Closed
jiumem opened this issue Apr 19, 2016 · 16 comments
Closed

Merge doesn't support masking #2393

jiumem opened this issue Apr 19, 2016 · 16 comments

Comments

@jiumem
Copy link

jiumem commented Apr 19, 2016

x = Masking(mask_value=0.)(input)
x_forward = GRU(128, return_sequences=True)(x)
x_backward = GRU(128, return_sequences=True, go_backwards=True)(x)
rnn = merge([x_forward, x_backward], mode='concat')

Seems two input's mask will be corrupted in concat mode. Will it support mask merge soon?

@braingineer
Copy link
Contributor

+1 for merged mask.

Also, interesting use case.

How should a backwards-running mask be pushed forward in this case?

let's say mask is 110. m_fwd = 110 still, but m_back = 011, right? so then, when they concat, should the new mask be 110011? Is the backwards RNN outputting the mask backward? I don't think it is. It look like it's just pushing it forward if returning sequences else None

In the concat mode, you would just then concat the incoming masks, right?

@codekansas
Copy link
Contributor

I opened a pull request here #2413

@jiumem
Copy link
Author

jiumem commented Apr 20, 2016

Forward mask is 110, and backward mask is 011, when they concat in axis -1, the new mask is [(1, 0), (1, 1), (0, 1)] but it's wrong.

I think first we will pad the two input mask(actually pad the two input sequential data):
forward sequence is 0110, backward sequence is 0110.
Then concat the two sequence in axis -1, it will be: [(0, 0), (1, 1), (1, 1), (0, 0)].
Last, slice the concated sequence to be [(1, 1), (1, 1), (0, 0)] which mask is equal to the forward mask 110.

Same process to the 'sum' and 'ave' merge mode, right ?

@codekansas
Copy link
Contributor

I believe if you do concat_axis=2 above it concatenates the RNN at each time step. I think if you just concatenate the masks in the same direction as the outputs of the RNN it does what you want. I'm pretty sure you want the mask to be [[1, 1, 0], [0, 1, 1]]. You don't want to pad it like that because you end up changing the amount of padding depending on the mask.

If you used mode='sum' for the merge layer, should the mask be [1, 1, 1] or [0, 1, 0]?

@braingineer
Copy link
Contributor

I don't think I'm on the same page as you guys, so I'll try to be super explicit.

I'm also changing the example slightly to make it more explicit what the shapes will be.

When I write "Mask @ J", I mean mask output from J.

# assumes shape variables: 
#    batch, time, feat_size, emb_size, rnn_size
#### 1. input is (batch, time)
####  Mask @ 1: None
input = Input(batch_shape=(batch, time), dtype='int32')# feature at every (b,t)

#### 2. embedding is (batch, time, emb_size)
#### Mask @ 2: (batch, time)  from K.not_equal(x, 0)
x = Embedding(input_dim=feat_size, output_dim=emb_size, mask_zeros=True)(input)

#### 3. return_sequences makes this (batch, time, rnn_size).
#### Mask @ 3: (batch, time) from just pushing mask through
x_forward = GRU(rnn_size, return_sequences=True)(x)

#### 4. return_sequences make this (batch, time rnn_size) and should reverse the mask
#### Mask @ 4: (batch, time) but should reverse; probably with mask[:,::-1]
x_backward = GRU(rnn_size, return_sequences=True, go_backwards=True)(x)
#### 5. Concatenation on the -1 dimension should result in (batch, time, rnn_size * 2)
#### Mask @ 5: (batch, time)..???? see below
rnn = merge([x_forward, x_backward], mode='concat')

So, step 5 is odd. And it presents an issue that isn't in your PR. That is, the time dimension was reversed on the RNN, not the feature dimension. So, the time dimension has the reversed mask.

But the methodological question is, when masking, how do you align two sequences where one has been reversed?

@braingineer
Copy link
Contributor

I think I get @jiumem 's comment now. Your recommendation was to use padding to realign the two RNNs. I think this is troublesome in general, though, because you might have different length sequences along the batch dimension.

@codekansas If the mask is over the time dimension, the mask shouldn't change depending on the merge type, right? So if you're merging in whatever what on the feature dimension, then the time dimension is untouched. But, If you're merging on the time dimension though, then you'd need to manage the mask.

@codekansas
Copy link
Contributor

In the PR I increased the mask dimensions to match the input, and concatenated the masks, here. So it outputs a mask with dimensions (batch, time, rnn_size * 2). I may be misunderstanding masking... We could reduce this mask to (batch, time) with K.any() or K.all(), which would be like doing a logical OR or logical AND respectively between the the two time masks. There is a similar problem for summing the inputs. Concatenating over the time dimension would simply involve concatenating the masks, while concatenating over the feature dimensions would involve logical operations on the mask.

@codekansas
Copy link
Contributor

Wait I think I understand the issue now. If you have another RNN on top of the merged layer:

rnn = merge([x_forward, x_backward], mode='concat') # output (batch, time, rnn_size * 2)
rnn_2 = GRU(rnn_size, return_sequences=True)(rnn) # output (batch, time, rnn_size)

You would want rnn_2 to take the first output of x_forward and the last output of x_backward at the same time. This seems like more of an issue with the go_backwards part of the RNN... It would make sense to keep the mask the same, but if go_backwards is set, only run it on the non-masked part. So you would end up with the non-masked stuff all at the beginning of the sequence.

It seems like this may have been the intent for go_backwards since they didn't reverse the mask. However, it would mean that this line should be changed to only reverse non-masked elements.

@braingineer
Copy link
Contributor

Concatenating over the time dimension would simply involve concatenating the masks, while concatenating over the feature dimensions would involve logical operations on the mask.

Ya. That seems more accurate than what I said. I don't think it's the full story though. That's good for concatenation, but what about summing, etc?

Slightly more complicating use case: I want to allow mixed data, where one sequence has an entry and the other doesn't. In this case, you mask on input to the merge, not on output. Currently, RNNs just push forward the state_tm1 (t minus 1) when they encounter a 0 in the mask. So, you wouldn't have a 0 entry where you want to allow the other sequence to give you data.

I may be misunderstanding masking...

Masks are frustrating..

You would want rnn_2 to take the first output of x_forward and the last output of x_backward at the same time.

Yes. The Bi-Directional RNN paper has the sequences aligned at all times. You could, in theory, even specify an offset, but that sounds insane in the current implementation.

his seems like more of an issue with the go_backwards part of the RNN... It would make sense to keep the mask the same, but if go_backwards is set, only run it on the non-masked part. So you would end up with the non-masked stuff all at the beginning of the sequence.

Ya, that is the thing that is coming to light from this conversation, I think.

However, it would mean that this line should be changed to only reverse non-masked elements.

It's an issue for more than just the unrolled RNNs. For example, the theano implementation:

In [38]: import theano

In [39]: x = theano.shared(np.arange(10))

In [40]: print(theano.scan(lambda x: x**2, sequences=x, go_backwards=True)[0].eval())
[81 64 49 36 25 16  9  4  1  0]

In [41]: np.arange(10)
Out[41]: array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

For alignment:
I'm not good with the theano.scan so I won't attempt to write one now, but you could use the scan operation. It would involve something along the lines of stepping through the mask and having an indexing vector. Anytime the mask hit 1, you use the indexing vector to grab another value. You could use this to step through all time dimensions & batch at the same time.

@codekansas
Copy link
Contributor

Masks are frustrating

Agreed.

I want to allow mixed data

It makes sense to just mask anything that's masked in any of the masks, e.g. do K.eall(masks) (element-wise all, like a bitwise AND).

Suppose you have some mask like [1, 1, 0, 1, 0]. For the general case we should take the unmasked indices, reverse them, and put them back in the correct locations according to the mask. For example, applying that mask to indices [1, 2, 3, 4, 5] should give you [4, 2, 3, 1, 5], e.g. it reversed [1, 2, 4] and put them back in the correct locations. However, it may make more sense for this situation to give indices [4, 2, 1, 3, 5] with the [3, 5] masked. This is consistent with the example from earlier, where you want to reverse the first, unmasked part of the input.

I was thinking instead of the line here it could be something like

indices = indices[K.any(mask)][::-1]
indices = indices + [-1] * (input_length - len(indices) + 1)

and then further down, instead of doing T.switch with the mask, just check if i == -1 (-1 indicating that the value is masked).

This would also mean changing the output mask in compute_mask so that it pushes all the mask values to the front. It would also mean changing the behavior of regular masking so that it pushes all masked values to the back.

I'm not sure if this would work. I will try it later and see.

@codekansas
Copy link
Contributor

stepping through the mask and having an indexing vector. Anytime the mask hit 1, you use the indexing vector to grab another value. You could use this to step through all time dimensions & batch at the same time

Now that I reread what you said this seems like the best way to do it

@jiumem
Copy link
Author

jiumem commented Apr 21, 2016

@codekansas Elegant code for K.any(mask)[::-1], thanks a lot!

@RaffEdwardBAH
Copy link

I have a similar use case/need for merging masked inputs. I'm combing the outputs of multiple LSTMs layers that may have to deal with variable length, but will all have a final result in the end. These can't be merged into a single input for the classification step though.

@codekansas
Copy link
Contributor

codekansas commented May 5, 2016

This is only an issue if you need to compine them at each time-step. I think if you end up dropping out the time dimension at some point there is a work-around, albeit not a very clean one. For me I was doing a max-pooling over the feature dimension after the LSTM, so the work-around was to merge the max pooling layers instead of the LSTM layers. So in general the work-around is to apply what you'd normally apply to both the forward and backward parts and then merge them after the time dimension has been dropped out.

@RaffEdwardBAH
Copy link

In my current use case I'm dropping out of the time dimension, but the merge still errors about the masking.

@codekansas
Copy link
Contributor

You can fork my PR here

@stale stale bot added the stale label May 23, 2017
@stale stale bot closed this as completed Jun 22, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants