How does Masking work? #3086

poyuwu · 2016-06-27T19:34:16Z

I'm wondering how Masking Layer works.
I try to write simple model to test Masking on Activation Layer

from keras.models import Model
import numpy as np
from keras.layers import Masking, Activation, Input
a = np.array([[3.,1.,2.,2.,0.,0.]])

inputs = Input(shape=(6,))
mask = Masking(mask_value=0.0)(inputs)
softmax = Activation('softmax')(mask)
model = Model(input=inputs,output=softmax)
model.predict(a)

and the result of prediction is

array([[ 0.50744212,  0.06867483,  0.18667753,  0.18667753,  0.02526405,
         0.02526405]])

Is this the correct behavior?
My keras version is 1.0.5

The text was updated successfully, but these errors were encountered:

ipoletaev · 2016-06-28T08:05:11Z

I'm also interested with this question. It seems to me you expected to get something like the following:
[[ 0.53444666, 0.07232948, 0.19661194, 0.19661194]] ?
But from the other side, - according to the explanations in core.py : class Masking(Layer) - masking doesn't work with 1D input data. So, if you try this, for example:

from keras.models import Model
import numpy as np
from keras.layers import Masking,Input,TimeDistributed,Dense
a = np.array([[[3,1,2,2,0,0],[0,0,0,0,0,0],[2,1,1,2,0,0]]])

input = Input(shape=(3,6))
mask = Masking(mask_value=0)(input)
out = TimeDistributed(Dense(1,activation='linear'))(mask)
model = Model(input=input,output=out)
q = model.predict(a)
print (q[0])

...you will get [[-0.20101213],[ 0. ],[-0.51546627]] as expected.
But I think that, most likely, there's something wrong in my understanding.

poyuwu · 2016-06-28T08:45:16Z

@ipoletaev
yes, sure. However keras can also return [[ 0.53444666, 0.07232948, 0.19661194, 0.19661194, 0.0, 0.0 ]] by padding it to keep shape.
Here is another example about Masknig on bi-LSTM layer but sum two layer

from keras.models import Model
import numpy as np
from keras.layers import Masking, Activation, Input, LSTM, merge
a = np.array([[[.3,.1,.2,.2,.1,.1],[.2,.3,.3,.3,.3,.1],[0,0,0,0,0,0]]])

inputs = Input(shape=(3,6))
mask = Masking(mask_value=0.0)(inputs)
fw = LSTM(1,return_sequences=True)(mask)
bw = LSTM(1,return_sequences=True,go_backwards=True)(mask)
merged = merge([fw,bw],mode='sum')
model = Model(input=inputs,output=fw)
model2 = Model(input=inputs,output=bw)
model3 = Model(input=inputs,output=merged)

the fw's output is
array([[[-0.07041532], [-0.12203699], [-0.12203699]]])
the bw's output is
array([[[ 0. ], [-0.03112165], [ 0.02271803]]])
the merge's output is
array([[[-0.07041532], [-0.15315863], [-0.09931896]]])
but I think it should be (Here it also can padding 0 to keep shape.)
array([[[-0.10153697], [-0.09931896]]])
which -0.10153697 = (-0.07041532) + (-0.03112165) and -0.09931896 = -0.12203699 + 0.02271803
Is is anything wrong on Keras?

ipoletaev · 2016-06-28T09:04:22Z

@poyuwu

However keras can also return [[ 0.53444666, 0.07232948, 0.19661194, 0.19661194, 0.0, 0.0 ]] by padding it to keep shape.

Hmm... I don't know how to make such out only through the Keras.

About your example: I think it's similar to the aforementioned example, so you should get array([[[-0.07041532], [-0.15315863], [0.02271803]]]). And it's really strange that bw works right but fw doesn't, because of its third output is not equal to zero, but it must...

lomizandtyd · 2016-06-28T09:06:43Z

Hi guys, I got this question too.. Especially for LSTM (BRNN).

Masking Layer gives a masked vector, only work for the inputs, not for inner states.
So in @poyuwu 's example, the fw's output still has value in step 3.

This might be correct because inputs are surely masked.
While, I want to find a way to skip the calculation step when coming masked value, like some special tags.

However, I think using Masking layer in bidirectional RNN for sequences with different lengths may be totally wrong.

ipoletaev · 2016-06-28T09:14:32Z

@lomizandtyd

So in @poyuwu 's example, the fw's output still has value in step 3.

Yes, it's logically, but in any case we want to get zero at the third place, isn't it?

While, I want to find a way to skip the calculation step when coming masked value.

I think it doesn't matter because of, as I understood, you should specify output sample_weights in fit() in order to skip necessary timesteps with all zeros in feature vector (I have already asked about this #3023 ). But if this is so, then it is not clear why do we need masking if we can specify it in fit(): what are the steps in the examples is using for training, and what - no. I mean it is not important to process these "empty" vectors by network, it is important to train network without backpropagation with errors calculated on such vectors.

Maybe there is some way to use a batch_size=1 and do not bother with padding?

lomizandtyd · 2016-06-28T09:34:08Z

@ipoletaev Wow, thanks a lot for this!

Yes, we want to get zero at the masked position.
The problem is we also want to keep the inner states across the masked step.

Maybe we can deliver another sample_weights in the predict() function?.
If do so, BRNN is still wrong...

ipoletaev · 2016-06-28T09:45:32Z

@lomizandtyd

...keep the inner states across the masked step.

I think it's not necessary, because the network shouldn't remember what responses it need to get at empty vectors...

Maybe we can deliver another sample_weights in the predict() function?.

I don't understand for what task you want to use it? After all you always know in advance what data you process, and you respectively know - which output of the network corresponds to the empty vectors, so you can just skip such positions in output, I guess.

If do so, BRNN is still wrong...

As far as I understood Keras has been "fighting" with RNN masking task about year :)

poyuwu · 2016-06-28T09:53:18Z

from keras.models import Model
import numpy as np
from keras.layers import Masking,Input,TimeDistributed,Dense
a = np.array([[[3,1,2,2,0,0],[0,0,0,0,0,0],[2,1,1,2,0,0]]])
input = Input(shape=(3,6))
mask = Masking(mask_value=0)(input)
out = TimeDistributed(Dense(1,activation='linear'))(mask)
model = Model(input=input,output=out)
q = model.predict(a)
print (q[0])

@ipoletaev I think it's just Dense Layer that have zero inputs, so that its output is 0. If you change activation function to softmax, then you will get wrong answer.
Besides, batch_size set None on time steps will raise other error in some case (especially on merge layer).

In lasagne, it seems to use Masking matrix to deal with padding. (I do not test its accuracy)

ipoletaev · 2016-06-28T09:57:30Z

@poyuwu : yes, I had checked it - and you are right. It means,as I understood, that and simple Dense doesn't keep masked values in the way we want...

I write again what does not converge with the expectations:

Forward RNN doesn't keep mask values, backward does it. It's strange.
Is this task solving with batch_size = 1?
How to specify correctly what timesteps the network should to skip.
And it's not clear in which moment BiLSTM does reset_state - only in the end of timesteps in current sample, or when the network meets with empty vector?

poyuwu · 2016-06-29T19:06:37Z

@ipoletaev I don't think

Forward RNN doesn't keep mask values, backward does it. It's strange.

this statement is true.
That's because padding argument is 'post', not 'pre'. Hence, the reason is the same as Dense layer I said.

How to specify correctly what timesteps the network should to skip.

As I said, in lasagne, we provide a mask numpy.array (the same shape as input) to deal with it. If go_backwards=True, it needs to keep padding argument the same.

Besides, Embedding layer mask_zeros seems to be the same.

ipoletaev · 2016-06-30T05:05:26Z

@poyuwu so you want to say that now, there's no way to solve this issue with Keras?
I mean is it necessary to use masking if we use sample weights?

xuewei4d · 2016-11-22T16:40:27Z

Same here. It seems masking mechanism in Keras is not fully supported.

fferroni · 2016-12-16T18:52:18Z

I don't think Masking masks input values (neither during forward or back-propagation). It just skips a time-step where all features are equal to the mask value (i.e. when you pad a sequence). You can confirm this by:

from keras.models import Model
import numpy as np
from keras.layers import Masking, Activation, Input, TimeDistributed, Dense

if __name__ == "__main__":
	a = np.array([[[3,1,2,2,0.1,0.1],[0,0,0,0,0,0],[2,1,1,2,0.1,0.1]]])
	print 'Input array:'
	print a
	print ''
	input = Input(shape=(3,6))
	mask = Masking(mask_value=0.1)(input)
	out = TimeDistributed(Dense(1, activation='linear'))(mask)
	model = Model(input=input, output=out)

	model.set_weights([np.array([[ 1. ],
							     [ 1. ],
							     [ 1. ],
							     [ 1. ],
							     [ 1. ],
							     [ 1.]], dtype=np.float32), 
	                   np.array([ 0.], dtype=np.float32)])

	print 'Weights'
	print model.get_weights()
	q = model.predict(a)
	print q

The answer is:

Input array:
[[[ 3.   1.   2.   2.   0.1  0.1]
  [ 0.   0.   0.   0.   0.   0. ]
  [ 2.   1.   1.   2.   0.1  0.1]]]

Weights
[array([[ 1.],
       [ 1.],
       [ 1.],
       [ 1.],
       [ 1.],
       [ 1.]], dtype=float32), array([ 0.], dtype=float32)]
[[[ 8.20000076]
  [ 0.        ]
  [ 6.19999981]]]

If it masked the inputs of value 0.1, you would expect result to be

[[[ 8.       ]
  [ 0.        ]
  [ 6.        ]]]

GPaolo · 2017-01-17T14:17:42Z

Actually Masking works exactly as expected.
The problem is that you are working with the wrong dimension order: in input = (3,6) the 3 is the time dimension and the Masking layer masks only along that dimension, making the net ignore a time sample if that sample is composed of all elements equal to the masked value.

import keras
from keras.utils.visualize_util import plot
from keras.layers import *
from keras.models import Model

net_input = Input(shape = ( 3, 10))
mask = Masking(mask_value = 0.5)(net_input)
conv = TimeDistributed(Dense(1, activation = 'linear', init='one'))(mask)
out = LSTM(1, init='one', inner_init='one',activation='tanh', inner_activation='tanh',)(conv)
model = Model(net_input, out)

print('W: ' + str(model.get_weights()))

net_in = np.ones((1,3, 10))
val = 0.5
net_in[0, 2, :] = val
out = model.predict(net_in)
print('Input: ' + str(net_in))
print('Output: ' + str(out))

In this case the answers are:

mask = 0.5, val = 0.0 : 0.73566443
mask = 0.0, val = 0.0 : 0.96402758
mask = 0.0, val = 0.5 : 0.99504161
mask = 0.5, val = 0.5 : 0.96402758

so from here you can see that when we mask val we get the same result, while when we mask something else, even if val = 0, we get a different result.

GPaolo · 2017-01-17T14:46:02Z

Moreover, I just tested, if you have a Multi-input net (with multiple input branches) and you have a masking layer on each branch, it is enough that just one of the inputs at time step t is equal to the masked value that all the time step is skipped.

I guess that if one wants to skip the time step only if all the inputs are equal to the masked value, the branches need to be merged, right?

irrationalagent · 2017-01-19T17:59:55Z

Hi Fragore, I have a similar question to you about masking with multiple inputs. I have two input branches and all I want to do is mask 0 from both. Am I right in thinking that adding a mask to the end of each branch is equivalent to adding a single mask AFTER the inputs are merged? here's my example

input1 = Sequential()
input1.add(TimeDistributed(Dense(50), input_shape=(MAX_SEQUENCE_LENGTH,48)))
input2 = Sequential()
input2.add(Embedding(nb_words+2,EMBEDDING_DIM,weights=[embedding_matrix],trainable=False,input_length=MAX_SEQUENCE_LENGTH))

model = Sequential()
model.add(keras.engine.topology.Merge([input1,input2],mode='concat',concat_axis=-1))
model.add(keras.layers.core.Masking(mask_value=0.0))
model.add(LSTM(1024,dropout_W=.2,dropout_U=.2,return_sequences=True))
model.add(LSTM(1024,dropout_W=.2,dropout_U=.2,return_sequences=True))
model.add(LSTM(512,dropout_W=.2,dropout_U=.2,return_sequences=True))
model.add(TimeDistributed(Dense(nb_words + 1)))
model.add(Activation('softmax'))

or version with a mask after each branch prior to merging

input1 = Sequential()
input1.add(TimeDistributed(Dense(50), input_shape=(MAX_SEQUENCE_LENGTH,48)))
input1.add(keras.layers.core.Masking(mask_value=0.0))
input2 = Sequential()
input2.add(Embedding(nb_words+2,EMBEDDING_DIM,weights=[embedding_matrix],trainable=False,input_length=MAX_SEQUENCE_LENGTH,mask_zero=True))

model = Sequential()
model.add(keras.engine.topology.Merge([input1,input2],mode='concat',concat_axis=-1))
model.add(LSTM(1024,dropout_W=.2,dropout_U=.2,return_sequences=True))
model.add(LSTM(1024,dropout_W=.2,dropout_U=.2,return_sequences=True))
model.add(LSTM(512,dropout_W=.2,dropout_U=.2,return_sequences=True))
model.add(TimeDistributed(Dense(nb_words + 1)))
model.add(Activation('softmax'))

GPaolo · 2017-01-24T06:52:32Z

Wait, you want to mask the output of the branches that are 0? In that case both of your approaches should give you the same result. But usually you mask inputs, this means to put the mask layer as input of the net.
Ps it may also be more convenient to use the functional API :)
PPS the last dense layer doesn't need TimeDistributed anymore cause the LSTM removes the time dimension.

slaterb1 · 2017-03-17T01:27:23Z

I've been experimenting with and without masking for a little bit now and I have finally figured out what the Masking layer actually does. It doesn't actually "skip" the timepoint that has all masked values, it just forces all the values for that timepoint to be equal to 0... So effectively Masking(mask_value=0.) does nothing. That is why in the example provided by @GPaolo above the results for mask_value=0 and mask_value=0.5 are the same when val matches them.

Here is some easy code to demonstrate what I mean.

Model:

`input1 = Input(batch_shape=(1,1,10)
mask1 = Masking(mask_value=2)(input1)
dense_layer1 = Dense(1, activation='sigmoid')
dense_layer1.setattr('supports_masking', True)
output1 = dense_layer1(mask1)

model = Model(input1, output1)
model.compile(optimizer='adam', loss='binary_crossentropy')
`
Data:

`data = np.ones((10, 1, 10), dtype='float32')
#set half of the data equal to mask value
for index in range(5,10):
data[index,0,:] = 2

#set first data point equal to mask value to show that this line is uneffected
data[0,0,0] = 2`

print outputs:

`get_mask_output = K.function([model.layers[0].input], [model.layers[1].output])
mask_output = get_mask_output([data])[0]

print(data)
print(mask_output)

data:

[[[ 2. 1. 1. 1. 1. 1. 1. 1. 1. 1.]]

[[ 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.]]

[[ 2. 2. 2. 2. 2. 2. 2. 2. 2. 2.]]

[[ 2. 2. 2. 2. 2. 2. 2. 2. 2. 2.]]]

mask_output:

[[[ 2. 1. 1. 1. 1. 1. 1. 1. 1. 1.]]

[[ 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.]]

[[ 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]]

[[ 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]]]`

Predictions:

`test_data = np.ones((5,1,10))
test_data[1,0,:] = 2
test_data[2,0,:] = 0
predictions = model.predict(test_data, batch_size=1)

print(test_data)
print(predictions)
`

Results:

test_data:
`[[[ 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.]]

[[ 2. 2. 2. 2. 2. 2. 2. 2. 2. 2.]]

[[ 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]]

[[ 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.]]

[[ 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.]]]

predictions:
[[[ 0.09200736]]

[[ 0.5 ]]

[[ 0.09200736]]

[[ 0.09200736]]]`

As you can imagine, "masking" values by setting them to 0 and still calculating the results for those lines in layers causes some mistakes from backpropagation (treating unknown values as a real result) as well as added unneeded computation time. I'm going to try to rework how masking is done in Keras a bit...

Edit: I did a little bit of digging into the training.py code and I found that the "masking" information (even with mask_value = 0.) does get incorporated into the training of the weights. The masked lines effectively get ignored after the calculation is done (which is good!). The problem that I am encountering in my actual network is that although "masked lines" are ignored during weight training, they are still evaluated by the network going forward which effects the outputs of future layers based on false information. To be able to build a network that handles variably sized inputs (not all have max timepoints) I want to completely ignore the masked lines entirely... I'm going to try to work that out

ragulpr · 2017-03-20T22:47:36Z

Building on @slaterb1 and @GPaolo 's snippets I tried digging around to see the benefits of masking but haven't found it yet. It feels like I'm missing something.

It does not seem to propagate numerically sound values through time
It propagates np.nan, see gist
Feels (TODO:test) quite numerically unstable to propagate possibly absurd values down the network? Like mask output 0 may not always be in place.
It has to test each input
Quick testing (see gist) seem to show that there's no immediate performance gains

Does anyone have an idea about if/when it gives performance gains? I didn't have time to run for long/deep/wide and I'm not comfortable about how Python/Keras/Tensorflow/Theano compiles

Is mask an intricate way of doing what I think weights should to be doing? I.e multiplying with the loss and dividing by sum of weights in batch?
It's literally what seems to be done here anyway:
https://github.com/fchollet/keras/blob/master/keras/engine/training.py#L453

Does it actually halt any execution (yet)?

carlthome · 2017-03-21T00:38:40Z

@ragulpr, it's my understanding that masking does more than just loss scaling. If a timestep has been masked, the previous output and state will be reused. See here and here.

slaterb1 · 2017-03-22T01:47:14Z

@ragulpr, I'm not sure about performance gains but Theano is pretty smart about knowing what it needs to hang on to and what it doesn't (based on the API doc: http://deeplearning.net/software/theano/library/scan.html)

More specifically this line: "Note that there is an optimization, that at compile time will detect that you are using just the last value of the result and ensure that scan does not store all the intermediate values that are used. So do not worry if A and k are large."

So after compiling the model it might pass over the masked values (or at least not hold them in memory as long), but that is pure speculation based on similarities in the underlying code.

@carlthome, I came across the mask snippet in the "theano_backend.py" as well and you are right that the masking has a direct effect on how the states are evaluated and passed on (T.switch). Maybe this is too general a question but how does this layer accept the mask? Just to give an example, if I have a model with multiple layers, defined as so:

model = Model(input1, output1)

I understand that Theano wraps this up as a mathematical equation to calculate:

output1 = input1 -> [ layers[0] -> layers[1] -> ... layers[N] ]

but if I have somewhere in the middle:

prev_layer -> Masking_layer -> RNN_layer

The output from the Masking_layer gets put into the RNN_layer as input ("x"). Does the "supports_masking" attribute tell the RNN_layer to figure out the mask? I could not find anywhere in the code where the mask is evaluated or interpreted by the RNN_layer, except that I can pass in a "mask" variable via the call() method of the Recurrent(Layer) object.

I tried calling RNN_layer(prev_layer, mask=Masking_layer) but it didn't do anything different. The last comment in the thread, #176 suggests that it has to be called with a mask but I'm not sure how to do that... Any thoughts?

carlthome · 2017-03-22T07:02:05Z

I could not find anywhere in the code where the mask is evaluated or interpreted by the RNN_layer

Each Keras layer declares if it supports masking. Each layer is also responsible for using the mask in a sensible way (which I believe is the primary source of confusion: that the masking functionality is implemented across a bunch of different classes). For RNN layers in particular, they rely on the fact that the underlying K.rnn operation has mask support so if you're looking for where precisely the logic is, you'll note that the RNN layers simply pass the mask argument into the backend, where the magic happens.

slaterb1 · 2017-03-22T19:57:12Z

@carlthome, I saw that in the code but was not able to get the mask to work in my RNN network. For clarity I was trying to rework stuff in RecurrentShop to setup an encoder decoder network that adjusts the next input based on a prediction made on the previous state from both the encoder and the decoder (a custom RNN that uses a .single_step_rnn() instead of the regular .rnn() ).

But based on your advice, I tried to just build a basic LSTM network to act as a NOT Gate (pointless but simple) and it does interpret the mask correctly, when it is passed a mask mid network! I'm including the gist. It shows that masking works for both return_sequences=True and return_sequences=False. It also shows that if you train the network with data that does not have 'masked' input, 'masked' lines in the test data will still get masked appropriately. Hope that helps people understand the masking stuff better!

This is the gist

Seanny123 · 2017-05-30T11:07:25Z

@fferroni @GPaolo apparently, the TimeDistributed layer didn't support masking, since this feature has been added in Pull #6401?

mehrdadscomputer · 2017-06-05T12:17:48Z

Hey Guys, there is a seq2seq example which it's input is a string (sequence) like '5+9' and output is another string '14'.
The author used pre padding to have sequences with same lengths at input but he didn't use masking.
I add a simple line to add masking to his model and there is about 8 percent improvement in accuracy.
Is my case a correct use of masking?

this is main code:

from random import seed
from random import randint
from numpy import array
from math import ceil
from math import log10
from math import sqrt
from numpy import argmax
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import LSTM
from keras.layers import TimeDistributed
from keras.layers import RepeatVector

def random_sum_pairs(n_examples, n_numbers, largest):
    X, y = list(), list()
    for i in range(n_examples):
	    in_pattern = [randint(1,largest) for _ in range(n_numbers)]
	    out_pattern = sum(in_pattern)
	    X.append(in_pattern)
	    y.append(out_pattern)
    return X, y

def to_string(X, y, n_numbers, largest):
    max_length = n_numbers * ceil(log10(largest+1)) + n_numbers - 1
    Xstr = list()
    for pattern in X:
	    strp = '+'.join([str(n) for n in pattern])
	    strp = ''.join([' ' for _ in range(max_length-len(strp))]) + strp
	    Xstr.append(strp)
    max_length = ceil(log10(n_numbers * (largest+1)))
    ystr = list()
    for pattern in y:
	    strp = str(pattern)
	    strp = ''.join([' ' for _ in range(max_length-len(strp))]) + strp
	    ystr.append(strp)
    return Xstr, ystr

def integer_encode(X, y, alphabet):
    char_to_int = dict((c, i) for i, c in enumerate(alphabet))
    Xenc = list()
    for pattern in X:
	    integer_encoded = [char_to_int[char] for char in pattern]
	    Xenc.append(integer_encoded)
    yenc = list()
    for pattern in y:
	    integer_encoded = [char_to_int[char] for char in pattern]
	    yenc.append(integer_encoded)
    return Xenc, yenc

def one_hot_encode(X, y, max_int):
    Xenc = list()
    for seq in X:
	    pattern = list()
	    for index in seq:
		    vector = [0 for _ in range(max_int)]
		    vector[index] = 1
		    pattern.append(vector)
	    Xenc.append(pattern)
    yenc = list()
    for seq in y:
	    pattern = list()
	    for index in seq:
		    vector = [0 for _ in range(max_int)]
		    vector[index] = 1
		    pattern.append(vector)
	    yenc.append(pattern)
    return Xenc, yenc

def generate_data(n_samples, n_numbers, largest, alphabet):
    X, y = random_sum_pairs(n_samples, n_numbers, largest)
    X, y = to_string(X, y, n_numbers, largest)
    X, y = integer_encode(X, y, alphabet)
    X, y = one_hot_encode(X, y, len(alphabet))
    X, y = array(X), array(y)
    return X, y

def invert(seq, alphabet):
    int_to_char = dict((i, c) for i, c in enumerate(alphabet))
    strings = list()
    for pattern in seq:
	    string = int_to_char[argmax(pattern)]
	    strings.append(string)
    return ''.join(strings)

seed(1)
n_samples = 1000
n_numbers = 2
largest = 10
alphabet = ['0', '1', '2', '3', '4', '5', '6', '7', '8', '9', '+', ' ']
n_chars = len(alphabet)
n_in_seq_length = n_numbers * ceil(log10(largest+1)) + n_numbers - 1
n_out_seq_length = ceil(log10(n_numbers * (largest+1)))
n_batch = 10
n_epoch = 10
model = Sequential()
model.add(LSTM(100, input_shape=(n_in_seq_length, n_chars)))
model.add(RepeatVector(n_out_seq_length))
model.add(LSTM(50, return_sequences=True))
model.add(TimeDistributed(Dense(n_chars, activation='softmax')))
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
print(model.summary())

for i in range(n_epoch):
    X, y = generate_data(n_samples, n_numbers, largest, alphabet)
    print(i)
    model.fit(X, y, epochs=1, batch_size=n_batch)

X, y = generate_data(n_samples, n_numbers, largest, alphabet)
result = model.predict(X, batch_size=n_batch, verbose=0)
expected = [invert(x, alphabet) for x in y]
predicted = [invert(x, alphabet) for x in result]
for i in range(20):
    print('Expected=%s, Predicted=%s' % (expected[i], predicted[i]))

and I just change this part:

model = Sequential()
model.add(LSTM(100, input_shape=(n_in_seq_length, n_chars)))

to this part:

from keras.layers import Masking
model = Sequential()
model.add(Masking(mask_value = [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1], input_shape=(n_in_seq_length, n_chars)))
    model.add(LSTM(100))

sources:
http://machinelearningmastery.com/learn-add-numbers-seq2seq-recurrent-neural-networks/#comment-400854

stale · 2017-09-03T12:46:20Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 30 days if no further activity occurs, but feel free to re-open a closed issue if needed.

MeloMing · 2019-08-29T07:14:42Z

I don't think Masking masks input values (neither during forward or back-propagation). It just skips a time-step where all features are equal to the mask value (i.e. when you pad a sequence). You can confirm this by:

from keras.models import Model
import numpy as np
from keras.layers import Masking, Activation, Input, TimeDistributed, Dense

if __name__ == "__main__":
	a = np.array([[[3,1,2,2,0.1,0.1],[0,0,0,0,0,0],[2,1,1,2,0.1,0.1]]])
	print 'Input array:'
	print a
	print ''
	input = Input(shape=(3,6))
	mask = Masking(mask_value=0.1)(input)
	out = TimeDistributed(Dense(1, activation='linear'))(mask)
	model = Model(input=input, output=out)

	model.set_weights([np.array([[ 1. ],
							     [ 1. ],
							     [ 1. ],
							     [ 1. ],
							     [ 1. ],
							     [ 1.]], dtype=np.float32), 
	                   np.array([ 0.], dtype=np.float32)])

	print 'Weights'
	print model.get_weights()
	q = model.predict(a)
	print q

The answer is:

Input array:
[[[ 3.   1.   2.   2.   0.1  0.1]
  [ 0.   0.   0.   0.   0.   0. ]
  [ 2.   1.   1.   2.   0.1  0.1]]]

Weights
[array([[ 1.],
       [ 1.],
       [ 1.],
       [ 1.],
       [ 1.],
       [ 1.]], dtype=float32), array([ 0.], dtype=float32)]
[[[ 8.20000076]
  [ 0.        ]
  [ 6.19999981]]]

If it masked the inputs of value 0.1, you would expect result to be

[[[ 8.       ]
  [ 0.        ]
  [ 6.        ]]]

Mask layer will work only when all feature of a timestep equals to the mask value.In you case,the input a is a 3d matrix with the shape(1,3,6),1 means batch_size,3 means timesteps,and 10 means the feature of that timestep.Mask will work when the feature of a timestep all equal to 0.1.if you change a to:
a = np.array([[[3,1,2,2,0.1,0.1],[0,0,0,0,0,0],[0.1,0.1,0.1,0.1,0.1,0.1]]])

you will get the output like:

[[[8.200001] [0. ] [0. ]]]

hoangcuong2011 · 2019-11-29T07:36:10Z

Hi,

I struggled a lot with this recently, and here is some experience I learnt. I hope it would be useful for people.

Masking is extremely powerful. I found it perhaps the only way to deal with several "hard" problems that are with sequence of missing inputs, missing outputs as follows.
Masking is not that complicated if we understand how the loss is computed with masking. For instance let us assume we have a sequence with length 256. From this sequence we have a masking with only 4 elements that are with masking of 1 (others are with masking 0). I thought the loss is computed as the average between these 4 elements. Guess what - it is not! The average loss will be divided by 256 instead. For this reason sometimes the loss will be extremely small (0.0something) if we have only few 1 elements and long sequence.
Does it matter? I guess not, as what we need is the gradient of loss, rather than the loss itself.
When we use softmax as the last layer, the denominator would be the sum of exponential of all elements, regarding whether their masking is 1 or 0.
I thought the output of masking inputs is zeros all the time in LSTM. But it is not the case. Let us assume we have a masking:

0 0 0 1 1 0 0 0

With this case, the three first elements with masking zero has output of 0. However, the three last zeros have output that is as the same as the output of the last element with masking 1.

Meanwhile, Keras is very convenient in the sense that the loss it computes will be based on only elements with masking of 1. I found this is a big plus of using Keras, something a bit too good too be true as I guess implementing this is not that easy.
However, the accuracy in Keras is not computed that way. It is thus not trivial in keras to write a custom metric (for fit). There is something very mysterious to me. I am pretty sure my code for writing custom metric is correct but somehow it does not give me accurate result. Because of this I think it is much much easier if we write such an accuracy function with a custom callback class.

That is it, I hope it is helpful!

zhanjiezhu · 2019-12-07T17:03:03Z

Hi,

I struggled a lot with this recently, and here is some experience I learnt. I hope it would be useful for people.

Masking is extremely powerful. I found it perhaps the only way to deal with several "hard" problems that are with sequence of missing inputs, missing outputs as follows.

Masking is not that complicated if we understand how the loss is computed with masking. For instance let us assume we have a sequence with length 256. From this sequence we have a masking with only 4 elements that are with masking of 1 (others are with masking 0). I thought the loss is computed as the average between these 4 elements. Guess what - it is not! The average loss will be divided by 256 instead. For this reason sometimes the loss will be extremely small (0.0something) if we have only few 1 elements and long sequence.
Does it matter? I guess not, as what we need is the gradient of loss, rather than the loss itself.

When we use softmax as the last layer, the denominator would be the sum of exponential of all elements, regarding whether their masking is 1 or 0.

I thought the output of masking inputs is zeros all the time in LSTM. But it is not the case. Let us assume we have a masking:

0 0 0 1 1 0 0 0

With this case, the three first elements with masking zero has output of 0. However, the three last zeros have output that is as the same as the output of the last element with masking 1.

Meanwhile, Keras is very convenient in the sense that the loss it computes will be based on only elements with masking of 1. I found this is a big plus of using Keras, something a bit too good too be true as I guess implementing this is not that easy.

However, the accuracy in Keras is not computed that way. It is thus not trivial in keras to write a custom metric (for fit). There is something very mysterious to me. I am pretty sure my code for writing custom metric is correct but somehow it does not give me accurate result. Because of this I think it is much much easier if we write such an accuracy function with a custom callback class.

That is it, I hope it is helpful!

Hi @hoangcuong2011 , thanks for your explanations. I've validated your second point and indeed it's exactly what you said. I'm currently trying to implement a LSTM-autoencoder model to encode sequence into sequence, in which it involves a LSTM layer with return_sequence = False and then RepeatVector layer to copy that back to the previous timestep dimension. However, the mask get lost right after the LSTM because return_sequence = False (if True it returns the input_mask), then I'm wondering how I can get back the mask so that the loss will also ignore the padded timesteps? Thanks!

hoangcuong2011 · 2019-12-08T17:23:56Z

@zhangwj618 I am not really sure what your question is about. I guess you would like to write a custom masking layer. If you explain the question in more detail, I think I can help. Thx!

cbdbdd mentioned this issue Sep 22, 2016

LSTM CudaNdarrayType(float32, col)' and 'CudaNdarrayType(float32, matrix) error #3641

Closed

zumpchke mentioned this issue Nov 17, 2016

Recurrent Models with sequences of mixed length #40

Closed

stale bot added the stale label Sep 3, 2017

stale bot closed this as completed Oct 3, 2017

andersjohanandreassen mentioned this issue Mar 16, 2019

TimeDistributed(Dense) with Masking not masking bias #12495

Closed

hoangcuong2011 mentioned this issue Dec 8, 2019

tf.keras.layers.Softmax does not support masking? tensorflow/tensorflow#27010

Closed

sushreebarsa mentioned this issue Oct 15, 2021

Masking layer does not work after training #14108

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How does Masking work? #3086

How does Masking work? #3086

poyuwu commented Jun 27, 2016 •

edited

Loading

ipoletaev commented Jun 28, 2016 •

edited

Loading

poyuwu commented Jun 28, 2016 •

edited

Loading

ipoletaev commented Jun 28, 2016 •

edited

Loading

lomizandtyd commented Jun 28, 2016 •

edited

Loading

ipoletaev commented Jun 28, 2016 •

edited

Loading

lomizandtyd commented Jun 28, 2016

ipoletaev commented Jun 28, 2016 •

edited

Loading

poyuwu commented Jun 28, 2016

ipoletaev commented Jun 28, 2016 •

edited

Loading

poyuwu commented Jun 29, 2016 •

edited

Loading

ipoletaev commented Jun 30, 2016 •

edited

Loading

xuewei4d commented Nov 22, 2016

fferroni commented Dec 16, 2016

GPaolo commented Jan 17, 2017 •

edited

Loading

GPaolo commented Jan 17, 2017 •

edited

Loading

irrationalagent commented Jan 19, 2017 •

edited

Loading

GPaolo commented Jan 24, 2017 •

edited

Loading

slaterb1 commented Mar 17, 2017 •

edited

Loading

ragulpr commented Mar 20, 2017 •

edited

Loading

carlthome commented Mar 21, 2017 •

edited

Loading

slaterb1 commented Mar 22, 2017 •

edited

Loading

carlthome commented Mar 22, 2017 •

edited

Loading

slaterb1 commented Mar 22, 2017 •

edited

Loading

Seanny123 commented May 30, 2017 •

edited

Loading

mehrdadscomputer commented Jun 5, 2017 •

edited

Loading

stale bot commented Sep 3, 2017

MeloMing commented Aug 29, 2019

hoangcuong2011 commented Nov 29, 2019

zhanjiezhu commented Dec 7, 2019

hoangcuong2011 commented Dec 8, 2019 •

edited

Loading

How does Masking work? #3086

How does Masking work? #3086

Comments

poyuwu commented Jun 27, 2016 • edited Loading

ipoletaev commented Jun 28, 2016 • edited Loading

poyuwu commented Jun 28, 2016 • edited Loading

ipoletaev commented Jun 28, 2016 • edited Loading

lomizandtyd commented Jun 28, 2016 • edited Loading

ipoletaev commented Jun 28, 2016 • edited Loading

lomizandtyd commented Jun 28, 2016

ipoletaev commented Jun 28, 2016 • edited Loading

poyuwu commented Jun 28, 2016

ipoletaev commented Jun 28, 2016 • edited Loading

poyuwu commented Jun 29, 2016 • edited Loading

ipoletaev commented Jun 30, 2016 • edited Loading

xuewei4d commented Nov 22, 2016

fferroni commented Dec 16, 2016

GPaolo commented Jan 17, 2017 • edited Loading

GPaolo commented Jan 17, 2017 • edited Loading

irrationalagent commented Jan 19, 2017 • edited Loading

GPaolo commented Jan 24, 2017 • edited Loading

slaterb1 commented Mar 17, 2017 • edited Loading

ragulpr commented Mar 20, 2017 • edited Loading

carlthome commented Mar 21, 2017 • edited Loading

slaterb1 commented Mar 22, 2017 • edited Loading

carlthome commented Mar 22, 2017 • edited Loading

slaterb1 commented Mar 22, 2017 • edited Loading

Seanny123 commented May 30, 2017 • edited Loading

mehrdadscomputer commented Jun 5, 2017 • edited Loading

stale bot commented Sep 3, 2017

MeloMing commented Aug 29, 2019

hoangcuong2011 commented Nov 29, 2019

zhanjiezhu commented Dec 7, 2019

hoangcuong2011 commented Dec 8, 2019 • edited Loading

poyuwu commented Jun 27, 2016 •

edited

Loading

ipoletaev commented Jun 28, 2016 •

edited

Loading

poyuwu commented Jun 28, 2016 •

edited

Loading

ipoletaev commented Jun 28, 2016 •

edited

Loading

lomizandtyd commented Jun 28, 2016 •

edited

Loading

ipoletaev commented Jun 28, 2016 •

edited

Loading

ipoletaev commented Jun 28, 2016 •

edited

Loading

ipoletaev commented Jun 28, 2016 •

edited

Loading

poyuwu commented Jun 29, 2016 •

edited

Loading

ipoletaev commented Jun 30, 2016 •

edited

Loading

GPaolo commented Jan 17, 2017 •

edited

Loading

GPaolo commented Jan 17, 2017 •

edited

Loading

irrationalagent commented Jan 19, 2017 •

edited

Loading

GPaolo commented Jan 24, 2017 •

edited

Loading

slaterb1 commented Mar 17, 2017 •

edited

Loading

ragulpr commented Mar 20, 2017 •

edited

Loading

carlthome commented Mar 21, 2017 •

edited

Loading

slaterb1 commented Mar 22, 2017 •

edited

Loading

carlthome commented Mar 22, 2017 •

edited

Loading

slaterb1 commented Mar 22, 2017 •

edited

Loading

Seanny123 commented May 30, 2017 •

edited

Loading

mehrdadscomputer commented Jun 5, 2017 •

edited

Loading

hoangcuong2011 commented Dec 8, 2019 •

edited

Loading