New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How does Masking work? #3086

Closed
poyuwu opened this Issue Jun 27, 2016 · 26 comments

Comments

Projects
None yet
@poyuwu

poyuwu commented Jun 27, 2016

I'm wondering how Masking Layer works.
I try to write simple model to test Masking on Activation Layer

from keras.models import Model
import numpy as np
from keras.layers import Masking, Activation, Input
a = np.array([[3.,1.,2.,2.,0.,0.]])

inputs = Input(shape=(6,))
mask = Masking(mask_value=0.0)(inputs)
softmax = Activation('softmax')(mask)
model = Model(input=inputs,output=softmax)
model.predict(a)

and the result of prediction is

array([[ 0.50744212,  0.06867483,  0.18667753,  0.18667753,  0.02526405,
         0.02526405]])

Is this the correct behavior?
My keras version is 1.0.5

@ipoletaev

This comment has been minimized.

Show comment
Hide comment
@ipoletaev

ipoletaev Jun 28, 2016

I'm also interested with this question. It seems to me you expected to get something like the following:
[[ 0.53444666, 0.07232948, 0.19661194, 0.19661194]] ?
But from the other side, - according to the explanations in core.py : class Masking(Layer) - masking doesn't work with 1D input data. So, if you try this, for example:

from keras.models import Model
import numpy as np
from keras.layers import Masking,Input,TimeDistributed,Dense
a = np.array([[[3,1,2,2,0,0],[0,0,0,0,0,0],[2,1,1,2,0,0]]])

input = Input(shape=(3,6))
mask = Masking(mask_value=0)(input)
out = TimeDistributed(Dense(1,activation='linear'))(mask)
model = Model(input=input,output=out)
q = model.predict(a)
print (q[0])

...you will get [[-0.20101213],[ 0. ],[-0.51546627]] as expected.
But I think that, most likely, there's something wrong in my understanding.

ipoletaev commented Jun 28, 2016

I'm also interested with this question. It seems to me you expected to get something like the following:
[[ 0.53444666, 0.07232948, 0.19661194, 0.19661194]] ?
But from the other side, - according to the explanations in core.py : class Masking(Layer) - masking doesn't work with 1D input data. So, if you try this, for example:

from keras.models import Model
import numpy as np
from keras.layers import Masking,Input,TimeDistributed,Dense
a = np.array([[[3,1,2,2,0,0],[0,0,0,0,0,0],[2,1,1,2,0,0]]])

input = Input(shape=(3,6))
mask = Masking(mask_value=0)(input)
out = TimeDistributed(Dense(1,activation='linear'))(mask)
model = Model(input=input,output=out)
q = model.predict(a)
print (q[0])

...you will get [[-0.20101213],[ 0. ],[-0.51546627]] as expected.
But I think that, most likely, there's something wrong in my understanding.

@poyuwu

This comment has been minimized.

Show comment
Hide comment
@poyuwu

poyuwu Jun 28, 2016

@ipoletaev
yes, sure. However keras can also return [[ 0.53444666, 0.07232948, 0.19661194, 0.19661194, 0.0, 0.0 ]] by padding it to keep shape.
Here is another example about Masknig on bi-LSTM layer but sum two layer

from keras.models import Model
import numpy as np
from keras.layers import Masking, Activation, Input, LSTM, merge
a = np.array([[[.3,.1,.2,.2,.1,.1],[.2,.3,.3,.3,.3,.1],[0,0,0,0,0,0]]])

inputs = Input(shape=(3,6))
mask = Masking(mask_value=0.0)(inputs)
fw = LSTM(1,return_sequences=True)(mask)
bw = LSTM(1,return_sequences=True,go_backwards=True)(mask)
merged = merge([fw,bw],mode='sum')
model = Model(input=inputs,output=fw)
model2 = Model(input=inputs,output=bw)
model3 = Model(input=inputs,output=merged)

the fw's output is
array([[[-0.07041532], [-0.12203699], [-0.12203699]]])
the bw's output is
array([[[ 0. ], [-0.03112165], [ 0.02271803]]])
the merge's output is
array([[[-0.07041532], [-0.15315863], [-0.09931896]]])
but I think it should be (Here it also can padding 0 to keep shape.)
array([[[-0.10153697], [-0.09931896]]])
which -0.10153697 = (-0.07041532) + (-0.03112165) and -0.09931896 = -0.12203699 + 0.02271803
Is is anything wrong on Keras?

poyuwu commented Jun 28, 2016

@ipoletaev
yes, sure. However keras can also return [[ 0.53444666, 0.07232948, 0.19661194, 0.19661194, 0.0, 0.0 ]] by padding it to keep shape.
Here is another example about Masknig on bi-LSTM layer but sum two layer

from keras.models import Model
import numpy as np
from keras.layers import Masking, Activation, Input, LSTM, merge
a = np.array([[[.3,.1,.2,.2,.1,.1],[.2,.3,.3,.3,.3,.1],[0,0,0,0,0,0]]])

inputs = Input(shape=(3,6))
mask = Masking(mask_value=0.0)(inputs)
fw = LSTM(1,return_sequences=True)(mask)
bw = LSTM(1,return_sequences=True,go_backwards=True)(mask)
merged = merge([fw,bw],mode='sum')
model = Model(input=inputs,output=fw)
model2 = Model(input=inputs,output=bw)
model3 = Model(input=inputs,output=merged)

the fw's output is
array([[[-0.07041532], [-0.12203699], [-0.12203699]]])
the bw's output is
array([[[ 0. ], [-0.03112165], [ 0.02271803]]])
the merge's output is
array([[[-0.07041532], [-0.15315863], [-0.09931896]]])
but I think it should be (Here it also can padding 0 to keep shape.)
array([[[-0.10153697], [-0.09931896]]])
which -0.10153697 = (-0.07041532) + (-0.03112165) and -0.09931896 = -0.12203699 + 0.02271803
Is is anything wrong on Keras?

@ipoletaev

This comment has been minimized.

Show comment
Hide comment
@ipoletaev

ipoletaev Jun 28, 2016

@poyuwu

However keras can also return [[ 0.53444666, 0.07232948, 0.19661194, 0.19661194, 0.0, 0.0 ]] by padding it to keep shape.

Hmm... I don't know how to make such out only through the Keras.

About your example: I think it's similar to the aforementioned example, so you should get array([[[-0.07041532], [-0.15315863], [0.02271803]]]). And it's really strange that bw works right but fw doesn't, because of its third output is not equal to zero, but it must...

ipoletaev commented Jun 28, 2016

@poyuwu

However keras can also return [[ 0.53444666, 0.07232948, 0.19661194, 0.19661194, 0.0, 0.0 ]] by padding it to keep shape.

Hmm... I don't know how to make such out only through the Keras.

About your example: I think it's similar to the aforementioned example, so you should get array([[[-0.07041532], [-0.15315863], [0.02271803]]]). And it's really strange that bw works right but fw doesn't, because of its third output is not equal to zero, but it must...

@lomizandtyd

This comment has been minimized.

Show comment
Hide comment
@lomizandtyd

lomizandtyd Jun 28, 2016

Hi guys, I got this question too.. Especially for LSTM (BRNN).

Masking Layer gives a masked vector, only work for the inputs, not for inner states.
So in @poyuwu 's example, the fw's output still has value in step 3.

This might be correct because inputs are surely masked.
While, I want to find a way to skip the calculation step when coming masked value, like some special tags.

However, I think using Masking layer in bidirectional RNN for sequences with different lengths may be totally wrong.

lomizandtyd commented Jun 28, 2016

Hi guys, I got this question too.. Especially for LSTM (BRNN).

Masking Layer gives a masked vector, only work for the inputs, not for inner states.
So in @poyuwu 's example, the fw's output still has value in step 3.

This might be correct because inputs are surely masked.
While, I want to find a way to skip the calculation step when coming masked value, like some special tags.

However, I think using Masking layer in bidirectional RNN for sequences with different lengths may be totally wrong.

@ipoletaev

This comment has been minimized.

Show comment
Hide comment
@ipoletaev

ipoletaev Jun 28, 2016

@lomizandtyd

So in @poyuwu 's example, the fw's output still has value in step 3.

Yes, it's logically, but in any case we want to get zero at the third place, isn't it?

While, I want to find a way to skip the calculation step when coming masked value.

I think it doesn't matter because of, as I understood, you should specify output sample_weights in fit() in order to skip necessary timesteps with all zeros in feature vector (I have already asked about this #3023 ). But if this is so, then it is not clear why do we need masking if we can specify it in fit(): what are the steps in the examples is using for training, and what - no. I mean it is not important to process these "empty" vectors by network, it is important to train network without backpropagation with errors calculated on such vectors.

Maybe there is some way to use a batch_size=1 and do not bother with padding?

ipoletaev commented Jun 28, 2016

@lomizandtyd

So in @poyuwu 's example, the fw's output still has value in step 3.

Yes, it's logically, but in any case we want to get zero at the third place, isn't it?

While, I want to find a way to skip the calculation step when coming masked value.

I think it doesn't matter because of, as I understood, you should specify output sample_weights in fit() in order to skip necessary timesteps with all zeros in feature vector (I have already asked about this #3023 ). But if this is so, then it is not clear why do we need masking if we can specify it in fit(): what are the steps in the examples is using for training, and what - no. I mean it is not important to process these "empty" vectors by network, it is important to train network without backpropagation with errors calculated on such vectors.

Maybe there is some way to use a batch_size=1 and do not bother with padding?

@lomizandtyd

This comment has been minimized.

Show comment
Hide comment
@lomizandtyd

lomizandtyd Jun 28, 2016

@ipoletaev Wow, thanks a lot for this!

Yes, we want to get zero at the masked position.
The problem is we also want to keep the inner states across the masked step.

Maybe we can deliver another sample_weights in the predict() function?.
If do so, BRNN is still wrong...

lomizandtyd commented Jun 28, 2016

@ipoletaev Wow, thanks a lot for this!

Yes, we want to get zero at the masked position.
The problem is we also want to keep the inner states across the masked step.

Maybe we can deliver another sample_weights in the predict() function?.
If do so, BRNN is still wrong...

@ipoletaev

This comment has been minimized.

Show comment
Hide comment
@ipoletaev

ipoletaev Jun 28, 2016

@lomizandtyd

...keep the inner states across the masked step.

I think it's not necessary, because the network shouldn't remember what responses it need to get at empty vectors...

Maybe we can deliver another sample_weights in the predict() function?.

I don't understand for what task you want to use it? After all you always know in advance what data you process, and you respectively know - which output of the network corresponds to the empty vectors, so you can just skip such positions in output, I guess.

If do so, BRNN is still wrong...

As far as I understood Keras has been "fighting" with RNN masking task about year :)

ipoletaev commented Jun 28, 2016

@lomizandtyd

...keep the inner states across the masked step.

I think it's not necessary, because the network shouldn't remember what responses it need to get at empty vectors...

Maybe we can deliver another sample_weights in the predict() function?.

I don't understand for what task you want to use it? After all you always know in advance what data you process, and you respectively know - which output of the network corresponds to the empty vectors, so you can just skip such positions in output, I guess.

If do so, BRNN is still wrong...

As far as I understood Keras has been "fighting" with RNN masking task about year :)

@poyuwu

This comment has been minimized.

Show comment
Hide comment
@poyuwu

poyuwu Jun 28, 2016

from keras.models import Model
import numpy as np
from keras.layers import Masking,Input,TimeDistributed,Dense
a = np.array([[[3,1,2,2,0,0],[0,0,0,0,0,0],[2,1,1,2,0,0]]])
input = Input(shape=(3,6))
mask = Masking(mask_value=0)(input)
out = TimeDistributed(Dense(1,activation='linear'))(mask)
model = Model(input=input,output=out)
q = model.predict(a)
print (q[0])

@ipoletaev I think it's just Dense Layer that have zero inputs, so that its output is 0. If you change activation function to softmax, then you will get wrong answer.
Besides, batch_size set None on time steps will raise other error in some case (especially on merge layer).

In lasagne, it seems to use Masking matrix to deal with padding. (I do not test its accuracy)

poyuwu commented Jun 28, 2016

from keras.models import Model
import numpy as np
from keras.layers import Masking,Input,TimeDistributed,Dense
a = np.array([[[3,1,2,2,0,0],[0,0,0,0,0,0],[2,1,1,2,0,0]]])
input = Input(shape=(3,6))
mask = Masking(mask_value=0)(input)
out = TimeDistributed(Dense(1,activation='linear'))(mask)
model = Model(input=input,output=out)
q = model.predict(a)
print (q[0])

@ipoletaev I think it's just Dense Layer that have zero inputs, so that its output is 0. If you change activation function to softmax, then you will get wrong answer.
Besides, batch_size set None on time steps will raise other error in some case (especially on merge layer).

In lasagne, it seems to use Masking matrix to deal with padding. (I do not test its accuracy)

@ipoletaev

This comment has been minimized.

Show comment
Hide comment
@ipoletaev

ipoletaev Jun 28, 2016

@poyuwu : yes, I had checked it - and you are right. It means,as I understood, that and simple Dense doesn't keep masked values in the way we want...

I write again what does not converge with the expectations:

  • Forward RNN doesn't keep mask values, backward does it. It's strange.
  • Is this task solving with batch_size = 1?
  • How to specify correctly what timesteps the network should to skip.
  • And it's not clear in which moment BiLSTM does reset_state - only in the end of timesteps in current sample, or when the network meets with empty vector?

ipoletaev commented Jun 28, 2016

@poyuwu : yes, I had checked it - and you are right. It means,as I understood, that and simple Dense doesn't keep masked values in the way we want...

I write again what does not converge with the expectations:

  • Forward RNN doesn't keep mask values, backward does it. It's strange.
  • Is this task solving with batch_size = 1?
  • How to specify correctly what timesteps the network should to skip.
  • And it's not clear in which moment BiLSTM does reset_state - only in the end of timesteps in current sample, or when the network meets with empty vector?
@poyuwu

This comment has been minimized.

Show comment
Hide comment
@poyuwu

poyuwu Jun 29, 2016

@ipoletaev I don't think

  • Forward RNN doesn't keep mask values, backward does it. It's strange.

this statement is true.
That's because padding argument is 'post', not 'pre'. Hence, the reason is the same as Dense layer I said.

  • How to specify correctly what timesteps the network should to skip.

As I said, in lasagne, we provide a mask numpy.array (the same shape as input) to deal with it. If go_backwards=True, it needs to keep padding argument the same.

Besides, Embedding layer mask_zeros seems to be the same.

poyuwu commented Jun 29, 2016

@ipoletaev I don't think

  • Forward RNN doesn't keep mask values, backward does it. It's strange.

this statement is true.
That's because padding argument is 'post', not 'pre'. Hence, the reason is the same as Dense layer I said.

  • How to specify correctly what timesteps the network should to skip.

As I said, in lasagne, we provide a mask numpy.array (the same shape as input) to deal with it. If go_backwards=True, it needs to keep padding argument the same.

Besides, Embedding layer mask_zeros seems to be the same.

@ipoletaev

This comment has been minimized.

Show comment
Hide comment
@ipoletaev

ipoletaev Jun 30, 2016

@poyuwu so you want to say that now, there's no way to solve this issue with Keras?
I mean is it necessary to use masking if we use sample weights?

ipoletaev commented Jun 30, 2016

@poyuwu so you want to say that now, there's no way to solve this issue with Keras?
I mean is it necessary to use masking if we use sample weights?

@xuewei4d

This comment has been minimized.

Show comment
Hide comment
@xuewei4d

xuewei4d Nov 22, 2016

Same here. It seems masking mechanism in Keras is not fully supported.

xuewei4d commented Nov 22, 2016

Same here. It seems masking mechanism in Keras is not fully supported.

@fferroni

This comment has been minimized.

Show comment
Hide comment
@fferroni

fferroni Dec 16, 2016

I don't think Masking masks input values (neither during forward or back-propagation). It just skips a time-step where all features are equal to the mask value (i.e. when you pad a sequence). You can confirm this by:

from keras.models import Model
import numpy as np
from keras.layers import Masking, Activation, Input, TimeDistributed, Dense

if __name__ == "__main__":
	a = np.array([[[3,1,2,2,0.1,0.1],[0,0,0,0,0,0],[2,1,1,2,0.1,0.1]]])
	print 'Input array:'
	print a
	print ''
	input = Input(shape=(3,6))
	mask = Masking(mask_value=0.1)(input)
	out = TimeDistributed(Dense(1, activation='linear'))(mask)
	model = Model(input=input, output=out)

	model.set_weights([np.array([[ 1. ],
							     [ 1. ],
							     [ 1. ],
							     [ 1. ],
							     [ 1. ],
							     [ 1.]], dtype=np.float32), 
	                   np.array([ 0.], dtype=np.float32)])

	print 'Weights'
	print model.get_weights()
	q = model.predict(a)
	print q

The answer is:

Input array:
[[[ 3.   1.   2.   2.   0.1  0.1]
  [ 0.   0.   0.   0.   0.   0. ]
  [ 2.   1.   1.   2.   0.1  0.1]]]

Weights
[array([[ 1.],
       [ 1.],
       [ 1.],
       [ 1.],
       [ 1.],
       [ 1.]], dtype=float32), array([ 0.], dtype=float32)]
[[[ 8.20000076]
  [ 0.        ]
  [ 6.19999981]]]

If it masked the inputs of value 0.1, you would expect result to be

[[[ 8.       ]
  [ 0.        ]
  [ 6.        ]]]

fferroni commented Dec 16, 2016

I don't think Masking masks input values (neither during forward or back-propagation). It just skips a time-step where all features are equal to the mask value (i.e. when you pad a sequence). You can confirm this by:

from keras.models import Model
import numpy as np
from keras.layers import Masking, Activation, Input, TimeDistributed, Dense

if __name__ == "__main__":
	a = np.array([[[3,1,2,2,0.1,0.1],[0,0,0,0,0,0],[2,1,1,2,0.1,0.1]]])
	print 'Input array:'
	print a
	print ''
	input = Input(shape=(3,6))
	mask = Masking(mask_value=0.1)(input)
	out = TimeDistributed(Dense(1, activation='linear'))(mask)
	model = Model(input=input, output=out)

	model.set_weights([np.array([[ 1. ],
							     [ 1. ],
							     [ 1. ],
							     [ 1. ],
							     [ 1. ],
							     [ 1.]], dtype=np.float32), 
	                   np.array([ 0.], dtype=np.float32)])

	print 'Weights'
	print model.get_weights()
	q = model.predict(a)
	print q

The answer is:

Input array:
[[[ 3.   1.   2.   2.   0.1  0.1]
  [ 0.   0.   0.   0.   0.   0. ]
  [ 2.   1.   1.   2.   0.1  0.1]]]

Weights
[array([[ 1.],
       [ 1.],
       [ 1.],
       [ 1.],
       [ 1.],
       [ 1.]], dtype=float32), array([ 0.], dtype=float32)]
[[[ 8.20000076]
  [ 0.        ]
  [ 6.19999981]]]

If it masked the inputs of value 0.1, you would expect result to be

[[[ 8.       ]
  [ 0.        ]
  [ 6.        ]]]
@GPaolo

This comment has been minimized.

Show comment
Hide comment
@GPaolo

GPaolo Jan 17, 2017

Actually Masking works exactly as expected.
The problem is that you are working with the wrong dimension order: in input = (3,6) the 3 is the time dimension and the Masking layer masks only along that dimension, making the net ignore a time sample if that sample is composed of all elements equal to the masked value.

import keras
from keras.utils.visualize_util import plot
from keras.layers import *
from keras.models import Model

net_input = Input(shape = ( 3, 10))
mask = Masking(mask_value = 0.5)(net_input)
conv = TimeDistributed(Dense(1, activation = 'linear', init='one'))(mask)
out = LSTM(1, init='one', inner_init='one',activation='tanh', inner_activation='tanh',)(conv)
model = Model(net_input, out)

print('W: ' + str(model.get_weights()))

net_in = np.ones((1,3, 10))
val = 0.5
net_in[0, 2, :] = val
out = model.predict(net_in)
print('Input: ' + str(net_in))
print('Output: ' + str(out))

In this case the answers are:

mask = 0.5, val = 0.0 : 0.73566443
mask = 0.0, val = 0.0 : 0.96402758
mask = 0.0, val = 0.5 : 0.99504161
mask = 0.5, val = 0.5 : 0.96402758

so from here you can see that when we mask val we get the same result, while when we mask something else, even if val = 0, we get a different result.

GPaolo commented Jan 17, 2017

Actually Masking works exactly as expected.
The problem is that you are working with the wrong dimension order: in input = (3,6) the 3 is the time dimension and the Masking layer masks only along that dimension, making the net ignore a time sample if that sample is composed of all elements equal to the masked value.

import keras
from keras.utils.visualize_util import plot
from keras.layers import *
from keras.models import Model

net_input = Input(shape = ( 3, 10))
mask = Masking(mask_value = 0.5)(net_input)
conv = TimeDistributed(Dense(1, activation = 'linear', init='one'))(mask)
out = LSTM(1, init='one', inner_init='one',activation='tanh', inner_activation='tanh',)(conv)
model = Model(net_input, out)

print('W: ' + str(model.get_weights()))

net_in = np.ones((1,3, 10))
val = 0.5
net_in[0, 2, :] = val
out = model.predict(net_in)
print('Input: ' + str(net_in))
print('Output: ' + str(out))

In this case the answers are:

mask = 0.5, val = 0.0 : 0.73566443
mask = 0.0, val = 0.0 : 0.96402758
mask = 0.0, val = 0.5 : 0.99504161
mask = 0.5, val = 0.5 : 0.96402758

so from here you can see that when we mask val we get the same result, while when we mask something else, even if val = 0, we get a different result.

@GPaolo

This comment has been minimized.

Show comment
Hide comment
@GPaolo

GPaolo Jan 17, 2017

Moreover, I just tested, if you have a Multi-input net (with multiple input branches) and you have a masking layer on each branch, it is enough that just one of the inputs at time step t is equal to the masked value that all the time step is skipped.

I guess that if one wants to skip the time step only if all the inputs are equal to the masked value, the branches need to be merged, right?

GPaolo commented Jan 17, 2017

Moreover, I just tested, if you have a Multi-input net (with multiple input branches) and you have a masking layer on each branch, it is enough that just one of the inputs at time step t is equal to the masked value that all the time step is skipped.

I guess that if one wants to skip the time step only if all the inputs are equal to the masked value, the branches need to be merged, right?

@irrationalagent

This comment has been minimized.

Show comment
Hide comment
@irrationalagent

irrationalagent Jan 19, 2017

Hi Fragore, I have a similar question to you about masking with multiple inputs. I have two input branches and all I want to do is mask 0 from both. Am I right in thinking that adding a mask to the end of each branch is equivalent to adding a single mask AFTER the inputs are merged? here's my example

input1 = Sequential()
input1.add(TimeDistributed(Dense(50), input_shape=(MAX_SEQUENCE_LENGTH,48)))
input2 = Sequential()
input2.add(Embedding(nb_words+2,EMBEDDING_DIM,weights=[embedding_matrix],trainable=False,input_length=MAX_SEQUENCE_LENGTH))

model = Sequential()
model.add(keras.engine.topology.Merge([input1,input2],mode='concat',concat_axis=-1))
model.add(keras.layers.core.Masking(mask_value=0.0))
model.add(LSTM(1024,dropout_W=.2,dropout_U=.2,return_sequences=True))
model.add(LSTM(1024,dropout_W=.2,dropout_U=.2,return_sequences=True))
model.add(LSTM(512,dropout_W=.2,dropout_U=.2,return_sequences=True))
model.add(TimeDistributed(Dense(nb_words + 1)))
model.add(Activation('softmax'))

or version with a mask after each branch prior to merging

input1 = Sequential()
input1.add(TimeDistributed(Dense(50), input_shape=(MAX_SEQUENCE_LENGTH,48)))
input1.add(keras.layers.core.Masking(mask_value=0.0))
input2 = Sequential()
input2.add(Embedding(nb_words+2,EMBEDDING_DIM,weights=[embedding_matrix],trainable=False,input_length=MAX_SEQUENCE_LENGTH,mask_zero=True))

model = Sequential()
model.add(keras.engine.topology.Merge([input1,input2],mode='concat',concat_axis=-1))
model.add(LSTM(1024,dropout_W=.2,dropout_U=.2,return_sequences=True))
model.add(LSTM(1024,dropout_W=.2,dropout_U=.2,return_sequences=True))
model.add(LSTM(512,dropout_W=.2,dropout_U=.2,return_sequences=True))
model.add(TimeDistributed(Dense(nb_words + 1)))
model.add(Activation('softmax'))


irrationalagent commented Jan 19, 2017

Hi Fragore, I have a similar question to you about masking with multiple inputs. I have two input branches and all I want to do is mask 0 from both. Am I right in thinking that adding a mask to the end of each branch is equivalent to adding a single mask AFTER the inputs are merged? here's my example

input1 = Sequential()
input1.add(TimeDistributed(Dense(50), input_shape=(MAX_SEQUENCE_LENGTH,48)))
input2 = Sequential()
input2.add(Embedding(nb_words+2,EMBEDDING_DIM,weights=[embedding_matrix],trainable=False,input_length=MAX_SEQUENCE_LENGTH))

model = Sequential()
model.add(keras.engine.topology.Merge([input1,input2],mode='concat',concat_axis=-1))
model.add(keras.layers.core.Masking(mask_value=0.0))
model.add(LSTM(1024,dropout_W=.2,dropout_U=.2,return_sequences=True))
model.add(LSTM(1024,dropout_W=.2,dropout_U=.2,return_sequences=True))
model.add(LSTM(512,dropout_W=.2,dropout_U=.2,return_sequences=True))
model.add(TimeDistributed(Dense(nb_words + 1)))
model.add(Activation('softmax'))

or version with a mask after each branch prior to merging

input1 = Sequential()
input1.add(TimeDistributed(Dense(50), input_shape=(MAX_SEQUENCE_LENGTH,48)))
input1.add(keras.layers.core.Masking(mask_value=0.0))
input2 = Sequential()
input2.add(Embedding(nb_words+2,EMBEDDING_DIM,weights=[embedding_matrix],trainable=False,input_length=MAX_SEQUENCE_LENGTH,mask_zero=True))

model = Sequential()
model.add(keras.engine.topology.Merge([input1,input2],mode='concat',concat_axis=-1))
model.add(LSTM(1024,dropout_W=.2,dropout_U=.2,return_sequences=True))
model.add(LSTM(1024,dropout_W=.2,dropout_U=.2,return_sequences=True))
model.add(LSTM(512,dropout_W=.2,dropout_U=.2,return_sequences=True))
model.add(TimeDistributed(Dense(nb_words + 1)))
model.add(Activation('softmax'))


@GPaolo

This comment has been minimized.

Show comment
Hide comment
@GPaolo

GPaolo Jan 24, 2017

Wait, you want to mask the output of the branches that are 0? In that case both of your approaches should give you the same result. But usually you mask inputs, this means to put the mask layer as input of the net.
Ps it may also be more convenient to use the functional API :)
PPS the last dense layer doesn't need TimeDistributed anymore cause the LSTM removes the time dimension.

GPaolo commented Jan 24, 2017

Wait, you want to mask the output of the branches that are 0? In that case both of your approaches should give you the same result. But usually you mask inputs, this means to put the mask layer as input of the net.
Ps it may also be more convenient to use the functional API :)
PPS the last dense layer doesn't need TimeDistributed anymore cause the LSTM removes the time dimension.

@slaterb1

This comment has been minimized.

Show comment
Hide comment
@slaterb1

slaterb1 Mar 17, 2017

Contributor

I've been experimenting with and without masking for a little bit now and I have finally figured out what the Masking layer actually does. It doesn't actually "skip" the timepoint that has all masked values, it just forces all the values for that timepoint to be equal to 0... So effectively Masking(mask_value=0.) does nothing. That is why in the example provided by @GPaolo above the results for mask_value=0 and mask_value=0.5 are the same when val matches them.

Here is some easy code to demonstrate what I mean.

Model:

`input1 = Input(batch_shape=(1,1,10)
mask1 = Masking(mask_value=2)(input1)
dense_layer1 = Dense(1, activation='sigmoid')
dense_layer1.setattr('supports_masking', True)
output1 = dense_layer1(mask1)

model = Model(input1, output1)
model.compile(optimizer='adam', loss='binary_crossentropy')
`
Data:

`data = np.ones((10, 1, 10), dtype='float32')
#set half of the data equal to mask value
for index in range(5,10):
data[index,0,:] = 2

#set first data point equal to mask value to show that this line is uneffected
data[0,0,0] = 2`

print outputs:

`get_mask_output = K.function([model.layers[0].input], [model.layers[1].output])
mask_output = get_mask_output([data])[0]

print(data)
print(mask_output)

data:

[[[ 2. 1. 1. 1. 1. 1. 1. 1. 1. 1.]]

[[ 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.]]

[[ 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.]]

[[ 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.]]

[[ 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.]]

[[ 2. 2. 2. 2. 2. 2. 2. 2. 2. 2.]]

[[ 2. 2. 2. 2. 2. 2. 2. 2. 2. 2.]]

[[ 2. 2. 2. 2. 2. 2. 2. 2. 2. 2.]]

[[ 2. 2. 2. 2. 2. 2. 2. 2. 2. 2.]]

[[ 2. 2. 2. 2. 2. 2. 2. 2. 2. 2.]]]

mask_output:

[[[ 2. 1. 1. 1. 1. 1. 1. 1. 1. 1.]]

[[ 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.]]

[[ 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.]]

[[ 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.]]

[[ 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.]]

[[ 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]]

[[ 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]]

[[ 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]]

[[ 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]]

[[ 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]]]`

Predictions:

`test_data = np.ones((5,1,10))
test_data[1,0,:] = 2
test_data[2,0,:] = 0
predictions = model.predict(test_data, batch_size=1)

print(test_data)
print(predictions)
`

Results:

test_data:
`[[[ 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.]]

[[ 2. 2. 2. 2. 2. 2. 2. 2. 2. 2.]]

[[ 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]]

[[ 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.]]

[[ 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.]]]

predictions:
[[[ 0.09200736]]

[[ 0.5 ]]

[[ 0.5 ]]

[[ 0.09200736]]

[[ 0.09200736]]]`

As you can imagine, "masking" values by setting them to 0 and still calculating the results for those lines in layers causes some mistakes from backpropagation (treating unknown values as a real result) as well as added unneeded computation time. I'm going to try to rework how masking is done in Keras a bit...

Edit: I did a little bit of digging into the training.py code and I found that the "masking" information (even with mask_value = 0.) does get incorporated into the training of the weights. The masked lines effectively get ignored after the calculation is done (which is good!). The problem that I am encountering in my actual network is that although "masked lines" are ignored during weight training, they are still evaluated by the network going forward which effects the outputs of future layers based on false information. To be able to build a network that handles variably sized inputs (not all have max timepoints) I want to completely ignore the masked lines entirely... I'm going to try to work that out

Contributor

slaterb1 commented Mar 17, 2017

I've been experimenting with and without masking for a little bit now and I have finally figured out what the Masking layer actually does. It doesn't actually "skip" the timepoint that has all masked values, it just forces all the values for that timepoint to be equal to 0... So effectively Masking(mask_value=0.) does nothing. That is why in the example provided by @GPaolo above the results for mask_value=0 and mask_value=0.5 are the same when val matches them.

Here is some easy code to demonstrate what I mean.

Model:

`input1 = Input(batch_shape=(1,1,10)
mask1 = Masking(mask_value=2)(input1)
dense_layer1 = Dense(1, activation='sigmoid')
dense_layer1.setattr('supports_masking', True)
output1 = dense_layer1(mask1)

model = Model(input1, output1)
model.compile(optimizer='adam', loss='binary_crossentropy')
`
Data:

`data = np.ones((10, 1, 10), dtype='float32')
#set half of the data equal to mask value
for index in range(5,10):
data[index,0,:] = 2

#set first data point equal to mask value to show that this line is uneffected
data[0,0,0] = 2`

print outputs:

`get_mask_output = K.function([model.layers[0].input], [model.layers[1].output])
mask_output = get_mask_output([data])[0]

print(data)
print(mask_output)

data:

[[[ 2. 1. 1. 1. 1. 1. 1. 1. 1. 1.]]

[[ 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.]]

[[ 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.]]

[[ 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.]]

[[ 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.]]

[[ 2. 2. 2. 2. 2. 2. 2. 2. 2. 2.]]

[[ 2. 2. 2. 2. 2. 2. 2. 2. 2. 2.]]

[[ 2. 2. 2. 2. 2. 2. 2. 2. 2. 2.]]

[[ 2. 2. 2. 2. 2. 2. 2. 2. 2. 2.]]

[[ 2. 2. 2. 2. 2. 2. 2. 2. 2. 2.]]]

mask_output:

[[[ 2. 1. 1. 1. 1. 1. 1. 1. 1. 1.]]

[[ 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.]]

[[ 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.]]

[[ 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.]]

[[ 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.]]

[[ 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]]

[[ 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]]

[[ 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]]

[[ 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]]

[[ 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]]]`

Predictions:

`test_data = np.ones((5,1,10))
test_data[1,0,:] = 2
test_data[2,0,:] = 0
predictions = model.predict(test_data, batch_size=1)

print(test_data)
print(predictions)
`

Results:

test_data:
`[[[ 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.]]

[[ 2. 2. 2. 2. 2. 2. 2. 2. 2. 2.]]

[[ 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]]

[[ 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.]]

[[ 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.]]]

predictions:
[[[ 0.09200736]]

[[ 0.5 ]]

[[ 0.5 ]]

[[ 0.09200736]]

[[ 0.09200736]]]`

As you can imagine, "masking" values by setting them to 0 and still calculating the results for those lines in layers causes some mistakes from backpropagation (treating unknown values as a real result) as well as added unneeded computation time. I'm going to try to rework how masking is done in Keras a bit...

Edit: I did a little bit of digging into the training.py code and I found that the "masking" information (even with mask_value = 0.) does get incorporated into the training of the weights. The masked lines effectively get ignored after the calculation is done (which is good!). The problem that I am encountering in my actual network is that although "masked lines" are ignored during weight training, they are still evaluated by the network going forward which effects the outputs of future layers based on false information. To be able to build a network that handles variably sized inputs (not all have max timepoints) I want to completely ignore the masked lines entirely... I'm going to try to work that out

@ragulpr

This comment has been minimized.

Show comment
Hide comment
@ragulpr

ragulpr Mar 20, 2017

Building on @slaterb1 and @GPaolo 's snippets I tried digging around to see the benefits of masking but haven't found it yet. It feels like I'm missing something.

  • It does not seem to propagate numerically sound values through time
  • It propagates np.nan, see gist
  • Feels (TODO:test) quite numerically unstable to propagate possibly absurd values down the network? Like mask output 0 may not always be in place.
  • It has to test each input
  • Quick testing (see gist) seem to show that there's no immediate performance gains

Does anyone have an idea about if/when it gives performance gains? I didn't have time to run for long/deep/wide and I'm not comfortable about how Python/Keras/Tensorflow/Theano compiles

Is mask an intricate way of doing what I think weights should to be doing? I.e multiplying with the loss and dividing by sum of weights in batch?
It's literally what seems to be done here anyway:
https://github.com/fchollet/keras/blob/master/keras/engine/training.py#L453

Does it actually halt any execution (yet)?

ragulpr commented Mar 20, 2017

Building on @slaterb1 and @GPaolo 's snippets I tried digging around to see the benefits of masking but haven't found it yet. It feels like I'm missing something.

  • It does not seem to propagate numerically sound values through time
  • It propagates np.nan, see gist
  • Feels (TODO:test) quite numerically unstable to propagate possibly absurd values down the network? Like mask output 0 may not always be in place.
  • It has to test each input
  • Quick testing (see gist) seem to show that there's no immediate performance gains

Does anyone have an idea about if/when it gives performance gains? I didn't have time to run for long/deep/wide and I'm not comfortable about how Python/Keras/Tensorflow/Theano compiles

Is mask an intricate way of doing what I think weights should to be doing? I.e multiplying with the loss and dividing by sum of weights in batch?
It's literally what seems to be done here anyway:
https://github.com/fchollet/keras/blob/master/keras/engine/training.py#L453

Does it actually halt any execution (yet)?

@carlthome

This comment has been minimized.

Show comment
Hide comment
@carlthome

carlthome Mar 21, 2017

Contributor

@ragulpr, it's my understanding that masking does more than just loss scaling. If a timestep has been masked, the previous output and state will be reused. See here and here.

Contributor

carlthome commented Mar 21, 2017

@ragulpr, it's my understanding that masking does more than just loss scaling. If a timestep has been masked, the previous output and state will be reused. See here and here.

@slaterb1

This comment has been minimized.

Show comment
Hide comment
@slaterb1

slaterb1 Mar 22, 2017

Contributor

@ragulpr, I'm not sure about performance gains but Theano is pretty smart about knowing what it needs to hang on to and what it doesn't (based on the API doc: http://deeplearning.net/software/theano/library/scan.html)

More specifically this line: "Note that there is an optimization, that at compile time will detect that you are using just the last value of the result and ensure that scan does not store all the intermediate values that are used. So do not worry if A and k are large."

So after compiling the model it might pass over the masked values (or at least not hold them in memory as long), but that is pure speculation based on similarities in the underlying code.

@carlthome, I came across the mask snippet in the "theano_backend.py" as well and you are right that the masking has a direct effect on how the states are evaluated and passed on (T.switch). Maybe this is too general a question but how does this layer accept the mask? Just to give an example, if I have a model with multiple layers, defined as so:

model = Model(input1, output1)

I understand that Theano wraps this up as a mathematical equation to calculate:

output1 = input1 -> [ layers[0] -> layers[1] -> ... layers[N] ]

but if I have somewhere in the middle:

prev_layer -> Masking_layer -> RNN_layer

The output from the Masking_layer gets put into the RNN_layer as input ("x"). Does the "supports_masking" attribute tell the RNN_layer to figure out the mask? I could not find anywhere in the code where the mask is evaluated or interpreted by the RNN_layer, except that I can pass in a "mask" variable via the call() method of the Recurrent(Layer) object.

I tried calling RNN_layer(prev_layer, mask=Masking_layer) but it didn't do anything different. The last comment in the thread, #176 suggests that it has to be called with a mask but I'm not sure how to do that... Any thoughts?

Contributor

slaterb1 commented Mar 22, 2017

@ragulpr, I'm not sure about performance gains but Theano is pretty smart about knowing what it needs to hang on to and what it doesn't (based on the API doc: http://deeplearning.net/software/theano/library/scan.html)

More specifically this line: "Note that there is an optimization, that at compile time will detect that you are using just the last value of the result and ensure that scan does not store all the intermediate values that are used. So do not worry if A and k are large."

So after compiling the model it might pass over the masked values (or at least not hold them in memory as long), but that is pure speculation based on similarities in the underlying code.

@carlthome, I came across the mask snippet in the "theano_backend.py" as well and you are right that the masking has a direct effect on how the states are evaluated and passed on (T.switch). Maybe this is too general a question but how does this layer accept the mask? Just to give an example, if I have a model with multiple layers, defined as so:

model = Model(input1, output1)

I understand that Theano wraps this up as a mathematical equation to calculate:

output1 = input1 -> [ layers[0] -> layers[1] -> ... layers[N] ]

but if I have somewhere in the middle:

prev_layer -> Masking_layer -> RNN_layer

The output from the Masking_layer gets put into the RNN_layer as input ("x"). Does the "supports_masking" attribute tell the RNN_layer to figure out the mask? I could not find anywhere in the code where the mask is evaluated or interpreted by the RNN_layer, except that I can pass in a "mask" variable via the call() method of the Recurrent(Layer) object.

I tried calling RNN_layer(prev_layer, mask=Masking_layer) but it didn't do anything different. The last comment in the thread, #176 suggests that it has to be called with a mask but I'm not sure how to do that... Any thoughts?

@carlthome

This comment has been minimized.

Show comment
Hide comment
@carlthome

carlthome Mar 22, 2017

Contributor

I could not find anywhere in the code where the mask is evaluated or interpreted by the RNN_layer

Each Keras layer declares if it supports masking. Each layer is also responsible for using the mask in a sensible way (which I believe is the primary source of confusion: that the masking functionality is implemented across a bunch of different classes). For RNN layers in particular, they rely on the fact that the underlying K.rnn operation has mask support so if you're looking for where precisely the logic is, you'll note that the RNN layers simply pass the mask argument into the backend, where the magic happens.

Contributor

carlthome commented Mar 22, 2017

I could not find anywhere in the code where the mask is evaluated or interpreted by the RNN_layer

Each Keras layer declares if it supports masking. Each layer is also responsible for using the mask in a sensible way (which I believe is the primary source of confusion: that the masking functionality is implemented across a bunch of different classes). For RNN layers in particular, they rely on the fact that the underlying K.rnn operation has mask support so if you're looking for where precisely the logic is, you'll note that the RNN layers simply pass the mask argument into the backend, where the magic happens.

@slaterb1

This comment has been minimized.

Show comment
Hide comment
@slaterb1

slaterb1 Mar 22, 2017

Contributor

@carlthome, I saw that in the code but was not able to get the mask to work in my RNN network. For clarity I was trying to rework stuff in RecurrentShop to setup an encoder decoder network that adjusts the next input based on a prediction made on the previous state from both the encoder and the decoder (a custom RNN that uses a .single_step_rnn() instead of the regular .rnn() ).

But based on your advice, I tried to just build a basic LSTM network to act as a NOT Gate (pointless but simple) and it does interpret the mask correctly, when it is passed a mask mid network! I'm including the gist. It shows that masking works for both return_sequences=True and return_sequences=False. It also shows that if you train the network with data that does not have 'masked' input, 'masked' lines in the test data will still get masked appropriately. Hope that helps people understand the masking stuff better!

This is the gist

Contributor

slaterb1 commented Mar 22, 2017

@carlthome, I saw that in the code but was not able to get the mask to work in my RNN network. For clarity I was trying to rework stuff in RecurrentShop to setup an encoder decoder network that adjusts the next input based on a prediction made on the previous state from both the encoder and the decoder (a custom RNN that uses a .single_step_rnn() instead of the regular .rnn() ).

But based on your advice, I tried to just build a basic LSTM network to act as a NOT Gate (pointless but simple) and it does interpret the mask correctly, when it is passed a mask mid network! I'm including the gist. It shows that masking works for both return_sequences=True and return_sequences=False. It also shows that if you train the network with data that does not have 'masked' input, 'masked' lines in the test data will still get masked appropriately. Hope that helps people understand the masking stuff better!

This is the gist

@Seanny123

This comment has been minimized.

Show comment
Hide comment
@Seanny123

Seanny123 May 30, 2017

@fferroni @GPaolo apparently, the TimeDistributed layer didn't support masking, since this feature has been added in Pull #6401?

Seanny123 commented May 30, 2017

@fferroni @GPaolo apparently, the TimeDistributed layer didn't support masking, since this feature has been added in Pull #6401?

@mehrdadscomputer

This comment has been minimized.

Show comment
Hide comment
@mehrdadscomputer

mehrdadscomputer Jun 5, 2017

Hey Guys, there is a seq2seq example which it's input is a string (sequence) like '5+9' and output is another string '14'.
The author used pre padding to have sequences with same lengths at input but he didn't use masking.
I add a simple line to add masking to his model and there is about 8 percent improvement in accuracy.
Is my case a correct use of masking?

this is main code:

from random import seed
from random import randint
from numpy import array
from math import ceil
from math import log10
from math import sqrt
from numpy import argmax
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import LSTM
from keras.layers import TimeDistributed
from keras.layers import RepeatVector

def random_sum_pairs(n_examples, n_numbers, largest):
    X, y = list(), list()
    for i in range(n_examples):
	    in_pattern = [randint(1,largest) for _ in range(n_numbers)]
	    out_pattern = sum(in_pattern)
	    X.append(in_pattern)
	    y.append(out_pattern)
    return X, y

def to_string(X, y, n_numbers, largest):
    max_length = n_numbers * ceil(log10(largest+1)) + n_numbers - 1
    Xstr = list()
    for pattern in X:
	    strp = '+'.join([str(n) for n in pattern])
	    strp = ''.join([' ' for _ in range(max_length-len(strp))]) + strp
	    Xstr.append(strp)
    max_length = ceil(log10(n_numbers * (largest+1)))
    ystr = list()
    for pattern in y:
	    strp = str(pattern)
	    strp = ''.join([' ' for _ in range(max_length-len(strp))]) + strp
	    ystr.append(strp)
    return Xstr, ystr

def integer_encode(X, y, alphabet):
    char_to_int = dict((c, i) for i, c in enumerate(alphabet))
    Xenc = list()
    for pattern in X:
	    integer_encoded = [char_to_int[char] for char in pattern]
	    Xenc.append(integer_encoded)
    yenc = list()
    for pattern in y:
	    integer_encoded = [char_to_int[char] for char in pattern]
	    yenc.append(integer_encoded)
    return Xenc, yenc

def one_hot_encode(X, y, max_int):
    Xenc = list()
    for seq in X:
	    pattern = list()
	    for index in seq:
		    vector = [0 for _ in range(max_int)]
		    vector[index] = 1
		    pattern.append(vector)
	    Xenc.append(pattern)
    yenc = list()
    for seq in y:
	    pattern = list()
	    for index in seq:
		    vector = [0 for _ in range(max_int)]
		    vector[index] = 1
		    pattern.append(vector)
	    yenc.append(pattern)
    return Xenc, yenc

def generate_data(n_samples, n_numbers, largest, alphabet):
    X, y = random_sum_pairs(n_samples, n_numbers, largest)
    X, y = to_string(X, y, n_numbers, largest)
    X, y = integer_encode(X, y, alphabet)
    X, y = one_hot_encode(X, y, len(alphabet))
    X, y = array(X), array(y)
    return X, y

def invert(seq, alphabet):
    int_to_char = dict((i, c) for i, c in enumerate(alphabet))
    strings = list()
    for pattern in seq:
	    string = int_to_char[argmax(pattern)]
	    strings.append(string)
    return ''.join(strings)

seed(1)
n_samples = 1000
n_numbers = 2
largest = 10
alphabet = ['0', '1', '2', '3', '4', '5', '6', '7', '8', '9', '+', ' ']
n_chars = len(alphabet)
n_in_seq_length = n_numbers * ceil(log10(largest+1)) + n_numbers - 1
n_out_seq_length = ceil(log10(n_numbers * (largest+1)))
n_batch = 10
n_epoch = 10
model = Sequential()
model.add(LSTM(100, input_shape=(n_in_seq_length, n_chars)))
model.add(RepeatVector(n_out_seq_length))
model.add(LSTM(50, return_sequences=True))
model.add(TimeDistributed(Dense(n_chars, activation='softmax')))
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
print(model.summary())

for i in range(n_epoch):
    X, y = generate_data(n_samples, n_numbers, largest, alphabet)
    print(i)
    model.fit(X, y, epochs=1, batch_size=n_batch)

X, y = generate_data(n_samples, n_numbers, largest, alphabet)
result = model.predict(X, batch_size=n_batch, verbose=0)
expected = [invert(x, alphabet) for x in y]
predicted = [invert(x, alphabet) for x in result]
for i in range(20):
    print('Expected=%s, Predicted=%s' % (expected[i], predicted[i]))

and I just change this part:

model = Sequential()
model.add(LSTM(100, input_shape=(n_in_seq_length, n_chars)))

to this part:

from keras.layers import Masking
model = Sequential()
model.add(Masking(mask_value = [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1], input_shape=(n_in_seq_length, n_chars)))
    model.add(LSTM(100))

sources:
http://machinelearningmastery.com/learn-add-numbers-seq2seq-recurrent-neural-networks/#comment-400854

mehrdadscomputer commented Jun 5, 2017

Hey Guys, there is a seq2seq example which it's input is a string (sequence) like '5+9' and output is another string '14'.
The author used pre padding to have sequences with same lengths at input but he didn't use masking.
I add a simple line to add masking to his model and there is about 8 percent improvement in accuracy.
Is my case a correct use of masking?

this is main code:

from random import seed
from random import randint
from numpy import array
from math import ceil
from math import log10
from math import sqrt
from numpy import argmax
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import LSTM
from keras.layers import TimeDistributed
from keras.layers import RepeatVector

def random_sum_pairs(n_examples, n_numbers, largest):
    X, y = list(), list()
    for i in range(n_examples):
	    in_pattern = [randint(1,largest) for _ in range(n_numbers)]
	    out_pattern = sum(in_pattern)
	    X.append(in_pattern)
	    y.append(out_pattern)
    return X, y

def to_string(X, y, n_numbers, largest):
    max_length = n_numbers * ceil(log10(largest+1)) + n_numbers - 1
    Xstr = list()
    for pattern in X:
	    strp = '+'.join([str(n) for n in pattern])
	    strp = ''.join([' ' for _ in range(max_length-len(strp))]) + strp
	    Xstr.append(strp)
    max_length = ceil(log10(n_numbers * (largest+1)))
    ystr = list()
    for pattern in y:
	    strp = str(pattern)
	    strp = ''.join([' ' for _ in range(max_length-len(strp))]) + strp
	    ystr.append(strp)
    return Xstr, ystr

def integer_encode(X, y, alphabet):
    char_to_int = dict((c, i) for i, c in enumerate(alphabet))
    Xenc = list()
    for pattern in X:
	    integer_encoded = [char_to_int[char] for char in pattern]
	    Xenc.append(integer_encoded)
    yenc = list()
    for pattern in y:
	    integer_encoded = [char_to_int[char] for char in pattern]
	    yenc.append(integer_encoded)
    return Xenc, yenc

def one_hot_encode(X, y, max_int):
    Xenc = list()
    for seq in X:
	    pattern = list()
	    for index in seq:
		    vector = [0 for _ in range(max_int)]
		    vector[index] = 1
		    pattern.append(vector)
	    Xenc.append(pattern)
    yenc = list()
    for seq in y:
	    pattern = list()
	    for index in seq:
		    vector = [0 for _ in range(max_int)]
		    vector[index] = 1
		    pattern.append(vector)
	    yenc.append(pattern)
    return Xenc, yenc

def generate_data(n_samples, n_numbers, largest, alphabet):
    X, y = random_sum_pairs(n_samples, n_numbers, largest)
    X, y = to_string(X, y, n_numbers, largest)
    X, y = integer_encode(X, y, alphabet)
    X, y = one_hot_encode(X, y, len(alphabet))
    X, y = array(X), array(y)
    return X, y

def invert(seq, alphabet):
    int_to_char = dict((i, c) for i, c in enumerate(alphabet))
    strings = list()
    for pattern in seq:
	    string = int_to_char[argmax(pattern)]
	    strings.append(string)
    return ''.join(strings)

seed(1)
n_samples = 1000
n_numbers = 2
largest = 10
alphabet = ['0', '1', '2', '3', '4', '5', '6', '7', '8', '9', '+', ' ']
n_chars = len(alphabet)
n_in_seq_length = n_numbers * ceil(log10(largest+1)) + n_numbers - 1
n_out_seq_length = ceil(log10(n_numbers * (largest+1)))
n_batch = 10
n_epoch = 10
model = Sequential()
model.add(LSTM(100, input_shape=(n_in_seq_length, n_chars)))
model.add(RepeatVector(n_out_seq_length))
model.add(LSTM(50, return_sequences=True))
model.add(TimeDistributed(Dense(n_chars, activation='softmax')))
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
print(model.summary())

for i in range(n_epoch):
    X, y = generate_data(n_samples, n_numbers, largest, alphabet)
    print(i)
    model.fit(X, y, epochs=1, batch_size=n_batch)

X, y = generate_data(n_samples, n_numbers, largest, alphabet)
result = model.predict(X, batch_size=n_batch, verbose=0)
expected = [invert(x, alphabet) for x in y]
predicted = [invert(x, alphabet) for x in result]
for i in range(20):
    print('Expected=%s, Predicted=%s' % (expected[i], predicted[i]))

and I just change this part:

model = Sequential()
model.add(LSTM(100, input_shape=(n_in_seq_length, n_chars)))

to this part:

from keras.layers import Masking
model = Sequential()
model.add(Masking(mask_value = [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1], input_shape=(n_in_seq_length, n_chars)))
    model.add(LSTM(100))

sources:
http://machinelearningmastery.com/learn-add-numbers-seq2seq-recurrent-neural-networks/#comment-400854

@stale stale bot added the stale label Sep 3, 2017

@stale

This comment has been minimized.

Show comment
Hide comment
@stale

stale bot Sep 3, 2017

This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 30 days if no further activity occurs, but feel free to re-open a closed issue if needed.

stale bot commented Sep 3, 2017

This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 30 days if no further activity occurs, but feel free to re-open a closed issue if needed.

@stale stale bot closed this Oct 3, 2017

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment