Attention Mechanism Implementation Issue #1472

zhzou2020 · 2016-01-15T02:30:40Z

The problem is that I have an output a from a LSTM layer with the shape of (batch,step,hidden), and an output b from another layer, which is called the weights(or attention), with the shape of (batch,step), and I have no idea how to do an operation which computes the weighted sum given the two outputs.
like this:
a0, a1, a2 = a.shape b0, b1 = b.shape for i in range(a0): for j in range(a1): for k in range(a2): c[i][k]+=a[i][j][k]*b[i][j]
Can this be done in keras?

The text was updated successfully, but these errors were encountered:

jfsantos · 2016-01-15T20:58:20Z

Wouldn't a TimeDistributedMerge layer work for you?

farizrahman4u · 2016-01-15T21:21:19Z

@zzjin13 Here you go..

from keras.layers.core import*
from keras.models import Sequential

input_dim = 32
hidden = 32

#The LSTM  model -  output_shape = (batch, step, hidden)
model1 = Sequential()
model1.add(LSTM(input_dim=input_dim, output_dim=hidden, input_length=step, return_sequences=True))

#The weight model  - actual output shape  = (batch, step)
# after reshape : output_shape = (batch, step,  hidden)
model2 = Sequential()
model2.add(Dense(input_dim=input_dim, output_dim=step))
model2.add(Activation('softmax')) # Learn a probability distribution over each  step.
#Reshape to match LSTM's output shape, so that we can do element-wise multiplication.
model2.add(RepeatVector(hidden))
model2.add(Permute(2, 1))

#The final model which gives the weighted sum:
model = Sequential()
model.add(Merge([model1, model2], 'mul'))  # Multiply each element with corresponding weight a[i][j][k] * b[i][j]
model.add(TimeDistributedMerge('sum')) # Sum the weighted elements.

model.compile(loss='mse', optimizer='sgd')

Hope it helps.

zhzou2020 · 2016-01-17T18:14:21Z

I found this code cannot work properly when the input of the LSTM is masked.
How can I solve it with masked input?

philipperemy · 2016-03-28T01:54:18Z

@zzjin13 Can you clarity? You pad with 0 and you give mask=True to all your LSTM layers? This is masking for you?
Because the logic is exactly the same, masked or not in this case.

ylqfp · 2016-04-28T04:14:52Z

@farizrahman4u Thanks so much! I'll have a try.

philipperemy · 2017-05-23T06:35:49Z

I've just written a very simple Hello world for attention with visualisations (with the new Keras syntax)

Have a look: https://github.com/philipperemy/keras-simple-attention-mechanism

It might help you :)

abali96 · 2017-05-24T17:12:49Z

@philipperemy which form of attention is this? Is there a specific paper you referenced in developing your attention model? Thanks for open sourcing by the way!

philipperemy · 2017-05-25T06:29:01Z

Thanks for your feedback! It's the basic attention mechanism when you derive a probability distribution over your time states with n-D time series (no encoder-decoder here).

@abali96 I didn't have any papers in mind when I implemented it.

But a good paper you can have a look at is this one:

https://arxiv.org/pdf/1703.10089.pdf (page 2)

nnulcm · 2017-07-19T05:22:21Z

Can I have a pre-trained attention layer weights?Such as the probability distribution of each word?

xiaoleihuang · 2017-07-19T21:22:29Z

@philipperemy check out Bengio's paper
I think there are some missing implementations in your codes.

philipperemy · 2017-07-20T04:37:43Z

@xiaoleihuang this one is for Neural Machine Translation, basically sequence to sequence attention. My implementation does not deal with this.

xiaoleihuang · 2017-07-20T20:24:06Z

@philipperemy Hi, I am not sure what formula you are based on. According to the formulas in the paper, I mean I did not see two basic steps in your attention_3d_block function

calculate the dot production of e_ij;
the weight a_ij is not normalized;
there is a multiplication step in your merge, but shall it follows with a sum?

philipperemy · 2017-08-18T02:59:17Z

@xiaoleihuang I didn't base my implementation on any known paper. My attention here is just a softmax mask inside the network. It basically gives you a normalized distribution of the importance of each time step (or unit) regarding an input.

Intrinsically, it should not help the model perform better but it should help the user understand which time steps contribute to the prediction of the model.

xiaoleihuang · 2017-08-18T06:32:03Z

Hi, @philipperemy I see. I can understand your intuition. You utilize the input and compute it as a kind of "weights" and it will automatically optimized by the Neural Network (Permute->Reshape->Dense). In order to do matrix multiplication, you repeat the output from Dense. But there is an issue with your implementation: the attention defined in the paper is a dot production, but yours is a vector. What theory would supports such an operation? I am a little confused. But I think yours is a good idea. Additionally, I found there might be some issue with you K.function part in get_activations.

bicepjai · 2017-09-09T23:56:46Z

Is this issue/discussion related to #4962

v1nc3nt27 · 2017-10-12T13:40:35Z

@xiaoleihuang There is a dot product inside the dense, isn't there? The permuted Input is multiplied with the weight matrix in the dense layer.

@philipperemy Did you manage to use your attention mechanism successfully in a real project? I've tested it, but the score doesn't really change and the word's highlighted are rather random from what I can see. Would be nice to see it working in a bigger context. By the way, what is the Reshape layer for?

xu-song · 2018-01-16T07:40:46Z

@philipperemy @farizrahman4u
The shape in farizrahman4u's case has some mistake. I revise it as the following

#The weight model  - actual output shape  = (batch, step)
# after reshape : output_shape = (batch, step,  hidden)
model2 = Sequential()                            # input_shape  = (batch, step, input_dim)
model2.add(Lambda(lambda x: K.mean(x, axis=2)))  # output_shape = (batch, step)
model2.add(Activation('softmax'))                # output_shape = (batch, step)
model2.add(RepeatVector(hidden))                 # output_shape = (batch, hidden, step)
model2.add(Permute(2, 1))                        # output_shape = (batch, step, hidden)

Ashima16 · 2018-05-24T11:08:51Z

Hi @farizrahman4u, I am getting "init() takes 2 positional arguments but 3 were given" error in the above code mentioned by you..Could you please help, as I am new to all this.

caugusta · 2018-05-24T19:14:56Z

Hi @Ashima16, that usually means you've provided too many inputs to a function or method. Make sure, for example, that you're providing only model1 and model2?

Ashima16 · 2018-05-25T09:52:30Z

Hi @caugusta, thanks for the response..

I tried executing the same code as mentioned above with step value=1..
I am getting the following error:

TypeError Traceback (most recent call last)
in ()
17 #Reshape to match LSTM's output shape, so that we can do element-wise multiplication.
18 model2.add(RepeatVector(hidden))
---> 19 model2.add(Permute(2, 1))
20
21 #The final model which gives the weighted sum:

TypeError: init() takes 2 positional arguments but 3 were given

caugusta · 2018-05-26T14:02:24Z

Hi @Ashima16, can you run the original code? If not, then it might be that Keras has updated the API since this code was written. Permute() might no longer work the way the original code expects.

likejazz · 2018-06-26T08:55:33Z

@Ashima16
It needs another pair of brackets. try this.

model2.add(Permute((2, 1)))

seanxiangct · 2018-10-26T02:18:59Z

I found this code cannot work properly when the input of the LSTM is masked.
How can I solve it with masked input?

Hi @zzjin13,

I encountered the same issue with the masked input, but I don't think this is related to the implementation of the attention model.

As attention model is trying to learn the weightings of the inputs based on the input, a masked input with leading/trailing 0s in the sequence will trick the attention model to make it think this unique pattern is what is should pay attention to, resulting in your model is only paying attention to the mask not the actual inputs.

I would recommend applying sample weighting to your training sequence so the mask wouldn't contribute to the gradient update. You can pass a weighting matrix to the sample_weight argument of the model.fit() function

kzhang123 · 2019-05-28T02:01:01Z

for i in range(a0):
for k in range(a2):
c[i][k]=a[i,:,k]*b[i]

So, three loops can be reduced to two.

siddhartha-mukherjee-india · 2019-06-05T06:48:57Z

I wrote following code with the inspiration from the above discussion.

print('Build model...')
_input = Input(shape=[maxlen],)
embedded = Embedding(
input_dim=27,
output_dim=embedding_size,
input_length=maxlen,
trainable=True,
mask_zero=False
)(_input)

activations = LSTM(units, dropout=0.4, return_sequences=True)(embedded)
activations = Dropout(0.4)(activations)
attention = Dense(1, activation='tanh')(activations)
attention = Flatten()(attention)
attention = Activation('softmax')(attention)
attention = RepeatVector(units)(attention)
attention = Permute([2, 1])(attention)
``
#sent_representation = merge([input, attention], mode='mul')
`# Compilation error, so changed to Multiply()`
`sent_representation = Multiply()([activations, attention])`
`sent_representation = K.sum(sent_representation, axis=2)`
`# whatever value i put in axis, I get error. Please help.`

sent_representation = Dropout(0.4)(sent_representation)

prediction = Dense(numclasses, activation='softmax')(sent_representation)

model = Model(inputs=_input, outputs=prediction)
# Checkpoint to pick the best model.
checkpoint = ModelCheckpoint(filepath=CHECK_PT_FILE, verbose=1, monitor='val_loss',save_best_only=True, mode='auto')
model.compile(loss='categorical_crossentropy',
optimizer='adamax',
metrics=['accuracy'])

I need help to resolve the issue with K.Sum(). If I remove that, I am able to start the training.
But for attention, I belive that K.Sum() is needed.
Can any please help me to resolve this issue ?

patyork mentioned this issue Jan 10, 2017

How to add Attention on top of a Recurrent Layer (Text Classification) #4962

Closed

GauravBh1010tt mentioned this issue Jan 19, 2018

Word level attention. #9065

Closed

farizrahman4u closed this as completed Oct 25, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Attention Mechanism Implementation Issue #1472

Attention Mechanism Implementation Issue #1472

zhzou2020 commented Jan 15, 2016

jfsantos commented Jan 15, 2016

farizrahman4u commented Jan 15, 2016

zhzou2020 commented Jan 17, 2016

philipperemy commented Mar 28, 2016

ylqfp commented Apr 28, 2016

philipperemy commented May 23, 2017 •

edited

Loading

abali96 commented May 24, 2017

philipperemy commented May 25, 2017 •

edited

Loading

nnulcm commented Jul 19, 2017

xiaoleihuang commented Jul 19, 2017

philipperemy commented Jul 20, 2017

xiaoleihuang commented Jul 20, 2017

philipperemy commented Aug 18, 2017

xiaoleihuang commented Aug 18, 2017

bicepjai commented Sep 9, 2017 •

edited

Loading

v1nc3nt27 commented Oct 12, 2017

xu-song commented Jan 16, 2018 •

edited

Loading

Ashima16 commented May 24, 2018

caugusta commented May 24, 2018

Ashima16 commented May 25, 2018

caugusta commented May 26, 2018

likejazz commented Jun 26, 2018

seanxiangct commented Oct 26, 2018

kzhang123 commented May 28, 2019

siddhartha-mukherjee-india commented Jun 5, 2019

Attention Mechanism Implementation Issue #1472

Attention Mechanism Implementation Issue #1472

Comments

zhzou2020 commented Jan 15, 2016

jfsantos commented Jan 15, 2016

farizrahman4u commented Jan 15, 2016

zhzou2020 commented Jan 17, 2016

philipperemy commented Mar 28, 2016

ylqfp commented Apr 28, 2016

philipperemy commented May 23, 2017 • edited Loading

abali96 commented May 24, 2017

philipperemy commented May 25, 2017 • edited Loading

nnulcm commented Jul 19, 2017

xiaoleihuang commented Jul 19, 2017

philipperemy commented Jul 20, 2017

xiaoleihuang commented Jul 20, 2017

philipperemy commented Aug 18, 2017

xiaoleihuang commented Aug 18, 2017

bicepjai commented Sep 9, 2017 • edited Loading

v1nc3nt27 commented Oct 12, 2017

xu-song commented Jan 16, 2018 • edited Loading

Ashima16 commented May 24, 2018

caugusta commented May 24, 2018

Ashima16 commented May 25, 2018

caugusta commented May 26, 2018

likejazz commented Jun 26, 2018

seanxiangct commented Oct 26, 2018

kzhang123 commented May 28, 2019

siddhartha-mukherjee-india commented Jun 5, 2019

philipperemy commented May 23, 2017 •

edited

Loading

philipperemy commented May 25, 2017 •

edited

Loading

bicepjai commented Sep 9, 2017 •

edited

Loading

xu-song commented Jan 16, 2018 •

edited

Loading