New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to add Attention on top of a Recurrent Layer (Text Classification) #4962

Closed
cbaziotis opened this Issue Jan 7, 2017 · 110 comments

Comments

Projects
None yet
@cbaziotis
Copy link

cbaziotis commented Jan 7, 2017

I am doing text classification. Also I am using my pre-trained word embeddings and i have a LSTM layer on top with a softmax at the end.

vocab_size = embeddings.shape[0]
embedding_size = embeddings.shape[1]

model = Sequential()

model.add(Embedding(
        input_dim=vocab_size,
        output_dim=embedding_size,
        input_length=max_length,
        trainable=False,
        mask_zero=True,
        weights=[embeddings]
    ))

model.add(LSTM(200, return_sequences=False))
model.add(Dropout(0.5))

model.add(Dense(3, activation='softmax', activity_regularizer=activity_l2(0.0001)))

Pretty simple. Now I want to add attention to the model, but i don't know how to do it.

My understanding is that i have to set return_sequences=True so as the attention layer will weigh each timestep accordingly. This way the LSTM will return a 3D Tensor, right?
After that what do i have to do?
Is there a way to easily implement a model with attention using Keras Layers or do i have to write my own custom layer?

If this can be done with the available Keras Layers, I would really appreciate an example.

@patyork

This comment has been minimized.

Copy link
Contributor

patyork commented Jan 7, 2017

It's been a while since I've used attention, so take this with a grain of salt.

return_sequences does not necessarily need to be True for attention to work; the underlying computation is the same, and this flag should be used only based on whether you need 1 output or an output for each timestep.

As for implementing attention in Keras.. There are two possible methods: a) add a hidden Activation layer for the softmax or b) change the recurrent unit to have a softmax.

On option a): this would apply attention to the output of the recurrent unit but not to the output/input passed to the next time step. I don't this is what is desired. In this case, the LSTM should have a squashing function applied, as LSTMs don't do too well with linear/relu style activation.

On option b): this would apply attention to the output of the recurrentcy, and also to the output/input passed to the next timestep. I think that this is what is desired, but I could be wrong. In this case, the linear output of the neurons would be squashed directly by the softmax; if you wish to apply a pre-squashing such as sigmoid or tanh before the softmax calculation, you would need a custom activation that does both in one step.

I could draw a diagram if necessary, and I should probably read the activation papers again..

@cbaziotis

This comment has been minimized.

Copy link

cbaziotis commented Jan 7, 2017

@patyork Thanks for the reply.
Do you have a good paper (or papers) in mind (for attention)? I am reading a lot about attention, and i want to try it out, because i really like the idea. But even though i think i understand the concept i don't have a clear understanding of how it works and how to implement it.

If it is possible i would like for someone to offer an example in Keras.

PS. is this the correct place to ask such question or i should do it at https://groups.google.com/d/forum/keras-users?

@patyork

This comment has been minimized.

Copy link
Contributor

patyork commented Jan 7, 2017

@baziotis This area is supposed to be more for bugs as opposed to "how to implement" questions. I admit I don't often look at the google group, but that is a valid place to ask these questions, as well as on the Slack channel.

Bengio et. al has a pretty good paper on attention (soft attention is the softmax attention).

An example of method a) I described:

vocab_size = embeddings.shape[0]
embedding_size = embeddings.shape[1]

model = Sequential()

model.add(Embedding(
        input_dim=vocab_size,
        output_dim=embedding_size,
        input_length=max_length,
        trainable=False,
        mask_zero=True,
        weights=[embeddings]
    ))

model.add(LSTM(200, return_sequences=False))
model.add(Activation('softmax')) #this guy here
model.add(Dropout(0.5))

model.add(Dense(3, activation='softmax', activity_regularizer=activity_l2(0.0001)))

example b), with simple activation:

vocab_size = embeddings.shape[0]
embedding_size = embeddings.shape[1]

model = Sequential()

model.add(Embedding(
        input_dim=vocab_size,
        output_dim=embedding_size,
        input_length=max_length,
        trainable=False,
        mask_zero=True,
        weights=[embeddings]
    ))

model.add(LSTM(200, return_sequences=False, activation='softmax'))
model.add(Dropout(0.5))

model.add(Dense(3, activation='softmax', activity_regularizer=activity_l2(0.0001)))

example b) with sigmoid and then softmax (non-working, but the idea):

vocab_size = embeddings.shape[0]
embedding_size = embeddings.shape[1]

def myAct(out):
    return K.softmax(K.tanh(out))

model = Sequential()

model.add(Embedding(
        input_dim=vocab_size,
        output_dim=embedding_size,
        input_length=max_length,
        trainable=False,
        mask_zero=True,
        weights=[embeddings]
    ))

model.add(LSTM(200, return_sequences=False, activation=myAct))
model.add(Dropout(0.5))

model.add(Dense(3, activation='softmax', activity_regularizer=activity_l2(0.0001)))

In addition, I should say that my notes about whether a) or b) above is what you probably need are based on your example, where you want one output (making option b probably the correct way). Attention is often used in spaces like caption generation where there is more than 1 output such as setting return_sequences=True. For those cases, I think that option a) is the described usage, such that the recurrency keeps all the information passing forward, and it's just the higher layers that utilize the attention.

@cbaziotis

This comment has been minimized.

Copy link

cbaziotis commented Jan 7, 2017

@patyork Thanks for the examples and for the paper. I new that posting here would get more attention :P

I will try them and post back.

@mbollmann

This comment has been minimized.

Copy link

mbollmann commented Jan 10, 2017

@patyork, I'm sorry, but I don't see how this implements attention at all?

From my understanding, the softmax in the Bengio et al. paper is not applied over the LSTM output, but over the output of an attention model, which is calculated from the LSTM's hidden state at a given timestep. The output of the softmax is then used to modify the LSTM's internal state. Essentially, attention is something that happens within an LSTM since it is both based on and modifies its internal states.

I actually made my own attempt to create an attentional LSTM in Keras, based on the very same paper you cited, which I've shared here:

https://gist.github.com/mbollmann/ccc735366221e4dba9f89d2aab86da1e

There are several different ways to incorporate attention into an LSTM, and I won't claim 100% correctness of my implementation (though I'd appreciate any hints if something seems terribly wrong!), but I'd be surprised if it was as simple as adding a softmax activation.

@cbaziotis

This comment has been minimized.

Copy link

cbaziotis commented Jan 10, 2017

@mbollmann You are correct that none of the solutions @patyork is what i want. i want to get a weight distribution (importance) for the outputs from each timestep of the RNN. Like in the paper: "Hierarchical Attention Networks for Document Classification" but in my case i want just the representation of a sentence. I am trying to implement this using the available keras layers.

Similar idea in this paper.

@mbollmann

This comment has been minimized.

Copy link

mbollmann commented Jan 10, 2017

@baziotis That indeed looks conceptually much simpler. I could just take a very short glance right now, but is there a specific point where you got stuck?

@cbaziotis

This comment has been minimized.

Copy link

cbaziotis commented Jan 10, 2017

@mbollmann Please do if you can.
I am trying to implement it right now and trying to understand the Keras API.

I don't have a working solution but i think i should set return_sequences=True in the RNN in order to get the intermediate outputs and masking=False.
On top of that i am thinking i should put a TimeDistributed(Dense(1)) with a softmax activation. But i haven't figured out how to put everything together.

Also i think that putting masking=False won't affect the performance as the attention layer will assign the correct weights on the padded words. Am i right?

Edit: to clarify i want to implement an attention mechanism like the one in [1].
attention mechanism

  1. Zhou, Peng, et al. "Attention-based bidirectional long short-term memory networks for relation classification." The 54th Annual Meeting of the Association for Computational Linguistics. 2016.
@cbaziotis

This comment has been minimized.

Copy link

cbaziotis commented Jan 10, 2017

I tried this:

_input = Input(shape=[max_length], dtype='int32')

    # get the embedding layer
    embedded = embeddings_layer(embeddings=embeddings_matrix,
                                trainable=False, masking=False, scale=False, normalize=False)(_input)

    activations = LSTM(64, return_sequences=True)(embedded)

    # attention
    attention = TimeDistributed(Dense(1, activation='tanh'))(activations) 
    attention = Flatten()(attention)
    attention = Activation('softmax')(attention)

    activations = Merge([activations, attention], mode='mul')

    probabilities = Dense(3, activation='softmax')(activations)

    model = Model(input=_input, output=probabilities)
    model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=[])

and i get the following error:

  File "...\keras\engine\topology.py", line 1170, in __init__
    node_indices, tensor_indices)
  File "...\keras\engine\topology.py", line 1193, in _arguments_validation
    layer_output_shape = layer.get_output_shape_at(node_indices[i])
AttributeError: 'TensorVariable' object has no attribute 'get_output_shape_at'
@mbollmann

This comment has been minimized.

Copy link

mbollmann commented Jan 10, 2017

@baziotis The cause of the error probably is that you need to use the merge function (lowercase), not the Merge layer (uppercase).

Apart from that, as far as I understood it:

The part with the tanh activation (Equation 5 in Yang et al., Equation 9 in Zhou et al.) comes before the multiplication with a trained context vector/parameter vector which reduces the dimensionality to "one scalar per timestep". For Yang et al., that seems to be a Dense layer which doesn't yet reduce the dimensionality (though this is a little unclear to me), so I'd expect TimeDistributed(Dense(64, activation='tanh')). For Zhou et al., they just write "tanh", so you'd probably not even need a Dense layer, just the tanh activation after the LSTM.

For the multiplication with a trained context vector/parameter vector, I believe (no longer -- see EDIT) this might be a simple Dense(1) in Keras, without the TimeDistributed wrapper, since we want to have individual weights for each timestep, but I'm not totally sure about this and haven't tested it. I'd imagine something like this, but take this with a grain of salt:

    # attention after Zhou et al.
    attention = Activation('tanh')(activations)    # Eq. 9
    attention = Dense(1)(attention)                # Eq. 10
    attention = Flatten()(attention)               # Eq. 10
    attention = Activation('softmax')(attention)   # Eq. 10
    activations = merge([activations, attention], mode='mul')  # Eq. 11

(EDIT: Nope, doesn't seem that way, they train a parameter vector with dimensionality of the embedding, not a matrix with a timestep dimension.)

@patyork

This comment has been minimized.

Copy link
Contributor

patyork commented Jan 10, 2017

My apologies; this would explain why I was not impressed with the results from my "attention" implementation.

There is an implementation here that seems to be working for people.

@cbaziotis

This comment has been minimized.

Copy link

cbaziotis commented Jan 10, 2017

@mbollmann you were right about the merge, it is different from Merge #2467.

I think this is really close:

units = 64
max_length = 50

_input = Input(shape=[max_length], dtype='int32')

# get the embedding layer
embedded = embeddings_layer(embeddings=embeddings_matrix,
							trainable=False, masking=False, scale=False, normalize=False)(_input)

activations = LSTM(units, return_sequences=True)(embedded)

# compute importance for each step
attention = TimeDistributed(Dense(1, activation='tanh'))(activations) 
attention = Flatten()(attention)
attention = Activation('softmax')(attention)
attention = RepeatVector(units)(attention)
attention = Permute([2, 1])(attention)

# apply the attention
sent_representation = merge([activations, attention], mode='mul')
sent_representation = Lambda(lambda xin: K.sum(xin, axis=0))(sent_representation)
sent_representation = Flatten()(sent_representation)

probabilities = Dense(3, activation='softmax')(sent_representation)

model = Model(input=_input, output=probabilities)
model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=[])

but i get an error because Lamda doesn't output the right dimensions. I should be getting [1,units] right?
What am i doing wrong?


Update: i tried explicitly passing the output_shape for Lambda and the model compiles:

sent_representation = merge([activations, attention], mode='mul')
sent_representation = Lambda(lambda xin: K.sum(xin, axis=0), output_shape=(units, ))(sent_representation)
# sent_representation = Flatten()(sent_representation)

but now i get the following error:

ValueError: Input dimension mis-match. (input[0].shape[0] = 128, input[1].shape[0] = 50)
Apply node that caused the error: Elemwise{Composite{(i0 * log(i1))}}(dense_2_target, Elemwise{Clip}[(0, 0)].0)
Toposort index: 155
Inputs types: [TensorType(float32, matrix), TensorType(float32, matrix)]
Inputs shapes: [(128, 3), (50, 3)]
Inputs strides: [(12, 4), (12, 4)]
Inputs values: ['not shown', 'not shown']
Outputs clients: [[Sum{axis=[1], acc_dtype=float64}(Elemwise{Composite{(i0 * log(i1))}}.0)]]
@cbaziotis

This comment has been minimized.

Copy link

cbaziotis commented Jan 10, 2017

Well i found out why it wasn't working. I was expecting the input to Lamda to be (max_length, units) but it was (None, max_length, units), so i just had to change the axis to 1. This now works.

units = 64
max_length = 50
vocab_size = embeddings.shape[0]
embedding_size = embeddings.shape[1]


_input = Input(shape=[max_length], dtype='int32')

# get the embedding layer
embedded = Embedding(
        input_dim=vocab_size,
        output_dim=embedding_size,
        input_length=max_length,
        trainable=trainable,
        mask_zero=masking,
        weights=[embeddings]
    )(_input)

activations = LSTM(units, return_sequences=True)(embedded)

# compute importance for each step
attention = TimeDistributed(Dense(1, activation='tanh'))(activations) 
attention = Flatten()(attention)
attention = Activation('softmax')(attention)
attention = RepeatVector(units)(attention)
attention = Permute([2, 1])(attention)

# apply the attention
sent_representation = merge([activations, attention], mode='mul')
sent_representation = Lambda(lambda xin: K.sum(xin, axis=1))(sent_representation)

probabilities = Dense(3, activation='softmax')(sent_representation)

model = Model(input=_input, output=probabilities)
model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=[])

I would like if someone could verify that this implementation is correct.

@mbollmann

This comment has been minimized.

Copy link

mbollmann commented Jan 11, 2017

@baziotis Looks good to me. I re-read the description in Zhou et al. and the code looks like it does what they describe. I no longer understand how what they're doing does anything useful, since the attention model only depends on the input and applies the same weights at every timestep, but ... that's probably just my insufficient understanding (I'm used to slightly different types of attention). :)

@cbaziotis

This comment has been minimized.

Copy link

cbaziotis commented Jan 11, 2017

@mbollmann i am confused about the same thing. can you give an example of the type of attention that you have in mind? I think that i have to put the word (embedding) in the calculation of the attention.

From what i understand the Dense layer:

  1. assigns a different weight (importance) to each timestep
  2. BUT the importance is static. Essentially this means that each word position in a sentence has different importance, but the importance comes from the position of the word and not the word itself.

I plotted the weights of the TimeDistributed(Dense(1, activation='tanh'))(activations) in a heatmap:

att
My interpretation is that the positions with big weights play more important role, so the output of the RNN for those steps will have i bigger impact in the final representation of the sentence.

The problem is that this is static. If an important word happens to occur in a position with a small weight then the representation of the sentence won't be good enough.

I would like some feedback on this, and preferably a good paper with a better attention mechanism.

@mbollmann

This comment has been minimized.

Copy link

mbollmann commented Jan 11, 2017

@baziotis Are you sure you don't have it the wrong way around?

The Dense layer takes the output of the LSTM at one timestep and transforms it. The TimeDistributed wrapper applies the same Dense layer with the same weights to each timestep -- which means the output of the calculation cannot depend on the position/timestep since the Dense layer doesn't even know about it.

So my confusion seems to be of a different nature than yours. :)

(In short: I don't see what calculating a softmax and multiplying the original vector by that gets you that a plain TimeDistributed(Dense(...)) couldn't already learn. However, I work on attentional models where the output is also a time-series, which means that I have multiple output timesteps for which the model should learn to attend to different input timesteps. I think that's not directly comparable to your situation, since you only have one output.)

@patyork

This comment has been minimized.

Copy link
Contributor

patyork commented Jan 11, 2017

@mbollmann I'm also a bit confused (but I have been from the get go). I think this blog post is fairly informative, or at least has some decent pictures.

So, @baziotis is using time series with multiple output steps (LSTM, with return_sequences=True). The first dense layer is applying weights over each individual time step output from the LSTM, which I'm not sure is accomplishing the intended behavior of looking at all the past activations and assigning weights to those, as in this picture:
image

I'm thinking the code above is just the line at,T feeding into the attention layer at each timestep. The fallout of this is that the attention is just determining which activations are important, not which timesteps are important.

@cbaziotis

This comment has been minimized.

Copy link

cbaziotis commented Jan 11, 2017

@mbollmann i thought that the TimeDistributed applies different weights to each timestep...
In that case everything is wrong.
How can i make it so i can apply different weights to each timestep?
Can this be done with the available keras layers? Any hint?

@patyork

This comment has been minimized.

Copy link
Contributor

patyork commented Jan 11, 2017

TimeDistributed applies the same weight set across every timestep.

You'd need to setup a standard Dense layer as a matrix e.g. Dense(20) where 20 is the lookback length. You'd then feed examples of 20 timesteps to train. This is where I'm quite confused about implementing attention, as in theory it looks like this lookback is infinite, not fixed at a certain length.

@cbaziotis cbaziotis closed this Jan 11, 2017

@cbaziotis cbaziotis reopened this Jan 11, 2017

@cbaziotis

This comment has been minimized.

Copy link

cbaziotis commented Jan 11, 2017

Sorry for the miss-click.
So if i have inputs of constant length, lets say 50 then is this what i have to do?

activations = LSTM(units, return_sequences=True)(embedded)

# compute importance for each step
attention = Dense(50 , activation='tanh')(activations) 
attention = Flatten()(attention)
@patyork

This comment has been minimized.

Copy link
Contributor

patyork commented Jan 11, 2017

Actually, no, I think you would just remove the TimeDistributed wrapper and keep Dense(1) - I need to implement it real quick and check some shapes though.

@patyork

This comment has been minimized.

Copy link
Contributor

patyork commented Jan 11, 2017

So I guess that is what you are looking for.

  • 50 timesteps
  • Feeds into a regular Dense(1), which provides separate weights for the 50 timesteps
  • Calculates attention and multiplies against the 50 timesteps to apply attention
  • Sums (this reduces the 50 timesteps to 1 output; this is where this attention implementation differs from what most of what I've read describes)
  • Dense layer that produces output of shape (None, 3)
_input = Input(shape=[max_length], dtype='int32')

# get the embedding layer
embedded = Embedding(
        input_dim=vocab_size,
        output_dim=embedding_size,
        input_length=max_length,
        trainable=False,
        mask_zero=False
    )(_input)

activations = LSTM(units, return_sequences=True)(embedded)

# compute importance for each step
attention = Dense(1, activation='tanh')(activations)
attention = Flatten()(attention)
attention = Activation('softmax')(attention)
attention = RepeatVector(units)(attention)
attention = Permute([2, 1])(attention)


sent_representation = merge([activations, attention], mode='mul')
sent_representation = Lambda(lambda xin: K.sum(xin, axis=-2), output_shape=(units,))(sent_representation)

probabilities = Dense(3, activation='softmax')(sent_representation)

I think this (ugly) chart maps the above out pretty well; it's up to you to determine if it makes sense for what you are doing:
image

@cbaziotis

This comment has been minimized.

Copy link

cbaziotis commented Jan 11, 2017

@patyork Thanks! I think this is what is described in the paper.
What they are trying to do from what i understand is: instead of using just the last output of the RNN, they use the weighted sum of all the intermediate outputs.

I have a question about this line:

sent_representation = Lambda(lambda xin: K.sum(xin, axis=-2), output_shape=(units,))(sent_representation)

Why axis=-2. How does this sum the tensors? I am using axis=1.

@cbaziotis

This comment has been minimized.

Copy link

cbaziotis commented Jan 11, 2017

continuing from my last comment, this is what is described in the blog post that you mentioned. See after the image that you posted...

The y‘s are our translated words produced by the decoder, and the x‘s are our source sentence words. The above illustration uses a bidirectional recurrent network, but that’s not important and you can just ignore the inverse direction. The important part is that each decoder output word y_t now depends on a weighted combination of all the input states, not just the last state. The a‘s are weights that define in how much of each input state should be considered for each output. So, if a_{3,2} is a large number, this would mean that the decoder pays a lot of attention to the second state in the source sentence while producing the third word of the target sentence. The a's are typically normalized to sum to 1 (so they are a distribution over the input states).

What different kind of attention do you have in mind? In the article attention is described in the context of machine translation. In my case (classification) i just want a better representation for the sentence.

@patyork

This comment has been minimized.

Copy link
Contributor

patyork commented Jan 11, 2017

Yeah, after thinking about this, it makes sense. The softmax multiplication will weight the timestep outputs (most will be near zero, some nearer to 1) and the so the sum of those will be close to the outputs of the "near to 1" timesteps - pretty clever.

In this case, axis=-2 is equivalent to axis=1; I use the reverse indexing all the time, so that I never have to remember that Keras includes the batch_size (the None aspect) in those shapes. You ran into this gotcha earlier; using the reverse indexing means I never have to think about that aspect - and you'll see that form of indexing throughout the actual Keras code for this reason.

I just mean that implementation seems a little limiting - you have to set T=50 or another limit; it can't be an infinite or undefined variable, which means you have to throw away the first T-1 (49) outputs/training outputs. As that image leads me to believe, the T should be infinite/undefined/variable, Something like the TimeDistributed wrapper could provide. Perhaps this is a good thing, perhaps not - I haven't tried both ways (obviously).

@mbollmann

This comment has been minimized.

Copy link

mbollmann commented Jan 11, 2017

Phew, a lot happened here, and I think I agree with most of what was written. Using Dense(1) without the TimeDistributed wrapper was what I was already trying to argue for yesterday, some dozens posts above, so that does seem correct to me as well in this scenario.

@patyork

This comment has been minimized.

Copy link
Contributor

patyork commented Jan 11, 2017

@mbollmann I read that - it seems like you talked yourself out of that at some point though, based on the edit. I was confusing/arguing with myself to no end throughout this entire issue as well.

I learned quite a bit though, at least.

@cbaziotis

This comment has been minimized.

Copy link

cbaziotis commented Jan 11, 2017

@patyork @mbollmann Thank you both! I learned a lot.

Btw after runnng some tests, i am not impressed. I see no obvious improvement compared to the classic senario (using just the last timestep). But the idea is interesting...

@patyork This may be stupid, but what do you mean by saying:

it can't be an infinite or undefined variable, which means you have to throw away the first T-1 (49) outputs/training outputs.

Why are they thrown? They are used in the weighted sum, aren't they? *
I agree that this is limiting as it won't work with masking (series of varying length).

*Do you mean the timesteps that are padded to keep a constant length?

@OptimusCrime

This comment has been minimized.

Copy link

OptimusCrime commented Jan 30, 2017

Hey @baziotis :) Thank you so much for your work. This is just what I was looking for. I have some trouble getting your layers to work on my machine. Using Theano they compile, but in my model and system using Theano as a backend is too slow.

Trying to run it with Tensorflow results in the following crash:

  File "~/.virtualenvs/master/lib/python3.5/site-packages/keras/models.py", line 327, in add
    output_tensor = layer(self.outputs[0])
  File "~/.virtualenvs/master/lib/python3.5/site-packages/keras/engine/topology.py", line 569, in __call__
    self.add_inbound_node(inbound_layers, node_indices, tensor_indices)
  File "~/.virtualenvs/master/lib/python3.5/site-packages/keras/engine/topology.py", line 632, in add_inbound_node
    Node.create_node(self, inbound_layers, node_indices, tensor_indices)
  File "~/.virtualenvs/master/lib/python3.5/site-packages/keras/engine/topology.py", line 164, in create_node
    output_tensors = to_list(outbound_layer.call(input_tensors[0], mask=input_masks[0]))
  File "~/code/rorschach/prediction/layer/attention_layer.py", line 66, in call
    eij = K.tanh(K.dot(x, self.W))
  File "~/.virtualenvs/master/lib/python3.5/site-packages/keras/backend/tensorflow_backend.py", line 799, in dot
    y_permute_dim = [y_permute_dim.pop(-2)] + y_permute_dim
IndexError: pop index out of range

Currently using Keras 1.2.0 and I've tried both Tensorflow 0.11.0 and 0.12.1 without luck.

@cbaziotis

This comment has been minimized.

Copy link

cbaziotis commented Jan 31, 2017

@OptimusCrime I am using Theano as a backend and have not experienced any slowdowns. Are you sure that the reason for the slowdowns is the attention layers?

BTW, in tensorflow if you are using AttentionWithContext Layer the dot doesn't work, as it is pointed here, so what you have to do is:

Replace this:

uit = K.dot(x, self.W)

if self.bias:
    uit += self.b

uit = K.tanh(uit)
ait = K.dot(uit, self.u) # replace this

a = K.exp(ait)

With this:

uit = K.dot(x, self.W)

if self.bias:
    uit += self.b

uit = K.tanh(uit)

mul_a = uit  * self.u # with this
ait = K.sum(mul_a, axis=2) # and this

a = K.exp(ait)

Also please look at the updated gists as i have updated them with a fix.

@miguelwon

This comment has been minimized.

Copy link

miguelwon commented Feb 24, 2017

Hello. Thanks for the code @cbaziotis.
I was having the same problem and now it works with no errors. But there is still a problem with the output dimensions. I tried this:

inputs = [[[0,0,0],[0,0,0],[0,0,0],[0,0,0]],[[1,2,3],[4,5,6],[7,8,9],[10,11,12]],[[10,20,30],[40,50,60],[70,80,90],[100,110,120]]]

hidden_size = 6
sent_size = 4
doc_size = 3

model = Sequential()
model.add(LSTM(hidden_size,input_shape = (sent_size,doc_size),return_sequences = True))
model.add(AttentionWithContext())

print "First layer:"
intermediate_layer_model = Model(input=model.input,output=model.layers[0].output)
print intermediate_layer_model.predict(inputs)
print ""
print "Second layer:"
intermediate_layer_model = Model(input=model.input,output=model.layers[1].output)
print intermediate_layer_model.predict(inputs)

and it is giving me this result:

First layer:
[[[ 0.          0.          0.          0.          0.          0.        ]
  [ 0.          0.          0.          0.          0.          0.        ]
  [ 0.          0.          0.          0.          0.          0.        ]
  [ 0.          0.          0.          0.          0.          0.        ]]

 [[ 0.04093511 -0.00982957 -0.          0.25834009 -0.39604828 -0.169927  ]
  [ 0.         -0.         -0.          0.68305802 -0.73000526 -0.1271846 ]
  [ 0.         -0.         -0.          0.79648596 -0.83882242 -0.        ]
  [ 0.         -0.         -0.          0.79895407 -0.79928428 -0.        ]]

 [[ 0.          0.         -0.          0.23120573 -0.76159418 -0.32464135]
  [ 0.          0.         -0.          0.76159418 -0.76159418 -0.        ]
  [ 0.          0.         -0.          0.76159418 -0.76159418 -0.        ]
  [ 0.          0.         -0.          0.76159418 -0.76159418 -0.        ]]]

Second layer:
[[ 0.          0.          0.          0.          0.          0.        ]
 [ 0.00770082 -0.00184916  0.          0.66687739 -0.71645236 -0.06456213]
 [ 0.          0.          0.          0.68619043 -0.76159418 -0.04615331]]

Shouldn't the Attention output have dimensions (samples, features), that this case should be (3,4)?

@cbaziotis

This comment has been minimized.

Copy link

cbaziotis commented Feb 24, 2017

No. The attention layer all that does is to compute a weighted sum of the outputs of the RNN.
In your case for example:

  1. the first layer outputs 3 (4,6) tensors.
  2. The weighted sum of a (4,6) tensor will be a (1,6) tensor (a 6 dimensional vector). We compress each column not each row.
  3. Then at the second layer you have a (3,6) tensor, which is correct.
@dupsys

This comment has been minimized.

Copy link

dupsys commented Mar 17, 2017

Hi guys,
I have the following model to correct input language sentence in a one-hot vector that is not in the standard English vocabulary. How can I introduces Attention mechanism to the model, so that the output will be the relevant information that will give the sentence a meaning

hiddenStateSize = 256
hiddenLayerSize = 256
model = Sequential()

The output of the LSTM layer are the hidden states of the LSTM for every time step.

model.add(GRU(hiddenStateSize, return_sequences = True, input_shape=(maxSequenceLength, len(char_2_id))))
model.add(Dense(1, activation='tanh')
model.add(Flatten())
model.add(Activation('softmax'))

I got stuck from this moment

model.add(TimeDistributed(Dense(hiddenLayerSize)))
model.add(TimeDistributed(Activation('relu')))
model.add(TimeDistributed(Dense(len(char_2_id))))
model.add(TimeDistributed(Activation('softmax')))

----SGD-------

sgd = SGD(lr=0.01, decay=1e-6, momentum=0.9, nesterov=True)

define model

%time model.compile(loss='categorical_crossentropy', optimizer = sgd , metrics=['accuracy'])

@nigeljyng

This comment has been minimized.

Copy link
Contributor

nigeljyng commented Apr 6, 2017

@cbaziotis I've been your AttentionWithContext code at https://gist.github.com/cbaziotis/7ef97ccf71cbc14366835198c09809d2

For some reason the output shape is wrong. See the model.summary() output below:

_________________________________________________________________
Layer (type)                 Output Shape              Param #
=================================================================
text_input (InputLayer)      (None, 100)               0
_________________________________________________________________
embedding_1 (Embedding)      (None, 100, 100)          2361000
_________________________________________________________________
masking_1 (Masking)          (None, 100, 100)          0
_________________________________________________________________
bidirectional_1 (Bidirection (None, 100, 256)          175872
_________________________________________________________________
attention_with_context_1 (At (None, 100, 256)          66048
_________________________________________________________________
output (Dense)               (None, 100, 34)           8738
=================================================================

Shouldn't attention_with_context_1 have an output shape of (None, 256) as listed in your documentation for your function? It should output a 2D tensor of shape (samples, features). The peculiar thing is when I retrieve the layer and get the output it shows the correct shape:

>>> att_layer.output
<tf.Tensor 'attention_with_context_1/Sum_2:0' shape=(?, 256) dtype=float32>
>>> # but this returns the wrong shape
>>> att_layer.output_shape
(None, 100, 256)

Any ideas?

@nigeljyng

This comment has been minimized.

Copy link
Contributor

nigeljyng commented Apr 6, 2017

@cbaziotis Found the issue. Turns out if you write a custom layer and it modifies the input shape, you need a compute_output_shape method. See here for a fork that now works.

>>> model.summary()
_________________________________________________________________
Layer (type)                 Output Shape              Param #
=================================================================
text_input (InputLayer)      (None, 100)               0
_________________________________________________________________
embedding_2 (Embedding)      (None, 100, 100)          2361000
_________________________________________________________________
masking_4 (Masking)          (None, 100, 100)          0
_________________________________________________________________
bidirectional_5 (Bidirection (None, 100, 256)          175872
_________________________________________________________________
attention_with_context_4 (At (None, 256)               66048
_________________________________________________________________
output (Dense)               (None, 34)                8738
=================================================================
@ylmeng

This comment has been minimized.

Copy link

ylmeng commented May 17, 2017

OK I did not read the whole discussion, but Zhou, Peng, et al. says H is a matrix where every column has the dimensionality of a word vector. Why is that? I think it should be the units of LSTM layer, which can be chosen to be the same as word vector dimensionality, of course, but it does not have to be so?

@philipperemy

This comment has been minimized.

Copy link

philipperemy commented Jun 6, 2017

Hey, have a look at this repo:

https://github.com/philipperemy/keras-attention-mechanism

It shows how to build an attention module of top of a recurrent layer.

Thanks

@jbrownlee

This comment has been minimized.

Copy link

jbrownlee commented Jun 29, 2017

@philipperemy I tested your approach. Indeed you can learn an attention vector, but testing across a suite of contrived problems, I see the model is just as skillful as a plan Dense + LSTM combination. Attention is an optimization that should lift skill or decrease training time for the same skill. Perhaps you have an example where your approach is more skillful than a straight Dense + LSTM setup with the same resources?

@cbaziotis After testing, I believe your attention method is something new/different inspired by Bahdanau, et al. [1]. It does not appear skillful on contrived problems either. Perhaps you have a good demonstration of where it does do well?

@mbollmann is correct as far as I can tell. The attention approach of Bahdanau, et al. requires access to the decoder hidden state (decoder output) of the last time step in order to compute the current time step (s_i-1 in the paper). This is unavailable unless you write your own layer and access it.

[1] https://arxiv.org/pdf/1409.0473.pdf

@NMRobert

This comment has been minimized.

Copy link

NMRobert commented Jul 7, 2017

@jbrownlee
Would it be possible to share some of these 'test' case contrived problems? It would be extremely helpful in terms of debugging and evaluating the efficacy of various attention implementations.

@dudeperf3ct

This comment has been minimized.

Copy link

dudeperf3ct commented Jul 8, 2017

@cbaziotis , How will the above attention mechanism work for the imdb example in keras? The input size is (5000, 80) (#max_length=80) and output is (5000, ). This the model for training :

	input_ = Input(shape=(80,), dtype='float32')
	print (input_.get_shape())                       #(?, 80)
	input_embed = Embedding(max_features, 128 ,input_length=80)(input_)
	print (input_embed.get_shape())                  #(?, 80, 128)

	activations = LSTM(64, return_sequences=True)(input_embed)
	attention = TimeDistributed(Dense(1, activation='tanh'))(activations)
	attention = Flatten()(attention)
	attention = Activation('softmax')(attention)
	attention = RepeatVector(64)(attention)
	attention = Permute([2, 1])(attention)	
	print (activations.get_shape())                   #(?, ?, 64)
	print (attention.get_shape())                     #(?, ?, 64)

	sent_representation = merge([activations, attention], mode='mul')
	sent_representation = Lambda(lambda x_train: K.sum(x_train, axis=1), output_shape=(5000,))(sent_representation)
	print (sent_representation.get_shape())           #(?, 64)
	probabilities = Dense(1, activation='softmax')(sent_representation)      #Expected (5000,)
	model = Model(inputs=input_, outputs=probabilities)
	model.summary()

 Error : ValueError: Dimensions must be equal, but are 64 and 5000 for 'dense_2/MatMul' (op: 'MatMul') with input shapes: [?,64], [5000,1].
@danieljf24

This comment has been minimized.

Copy link

danieljf24 commented Jul 20, 2017

Hi, @cbaziotis Thanks for your code.
As you did not conduct special treatment for the padded words, I am wondering if the attention mechanism will assign the correct weights (close to zero) on the padded words.

@cbaziotis

This comment has been minimized.

Copy link

cbaziotis commented Jul 29, 2017

If you read carefully you will see that i have posted the updated versions of the layers. Here you go:

model.add(LSTM(64, return_sequences=True))
model.add(AttentionWithContext())
# next add a Dense layer (for classification/regression) or whatever...
model.add(LSTM(64, return_sequences=True))
model.add(Attention())
# next add a Dense layer (for classification/regression) or whatever...

And as i say, the layers take into account the mask.
Edit: also note that i have not tested them with Keras 2, but i imagine that you will need to make some minor syntactic changes.

@jiangchao123

This comment has been minimized.

Copy link

jiangchao123 commented Sep 26, 2017

does the attention+lstm improve the accuracy in text classification? In my dataset, I find that, there is no difference with mean pooling + lstm.

@shaifugpt

This comment has been minimized.

Copy link

shaifugpt commented Nov 21, 2017

@cbaziotis I have a query regarding the attention:
activations=LSTM(neu,activation='relu',return_sequences=True,return_state=True)(inputs)
This statement applies attention on output of LSTM. Does this imply on h (hidden state) where h=o_t (tanh(c_t))

I read somewhere, that in
activations,hh,cc=LSTM(neu,activation='relu',return_sequences=True,return_state=True)(inputs)

that hh is hidden state and cc is the cell state. Are hh and cc the final hidden and cell states?

Also what is the difference between attention and attention with context

@Ravin0512

This comment has been minimized.

Copy link

Ravin0512 commented Mar 7, 2018

Hi, @cbaziotis Thank you very much for the Attention code!
Here is a question about that: how to get the attention weight of each sentence and word in order to visualize the work like this
qq 20180307110818

@spate141

This comment has been minimized.

Copy link

spate141 commented May 3, 2018

@Ravin0512 Any updates?

@cbaziotis

This comment has been minimized.

Copy link

cbaziotis commented May 5, 2018

@Ravin0512 i recently made this tool https://github.com/cbaziotis/neat-vision

Just make sure to return the attention scores besides the final representation of the sentence from the attention layer.

@saxenarohit

This comment has been minimized.

Copy link

saxenarohit commented Jun 20, 2018

@cbaziotis As per sharing the weights across time-steps, I think it is fine. Even Andrew Ng's Sequence Models course have shared weight implementation.

@deltaxp

This comment has been minimized.

Copy link

deltaxp commented Jun 21, 2018

  1. can one make the attentionmodel shorter by using the dot function of keras.laysers ?

inputs=Input(shape=(input_len,))
embedded=Embedding(input_dim, embedding_dim)(inputs)
activation=LSTM(hidden_dim, return_sequences=True)(embedded)
attention=TimeDistributed(Dense(1,use_bias=False, activation='linear'))(activation)
attention=Flatten()(attention)
attention=Activation('softmax')(attention)
representation=dot([attention,activation],axes=1)

isnt it the same as the Long Version
attention=TimeDistributed(Dense(1,use_bias=False, activation='linear'))(activation)
attention=Flatten()(attention)
Attention=Activation('softmax')(attention)
attention=RepeatVector(self.hidden_dim)(attention)
attention=Permute([2,1])(attention)
activation=multiply([attention,activation])
representation=Lambda(lambda x: K.sum(x,axis=1))(activation)

the dot function contracts the Tensor at the axis=1 sum_t a_t*h_th= h_h

the dense layer for the activation shouldnt have a bias, since the weights accoording th zhou work only on the hidden components of the hidden states. further more in zhous model a linear activation is enough

as far as i understood the Attention-dense layer has to be time distributed. because the weights act on the hidden states components they have the same role mor or less as all matrices in the the recurrent layer which all share the weights over time.

the time dependence oft the activation factors rises from the the hidden state differences (components deiffer an therefore alpha(t)=softmax(w^T*h_t) differs,

@stevewyl

This comment has been minimized.

Copy link

stevewyl commented Jun 22, 2018

@Ravin0512 I just found an ugly method.
First you need to define a simple network structure before your attention layer (here the attention layer is the fourth layer).
sent_before_att = K.function([sent_model.layers[0].input, K.learning_phase()], [sent_model.layers[2].output])
And you then take out the attention layer weight.
sent_att_w = sent_model.layers[3].get_weights()
And use the sent_before_att function to get the vector after the layer before the attention layer.
sent_each_att = sent_before_att([sentence, 0])
In addtion, you need to define a function to calculate the attention weights, here is the funtion named cal_att_weights, you can use numpy to realize the same thing you define the attention layer.
Finally the sent_each_att is the attention weight you want.
sent_each_att = cal_att_weights(sent_each_att, sent_att_w)

@stevewyl

This comment has been minimized.

Copy link

stevewyl commented Jun 22, 2018

@cbaziotis the best attention visualization tools I have ever seen 👍

@fuchami

This comment has been minimized.

Copy link

fuchami commented Jul 11, 2018

i want to Regression output with Attention LSTM

I tried this:
`def Attention_LSTM(self):

    _input = Input(shape=(self.seq_length, self.feature_length,))

    LSTM_layer = LSTM(self.n_hidden, return_sequences=True)(_input)

    # Attention layer
    attention = TimeDistributed(Dense(1, activation='tanh'))(LSTM_layer)
    attention = Flatten()(attention)
    attention = Activation('softmax')(attention)
    attention = RepeatVector(self.n_hidden)(attention)
    attention = Permute([2,1])(attention)

    #sent_representation = merge([LSTM_layer, attention], mode='mul')
    sent_representation = multiply([LSTM_layer, attention])
    sent_representation = Lambda(lambda xin: K.sum(xin, axis=-1))(sent_representation)

    probabilities = TimeDistributed(Dense(1, activation='sigmoid'))(sent_representation)
    
    model = Model(inputs=_input, outputs=probabilities)
    return model`

but gives the following error:

` assert len(input_shape) >= 3
AssertionError

`

my understanding may be inadequate...

@deltaxp

This comment has been minimized.

Copy link

deltaxp commented Jul 18, 2018

Sorry, made an error, activaton and flatten had to be changed, (first flatten and than activation('softmax') fixed it.

I tested my Version, and it worked, so far as i could see

here is the graph of an example with 1 layer GRU und nextword prediction with attantion including shapes for clarification
sequence length=20,
hidden_dim=128,
embedding_dim=32,
vocabulary_size=397

(for real language processing typical stacked lstms insteand of grus and higher hidden_dims and embedding_dims are used. ist only a toy example)

next_errlog_layr1_slen20_hdim128_edim32_attn1_graph

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment