Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to add Attention on top of a Recurrent Layer (Text Classification) #4962

Closed
cbaziotis opened this issue Jan 7, 2017 · 116 comments
Closed

Comments

@cbaziotis
Copy link

cbaziotis commented Jan 7, 2017

I am doing text classification. Also I am using my pre-trained word embeddings and i have a LSTM layer on top with a softmax at the end.

vocab_size = embeddings.shape[0]
embedding_size = embeddings.shape[1]

model = Sequential()

model.add(Embedding(
        input_dim=vocab_size,
        output_dim=embedding_size,
        input_length=max_length,
        trainable=False,
        mask_zero=True,
        weights=[embeddings]
    ))

model.add(LSTM(200, return_sequences=False))
model.add(Dropout(0.5))

model.add(Dense(3, activation='softmax', activity_regularizer=activity_l2(0.0001)))

Pretty simple. Now I want to add attention to the model, but i don't know how to do it.

My understanding is that i have to set return_sequences=True so as the attention layer will weigh each timestep accordingly. This way the LSTM will return a 3D Tensor, right?
After that what do i have to do?
Is there a way to easily implement a model with attention using Keras Layers or do i have to write my own custom layer?

If this can be done with the available Keras Layers, I would really appreciate an example.

@patyork
Copy link
Contributor

patyork commented Jan 7, 2017

It's been a while since I've used attention, so take this with a grain of salt.

return_sequences does not necessarily need to be True for attention to work; the underlying computation is the same, and this flag should be used only based on whether you need 1 output or an output for each timestep.

As for implementing attention in Keras.. There are two possible methods: a) add a hidden Activation layer for the softmax or b) change the recurrent unit to have a softmax.

On option a): this would apply attention to the output of the recurrent unit but not to the output/input passed to the next time step. I don't this is what is desired. In this case, the LSTM should have a squashing function applied, as LSTMs don't do too well with linear/relu style activation.

On option b): this would apply attention to the output of the recurrentcy, and also to the output/input passed to the next timestep. I think that this is what is desired, but I could be wrong. In this case, the linear output of the neurons would be squashed directly by the softmax; if you wish to apply a pre-squashing such as sigmoid or tanh before the softmax calculation, you would need a custom activation that does both in one step.

I could draw a diagram if necessary, and I should probably read the activation papers again..

@cbaziotis
Copy link
Author

cbaziotis commented Jan 7, 2017

@patyork Thanks for the reply.
Do you have a good paper (or papers) in mind (for attention)? I am reading a lot about attention, and i want to try it out, because i really like the idea. But even though i think i understand the concept i don't have a clear understanding of how it works and how to implement it.

If it is possible i would like for someone to offer an example in Keras.

PS. is this the correct place to ask such question or i should do it at https://groups.google.com/d/forum/keras-users?

@patyork
Copy link
Contributor

patyork commented Jan 7, 2017

@baziotis This area is supposed to be more for bugs as opposed to "how to implement" questions. I admit I don't often look at the google group, but that is a valid place to ask these questions, as well as on the Slack channel.

Bengio et. al has a pretty good paper on attention (soft attention is the softmax attention).

An example of method a) I described:

vocab_size = embeddings.shape[0]
embedding_size = embeddings.shape[1]

model = Sequential()

model.add(Embedding(
        input_dim=vocab_size,
        output_dim=embedding_size,
        input_length=max_length,
        trainable=False,
        mask_zero=True,
        weights=[embeddings]
    ))

model.add(LSTM(200, return_sequences=False))
model.add(Activation('softmax')) #this guy here
model.add(Dropout(0.5))

model.add(Dense(3, activation='softmax', activity_regularizer=activity_l2(0.0001)))

example b), with simple activation:

vocab_size = embeddings.shape[0]
embedding_size = embeddings.shape[1]

model = Sequential()

model.add(Embedding(
        input_dim=vocab_size,
        output_dim=embedding_size,
        input_length=max_length,
        trainable=False,
        mask_zero=True,
        weights=[embeddings]
    ))

model.add(LSTM(200, return_sequences=False, activation='softmax'))
model.add(Dropout(0.5))

model.add(Dense(3, activation='softmax', activity_regularizer=activity_l2(0.0001)))

example b) with sigmoid and then softmax (non-working, but the idea):

vocab_size = embeddings.shape[0]
embedding_size = embeddings.shape[1]

def myAct(out):
    return K.softmax(K.tanh(out))

model = Sequential()

model.add(Embedding(
        input_dim=vocab_size,
        output_dim=embedding_size,
        input_length=max_length,
        trainable=False,
        mask_zero=True,
        weights=[embeddings]
    ))

model.add(LSTM(200, return_sequences=False, activation=myAct))
model.add(Dropout(0.5))

model.add(Dense(3, activation='softmax', activity_regularizer=activity_l2(0.0001)))

In addition, I should say that my notes about whether a) or b) above is what you probably need are based on your example, where you want one output (making option b probably the correct way). Attention is often used in spaces like caption generation where there is more than 1 output such as setting return_sequences=True. For those cases, I think that option a) is the described usage, such that the recurrency keeps all the information passing forward, and it's just the higher layers that utilize the attention.

@cbaziotis
Copy link
Author

@patyork Thanks for the examples and for the paper. I new that posting here would get more attention :P

I will try them and post back.

@mbollmann
Copy link

@patyork, I'm sorry, but I don't see how this implements attention at all?

From my understanding, the softmax in the Bengio et al. paper is not applied over the LSTM output, but over the output of an attention model, which is calculated from the LSTM's hidden state at a given timestep. The output of the softmax is then used to modify the LSTM's internal state. Essentially, attention is something that happens within an LSTM since it is both based on and modifies its internal states.

I actually made my own attempt to create an attentional LSTM in Keras, based on the very same paper you cited, which I've shared here:

https://gist.github.com/mbollmann/ccc735366221e4dba9f89d2aab86da1e

There are several different ways to incorporate attention into an LSTM, and I won't claim 100% correctness of my implementation (though I'd appreciate any hints if something seems terribly wrong!), but I'd be surprised if it was as simple as adding a softmax activation.

@cbaziotis
Copy link
Author

cbaziotis commented Jan 10, 2017

@mbollmann You are correct that none of the solutions @patyork is what i want. i want to get a weight distribution (importance) for the outputs from each timestep of the RNN. Like in the paper: "Hierarchical Attention Networks for Document Classification" but in my case i want just the representation of a sentence. I am trying to implement this using the available keras layers.

Similar idea in this paper.

@mbollmann
Copy link

@baziotis That indeed looks conceptually much simpler. I could just take a very short glance right now, but is there a specific point where you got stuck?

@cbaziotis
Copy link
Author

cbaziotis commented Jan 10, 2017

@mbollmann Please do if you can.
I am trying to implement it right now and trying to understand the Keras API.

I don't have a working solution but i think i should set return_sequences=True in the RNN in order to get the intermediate outputs and masking=False.
On top of that i am thinking i should put a TimeDistributed(Dense(1)) with a softmax activation. But i haven't figured out how to put everything together.

Also i think that putting masking=False won't affect the performance as the attention layer will assign the correct weights on the padded words. Am i right?

Edit: to clarify i want to implement an attention mechanism like the one in [1].
attention mechanism

  1. Zhou, Peng, et al. "Attention-based bidirectional long short-term memory networks for relation classification." The 54th Annual Meeting of the Association for Computational Linguistics. 2016.

@cbaziotis
Copy link
Author

cbaziotis commented Jan 10, 2017

I tried this:

_input = Input(shape=[max_length], dtype='int32')

    # get the embedding layer
    embedded = embeddings_layer(embeddings=embeddings_matrix,
                                trainable=False, masking=False, scale=False, normalize=False)(_input)

    activations = LSTM(64, return_sequences=True)(embedded)

    # attention
    attention = TimeDistributed(Dense(1, activation='tanh'))(activations) 
    attention = Flatten()(attention)
    attention = Activation('softmax')(attention)

    activations = Merge([activations, attention], mode='mul')

    probabilities = Dense(3, activation='softmax')(activations)

    model = Model(input=_input, output=probabilities)
    model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=[])

and i get the following error:

  File "...\keras\engine\topology.py", line 1170, in __init__
    node_indices, tensor_indices)
  File "...\keras\engine\topology.py", line 1193, in _arguments_validation
    layer_output_shape = layer.get_output_shape_at(node_indices[i])
AttributeError: 'TensorVariable' object has no attribute 'get_output_shape_at'

@mbollmann
Copy link

mbollmann commented Jan 10, 2017

@baziotis The cause of the error probably is that you need to use the merge function (lowercase), not the Merge layer (uppercase).

Apart from that, as far as I understood it:

The part with the tanh activation (Equation 5 in Yang et al., Equation 9 in Zhou et al.) comes before the multiplication with a trained context vector/parameter vector which reduces the dimensionality to "one scalar per timestep". For Yang et al., that seems to be a Dense layer which doesn't yet reduce the dimensionality (though this is a little unclear to me), so I'd expect TimeDistributed(Dense(64, activation='tanh')). For Zhou et al., they just write "tanh", so you'd probably not even need a Dense layer, just the tanh activation after the LSTM.

For the multiplication with a trained context vector/parameter vector, I believe (no longer -- see EDIT) this might be a simple Dense(1) in Keras, without the TimeDistributed wrapper, since we want to have individual weights for each timestep, but I'm not totally sure about this and haven't tested it. I'd imagine something like this, but take this with a grain of salt:

    # attention after Zhou et al.
    attention = Activation('tanh')(activations)    # Eq. 9
    attention = Dense(1)(attention)                # Eq. 10
    attention = Flatten()(attention)               # Eq. 10
    attention = Activation('softmax')(attention)   # Eq. 10
    activations = merge([activations, attention], mode='mul')  # Eq. 11

(EDIT: Nope, doesn't seem that way, they train a parameter vector with dimensionality of the embedding, not a matrix with a timestep dimension.)

@patyork
Copy link
Contributor

patyork commented Jan 10, 2017

My apologies; this would explain why I was not impressed with the results from my "attention" implementation.

There is an implementation here that seems to be working for people.

@cbaziotis
Copy link
Author

cbaziotis commented Jan 10, 2017

@mbollmann you were right about the merge, it is different from Merge #2467.

I think this is really close:

units = 64
max_length = 50

_input = Input(shape=[max_length], dtype='int32')

# get the embedding layer
embedded = embeddings_layer(embeddings=embeddings_matrix,
							trainable=False, masking=False, scale=False, normalize=False)(_input)

activations = LSTM(units, return_sequences=True)(embedded)

# compute importance for each step
attention = TimeDistributed(Dense(1, activation='tanh'))(activations) 
attention = Flatten()(attention)
attention = Activation('softmax')(attention)
attention = RepeatVector(units)(attention)
attention = Permute([2, 1])(attention)

# apply the attention
sent_representation = merge([activations, attention], mode='mul')
sent_representation = Lambda(lambda xin: K.sum(xin, axis=0))(sent_representation)
sent_representation = Flatten()(sent_representation)

probabilities = Dense(3, activation='softmax')(sent_representation)

model = Model(input=_input, output=probabilities)
model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=[])

but i get an error because Lamda doesn't output the right dimensions. I should be getting [1,units] right?
What am i doing wrong?


Update: i tried explicitly passing the output_shape for Lambda and the model compiles:

sent_representation = merge([activations, attention], mode='mul')
sent_representation = Lambda(lambda xin: K.sum(xin, axis=0), output_shape=(units, ))(sent_representation)
# sent_representation = Flatten()(sent_representation)

but now i get the following error:

ValueError: Input dimension mis-match. (input[0].shape[0] = 128, input[1].shape[0] = 50)
Apply node that caused the error: Elemwise{Composite{(i0 * log(i1))}}(dense_2_target, Elemwise{Clip}[(0, 0)].0)
Toposort index: 155
Inputs types: [TensorType(float32, matrix), TensorType(float32, matrix)]
Inputs shapes: [(128, 3), (50, 3)]
Inputs strides: [(12, 4), (12, 4)]
Inputs values: ['not shown', 'not shown']
Outputs clients: [[Sum{axis=[1], acc_dtype=float64}(Elemwise{Composite{(i0 * log(i1))}}.0)]]

@cbaziotis
Copy link
Author

cbaziotis commented Jan 10, 2017

Well i found out why it wasn't working. I was expecting the input to Lamda to be (max_length, units) but it was (None, max_length, units), so i just had to change the axis to 1. This now works.

units = 64
max_length = 50
vocab_size = embeddings.shape[0]
embedding_size = embeddings.shape[1]


_input = Input(shape=[max_length], dtype='int32')

# get the embedding layer
embedded = Embedding(
        input_dim=vocab_size,
        output_dim=embedding_size,
        input_length=max_length,
        trainable=trainable,
        mask_zero=masking,
        weights=[embeddings]
    )(_input)

activations = LSTM(units, return_sequences=True)(embedded)

# compute importance for each step
attention = TimeDistributed(Dense(1, activation='tanh'))(activations) 
attention = Flatten()(attention)
attention = Activation('softmax')(attention)
attention = RepeatVector(units)(attention)
attention = Permute([2, 1])(attention)

# apply the attention
sent_representation = merge([activations, attention], mode='mul')
sent_representation = Lambda(lambda xin: K.sum(xin, axis=1))(sent_representation)

probabilities = Dense(3, activation='softmax')(sent_representation)

model = Model(input=_input, output=probabilities)
model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=[])

I would like if someone could verify that this implementation is correct.

@mbollmann
Copy link

@baziotis Looks good to me. I re-read the description in Zhou et al. and the code looks like it does what they describe. I no longer understand how what they're doing does anything useful, since the attention model only depends on the input and applies the same weights at every timestep, but ... that's probably just my insufficient understanding (I'm used to slightly different types of attention). :)

@cbaziotis
Copy link
Author

cbaziotis commented Jan 11, 2017

@mbollmann i am confused about the same thing. can you give an example of the type of attention that you have in mind? I think that i have to put the word (embedding) in the calculation of the attention.

From what i understand the Dense layer:

  1. assigns a different weight (importance) to each timestep
  2. BUT the importance is static. Essentially this means that each word position in a sentence has different importance, but the importance comes from the position of the word and not the word itself.

I plotted the weights of the TimeDistributed(Dense(1, activation='tanh'))(activations) in a heatmap:

att
My interpretation is that the positions with big weights play more important role, so the output of the RNN for those steps will have i bigger impact in the final representation of the sentence.

The problem is that this is static. If an important word happens to occur in a position with a small weight then the representation of the sentence won't be good enough.

I would like some feedback on this, and preferably a good paper with a better attention mechanism.

@mbollmann
Copy link

@baziotis Are you sure you don't have it the wrong way around?

The Dense layer takes the output of the LSTM at one timestep and transforms it. The TimeDistributed wrapper applies the same Dense layer with the same weights to each timestep -- which means the output of the calculation cannot depend on the position/timestep since the Dense layer doesn't even know about it.

So my confusion seems to be of a different nature than yours. :)

(In short: I don't see what calculating a softmax and multiplying the original vector by that gets you that a plain TimeDistributed(Dense(...)) couldn't already learn. However, I work on attentional models where the output is also a time-series, which means that I have multiple output timesteps for which the model should learn to attend to different input timesteps. I think that's not directly comparable to your situation, since you only have one output.)

@patyork
Copy link
Contributor

patyork commented Jan 11, 2017

@mbollmann I'm also a bit confused (but I have been from the get go). I think this blog post is fairly informative, or at least has some decent pictures.

So, @baziotis is using time series with multiple output steps (LSTM, with return_sequences=True). The first dense layer is applying weights over each individual time step output from the LSTM, which I'm not sure is accomplishing the intended behavior of looking at all the past activations and assigning weights to those, as in this picture:
image

I'm thinking the code above is just the line at,T feeding into the attention layer at each timestep. The fallout of this is that the attention is just determining which activations are important, not which timesteps are important.

@cbaziotis
Copy link
Author

@mbollmann i thought that the TimeDistributed applies different weights to each timestep...
In that case everything is wrong.
How can i make it so i can apply different weights to each timestep?
Can this be done with the available keras layers? Any hint?

@patyork
Copy link
Contributor

patyork commented Jan 11, 2017

TimeDistributed applies the same weight set across every timestep.

You'd need to setup a standard Dense layer as a matrix e.g. Dense(20) where 20 is the lookback length. You'd then feed examples of 20 timesteps to train. This is where I'm quite confused about implementing attention, as in theory it looks like this lookback is infinite, not fixed at a certain length.

@cbaziotis cbaziotis reopened this Jan 11, 2017
@cbaziotis
Copy link
Author

cbaziotis commented Jan 11, 2017

Sorry for the miss-click.
So if i have inputs of constant length, lets say 50 then is this what i have to do?

activations = LSTM(units, return_sequences=True)(embedded)

# compute importance for each step
attention = Dense(50 , activation='tanh')(activations) 
attention = Flatten()(attention)

@patyork
Copy link
Contributor

patyork commented Jan 11, 2017

Actually, no, I think you would just remove the TimeDistributed wrapper and keep Dense(1) - I need to implement it real quick and check some shapes though.

@patyork
Copy link
Contributor

patyork commented Jan 11, 2017

So I guess that is what you are looking for.

  • 50 timesteps
  • Feeds into a regular Dense(1), which provides separate weights for the 50 timesteps
  • Calculates attention and multiplies against the 50 timesteps to apply attention
  • Sums (this reduces the 50 timesteps to 1 output; this is where this attention implementation differs from what most of what I've read describes)
  • Dense layer that produces output of shape (None, 3)
_input = Input(shape=[max_length], dtype='int32')

# get the embedding layer
embedded = Embedding(
        input_dim=vocab_size,
        output_dim=embedding_size,
        input_length=max_length,
        trainable=False,
        mask_zero=False
    )(_input)

activations = LSTM(units, return_sequences=True)(embedded)

# compute importance for each step
attention = Dense(1, activation='tanh')(activations)
attention = Flatten()(attention)
attention = Activation('softmax')(attention)
attention = RepeatVector(units)(attention)
attention = Permute([2, 1])(attention)


sent_representation = merge([activations, attention], mode='mul')
sent_representation = Lambda(lambda xin: K.sum(xin, axis=-2), output_shape=(units,))(sent_representation)

probabilities = Dense(3, activation='softmax')(sent_representation)

I think this (ugly) chart maps the above out pretty well; it's up to you to determine if it makes sense for what you are doing:
image

@cbaziotis
Copy link
Author

@patyork Thanks! I think this is what is described in the paper.
What they are trying to do from what i understand is: instead of using just the last output of the RNN, they use the weighted sum of all the intermediate outputs.

I have a question about this line:

sent_representation = Lambda(lambda xin: K.sum(xin, axis=-2), output_shape=(units,))(sent_representation)

Why axis=-2. How does this sum the tensors? I am using axis=1.

@cbaziotis
Copy link
Author

cbaziotis commented Jan 11, 2017

continuing from my last comment, this is what is described in the blog post that you mentioned. See after the image that you posted...

The y‘s are our translated words produced by the decoder, and the x‘s are our source sentence words. The above illustration uses a bidirectional recurrent network, but that’s not important and you can just ignore the inverse direction. The important part is that each decoder output word y_t now depends on a weighted combination of all the input states, not just the last state. The a‘s are weights that define in how much of each input state should be considered for each output. So, if a_{3,2} is a large number, this would mean that the decoder pays a lot of attention to the second state in the source sentence while producing the third word of the target sentence. The a's are typically normalized to sum to 1 (so they are a distribution over the input states).

What different kind of attention do you have in mind? In the article attention is described in the context of machine translation. In my case (classification) i just want a better representation for the sentence.

@patyork
Copy link
Contributor

patyork commented Jan 11, 2017

Yeah, after thinking about this, it makes sense. The softmax multiplication will weight the timestep outputs (most will be near zero, some nearer to 1) and the so the sum of those will be close to the outputs of the "near to 1" timesteps - pretty clever.

In this case, axis=-2 is equivalent to axis=1; I use the reverse indexing all the time, so that I never have to remember that Keras includes the batch_size (the None aspect) in those shapes. You ran into this gotcha earlier; using the reverse indexing means I never have to think about that aspect - and you'll see that form of indexing throughout the actual Keras code for this reason.

I just mean that implementation seems a little limiting - you have to set T=50 or another limit; it can't be an infinite or undefined variable, which means you have to throw away the first T-1 (49) outputs/training outputs. As that image leads me to believe, the T should be infinite/undefined/variable, Something like the TimeDistributed wrapper could provide. Perhaps this is a good thing, perhaps not - I haven't tried both ways (obviously).

@mbollmann
Copy link

Phew, a lot happened here, and I think I agree with most of what was written. Using Dense(1) without the TimeDistributed wrapper was what I was already trying to argue for yesterday, some dozens posts above, so that does seem correct to me as well in this scenario.

@patyork
Copy link
Contributor

patyork commented Jan 11, 2017

@mbollmann I read that - it seems like you talked yourself out of that at some point though, based on the edit. I was confusing/arguing with myself to no end throughout this entire issue as well.

I learned quite a bit though, at least.

@cbaziotis
Copy link
Author

cbaziotis commented Jan 11, 2017

@patyork @mbollmann Thank you both! I learned a lot.

Btw after runnng some tests, i am not impressed. I see no obvious improvement compared to the classic senario (using just the last timestep). But the idea is interesting...

@patyork This may be stupid, but what do you mean by saying:

it can't be an infinite or undefined variable, which means you have to throw away the first T-1 (49) outputs/training outputs.

Why are they thrown? They are used in the weighted sum, aren't they? *
I agree that this is limiting as it won't work with masking (series of varying length).

*Do you mean the timesteps that are padded to keep a constant length?

@philipperemy
Copy link

Hey, have a look at this repo:

https://github.com/philipperemy/keras-attention-mechanism

It shows how to build an attention module of top of a recurrent layer.

Thanks

@jbrownlee
Copy link

@philipperemy I tested your approach. Indeed you can learn an attention vector, but testing across a suite of contrived problems, I see the model is just as skillful as a plan Dense + LSTM combination. Attention is an optimization that should lift skill or decrease training time for the same skill. Perhaps you have an example where your approach is more skillful than a straight Dense + LSTM setup with the same resources?

@cbaziotis After testing, I believe your attention method is something new/different inspired by Bahdanau, et al. [1]. It does not appear skillful on contrived problems either. Perhaps you have a good demonstration of where it does do well?

@mbollmann is correct as far as I can tell. The attention approach of Bahdanau, et al. requires access to the decoder hidden state (decoder output) of the last time step in order to compute the current time step (s_i-1 in the paper). This is unavailable unless you write your own layer and access it.

[1] https://arxiv.org/pdf/1409.0473.pdf

@NMRobert
Copy link

NMRobert commented Jul 7, 2017

@jbrownlee
Would it be possible to share some of these 'test' case contrived problems? It would be extremely helpful in terms of debugging and evaluating the efficacy of various attention implementations.

@dudeperf3ct
Copy link

@cbaziotis , How will the above attention mechanism work for the imdb example in keras? The input size is (5000, 80) (#max_length=80) and output is (5000, ). This the model for training :

	input_ = Input(shape=(80,), dtype='float32')
	print (input_.get_shape())                       #(?, 80)
	input_embed = Embedding(max_features, 128 ,input_length=80)(input_)
	print (input_embed.get_shape())                  #(?, 80, 128)

	activations = LSTM(64, return_sequences=True)(input_embed)
	attention = TimeDistributed(Dense(1, activation='tanh'))(activations)
	attention = Flatten()(attention)
	attention = Activation('softmax')(attention)
	attention = RepeatVector(64)(attention)
	attention = Permute([2, 1])(attention)	
	print (activations.get_shape())                   #(?, ?, 64)
	print (attention.get_shape())                     #(?, ?, 64)

	sent_representation = merge([activations, attention], mode='mul')
	sent_representation = Lambda(lambda x_train: K.sum(x_train, axis=1), output_shape=(5000,))(sent_representation)
	print (sent_representation.get_shape())           #(?, 64)
	probabilities = Dense(1, activation='softmax')(sent_representation)      #Expected (5000,)
	model = Model(inputs=input_, outputs=probabilities)
	model.summary()

 Error : ValueError: Dimensions must be equal, but are 64 and 5000 for 'dense_2/MatMul' (op: 'MatMul') with input shapes: [?,64], [5000,1].

@danieljf24
Copy link

Hi, @cbaziotis Thanks for your code.
As you did not conduct special treatment for the padded words, I am wondering if the attention mechanism will assign the correct weights (close to zero) on the padded words.

@cbaziotis
Copy link
Author

cbaziotis commented Jul 29, 2017

If you read carefully you will see that i have posted the updated versions of the layers. Here you go:

model.add(LSTM(64, return_sequences=True))
model.add(AttentionWithContext())
# next add a Dense layer (for classification/regression) or whatever...
model.add(LSTM(64, return_sequences=True))
model.add(Attention())
# next add a Dense layer (for classification/regression) or whatever...

And as i say, the layers take into account the mask.
Edit: also note that i have not tested them with Keras 2, but i imagine that you will need to make some minor syntactic changes.

@jiangchao123
Copy link

does the attention+lstm improve the accuracy in text classification? In my dataset, I find that, there is no difference with mean pooling + lstm.

@shaifugpt
Copy link

shaifugpt commented Nov 21, 2017

@cbaziotis I have a query regarding the attention:
activations=LSTM(neu,activation='relu',return_sequences=True,return_state=True)(inputs)
This statement applies attention on output of LSTM. Does this imply on h (hidden state) where h=o_t (tanh(c_t))

I read somewhere, that in
activations,hh,cc=LSTM(neu,activation='relu',return_sequences=True,return_state=True)(inputs)

that hh is hidden state and cc is the cell state. Are hh and cc the final hidden and cell states?

Also what is the difference between attention and attention with context

@spate141
Copy link

spate141 commented May 3, 2018

@Ravin0512 Any updates?

@cbaziotis
Copy link
Author

@Ravin0512 i recently made this tool https://github.com/cbaziotis/neat-vision

Just make sure to return the attention scores besides the final representation of the sentence from the attention layer.

@saxenarohit
Copy link

saxenarohit commented Jun 20, 2018

@cbaziotis As per sharing the weights across time-steps, I think it is fine. Even Andrew Ng's Sequence Models course have shared weight implementation.

@deltaxp
Copy link

deltaxp commented Jun 21, 2018

  1. can one make the attentionmodel shorter by using the dot function of keras.laysers ?

inputs=Input(shape=(input_len,))
embedded=Embedding(input_dim, embedding_dim)(inputs)
activation=LSTM(hidden_dim, return_sequences=True)(embedded)
attention=TimeDistributed(Dense(1,use_bias=False, activation='linear'))(activation)
attention=Flatten()(attention)
attention=Activation('softmax')(attention)
representation=dot([attention,activation],axes=1)

isnt it the same as the Long Version
attention=TimeDistributed(Dense(1,use_bias=False, activation='linear'))(activation)
attention=Flatten()(attention)
Attention=Activation('softmax')(attention)
attention=RepeatVector(self.hidden_dim)(attention)
attention=Permute([2,1])(attention)
activation=multiply([attention,activation])
representation=Lambda(lambda x: K.sum(x,axis=1))(activation)

the dot function contracts the Tensor at the axis=1 sum_t a_t*h_th= h_h

the dense layer for the activation shouldnt have a bias, since the weights accoording th zhou work only on the hidden components of the hidden states. further more in zhous model a linear activation is enough

as far as i understood the Attention-dense layer has to be time distributed. because the weights act on the hidden states components they have the same role mor or less as all matrices in the the recurrent layer which all share the weights over time.

the time dependence oft the activation factors rises from the the hidden state differences (components deiffer an therefore alpha(t)=softmax(w^T*h_t) differs,

@stevewyl
Copy link

@Ravin0512 I just found an ugly method.
First you need to define a simple network structure before your attention layer (here the attention layer is the fourth layer).
sent_before_att = K.function([sent_model.layers[0].input, K.learning_phase()], [sent_model.layers[2].output])
And you then take out the attention layer weight.
sent_att_w = sent_model.layers[3].get_weights()
And use the sent_before_att function to get the vector after the layer before the attention layer.
sent_each_att = sent_before_att([sentence, 0])
In addtion, you need to define a function to calculate the attention weights, here is the funtion named cal_att_weights, you can use numpy to realize the same thing you define the attention layer.
Finally the sent_each_att is the attention weight you want.
sent_each_att = cal_att_weights(sent_each_att, sent_att_w)

@stevewyl
Copy link

@cbaziotis the best attention visualization tools I have ever seen 👍

@fuchami
Copy link

fuchami commented Jul 11, 2018

i want to Regression output with Attention LSTM

I tried this:
`def Attention_LSTM(self):

    _input = Input(shape=(self.seq_length, self.feature_length,))

    LSTM_layer = LSTM(self.n_hidden, return_sequences=True)(_input)

    # Attention layer
    attention = TimeDistributed(Dense(1, activation='tanh'))(LSTM_layer)
    attention = Flatten()(attention)
    attention = Activation('softmax')(attention)
    attention = RepeatVector(self.n_hidden)(attention)
    attention = Permute([2,1])(attention)

    #sent_representation = merge([LSTM_layer, attention], mode='mul')
    sent_representation = multiply([LSTM_layer, attention])
    sent_representation = Lambda(lambda xin: K.sum(xin, axis=-1))(sent_representation)

    probabilities = TimeDistributed(Dense(1, activation='sigmoid'))(sent_representation)
    
    model = Model(inputs=_input, outputs=probabilities)
    return model`

but gives the following error:

` assert len(input_shape) >= 3
AssertionError

`

my understanding may be inadequate...

@deltaxp
Copy link

deltaxp commented Jul 18, 2018

Sorry, made an error, activaton and flatten had to be changed, (first flatten and than activation('softmax') fixed it.

I tested my Version, and it worked, so far as i could see

here is the graph of an example with 1 layer GRU und nextword prediction with attantion including shapes for clarification
sequence length=20,
hidden_dim=128,
embedding_dim=32,
vocabulary_size=397

(for real language processing typical stacked lstms insteand of grus and higher hidden_dims and embedding_dims are used. ist only a toy example)

next_errlog_layr1_slen20_hdim128_edim32_attn1_graph

@timschott
Copy link

timschott commented Mar 20, 2019

Hi, @stevewyl -- what is inside that cal_att_weights call?
I'm following this post to detect the weights per word in an inputted test text. It implements the attentive layer from @cbaziotis and then tacks on that cal_att_weights method to inspect the weights per word.
The dimensions of the weight array I get back are correct, but the weights themselves are crazy small -- all of them hover around 0.0000009.
Does this calculation step look correct to you?

def cal_att_weights(output, att_w):
    eij = np.tanh(np.dot(output[0], att_w[0]) + att_w[1])
    eij = np.dot(eij, att_w[2])
    eij = eij.reshape((eij.shape[0], eij.shape[1]))
    ai = np.exp(eij)
    weights = ai / np.sum(ai)
    return weights```

@Ritaprava95
Copy link

attention = Flatten()(attention)
for this line I am getting error:
Layer flatten_4 does not support masking, but was passed an input_mask: Tensor("time_distributed_6/Reshape_3:0", shape=(None, None), dtype=bool)

@EvgeniaChroni
Copy link

Hello all ,

I am trying to use attention on top of a BiLSTM in tensorflow 2.
Also, I am using pretrained word embeddings.

my model is the following:

units=250
EMBEDDING_DIM=310
MAX_LENGTH_PER_SENTENCE=65
encoder_input = keras.Input(shape=(MAX_LENGTH_PER_SENTENCE))
x =layers.Embedding(input_dim=len(embedding_matrix), output_dim=EMBEDDING_DIM, input_length=MAX_LENGTH_PER_SENTENCE,
                              weights=[embedding_matrix],
                              trainable=False)(encoder_input)
                              
activations =layers.Bidirectional(tf.keras.layers.LSTM(units))(x)
activations = layers.Dropout(0.5)(activations)

attention=layers.Dense(1, activation='tanh')(activations)
attention=layers.Flatten()(attention)
attention=layers.Activation('softmax')(attention)
attention=layers.RepeatVector(units*2)(attention)
attention=layers.Permute((2, 1))(attention)

sent_representation = layers.Multiply()([activations, attention])
sent_representation = layers.Lambda(lambda xin: K.sum(xin, axis=-2), output_shape=(units*2,))(sent_representation)

sent_representation = layers.Dropout(0.5)(sent_representation)

probabilities = layers.Dense(4, activation='softmax')(sent_representation)


encoder = keras.Model(inputs=[encoder_input], outputs=[probabilities],name='encoder')
encoder.summary()

Could you please let me know if my implementation is correct ?
What makes me worry is that the result with attention model do not have an improvement.

Thanks in advance.

@cerlymarco
Copy link

here a simple solution to add attention in your network

https://stackoverflow.com/questions/62948332/how-to-add-attention-layer-to-a-bi-lstm/62949137#62949137

@thisisdhruvagarwal
Copy link

Hey everyone. I saw that everyone adds Dense( ) layer in their custom attention layer, which I think isn't needed.

image

This is an image from a tutorial here. Here, we are just multiplying 2 vectors and then doing several operations on these vectors only. So what is the need of Dense( ) layer. Is the tutorial on 'how does attention work' wrong?

@suncrown
Copy link

Hey, have a look at this repo:

https://github.com/philipperemy/keras-attention-mechanism

It shows how to build an attention module of top of a recurrent layer.

Thanks

Thanks Philip. Your implementation is clean and easy to follow.

@sandeepbhutani304
Copy link

Hey everyone. I saw that everyone adds Dense( ) layer in their custom attention layer, which I think isn't needed.

image

This is an image from a tutorial here. Here, we are just multiplying 2 vectors and then doing several operations on these vectors only. So what is the need of Dense( ) layer. Is the tutorial on 'how does attention work' wrong?

Same question I have

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests