Questions on implementation details #14

felixhao28 · 2018-03-05T11:27:44Z

Update on 2019/2/14, nearly one year later:

The implementation in this repo is definitely bugged. Please refer to my implementation in a reply below for correction. My version has been working in our product since this thread and it outperforms both vanilla LSTM without attention and the incorrect version in this repo by a significant margin. I am not the only one raising the question 1.

Both this repo and my version of attention are intended for sequence-to-one networks (although it can be easily tweaked for seq2seq by replacing h_t with current state of the decoder step). If you are looking for a ready-to-use attention for sequence-to-sequence networks, check this out: https://github.com/farizrahman4u/seq2seq.

============Original answer==============

I am currently working on a text generation task and learnt attention from TensorFlow tutorials. The implementation details seems quite different from your code.

This is how TensorFlow tutorial describes the process:

If I am understanding it correctly, all learnable parameters in the attention mechanism are stored in , which has a shape of (rnn_size, rnn_size) (rnn_size is the size of hidden state). So first you need to use to calculate the score of each hidden state based on the value of the hidden state and , but I am not seeing anywhere in your code. Instead, you applied a dense layer on all . And that means (Edit: h_t should be h_s in this equation) becomes the in the paper. This seems wrong.

In the next step you element-wise multiplies the attention weights with hidden states as equation (2). Then somehow missed the equation (3).

I noticed the tutorial is about Seq2Seq (Encoder-Decoder) model and your code is an RNN. Maybe that is why your code is different. Do you have any source on how attention is applied to a non Seq2Seq network?

Here is your code:

def attention_3d_block(inputs):
    # inputs.shape = (batch_size, time_steps, input_dim)
    input_dim = int(inputs.shape[2])
    a = Permute((2, 1))(inputs)
    a = Reshape((input_dim, TIME_STEPS))(a) # this line is not useful. It's just to know which dimension is what.
    a = Dense(TIME_STEPS, activation='softmax')(a)
    if SINGLE_ATTENTION_VECTOR:
        a = Lambda(lambda x: K.mean(x, axis=1), name='dim_reduction')(a)
        a = RepeatVector(input_dim)(a)
    a_probs = Permute((2, 1), name='attention_vec')(a)
    output_attention_mul = merge([inputs, a_probs], name='attention_mul', mode='mul')
    return output_attention_mul


def model_attention_applied_after_lstm():
    inputs = Input(shape=(TIME_STEPS, INPUT_DIM,))
    lstm_units = 32
    lstm_out = LSTM(lstm_units, return_sequences=True)(inputs)
    attention_mul = attention_3d_block(lstm_out)
    attention_mul = Flatten()(attention_mul)
    output = Dense(1, activation='sigmoid')(attention_mul)
    model = Model(input=[inputs], output=output)
    return model

The text was updated successfully, but these errors were encountered:

felixhao28 · 2018-03-07T11:00:29Z

I implemented my own version of attention + LSTM. Since we don't have h_t in a regular RNN, I just used the last hidden state as h_t, which works just fine.

INPUT_DIM = 100
TIME_STEPS = 20
# if True, the attention vector is shared across the input_dimensions where the attention is applied.
SINGLE_ATTENTION_VECTOR = True
APPLY_ATTENTION_BEFORE_LSTM = False

ATTENTION_SIZE = 128

def attention_3d_block(hidden_states):
    # hidden_states.shape = (batch_size, time_steps, hidden_size)
    hidden_size = int(hidden_states.shape[2])
    # _t stands for transpose
    hidden_states_t = Permute((2, 1), name='attention_input_t')(hidden_states)
    # hidden_states_t.shape = (batch_size, hidden_size, time_steps)
    # this line is not useful. It's just to know which dimension is what.
    hidden_states_t = Reshape((hidden_size, TIME_STEPS), name='attention_input_reshape')(hidden_states_t)
    # Inside dense layer
    # a (batch_size, hidden_size, time_steps) dot W (time_steps, time_steps) => (batch_size, hidden_size, time_steps)
    # W is the trainable weight matrix of attention
    # Luong's multiplicative style score
    score_first_part = Dense(TIME_STEPS, use_bias=False, name='attention_score_vec')(hidden_states_t)
    score_first_part_t = Permute((2, 1), name='attention_score_vec_t')(score_first_part)
    #            score_first_part_t         dot        last_hidden_state     => attention_weights
    # (batch_size, time_steps, hidden_size) dot (batch_size, hidden_size, 1) => (batch_size, time_steps, 1)
    h_t = Lambda(lambda x: x[:, :, -1], output_shape=(hidden_size, 1), name='last_hidden_state')(hidden_states_t)
    score = dot([score_first_part_t, h_t], [2, 1], name='attention_score')
    attention_weights = Activation('softmax', name='attention_weight')(score)
    # if SINGLE_ATTENTION_VECTOR:
    #     a = Lambda(lambda x: K.mean(x, axis=1), name='dim_reduction')(a)
    #     a = RepeatVector(hidden_size)(a)
    # (batch_size, hidden_size, time_steps) dot (batch_size, time_steps, 1) => (batch_size, hidden_size, 1)
    context_vector = dot([hidden_states_t, attention_weights], [2, 1], name='context_vector')
    context_vector = Reshape((hidden_size,))(context_vector)
    h_t = Reshape((hidden_size,))(h_t)
    pre_activation = concatenate([context_vector, h_t], name='attention_output')
    attention_vector = Dense(ATTENTION_SIZE, use_bias=False, activation='tanh', name='attention_vector')(pre_activation)
    return attention_vector

The interface remained same except you don't need Flatten layer anymore:

def model_attention_applied_after_lstm():
    inputs = Input(shape=(TIME_STEPS, INPUT_DIM,))
    lstm_units = 32
    lstm_out = LSTM(lstm_units, return_sequences=True)(inputs)
    attention_mul = attention_3d_block(lstm_out)
    # attention_mul = Flatten()(attention_mul)
    output = Dense(INPUT_DIM, activation='sigmoid')(attention_mul)
    model = Model(input=[inputs], output=output)
    return model

The results seems even better than your original implementation:

The process of building attention myself has brought me more questions than answers:

What is SINGLE_ATTENTION_VECTOR? And how could you use K.mean as dimension reduction while all parameters in a are defined in a Dense layer? Doesn't that just mean all weight parameters have the same gradient for each batch and behaves just like one parameter vector and wastes GPU memory for storing the full matrix?
I understand your intuition behind APPLY_ATTENTION_BEFORE_LSTM, but that is not what attention is for and you can pretty much achieve the same results by sending fixed-length input into a fully-connected layer and use the output from that layer as the input of a LSTM layer. "The data at index 10 being important" is not a good feature to learn through attention. Exact index of timestamp should be transparent to attention mechanism.

P.S. I have modified get_data_recurrent function a little bit to produce one-hot data as it is more similar to my actual needs.

def get_data_recurrent(n, time_steps, input_dim, attention_column=10):
    """
    Data generation. x is purely random except that it's first value equals the target y.
    In practice, the network should learn that the target = x[attention_column].
    Therefore, most of its attention should be focused on the value addressed by attention_column.
    :param n: the number of samples to retrieve.
    :param time_steps: the number of time steps of your series.
    :param input_dim: the number of dimensions of each element in the series.
    :param attention_column: the column linked to the target. Everything else is purely random.
    :return: x: model inputs, y: model targets
    """
    x = np.random.randint(input_dim, size=(n, time_steps))
    x = np.eye(input_dim)[x]
    y = x[:, attention_column, :]
    return x, y

felixhao28 · 2018-03-07T13:07:12Z

Being confused about why attention can learn info about specific index in input sequence, I went on and read the code in official tensorflow implementation. I was wrong about the attention_score_vec dense layer which is a.k.a "memory layer" in TF implementation. The weight matrix W is not a (time_steps, time_steps) sized but rather (hidden_size, hidden_size) as shown here. The correct implementation should be:

def attention_3d_block(hidden_states):
    # hidden_states.shape = (batch_size, time_steps, hidden_size)
    hidden_size = int(hidden_states.shape[2])
    # Inside dense layer
    #              hidden_states            dot               W            =>           score_first_part
    # (batch_size, time_steps, hidden_size) dot (hidden_size, hidden_size) => (batch_size, time_steps, hidden_size)
    # W is the trainable weight matrix of attention
    # Luong's multiplicative style score
    score_first_part = Dense(hidden_size, use_bias=False, name='attention_score_vec')(hidden_states)
    #            score_first_part           dot        last_hidden_state     => attention_weights
    # (batch_size, time_steps, hidden_size) dot   (batch_size, hidden_size)  => (batch_size, time_steps)
    h_t = Lambda(lambda x: x[:, -1, :], output_shape=(hidden_size,), name='last_hidden_state')(
        hidden_states)
    score = dot([score_first_part, h_t], [2, 1], name='attention_score')
    attention_weights = Activation('softmax', name='attention_weight')(score)
    # (batch_size, time_steps, hidden_size) dot (batch_size, time_steps) => (batch_size, hidden_size)
    context_vector = dot([hidden_states, attention_weights], [1, 1], name='context_vector')
    pre_activation = concatenate([context_vector, h_t], name='attention_output')
    attention_vector = Dense(128, use_bias=False, activation='tanh',
                             name='attention_vector')(
        pre_activation)
    return attention_vector

score_first_part stands for , as part of .

Surprisingly, even without any hard information on the index of sequence, attention model still managed to learn the importance of 10th element. Now I am super confused.

My guess is somehow LSTM learned to "count" to 10 in its hidden state. And that "count" is captured by attention. I will need to visualize the inner parameters of LSTM to be sure.

An interesting finding I made is how attention is learnt through time:

Full code (except attention_3d_block), showing here just for reference:

from keras.layers import concatenate, dot
from keras.layers.core import *
from keras.layers.recurrent import LSTM
from keras.models import *

from attention_utils import get_activations, get_data_recurrent

INPUT_DIM = 100
TIME_STEPS = 20
# if True, the attention vector is shared across the input_dimensions where the attention is applied.
SINGLE_ATTENTION_VECTOR = True
APPLY_ATTENTION_BEFORE_LSTM = False


def attention_3d_block(hidden_states):
    # same as above


def model_attention_applied_after_lstm():
    inputs = Input(shape=(TIME_STEPS, INPUT_DIM,))
    lstm_units = 32
    lstm_out = LSTM(lstm_units, return_sequences=True)(inputs)
    attention_mul = attention_3d_block(lstm_out)
    # attention_mul = Flatten()(attention_mul)
    output = Dense(INPUT_DIM, activation='sigmoid', name='output')(attention_mul)
    model = Model(input=[inputs], output=output)
    return model

if __name__ == '__main__':

    N = 300000
    # N = 300 -> too few = no training
    inputs_1, outputs = get_data_recurrent(N, TIME_STEPS, INPUT_DIM)

    if APPLY_ATTENTION_BEFORE_LSTM:
        m = model_attention_applied_before_lstm()
    else:
        m = model_attention_applied_after_lstm()

    m.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])
    print(m.summary())

    m.fit([inputs_1], outputs, epochs=1, batch_size=64, validation_split=0)

    attention_vectors = []
    for i in range(10):
        testing_inputs_1, testing_outputs = get_data_recurrent(1, TIME_STEPS, INPUT_DIM)
        activations = get_activations(m, testing_inputs_1, print_shape_only=True, layer_name='attention_weight')
        attention_vec = np.mean(activations[0], axis=0).squeeze()
        print('attention =', attention_vec)
        assert (np.sum(attention_vec) - 1.0) < 1e-5
        attention_vectors.append(attention_vec)

    attention_vector_final = np.mean(np.array(attention_vectors), axis=0)
    # plot part.
    import matplotlib.pyplot as plt
    import pandas as pd

    pd.DataFrame(attention_vector_final, columns=['attention (%)']).plot(kind='bar',
                                                                         title='Attention Mechanism as '
                                                                               'a function of input'
                                                                               ' dimensions.')
    plt.show()

felixhao28 · 2018-03-08T04:22:28Z

I am actually thinking you were trying to implement self attention, which is used in text classification. But nonetheless the weight parameters should be sized (hidden_size, hidden_size) instead of (time_steps, time_steps).

Wangzihaooooo · 2018-03-23T02:35:41Z

@felixhao28 why do you use the layer that was named “last_hidden_state”?

felixhao28 · 2018-03-23T05:36:33Z

@Wangzihaooooo Because attention was first introduced in a Sequence to Sequence model, where attention score is computed based on both h_t and all h_s. In a language/classification model (sequence to one), we don't have the h_t to represent the information of the current outputting Y. Therefore I just used the last hidden state as h_t.

To be fair, you can totally remove h_t from the score computation, which then just becomes score = W * h_s. And it is essentially self-attention. It is different from traditional attention that self-attention only gives a score based on how important a hidden_state is globally, without the information of the current state of LSTM.

Wangzihaooooo · 2018-03-24T11:33:55Z

@felixhao28 thank you ，I learned a lot from your code.

rajeev-samalkha · 2018-05-03T12:28:45Z

@felixhao28 thank you very much. This is very well explained and removes the complexity around Attention Layer. I implemented the code inline for Seq2Seq model and able to grab attention matrix directly. Thanks once again for your help.

Regards
Rajeev

michetonu · 2018-08-23T10:17:24Z

@felixhao28 I'm a bit confused in this part of the code:

attention_vectors = []
    for i in range(10):
        testing_inputs_1, testing_outputs = get_data_recurrent(1, TIME_STEPS, INPUT_DIM)
        activations = get_activations(m, testing_inputs_1, print_shape_only=True, layer_name='attention_weight')

get_activations() effectively passes testin_inputs_1 through the layer 'attention_weight' and outputs the softmax probabilities for each. However, you are passing the raw input without making them pass through the LSTM first; is that on purpose? If so, can you explain why? Since in the model the inputs to the attention layers are the output of the LSTM layer(s), I would expect to have to do the same here.

Thanks!

felixhao28 · 2018-08-24T11:58:26Z

you are passing the raw input without making them pass through the LSTM first

The input does pass through LSTM first. Layer is an abstract concept of how the tensor should be calculated, not the actual tensor to be calculated. The relationship is more like "class" and "instance" if you are familiar with OOP.

the outputs is the actual tensor (instance) of attention_weight layer, which has already been connected to previous tensors (computational graph) by attention_weights = Activation('softmax', name='attention_weight')(score). It is not this specific tensor that takes the testing_inputs_1, it is the computaional graph, which initially begins from inputs = Input(shape=(TIME_STEPS, INPUT_DIM,)).

michetonu · 2018-08-24T12:13:47Z

@felixhao28 I see, thanks for the explanation!

farahshamout · 2018-10-15T12:56:11Z

I implemented my own version of attention + LSTM. Since we don't have h_t in a regular RNN, I just used the last hidden state as h_t, which works just fine.

INPUT_DIM = 100
TIME_STEPS = 20
# if True, the attention vector is shared across the input_dimensions where the attention is applied.
SINGLE_ATTENTION_VECTOR = True
APPLY_ATTENTION_BEFORE_LSTM = False

ATTENTION_SIZE = 128

def attention_3d_block(hidden_states):
    # hidden_states.shape = (batch_size, time_steps, hidden_size)
    hidden_size = int(hidden_states.shape[2])
    # _t stands for transpose
    hidden_states_t = Permute((2, 1), name='attention_input_t')(hidden_states)
    # hidden_states_t.shape = (batch_size, hidden_size, time_steps)
    # this line is not useful. It's just to know which dimension is what.
    hidden_states_t = Reshape((hidden_size, TIME_STEPS), name='attention_input_reshape')(hidden_states_t)
    # Inside dense layer
    # a (batch_size, hidden_size, time_steps) dot W (time_steps, time_steps) => (batch_size, hidden_size, time_steps)
    # W is the trainable weight matrix of attention
    # Luong's multiplicative style score
    score_first_part = Dense(TIME_STEPS, use_bias=False, name='attention_score_vec')(hidden_states_t)
    score_first_part_t = Permute((2, 1), name='attention_score_vec_t')(score_first_part)
    #            score_first_part_t         dot        last_hidden_state     => attention_weights
    # (batch_size, time_steps, hidden_size) dot (batch_size, hidden_size, 1) => (batch_size, time_steps, 1)
    h_t = Lambda(lambda x: x[:, :, -1], output_shape=(hidden_size, 1), name='last_hidden_state')(hidden_states_t)
    score = dot([score_first_part_t, h_t], [2, 1], name='attention_score')
    attention_weights = Activation('softmax', name='attention_weight')(score)
    # if SINGLE_ATTENTION_VECTOR:
    #     a = Lambda(lambda x: K.mean(x, axis=1), name='dim_reduction')(a)
    #     a = RepeatVector(hidden_size)(a)
    # (batch_size, hidden_size, time_steps) dot (batch_size, time_steps, 1) => (batch_size, hidden_size, 1)
    context_vector = dot([hidden_states_t, attention_weights], [2, 1], name='context_vector')
    context_vector = Reshape((hidden_size,))(context_vector)
    h_t = Reshape((hidden_size,))(h_t)
    pre_activation = concatenate([context_vector, h_t], name='attention_output')
    attention_vector = Dense(ATTENTION_SIZE, use_bias=False, activation='tanh', name='attention_vector')(pre_activation)
    return attention_vector

The interface remained same except you don't need Flatten layer anymore:

def model_attention_applied_after_lstm():
    inputs = Input(shape=(TIME_STEPS, INPUT_DIM,))
    lstm_units = 32
    lstm_out = LSTM(lstm_units, return_sequences=True)(inputs)
    attention_mul = attention_3d_block(lstm_out)
    # attention_mul = Flatten()(attention_mul)
    output = Dense(INPUT_DIM, activation='sigmoid')(attention_mul)
    model = Model(input=[inputs], output=output)
    return model

The results seems even better than your original implementation:

The process of building attention myself has brought me more questions than answers:

What is SINGLE_ATTENTION_VECTOR? And how could you use K.mean as dimension reduction while all parameters in a are defined in a Dense layer? Doesn't that just mean all weight parameters have the same gradient for each batch and behaves just like one parameter vector and wastes GPU memory for storing the full matrix?
I understand your intuition behind APPLY_ATTENTION_BEFORE_LSTM, but that is not what attention is for and you can pretty much achieve the same results by sending fixed-length input into a fully-connected layer and use the output from that layer as the input of a LSTM layer. "The data at index 10 being important" is not a good feature to learn through attention. Exact index of timestamp should be transparent to attention mechanism.

P.S. I have modified get_data_recurrent function a little bit to produce one-hot data as it is more similar to my actual needs.

def get_data_recurrent(n, time_steps, input_dim, attention_column=10):
    """
    Data generation. x is purely random except that it's first value equals the target y.
    In practice, the network should learn that the target = x[attention_column].
    Therefore, most of its attention should be focused on the value addressed by attention_column.
    :param n: the number of samples to retrieve.
    :param time_steps: the number of time steps of your series.
    :param input_dim: the number of dimensions of each element in the series.
    :param attention_column: the column linked to the target. Everything else is purely random.
    :return: x: model inputs, y: model targets
    """
    x = np.random.randint(input_dim, size=(n, time_steps))
    x = np.eye(input_dim)[x]
    y = x[:, attention_column, :]
    return x, y

Hi, can you clarify what you mean by "Since we don't have h_t in a regular RNN, I just used the last hidden state as h_t, which works just fine."

felixhao28 · 2018-10-16T02:51:55Z

@farahshamout Here is a rather complete explanation on attention over sequence to sequence model. The original idea of attention uses the output of the decoder as h_t, representing "current decoding state". If you think of the "many-to-one" problem as a special case of the "many-to-many" problem, h_t becomes the last hidden state of the encoder.

farahshamout · 2018-10-17T10:45:05Z

@felixhao28 I see, thanks!

Bertorob · 2019-02-11T16:18:01Z

Hi, i was trying to use your implementation, but i would like to save an attention heat map during the training (once for epoch), i tried to add return attention_vector,attention_weights but it is not what i wanted.
Do you have any suggestion?

felixhao28 · 2019-02-12T06:48:28Z

@Bertorob I assume you added attention_weights to the outputs of the model. Sadly there is a limitation in Keras that every output needs to be paired with a "ground-truth y" and calculated by a loss function. So if you intend to collect attention_weights for every batch, you need to provide an empty but same-sized numpy array as the second "ground-truth y" in model.fit, and a custom loss function for attention_weights that always return 0.

If you only need attention heat map once per epoch instead of once per batch, model.train_on_batch is what you need to replace model.fit

Bertorob · 2019-02-12T08:30:24Z

@felixhao28 Thank you for the answer. However if i want to plot the attention after training, I suppose i don't need to add the second ''ground-truth y'' but i don't get how you are able to do it. Could you please explain how can you do that?

felixhao28 · 2019-02-12T08:55:47Z

@Bertorob

This part of the code calculates the attention heat map:

    attention_vectors = []
    for i in range(10):
        ... # lines ommited
    pd.DataFrame(attention_vector_final, columns=['attention (%)']).plot(kind='bar',
                                                                         title='Attention Mechanism as '
                                                                               'a function of input'
                                                                               ' dimensions.')
    plt.show()

The attention_weights are not directly fetched during training. It isn't run until later after model.fit.

m.fit([inputs_1], outputs, epochs=1, batch_size=64, validation_split=0)

You see this line above just run one epoch. If you create a loop around it and change plt.show to plt.savefig then you get a series of images of the attention weights. Ultimately the code looks like this:

for epoch_i in range(n_epochs):
    m.fit([inputs_1], outputs, epochs=1, batch_size=64, validation_split=0)
    attention_vectors = []
    for i in range(10):
        ... # lines ommited
    pd.DataFrame(attention_vector_final, columns=['attention (%)']).plot(kind='bar',
                                                                         title='Attention Mechanism as '
                                                                               'a function of input'
                                                                               ' dimensions.')
    plt.savefig(f'attention-weights-{epoch_i}.png')

Edit: here I am still using model.fit instead of model.train_on_batch because the data here is really small and constant within each epochs. In reality though, you might want to use model.train_on_batch for better flexibility.

Bertorob · 2019-02-12T09:44:07Z

Ok i'm figuring something out. Last question now i tried something like this:

att_weights = []
for i in range(10):
   activations = get_activations(mymodel,np.reshape(x,(1,100,30)),print_shape_only=True,layer_name='attention_weight')
   attention_vec = np.mean(activations[0], axis=0).squeeze() 
   print('attention =', attention_vec)
   assert (np.sum(attention_vec) - 1.0) < 1e-5
   att_weights.append(attention_vec) 
attention_vector_final = np.mean(np.array(att_weights),axis=0)

where x is my input and actually i have my attention vector but it is filled with ones , maybe i'm still doing something wrong, why is there 10 in the for ?

EDIT: sorry i have understimated the relevance of return_sequences=True in the LSTM now i'm able to plot the attention map @felixhao28 thank you !!!!

LZQthePlane · 2019-03-18T13:33:34Z

@felixhao28 Both this repo and my version of attention are intended for sequence-to-one networks (although it can be easily tweaked for seq2seq by replacing h_t with current state of the decoder step).
Could you please show the detail of implementing seq2seq networks? I would so appropriate that. Is that just setting the return_sequences=True？

felixhao28 · 2019-03-19T12:45:10Z

@LZQthePlane No it is more complicated than that. The basic idea is to replace the h_t with current state of the decoder step. You might want to find another ready-to-use seq2seq attention code.

OmniaZayed · 2019-05-02T19:15:46Z

Hi @felixhao28, Thank you so much for your code and explanations above.

I am new to learning attention and I want to use it after LSTM for a classification problem. I understood the concepts of attention from this presentation [1] by Sujit Pal :
[1] https://www.slideshare.net/PyData/sujit-pal-applying-the-fourstep-embed-encode-attend-predict-framework-to-predict-document-similarity

Link to code
Link to the video

I got confused after reading your code about the type of attention (the theory behind it and how is it called in papers). does it compute an attention vector on an incoming matrix using a learned context vector?

hope you could help!

Goofy321 · 2019-05-08T12:35:51Z

@felixhao28
Thank you so much for your code and explanation. I think it is quite right except a slight problem. In my opinion, score_first_part shouldn't relate with h_t, which means the inputs of attention_score_vec layer shouldn't include h_t. How do you think?

felixhao28 · 2019-05-08T15:04:46Z

@Goofy321 How do you calculate the attention score then?

felixhao28 · 2019-05-08T15:08:58Z

@OmniaZayed My implementation is similar to AttentionMV in Sujit Pal's code except that ctx is the last hidden state.

Goofy321 · 2019-05-08T16:10:03Z

@Goofy321 How do you calculate the attention score then?

I mean the input of attention_score_vec layer change into hidden_states[:,:-1,:]. And the calculation of the attention score is the same as yours.

felixhao28 · 2019-05-09T03:04:11Z

@Goofy321 I think that works too.

patebel · 2019-05-24T15:36:44Z

@felixhao28 : When i try to run your code I get following error when calculating the score:

score = dot([score_first_part, h_t], axes=[2, 1], name='attention_score')

ValueError: Shape must be rank 2 but is rank 3 for 'attention_score/MatMul' (op: 'MatMul') with input shapes: [?,20,32], [?,32]

Currently I can't figure out why the dimensions don't match, any idea? Did anyone else experience the same issues?

felixhao28 · 2019-06-05T09:21:53Z

@patebel the shape of h_t should be (batch_size, hidden_size, 1), you are missing the final "1" dimension. Keras used to reshape the output of lambda layer to your output shape, maybe adding h_t = Reshape((hidden_size, 1))(h_t) will fix it.

patebel · 2019-06-05T11:35:44Z

@felixhao28 Oh yes, I didn't recognize, thank you!

uzaymacar · 2019-06-15T03:18:01Z

Hi @felixhao28, thanks for your insights and helpfulness in this issue! Reading the original paper by Bahdanau et. al. and comparing the operations to this repository, I was really confused until I saw this.
I have a question for you and other people on this thread. I have a language model that gets fed in a sequence of length 50 in batch sizes of 32, and tries to predict the next token where the vocabulary size is 35. Hence, it is an application of many-to-one for text generation. Below is the version that generates logical output.

However, when I apply the attention layer as you have suggested before the final dense layer for prediction with attension size of 256, I get extremely gibberish output, certain letters being repeated back to back in a nonsensical way. Below is that version.

Any ideas why this approach fails? I have also tried without stacking LSTM layers, and it still fails. The only thing I can think of is that the token-level for this language model is characters, whereas I have seen attention applied mostly to word-level language models. Any help will be appreciated!

UPDATE: Solved it, turns out I didn't set one of the Dense layers to be trainable.

junhuang-ifast · 2020-01-23T07:44:10Z

@felixhao28 thanks for the quick responds. I have one other question regarding

I implemented my own version of attention + LSTM. Since we don't have h_t in a regular RNN, I just used the last hidden state as h_t, which works just fine.

which many have already asked you.

if we were to take only the last hidden state, would it be in a way saying that we are focusing on one specific part (last part in this case) of the lstm output to do the many-to-one problem. What if however, the intuition was that the whole input sequence were important in predicting the one output, would it be more suitable to use the mean along the time axis instead?

so something like

h_t = Lambda(lambda x: tf.reduce_mean(x, axis=1), output_shape=(unit,), name='mean_hidden_state')

PS: using the mean is just an example, it could be any other function depending on the problem

philipperemy · 2020-01-23T11:38:01Z

@felixhao28 thanks a ton for your useful comments! I haven't had time to work on this repo since then. I was pretty new to deep learning when I wrote it. I'm going to invest some time to integrate some of your suggestions and fix the things that need to be fixed :)

felixhao28 · 2020-02-01T14:20:50Z

@junhuang-ifast In my application I was using attention in a sequence prediction model, which just focuses on the very next token in the sequence. Taking only the last hidden state worked fine due to the locality nature of sequences.

I am not an expert on applications other than sequence prediction. But if I have to guess, you can omit h_t all together (for example h_t = I, identity matrix). This will produce a self-attention vector.

Averaging all hidden states feels strange because by using attention, you are assuming not all elements in the sequence are equal. It is attentions' job to figure out which ones are more important and by how much. Using the mean of all states erases that difference. Unless there is a global information which differs by each sequence, hiding in each element and you want sum it up, I don't feel averaging is the way to go. I might be wrong though.

felixhao28 · 2020-02-01T14:22:55Z

@philipperemy No problem. We are all learning it as we discuss it.

junhuang-ifast · 2020-02-03T01:03:51Z

@felixhao28 just to be clear, when u say

h_t = I, identity matrix

would be the equivalent to not calculating h_t or the first dot product ie

h_t = Lambda(lambda x: x[:, -1, :], output_shape=(hidden_size,), name='last_hidden_state')(hidden_states)
score = dot([score_first_part, h_t], [2, 1], name='attention_score')

and just letting score = score_first_part ?

felixhao28 · 2020-02-03T11:35:19Z

@junhuang-ifast yes

philipperemy · 2020-02-04T08:21:52Z

@felixhao28 Do you have the link to the paper of this attention that was described in the TensorFlow tutorial?

felixhao28 · 2020-02-04T09:50:50Z

@philipperemy the original link is gone but I think they are:
https://arxiv.org/abs/1409.0473
and
https://arxiv.org/abs/1508.04025

Hessen525 · 2020-02-05T13:42:42Z

Actually, There are three different versions of attention. felixhao28' version is called global attention and philipperemy ' version is called self-attention. The rest one is called local attention, a little different with global attention.

philipperemy · 2020-03-25T13:16:40Z

I updated the repo with all the comments of this thread. Thank you all!

dolevelbaz · 2020-05-14T15:33:29Z

Actually, There are three different versions of attention. felixhao28' version is called global attention and philipperemy ' version is called self-attention. The rest one is called local attention, a little different with global attention.

Do you know a good implementation for local attention?

raghavgurbaxani · 2020-09-03T02:46:56Z

@philipperemy @felixhao28

Do you know how I can apply the attention module to a 2D shaped input , I would like to apply to apply attention after the LSTM layer-

Layer (type)                    Output Shape         Param #     Connected to                     
features (InputLayer)           (None, 16, 1816)     0                                            
__________________________________________________________________________________________________
lstm_1 (LSTM)                   (None, 2048)         31662080    features[0][0]                   
__________________________________________________________________________________________________
dense_2 (Dense)                 (None, 1024)         2098176     lstm_1[0][0]                     
__________________________________________________________________________________________________
leaky_re_lu_2 (LeakyReLU)       (None, 1024)         0           dense_2[0][0]                    
__________________________________________________________________________________________________
dense_3 (Dense)                 (None, 120)          123000      leaky_re_lu_2[0][0]              
__________________________________________________________________________________________________
feature_weights (InputLayer)    (None, 120)          0                                            
__________________________________________________________________________________________________
multiply_1 (Multiply)           (None, 120)          0           dense_3[0][0]                    
                                                                 feature_weights[0][0]            

Total params: 33,883,256
Trainable params: 33,883,256
Non-trainable params: 0
__________________________________________________________________________________________________

Would really appreciate your suggestion on how to modify attention_3D block to make it work for a 2D input as well. thanks.

philipperemy · 2020-09-03T12:15:18Z

@raghavgurbaxani I answered you in your thread.

AnanyaO · 2020-11-16T09:57:25Z

Hi @philipperemy and @felixhao28 . I am trying to apply attention model on top of an LSTM, where my input training data is a nd array. How should I fit my model in this case? I get the following error because of my data being a nd array

ValueError: Failed to convert a NumPy array to a Tensor (Unsupported object type numpy.ndarray).

What changes should I make? Would appreciate your help! Thank you

philipperemy · 2020-11-16T10:07:47Z

@AnanyaO did you have a look at the examples here: https://github.com/philipperemy/keras-attention-mechanism/tree/master/examples?

BuddhsitL · 2021-01-28T18:34:21Z

Hi, thanks for all of uers' comments. I have learned a lot from that. But can I ask a question. If we use an RNN(or some variants of it), we can get the hidden states of each time_step which can then be used to compute the score. But if I did not use Lstm to be as an encoder, alternately, I use a 1D CNN as an encoder, what should I do when I want to apply attention. For example, I would like to handle some textual messages, so I first used an embedding layer and then used a 1DConv layer. Is there some methods I can use to apply the attention mechanism to my model. Thanks so much.

philipperemy mentioned this issue Feb 4, 2020

what is the meaning of the second parameter in dot([], [1, 1], name='context_vector') #37

Closed

philipperemy mentioned this issue Feb 7, 2020

where is dense attention implementation ？ #38

Closed

philipperemy mentioned this issue Mar 19, 2020

Attention Visualization #20

Closed

philipperemy closed this as completed Mar 25, 2020

philipperemy mentioned this issue Jan 24, 2022

what do the h_t mean in the Attention model? #57

Closed

Questions on implementation details #14

Questions on implementation details #14

Comments

felixhao28 commented Mar 5, 2018 • edited Loading

felixhao28 commented Mar 7, 2018

felixhao28 commented Mar 7, 2018 • edited Loading

felixhao28 commented Mar 8, 2018

Wangzihaooooo commented Mar 23, 2018

felixhao28 commented Mar 23, 2018

Wangzihaooooo commented Mar 24, 2018

rajeev-samalkha commented May 3, 2018

michetonu commented Aug 23, 2018

felixhao28 commented Aug 24, 2018 • edited Loading

michetonu commented Aug 24, 2018

farahshamout commented Oct 15, 2018

felixhao28 commented Oct 16, 2018

farahshamout commented Oct 17, 2018

Bertorob commented Feb 11, 2019

felixhao28 commented Feb 12, 2019

Bertorob commented Feb 12, 2019

felixhao28 commented Feb 12, 2019 • edited Loading

Bertorob commented Feb 12, 2019 • edited Loading

LZQthePlane commented Mar 18, 2019 • edited Loading

felixhao28 commented Mar 19, 2019

OmniaZayed commented May 2, 2019 • edited Loading

Goofy321 commented May 8, 2019

felixhao28 commented May 8, 2019

felixhao28 commented May 8, 2019

Goofy321 commented May 8, 2019

felixhao28 commented May 9, 2019

patebel commented May 24, 2019 • edited Loading

felixhao28 commented Jun 5, 2019 • edited Loading

patebel commented Jun 5, 2019

uzaymacar commented Jun 15, 2019 • edited Loading

junhuang-ifast commented Jan 23, 2020

philipperemy commented Jan 23, 2020

felixhao28 commented Feb 1, 2020

felixhao28 commented Feb 1, 2020

junhuang-ifast commented Feb 3, 2020

felixhao28 commented Feb 3, 2020

philipperemy commented Feb 4, 2020

felixhao28 commented Feb 4, 2020

Hessen525 commented Feb 5, 2020

philipperemy commented Mar 25, 2020

dolevelbaz commented May 14, 2020

raghavgurbaxani commented Sep 3, 2020 • edited Loading

philipperemy commented Sep 3, 2020

AnanyaO commented Nov 16, 2020

philipperemy commented Nov 16, 2020

BuddhsitL commented Jan 28, 2021

felixhao28 commented Mar 5, 2018 •

edited

Loading

felixhao28 commented Mar 7, 2018 •

edited

Loading

felixhao28 commented Aug 24, 2018 •

edited

Loading

felixhao28 commented Feb 12, 2019 •

edited

Loading

Bertorob commented Feb 12, 2019 •

edited

Loading

LZQthePlane commented Mar 18, 2019 •

edited

Loading

OmniaZayed commented May 2, 2019 •

edited

Loading

patebel commented May 24, 2019 •

edited

Loading

felixhao28 commented Jun 5, 2019 •

edited

Loading

uzaymacar commented Jun 15, 2019 •

edited

Loading

raghavgurbaxani commented Sep 3, 2020 •

edited

Loading