Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Questions on implementation details #14

Closed
felixhao28 opened this issue Mar 5, 2018 · 53 comments
Closed

Questions on implementation details #14

felixhao28 opened this issue Mar 5, 2018 · 53 comments

Comments

@felixhao28
Copy link

felixhao28 commented Mar 5, 2018

Update on 2019/2/14, nearly one year later:

The implementation in this repo is definitely bugged. Please refer to my implementation in a reply below for correction. My version has been working in our product since this thread and it outperforms both vanilla LSTM without attention and the incorrect version in this repo by a significant margin. I am not the only one raising the question 1.

Both this repo and my version of attention are intended for sequence-to-one networks (although it can be easily tweaked for seq2seq by replacing h_t with current state of the decoder step). If you are looking for a ready-to-use attention for sequence-to-sequence networks, check this out: https://github.com/farizrahman4u/seq2seq.

============Original answer==============

I am currently working on a text generation task and learnt attention from TensorFlow tutorials. The implementation details seems quite different from your code.

This is how TensorFlow tutorial describes the process:

image

image

If I am understanding it correctly, all learnable parameters in the attention mechanism are stored in W, which has a shape of (rnn_size, rnn_size) (rnn_size is the size of hidden state). So first you need to use W to calculate the score of each hidden state based on the value of the hidden state h_t and h_s, but I am not seeing h_t anywhere in your code. Instead, you applied a dense layer on all h_s. And that means pre_act (Edit: h_t should be h_s in this equation) becomes the score in the paper. This seems wrong.

In the next step you element-wise multiplies the attention weights with hidden states as equation (2). Then somehow missed the equation (3).

I noticed the tutorial is about Seq2Seq (Encoder-Decoder) model and your code is an RNN. Maybe that is why your code is different. Do you have any source on how attention is applied to a non Seq2Seq network?

Here is your code:

def attention_3d_block(inputs):
    # inputs.shape = (batch_size, time_steps, input_dim)
    input_dim = int(inputs.shape[2])
    a = Permute((2, 1))(inputs)
    a = Reshape((input_dim, TIME_STEPS))(a) # this line is not useful. It's just to know which dimension is what.
    a = Dense(TIME_STEPS, activation='softmax')(a)
    if SINGLE_ATTENTION_VECTOR:
        a = Lambda(lambda x: K.mean(x, axis=1), name='dim_reduction')(a)
        a = RepeatVector(input_dim)(a)
    a_probs = Permute((2, 1), name='attention_vec')(a)
    output_attention_mul = merge([inputs, a_probs], name='attention_mul', mode='mul')
    return output_attention_mul


def model_attention_applied_after_lstm():
    inputs = Input(shape=(TIME_STEPS, INPUT_DIM,))
    lstm_units = 32
    lstm_out = LSTM(lstm_units, return_sequences=True)(inputs)
    attention_mul = attention_3d_block(lstm_out)
    attention_mul = Flatten()(attention_mul)
    output = Dense(1, activation='sigmoid')(attention_mul)
    model = Model(input=[inputs], output=output)
    return model
@felixhao28
Copy link
Author

I implemented my own version of attention + LSTM. Since we don't have h_t in a regular RNN, I just used the last hidden state as h_t, which works just fine.

INPUT_DIM = 100
TIME_STEPS = 20
# if True, the attention vector is shared across the input_dimensions where the attention is applied.
SINGLE_ATTENTION_VECTOR = True
APPLY_ATTENTION_BEFORE_LSTM = False

ATTENTION_SIZE = 128

def attention_3d_block(hidden_states):
    # hidden_states.shape = (batch_size, time_steps, hidden_size)
    hidden_size = int(hidden_states.shape[2])
    # _t stands for transpose
    hidden_states_t = Permute((2, 1), name='attention_input_t')(hidden_states)
    # hidden_states_t.shape = (batch_size, hidden_size, time_steps)
    # this line is not useful. It's just to know which dimension is what.
    hidden_states_t = Reshape((hidden_size, TIME_STEPS), name='attention_input_reshape')(hidden_states_t)
    # Inside dense layer
    # a (batch_size, hidden_size, time_steps) dot W (time_steps, time_steps) => (batch_size, hidden_size, time_steps)
    # W is the trainable weight matrix of attention
    # Luong's multiplicative style score
    score_first_part = Dense(TIME_STEPS, use_bias=False, name='attention_score_vec')(hidden_states_t)
    score_first_part_t = Permute((2, 1), name='attention_score_vec_t')(score_first_part)
    #            score_first_part_t         dot        last_hidden_state     => attention_weights
    # (batch_size, time_steps, hidden_size) dot (batch_size, hidden_size, 1) => (batch_size, time_steps, 1)
    h_t = Lambda(lambda x: x[:, :, -1], output_shape=(hidden_size, 1), name='last_hidden_state')(hidden_states_t)
    score = dot([score_first_part_t, h_t], [2, 1], name='attention_score')
    attention_weights = Activation('softmax', name='attention_weight')(score)
    # if SINGLE_ATTENTION_VECTOR:
    #     a = Lambda(lambda x: K.mean(x, axis=1), name='dim_reduction')(a)
    #     a = RepeatVector(hidden_size)(a)
    # (batch_size, hidden_size, time_steps) dot (batch_size, time_steps, 1) => (batch_size, hidden_size, 1)
    context_vector = dot([hidden_states_t, attention_weights], [2, 1], name='context_vector')
    context_vector = Reshape((hidden_size,))(context_vector)
    h_t = Reshape((hidden_size,))(h_t)
    pre_activation = concatenate([context_vector, h_t], name='attention_output')
    attention_vector = Dense(ATTENTION_SIZE, use_bias=False, activation='tanh', name='attention_vector')(pre_activation)
    return attention_vector

The interface remained same except you don't need Flatten layer anymore:

def model_attention_applied_after_lstm():
    inputs = Input(shape=(TIME_STEPS, INPUT_DIM,))
    lstm_units = 32
    lstm_out = LSTM(lstm_units, return_sequences=True)(inputs)
    attention_mul = attention_3d_block(lstm_out)
    # attention_mul = Flatten()(attention_mul)
    output = Dense(INPUT_DIM, activation='sigmoid')(attention_mul)
    model = Model(input=[inputs], output=output)
    return model

The results seems even better than your original implementation:

image

The process of building attention myself has brought me more questions than answers:

  1. What is SINGLE_ATTENTION_VECTOR? And how could you use K.mean as dimension reduction while all parameters in a are defined in a Dense layer? Doesn't that just mean all weight parameters have the same gradient for each batch and behaves just like one parameter vector and wastes GPU memory for storing the full matrix?
  2. I understand your intuition behind APPLY_ATTENTION_BEFORE_LSTM, but that is not what attention is for and you can pretty much achieve the same results by sending fixed-length input into a fully-connected layer and use the output from that layer as the input of a LSTM layer. "The data at index 10 being important" is not a good feature to learn through attention. Exact index of timestamp should be transparent to attention mechanism.

P.S. I have modified get_data_recurrent function a little bit to produce one-hot data as it is more similar to my actual needs.

def get_data_recurrent(n, time_steps, input_dim, attention_column=10):
    """
    Data generation. x is purely random except that it's first value equals the target y.
    In practice, the network should learn that the target = x[attention_column].
    Therefore, most of its attention should be focused on the value addressed by attention_column.
    :param n: the number of samples to retrieve.
    :param time_steps: the number of time steps of your series.
    :param input_dim: the number of dimensions of each element in the series.
    :param attention_column: the column linked to the target. Everything else is purely random.
    :return: x: model inputs, y: model targets
    """
    x = np.random.randint(input_dim, size=(n, time_steps))
    x = np.eye(input_dim)[x]
    y = x[:, attention_column, :]
    return x, y

@felixhao28
Copy link
Author

felixhao28 commented Mar 7, 2018

Being confused about why attention can learn info about specific index in input sequence, I went on and read the code in official tensorflow implementation. I was wrong about the attention_score_vec dense layer which is a.k.a "memory layer" in TF implementation. The weight matrix W is not a (time_steps, time_steps) sized but rather (hidden_size, hidden_size) as shown here. The correct implementation should be:

def attention_3d_block(hidden_states):
    # hidden_states.shape = (batch_size, time_steps, hidden_size)
    hidden_size = int(hidden_states.shape[2])
    # Inside dense layer
    #              hidden_states            dot               W            =>           score_first_part
    # (batch_size, time_steps, hidden_size) dot (hidden_size, hidden_size) => (batch_size, time_steps, hidden_size)
    # W is the trainable weight matrix of attention
    # Luong's multiplicative style score
    score_first_part = Dense(hidden_size, use_bias=False, name='attention_score_vec')(hidden_states)
    #            score_first_part           dot        last_hidden_state     => attention_weights
    # (batch_size, time_steps, hidden_size) dot   (batch_size, hidden_size)  => (batch_size, time_steps)
    h_t = Lambda(lambda x: x[:, -1, :], output_shape=(hidden_size,), name='last_hidden_state')(
        hidden_states)
    score = dot([score_first_part, h_t], [2, 1], name='attention_score')
    attention_weights = Activation('softmax', name='attention_weight')(score)
    # (batch_size, time_steps, hidden_size) dot (batch_size, time_steps) => (batch_size, hidden_size)
    context_vector = dot([hidden_states, attention_weights], [1, 1], name='context_vector')
    pre_activation = concatenate([context_vector, h_t], name='attention_output')
    attention_vector = Dense(128, use_bias=False, activation='tanh',
                             name='attention_vector')(
        pre_activation)
    return attention_vector

score_first_part stands for score_first_part, as part of score.

Surprisingly, even without any hard information on the index of sequence, attention model still managed to learn the importance of 10th element. Now I am super confused.

image

My guess is somehow LSTM learned to "count" to 10 in its hidden state. And that "count" is captured by attention. I will need to visualize the inner parameters of LSTM to be sure.

An interesting finding I made is how attention is learnt through time:

giphy

Full code (except attention_3d_block), showing here just for reference:

from keras.layers import concatenate, dot
from keras.layers.core import *
from keras.layers.recurrent import LSTM
from keras.models import *

from attention_utils import get_activations, get_data_recurrent

INPUT_DIM = 100
TIME_STEPS = 20
# if True, the attention vector is shared across the input_dimensions where the attention is applied.
SINGLE_ATTENTION_VECTOR = True
APPLY_ATTENTION_BEFORE_LSTM = False


def attention_3d_block(hidden_states):
    # same as above


def model_attention_applied_after_lstm():
    inputs = Input(shape=(TIME_STEPS, INPUT_DIM,))
    lstm_units = 32
    lstm_out = LSTM(lstm_units, return_sequences=True)(inputs)
    attention_mul = attention_3d_block(lstm_out)
    # attention_mul = Flatten()(attention_mul)
    output = Dense(INPUT_DIM, activation='sigmoid', name='output')(attention_mul)
    model = Model(input=[inputs], output=output)
    return model

if __name__ == '__main__':

    N = 300000
    # N = 300 -> too few = no training
    inputs_1, outputs = get_data_recurrent(N, TIME_STEPS, INPUT_DIM)

    if APPLY_ATTENTION_BEFORE_LSTM:
        m = model_attention_applied_before_lstm()
    else:
        m = model_attention_applied_after_lstm()

    m.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])
    print(m.summary())

    m.fit([inputs_1], outputs, epochs=1, batch_size=64, validation_split=0)

    attention_vectors = []
    for i in range(10):
        testing_inputs_1, testing_outputs = get_data_recurrent(1, TIME_STEPS, INPUT_DIM)
        activations = get_activations(m, testing_inputs_1, print_shape_only=True, layer_name='attention_weight')
        attention_vec = np.mean(activations[0], axis=0).squeeze()
        print('attention =', attention_vec)
        assert (np.sum(attention_vec) - 1.0) < 1e-5
        attention_vectors.append(attention_vec)

    attention_vector_final = np.mean(np.array(attention_vectors), axis=0)
    # plot part.
    import matplotlib.pyplot as plt
    import pandas as pd

    pd.DataFrame(attention_vector_final, columns=['attention (%)']).plot(kind='bar',
                                                                         title='Attention Mechanism as '
                                                                               'a function of input'
                                                                               ' dimensions.')
    plt.show()

@felixhao28
Copy link
Author

I am actually thinking you were trying to implement self attention, which is used in text classification. But nonetheless the weight parameters should be sized (hidden_size, hidden_size) instead of (time_steps, time_steps).

@Wangzihaooooo
Copy link

@felixhao28 why do you use the layer that was named “last_hidden_state”?

@felixhao28
Copy link
Author

@Wangzihaooooo Because attention was first introduced in a Sequence to Sequence model, where attention score is computed based on both h_t and all h_s. In a language/classification model (sequence to one), we don't have the h_t to represent the information of the current outputting Y. Therefore I just used the last hidden state as h_t.

To be fair, you can totally remove h_t from the score computation, which then just becomes score = W * h_s. And it is essentially self-attention. It is different from traditional attention that self-attention only gives a score based on how important a hidden_state is globally, without the information of the current state of LSTM.

@Wangzihaooooo
Copy link

@felixhao28 thank you ,I learned a lot from your code.

@rajeev-samalkha
Copy link

@felixhao28 thank you very much. This is very well explained and removes the complexity around Attention Layer. I implemented the code inline for Seq2Seq model and able to grab attention matrix directly. Thanks once again for your help.

Regards
Rajeev

@michetonu
Copy link

@felixhao28 I'm a bit confused in this part of the code:

attention_vectors = []
    for i in range(10):
        testing_inputs_1, testing_outputs = get_data_recurrent(1, TIME_STEPS, INPUT_DIM)
        activations = get_activations(m, testing_inputs_1, print_shape_only=True, layer_name='attention_weight')

get_activations() effectively passes testin_inputs_1 through the layer 'attention_weight' and outputs the softmax probabilities for each. However, you are passing the raw input without making them pass through the LSTM first; is that on purpose? If so, can you explain why? Since in the model the inputs to the attention layers are the output of the LSTM layer(s), I would expect to have to do the same here.

Thanks!

@felixhao28
Copy link
Author

felixhao28 commented Aug 24, 2018

you are passing the raw input without making them pass through the LSTM first

The input does pass through LSTM first. Layer is an abstract concept of how the tensor should be calculated, not the actual tensor to be calculated. The relationship is more like "class" and "instance" if you are familiar with OOP.

the outputs is the actual tensor (instance) of attention_weight layer, which has already been connected to previous tensors (computational graph) by attention_weights = Activation('softmax', name='attention_weight')(score). It is not this specific tensor that takes the testing_inputs_1, it is the computaional graph, which initially begins from inputs = Input(shape=(TIME_STEPS, INPUT_DIM,)).

@michetonu
Copy link

@felixhao28 I see, thanks for the explanation!

@farahshamout
Copy link

I implemented my own version of attention + LSTM. Since we don't have h_t in a regular RNN, I just used the last hidden state as h_t, which works just fine.

INPUT_DIM = 100
TIME_STEPS = 20
# if True, the attention vector is shared across the input_dimensions where the attention is applied.
SINGLE_ATTENTION_VECTOR = True
APPLY_ATTENTION_BEFORE_LSTM = False

ATTENTION_SIZE = 128

def attention_3d_block(hidden_states):
    # hidden_states.shape = (batch_size, time_steps, hidden_size)
    hidden_size = int(hidden_states.shape[2])
    # _t stands for transpose
    hidden_states_t = Permute((2, 1), name='attention_input_t')(hidden_states)
    # hidden_states_t.shape = (batch_size, hidden_size, time_steps)
    # this line is not useful. It's just to know which dimension is what.
    hidden_states_t = Reshape((hidden_size, TIME_STEPS), name='attention_input_reshape')(hidden_states_t)
    # Inside dense layer
    # a (batch_size, hidden_size, time_steps) dot W (time_steps, time_steps) => (batch_size, hidden_size, time_steps)
    # W is the trainable weight matrix of attention
    # Luong's multiplicative style score
    score_first_part = Dense(TIME_STEPS, use_bias=False, name='attention_score_vec')(hidden_states_t)
    score_first_part_t = Permute((2, 1), name='attention_score_vec_t')(score_first_part)
    #            score_first_part_t         dot        last_hidden_state     => attention_weights
    # (batch_size, time_steps, hidden_size) dot (batch_size, hidden_size, 1) => (batch_size, time_steps, 1)
    h_t = Lambda(lambda x: x[:, :, -1], output_shape=(hidden_size, 1), name='last_hidden_state')(hidden_states_t)
    score = dot([score_first_part_t, h_t], [2, 1], name='attention_score')
    attention_weights = Activation('softmax', name='attention_weight')(score)
    # if SINGLE_ATTENTION_VECTOR:
    #     a = Lambda(lambda x: K.mean(x, axis=1), name='dim_reduction')(a)
    #     a = RepeatVector(hidden_size)(a)
    # (batch_size, hidden_size, time_steps) dot (batch_size, time_steps, 1) => (batch_size, hidden_size, 1)
    context_vector = dot([hidden_states_t, attention_weights], [2, 1], name='context_vector')
    context_vector = Reshape((hidden_size,))(context_vector)
    h_t = Reshape((hidden_size,))(h_t)
    pre_activation = concatenate([context_vector, h_t], name='attention_output')
    attention_vector = Dense(ATTENTION_SIZE, use_bias=False, activation='tanh', name='attention_vector')(pre_activation)
    return attention_vector

The interface remained same except you don't need Flatten layer anymore:

def model_attention_applied_after_lstm():
    inputs = Input(shape=(TIME_STEPS, INPUT_DIM,))
    lstm_units = 32
    lstm_out = LSTM(lstm_units, return_sequences=True)(inputs)
    attention_mul = attention_3d_block(lstm_out)
    # attention_mul = Flatten()(attention_mul)
    output = Dense(INPUT_DIM, activation='sigmoid')(attention_mul)
    model = Model(input=[inputs], output=output)
    return model

The results seems even better than your original implementation:

image

The process of building attention myself has brought me more questions than answers:

  1. What is SINGLE_ATTENTION_VECTOR? And how could you use K.mean as dimension reduction while all parameters in a are defined in a Dense layer? Doesn't that just mean all weight parameters have the same gradient for each batch and behaves just like one parameter vector and wastes GPU memory for storing the full matrix?
  2. I understand your intuition behind APPLY_ATTENTION_BEFORE_LSTM, but that is not what attention is for and you can pretty much achieve the same results by sending fixed-length input into a fully-connected layer and use the output from that layer as the input of a LSTM layer. "The data at index 10 being important" is not a good feature to learn through attention. Exact index of timestamp should be transparent to attention mechanism.

P.S. I have modified get_data_recurrent function a little bit to produce one-hot data as it is more similar to my actual needs.

def get_data_recurrent(n, time_steps, input_dim, attention_column=10):
    """
    Data generation. x is purely random except that it's first value equals the target y.
    In practice, the network should learn that the target = x[attention_column].
    Therefore, most of its attention should be focused on the value addressed by attention_column.
    :param n: the number of samples to retrieve.
    :param time_steps: the number of time steps of your series.
    :param input_dim: the number of dimensions of each element in the series.
    :param attention_column: the column linked to the target. Everything else is purely random.
    :return: x: model inputs, y: model targets
    """
    x = np.random.randint(input_dim, size=(n, time_steps))
    x = np.eye(input_dim)[x]
    y = x[:, attention_column, :]
    return x, y

Hi, can you clarify what you mean by "Since we don't have h_t in a regular RNN, I just used the last hidden state as h_t, which works just fine."

@felixhao28
Copy link
Author

@farahshamout Here is a rather complete explanation on attention over sequence to sequence model. The original idea of attention uses the output of the decoder as h_t, representing "current decoding state". If you think of the "many-to-one" problem as a special case of the "many-to-many" problem, h_t becomes the last hidden state of the encoder.

@farahshamout
Copy link

@felixhao28 I see, thanks!

@Bertorob
Copy link

Hi, i was trying to use your implementation, but i would like to save an attention heat map during the training (once for epoch), i tried to add return attention_vector,attention_weights but it is not what i wanted.
Do you have any suggestion?

@felixhao28
Copy link
Author

@Bertorob I assume you added attention_weights to the outputs of the model. Sadly there is a limitation in Keras that every output needs to be paired with a "ground-truth y" and calculated by a loss function. So if you intend to collect attention_weights for every batch, you need to provide an empty but same-sized numpy array as the second "ground-truth y" in model.fit, and a custom loss function for attention_weights that always return 0.

If you only need attention heat map once per epoch instead of once per batch, model.train_on_batch is what you need to replace model.fit

@Bertorob
Copy link

@felixhao28 Thank you for the answer. However if i want to plot the attention after training, I suppose i don't need to add the second ''ground-truth y'' but i don't get how you are able to do it. Could you please explain how can you do that?

@felixhao28
Copy link
Author

felixhao28 commented Feb 12, 2019

@Bertorob

This part of the code calculates the attention heat map:

    attention_vectors = []
    for i in range(10):
        ... # lines ommited
    pd.DataFrame(attention_vector_final, columns=['attention (%)']).plot(kind='bar',
                                                                         title='Attention Mechanism as '
                                                                               'a function of input'
                                                                               ' dimensions.')
    plt.show()

The attention_weights are not directly fetched during training. It isn't run until later after model.fit.

m.fit([inputs_1], outputs, epochs=1, batch_size=64, validation_split=0)

You see this line above just run one epoch. If you create a loop around it and change plt.show to plt.savefig then you get a series of images of the attention weights. Ultimately the code looks like this:

for epoch_i in range(n_epochs):
    m.fit([inputs_1], outputs, epochs=1, batch_size=64, validation_split=0)
    attention_vectors = []
    for i in range(10):
        ... # lines ommited
    pd.DataFrame(attention_vector_final, columns=['attention (%)']).plot(kind='bar',
                                                                         title='Attention Mechanism as '
                                                                               'a function of input'
                                                                               ' dimensions.')
    plt.savefig(f'attention-weights-{epoch_i}.png')

Edit: here I am still using model.fit instead of model.train_on_batch because the data here is really small and constant within each epochs. In reality though, you might want to use model.train_on_batch for better flexibility.

@Bertorob
Copy link

Bertorob commented Feb 12, 2019

Ok i'm figuring something out. Last question now i tried something like this:

att_weights = []
for i in range(10):
   activations = get_activations(mymodel,np.reshape(x,(1,100,30)),print_shape_only=True,layer_name='attention_weight')
   attention_vec = np.mean(activations[0], axis=0).squeeze() 
   print('attention =', attention_vec)
   assert (np.sum(attention_vec) - 1.0) < 1e-5
   att_weights.append(attention_vec) 
attention_vector_final = np.mean(np.array(att_weights),axis=0)

where x is my input and actually i have my attention vector but it is filled with ones , maybe i'm still doing something wrong, why is there 10 in the for ?

EDIT: sorry i have understimated the relevance of return_sequences=True in the LSTM now i'm able to plot the attention map @felixhao28 thank you !!!!

@LZQthePlane
Copy link

LZQthePlane commented Mar 18, 2019

@felixhao28 Both this repo and my version of attention are intended for sequence-to-one networks (although it can be easily tweaked for seq2seq by replacing h_t with current state of the decoder step).
Could you please show the detail of implementing seq2seq networks? I would so appropriate that. Is that just setting the return_sequences=True

@felixhao28
Copy link
Author

@LZQthePlane No it is more complicated than that. The basic idea is to replace the h_t with current state of the decoder step. You might want to find another ready-to-use seq2seq attention code.

@OmniaZayed
Copy link

OmniaZayed commented May 2, 2019

Hi @felixhao28, Thank you so much for your code and explanations above.

I am new to learning attention and I want to use it after LSTM for a classification problem. I understood the concepts of attention from this presentation [1] by Sujit Pal :
[1] https://www.slideshare.net/PyData/sujit-pal-applying-the-fourstep-embed-encode-attend-predict-framework-to-predict-document-similarity

I got confused after reading your code about the type of attention (the theory behind it and how is it called in papers). does it compute an attention vector on an incoming matrix using a learned context vector?

hope you could help!

@Goofy321
Copy link

Goofy321 commented May 8, 2019

@felixhao28
Thank you so much for your code and explanation. I think it is quite right except a slight problem. In my opinion, score_first_part shouldn't relate with h_t, which means the inputs of attention_score_vec layer shouldn't include h_t. How do you think?

@felixhao28
Copy link
Author

@Goofy321 How do you calculate the attention score then?

@felixhao28
Copy link
Author

@OmniaZayed My implementation is similar to AttentionMV in Sujit Pal's code except that ctx is the last hidden state.

@Goofy321
Copy link

Goofy321 commented May 8, 2019

@Goofy321 How do you calculate the attention score then?

I mean the input of attention_score_vec layer change into hidden_states[:,:-1,:]. And the calculation of the attention score is the same as yours.

@felixhao28
Copy link
Author

@Goofy321 I think that works too.

@patebel
Copy link

patebel commented May 24, 2019

@felixhao28 : When i try to run your code I get following error when calculating the score:

score = dot([score_first_part, h_t], axes=[2, 1], name='attention_score')

ValueError: Shape must be rank 2 but is rank 3 for 'attention_score/MatMul' (op: 'MatMul') with input shapes: [?,20,32], [?,32]

Currently I can't figure out why the dimensions don't match, any idea? Did anyone else experience the same issues?

@felixhao28
Copy link
Author

felixhao28 commented Jun 5, 2019

@patebel the shape of h_t should be (batch_size, hidden_size, 1), you are missing the final "1" dimension. Keras used to reshape the output of lambda layer to your output shape, maybe adding h_t = Reshape((hidden_size, 1))(h_t) will fix it.

@patebel
Copy link

patebel commented Jun 5, 2019

@felixhao28 Oh yes, I didn't recognize, thank you!

@uzaymacar
Copy link

uzaymacar commented Jun 15, 2019

Hi @felixhao28, thanks for your insights and helpfulness in this issue! Reading the original paper by Bahdanau et. al. and comparing the operations to this repository, I was really confused until I saw this.
I have a question for you and other people on this thread. I have a language model that gets fed in a sequence of length 50 in batch sizes of 32, and tries to predict the next token where the vocabulary size is 35. Hence, it is an application of many-to-one for text generation. Below is the version that generates logical output.

Capture0

However, when I apply the attention layer as you have suggested before the final dense layer for prediction with attension size of 256, I get extremely gibberish output, certain letters being repeated back to back in a nonsensical way. Below is that version.

Capture1

Any ideas why this approach fails? I have also tried without stacking LSTM layers, and it still fails. The only thing I can think of is that the token-level for this language model is characters, whereas I have seen attention applied mostly to word-level language models. Any help will be appreciated!

UPDATE: Solved it, turns out I didn't set one of the Dense layers to be trainable.

@junhuang-ifast
Copy link

@felixhao28 thanks for the quick responds. I have one other question regarding

I implemented my own version of attention + LSTM. Since we don't have h_t in a regular RNN, I just used the last hidden state as h_t, which works just fine.

which many have already asked you.

if we were to take only the last hidden state, would it be in a way saying that we are focusing on one specific part (last part in this case) of the lstm output to do the many-to-one problem. What if however, the intuition was that the whole input sequence were important in predicting the one output, would it be more suitable to use the mean along the time axis instead?

so something like

h_t = Lambda(lambda x: tf.reduce_mean(x, axis=1), output_shape=(unit,), name='mean_hidden_state')

PS: using the mean is just an example, it could be any other function depending on the problem

@philipperemy
Copy link
Owner

@felixhao28 thanks a ton for your useful comments! I haven't had time to work on this repo since then. I was pretty new to deep learning when I wrote it. I'm going to invest some time to integrate some of your suggestions and fix the things that need to be fixed :)

@felixhao28
Copy link
Author

@junhuang-ifast In my application I was using attention in a sequence prediction model, which just focuses on the very next token in the sequence. Taking only the last hidden state worked fine due to the locality nature of sequences.

I am not an expert on applications other than sequence prediction. But if I have to guess, you can omit h_t all together (for example h_t = I, identity matrix). This will produce a self-attention vector.

Averaging all hidden states feels strange because by using attention, you are assuming not all elements in the sequence are equal. It is attentions' job to figure out which ones are more important and by how much. Using the mean of all states erases that difference. Unless there is a global information which differs by each sequence, hiding in each element and you want sum it up, I don't feel averaging is the way to go. I might be wrong though.

@felixhao28
Copy link
Author

@philipperemy No problem. We are all learning it as we discuss it.

@junhuang-ifast
Copy link

@felixhao28 just to be clear, when u say

h_t = I, identity matrix

would be the equivalent to not calculating h_t or the first dot product ie

h_t = Lambda(lambda x: x[:, -1, :], output_shape=(hidden_size,), name='last_hidden_state')(hidden_states)
score = dot([score_first_part, h_t], [2, 1], name='attention_score')

and just letting score = score_first_part ?

@felixhao28
Copy link
Author

@junhuang-ifast yes

@philipperemy
Copy link
Owner

@felixhao28 Do you have the link to the paper of this attention that was described in the TensorFlow tutorial?

@felixhao28
Copy link
Author

@philipperemy the original link is gone but I think they are:
https://arxiv.org/abs/1409.0473
and
https://arxiv.org/abs/1508.04025

@Hessen525
Copy link

Actually, There are three different versions of attention. felixhao28' version is called global attention and philipperemy ' version is called self-attention. The rest one is called local attention, a little different with global attention.

@philipperemy
Copy link
Owner

I updated the repo with all the comments of this thread. Thank you all!

@dolevelbaz
Copy link

Actually, There are three different versions of attention. felixhao28' version is called global attention and philipperemy ' version is called self-attention. The rest one is called local attention, a little different with global attention.

Do you know a good implementation for local attention?

@raghavgurbaxani
Copy link

raghavgurbaxani commented Sep 3, 2020

@philipperemy @felixhao28

Do you know how I can apply the attention module to a 2D shaped input , I would like to apply to apply attention after the LSTM layer-

Layer (type)                    Output Shape         Param #     Connected to                     
features (InputLayer)           (None, 16, 1816)     0                                            
__________________________________________________________________________________________________
lstm_1 (LSTM)                   (None, 2048)         31662080    features[0][0]                   
__________________________________________________________________________________________________
dense_2 (Dense)                 (None, 1024)         2098176     lstm_1[0][0]                     
__________________________________________________________________________________________________
leaky_re_lu_2 (LeakyReLU)       (None, 1024)         0           dense_2[0][0]                    
__________________________________________________________________________________________________
dense_3 (Dense)                 (None, 120)          123000      leaky_re_lu_2[0][0]              
__________________________________________________________________________________________________
feature_weights (InputLayer)    (None, 120)          0                                            
__________________________________________________________________________________________________
multiply_1 (Multiply)           (None, 120)          0           dense_3[0][0]                    
                                                                 feature_weights[0][0]            

Total params: 33,883,256
Trainable params: 33,883,256
Non-trainable params: 0
__________________________________________________________________________________________________

Would really appreciate your suggestion on how to modify attention_3D block to make it work for a 2D input as well. thanks.

@philipperemy
Copy link
Owner

@raghavgurbaxani I answered you in your thread.

@AnanyaO
Copy link

AnanyaO commented Nov 16, 2020

Hi @philipperemy and @felixhao28 . I am trying to apply attention model on top of an LSTM, where my input training data is a nd array. How should I fit my model in this case? I get the following error because of my data being a nd array

ValueError: Failed to convert a NumPy array to a Tensor (Unsupported object type numpy.ndarray).

What changes should I make? Would appreciate your help! Thank you

@philipperemy
Copy link
Owner

@AnanyaO did you have a look at the examples here: https://github.com/philipperemy/keras-attention-mechanism/tree/master/examples?

@BuddhsitL
Copy link

Hi, thanks for all of uers' comments. I have learned a lot from that. But can I ask a question. If we use an RNN(or some variants of it), we can get the hidden states of each time_step which can then be used to compute the score. But if I did not use Lstm to be as an encoder, alternately, I use a 1D CNN as an encoder, what should I do when I want to apply attention. For example, I would like to handle some textual messages, so I first used an embedding layer and then used a 1DConv layer. Is there some methods I can use to apply the attention mechanism to my model. Thanks so much.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests