New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

attention layer requires another PR #1094

Closed
volkancirik opened this Issue Nov 27, 2015 · 16 comments

Comments

Projects
None yet
9 participants
@volkancirik
Copy link

volkancirik commented Nov 27, 2015

Hello all,

I implemented soft-attention layer (both dense & timedistributed). However, it depends on another PR. what's the best practice to create new PR?

@farizrahman4u

This comment has been minimized.

Copy link
Member

farizrahman4u commented Nov 27, 2015

Can we see your implementation?

@EderSantana

This comment has been minimized.

Copy link
Contributor

EderSantana commented Nov 27, 2015

@fchollet declared hold on all PRs that do not address the backend abstraction. We will be solely working into getting Keras with Tensorflow + Theano working and up to date. So, if we accept PRs to master right now, we will have to change it later anyway. I recommend you to wait a bit because there may be changes (although very little, like 1 char) in the way we write layers in Keras. I recommend you now to develop your model in parallel in your own github. Also if you can get help from @farizrahman4u I'm sure the result will be awesome.

@volkancirik

This comment has been minimized.

Copy link

volkancirik commented Nov 27, 2015

@farizrahman4u Here is the attention layer. It only works for Graph though.

it requires a small hack in core layer and naming convention is a bit off for which I need help from Keras community. It will be much better in couple of commits.

An encoder-decoder with attention would be like this:

mode = Graph() 
model.add_input(name='input', input_shape=(None,len(chars)))
model.add_node(RNN(128), name='encoder_rnn', input='input')
model.add_node(RepeatVector(MAXLEN), name ='recurrent_context', input = 'encoder_rnn')
model.add_node(RNN(256, return_sequences = True), name='encoder_context', input='input')
model.add_node(TimeDistributedAttention(prev_dim = 128, att_dim = 64, return_sequences = True), name='attention', inputs=['encoder_context','recurrent_context'], merge_mode = 'join_att')
model.add_node(TimeDistributedDense(len(chars)), name='tdd', input='attention')
model.add_node(Activation('softmax'), name = 'softmax',input = 'tdd')
model.add_output(name='output', input='softmax')

or visual-attention

image_model = Graph()
image_model.add_input(name = 'input', input_shape = Ximages[0].shape)
image_model.add_node(Convolution2D(12, 3, 3, border_mode='full'), name = 'c1',input = 'input')
image_model.add_node(Activation('relu'), name = 'a1',input = 'c1')
image_model.add_node(Convolution2D(12, 3, 3), name = 'c2',input = 'a1')
image_model.add_node(Activation('relu'), name = 'a2',input = 'c2')
image_model.add_node(MaxPooling2D(pool_size=(2, 2)), name = 'p1',input = 'a2')
image_model.add_node(Convolution2D(10, 3, 3, border_mode='full'), name = 'c3',input = 'p1')
image_model.add_node(Activation('relu'), name = 'a3',input = 'c3')
image_model.add_node(Convolution2D(10, 3, 3), name = 'c4',input = 'a3')
image_model.add_node(Activation('relu'), name = 'a4',input = 'c4')
image_model.add_node(PreAttention(), name='pre_attention', input= 'c4')
image_model.add_node(DenseAttention(att_dim = 128), name='dense_attention', input= 'pre_attention')
image_model.add_node(Dense(answer_size), name = 'd',input = 'dense_attention')
image_model.add_node(Activation('softmax'), name = 'softmax',input = 'd')
@farizrahman4u

This comment has been minimized.

Copy link
Member

farizrahman4u commented Nov 28, 2015

@wolet You should be using the LambdaMerge layer instead of the 'hack'. That way, this could be merged easily to Keras without changing the core layers (After we are done with TensorFlow of course), and would work seamlessly in Sequential and Graph models.

@volkancirik

This comment has been minimized.

Copy link

volkancirik commented Nov 29, 2015

@farizrahman4u I did not know about LambdaMerge, thanks for pointing out!

@elanmart

This comment has been minimized.

Copy link

elanmart commented Nov 29, 2015

You could also see #1051

@volkancirik

This comment has been minimized.

Copy link

volkancirik commented Dec 1, 2015

@farizrahman4u I could not find a way to use LambdaMerge in Graph models. Would you mind giving me a simple example?

@farizrahman4u

This comment has been minimized.

Copy link
Member

farizrahman4u commented Dec 1, 2015

(Not tested)

def func(X):
    #your merge  function here. X is a  list of input tensors
   #this function should output the merged  tensor
    pass

def output_shape(shapes):
    #shapes = list of output shapes of input tensors 
    #this function should output the shape of the merged tensor
    pass

input1 = Dense(....)
input2 = Dense(....)

lambda_merge = LambdaMerge([input1, input2], func, output_shape)

graph = Graph()
graph.add_input(input1, name='input1')
graph.add_input(input2, name='input2')

graph.add_node(lambda_merge, name='lambda_merge')
graph.add_node(Dense(....), name='dense1', input='lambda_merge')
@jfsantos

This comment has been minimized.

Copy link
Contributor

jfsantos commented Jan 8, 2016

@wolet do you plan to convert this code to the generic Keras backend (which supports both Tensorflow and Theano)? I am going to need an attention layer in the near future and you have already put so much work on this, so I don't see a reason for implementing it from scratch :)

@volkancirik

This comment has been minimized.

Copy link

volkancirik commented Jan 14, 2016

@jfsantos I haven't used the new API after Tensorflow changes. I will check the new API and see how I can contribute.

@niitsuma

This comment has been minimized.

Copy link

niitsuma commented Jan 29, 2016

In my environment

graph.add_input(input1, name='input1')

cause

TypeError: add_input() got multiple values for keyword argument 'name'
@pasky

This comment has been minimized.

Copy link
Contributor

pasky commented Feb 7, 2016

I figured that someone might find an example of attention layer like in 1506.03340 or 1511.04108 useful, so here is mine.

The setup: Transforming sequence of embeddings e1s to e1sm by multiplying it with a per-token attention. The attention is determined by similarity with another embedding e0a, and focused to a single or few points in the sequence by softmax, as in the papers above. The token attention scalar can be generated by a couple of ways, the original papers use w*tanh(e0a + W*e1s).

    model.add_node(name='e1sa', input='e1s',  # consider another nonlinearity here
                   layer=TimeDistributedDense(input_dim=int(N*sdim), output_dim=int(N*adim), W_regularizer=l2(l2reg))
    model.add_node(name='e0sa', input='e0a',
                   layer=RepeatVector(s1pad))
    model.add_node(name='esa[0]', inputs=['e0sa', 'e1sa'], merge_mode='sum',
                   layer=Activation(T.tanh))
    model.add_node(name='esa[1]', input='esa[0]',
                   layer=TimeDistributedDense(input_dim=int(N*adim), output_dim=1, W_regularizer=l2(l2reg)))
    model.add_node(name='esa[2]', input='esa[1]',
                   layer=Flatten(input_shape=(s1pad, 1)))
    model.add_node(name='esa[3]', input='esa[2]',
                   layer=Activation('softmax'))
    # and now just multiply timewise
    model.add_node(name='esa[4]', input='esa[3]',
                   layer=RepeatVector(int(N*sdim)))
    model.add_node(name='esa', input='esa[4]',
                   layer=Permute((2,1)))
    model.add_node(name='e1sm', inputs=['e1s', 'esa'], merge_mode='mul',
                   layer=Activation('linear'))

Posting it here as it was a bit difficult for me to figure out as a Keras/Theano newbie. I don't know if it's worth making a dedicated layer in Keras API for this, though. For one, because in my experiments I've found dot-product similarity to work a lot better than the weighed sum (but still researching):

    def batched_batched_dot(s):
        """ from (x,y,z)-shaped pair, produce (x,y)-shaped pair that replaces the z-vector pairs by their dot-products """
        import theano
        import theano.tensor as T
        return theano.scan(fn=lambda xm, ym: T.batched_dot(xm, ym),
                           outputs_info=None, sequences=s, non_sequences=None)[0]
    model.add_node(name='esa[0]',  # nested batched_dot
               layer=LambdaMerge([model.nodes['e0sa'], model.nodes['e1sa']],
                                 batched_batched_dot,
                                 lambda s: (s[1][0], s[1][1])))
    model.add_node(name='esa[3]', input='esa[0]',
                   layer=Activation('softmax'))

I hope to soon submit a PR adding an example that does some serious stuff with NLP embedding sequences and includes the attention mechanism, though!

@pasky

This comment has been minimized.

Copy link
Contributor

pasky commented Feb 12, 2016

I just wanted to add a link to a standalone example of that, which may be also easier to read:

https://github.com/brmson/dataset-sts/blob/master/examples/anssel_attn.py

(after a little more work, I intend to contribute a simplified version as an example in Keras itself)

@ylqfp

This comment has been minimized.

Copy link

ylqfp commented Apr 30, 2016

@pasky Thanks a lot!

@thomasjungblut

This comment has been minimized.

Copy link

thomasjungblut commented May 16, 2017

@pasky / @wolet did you ever ported this to the generic keras backend?

@stale stale bot added the stale label Aug 15, 2017

@stale

This comment has been minimized.

Copy link

stale bot commented Aug 15, 2017

This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 30 days if no further activity occurs, but feel free to re-open a closed issue if needed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment