Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

THIS IMPLEMENTATION's GOT at least 2 apparent MISTAKEs! #5

Open
SimZhou opened this issue Apr 2, 2020 · 2 comments
Open

THIS IMPLEMENTATION's GOT at least 2 apparent MISTAKEs! #5

SimZhou opened this issue Apr 2, 2020 · 2 comments

Comments

@SimZhou
Copy link

SimZhou commented Apr 2, 2020

So far as I've read until, the implementation of attention on both word and sentence level are WRONG:

## The word RNN model for generating a sentence vector
class WordRNN(nn.Module):
    def __init__(self, vocab_size,embedsize, batch_size, hid_size):
        super(WordRNN, self).__init__()
        self.batch_size = batch_size
        self.embedsize = embedsize
        self.hid_size = hid_size
        ## Word Encoder
        self.embed = nn.Embedding(vocab_size, embedsize)
        self.wordRNN = nn.GRU(embedsize, hid_size, bidirectional=True)
        ## Word Attention
        self.wordattn = nn.Linear(2*hid_size, 2*hid_size)
        self.attn_combine = nn.Linear(2*hid_size, 2*hid_size,bias=False)
    def forward(self,inp, hid_state):
        emb_out  = self.embed(inp)

        out_state, hid_state = self.wordRNN(emb_out, hid_state)

        word_annotation = self.wordattn(out_state)
        attn = F.softmax(self.attn_combine(word_annotation),dim=1)

        sent = attention_mul(out_state,attn)
        return sent, hid_state

at Line 4 from the bottom: attn = F.softmax(self.attn_combine(word_annotation),dim=1).

As the nature of pytorch, if you don't use batch_first=True for GRU, the output dimention of out_state should be: (n_steps, batch_size, out_dims)

As the paper states, the softmax function should be applied on different time steps (for which the sum of all timesteps of softmax(value) should add up to 1), wheras THE IMPLEMENTATION of F.softmax MADE THE SOFTMAX ON DIFFERENT BATCHES (dim=1), which is incorrect!!! (should be changed to dim=0)

So does the sentence level attention.

Maybe this could be a reason for the non-convergent fluctuating test accuracy.
I am reading through the code and trying to make a corrected version for this implementation, will get back later.

@SimZhou SimZhou changed the title This implementation's got at least 2 apparent MISTAKEs! THIS IMPLEMENTATION's GOT at least 2 apparent MISTAKEs! Apr 2, 2020
@SimZhou
Copy link
Author

SimZhou commented Apr 3, 2020

Sry I give up correcting the codes, they are a bit redundant...

@sumba101
Copy link

Sry I give up correcting the codes, they are a bit redundant...

do you have a correction to the code that you can provide?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants