In AutoregressiveWrapper, if no attention mask is supplied, create a lower triangular one #74

galatolofederico · 2022-02-01T13:29:37Z

Hi @lucidrains
Thank you for your amazing work with all of your repositories!
I don't know if this behavior fits the minimal philosophy of this implementation but usually when training in an autoregressive fashion the future tokens are masked to prevent the transformer to "see in the future".
I added a default lower triangular attention mask in the AutoregressiveWrapper forward logic to implement this idea.
I tested it in a decoder-only architecture like the one from the enwik8 example and it works.
Reading the code it should work in a encoder-decoder architecture with cross_attend = True too but i haven't tested it.

lucidrains · 2022-02-01T15:02:24Z

Hey! Yup, that's already taken care of automatically, if you set causal to True

https://github.com/lucidrains/x-transformers/blob/main/x_transformers/x_transformers.py#L621

galatolofederico · 2022-02-01T15:06:10Z

Ops my bad i didn't notice that.
Maybe it can be an idea to set causal=True in the enwik8 example since that is an example of a causal language modelling?

lucidrains · 2022-02-01T15:13:46Z

As long as you use the Decoder class, it is already set for you!

lucidrains · 2022-02-01T15:14:40Z

https://github.com/lucidrains/x-transformers/blob/main/x_transformers/x_transformers.py#L916

galatolofederico · 2022-02-01T15:17:55Z

Ok got it thanks! So in theory passing the attention mask as i did in the PR should have no effect at all right?

galatolofederico · 2022-02-01T15:26:05Z

I was experimenting with a decoder-only TransformerWrapper trained using the AutoregressiveWrapper and i get very different results.

The yellow one is without any mask and the orange one is with the lower triangular attention mask

this is the network i used that is basically the same network of the documentation

self.transformer = TransformerWrapper(
    num_tokens=self.vqvae.hparams.codebook_size,
    max_seq_len=self.hparams.signal_size // self.hparams.window_size,
    attn_layers = Decoder(
        dim=self.hparams.transformer_dim,
        depth=self.hparams.transformer_depth,
        heads=self.hparams.transformer_heads,
        rel_pos_bias=True
    )
)
self.transformer = AutoregressiveWrapper(self.transformer)

lucidrains · 2022-02-01T15:32:34Z

Yup it should be the same, I'm not sure why you are seeing that!

galatolofederico · 2022-02-01T15:40:35Z

Lets leave the PR open for now, in the next days as soon as i will have some free time i will try to write some replicable MWE so that we can figure out if there is something weird going on or not

lucidrains · 2022-02-01T16:44:31Z

@galatolofederico if you use the Decoder, it is pretty unlikely there is a bug with the causal mask

the repository is used in a lot of different labs, and even used in production at some consultancy companies

automatically create lower triangular attention mask in autoregressive

6204620

galatolofederico closed this Feb 6, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

In AutoregressiveWrapper, if no attention mask is supplied, create a lower triangular one #74

In AutoregressiveWrapper, if no attention mask is supplied, create a lower triangular one #74

galatolofederico commented Feb 1, 2022

lucidrains commented Feb 1, 2022

galatolofederico commented Feb 1, 2022

lucidrains commented Feb 1, 2022

lucidrains commented Feb 1, 2022

galatolofederico commented Feb 1, 2022

galatolofederico commented Feb 1, 2022

lucidrains commented Feb 1, 2022

galatolofederico commented Feb 1, 2022

lucidrains commented Feb 1, 2022 •

edited

In AutoregressiveWrapper, if no attention mask is supplied, create a lower triangular one #74

In AutoregressiveWrapper, if no attention mask is supplied, create a lower triangular one #74

Conversation

galatolofederico commented Feb 1, 2022

lucidrains commented Feb 1, 2022

galatolofederico commented Feb 1, 2022

lucidrains commented Feb 1, 2022

lucidrains commented Feb 1, 2022

galatolofederico commented Feb 1, 2022

galatolofederico commented Feb 1, 2022

lucidrains commented Feb 1, 2022

galatolofederico commented Feb 1, 2022

lucidrains commented Feb 1, 2022 • edited

lucidrains commented Feb 1, 2022 •

edited