In AutoregressiveWrapper, if no attention mask is supplied, create a lower triangular one #74
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Hi @lucidrains
Thank you for your amazing work with all of your repositories!
I don't know if this behavior fits the minimal philosophy of this implementation but usually when training in an autoregressive fashion the future tokens are masked to prevent the transformer to "see in the future".
I added a default lower triangular attention mask in the
AutoregressiveWrapper
forward logic to implement this idea.I tested it in a decoder-only architecture like the one from the
enwik8
example and it works.Reading the code it should work in a encoder-decoder architecture with
cross_attend = True
too but i haven't tested it.