Further improvements to attention backward #170
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Backward kernel where threads reuse data in registers to reduce memory transfers.
This PR is build on top of my previous PRs, which should be merged first. Once that is done, I'll rebase and remove the draft status here. But I need the changes to backward pass memory allocation, otherwise I cannot profile the backward pass because I get OOMs; and also, I need to be able to assume that the kernel writes (=) instead of accumulate (+=) its gradients.