Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Making this work with relative position bias from XTransformers #5

Open
pfeatherstone opened this issue Dec 2, 2022 · 5 comments
Open

Comments

@pfeatherstone
Copy link

pfeatherstone commented Dec 2, 2022

Is there a way to make this work with RelativePositionBias. Currently this produces an attention bias of size $BHN^2$ where B is batch size, H is number of heads and N is input size. Can this be chunked and computed per chunk?

@lucidrains
Copy link
Owner

lucidrains commented Dec 5, 2022

@pfeatherstone if you are working with 1d sequences, the best approach would be https://github.com/lucidrains/x-transformers#dynamic-positional-bias, which is O(n)

the other alternative is ALiBi positional embedding, which needs only to be materialized within each block, but may come with some limitations (unidirectional, forced local attending, etc)

@lucidrains
Copy link
Owner

@pfeatherstone which module are you using from this repository?

you should be using the CUDA implementation from here

@pfeatherstone
Copy link
Author

@lucidrains Actually, i've just realized, you can pass in attn_bias to both normal and memory efficient attention, which can have dimensions up to [B,H,L,S] where L is target length and S is context length. So you can use that for any additional masking (by filling with -float('inf')) or positional encoding. Correct?

@pfeatherstone
Copy link
Author

pfeatherstone commented Dec 6, 2022

I need to use something that can be ONNX exported. I don't think https://github.com/hazyResearch/flash-attention will work through torch.onnx.export().

Memory efficient attention is great because it yields the exact same result as normal attention, so i can train with memory efficient option turned on, then export to ONNX using normal attention.

Correct me if I'm wrong, but i don't think this will work with flash attention?

@pfeatherstone
Copy link
Author

I've also kind of given up on the memory efficient implementation, it is cripplingly slow to train.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants