Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added dropout support to memory efficient variant #6

Merged
merged 1 commit into from Dec 30, 2022

Conversation

usryokousha
Copy link
Contributor

Hey Phil,

I have been using this repository for a project and I wanted to add dropout for completeness. I checked consistency with perceiver-ar impl.. I hope this is helpful.

-Matt

@lucidrains
Copy link
Owner

@usryokousha oh sure, thanks Matt! just for your information, the field is slowly starting to realize that the traditional dropout is pretty useless

however, structured dropout, like https://github.com/lucidrains/x-transformers#forgetful-causal-mask or https://arxiv.org/abs/2206.00826 can still be used, but would not need to exist within the attention operation

@lucidrains
Copy link
Owner

let's merge it for completeness sake though! hope rabe or flash attention is working well for your project! just one more note, you should use the CUDA implementation here for optimal performance

@lucidrains lucidrains merged commit c37fbd2 into lucidrains:main Dec 30, 2022
@usryokousha
Copy link
Contributor Author

Phil, thanks for pointing out the two papers on dropout! I wonder how the Bayesformer paper's proposed dropout stands up in non-causal attention. In my own experiments I have always turned it off because I found it hurt training. The CUDA optimized flash attention package looks very appealing! This is going to help in my future projects for sure!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants