Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added packed layernorm_forward #513

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

ChrisDryden
Copy link
Contributor

@ChrisDryden ChrisDryden commented Jun 2, 2024

This is the implementation of using packed data types for layernorm and has an associated speedup of around 50% for this kernel in the dev files, waiting for the PR for making the data types in that kernel into floatX to merge before merging the test kernels for this.

Co-authored-by: @JaneIllario

@ChrisDryden
Copy link
Contributor Author

#319 this one adds the floatX to the dev cuda kernel for this

@ChrisDryden
Copy link
Contributor Author

ChrisDryden commented Jun 3, 2024

I am embarrassed, running this yesterday I was getting numbers that were closer to 600GB/s for both kernel 6 and kernel 9 throughput and around 900GB/s for kernel 8 throughput and now when I run those kernels its around 950GB/s and 1050GB/s for the 8. If that really is the case and you'd rather keep the simplicity and not have the packing then there is also the option of going for kernel 9 that is now the same speed as kernel 6 that will give the same performance as before but split it into two simpler kernels?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

1 participant