Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Modified version of ademeure's fused gelu_forward kernel #363

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

ChrisDryden
Copy link
Contributor

Was experimenting with the fused gelu kernel to combine it to have the previous code when working with non-gelu matmuls that was built previously and when running it locally it appeared to have a performance benefit and was showing up with the correct losses for b16

Was hoping to get second opinion whether my observations were correct about this change

@ademeure
Copy link
Contributor

ademeure commented May 6, 2024

You're right that I removed the code to set the bias pointer by mistake, oops! But I still see the same lower performance on my end with an inefficient non-fused epilogue kernel.

Can you try running "python profile_gpt2cu.py" and see if you can find a line like this?

08 fwd×12 cublasLt::epilogue::impl::globalKernel 5.74 871.3 0.0 1.81 3.19 1.70 3.38 798.18

If you don't see one, then that's really interesting, because it would mean that this is GPU/CUDA/driver version dependent... if you don't see that kernel on your end, it'd be great if you could copy-paste the outputs of "nvcc --version" and "nvidia-smi" (and possibly of profile_gpt2cu.py as well).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants