Modified version of ademeure's fused gelu_forward kernel #363

ChrisDryden · 2024-05-05T20:32:27Z

Was experimenting with the fused gelu kernel to combine it to have the previous code when working with non-gelu matmuls that was built previously and when running it locally it appeared to have a performance benefit and was showing up with the correct losses for b16

Was hoping to get second opinion whether my observations were correct about this change

ademeure · 2024-05-06T14:12:21Z

You're right that I removed the code to set the bias pointer by mistake, oops! But I still see the same lower performance on my end with an inefficient non-fused epilogue kernel.

Can you try running "python profile_gpt2cu.py" and see if you can find a line like this?

08 fwd×12 cublasLt::epilogue::impl::globalKernel 5.74 871.3 0.0 1.81 3.19 1.70 3.38 798.18

If you don't see one, then that's really interesting, because it would mean that this is GPU/CUDA/driver version dependent... if you don't see that kernel on your end, it'd be great if you could copy-paste the outputs of "nvcc --version" and "nvidia-smi" (and possibly of profile_gpt2cu.py as well).

Modified version of ademeure's fused gelu_forward kernel

935f39f

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Modified version of ademeure's fused gelu_forward kernel #363

Modified version of ademeure's fused gelu_forward kernel #363

ChrisDryden commented May 5, 2024

ademeure commented May 6, 2024 •

edited

Modified version of ademeure's fused gelu_forward kernel #363

Are you sure you want to change the base?

Modified version of ademeure's fused gelu_forward kernel #363

Conversation

ChrisDryden commented May 5, 2024

ademeure commented May 6, 2024 • edited

ademeure commented May 6, 2024 •

edited