Added packed layernorm_forward #513

ChrisDryden · 2024-06-02T00:54:23Z

This is the implementation of using packed data types for layernorm and has an associated speedup of around 50% for this kernel in the dev files, waiting for the PR for making the data types in that kernel into floatX to merge before merging the test kernels for this.

Co-authored-by: @JaneIllario

ChrisDryden · 2024-06-03T03:28:24Z

#319 this one adds the floatX to the dev cuda kernel for this

ChrisDryden · 2024-06-03T03:48:33Z

I am embarrassed, running this yesterday I was getting numbers that were closer to 600GB/s for both kernel 6 and kernel 9 throughput and around 900GB/s for kernel 8 throughput and now when I run those kernels its around 950GB/s and 1050GB/s for the 8. If that really is the case and you'd rather keep the simplicity and not have the packing then there is also the option of going for kernel 9 that is now the same speed as kernel 6 that will give the same performance as before but split it into two simpler kernels?

Added packed layernorm_forward

b37371f

Added the laternorm forward dev kernels for the packing changes

cc8a18d

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Added packed layernorm_forward #513

Added packed layernorm_forward #513

ChrisDryden commented Jun 2, 2024 •

edited

Loading

ChrisDryden commented Jun 3, 2024

ChrisDryden commented Jun 3, 2024 •

edited

Loading

Added packed layernorm_forward #513

Are you sure you want to change the base?

Added packed layernorm_forward #513

Conversation

ChrisDryden commented Jun 2, 2024 • edited Loading

ChrisDryden commented Jun 3, 2024

ChrisDryden commented Jun 3, 2024 • edited Loading

ChrisDryden commented Jun 2, 2024 •

edited

Loading

ChrisDryden commented Jun 3, 2024 •

edited

Loading