Skip to content

Conversation

@barronalex
Copy link
Contributor

@barronalex barronalex commented Feb 18, 2025

Add support for large Hadamard transforms on the GPU.

For $N=2^{24}$ the GPU version is about 50x faster than the CPU:

Timing hadamard_transform ... 2.32494 msec
Timing hadamard_transform ... 123.39948 msec

@angeloskath
Copy link
Member

Looks great! As per offline discussion let's move this into the primitive instead.

@angeloskath
Copy link
Member

angeloskath commented Apr 29, 2025

This should be fine to review and merge now.

The kernel is a bit faster than copying and calling the contiguous one and it also has the benefit of being completely in-place which the transpose copy can't do.

hadamard

The 16K elements in-place transform was not launching correctly (max threads per thread group 832) so I reduced the limit to 8K. @barronalex if you remember encountering the same issue before let me know how you did fix it 🤔.

Edit: The plot is in GB/s not MB/s

@angeloskath angeloskath requested a review from awni April 29, 2025 16:55
Copy link
Member

@awni awni left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome!!

@angeloskath angeloskath force-pushed the big-gpu-hadamard branch 2 times, most recently from fb6f761 to 61ffdf0 Compare May 1, 2025 22:56
angeloskath added a commit that referenced this pull request May 2, 2025
@angeloskath angeloskath merged commit 4813494 into main May 2, 2025
0 of 3 checks passed
@angeloskath angeloskath deleted the big-gpu-hadamard branch May 2, 2025 00:19
faisalmemon pushed a commit to faisalmemon/mlx that referenced this pull request Oct 30, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants