-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
M3 Performance #11
Comments
Apple rewrote the M3's memory architecture from scratch. This may have obviated the SIMD async copy instructions, or not; I don't know. However, MPS looks like it's not reaching optimal performance. Here is a graph from @liuliu for M3 Pro. I do have an Apple9 device right now (iPhone 15 Pro Max), but the tests can't run on iOS right now. They have a dependency on Python for NumPy benchmarking and Matplotlib. Next summer, I'll need to use something else to visualize matmul kernel bugs and perf charts. Maybe export data to JSON and graph externally, or use SwiftUI + some Metal graphics shaders on-device. For more context, I raised an issue on the ISA repository before M3 was released. It referred to some strange structure to MPS kernels for the never-released A16 raytracing GPU. dougallj/applegpu#44. I might suggest increasing the accumulator size to something absurd, although that probably won't fully fix the problem. |
Here are are mine. I tried fiddling around with cache sizes but as suspected it didn't yield anything significant. One surpising thing to me, was that reducing the number of splits to Happy to provide further test, or even try live debugging if you want to to access indirectly the M3. Cheers. For completeness:
|
I think the MPS kernel is structured so a single SIMD group accesses an entire accumulator in isolation. The reason MFA uses splits, is so 4 simd groups can amortize the overhead of the |
You're talking about this, right ? https://github.com/philipturner/metal-flash-attention/blob/main/Sources/GEMM.metal#L203-L216 |
Yeah. The 64-bit pointer is only temporarily materialized, recycling registers allocated for other data during the GEMM part. |
If I understand briefly the 64x64 allocator idea, you mean increasing this buffer to 4096: https://github.com/philipturner/metal-flash-attention/blob/main/Sources/GEMM.metal#L275 And updating the MNK_simd sizes here : metal-flash-attention/Tests/Operations/GEMM.swift Lines 124 to 139 in 32592c9
I'm not super familiar with the codebase, just trying to attempt simple ideas here. If they've thrown async simd primitives you found in 14.2 out of the windows it'd be a shame, and not sure how I could help at this point. |
That 1024 is just to fool the compiler. It doesn't allocate any blocks. You can increase the accumulator size by trying 64x64, 80x80, or 96x96 instead of 48x48.
If they threw it out of the window, it's simply unnecessary. It might require some simplifications to the GPU kernel, which might not be super hard. Just more than changing a single parameter. Requires more than a few hours of time and some familiarity with Metal. Also a new function constant to take the simplified M3 branch on newer devices, while still supporting older ones. Attention is a more complex kernel, but the advantage over MPS is already enormous there. So the biggest issue is just fixing GEMM, which should be simpler. |
Out of curiosity, what is MPS performance for Float16? |
Also if it's just 1 split, you could fine-tune (40x40, 56x56, 72x72). I think I know at least how to get equal performance to MPS, if you use a very large kernel ensemble. Make the K dimension very small, ramp up the M and N dimension sizes. Generate graphs for each multiple of 8 from MN=16 to MN=96. Freeze K at 16 while searching the solution space for M/N. Later, I'll instruct you to unfreeze and try K=8/16/24/32. Use 1x1 split. Either use the dropdown feature in GitHub or attach the images as a PDF. Better yet, create a GitHub Markdown gist and link to this comment. I don't want the thread polluted with many large images. |
I wonder if we're in for another memory architecture rewrite with M4. |
I doubt it. Also, M3 is not that hard, compared to M1. I just haven’t had any factor motivating me to patch up MFA yet. |
FP32 is the first step before delving into more advanced types. Such as truncated FP32 (brain float) and FP32 with no dynamic range (half precision). https://gist.github.com/philipturner/3bda14e876a635e73745c42f2eb240c8 Optimal block sizes: MPS uses 64x64xK with async copy on M1-M2 |
Hi thanks a lot for this really cool library.
I've taken it for a spin on M3, and saw on it that GEMM seemed to perform better on MPS.
Do you have any suggestions as to why ? Any way we could help ?
The text was updated successfully, but these errors were encountered: