Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

M3 Performance #11

Open
Narsil opened this issue Dec 9, 2023 · 14 comments
Open

M3 Performance #11

Narsil opened this issue Dec 9, 2023 · 14 comments

Comments

@Narsil
Copy link

Narsil commented Dec 9, 2023

Hi thanks a lot for this really cool library.

I've taken it for a spin on M3, and saw on it that GEMM seemed to perform better on MPS.
Do you have any suggestions as to why ? Any way we could help ?

@philipturner
Copy link
Owner

Apple rewrote the M3's memory architecture from scratch. This may have obviated the SIMD async copy instructions, or not; I don't know. However, MPS looks like it's not reaching optimal performance. Here is a graph from @liuliu for M3 Pro.

Figure_1

I do have an Apple9 device right now (iPhone 15 Pro Max), but the tests can't run on iOS right now. They have a dependency on Python for NumPy benchmarking and Matplotlib. Next summer, I'll need to use something else to visualize matmul kernel bugs and perf charts. Maybe export data to JSON and graph externally, or use SwiftUI + some Metal graphics shaders on-device.

For more context, I raised an issue on the ISA repository before M3 was released. It referred to some strange structure to MPS kernels for the never-released A16 raytracing GPU. dougallj/applegpu#44. I might suggest increasing the accumulator size to something absurd, although that probably won't fully fix the problem.

@Narsil
Copy link
Author

Narsil commented Dec 9, 2023

Here are are mine.

I tried fiddling around with cache sizes but as suspected it didn't yield anything significant.

F16
float32

One surpising thing to me, was that reducing the number of splits to 1 for M and N, gave a much more consistent throughput for the GEMM op.

Happy to provide further test, or even try live debugging if you want to to access indirectly the M3.

Cheers.

For completeness:

GPU name: Apple M3 Max
GPU vendor: Apple
GPU core count: 40
GPU clock frequency: 1.398 GHz
GPU bandwidth: 409.6 GB/s
GPU FLOPS: 14.316 TFLOPS
GPU IPS: 7.158 TIPS
GPU system level cache: 48 MB
GPU memory: 128 GB
GPU family: Apple 8

@philipturner
Copy link
Owner

For reference, I superimposed the M1 Max graph on Nick's M3 Max graph. The supermassive dips in MPS performance on M1 came from misuse of SIMD async copy instructions. Since the M3 MPS shaders don't use SIMD async copy, the performance is more consistent for uneven matrix sizes. However, MPS is nowhere near optimal, shown by M1 MFA outperforming M3 MPS for M=200-600.

Screenshot 2023-12-09 at 5 29 35 PM

Perhaps the SIMD async copy instructions were removed from M3, and the runtime AIR compiler replaces them with a slow emulated version. My first suggestion would be modifying the kernel to directly load from device memory, which is non-trivial (although the Metal source code is so elegant, it's easier to modify than any other implementation). Usage of SIMD async copies allows use of the 16-bit address range for threadgroup memory pointers and only temporarily materializing 64-bit pointer.

@philipturner
Copy link
Owner

One surpising thing to me, was that reducing the number of splits to 1 for M and N, gave a much more consistent throughput for the GEMM op.

I think the MPS kernel is structured so a single SIMD group accesses an entire accumulator in isolation. The reason MFA uses splits, is so 4 simd groups can amortize the overhead of the device -> threadgroup transfer by 2 times. Also, with the M1 architecture, a single SIMD group can't access an accumulator as large as 64 x 64 x FP32 (16 KB). Getting close (48 x 48) would decrease occupancy too much. Perhaps M3 is different.

@Narsil
Copy link
Author

Narsil commented Dec 9, 2023

@philipturner
Copy link
Owner

Yeah. The 64-bit pointer is only temporarily materialized, recycling registers allocated for other data during the GEMM part.

@Narsil
Copy link
Author

Narsil commented Dec 9, 2023

If I understand briefly the 64x64 allocator idea, you mean increasing this buffer to 4096: https://github.com/philipturner/metal-flash-attention/blob/main/Sources/GEMM.metal#L275

And updating the MNK_simd sizes here :

let A_block_length = M_group * K_simd
let B_block_length = K_simd * N_group
var blockElements = A_block_length + B_block_length;
if (pcopy.M % 8 != 0) && (pcopy.N % 8 != 0) {
let C_block_length = M_group * N_group;
blockElements = max(C_block_length, blockElements)
}
if parameters.fused_bias {
if parameters.D_trans {
blockElements = max(blockElements, M_group)
} else {
blockElements = max(blockElements, N_group)
}
}
let blockBytes = blockElements * UInt16(dataType.size)
in order to increase the threadgroup memory size ?

I'm not super familiar with the codebase, just trying to attempt simple ideas here. If they've thrown async simd primitives you found in 14.2 out of the windows it'd be a shame, and not sure how I could help at this point.

@philipturner
Copy link
Owner

That 1024 is just to fool the compiler. It doesn't allocate any blocks. You can increase the accumulator size by trying 64x64, 80x80, or 96x96 instead of 48x48.

I'm not super familiar with the codebase, just trying to attempt simple ideas here. If they've thrown async simd primitives you found in 14.2 out of the windows it'd be a shame, and not sure how I could help at this point.

If they threw it out of the window, it's simply unnecessary. It might require some simplifications to the GPU kernel, which might not be super hard. Just more than changing a single parameter. Requires more than a few hours of time and some familiarity with Metal. Also a new function constant to take the simplified M3 branch on newer devices, while still supporting older ones.

Attention is a more complex kernel, but the advantage over MPS is already enormous there. So the biggest issue is just fixing GEMM, which should be simpler.

@philipturner
Copy link
Owner

Out of curiosity, what is MPS performance for Float16?

@philipturner
Copy link
Owner

philipturner commented Dec 9, 2023

Also if it's just 1 split, you could fine-tune (40x40, 56x56, 72x72). I think I know at least how to get equal performance to MPS, if you use a very large kernel ensemble.

Make the K dimension very small, ramp up the M and N dimension sizes. Generate graphs for each multiple of 8 from MN=16 to MN=96. Freeze K at 16 while searching the solution space for M/N. Later, I'll instruct you to unfreeze and try K=8/16/24/32. Use 1x1 split.

Either use the dropdown feature in GitHub or attach the images as a PDF. Better yet, create a GitHub Markdown gist and link to this comment. I don't want the thread polluted with many large images.

@philipturner
Copy link
Owner

philipturner commented Dec 9, 2023

Screenshot 2023-12-09 at 7 02 13 PM

@DigitalSolomon
Copy link

Apple rewrote the M3's memory architecture from scratch. This may have obviated the SIMD async copy instructions, or not; I don't know. However, MPS looks like it's not reaching optimal performance. Here is a graph from @liuliu for M3 Pro.

I wonder if we're in for another memory architecture rewrite with M4.

@philipturner
Copy link
Owner

I doubt it. Also, M3 is not that hard, compared to M1. I just haven’t had any factor motivating me to patch up MFA yet.

@philipturner
Copy link
Owner

philipturner commented May 24, 2024

FP32 is the first step before delving into more advanced types. Such as truncated FP32 (brain float) and FP32 with no dynamic range (half precision).

https://gist.github.com/philipturner/3bda14e876a635e73745c42f2eb240c8

Optimal block sizes:
32x32x32 and 48x48x24 with async copy, otherwise
32x32x8 and 48x48x8 for M1-M2
32x32x8 for M3+

MPS uses 64x64xK with async copy on M1-M2
MPS uses 32x32x8 on M3+

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants