M3 Performance #11

Narsil · 2023-12-09T21:56:16Z

Hi thanks a lot for this really cool library.

I've taken it for a spin on M3, and saw on it that GEMM seemed to perform better on MPS.
Do you have any suggestions as to why ? Any way we could help ?

philipturner · 2023-12-09T22:19:00Z

Apple rewrote the M3's memory architecture from scratch. This may have obviated the SIMD async copy instructions, or not; I don't know. However, MPS looks like it's not reaching optimal performance. Here is a graph from @liuliu for M3 Pro.

I do have an Apple9 device right now (iPhone 15 Pro Max), but the tests can't run on iOS right now. They have a dependency on Python for NumPy benchmarking and Matplotlib. Next summer, I'll need to use something else to visualize matmul kernel bugs and perf charts. Maybe export data to JSON and graph externally, or use SwiftUI + some Metal graphics shaders on-device.

For more context, I raised an issue on the ISA repository before M3 was released. It referred to some strange structure to MPS kernels for the never-released A16 raytracing GPU. dougallj/applegpu#44. I might suggest increasing the accumulator size to something absurd, although that probably won't fully fix the problem.

Narsil · 2023-12-09T22:35:02Z

Here are are mine.

I tried fiddling around with cache sizes but as suspected it didn't yield anything significant.

One surpising thing to me, was that reducing the number of splits to 1 for M and N, gave a much more consistent throughput for the GEMM op.

Happy to provide further test, or even try live debugging if you want to to access indirectly the M3.

Cheers.

For completeness:

GPU name: Apple M3 Max
GPU vendor: Apple
GPU core count: 40
GPU clock frequency: 1.398 GHz
GPU bandwidth: 409.6 GB/s
GPU FLOPS: 14.316 TFLOPS
GPU IPS: 7.158 TIPS
GPU system level cache: 48 MB
GPU memory: 128 GB
GPU family: Apple 8

philipturner · 2023-12-09T22:35:45Z

For reference, I superimposed the M1 Max graph on Nick's M3 Max graph. The supermassive dips in MPS performance on M1 came from misuse of SIMD async copy instructions. Since the M3 MPS shaders don't use SIMD async copy, the performance is more consistent for uneven matrix sizes. However, MPS is nowhere near optimal, shown by M1 MFA outperforming M3 MPS for M=200-600.

Perhaps the SIMD async copy instructions were removed from M3, and the runtime AIR compiler replaces them with a slow emulated version. My first suggestion would be modifying the kernel to directly load from device memory, which is non-trivial (although the Metal source code is so elegant, it's easier to modify than any other implementation). Usage of SIMD async copies allows use of the 16-bit address range for threadgroup memory pointers and only temporarily materializing 64-bit pointer.

philipturner · 2023-12-09T22:38:56Z

One surpising thing to me, was that reducing the number of splits to 1 for M and N, gave a much more consistent throughput for the GEMM op.

I think the MPS kernel is structured so a single SIMD group accesses an entire accumulator in isolation. The reason MFA uses splits, is so 4 simd groups can amortize the overhead of the device -> threadgroup transfer by 2 times. Also, with the M1 architecture, a single SIMD group can't access an accumulator as large as 64 x 64 x FP32 (16 KB). Getting close (48 x 48) would decrease occupancy too much. Perhaps M3 is different.

Narsil · 2023-12-09T22:39:46Z

You're talking about this, right ? https://github.com/philipturner/metal-flash-attention/blob/main/Sources/GEMM.metal#L203-L216

philipturner · 2023-12-09T22:42:51Z

Yeah. The 64-bit pointer is only temporarily materialized, recycling registers allocated for other data during the GEMM part.

Narsil · 2023-12-09T22:53:44Z

If I understand briefly the 64x64 allocator idea, you mean increasing this buffer to 4096: https://github.com/philipturner/metal-flash-attention/blob/main/Sources/GEMM.metal#L275

And updating the MNK_simd sizes here :

metal-flash-attention/Tests/Operations/GEMM.swift

Lines 124 to 139 in 32592c9

    
           let A_block_length = M_group * K_simd 
        
           let B_block_length = K_simd * N_group 
        
           var blockElements = A_block_length + B_block_length; 
        
           if (pcopy.M % 8 != 0) && (pcopy.N % 8 != 0) { 
        
             let C_block_length = M_group * N_group; 
        
             blockElements = max(C_block_length, blockElements) 
        
           } 
        
           if parameters.fused_bias { 
        
             if parameters.D_trans { 
        
               blockElements = max(blockElements, M_group) 
        
             } else { 
        
               blockElements = max(blockElements, N_group) 
        
             } 
        
           } 
        
           let blockBytes = blockElements * UInt16(dataType.size)

in order to increase the threadgroup memory size ?

I'm not super familiar with the codebase, just trying to attempt simple ideas here. If they've thrown async simd primitives you found in 14.2 out of the windows it'd be a shame, and not sure how I could help at this point.

philipturner · 2023-12-09T23:31:38Z

That 1024 is just to fool the compiler. It doesn't allocate any blocks. You can increase the accumulator size by trying 64x64, 80x80, or 96x96 instead of 48x48.

I'm not super familiar with the codebase, just trying to attempt simple ideas here. If they've thrown async simd primitives you found in 14.2 out of the windows it'd be a shame, and not sure how I could help at this point.

If they threw it out of the window, it's simply unnecessary. It might require some simplifications to the GPU kernel, which might not be super hard. Just more than changing a single parameter. Requires more than a few hours of time and some familiarity with Metal. Also a new function constant to take the simplified M3 branch on newer devices, while still supporting older ones.

Attention is a more complex kernel, but the advantage over MPS is already enormous there. So the biggest issue is just fixing GEMM, which should be simpler.

philipturner · 2023-12-09T23:33:49Z

Out of curiosity, what is MPS performance for Float16?

philipturner · 2023-12-09T23:47:03Z

Also if it's just 1 split, you could fine-tune (40x40, 56x56, 72x72). I think I know at least how to get equal performance to MPS, if you use a very large kernel ensemble.

Make the K dimension very small, ramp up the M and N dimension sizes. Generate graphs for each multiple of 8 from MN=16 to MN=96. Freeze K at 16 while searching the solution space for M/N. Later, I'll instruct you to unfreeze and try K=8/16/24/32. Use 1x1 split.

Either use the dropdown feature in GitHub or attach the images as a PDF. Better yet, create a GitHub Markdown gist and link to this comment. I don't want the thread polluted with many large images.

philipturner · 2023-12-09T23:58:24Z

DigitalSolomon · 2024-05-14T05:21:04Z

Apple rewrote the M3's memory architecture from scratch. This may have obviated the SIMD async copy instructions, or not; I don't know. However, MPS looks like it's not reaching optimal performance. Here is a graph from @liuliu for M3 Pro.

I wonder if we're in for another memory architecture rewrite with M4.

philipturner · 2024-05-14T11:01:24Z

I doubt it. Also, M3 is not that hard, compared to M1. I just haven’t had any factor motivating me to patch up MFA yet.

philipturner · 2024-05-24T21:18:42Z

FP32 is the first step before delving into more advanced types. Such as truncated FP32 (brain float) and FP32 with no dynamic range (half precision).

https://gist.github.com/philipturner/3bda14e876a635e73745c42f2eb240c8

Optimal block sizes:
32x32x32 and 48x48x24 with async copy, otherwise
32x32x8 and 48x48x8 for M1-M2
32x32x8 for M3+

MPS uses 64x64xK with async copy on M1-M2
MPS uses 32x32x8 on M3+

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

M3 Performance #11

M3 Performance #11

Narsil commented Dec 9, 2023

philipturner commented Dec 9, 2023

Narsil commented Dec 9, 2023

philipturner commented Dec 9, 2023

philipturner commented Dec 9, 2023

Narsil commented Dec 9, 2023

philipturner commented Dec 9, 2023

Narsil commented Dec 9, 2023 •

edited

philipturner commented Dec 9, 2023

philipturner commented Dec 9, 2023

philipturner commented Dec 9, 2023 •

edited

philipturner commented Dec 9, 2023 •

edited

DigitalSolomon commented May 14, 2024

philipturner commented May 14, 2024

philipturner commented May 24, 2024 •

edited

M3 Performance #11

M3 Performance #11

Comments

Narsil commented Dec 9, 2023

philipturner commented Dec 9, 2023

Narsil commented Dec 9, 2023

philipturner commented Dec 9, 2023

philipturner commented Dec 9, 2023

Narsil commented Dec 9, 2023

philipturner commented Dec 9, 2023

Narsil commented Dec 9, 2023 • edited

philipturner commented Dec 9, 2023

philipturner commented Dec 9, 2023

philipturner commented Dec 9, 2023 • edited

philipturner commented Dec 9, 2023 • edited

DigitalSolomon commented May 14, 2024

philipturner commented May 14, 2024

philipturner commented May 24, 2024 • edited

Narsil commented Dec 9, 2023 •

edited

philipturner commented Dec 9, 2023 •

edited

philipturner commented Dec 9, 2023 •

edited

philipturner commented May 24, 2024 •

edited