This document coalesces data about real-world performance of the Apple AMX coprocessor. The motivating use case is electronic structure calculations (DFT simulations), which use complex-valued matrix multiplications and eigendecompositions. Interleaved complex numbers incur additional overhead compared to split complex numbers, but BLAS only accepts the interleaved format. This format underutilizes both NEON and AMX units.
Table of Contents
TODO: Rewrite this, now that I understand what's happening in more detail. Real-space algorithms may be the only feasible ones.
According a recent research paper (2023), the LOBPCG iterations can be partitioned into stages with different precisions. The first iterations use single precision, while the last iterations use double precision. The researchers used a consumer RTX A4000 with a 1:32 ratio of FP64:FP32 compute power. They achieved 6-8x speedup over GPU FP64 and negligible accuracy loss. A host CPU was used for ZHEEV, but that operation only consumed 3-10% of the total time.
For Apple silicon, the best adaptation of this algorithm would use CPU and GPU simultaneously. The AMX would not perform the majority of operations, but its presence would still be important. Below is a tentative illustration of the scheme*:
- 65% of iterations: GPU FP32 (CHEMM) + GPU FP32 (CHEEV)
- 15% of iterations: GPU FP32 (CHEMM) + GPU double-single (ZHEEV)
- 15% of iterations: AMX FP32 (CHEMM) + NEON FP64 (ZHEEV)
- 5% of iterations: AMX FP64 (ZHEMM) + NEON FP64 (ZHEEV)
*De-interleaves the complex multiplications (CHEMM, ZHEMM) into four separate multiplications of their real and complex parts (SGEMM, DGEMM). This improves ALU utilization with the AMX and
simdgroup_matrix
.
Using 75% of the performance cores' NEON, all of the AMX's FP64 GEMM compute, and all of the GPU's eFP64, the M1 Max could reach 1658 GFLOPS FP64. This is 4.3x faster than 100% of the performance cores' NEON alone and 2.8x faster than the GPU's eFP64 alone. However, using all of that simultaneously may cause thermal throttling, decreasing performance by up to 1.5x.
In another scheme, the AMX would perform most of the computations. Matrix sizes used for GEMM exceed the matrix sizes used for ZHEEV. ZHEEV is kn^3, where n is the number of valence electrons. Meanwhile, GEMM is kLn^2, where L is the number of grid cells. There are significantly more grid cells than valence electrons, by multiple orders of magnitude.
- 65% of iterations: AMX FP32 (CHEMM) + NEON FP32 (CHEEV)
- 30% of iterations: AMX FP32 (CHEMM) + NEON FP64 (ZHEEV)
- 5% of iterations: AMX FP64 (ZHEMM) + NEON FP64 (ZHEEV)
New scheme:
AMX Vector | NEON Vector | GPU Matrix | GPU Vector | AMX Matrix | |
---|---|---|---|---|---|
Max Clock @ Full Utilization | 3.228 GHz | 3.132 GHz | 1.296 GHz | 1.296 GHz | 3.228 GHz |
Max Observed Power | ~12 W ??? | 43.9 W | 52 W | 51.1 W | ~12 W ??? |
Max Observed GFLOPS F32 | TBD | 655 | 9258 | 10400 | 2746 |
Max Observed GFLOPS F64 | TBD | 352 | 0 | 0 | 700 |
Max Theoretical GFLOPS FFMA32 | 413 | 801 | 9437 | 10617 | 3305 |
Max Theoretical GFLOPS FDIV32 | 0 | 200 | 0 | 884 | 0 |
Max Theoretical GFLOPS FSQRT32 | 0 | 200 | 0 | 663 | 0 |
Max Theoretical GFLOPS FFMA64 | 206 | 400 | TBD | 589 | 826 |
Max Theoretical GFLOPS FDIV64 | 0 | 100 | 0 | 183 | 0 |
Max Theoretical GFLOPS FSQRT64 | 0 | 100 | 0 | 189 | 0 |
GFLOPS is not a plural noun. GFLOPS is a rate: (G)Billion (FL)Floating Point (OP)Operations per (S)Second. The term GFLOPS/second is often used to remove ambiguity, except that translates to GFLOP/second/second. Data throughput is a measure of speed - speed requires units of velocity, not acceleration. Therefore, this repository uses the original term GFLOPS.
GFLOPs is a plural noun. Occasionally, I use GFLOPs to specify the number of floating-point operations required for a linear algebra operation. The capitalization of s
will distinguish the metric from GFLOPS. There are not many other concise, consistent ways to describe both of these terms.
TODO: Explain O(kn^3), uncertainty in computational complexity, universal measure of time-to-solution (agnostic of precision or algorithm), why I used GFLOPS/0.25k for complex-valued operations to normalize for ALU utilization
GFLOPS/k = (matrix dimension)^3 / (time to solution)
Imagine a processor has 1000 GFLOPS, uses 10 watts.
OpenBLAS GEMM: real GLOPS/k = 800, but complex GFLOPS/k = 190
Real has 80% ALU / 8.0 watts.
Complex has 76% ALU / 7.6 watts, not 19% ALU / 1.9 watts.
Both operations have ~80% ALU and ~8 watts.
However, GFLOPS/0.25k = 4 * (GFLOPS/k) ~ 760
80% ALU is much closer to 76%, and shows that complex is 4% slower,
but not because it requires more computations. Also, you would think
it's 4% **faster**, because it has **more** arithmetic intensity.
GFLOPS/0.25k is a fairer, more insightful comparison.
k_complex = 4k_real
k_real = 0.25k_complex
Real: GFLOPS = GFLOPS/k * k_real
Complex: GFLOPS = GFLOPS/0.25k * 0.25k_complex
Non-hybrid algorithms (all on one processor, either the CPU cores, AMX units, or GPU cores)
Operation | kreal | OpenBLAS GFLOPS/k | Accelerate GFLOPS/k | Metal GFLOPS/k | NEON % | AMX % | GPU % | Max GFLOPS |
---|---|---|---|---|---|---|---|---|
SGEMM | 2 | 362.2 | 1327.4 | 4629.0 | 84.4% | 85.4% | 87.2% | 9258.0 |
DGEMM | 2 | 176.2 | 337.9 | - | 90.7% | 87.0% | - | 675.8 |
ZGEMM | 2 | 148.4 | 223.6 | - | 76.4% | 57.6% | - | 447.2 |
SSYEV | TBD | 4.54 | 12.9 | - | TBD | TBD | - | TBD |
DSYEV | TBD | 4.57 | 7.74 | - | TBD | TBD | - | TBD |
ZHEEV | TBD | 6.76 | 5.48 | - | TBD | TBD | - | TBD |
SPOTRF | ||||||||
DPOTRF | ||||||||
ZPOTRF | ||||||||
STRSM | ||||||||
DTRSM | ||||||||
ZTRSM |
GFLOPS/k for each operation used in quantum chemistry. This metric compares each operation's execution speed regardless of the algorithm used to perform it, or the formula used to estimate GFLOPS. Complex-valued operations use GFLOPS/0.25k to directly compare ALU utilization to real-valued operations. For every operation listed so far, complex-valued versions are slower because they must de-interleave the numbers before processing them.
ZHEEV achieved maximum performance on Accelerate with MRRR. All other eigendecompositions use the divide and conquer algorithm. Although OpenBLAS beats Accelerate with asymptotically large matrices, Accelerate is faster for the matrix sizes typically encountered in DFT.
ISA Documentation | Performance Documentation | OSS GEMM Libraries | |
---|---|---|---|
Apple AMX | corsix/amx | philipturner/amx-benchmarks | xrq-phys/blis_apple |
Apple GPU | dougallj/applegpu | philipturner/metal-benchmarks | philipturner/metal-flash-attention |