Apple silicon (MPS backends) support? #313

overheat · 2023-08-03T00:46:33Z

Support running on Macbook?

Narsil · 2023-08-03T07:56:36Z

It should work out of the box !

Performance is not optimized, we haven't even started metal backend either but it's on the roadmap !
We still did one round to optimize before even using accelerate, and we're ~1.5,2x slower than accelerate with purely custom code. I hope we can replicate the same performance for Q4 (which is the end target for M1 imo).

LaurentMazare/gemm#5
LaurentMazare/gemm#6

Original repo and credits for the actual performance tricks: https://github.com/sarah-ek/gemm (We forked with sarah-ek 's blessing, we'll hopefully merge upstream no too far in the future).

okpatil4u · 2023-08-09T05:32:55Z

This would be a huge addition. Have you thought about https://github.com/philipturner/metal-flash-attention for the flash attention alternative ?

Narsil · 2023-08-09T08:58:24Z

I wasn't aware of that project.

In general for adding backends the biggest thing we can do is adding support for custom ops from day1, wether it's metal webgpu or rocm. That's because we cannot build ALL the ops for all the backends in a timely fashion.

However having used quite a bit of frameworks, usually you need only 1 op for your particular use case (like f16 conv2d, or flash, or GPTQ), and you don't care about making sure it works with backprop and 200 other ops and op order.

okpatil4u · 2023-08-09T09:06:25Z

Thanks, that makes sense. Are you planning to add apple MPS support in version 2.0 ?

Narsil · 2023-08-09T10:43:06Z

#360

Here is the current ideas floating around so not really.
But it might happen sooner than expected sometimes we start checking out something and find out it's relatively simple so we just go on and add it.

ioma8 · 2023-08-30T08:21:44Z

I am also for MPS support.

okpatil4u · 2023-10-01T05:23:19Z

Any idea when MPS support will be launched ?

minghuaw · 2023-11-16T10:06:11Z

Would a Vulkan compute shader backend get a close performance to MPS?

okpatil4u · 2023-11-16T17:10:43Z

Someone like @philipturner could comment on this.

philipturner · 2023-11-16T19:58:52Z

Would a Vulkan compute shader backend get a close performance to MPS?

Both Vulkan and MPS would have terrible compute performance. Both less than 50% of the max utilization for some important use cases. Vulkan like 20%, as slow as just running on CPU/Accelerate. GEMM and related ops are very difficult to get right with custom shaders. LLaMA.cpp is an exception, they can make custom Metal shaders because the compute bottleneck isn't the type of operations that require a lot of expertise.

To use Vulkan shaders, you either need GLSL (archaic) or WGSL. The latter is more advanced, but still lacks some very important features to reach maximum GPU performance. We're talking a factor of 3 times slower in the parts that matter in some instances. You need SIMD reductions, SIMD matrix multiplications, etc. None of those are available in any API except Metal (though I did backdoor the macOS-only OpenCL framework for a portion of those, but OpenCL doesn't work on iOS). For MPS, the lack of control over CPU-side overhead can cause major issues for small data sets.

Also, people generally don't consider CPU-GPU communication latency or GPU command encoding overhead as something to optimize for. They write a CPU-side wrapper in a host language that eagerly dispatches the GPU commands one-at-a-time. CUDA was optimized for this, but Metal was optimized for a completely different usage pattern. However, Apple's OpenCL driver interestingly does have the automatic batching present in CUDA and ROCm.

okpatil4u · 2023-11-16T20:03:23Z

Does accelerate run on ANE ? I have observed that the performance does not scale for more than a single rayon thread when using accelerate. Wouldn’t MPS be better in this case as ANE is smaller compared to the apple GPU ?

…

On Fri, 17 Nov 2023 at 1:29 AM, Philip Turner ***@***.***> wrote: Would a Vulkan compute shader backend get a close performance to MPS? Both Vulkan and MPS would have terrible compute performance. Both less than 50% of the max utilization for some important use cases. Vulkan like 20%, as slow as just running on CPU/Accelerate. GEMM and related ops are very difficult to get right with custom shaders. LLaMA.cpp is an exception, they can make custom Metal shaders because the compute bottleneck isn't the type of operations that require a lot of expertise. To use Vulkan shaders, you either need GLSL (archaic) or WGSL. The latter is more advanced, but still lacks some very important features to reach maximum GPU performance. We're talking a factor of 3 times slower in the parts that matter in some instances. You need SIMD reductions, SIMD matrix multiplications, etc. None of those are available in any API except Metal (though I did backdoor the macOS-only OpenCL framework for a portion of those, but OpenCL doesn't work on iOS). For MPS, the lack of control over CPU-side overhead can cause major issues for small data sets. Also, people generally don't consider CPU-GPU communication latency or GPU command encoding overhead as something to optimize for. They write a CPU-side wrapper in a host language that eagerly dispatches the GPU commands one-at-a-time. CUDA was optimized for this, but Metal was optimized for a completely different usage pattern. However, Apple's OpenCL driver interestingly does have the automatic batching present in CUDA and ROCm. — Reply to this email directly, view it on GitHub <#313 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAXGU4FCVKJ6C2FQ67ZOL2DYEZWARAVCNFSM6AAAAAA3CAKKACVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQMJVGIZDIMRWHE> . You are receiving this because you commented.Message ID: ***@***.***>

philipturner · 2023-11-17T01:11:35Z

I put in some performance metrics for the M1 Max chip, to give some context.

Python:

1 GFLOPS FP64

"Fast" programming languages (e.g. C, C++, multithreaded Rust):

400 GFLOPS FP64

800 GFLOPS FP32

1600 GFLOPS FP16

Accelerate runs on the CPU. It uses the AMX coprocessor, a hardware-accelerated matrix multiplier for the CPU. Its primary benefit is low latency and usage in very difficult-to-parallelize tasks, like matrix factorization. The way this hardware works, every group of 4-6 CPU cores comes with a single accelerator. In many cases, multithreading or using multiple accelerators (e.g. M3 Max, with two P-blocks) actually harms performance. Accelerate automatically promotes supermassive tasks to 2 threads when appropriate.

AMX via Accelerate:

800 GFLOPS FP64

3000 GFLOPS FP32

6000 GFLOPS FP16 (not exposed by Accelerate)

Metal runs on the GPU. MPS is a proprietary closed-source library with some GPU kernels. MPSGraph is a domain-specific language for Swift, which automatically uses the ANE on certain devices. On iPhone and M1 (non-Pro), you can sometimes activate ANE by using FP16 and a supermassive matrix (4000 x 4000 x 4000). Generally, it's only 2x as fast as the GPU for GEMM. I wrote a library that's an alternative to MPS, and only uses the GPU. But it gives you more programmatic access to the GPU and implements GEMM algorithms better than Apple's MPS team.

GPU via Vulkan

3000 GFLOPS FP32

6000 GFLOPS FP16

GPU via OpenCL

3000 GFLOPS FP32

0 GFLOPS FP16

GPU via MPS

7500 GFLOPS FP32

7000 GFLOPS FP16

GPU via MFA

8500 GFLOPS FP32

10000 GFLOPS FP16

ANE is very hard to get full programmatic access to. Behind the scenes, it's a programmable fused multiply-add engine. You have to hard-code the weights into the assembly language, and it only supports FP16, so it can't be used for AI training. Not very useful for general-purpose compute, just for AI inferencing through CoreML.

ANE via MPSGraph: 4000 GFLOPS FP16
ANE advertised perf: 16000 GFLOPS FP16

okpatil4u · 2023-11-17T06:20:41Z

Thanks. This is super useful. Can MFA support quantized operations as well (4 bit, 5 bit) ? If yes, then what kind of benchmarks should one expect ? What would be a good starting point for a rust developer ?

FYI, significant metal development is already in progress for candle framework.

minghuaw · 2023-11-17T07:59:17Z

I just realized that pytorch has got some Metal shaders (https://github.com/pytorch/pytorch/tree/c233cef8fd648157fbaf85e1f2493562316a1ec4/aten/src/ATen/native/metal/ops). Might be useful for the candle Metal backend

philipturner · 2023-11-17T12:37:11Z

a001089

This is not really Metal, just some CPU code that calls into MPS.

I just realized that pytorch has got some Metal shaders

PyTorch's shaders are very basic elementwise ones. Similar to LLaMA.cpp in scope. The difficult/important one is GEMM-like operations, which are non-trivial to optimize.

Thanks. This is super useful. Can MFA support quantized operations as well (4 bit, 5 bit) ? If yes, then what kind of benchmarks should one expect ? What would be a good starting point for a rust developer ?

I made MFA so that you can modify it yourself. You can make a fork, tweak the shaders, support different quantizations if you want. Although at least for SDXL, it's advantageous to dequantize into a separate buffer before running GEMM. Less effort to write a new shader, potentially faster (due to less redundant compute).

okpatil4u · 2023-11-17T15:13:10Z

I have had a chance to look into MFA earlier. To be honest the project is intimidating to get started with. The documentation is scarce. A quick comparison would be https://github.com/Dao-AILab/flash-attention. If this entry barrier is removed, I am sure I could introduce MFA to my group of developers.

philipturner · 2023-11-17T15:25:48Z

It's ~500-1000 lines of shader code for GEMM and another ~500-1000 for Attention. The source files are mostly self-documenting, with a lot of Metal function constants. You can enable or disable each function constant while creating the MTLComputePipelineState. There's also two reference implementations of how to use MFA from the Metal API. One in Swift, another in C++.

The matrix dimensions need to be known at pipeline creation time. That way, some clever constant folding magic in the backend compiler will happen. It's extremely simple, but extremely powerful code generation. Executed much more effectively and robustly than MPS. The function constants describing matrix dimensions are usually capital letters. Examples: M, N, K for a classic M x N x K matrix multiply. The constant names are designed to be concise and simple to understand.

I pre-compiled some Metal binaries and hosted them on the GitHub releases page. That removes the need to download esoteric Xcode versions and go through the complex build process. Just copy the binary file and load the .metallib. I can make a ~100 line Swift script that does this and executes a GEMM. I could even dig up one from a SwiftUI iOS app from a while ago.

philipturner · 2023-11-17T19:51:18Z

@ivarflakstad any interest in commenting? This is a Rust framework trying to use Metal for matrix multiplication.

ivarflakstad · 2023-11-17T20:22:20Z

Yes! I'm actually involved in this work already working with @Narsil.
#1230 contains the initial work, which has been split into several PRs (#1309, #1323, etc).

We've also been working on adding mps support to metal-rs here

I began working on a wrapper around metal flash attention here, but haven't had time to complete it. GEMM works though😊
It's a bit too opinionated at the moment so a bit of refactoring would be good as well.

ivarflakstad · 2023-11-17T20:50:31Z

As a comment on documentation, dauntingness, etc I found the metal code itself fairly straightforward (if we take into consideration that we're talking about GPGPU GEMM, flash attention, and pushing the limits of what is possible with Metal).

It was the actually the swift tests orchestrating the metal execution that took me a while to understand. Calculating the correct values for func constants, async pipeline, cached operations etc.
Finding the minimal working example was key. Often is😊

(Perhaps most importantly @philipturner kindly answered all my dumb questions😉)

philipturner · 2023-11-17T23:28:02Z

Also, MFA is currently not running optimal on the new A17/M3 architecture. Requires some major rewriting, which should get funded next summer. In the meantime, use MPS for FP32 GEMM and MFA for Attention (on MTLGPUFamily.apple9).

As a comment on documentation, dauntingness

My primary interest is in a different field. I do document stuff very well when I want to (example). Just doing MFA as work during the summers.

It was the actually the swift tests orchestrating the metal execution that took me a while to understand. Calculating the correct values for func constants, async pipeline, cached operations etc.

When you write your own codebase, and write both the Swift and C++ impl, this is the result. I do try to make the code as legible as possible, using modern coding practices. The unit tests got really unwieldy, just enough to "do the job" given time constraints.

The tensor library, I wrote from scratch. I needed to bypass issues with CPU overhead to properly benchmark performance. Batching almost 100 GPU commands into a single command buffer, yet using an API that's semantically eager. A bit of compiler/DSL engineering.

I found the metal code itself fairly straightforward (if we take into consideration that we're talking about GPGPU GEMM, flash attention, and pushing the limits of what is possible with Metal).

I remember boasting to Tristan Dao about having the shortest, most elegant implementation on his list (inspired by Tinygrad). Other codebases are at least tens of thousands of lines, took >1 month to fully comprehend them. Also I came up with FlashAttention-2 independently before Tristan released the algorithm.

okpatil4u · 2023-11-18T07:52:17Z

Thanks for clearing it out. This looks very impressive from where I stand. I come from Molecular Simulations background as well, I worked on OpenMM in 2011-15, it's early days.

Finding a minimal working example was the issue for me when I briefly went through MFA. @philipturner a working 100 lines swift example would be great, if you could spare some time. Thank you.

Edit: Never mind. Just saw this https://github.com/ivarflakstad/metal-flash-attention-rs. This is a good starting point.

ivarflakstad · 2023-11-18T09:37:01Z

If you're interested contributions are very welcome, @okpatil4u 😊

philipturner · 2023-11-18T13:48:56Z

It may be an older iteration of the Metal function constants API, but here (200 lines): https://gist.github.com/philipturner/60c9b196a2e9361f1ec15a99a9267268

Edit: Yeah, this seems old because there’s no function constants for explicitly setting the block sizes.

okpatil4u · 2023-11-21T10:12:47Z

Thanks @philipturner @ivarflakstad, I will look into it.

jk2K · 2023-11-27T12:18:08Z

Is there anything I can do ? I want to work with you

Narsil · 2023-11-27T12:35:13Z

This PR implements some basic metal.

#1318

It seems to work ok (tried M1 + M3). Although speedups are only available on M3 for Phi and larger models.
Implementation is still pretty rough, and might be modified quite significantly.

philipturner · 2023-11-27T12:50:11Z

I wonder if Apple fixed the sequential throughput bottleneck in their drivers with the M3 generation. I'll have to benchmark my A17 Pro when I have the time.

okpatil4u · 2023-11-27T12:54:17Z

Just had a quick check, both metal and accelerate are yielding the same speedup.

Narsil · 2023-11-27T13:32:34Z

Quantized is still using CPU as CPU is hardcoded for now.

okpatil4u · 2023-12-07T20:07:14Z

Hey @Narsil, just checking if there has been an update on this.

ivarflakstad · 2023-12-07T20:11:48Z

On quantized or metal support?
Metal support is progressing, but I've been stuck on a weird bug where there are NaNs in the buffer on lhs of matmul. To make it even more fun it only happens with M1/M1 pro and it's non-deterministic (same input does not always reproduce the bug, happens at some point after 40k operations).
Testing out alternatives though.

Can't comment on quantized.

okpatil4u · 2023-12-07T20:46:03Z

I was checking for metal support. Apple recently released MLX. Not sure if you had any chance to look at it. Mistral 16bit example is pretty fast and prompt evaluation time is almost non existent. Which removes the need of flash attention. Something that even llama.cpp lacked for apple silicon architecture.

It even has c++ api that could be readily used. Would this be useful for what you are working on ?

ivarflakstad · 2023-12-07T21:01:35Z

Yes, I saw it :) Very cool stuff.
Taking inspiration from the kernels is especially interesting to me.

philipturner · 2023-12-07T21:35:39Z

Their GEMM kernels are slower than MPS because they don't use SIMD-group async copy instructions. This is the reason MFA was so finicky. You had to use an older Xcode toolchain to access the hardware instructions that Apple hid from the official shading language. I doubt MLX did rigorous benchmarks of how well their code performs across all possible matrix sizes.

LaurentMazare · 2024-03-02T20:21:29Z

Metal/MPS support has been added for a couple months now, let's close this issue and open new ones if new problems arise.

alexcardo · 2024-05-17T10:40:18Z

Point me out please what should I do to use Candle with Mac M1 GPU support. I was unable to find it reading the README file. In my case, it utilize only CPU and maybe 10% GPU.

philipturner · 2024-05-17T11:04:12Z

It already utilizes the GPU. GPU utilization depends on several factors. Note that not all operations are able to make use of GPU. This goes for all computations in general, not just AI. To utilize the GPU, the number of computations to memory transfers must be extremely high. For example, LLMs make poor use of GPUs and excellent use of CPUs because they are memory bandwidth bound.

alexcardo · 2024-05-17T11:14:11Z

My goal is to load MADLAD-400 on my Macbook M1 8GB (the only hardware I have). I'm forced to use a quantized .gguf version of this model. The only way to achieve this goal is via Candle (as described in the official Google MADLAD-400's HF repo).

Regularly, I'm using LLAMA CPP to run the quantized models. And, in case of 4bit quantization of 7B and even 8B models (latest LLAMA 3), it performs quite fast on GPU.

Meanwhile, when I'm trying to run MADLAD-400 7B or even 3B with candle, I see that the GPU resources are not used at all. Looks like even CPU resources are not used in a full power. And I see only 1 token/sec maximum, which make it useless for me.

I suppose that I need to use some additional parameter to run it or compile it another way.

I'm using this approach from the documentation:

cargo run --example quantized-t5 --release  -- \
  --model-id "jbochi/madlad400-3b-mt" --weight-file "model-q4k.gguf" \
  --prompt "<2de> How are you, my friend?" \
  --temperature 0
...
 Wie geht es dir, mein Freund?

But I was unable to find any other extra parameters or whatever to run this package with GPU support.

LaurentMazare · 2024-05-17T11:15:43Z

You will want to use --features metal to turn on gpu support on apple silicon.

alexcardo · 2024-05-17T11:21:06Z

--features metal

Can you please suggest me the exact command I need to execute?

As follows from one video I've seen on Youtube, this parameter should be used during the cargo build. I'm not experienced in Rust at all. And I was unable to find any text instruction where there is at least something about this parameter (--features metal). And the only result I see in Google when I'm trying to search for this issue is this discussion.

alexcardo · 2024-05-17T11:55:44Z

I reinstalled cargo, candle, and everything. I executed cargo build --features metal in my Candle directory. No, Candle uses CPU instead of GPU for sure. It doesn't utilize GPU at all. Point me out what am I doing wrong.

ivarflakstad · 2024-05-17T17:34:37Z

cargo run is basically cargo build + execute the binary just built. Arguments to build and run are identical (exceptions exist but you don’t need to worry about that).

In other words when you call cargo run with different arguments than what you just used for cargo build it will build a different binary.

In summary you want something like

cargo run —features metal --example quantized-t5 --release  -- \
  --model-id "jbochi/madlad400-3b-mt" --weight-file "model-q4k.gguf" \
  --prompt "<2de> How are you, my friend?" \
  --temperature 0

cloneable · 2024-05-21T10:21:29Z

Looks like CPU is hard-coded?

candle/candle-examples/examples/quantized-t5/main.rs

Line 85 in 7ebc354

let device = Device::Cpu;

ivarflakstad · 2024-05-21T17:34:32Z

Hah. Nice catch.
It should ideally have been something like
let device = candle_examples::device(args.cpu)?; - is that correct @LaurentMazare ?
(cpu is not an argument atm)

LaurentMazare · 2024-05-21T19:06:34Z

Yes candle_examples::device would be the first thing to try, it might fail if there are other hardcoded cpu in the model code but these should be easy fixes.

philipturner · 2024-06-15T00:01:25Z

I take it Candle is using MFA for the GEMM kernel (not the attention kernel)? If so, you can obsolete the current metallib file and translate this to Rust.

https://gist.github.com/philipturner/84f613a5cc745460a914d2c6ad226131

Optimized for M3 and BF16

rightsum · 2024-06-18T22:04:27Z

Hey folks, I made 2 small changes as suggested above to the code and used:

let device = candle_examples::device(false)?;

for both load and build methods and when I run

cargo run --features metal --example quantized-t5 --release  -- \
  --model-id "jbochi/madlad400-3b-mt" --weight-file "model-q4k.gguf" \
  --prompt "<2fa> I love pizza!" \
  --temperature 0

I still receive a weird error on M1 Utltra:
Error: device mismatch in index-select, lhs: Metal { gpu_id: 4294971352 }, rhs: Metal { gpu_id: 4294971352 }

bruceunx · 2024-07-24T03:04:54Z

let device = Device::new_metal(0)?;

use this to replace

let device = Device::Cpu;

But even with metal, the rate in M2, 2-3 tokens/sec, it's still a little slow

Hey folks, I made 2 small changes as suggested above to the code and used:
let device = candle_examples::device(false)?;
for both load and build methods and when I run
cargo run --features metal --example quantized-t5 --release  -- \
  --model-id "jbochi/madlad400-3b-mt" --weight-file "model-q4k.gguf" \
  --prompt "<2fa> I love pizza!" \
  --temperature 0
I still receive a weird error on M1 Utltra: Error: device mismatch in index-select, lhs: Metal { gpu_id: 4294971352 }, rhs: Metal { gpu_id: 4294971352 }

philipturner · 2024-07-24T03:41:18Z

“a little slow” is a bit subjective. Not a lot of context to evaluate how to improve on the speed.

Have you made a roofline model of the minimum latency per token, by dividing the model size (in GB) by the hardware bandwidth (in GB/s)? Does the model use speculative execution to amplify sequential throughput? Are there any major compute-bound components?

okpatil4u mentioned this issue Aug 14, 2023

Bigram Model #406

Closed

philpax mentioned this issue Oct 26, 2023

Future of candle-transformers / long-term plans #1186

Open

okpatil4u mentioned this issue Jan 15, 2024

Metal Backend not properly loading large models at 16GB of RAM #1568

Open

LaurentMazare closed this as completed Mar 2, 2024

overheat mentioned this issue Jun 13, 2024

Status of Apple silicon M3 GPU support? #2265

Open

Apple silicon (MPS backends) support? #313

Apple silicon (MPS backends) support? #313

Comments

overheat commented Aug 3, 2023

Narsil commented Aug 3, 2023

okpatil4u commented Aug 9, 2023

Narsil commented Aug 9, 2023

okpatil4u commented Aug 9, 2023

Narsil commented Aug 9, 2023

ioma8 commented Aug 30, 2023

okpatil4u commented Oct 1, 2023

minghuaw commented Nov 16, 2023

okpatil4u commented Nov 16, 2023

philipturner commented Nov 16, 2023

okpatil4u commented Nov 16, 2023 via email

philipturner commented Nov 17, 2023 • edited Loading

okpatil4u commented Nov 17, 2023

minghuaw commented Nov 17, 2023

philipturner commented Nov 17, 2023

okpatil4u commented Nov 17, 2023

philipturner commented Nov 17, 2023

philipturner commented Nov 17, 2023

ivarflakstad commented Nov 17, 2023

ivarflakstad commented Nov 17, 2023 • edited Loading

philipturner commented Nov 17, 2023 • edited Loading

okpatil4u commented Nov 18, 2023 • edited Loading

ivarflakstad commented Nov 18, 2023

philipturner commented Nov 18, 2023 • edited Loading

okpatil4u commented Nov 21, 2023

jk2K commented Nov 27, 2023

Narsil commented Nov 27, 2023

philipturner commented Nov 27, 2023

okpatil4u commented Nov 27, 2023

Narsil commented Nov 27, 2023

okpatil4u commented Dec 7, 2023

ivarflakstad commented Dec 7, 2023 • edited Loading

okpatil4u commented Dec 7, 2023

ivarflakstad commented Dec 7, 2023

philipturner commented Dec 7, 2023

LaurentMazare commented Mar 2, 2024

alexcardo commented May 17, 2024 • edited Loading

philipturner commented May 17, 2024

alexcardo commented May 17, 2024

LaurentMazare commented May 17, 2024

alexcardo commented May 17, 2024

alexcardo commented May 17, 2024

ivarflakstad commented May 17, 2024 • edited Loading

cloneable commented May 21, 2024

ivarflakstad commented May 21, 2024

LaurentMazare commented May 21, 2024

philipturner commented Jun 15, 2024

rightsum commented Jun 18, 2024 • edited Loading

bruceunx commented Jul 24, 2024

philipturner commented Jul 24, 2024 • edited Loading

philipturner commented Nov 17, 2023 •

edited

Loading

ivarflakstad commented Nov 17, 2023 •

edited

Loading

philipturner commented Nov 17, 2023 •

edited

Loading

okpatil4u commented Nov 18, 2023 •

edited

Loading

philipturner commented Nov 18, 2023 •

edited

Loading

ivarflakstad commented Dec 7, 2023 •

edited

Loading

alexcardo commented May 17, 2024 •

edited

Loading

ivarflakstad commented May 17, 2024 •

edited

Loading

rightsum commented Jun 18, 2024 •

edited

Loading

philipturner commented Jul 24, 2024 •

edited

Loading