Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Apple silicon (MPS backends) support? #313

Closed
overheat opened this issue Aug 3, 2023 · 50 comments
Closed

Apple silicon (MPS backends) support? #313

overheat opened this issue Aug 3, 2023 · 50 comments

Comments

@overheat
Copy link

overheat commented Aug 3, 2023

Support running on Macbook?

@Narsil
Copy link
Collaborator

Narsil commented Aug 3, 2023

It should work out of the box !

Performance is not optimized, we haven't even started metal backend either but it's on the roadmap !
We still did one round to optimize before even using accelerate, and we're ~1.5,2x slower than accelerate with purely custom code. I hope we can replicate the same performance for Q4 (which is the end target for M1 imo).

LaurentMazare/gemm#5
LaurentMazare/gemm#6

Original repo and credits for the actual performance tricks: https://github.com/sarah-ek/gemm (We forked with sarah-ek 's blessing, we'll hopefully merge upstream no too far in the future).

@okpatil4u
Copy link

This would be a huge addition. Have you thought about https://github.com/philipturner/metal-flash-attention for the flash attention alternative ?

@Narsil
Copy link
Collaborator

Narsil commented Aug 9, 2023

I wasn't aware of that project.

In general for adding backends the biggest thing we can do is adding support for custom ops from day1, wether it's metal webgpu or rocm. That's because we cannot build ALL the ops for all the backends in a timely fashion.

However having used quite a bit of frameworks, usually you need only 1 op for your particular use case (like f16 conv2d, or flash, or GPTQ), and you don't care about making sure it works with backprop and 200 other ops and op order.

@okpatil4u
Copy link

Thanks, that makes sense. Are you planning to add apple MPS support in version 2.0 ?

@Narsil
Copy link
Collaborator

Narsil commented Aug 9, 2023

#360

Here is the current ideas floating around so not really.
But it might happen sooner than expected sometimes we start checking out something and find out it's relatively simple so we just go on and add it.

@okpatil4u okpatil4u mentioned this issue Aug 14, 2023
@ioma8
Copy link

ioma8 commented Aug 30, 2023

I am also for MPS support.

@okpatil4u
Copy link

Any idea when MPS support will be launched ?

@minghuaw
Copy link

Would a Vulkan compute shader backend get a close performance to MPS?

@okpatil4u
Copy link

Someone like @philipturner could comment on this.

@philipturner
Copy link

Would a Vulkan compute shader backend get a close performance to MPS?

Both Vulkan and MPS would have terrible compute performance. Both less than 50% of the max utilization for some important use cases. Vulkan like 20%, as slow as just running on CPU/Accelerate. GEMM and related ops are very difficult to get right with custom shaders. LLaMA.cpp is an exception, they can make custom Metal shaders because the compute bottleneck isn't the type of operations that require a lot of expertise.

To use Vulkan shaders, you either need GLSL (archaic) or WGSL. The latter is more advanced, but still lacks some very important features to reach maximum GPU performance. We're talking a factor of 3 times slower in the parts that matter in some instances. You need SIMD reductions, SIMD matrix multiplications, etc. None of those are available in any API except Metal (though I did backdoor the macOS-only OpenCL framework for a portion of those, but OpenCL doesn't work on iOS). For MPS, the lack of control over CPU-side overhead can cause major issues for small data sets.

Also, people generally don't consider CPU-GPU communication latency or GPU command encoding overhead as something to optimize for. They write a CPU-side wrapper in a host language that eagerly dispatches the GPU commands one-at-a-time. CUDA was optimized for this, but Metal was optimized for a completely different usage pattern. However, Apple's OpenCL driver interestingly does have the automatic batching present in CUDA and ROCm.

@okpatil4u
Copy link

okpatil4u commented Nov 16, 2023 via email

@philipturner
Copy link

philipturner commented Nov 17, 2023

I put in some performance metrics for the M1 Max chip, to give some context.

Python:

  • 1 GFLOPS FP64

"Fast" programming languages (e.g. C, C++, multithreaded Rust):

  • 400 GFLOPS FP64
  • 800 GFLOPS FP32
  • 1600 GFLOPS FP16

Accelerate runs on the CPU. It uses the AMX coprocessor, a hardware-accelerated matrix multiplier for the CPU. Its primary benefit is low latency and usage in very difficult-to-parallelize tasks, like matrix factorization. The way this hardware works, every group of 4-6 CPU cores comes with a single accelerator. In many cases, multithreading or using multiple accelerators (e.g. M3 Max, with two P-blocks) actually harms performance. Accelerate automatically promotes supermassive tasks to 2 threads when appropriate.

AMX via Accelerate:

  • 800 GFLOPS FP64
  • 3000 GFLOPS FP32
  • 6000 GFLOPS FP16 (not exposed by Accelerate)

Metal runs on the GPU. MPS is a proprietary closed-source library with some GPU kernels. MPSGraph is a domain-specific language for Swift, which automatically uses the ANE on certain devices. On iPhone and M1 (non-Pro), you can sometimes activate ANE by using FP16 and a supermassive matrix (4000 x 4000 x 4000). Generally, it's only 2x as fast as the GPU for GEMM. I wrote a library that's an alternative to MPS, and only uses the GPU. But it gives you more programmatic access to the GPU and implements GEMM algorithms better than Apple's MPS team.

GPU via Vulkan

  • 3000 GFLOPS FP32
  • 6000 GFLOPS FP16

GPU via OpenCL

  • 3000 GFLOPS FP32
  • 0 GFLOPS FP16

GPU via MPS

  • 7500 GFLOPS FP32
  • 7000 GFLOPS FP16

GPU via MFA

  • 8500 GFLOPS FP32
  • 10000 GFLOPS FP16

ANE is very hard to get full programmatic access to. Behind the scenes, it's a programmable fused multiply-add engine. You have to hard-code the weights into the assembly language, and it only supports FP16, so it can't be used for AI training. Not very useful for general-purpose compute, just for AI inferencing through CoreML.

ANE via MPSGraph: 4000 GFLOPS FP16
ANE advertised perf: 16000 GFLOPS FP16

@okpatil4u
Copy link

Thanks. This is super useful. Can MFA support quantized operations as well (4 bit, 5 bit) ? If yes, then what kind of benchmarks should one expect ? What would be a good starting point for a rust developer ?

FYI, significant metal development is already in progress for candle framework.

@minghuaw
Copy link

I just realized that pytorch has got some Metal shaders (https://github.com/pytorch/pytorch/tree/c233cef8fd648157fbaf85e1f2493562316a1ec4/aten/src/ATen/native/metal/ops). Might be useful for the candle Metal backend

@philipturner
Copy link

a001089

This is not really Metal, just some CPU code that calls into MPS.

I just realized that pytorch has got some Metal shaders

PyTorch's shaders are very basic elementwise ones. Similar to LLaMA.cpp in scope. The difficult/important one is GEMM-like operations, which are non-trivial to optimize.

Thanks. This is super useful. Can MFA support quantized operations as well (4 bit, 5 bit) ? If yes, then what kind of benchmarks should one expect ? What would be a good starting point for a rust developer ?

I made MFA so that you can modify it yourself. You can make a fork, tweak the shaders, support different quantizations if you want. Although at least for SDXL, it's advantageous to dequantize into a separate buffer before running GEMM. Less effort to write a new shader, potentially faster (due to less redundant compute).

@okpatil4u
Copy link

I have had a chance to look into MFA earlier. To be honest the project is intimidating to get started with. The documentation is scarce. A quick comparison would be https://github.com/Dao-AILab/flash-attention. If this entry barrier is removed, I am sure I could introduce MFA to my group of developers.

@philipturner
Copy link

It's ~500-1000 lines of shader code for GEMM and another ~500-1000 for Attention. The source files are mostly self-documenting, with a lot of Metal function constants. You can enable or disable each function constant while creating the MTLComputePipelineState. There's also two reference implementations of how to use MFA from the Metal API. One in Swift, another in C++.

The matrix dimensions need to be known at pipeline creation time. That way, some clever constant folding magic in the backend compiler will happen. It's extremely simple, but extremely powerful code generation. Executed much more effectively and robustly than MPS. The function constants describing matrix dimensions are usually capital letters. Examples: M, N, K for a classic M x N x K matrix multiply. The constant names are designed to be concise and simple to understand.

I pre-compiled some Metal binaries and hosted them on the GitHub releases page. That removes the need to download esoteric Xcode versions and go through the complex build process. Just copy the binary file and load the .metallib. I can make a ~100 line Swift script that does this and executes a GEMM. I could even dig up one from a SwiftUI iOS app from a while ago.

@philipturner
Copy link

@ivarflakstad any interest in commenting? This is a Rust framework trying to use Metal for matrix multiplication.

@ivarflakstad
Copy link
Member

Yes! I'm actually involved in this work already working with @Narsil.
#1230 contains the initial work, which has been split into several PRs (#1309, #1323, etc).

We've also been working on adding mps support to metal-rs here

I began working on a wrapper around metal flash attention here, but haven't had time to complete it. GEMM works though😊
It's a bit too opinionated at the moment so a bit of refactoring would be good as well.

@ivarflakstad
Copy link
Member

ivarflakstad commented Nov 17, 2023

As a comment on documentation, dauntingness, etc I found the metal code itself fairly straightforward (if we take into consideration that we're talking about GPGPU GEMM, flash attention, and pushing the limits of what is possible with Metal).

It was the actually the swift tests orchestrating the metal execution that took me a while to understand. Calculating the correct values for func constants, async pipeline, cached operations etc.
Finding the minimal working example was key. Often is😊

(Perhaps most importantly @philipturner kindly answered all my dumb questions😉)

@philipturner
Copy link

philipturner commented Nov 17, 2023

Also, MFA is currently not running optimal on the new A17/M3 architecture. Requires some major rewriting, which should get funded next summer. In the meantime, use MPS for FP32 GEMM and MFA for Attention (on MTLGPUFamily.apple9).

Figure_1

As a comment on documentation, dauntingness

My primary interest is in a different field. I do document stuff very well when I want to (example). Just doing MFA as work during the summers.

It was the actually the swift tests orchestrating the metal execution that took me a while to understand. Calculating the correct values for func constants, async pipeline, cached operations etc.

When you write your own codebase, and write both the Swift and C++ impl, this is the result. I do try to make the code as legible as possible, using modern coding practices. The unit tests got really unwieldy, just enough to "do the job" given time constraints.

The tensor library, I wrote from scratch. I needed to bypass issues with CPU overhead to properly benchmark performance. Batching almost 100 GPU commands into a single command buffer, yet using an API that's semantically eager. A bit of compiler/DSL engineering.

I found the metal code itself fairly straightforward (if we take into consideration that we're talking about GPGPU GEMM, flash attention, and pushing the limits of what is possible with Metal).

I remember boasting to Tristan Dao about having the shortest, most elegant implementation on his list (inspired by Tinygrad). Other codebases are at least tens of thousands of lines, took >1 month to fully comprehend them. Also I came up with FlashAttention-2 independently before Tristan released the algorithm.

@okpatil4u
Copy link

okpatil4u commented Nov 18, 2023

Thanks for clearing it out. This looks very impressive from where I stand. I come from Molecular Simulations background as well, I worked on OpenMM in 2011-15, it's early days.

Finding a minimal working example was the issue for me when I briefly went through MFA. @philipturner a working 100 lines swift example would be great, if you could spare some time. Thank you.

Edit: Never mind. Just saw this https://github.com/ivarflakstad/metal-flash-attention-rs. This is a good starting point.

@ivarflakstad
Copy link
Member

If you're interested contributions are very welcome, @okpatil4u 😊

@philipturner
Copy link

philipturner commented Nov 18, 2023

It may be an older iteration of the Metal function constants API, but here (200 lines): https://gist.github.com/philipturner/60c9b196a2e9361f1ec15a99a9267268

Edit: Yeah, this seems old because there’s no function constants for explicitly setting the block sizes.

@okpatil4u
Copy link

Thanks @philipturner @ivarflakstad, I will look into it.

@jk2K
Copy link

jk2K commented Nov 27, 2023

Is there anything I can do ? I want to work with you

@Narsil
Copy link
Collaborator

Narsil commented Nov 27, 2023

This PR implements some basic metal.

#1318

It seems to work ok (tried M1 + M3). Although speedups are only available on M3 for Phi and larger models.
Implementation is still pretty rough, and might be modified quite significantly.

@philipturner
Copy link

I wonder if Apple fixed the sequential throughput bottleneck in their drivers with the M3 generation. I'll have to benchmark my A17 Pro when I have the time.

@okpatil4u
Copy link

Just had a quick check, both metal and accelerate are yielding the same speedup.

Screenshot 2023-11-27 at 6 23 16 PM

@Narsil
Copy link
Collaborator

Narsil commented Nov 27, 2023

Quantized is still using CPU as CPU is hardcoded for now.

@okpatil4u
Copy link

Hey @Narsil, just checking if there has been an update on this.

@ivarflakstad
Copy link
Member

ivarflakstad commented Dec 7, 2023

On quantized or metal support?
Metal support is progressing, but I've been stuck on a weird bug where there are NaNs in the buffer on lhs of matmul. To make it even more fun it only happens with M1/M1 pro and it's non-deterministic (same input does not always reproduce the bug, happens at some point after 40k operations).
Testing out alternatives though.

Can't comment on quantized.

@okpatil4u
Copy link

I was checking for metal support. Apple recently released MLX. Not sure if you had any chance to look at it. Mistral 16bit example is pretty fast and prompt evaluation time is almost non existent. Which removes the need of flash attention. Something that even llama.cpp lacked for apple silicon architecture.

It even has c++ api that could be readily used. Would this be useful for what you are working on ?

@ivarflakstad
Copy link
Member

Yes, I saw it :) Very cool stuff.
Taking inspiration from the kernels is especially interesting to me.

@philipturner
Copy link

Their GEMM kernels are slower than MPS because they don't use SIMD-group async copy instructions. This is the reason MFA was so finicky. You had to use an older Xcode toolchain to access the hardware instructions that Apple hid from the official shading language. I doubt MLX did rigorous benchmarks of how well their code performs across all possible matrix sizes.

@LaurentMazare
Copy link
Collaborator

Metal/MPS support has been added for a couple months now, let's close this issue and open new ones if new problems arise.

@alexcardo
Copy link

alexcardo commented May 17, 2024

Point me out please what should I do to use Candle with Mac M1 GPU support. I was unable to find it reading the README file. In my case, it utilize only CPU and maybe 10% GPU.

@philipturner
Copy link

It already utilizes the GPU. GPU utilization depends on several factors. Note that not all operations are able to make use of GPU. This goes for all computations in general, not just AI. To utilize the GPU, the number of computations to memory transfers must be extremely high. For example, LLMs make poor use of GPUs and excellent use of CPUs because they are memory bandwidth bound.

@alexcardo
Copy link

My goal is to load MADLAD-400 on my Macbook M1 8GB (the only hardware I have). I'm forced to use a quantized .gguf version of this model. The only way to achieve this goal is via Candle (as described in the official Google MADLAD-400's HF repo).

Regularly, I'm using LLAMA CPP to run the quantized models. And, in case of 4bit quantization of 7B and even 8B models (latest LLAMA 3), it performs quite fast on GPU.

Meanwhile, when I'm trying to run MADLAD-400 7B or even 3B with candle, I see that the GPU resources are not used at all. Looks like even CPU resources are not used in a full power. And I see only 1 token/sec maximum, which make it useless for me.

I suppose that I need to use some additional parameter to run it or compile it another way.

I'm using this approach from the documentation:

cargo run --example quantized-t5 --release  -- \
  --model-id "jbochi/madlad400-3b-mt" --weight-file "model-q4k.gguf" \
  --prompt "<2de> How are you, my friend?" \
  --temperature 0
...
 Wie geht es dir, mein Freund?

But I was unable to find any other extra parameters or whatever to run this package with GPU support.

@LaurentMazare
Copy link
Collaborator

You will want to use --features metal to turn on gpu support on apple silicon.

@alexcardo
Copy link

--features metal

Can you please suggest me the exact command I need to execute?

As follows from one video I've seen on Youtube, this parameter should be used during the cargo build. I'm not experienced in Rust at all. And I was unable to find any text instruction where there is at least something about this parameter (--features metal). And the only result I see in Google when I'm trying to search for this issue is this discussion.

@alexcardo
Copy link

I reinstalled cargo, candle, and everything. I executed cargo build --features metal in my Candle directory. No, Candle uses CPU instead of GPU for sure. It doesn't utilize GPU at all. Point me out what am I doing wrong.

@ivarflakstad
Copy link
Member

ivarflakstad commented May 17, 2024

cargo run is basically cargo build + execute the binary just built. Arguments to build and run are identical (exceptions exist but you don’t need to worry about that).

In other words when you call cargo run with different arguments than what you just used for cargo build it will build a different binary.

In summary you want something like

cargo run —features metal --example quantized-t5 --release  -- \
  --model-id "jbochi/madlad400-3b-mt" --weight-file "model-q4k.gguf" \
  --prompt "<2de> How are you, my friend?" \
  --temperature 0

@cloneable
Copy link

Looks like CPU is hard-coded?

let device = Device::Cpu;

@ivarflakstad
Copy link
Member

Hah. Nice catch.
It should ideally have been something like
let device = candle_examples::device(args.cpu)?; - is that correct @LaurentMazare ?
(cpu is not an argument atm)

@LaurentMazare
Copy link
Collaborator

Yes candle_examples::device would be the first thing to try, it might fail if there are other hardcoded cpu in the model code but these should be easy fixes.

@philipturner
Copy link

I take it Candle is using MFA for the GEMM kernel (not the attention kernel)? If so, you can obsolete the current metallib file and translate this to Rust.

https://gist.github.com/philipturner/84f613a5cc745460a914d2c6ad226131

Optimized for M3 and BF16

@rightsum
Copy link

rightsum commented Jun 18, 2024

Hey folks, I made 2 small changes as suggested above to the code and used:

let device = candle_examples::device(false)?;

for both load and build methods and when I run

cargo run --features metal --example quantized-t5 --release  -- \
  --model-id "jbochi/madlad400-3b-mt" --weight-file "model-q4k.gguf" \
  --prompt "<2fa> I love pizza!" \
  --temperature 0

I still receive a weird error on M1 Utltra:
Error: device mismatch in index-select, lhs: Metal { gpu_id: 4294971352 }, rhs: Metal { gpu_id: 4294971352 }

@bruceunx
Copy link

let device = Device::new_metal(0)?;

use this to replace

let device = Device::Cpu;

But even with metal, the rate in M2, 2-3 tokens/sec, it's still a little slow

Hey folks, I made 2 small changes as suggested above to the code and used:

let device = candle_examples::device(false)?;

for both load and build methods and when I run

cargo run --features metal --example quantized-t5 --release  -- \
  --model-id "jbochi/madlad400-3b-mt" --weight-file "model-q4k.gguf" \
  --prompt "<2fa> I love pizza!" \
  --temperature 0

I still receive a weird error on M1 Utltra: Error: device mismatch in index-select, lhs: Metal { gpu_id: 4294971352 }, rhs: Metal { gpu_id: 4294971352 }

@philipturner
Copy link

philipturner commented Jul 24, 2024

“a little slow” is a bit subjective. Not a lot of context to evaluate how to improve on the speed.

Have you made a roofline model of the minimum latency per token, by dividing the model size (in GB) by the hardware bandwidth (in GB/s)? Does the model use speculative execution to amplify sequential throughput? Are there any major compute-bound components?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests