Refactor the metal backend to always reuse command encoders/buffers unless a shared memory access is requested #2037

tomsanbear · 2024-04-10T17:53:30Z

This change aims to replace the pattern of each tensor provisioning a command buffer and encoder for each kernel operation that occurs, to a pattern where an encoder is provided to a kernel to setup it's operations.

Prior to this change, we relied on the fact that command buffers were executed sequentially to handle ordering of operations and ensuring that an output from one operation has completed before being used in a downstream operation. We now leverage Metal's resource based memory barriers to do this more effectively by ensuring we only block operations on the completion of operations on input dependencies.

What are the outcomes in terms of performance for this change? Well in summary not much... in theory I would have expected a minor gain due to the improved parallelization and lower overhead of only using a few command buffers and encoders, however in practice for the example models we do not see a change in performance.

Now although we don't see a change in performance on our models, we get a much more stable gputrace output as a result of this change. On main generating a gputrace would often cause OOM errors due to the amount of extra recorded command buffer/encoders, we now get a much more streamlined recording of the models, and the memory required to load these traces has been cut down drastically.

The major changes to note here are the following:

command buffers are lazily initialized on calls to MetalDevice::command_buffer
command encoders are lazily initialized on calls to MetalDevice::command_encoder
a new RwLock is introduced for tracking the command encoder
the device struct is modified to make access to the command encoder and buffers private to ensure users always use the helper methods for accessing those methods
the method for copying data to the cpu has been adjusted to handle closing the compute encoder and initializing the compute encoder
reduced usage of the "wait_until_completed" instruction for the copy instructions, we now only rely on this when we require synchronization of resources between CPU and GPU

…derReuse

…mandEncoderReuse" This reverts commit 73088cc, reversing changes made to bb594bc.

…readability" This reverts commit bb594bc.

tomsanbear · 2024-04-10T17:55:27Z

Hoping this branch will address some portion of #1939

…rations

…mandEncoderReuse

tomsanbear · 2024-04-12T15:18:38Z

Okay, I'm happy with the PR in it's current state!

@ivarflakstad it'd be great to get your eyes on this change as well!

Feel free to reach out on discord if you want to go through parts of the change together and avoid back and forth 👍

tomsanbear · 2024-04-13T14:58:28Z

@LaurentMazare I did some cleanup of the areas commented above, lmk if you have any other ideas on improving docs

LaurentMazare · 2024-04-13T17:04:19Z

candle-examples/examples/quantized/main.rs

+        if args.metal_tracing {
+            use candle::Device;
+            if let Device::Metal(metal_device) = device.clone() {
+                metal_device.capture("/tmp/candle.gputrace")?;


Let's make the metal-tracing arg an Option<String> so that it specifies the filename for the trace rather than hardcoding it?

Agreed. Adding to this, does it make sense to have this as an env var?
In other words let gpu tracing be independent on what type of model you’re running.

In which case we should log that gpu tracing is enabled

Would rather keep it as an explicit argument as we try to not rely on env variable (though there might be one remaining), the gist of it is that they are hard to discover whereas arguments are documented. Also this way it doesn't depend on whether the env variable gets changed by the user code calling candle etc.
(and on another note, candle tries actively to be log free outside of the main.rs)

Right. I was thinking it would only take effect with the debug or release-with-debug profiles. I imagine no one is doing gpu tracing in a production environment.
Debug logging is compiled away in release builds.

tomsanbear · 2024-04-13T20:58:48Z

Just an update, I've run through most examples and they work fine, the exception is stable diffusion (wuerstchen is fine though...) which outputs garbage on the photo compared to main, i'll be looking into this and will tag when I find the root cause, not immediately obvious so far.

candle-core/src/metal_backend/device.rs

LaurentMazare · 2024-04-14T07:32:02Z

Just an update, I've run through most examples and they work fine, the exception is stable diffusion (wuerstchen is fine though...) which outputs garbage on the photo compared to main, i'll be looking into this and will tag when I find the root cause, not immediately obvious so far.

Gave it a quick look and it indeed looks pretty weird. When enabling intermediary-images (to make it use the vae earlier) and comparing with the base version via some print statements I saw the first difference on the last layer of the vae, but I also got a run where it didn't seem to happen. So it may well be some form of race condition, not sure if it's between kernels, or when fetching the data back from the gpu, also worth pointing out that my mac gets very low on memory at that point (with 16GB) - but I would expect an actual error rather than data corruption if this is an issue.
Two other things that may be worth trying if easy: disabling the buffer reuse and use freshly allocated buffers each time, and limiting the maximum number of encoded ops in the buffer.

…mandEncoderReuse

LaurentMazare · 2024-04-28T18:37:51Z

Any luck with this? It would be great to have to analyze performance on metal so pretty keen to have it if it still works with stable diffusion etc.

tomsanbear · 2024-04-28T19:27:15Z

Any luck with this? It would be great to have to analyze performance on metal so pretty keen to have it if it still works with stable diffusion etc.

Unfortunately haven't had the bandwidth the last two weeks, I should have some more time coming up where I can spend some time investigating. I was going back through the implementation I had and was trying to hunt down where the change was introduced, so far nothing seemed apparent, so now I'm running through the operations that are performed in the stable diffusion model so that I can understand what's going on.

…mandEncoderReuse

tomsanbear · 2024-04-28T19:46:58Z

updated the branch here with main so we don't stay too far out of sync

…mandEncoderReuse

tomsanbear · 2024-05-06T15:21:02Z

Just an update @LaurentMazare

I did finally trace this down to something odd going on in the memory of certain tensors. It appears that at some point the buffer for the "encoder_hidden_states" is getting overwritten with zeros, this seems to also be happening with other buffers... Hence why images generated are coming out black right now. I'm not sure where this is coming from.. I initially thought it may be the buffer reuse, but after disabling that the issue is still reproducible...

As for next steps, I'm not quite sure... If anyone has a clue/inkling why this may be occurring for the stable diffusion example but not for any other example I would love to know, I'm just not familiar enough to have a clear "aha" moment on this yet.

…mandEncoderReuse

tomsanbear added 10 commits April 10, 2024 12:50

oh my god it works

b2ead34

fix other packages

adaf302

refactor metal backend to split storage and device files for readability

bb594bc

Merge branch 'main' of github.com:huggingface/candle into CommandEnco…

73088cc

…derReuse

Revert "Merge branch 'main' of github.com:huggingface/candle into Com…

ba0aaa9

…mandEncoderReuse" This reverts commit 73088cc, reversing changes made to bb594bc.

Revert "refactor metal backend to split storage and device files for …

826b16d

…readability" This reverts commit bb594bc.

undo setting chnage

b1e09e6

undo messed up revert

90e6909

undo messed up revert

591e94c

enable metal tracing option

0879f67

tomsanbear added 3 commits April 10, 2024 14:01

fix kernel tests

789510e

rearrange the spot where we call wait until completed

93abe5c

remove unused import

843326c

tomsanbear changed the title ~~Refactor the metal implementation to reuse command encoders/buffers always unless a shared memory access is requested~~ Refactor the metal backend to reuse command encoders/buffers always unless a shared memory access is requested Apr 10, 2024

reintroduce drop unused buffers

1a5792a

LaurentMazare self-assigned this Apr 10, 2024

tomsanbear added 10 commits April 10, 2024 17:09

simplify even further by removing creation of buffers during copy ope…

cf5e2fc

…rations

fix synchronization method

c51c71e

cleanup memory barriers to only depend on inputs

4067bdd

refactor command encoder and buffer to be lazily initialized

41bb216

add syncchronize back in

0a1d47e

clean up documentation

ef69643

simplify the to_cpu op

79647c8

Merge branch 'main' of https://github.com/huggingface/candle into Com…

912755d

…mandEncoderReuse

Merge branch 'main' of https://github.com/huggingface/candle into Com…

3577e24

…mandEncoderReuse

cleanup constructor for the metal device

f57a24f

tomsanbear marked this pull request as ready for review April 12, 2024 15:18

tomsanbear changed the title ~~Refactor the metal backend to reuse command encoders/buffers always unless a shared memory access is requested~~ Refactor the metal backend to always reuse command encoders/buffers unless a shared memory access is requested Apr 12, 2024

update docs explaining usage of the end_compute_encoding function

5ba51c1

remove unnecessary memory barrier

74a265f

LaurentMazare reviewed Apr 13, 2024

View reviewed changes

LaurentMazare reviewed Apr 14, 2024

View reviewed changes

candle-core/src/metal_backend/device.rs Outdated Show resolved Hide resolved

candle-core/src/metal_backend/device.rs Outdated Show resolved Hide resolved

candle-core/src/metal_backend/device.rs Outdated Show resolved Hide resolved

tomsanbear added 7 commits April 14, 2024 11:29

review feedback changes on command encoder/buffer internals

c026881

prevent oom errors

7d28704

actually increment

4089486

allocate data directly

6fcfdde

Merge branch 'main' of https://github.com/huggingface/candle into Com…

675ef87

…mandEncoderReuse

change to independent command buffers for blit encoders

f988a52

update ones impl to use direct buffer allocation

572e269

tomsanbear mentioned this pull request Apr 14, 2024

Refactor candle-metal-kernels to accept an encoder instead of a command buffer #2061

Draft

tomsanbear marked this pull request as draft April 14, 2024 20:34

Merge branch 'main' of https://github.com/huggingface/candle into Com…

566bcb2

…mandEncoderReuse

tomsanbear added 6 commits May 4, 2024 11:00

Merge branch 'main' of https://github.com/huggingface/candle into Com…

4f73ca1

…mandEncoderReuse

update impl to use command encoder

b8b52ce

use sat sub

0d8f4f2

fix zero tests

3add705

fix

b161320

adjust buffer behaviour:

042e001

tomsanbear added 3 commits May 6, 2024 11:25

Merge branch 'main' of https://github.com/huggingface/candle into Com…

54c7e9b

…mandEncoderReuse

clippy fix

53ac546

format

c4e2434

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor the metal backend to always reuse command encoders/buffers unless a shared memory access is requested #2037

Refactor the metal backend to always reuse command encoders/buffers unless a shared memory access is requested #2037

tomsanbear commented Apr 10, 2024 •

edited

Loading

tomsanbear commented Apr 10, 2024

tomsanbear commented Apr 12, 2024 •

edited

Loading

tomsanbear commented Apr 13, 2024

LaurentMazare Apr 13, 2024

ivarflakstad Apr 13, 2024 •

edited

Loading

ivarflakstad Apr 13, 2024

LaurentMazare Apr 13, 2024

ivarflakstad Apr 13, 2024

tomsanbear commented Apr 13, 2024 •

edited

Loading

LaurentMazare commented Apr 14, 2024

LaurentMazare commented Apr 28, 2024

tomsanbear commented Apr 28, 2024

tomsanbear commented Apr 28, 2024

tomsanbear commented May 6, 2024

Refactor the metal backend to always reuse command encoders/buffers unless a shared memory access is requested #2037

Are you sure you want to change the base?

Refactor the metal backend to always reuse command encoders/buffers unless a shared memory access is requested #2037

Conversation

tomsanbear commented Apr 10, 2024 • edited Loading

tomsanbear commented Apr 10, 2024

tomsanbear commented Apr 12, 2024 • edited Loading

tomsanbear commented Apr 13, 2024

LaurentMazare Apr 13, 2024

Choose a reason for hiding this comment

ivarflakstad Apr 13, 2024 • edited Loading

Choose a reason for hiding this comment

ivarflakstad Apr 13, 2024

Choose a reason for hiding this comment

LaurentMazare Apr 13, 2024

Choose a reason for hiding this comment

ivarflakstad Apr 13, 2024

Choose a reason for hiding this comment

tomsanbear commented Apr 13, 2024 • edited Loading

LaurentMazare commented Apr 14, 2024

LaurentMazare commented Apr 28, 2024

tomsanbear commented Apr 28, 2024

tomsanbear commented Apr 28, 2024

tomsanbear commented May 6, 2024

tomsanbear commented Apr 10, 2024 •

edited

Loading

tomsanbear commented Apr 12, 2024 •

edited

Loading

ivarflakstad Apr 13, 2024 •

edited

Loading

tomsanbear commented Apr 13, 2024 •

edited

Loading