Add tracking of memory allocations to Device #1412

EricLBuehler · 2023-12-07T01:41:12Z

Hello everybody,

This PR adds automatic tracking of the number of bytes allocated by each device, which is necessary for the immediate further development of candle-vllm.

Why this is necessary

This is a necessary addition because it will allow candle-vllm to profile the amount of memory allocated. Profiling the memory is essential to the initialization of a candle-vllm instance, which makes it necessary for the immediate further development of candle-vllm.

Why this works

For each CPU, CUDA, or Metal allocation, a mutable state tracks the number of bytes allocated per device ordinal. Specifically, for CUDA and Metal kernels, a table that maps device ordinals to the allocation level is used. The state is behind a mutex for soundness.

Because allocations should of course not be done very often in performance critical sections, locking the mutex will only induce a negligible performance regression.

API addition

I add the reset_peak_memory_stats and max_memory_allocated methods to Device. These respectively allow the state to be reset or read for a specific device ordinal.

EricLBuehler · 2023-12-20T21:24:55Z

@LaurentMazare , what are your thoughts on this PR?

mokeyish · 2023-12-21T01:52:57Z

I think statistics via VarBuilder would be better.

LaurentMazare · 2023-12-21T11:05:17Z

I don't think we would want something like this baked into candle core. This will be imprecise and error prone for GPUs because of cuda's sheanigans and that's the place where it would be the more helpful - instead I would recommend using an external profiler like nvprof or something like bytehound on the cpu side. These wouldn't give candle specific details but are battle tested tools which are good to know across the board when investigating memory issues.

EricLBuehler · 2023-12-21T20:11:00Z

I see how adding this type of metric is not ideal, but I think it would be a much better to integrate the functionality into candle rather than forcing candle-vllm and future applications to depend on nvprof. Ultimately, candle specific information is required to calculate the exact number of blocks to allocate - my intended use case. Additionally, the profiler docs, show a C++ API to start and stop profiling and read profiling results.

Perhaps this can behind a feature flag?

LaurentMazare · 2023-12-21T20:15:18Z

Feature flags are hard to maintain over the long run and this seems like introducing much code for a use case that is not very common so would rather avoid it.
Also nvprof is the only accurate way to monitor memory usage on cuda, the cuda api is lazy so the memory consumption that would be reported by these measures in candle might well not be accurate. In the same way, using tracing on models running on cuda is not accurate at all.

EricLBuehler · 2023-12-21T20:40:07Z

Ok, do you know of a way to run a Candle model and get the nvprof results?

EricLBuehler · 2024-01-04T01:26:05Z

Hi @LaurentMazare, do you think you could take another look at this? I just wanted to add that this change is absolutely necessary for candle-vllm to work, and if you could consider merging this, it would be very helpful!

Feature flags are hard to maintain over the long run and this seems like introducing much code for a use case that is not very common so would rather avoid it.

Generally I agree, however, adding this code is unfortunately the only way forward for candle-vllm.

In the same way, using tracing on models running on cuda is not accurate at all.

This implementation precisely tracks the bytes allocated at all allocation sites. The purpose of this new API is not to trace the model, but to track the allocation high-water-mark.

Thank you!

guoqingbao · 2024-02-29T03:38:16Z

Hi @LaurentMazare, do you think you could take another look at this? I just wanted to add that this change is absolutely necessary for candle-vllm to work, and if you could consider merging this, it would be very helpful!

Feature flags are hard to maintain over the long run and this seems like introducing much code for a use case that is not very common so would rather avoid it.

Generally I agree, however, adding this code is unfortunately the only way forward for candle-vllm.

In the same way, using tracing on models running on cuda is not accurate at all.

This implementation precisely tracks the bytes allocated at all allocation sites. The purpose of this new API is not to trace the model, but to track the allocation high-water-mark.

Thank you!

I think you may add this feature (tracing of memory usage) to cudarc if they don't want to take it. Candle calls cudarc for GPU memory allocation and you may access cudarc in your candle-vllm.

EricLBuehler · 2024-03-01T13:57:27Z

That sounds like a great idea! I will open a PR.

EricLBuehler added 4 commits December 6, 2023 18:46

Add max allocated tracking for CPU

e901147

Add max storage tracking for cuda device

5b18d08

Add methods to Device

40c2804

Add metal impl

d81eafa

EricLBuehler mentioned this pull request Dec 7, 2023

PagedAttention tracking issue EricLBuehler/candle-vllm#11

Closed

32 tasks

EricLBuehler mentioned this pull request Dec 20, 2023

Barriers to further development EricLBuehler/candle-vllm#18

Closed

EricLBuehler mentioned this pull request Feb 7, 2024

Support running without the --hf-token parameter and using ~/.cache/huggingface/token instead EricLBuehler/candle-vllm#31

Closed

EricLBuehler closed this Mar 1, 2024

EricLBuehler deleted the get_max_allocated branch March 11, 2024 16:56

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add tracking of memory allocations to Device #1412

Add tracking of memory allocations to Device #1412

EricLBuehler commented Dec 7, 2023

EricLBuehler commented Dec 20, 2023

mokeyish commented Dec 21, 2023

LaurentMazare commented Dec 21, 2023

EricLBuehler commented Dec 21, 2023

LaurentMazare commented Dec 21, 2023

EricLBuehler commented Dec 21, 2023

EricLBuehler commented Jan 4, 2024

guoqingbao commented Feb 29, 2024

EricLBuehler commented Mar 1, 2024

Add tracking of memory allocations to Device #1412

Add tracking of memory allocations to Device #1412

Conversation

EricLBuehler commented Dec 7, 2023

Why this is necessary

Why this works

API addition

EricLBuehler commented Dec 20, 2023

mokeyish commented Dec 21, 2023

LaurentMazare commented Dec 21, 2023

EricLBuehler commented Dec 21, 2023

LaurentMazare commented Dec 21, 2023

EricLBuehler commented Dec 21, 2023

EricLBuehler commented Jan 4, 2024

guoqingbao commented Feb 29, 2024

EricLBuehler commented Mar 1, 2024