Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add tracking of memory allocations to Device #1412

Closed
wants to merge 4 commits into from

Conversation

EricLBuehler
Copy link
Member

Hello everybody,

This PR adds automatic tracking of the number of bytes allocated by each device, which is necessary for the immediate further development of candle-vllm.

Why this is necessary

This is a necessary addition because it will allow candle-vllm to profile the amount of memory allocated. Profiling the memory is essential to the initialization of a candle-vllm instance, which makes it necessary for the immediate further development of candle-vllm.

Why this works

For each CPU, CUDA, or Metal allocation, a mutable state tracks the number of bytes allocated per device ordinal. Specifically, for CUDA and Metal kernels, a table that maps device ordinals to the allocation level is used. The state is behind a mutex for soundness.

Because allocations should of course not be done very often in performance critical sections, locking the mutex will only induce a negligible performance regression.

API addition

I add the reset_peak_memory_stats and max_memory_allocated methods to Device. These respectively allow the state to be reset or read for a specific device ordinal.

@EricLBuehler
Copy link
Member Author

@LaurentMazare , what are your thoughts on this PR?

@mokeyish
Copy link
Contributor

I think statistics via VarBuilder would be better.

@LaurentMazare
Copy link
Collaborator

I don't think we would want something like this baked into candle core. This will be imprecise and error prone for GPUs because of cuda's sheanigans and that's the place where it would be the more helpful - instead I would recommend using an external profiler like nvprof or something like bytehound on the cpu side. These wouldn't give candle specific details but are battle tested tools which are good to know across the board when investigating memory issues.

@EricLBuehler
Copy link
Member Author

I see how adding this type of metric is not ideal, but I think it would be a much better to integrate the functionality into candle rather than forcing candle-vllm and future applications to depend on nvprof. Ultimately, candle specific information is required to calculate the exact number of blocks to allocate - my intended use case. Additionally, the profiler docs, show a C++ API to start and stop profiling and read profiling results.

Perhaps this can behind a feature flag?

@LaurentMazare
Copy link
Collaborator

Feature flags are hard to maintain over the long run and this seems like introducing much code for a use case that is not very common so would rather avoid it.
Also nvprof is the only accurate way to monitor memory usage on cuda, the cuda api is lazy so the memory consumption that would be reported by these measures in candle might well not be accurate. In the same way, using tracing on models running on cuda is not accurate at all.

@EricLBuehler
Copy link
Member Author

Ok, do you know of a way to run a Candle model and get the nvprof results?

@EricLBuehler
Copy link
Member Author

Hi @LaurentMazare, do you think you could take another look at this? I just wanted to add that this change is absolutely necessary for candle-vllm to work, and if you could consider merging this, it would be very helpful!

Feature flags are hard to maintain over the long run and this seems like introducing much code for a use case that is not very common so would rather avoid it.

Generally I agree, however, adding this code is unfortunately the only way forward for candle-vllm.

In the same way, using tracing on models running on cuda is not accurate at all.

This implementation precisely tracks the bytes allocated at all allocation sites. The purpose of this new API is not to trace the model, but to track the allocation high-water-mark.

Thank you!

@guoqingbao
Copy link
Contributor

Hi @LaurentMazare, do you think you could take another look at this? I just wanted to add that this change is absolutely necessary for candle-vllm to work, and if you could consider merging this, it would be very helpful!

Feature flags are hard to maintain over the long run and this seems like introducing much code for a use case that is not very common so would rather avoid it.

Generally I agree, however, adding this code is unfortunately the only way forward for candle-vllm.

In the same way, using tracing on models running on cuda is not accurate at all.

This implementation precisely tracks the bytes allocated at all allocation sites. The purpose of this new API is not to trace the model, but to track the allocation high-water-mark.

Thank you!

I think you may add this feature (tracing of memory usage) to cudarc if they don't want to take it. Candle calls cudarc for GPU memory allocation and you may access cudarc in your candle-vllm.

@EricLBuehler
Copy link
Member Author

That sounds like a great idea! I will open a PR.

@EricLBuehler EricLBuehler deleted the get_max_allocated branch March 11, 2024 16:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants