-
Notifications
You must be signed in to change notification settings - Fork 943
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add tracking of memory allocations to Device #1412
Conversation
@LaurentMazare , what are your thoughts on this PR? |
I think statistics via VarBuilder would be better. |
I don't think we would want something like this baked into candle core. This will be imprecise and error prone for GPUs because of cuda's sheanigans and that's the place where it would be the more helpful - instead I would recommend using an external profiler like |
I see how adding this type of metric is not ideal, but I think it would be a much better to integrate the functionality into candle rather than forcing candle-vllm and future applications to depend on Perhaps this can behind a feature flag? |
Feature flags are hard to maintain over the long run and this seems like introducing much code for a use case that is not very common so would rather avoid it. |
Ok, do you know of a way to run a Candle model and get the nvprof results? |
Hi @LaurentMazare, do you think you could take another look at this? I just wanted to add that this change is absolutely necessary for
Generally I agree, however, adding this code is unfortunately the only way forward for
This implementation precisely tracks the bytes allocated at all allocation sites. The purpose of this new API is not to trace the model, but to track the allocation high-water-mark. Thank you! |
I think you may add this feature (tracing of memory usage) to cudarc if they don't want to take it. Candle calls cudarc for GPU memory allocation and you may access cudarc in your candle-vllm. |
That sounds like a great idea! I will open a PR. |
Hello everybody,
This PR adds automatic tracking of the number of bytes allocated by each device, which is necessary for the immediate further development of candle-vllm.
Why this is necessary
This is a necessary addition because it will allow candle-vllm to profile the amount of memory allocated. Profiling the memory is essential to the initialization of a candle-vllm instance, which makes it necessary for the immediate further development of candle-vllm.
Why this works
For each CPU, CUDA, or Metal allocation, a mutable state tracks the number of bytes allocated per device ordinal. Specifically, for CUDA and Metal kernels, a table that maps device ordinals to the allocation level is used. The state is behind a mutex for soundness.
Because allocations should of course not be done very often in performance critical sections, locking the mutex will only induce a negligible performance regression.
API addition
I add the
reset_peak_memory_stats
andmax_memory_allocated
methods toDevice
. These respectively allow the state to be reset or read for a specific device ordinal.