New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Enhancements] add a vcude device to help mitigate compile time GPU memory usage #302
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @xinli-git , good work!
I left some alternative implementation suggestions and let's discuss.
|
||
has_vcuda = len(inputs) > 0 and all(inp.device.is_vcuda() for inp in inputs) | ||
if has_vcuda: | ||
for inp in inputs: | ||
inp.move_from_vcuda() | ||
|
||
candidate = self.candidates[self.pick_best_candidate(inputs, outputs)] | ||
candidate(*inputs, *outputs) | ||
|
||
if has_vcuda: | ||
for inp in inputs: | ||
inp.move_to_vcuda() | ||
for outp in outputs: | ||
outp.move_to_vcuda() | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are two ways to implement "vcuda" device:
- [The current version]: we copy the tensor to cuda before we run the operator and copy back to cpu when we finish it.
- We can also reply on the unified addressing of cuda. (see here and here). Essentially, the unified addressing allows the cuda kernel access the host memory (page-locked) directly. Thus, we do not need to copy to and from the cuda device memory.
I prefer the second way. With it, we do not need to update our run_async
function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is a great suggestion and it works! thank you
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One minor issue is that the cpu tensor allocated by other library (e.g., pytorch and numpy) will not use cudaMallocHost
to allocate, and it is not page-locked. Thus, when we convert the tensors from them, we should allocate the tensor by hidet instead of sharing the weights (on the platform that cuda is avaliable).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this case is fine because if users give a cpu tensor from other library and expect it to work on CUDA device that is already wrong
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
runtime check should catch this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Make sense.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @xinli-git !
The problem
During compilation, some passes needs to replace certain operators with one or more new operators. For example, resolve_variants, subgraph_rewrite. In doing so, hidet relies on
imperative_run
to generate the new operators. This creates a lot of intermediate tensors that can exceed the actual run-time memory consumption, which makes larger models (such as Llama-7B) unable to compile even for a GPU with 24GB using fp16.Fix
We introduce a vcuda device that allows GPU tensors to be stored on CPU and only transferred to GPU on demand. We call this v
cuda
. With this change, any additional GPU memory usage during compilation is off-loaded to CPU.This might incur a bit compilation overhead when enabled, but given the time-consuming nature of such large models, this compile time increase is negligible.
Now, on RTX3090, this is the GPU memory consumption for running llama test