Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Enhancements] add a vcude device to help mitigate compile time GPU memory usage #302

Merged
merged 13 commits into from Jul 5, 2023

Conversation

xinli-git
Copy link
Collaborator

The problem

During compilation, some passes needs to replace certain operators with one or more new operators. For example, resolve_variants, subgraph_rewrite. In doing so, hidet relies on imperative_run to generate the new operators. This creates a lot of intermediate tensors that can exceed the actual run-time memory consumption, which makes larger models (such as Llama-7B) unable to compile even for a GPU with 24GB using fp16.

Fix

We introduce a vcuda device that allows GPU tensors to be stored on CPU and only transferred to GPU on demand. We call this vcuda. With this change, any additional GPU memory usage during compilation is off-loaded to CPU.

This might incur a bit compilation overhead when enabled, but given the time-consuming nature of such large models, this compile time increase is negligible.

Now, on RTX3090, this is the GPU memory consumption for running llama test

Status of cuda:0 memory pool
   Allocated: 14081 MiB
        Peak: 14081 MiB
    Reserved: 1196 MiB
      Active: 12884 MiB
Status of cpu memory pool
   Allocated: 4 KiB
        Peak: 26022 MiB
    Reserved: 4 KiB
      Active: 0 Bytes
Status of vcuda:0 memory pool
   Allocated: 0 Bytes
        Peak: 25486 MiB
    Reserved: 0 Bytes
      Active: 0 Bytes

@xinli-git xinli-git changed the title [enhancements] add a vcude device to help mitigate compile time GPU memory usage [Enhancements] add a vcude device to help mitigate compile time GPU memory usage Jul 3, 2023
Copy link
Member

@yaoyaoding yaoyaoding left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @xinli-git , good work!

I left some alternative implementation suggestions and let's discuss.

python/hidet/graph/flow_graph.py Outdated Show resolved Hide resolved
python/hidet/graph/operator.py Outdated Show resolved Hide resolved
python/hidet/graph/tensor.py Outdated Show resolved Hide resolved
Comment on lines 179 to 193

has_vcuda = len(inputs) > 0 and all(inp.device.is_vcuda() for inp in inputs)
if has_vcuda:
for inp in inputs:
inp.move_from_vcuda()

candidate = self.candidates[self.pick_best_candidate(inputs, outputs)]
candidate(*inputs, *outputs)

if has_vcuda:
for inp in inputs:
inp.move_to_vcuda()
for outp in outputs:
outp.move_to_vcuda()

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are two ways to implement "vcuda" device:

  1. [The current version]: we copy the tensor to cuda before we run the operator and copy back to cpu when we finish it.
  2. We can also reply on the unified addressing of cuda. (see here and here). Essentially, the unified addressing allows the cuda kernel access the host memory (page-locked) directly. Thus, we do not need to copy to and from the cuda device memory.

I prefer the second way. With it, we do not need to update our run_async function.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a great suggestion and it works! thank you

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One minor issue is that the cpu tensor allocated by other library (e.g., pytorch and numpy) will not use cudaMallocHost to allocate, and it is not page-locked. Thus, when we convert the tensors from them, we should allocate the tensor by hidet instead of sharing the weights (on the platform that cuda is avaliable).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this case is fine because if users give a cpu tensor from other library and expect it to work on CUDA device that is already wrong

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

runtime check should catch this

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make sense.

python/hidet/runtime/compiled_task.py Outdated Show resolved Hide resolved
python/hidet/testing/models/llama.py Outdated Show resolved Hide resolved
python/hidet/testing/models/llama.py Outdated Show resolved Hide resolved
python/hidet/testing/models/llama.py Outdated Show resolved Hide resolved
Copy link
Member

@yaoyaoding yaoyaoding left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @xinli-git !

@yaoyaoding yaoyaoding merged commit a15f5c0 into hidet-org:main Jul 5, 2023
2 checks passed
@xinli-git xinli-git deleted the vcuda branch July 5, 2023 21:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants