[Enhancements] add a vcude device to help mitigate compile time GPU memory usage #302

xinli-git · 2023-07-03T22:34:51Z

The problem

During compilation, some passes needs to replace certain operators with one or more new operators. For example, resolve_variants, subgraph_rewrite. In doing so, hidet relies on imperative_run to generate the new operators. This creates a lot of intermediate tensors that can exceed the actual run-time memory consumption, which makes larger models (such as Llama-7B) unable to compile even for a GPU with 24GB using fp16.

Fix

We introduce a vcuda device that allows GPU tensors to be stored on CPU and only transferred to GPU on demand. We call this vcuda. With this change, any additional GPU memory usage during compilation is off-loaded to CPU.

This might incur a bit compilation overhead when enabled, but given the time-consuming nature of such large models, this compile time increase is negligible.

Now, on RTX3090, this is the GPU memory consumption for running llama test

Status of cuda:0 memory pool
   Allocated: 14081 MiB
        Peak: 14081 MiB
    Reserved: 1196 MiB
      Active: 12884 MiB
Status of cpu memory pool
   Allocated: 4 KiB
        Peak: 26022 MiB
    Reserved: 4 KiB
      Active: 0 Bytes
Status of vcuda:0 memory pool
   Allocated: 0 Bytes
        Peak: 25486 MiB
    Reserved: 0 Bytes
      Active: 0 Bytes

yaoyaoding

Hi @xinli-git , good work!

I left some alternative implementation suggestions and let's discuss.

python/hidet/graph/flow_graph.py

python/hidet/graph/operator.py

python/hidet/graph/tensor.py

yaoyaoding · 2023-07-04T04:11:54Z

python/hidet/runtime/compiled_task.py

+
+        has_vcuda = len(inputs) > 0 and all(inp.device.is_vcuda() for inp in inputs)
+        if has_vcuda:
+            for inp in inputs:
+                inp.move_from_vcuda()
+
        candidate = self.candidates[self.pick_best_candidate(inputs, outputs)]
        candidate(*inputs, *outputs)
+
+        if has_vcuda:
+            for inp in inputs:
+                inp.move_to_vcuda()
+            for outp in outputs:
+                outp.move_to_vcuda()
+


There are two ways to implement "vcuda" device:

[The current version]: we copy the tensor to cuda before we run the operator and copy back to cpu when we finish it.

We can also reply on the unified addressing of cuda. (see here and here). Essentially, the unified addressing allows the cuda kernel access the host memory (page-locked) directly. Thus, we do not need to copy to and from the cuda device memory.

I prefer the second way. With it, we do not need to update our run_async function.

this is a great suggestion and it works! thank you

One minor issue is that the cpu tensor allocated by other library (e.g., pytorch and numpy) will not use cudaMallocHost to allocate, and it is not page-locked. Thus, when we convert the tensors from them, we should allocate the tensor by hidet instead of sharing the weights (on the platform that cuda is avaliable).

I think this case is fine because if users give a cpu tensor from other library and expect it to work on CUDA device that is already wrong

runtime check should catch this

Make sense.

python/hidet/runtime/compiled_task.py

python/hidet/testing/models/llama.py

yaoyaoding

Thanks @xinli-git !

xinli-centml added 8 commits July 3, 2023 17:29

vcuda

018ced1

format and lint

d0d56b4

fix typo and int32 to int64

ef187f0

skip test

df9c853

enable and update tests

7c13bd3

update test

1af66d4

more fix to tests

292abbb

leave a single test

a078a07

xinli-git changed the title ~~[enhancements] add a vcude device to help mitigate compile time GPU memory usage~~ [Enhancements] add a vcude device to help mitigate compile time GPU memory usage Jul 3, 2023

format accident

8d6d897

yaoyaoding reviewed Jul 4, 2023

View reviewed changes

xinli-centml added 4 commits July 4, 2023 15:58

resolve commit comments

f54b344

guard vcuda /cuda check

681795d

format accident

47a8c1a

revert accidental commit

6d79f4c

yaoyaoding approved these changes Jul 5, 2023

View reviewed changes

yaoyaoding merged commit a15f5c0 into hidet-org:main Jul 5, 2023
2 checks passed

xinli-git deleted the vcuda branch July 5, 2023 21:07

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Enhancements] add a vcude device to help mitigate compile time GPU memory usage #302

[Enhancements] add a vcude device to help mitigate compile time GPU memory usage #302

xinli-git commented Jul 3, 2023

yaoyaoding left a comment

yaoyaoding Jul 4, 2023

xinli-git Jul 4, 2023

yaoyaoding Jul 4, 2023

xinli-git Jul 5, 2023

xinli-git Jul 5, 2023

yaoyaoding Jul 5, 2023

yaoyaoding left a comment

[Enhancements] add a vcude device to help mitigate compile time GPU memory usage #302

[Enhancements] add a vcude device to help mitigate compile time GPU memory usage #302

Conversation

xinli-git commented Jul 3, 2023

The problem

Fix

yaoyaoding left a comment

Choose a reason for hiding this comment

yaoyaoding Jul 4, 2023

Choose a reason for hiding this comment

xinli-git Jul 4, 2023

Choose a reason for hiding this comment

yaoyaoding Jul 4, 2023

Choose a reason for hiding this comment

xinli-git Jul 5, 2023

Choose a reason for hiding this comment

xinli-git Jul 5, 2023

Choose a reason for hiding this comment

yaoyaoding Jul 5, 2023

Choose a reason for hiding this comment

yaoyaoding left a comment

Choose a reason for hiding this comment