Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature] Export optimized cuda graph #130

Closed
digital-nomad-cheng opened this issue Mar 7, 2023 · 2 comments
Closed

[Feature] Export optimized cuda graph #130

digital-nomad-cheng opened this issue Mar 7, 2023 · 2 comments
Labels
enhancement New feature or request

Comments

@digital-nomad-cheng
Copy link
Contributor

Is there a way to export the optimized cuda graph to runtime library/cuda kernel and call the runtime library from host?
If there is, which APIs should I check to do this?

@digital-nomad-cheng digital-nomad-cheng added the enhancement New feature or request label Mar 7, 2023
@yaoyaoding
Copy link
Member

yaoyaoding commented Mar 7, 2023

Hi @digital-nomad-cheng,

Hidet supports dump and load a model (in the form of a FlowGraph via its save and load method). The dumpped model will store all the information (e.g., model structure, weights) but will leave the compiled kernels to hidet cache.

Currently, Hidet only has a Python-based runtime to execute a flow graph. We are still working on a VM-based runtime (e.g., TVM and Raf has such a VM runtime). In the future, we will have a VM-based runtime that executes a hidet compiled model without Python enviroment, but now we still need python to run the model. Because we used CUDA-graph, the python-based runtime has negligible overhead.

Let me know if you have any other question about Hidet.

@digital-nomad-cheng
Copy link
Contributor Author

Thanks for clarification, looking forward to the VM-based runtime.

KTong821 pushed a commit to KTong821/hidet that referenced this issue Apr 24, 2024
**Context:**
I made these changes to help with debugging Gemma, the dump produces
many operators and this makes it easier for example to find which
operators involve the input IDs / position IDs / KV-cache.

**Summary of changes:**
- Add missing dump_op parameter to ctx.debug()
- Dump input indices (e.g. @23) in operator dump
- Prevent dump_op and dump_outputs from overriding each other in the
single-output case

This is an example `41_Concat_def.txt` taken from my Gemma
implementation, which corresponds to concatenating past keys in the
KV-cache with the current keys. The `Inputs` field shows the indices of
the operator inputs, which might be another operator output `@n` or some
graph input `@in:n`.

```
Operator:
Concat(0: float32(bs, 1, past_seq_len, 256), 1: float32(bs, 1, seq_len, 256), axis=2)
Inputs:
	0 <- @in:2
	1 <- @40
Task:
Task(
  name: concat
  parameters: 
    x0: tensor(float32, [bs, 1, past_seq_len, 256])
    x1: tensor(float32, [bs, 1, seq_len, 256])
    out: tensor(float32, [bs, 1, (past_seq_len + seq_len), 256])
  inputs: [x0, x1]
  outputs: [out]
  computations: 
    out: float32[bs, 1, (past_seq_len + seq_len), 256] where out[v, v_1, v_2, v_3] = ((v_2 < past_seq_len) ? x0[v, v_1, v_2, v_3] : x1[v, v_1, (v_2 - past_seq_len), v_3])
  attributes: {}
)
```
vadiklyutiy pushed a commit that referenced this issue Jul 22, 2024
**Context:**
I made these changes to help with debugging Gemma, the dump produces
many operators and this makes it easier for example to find which
operators involve the input IDs / position IDs / KV-cache.

**Summary of changes:**
- Add missing dump_op parameter to ctx.debug()
- Dump input indices (e.g. @23) in operator dump
- Prevent dump_op and dump_outputs from overriding each other in the
single-output case

This is an example `41_Concat_def.txt` taken from my Gemma
implementation, which corresponds to concatenating past keys in the
KV-cache with the current keys. The `Inputs` field shows the indices of
the operator inputs, which might be another operator output `@n` or some
graph input `@in:n`.

```
Operator:
Concat(0: float32(bs, 1, past_seq_len, 256), 1: float32(bs, 1, seq_len, 256), axis=2)
Inputs:
	0 <- @in:2
	1 <- @40
Task:
Task(
  name: concat
  parameters: 
    x0: tensor(float32, [bs, 1, past_seq_len, 256])
    x1: tensor(float32, [bs, 1, seq_len, 256])
    out: tensor(float32, [bs, 1, (past_seq_len + seq_len), 256])
  inputs: [x0, x1]
  outputs: [out]
  computations: 
    out: float32[bs, 1, (past_seq_len + seq_len), 256] where out[v, v_1, v_2, v_3] = ((v_2 < past_seq_len) ? x0[v, v_1, v_2, v_3] : x1[v, v_1, (v_2 - past_seq_len), v_3])
  attributes: {}
)
```
vadiklyutiy pushed a commit that referenced this issue Jul 23, 2024
**Context:**
I made these changes to help with debugging Gemma, the dump produces
many operators and this makes it easier for example to find which
operators involve the input IDs / position IDs / KV-cache.

**Summary of changes:**
- Add missing dump_op parameter to ctx.debug()
- Dump input indices (e.g. @23) in operator dump
- Prevent dump_op and dump_outputs from overriding each other in the
single-output case

This is an example `41_Concat_def.txt` taken from my Gemma
implementation, which corresponds to concatenating past keys in the
KV-cache with the current keys. The `Inputs` field shows the indices of
the operator inputs, which might be another operator output `@n` or some
graph input `@in:n`.

```
Operator:
Concat(0: float32(bs, 1, past_seq_len, 256), 1: float32(bs, 1, seq_len, 256), axis=2)
Inputs:
	0 <- @in:2
	1 <- @40
Task:
Task(
  name: concat
  parameters: 
    x0: tensor(float32, [bs, 1, past_seq_len, 256])
    x1: tensor(float32, [bs, 1, seq_len, 256])
    out: tensor(float32, [bs, 1, (past_seq_len + seq_len), 256])
  inputs: [x0, x1]
  outputs: [out]
  computations: 
    out: float32[bs, 1, (past_seq_len + seq_len), 256] where out[v, v_1, v_2, v_3] = ((v_2 < past_seq_len) ? x0[v, v_1, v_2, v_3] : x1[v, v_1, (v_2 - past_seq_len), v_3])
  attributes: {}
)
```
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants