[Feature] Export optimized cuda graph #130

digital-nomad-cheng · 2023-03-07T08:39:21Z

Is there a way to export the optimized cuda graph to runtime library/cuda kernel and call the runtime library from host?
If there is, which APIs should I check to do this?

yaoyaoding · 2023-03-07T16:59:31Z

Hi @digital-nomad-cheng,

Hidet supports dump and load a model (in the form of a FlowGraph via its save and load method). The dumpped model will store all the information (e.g., model structure, weights) but will leave the compiled kernels to hidet cache.

Currently, Hidet only has a Python-based runtime to execute a flow graph. We are still working on a VM-based runtime (e.g., TVM and Raf has such a VM runtime). In the future, we will have a VM-based runtime that executes a hidet compiled model without Python enviroment, but now we still need python to run the model. Because we used CUDA-graph, the python-based runtime has negligible overhead.

Let me know if you have any other question about Hidet.

digital-nomad-cheng · 2023-03-07T18:02:21Z

Thanks for clarification, looking forward to the VM-based runtime.

@in

**Context:** I made these changes to help with debugging Gemma, the dump produces many operators and this makes it easier for example to find which operators involve the input IDs / position IDs / KV-cache. **Summary of changes:** - Add missing dump_op parameter to ctx.debug() - Dump input indices (e.g. @23) in operator dump - Prevent dump_op and dump_outputs from overriding each other in the single-output case This is an example `41_Concat_def.txt` taken from my Gemma implementation, which corresponds to concatenating past keys in the KV-cache with the current keys. The `Inputs` field shows the indices of the operator inputs, which might be another operator output `@n` or some graph input `@in:n`. ``` Operator: Concat(0: float32(bs, 1, past_seq_len, 256), 1: float32(bs, 1, seq_len, 256), axis=2) Inputs: 0 <- @in:2 1 <- @40 Task: Task( name: concat parameters: x0: tensor(float32, [bs, 1, past_seq_len, 256]) x1: tensor(float32, [bs, 1, seq_len, 256]) out: tensor(float32, [bs, 1, (past_seq_len + seq_len), 256]) inputs: [x0, x1] outputs: [out] computations: out: float32[bs, 1, (past_seq_len + seq_len), 256] where out[v, v_1, v_2, v_3] = ((v_2 < past_seq_len) ? x0[v, v_1, v_2, v_3] : x1[v, v_1, (v_2 - past_seq_len), v_3]) attributes: {} ) ```

@in

**Context:** I made these changes to help with debugging Gemma, the dump produces many operators and this makes it easier for example to find which operators involve the input IDs / position IDs / KV-cache. **Summary of changes:** - Add missing dump_op parameter to ctx.debug() - Dump input indices (e.g. @23) in operator dump - Prevent dump_op and dump_outputs from overriding each other in the single-output case This is an example `41_Concat_def.txt` taken from my Gemma implementation, which corresponds to concatenating past keys in the KV-cache with the current keys. The `Inputs` field shows the indices of the operator inputs, which might be another operator output `@n` or some graph input `@in:n`. ``` Operator: Concat(0: float32(bs, 1, past_seq_len, 256), 1: float32(bs, 1, seq_len, 256), axis=2) Inputs: 0 <- @in:2 1 <- @40 Task: Task( name: concat parameters: x0: tensor(float32, [bs, 1, past_seq_len, 256]) x1: tensor(float32, [bs, 1, seq_len, 256]) out: tensor(float32, [bs, 1, (past_seq_len + seq_len), 256]) inputs: [x0, x1] outputs: [out] computations: out: float32[bs, 1, (past_seq_len + seq_len), 256] where out[v, v_1, v_2, v_3] = ((v_2 < past_seq_len) ? x0[v, v_1, v_2, v_3] : x1[v, v_1, (v_2 - past_seq_len), v_3]) attributes: {} ) ```

@in

**Context:** I made these changes to help with debugging Gemma, the dump produces many operators and this makes it easier for example to find which operators involve the input IDs / position IDs / KV-cache. **Summary of changes:** - Add missing dump_op parameter to ctx.debug() - Dump input indices (e.g. @23) in operator dump - Prevent dump_op and dump_outputs from overriding each other in the single-output case This is an example `41_Concat_def.txt` taken from my Gemma implementation, which corresponds to concatenating past keys in the KV-cache with the current keys. The `Inputs` field shows the indices of the operator inputs, which might be another operator output `@n` or some graph input `@in:n`. ``` Operator: Concat(0: float32(bs, 1, past_seq_len, 256), 1: float32(bs, 1, seq_len, 256), axis=2) Inputs: 0 <- @in:2 1 <- @40 Task: Task( name: concat parameters: x0: tensor(float32, [bs, 1, past_seq_len, 256]) x1: tensor(float32, [bs, 1, seq_len, 256]) out: tensor(float32, [bs, 1, (past_seq_len + seq_len), 256]) inputs: [x0, x1] outputs: [out] computations: out: float32[bs, 1, (past_seq_len + seq_len), 256] where out[v, v_1, v_2, v_3] = ((v_2 < past_seq_len) ? x0[v, v_1, v_2, v_3] : x1[v, v_1, (v_2 - past_seq_len), v_3]) attributes: {} ) ```

digital-nomad-cheng added the enhancement New feature or request label Mar 7, 2023

digital-nomad-cheng closed this as completed Mar 7, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature] Export optimized cuda graph #130

[Feature] Export optimized cuda graph #130

digital-nomad-cheng commented Mar 7, 2023

yaoyaoding commented Mar 7, 2023 •

edited

Loading

digital-nomad-cheng commented Mar 7, 2023

[Feature] Export optimized cuda graph #130

[Feature] Export optimized cuda graph #130

Comments

digital-nomad-cheng commented Mar 7, 2023

yaoyaoding commented Mar 7, 2023 • edited Loading

digital-nomad-cheng commented Mar 7, 2023

yaoyaoding commented Mar 7, 2023 •

edited

Loading