-
Notifications
You must be signed in to change notification settings - Fork 52
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feature] Export optimized cuda graph #130
Comments
Hidet supports dump and load a model (in the form of a FlowGraph via its save and load method). The dumpped model will store all the information (e.g., model structure, weights) but will leave the compiled kernels to hidet cache. Currently, Hidet only has a Python-based runtime to execute a flow graph. We are still working on a VM-based runtime (e.g., TVM and Raf has such a VM runtime). In the future, we will have a VM-based runtime that executes a hidet compiled model without Python enviroment, but now we still need python to run the model. Because we used CUDA-graph, the python-based runtime has negligible overhead. Let me know if you have any other question about Hidet. |
Thanks for clarification, looking forward to the VM-based runtime. |
**Context:** I made these changes to help with debugging Gemma, the dump produces many operators and this makes it easier for example to find which operators involve the input IDs / position IDs / KV-cache. **Summary of changes:** - Add missing dump_op parameter to ctx.debug() - Dump input indices (e.g. @23) in operator dump - Prevent dump_op and dump_outputs from overriding each other in the single-output case This is an example `41_Concat_def.txt` taken from my Gemma implementation, which corresponds to concatenating past keys in the KV-cache with the current keys. The `Inputs` field shows the indices of the operator inputs, which might be another operator output `@n` or some graph input `@in:n`. ``` Operator: Concat(0: float32(bs, 1, past_seq_len, 256), 1: float32(bs, 1, seq_len, 256), axis=2) Inputs: 0 <- @in:2 1 <- @40 Task: Task( name: concat parameters: x0: tensor(float32, [bs, 1, past_seq_len, 256]) x1: tensor(float32, [bs, 1, seq_len, 256]) out: tensor(float32, [bs, 1, (past_seq_len + seq_len), 256]) inputs: [x0, x1] outputs: [out] computations: out: float32[bs, 1, (past_seq_len + seq_len), 256] where out[v, v_1, v_2, v_3] = ((v_2 < past_seq_len) ? x0[v, v_1, v_2, v_3] : x1[v, v_1, (v_2 - past_seq_len), v_3]) attributes: {} ) ```
**Context:** I made these changes to help with debugging Gemma, the dump produces many operators and this makes it easier for example to find which operators involve the input IDs / position IDs / KV-cache. **Summary of changes:** - Add missing dump_op parameter to ctx.debug() - Dump input indices (e.g. @23) in operator dump - Prevent dump_op and dump_outputs from overriding each other in the single-output case This is an example `41_Concat_def.txt` taken from my Gemma implementation, which corresponds to concatenating past keys in the KV-cache with the current keys. The `Inputs` field shows the indices of the operator inputs, which might be another operator output `@n` or some graph input `@in:n`. ``` Operator: Concat(0: float32(bs, 1, past_seq_len, 256), 1: float32(bs, 1, seq_len, 256), axis=2) Inputs: 0 <- @in:2 1 <- @40 Task: Task( name: concat parameters: x0: tensor(float32, [bs, 1, past_seq_len, 256]) x1: tensor(float32, [bs, 1, seq_len, 256]) out: tensor(float32, [bs, 1, (past_seq_len + seq_len), 256]) inputs: [x0, x1] outputs: [out] computations: out: float32[bs, 1, (past_seq_len + seq_len), 256] where out[v, v_1, v_2, v_3] = ((v_2 < past_seq_len) ? x0[v, v_1, v_2, v_3] : x1[v, v_1, (v_2 - past_seq_len), v_3]) attributes: {} ) ```
**Context:** I made these changes to help with debugging Gemma, the dump produces many operators and this makes it easier for example to find which operators involve the input IDs / position IDs / KV-cache. **Summary of changes:** - Add missing dump_op parameter to ctx.debug() - Dump input indices (e.g. @23) in operator dump - Prevent dump_op and dump_outputs from overriding each other in the single-output case This is an example `41_Concat_def.txt` taken from my Gemma implementation, which corresponds to concatenating past keys in the KV-cache with the current keys. The `Inputs` field shows the indices of the operator inputs, which might be another operator output `@n` or some graph input `@in:n`. ``` Operator: Concat(0: float32(bs, 1, past_seq_len, 256), 1: float32(bs, 1, seq_len, 256), axis=2) Inputs: 0 <- @in:2 1 <- @40 Task: Task( name: concat parameters: x0: tensor(float32, [bs, 1, past_seq_len, 256]) x1: tensor(float32, [bs, 1, seq_len, 256]) out: tensor(float32, [bs, 1, (past_seq_len + seq_len), 256]) inputs: [x0, x1] outputs: [out] computations: out: float32[bs, 1, (past_seq_len + seq_len), 256] where out[v, v_1, v_2, v_3] = ((v_2 < past_seq_len) ? x0[v, v_1, v_2, v_3] : x1[v, v_1, (v_2 - past_seq_len), v_3]) attributes: {} ) ```
Is there a way to export the optimized cuda graph to runtime library/cuda kernel and call the runtime library from host?
If there is, which APIs should I check to do this?
The text was updated successfully, but these errors were encountered: