Force cudaGraphExec reinstantiation when clusters are used #2813

andportnoy · 2025-11-21T20:32:07Z

Proposed changes

Thread block cluster dimensions are not correctly updated by cudaGraphExecUpdate. Therefore, when clusters are used, we reinstantiate a cudaGraphExec rather than updating it.

Checklist

I have read the CONTRIBUTING document
I have run pre-commit run --all-files to format my code / installed pre-commit prior to committing changes
I have added tests that prove my fix is effective or that my feature works
The change fixes existing tests.
I have updated the necessary documentation (if needed)

Thread block cluster dimensions are not correctly updated by cudaGraphExecUpdate. Therefore, when clusters are used, we reinstantiate a cudaGraphExec rather than updating it.

awni · 2025-11-22T14:54:23Z

Some benchmarks on B200

Inference with meta-llama/Meta-Llama-3.1-8B

Pre: prompt_tps=59185.146
Post: prompt_tps=59359.110

Pre: generation_tps=282.392
Post: generation_tps=282.671

Pre: toks_per_sec: 97961.0518
Post: toks_per_sec: 97452.2300

awni · 2025-11-22T14:56:14Z

@andportnoy I changed this a bit more than I expected. Basically three changes:

we don't bother cache the graph exec if the graph is not updatable since there is no point
The condition for allowing the update are less conservative (if there is a non-singleton cluster in the x dimension of a sub graph with a single kernel node we can still update the graph).
In order to achieve the above I encoded the cluster x dimension in the graph key which took a bit of rearranging to make it work nicely.

awni

This resolves outstanding test failures! Thanks for the fix!!

…e#2813) Co-authored-by: Awni Hannun <awni@apple.com>

andportnoy and others added 2 commits November 21, 2025 15:22

Force cudaGraphExec reinstantiation when clusters are used

f2f2430

Thread block cluster dimensions are not correctly updated by cudaGraphExecUpdate. Therefore, when clusters are used, we reinstantiate a cudaGraphExec rather than updating it.

update fix for clusters

9886dbf

awni force-pushed the fix-cuda-graphs-update-clusters branch from f54794d to 9886dbf Compare November 22, 2025 14:15

awni approved these changes Nov 22, 2025

View reviewed changes

awni merged commit 3e05cea into ml-explore:main Nov 22, 2025
10 checks passed

awni mentioned this pull request Nov 22, 2025

[BUG] [CUDA] Blas tests failing on B200 #2748

Closed

Jckwind pushed a commit to TheProxyCompany/mlx that referenced this pull request Dec 5, 2025

Force cudaGraphExec reinstantiation when clusters are used (ml-explor…

a480c27

…e#2813) Co-authored-by: Awni Hannun <awni@apple.com>

BrewTestBot mentioned this pull request Dec 18, 2025

mlx 0.30.1 Homebrew/homebrew-core#259125

Merged

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Force cudaGraphExec reinstantiation when clusters are used #2813

Force cudaGraphExec reinstantiation when clusters are used #2813

Uh oh!

andportnoy commented Nov 21, 2025

Uh oh!

awni commented Nov 22, 2025

Uh oh!

awni commented Nov 22, 2025

Uh oh!

awni left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Force cudaGraphExec reinstantiation when clusters are used #2813

Force cudaGraphExec reinstantiation when clusters are used #2813

Uh oh!

Conversation

andportnoy commented Nov 21, 2025

Proposed changes

Checklist

Uh oh!

awni commented Nov 22, 2025

Uh oh!

awni commented Nov 22, 2025

Uh oh!

awni left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants