TensorCircuit-NG vs cuQuantum on H200: JIT compilation beats the "magic GPU library" assumption #118
refraction-ray
started this conversation in
Show and tell
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
NVIDIA cuQuantum has a strong reputation as the natural high-performance baseline for GPU quantum simulation. That reputation is understandable: cuQuantum contains serious low-level GPU libraries such as cuStateVec and cuTensorNet and it is NVIDIA who creates GPU and CUDA!
But in an end-to-end differentiable VQE workload, the result is more nuanced. On our H200 GPU benchmark, TensorCircuit-NG was substantially faster after compilation, while also offering a much higher-level and user-friendly programming model.
The short version:
Benchmark setup
We used the workload as in the script for 1D TFIM VQE task:
Hardware and software:
1.6.00.7.226.3.214.1.12.11.0+cu128We measured one warmup/compile call and then the mean of five later value-and-gradient calls.
Implementations compared
We tested two TensorCircuit-NG modes:
scanover VQE layers to reduce JAX compilation/staging time.We also tested two direct cuQuantum routes:
The cuTensorNet path is intentionally not the obviously bad version where every Pauli term gets a separate tensor-network path search. We first tried that more "TN-native" observable-contraction style, but for this workload it spent too much time in repeated graph/path overhead. The final version is closer to the state-vector expectation workflow used by the TensorCircuit-NG and MindQuantum benchmark.
Repeated value-and-gradient runtime
The table below reports the post-warmup runtime. This is the relevant metric for VQE-style optimization, where the same circuit structure is evaluated many times.
In repeated value-and-gradient calls, TensorCircuit-NG is faster than cuStateVec:
The gap is much larger against the cuTensorNet route for this particular state-vector expectation plus autograd workflow:
These numbers are the main point: cuQuantum is not a magic speed button. A library being close to CUDA, or being written by a GPU vendor, does not automatically make it the fastest end-to-end implementation for a differentiable quantum algorithm.
First-call cost and amortization
cuQuantum has much lower first-call overhead. This is expected: TensorCircuit-NG uses JAX JIT compilation, and that first call can be expensive.
So if the task is a single one-off circuit evaluation, cuQuantum's low startup cost is attractive. But VQE is usually not a one-off workload. It repeatedly evaluates the same circuit structure for many optimizer steps and often across multiple random initializations. In that regime, TensorCircuit-NG's first-call cost is easily amortized, and the much faster post-compilation runtime becomes the dominant factor.
There is also a useful TensorCircuit-NG tradeoff:
At 24 qubits, unrolled TensorCircuit-NG is about
2.50xfaster than scan mode after compilation, but the first call is about9xheavier.Programming model
Performance is only half of the story. The programming model matters.
In TensorCircuit-NG, the benchmark is expressed as circuit code:
With direct cuQuantum, the user has to manually manage much lower-level details:
cuQuantum is valuable, but it is closer to a low-level engine than a high-level quantum algorithm framework. For a researcher, that difference is very real.
Takeaway
This benchmark does not prove that cuQuantum is slow for every task. What this benchmark does show is narrower and more practical:
For VQE workload, direct cuQuantum was not the fastest end-to-end route. TensorCircuit-NG provided a much simpler programming interface and substantially faster repeated value-and-gradient evaluations after JAX compilation.
The common assumption that "NVIDIA controls CUDA, therefore cuQuantum must be the fastest implementation" is too simplistic. Raw GPU kernels matter, but so do JIT compilation, autodiff integration, graph-level optimization, and the abstraction level exposed to users.
TensorCircuit-NG's advantage is that it lets users write concise quantum-program code while still compiling to high-performance backend-native tensor programs. For repeated VQE-style workloads, that combination can beat direct cuQuantum both in usability and in runtime.
Beta Was this translation helpful? Give feedback.
All reactions