You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository was archived by the owner on Jan 12, 2024. It is now read-only.
By default (at least, on Windows 10) OpenMP seems to be configured to use spin-locks on the worker threads. This might make high-throughput workloads faster as the worker threads are always ready. However, it also might lead to wasting CPU cycles on the spin-locks while waiting for real work without any benefit for the "wall time" performance of the scenario. Profiling QML benchmarks (see attached) for 20 qubits shows that the worker threads are spending about 2/3 of the time in SwitchToThread, wasting power and hogging the CPU resources on the machine.
We should investigate into what kind of workloads are typical and consider setting OMP_WAIT_POLICY=passive. Another way to tackle this is to understand why we are not achieving the desired load on the worker threads. I ran the benchmark on 8 threads (rather than 16, the simulator allocates by default on my 16/32 core machine), there was a small regression in wall time per gate with slightly increased load per thread, but it was still below the full capacity.
Note 1: discussed with @thomashaener, he's suggested to profile for 24+ qubits as at 20 qubits the problem might still be too small to load all 16 threads.
Note 2: should also look into profiling cache accesses.
By default (at least, on Windows 10) OpenMP seems to be configured to use spin-locks on the worker threads. This might make high-throughput workloads faster as the worker threads are always ready. However, it also might lead to wasting CPU cycles on the spin-locks while waiting for real work without any benefit for the "wall time" performance of the scenario. Profiling QML benchmarks (see attached) for 20 qubits shows that the worker threads are spending about 2/3 of the time in SwitchToThread, wasting power and hogging the CPU resources on the machine.
We should investigate into what kind of workloads are typical and consider setting OMP_WAIT_POLICY=passive. Another way to tackle this is to understand why we are not achieving the desired load on the worker threads. I ran the benchmark on 8 threads (rather than 16, the simulator allocates by default on my 16/32 core machine), there was a small regression in wall time per gate with slightly increased load per thread, but it was still below the full capacity.
Note 1: discussed with @thomashaener, he's suggested to profile for 24+ qubits as at 20 qubits the problem might still be too small to load all 16 threads.
Note 2: should also look into profiling cache accesses.
QML_benchmark.zip