Consider using passive OMP_WAIT_POLICY in native simulator

By default (at least, on Windows 10) OpenMP seems to be configured to use spin-locks on the worker threads. This might make high-throughput workloads faster as the worker threads are always ready. However, it also might lead to wasting CPU cycles on the spin-locks while waiting for real work without any benefit for the "wall time" performance of the scenario. Profiling QML benchmarks (see attached) for 20 qubits shows that the worker threads are spending about 2/3 of the time in SwitchToThread, wasting power and hogging the CPU resources on the machine.

We should investigate into what kind of workloads are typical and consider setting OMP_WAIT_POLICY=passive. Another way to tackle this is to understand why we are not achieving the desired load on the worker threads. I ran the benchmark on 8 threads (rather than 16, the simulator allocates by default on my 16/32 core machine), there was a small regression in wall time per gate with slightly increased load per thread, but it was still below the full capacity.

Note 1: discussed with @thomashaener, he's suggested to profile for 24+ qubits as at 20 qubits the problem might still be too small to load all 16 threads. 

Note 2: should also look into profiling cache accesses.

[QML_benchmark.zip](https://github.com/microsoft/qsharp-runtime/files/5654737/QML_benchmark.zip)

![omp_threads_spinning](https://user-images.githubusercontent.com/36858951/101394090-e2f8fa00-387c-11eb-89d5-3dc4c16f02fc.png)
 


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Consider using passive OMP_WAIT_POLICY in native simulator #457

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Consider using passive OMP_WAIT_POLICY in native simulator #457

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions