Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement mutex-free spin lock for task queue #14834

Merged
merged 19 commits into from
May 19, 2023
Merged

Conversation

RandySheriffH
Copy link
Contributor

@RandySheriffH RandySheriffH commented Feb 27, 2023

Implemented "lock-free" spinlock to save CPU usage on context switching.
The change has been tested on queene service of Ads team, the lock-free version of ort (40 threads) saves CPU usage on gen8 (128 logical processors on 8 numa nodes) windows by nearly half, from 65% to 35%.

For 32 cores, the curve is flat:

Anubis, 32 vCPU, windows, hugging face models,
95 percentile E2E latency in ms:

model mutex(ms) mutex-free
alvert_base_v2 34.21 34.09
bert_large_uncased 116.27 117.84
bart_base 72.06 71.99
distilgpt2 25.43 25.02
vit_base_patch16_224 37.33 37.76

Anubis, 32 vCPU win, Linux, 1st party models,
95 percentile E2E latency in ms:

model mutex(ms) mutex-free
deepthink_v2 24.35 22.95
bing_feeds 36.96 36.48
deep_writes 14.46 14.32
keypoints 9.34 7.69
model11 1.71 1.66
model12 1.82 1.44
model2 4.21 3.95
model6 1.08 1.05
agiencoder 0.99 0.93
geminet_transformer 5.32 5.24

@RandySheriffH RandySheriffH marked this pull request as ready for review March 16, 2023 22:05
@RandySheriffH RandySheriffH changed the title Implement lock-free mutex for task queue for main Implement mutex-free spin lock for task queue Mar 16, 2023
include/onnxruntime/core/platform/ort_mutex.h Outdated Show resolved Hide resolved
include/onnxruntime/core/platform/ort_mutex.h Outdated Show resolved Hide resolved
include/onnxruntime/core/platform/ort_mutex.h Outdated Show resolved Hide resolved
@yuslepukhin
Copy link
Member

yuslepukhin commented Apr 17, 2023

I have not looked into this in depth. However, spin locks that never go to sleep do not do well in all scenarios. Case in point our spinning in the threadpools. If you just want to increase the spin count there are official ways of doing so with CriticalSecion on windows and phtread_mutexes. The windows read/write lock delivers great performance and may have other options.

Echoing Pranav's point about memory_order. It is hard to get it right. I recommend watching the appropriate talk from the Cpp Conference.

@pranavsharma pranavsharma requested a review from tlh20 April 17, 2023 23:03
@tlh20
Copy link
Contributor

tlh20 commented Apr 26, 2023

Thanks for the profiling and exploration here! Believe the spinlock implementation is correct. I share Dmitri and Pranav's concerns on generality though. The risk is that a spinklock can look good on standalone microbenchmarks where the number of threads >= the number of CPUs, and then performance can collapse in a more complex setting if the CPU is over-subscribed -- if a lock-holder is preempted, then other threads spin holding on to CPUs. For that reason, mutex libraries tend to implement a spin-then-block approach so that the fast path of lock acquire/release can stay in user mode, and kernel work is only incurred after some spinning delay.

For Linux, we use the nsync which IIRC provides that kind of behavior.

For Windows, it sounds from Dmitri's comment that we may be able to configure the existing lock to perform better?

For the original 8-NUMA-node workload, do we know how other configurations behave? For instance, running 8 * 1-NUMA-node ORT instances, each pinned to a separate socket? Also, if we are running concurrent inference requests over a single ORT for the whole 8-NUMA machine, is the workload configured with a limit on a maximum degree of parallelism for each request? Very few operators scale cross-NUMA up to 128 threads... so I am worried we may be adding overheads on work distribution and the locks here without much benefit.

@RandySheriffH RandySheriffH requested a review from a team as a code owner May 17, 2023 21:38
@RandySheriffH RandySheriffH self-assigned this May 17, 2023
@snnn
Copy link
Member

snnn commented May 17, 2023

For me this is a naive implementation of TAS or TTAS: https://disco.ethz.ch/courses/hs15/distsys/lecture/chapter11.pdf . In general, they are not any better than modern locking queues except they are easy to understand and implement. I think they are not good for general use cases. They only make senses in some special scenarios.

@snnn snnn added the triage:approved Approved for cherrypicks for release label May 18, 2023
@RandySheriffH RandySheriffH merged commit 4dfb89b into main May 19, 2023
@RandySheriffH RandySheriffH deleted the rashuai/SpinLock branch May 19, 2023 17:12
snnn pushed a commit that referenced this pull request May 19, 2023
Implemented "lock-free" spinlock to save CPU usage on context switching.
The change has been tested on queene service of Ads team, the lock-free
version of ort (40 threads) saves CPU usage on gen8 (128 logical
processors on 8 numa nodes) windows by nearly half, from 65% to 35%.

For 32 cores, the curve is flat:

Anubis, 32 vCPU, windows, hugging face models,
95 percentile E2E latency in ms:

model | mutex(ms) | mutex-free
--- | --- | ---
 alvert_base_v2 | 34.21 | 34.09
 bert_large_uncased | 116.27| 117.84
 bart_base | 72.06 | 71.99
 distilgpt2 | 25.43 | 25.02
 vit_base_patch16_224 | 37.33 | 37.76

Anubis, 32 vCPU win, Linux, 1st party models,
95 percentile E2E latency in ms:

model | mutex(ms) | mutex-free
--- | --- | ---
deepthink_v2 | 24.35 | 22.95
bing_feeds |  36.96 | 36.48
deep_writes |  14.46 | 14.32
keypoints |  9.34 | 7.69
model11 |  1.71 | 1.66
model12 |  1.82 | 1.44
model2 |  4.21 | 3.95
model6 |  1.08 | 1.05
agiencoder |  0.99 | 0.93
geminet_transformer |  5.32 | 5.24

---------

Co-authored-by: Randy Shuai <rashuai@microsoft.com>
snnn added a commit that referenced this pull request May 20, 2023
### Description

Cherry-pick 4 commits to rel-1.15.0 branch

#14834 
#15727
#16010
#16011
@snnn snnn removed triage:approved Approved for cherrypicks for release release:1.15 labels May 22, 2023
preetha-intel pushed a commit to intel/onnxruntime that referenced this pull request Jun 7, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants