Avoid cudaStreamSync at the end of Forward/Backward by SherlockNoMad · Pull Request #9470 · microsoft/onnxruntime

SherlockNoMad · 2021-10-21T05:52:39Z

As ORTModule should match the behavior of nn.module, we don't need to explicitly introduce a cudaStreamSync at the end of each subgraph execution.

Addition cudaStreamSync at the end of forward
As shown in the profiling result below, ORTModule run has an extra “cudaStreamSync” call at the end of forward section. This was introduced as the finalizing step for InferenceSession::PartialRun(). This behavior is copied from original InferenceSession::Run() code when we implemented PartialRun executor.
However, PyTorch would automatically introduce “cudaStreamSync” if following CPU computation has dependency on a GPU tensor. In another word, ORT doesn’t need to introduce this call explicitly.

Warmup Patterns after cudaStreamSync
As we zoom in to the time segment following cudaStreamSync call, we can see a time window lasting ~4ms that GPU is barely utilized. As the tasks in compute stream are depleted with cudaStreamSync call, CPU needs to refill the compute stream from scratch. This resulted in the GPU starvation, as CPU is not able to launch the kernels fast enough, worsen by the fact that the scheduled kernels are short to complete (<10us). The starving situation is eventually relieved when a larger kernel kicks in, taking up >100 us, giving time for CPU to catch up with the scheduling work.

SherlockNoMad added 2 commits October 14, 2021 03:47

Skip cudaStreamSynchronize at the end of fw

23408f2

skip sync stream for end of backward

d791280

SherlockNoMad added training issues related to ONNX Runtime training; typically submitted using template component:ortmodule labels Oct 21, 2021

SherlockNoMad requested review from Lafi7e, pengwa and weixingzhang October 21, 2021 06:04

weixingzhang approved these changes Oct 21, 2021

View reviewed changes

SherlockNoMad merged commit ff23b9f into master Oct 21, 2021

SherlockNoMad deleted the bahuang/no_stream_sync branch October 21, 2021 18:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Avoid cudaStreamSync at the end of Forward/Backward#9470

Avoid cudaStreamSync at the end of Forward/Backward#9470
SherlockNoMad merged 2 commits intomasterfrom
bahuang/no_stream_sync

SherlockNoMad commented Oct 21, 2021 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

SherlockNoMad commented Oct 21, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

SherlockNoMad commented Oct 21, 2021 •

edited

Loading