Add CUDA option to run copy in default stream by ke1337 · Pull Request #5445 · microsoft/onnxruntime

ke1337 · 2020-10-10T05:07:53Z

This change fixes #4829. Thanks @maherzog for providing the repro!

The bug is caused by memory reuse in BFC arena, where copy and
compute stream in CUDA has a racing condition.

BFC arena is an arena allocator on top of cudaMalloc/Free to
reduce the cost in syncing CPU and GPU when alloc/free. It means
when CPU alloc/free the memory, GPU might not finished previous
work on the memory, so that CPU and GPU could run asynchronously.

This is OK if there's only one stream, where the execution order
in CPU and GPU are consistent. For example, if we have two kernels
A and B, CPU runs allocA->computeA->freeA->allocB->computeB->freeB,
A and B could shares the same memory since computeA and computeB
will not have racing as long as they run in the same GPU compute
stream.

However, if CPU runs allocA->CopyA->freeA->allocB->computeB->freeB,
the order of execution in GPU could have copyA happen after computeB,
if copy and compute happens in different GPU streams.

This change makes copy to run in default compute stream, while adding
an option to fall back to previous behavior if there's perf hit. This
is a short term fix before BFC arena could support multiple streams.

User may use following options to revert to previous behavior:
C API:
struct OrtCUDAProviderOptions cudaProviderOpt;
cudaProviderOpt.do_copy_in_default_stream = false;
C++ API:
CUDAExecutionProviderInfo cudaEPInfo;
cudaEPInfo.do_copy_in_default_stream = false;
C# API:
pending...
Python:
import onnxruntime
onnxruntime.capi._pybind_state.set_do_copy_in_default_stream(False)

@maherzog

This change fixes #4829. Thanks @maherzog for providing the repro! The bug is caused by memory reuse in BFC arena, where copy and compute stream in CUDA has a racing condition. BFC arena is an arena allocator on top of cudaMalloc/Free to reduce the cost in syncing CPU and GPU when alloc/free. It means when CPU alloc/free the memory, GPU might not finished previous work on the memory, so that CPU and GPU could run asynchronously. This is OK if there's only one stream, where the execution order in CPU and GPU are consistent. For example, if we have two kernels A and B, CPU runs allocA->computeA->freeA->allocB->computeB->freeB, A and B could shares the same memory since computeA and computeB will not have racing as long as they run in the same GPU compute stream. However, if CPU runs allocA->CopyA->freeA->allocB->computeB->freeB, the order of execution in GPU could have copyA happen after computeB, if copy and compute happens in different GPU streams. This change makes copy to run in default compute stream, while adding an option to fall back to previous behavior if there's perf hit. This is a short term fix before BFC arena could support multiple streams. User may use following options to revert to previous behavior: C API: struct OrtCUDAProviderOptions cudaProviderOpt; cudaProviderOpt.do_copy_in_default_stream = false; C++ API: CUDAExecutionProviderInfo cudaEPInfo; cudaEPInfo.do_copy_in_default_stream = false; C# API: pending... Python: import onnxruntime onnxruntime.capi._pybind_state.set_do_copy_in_default_stream(False)

Revert the test to get CI pass now

HectorSVC

ke1337 requested a review from a team as a code owner October 10, 2020 05:07

ke1337 commented Oct 10, 2020

View reviewed changes

Comment thread onnxruntime/test/python/onnxruntime_test_python.py Outdated

KeDengMS added 2 commits October 10, 2020 18:38

Confirmed the test failes in CI when doing copy in separate stream

25d6fc2

Revert the test to get CI pass now

Fix Windows test

0854872

HectorSVC reviewed Oct 12, 2020

View reviewed changes

Comment thread onnxruntime/python/onnxruntime_pybind_state.cc Outdated

HectorSVC reviewed Oct 12, 2020

View reviewed changes

Comment thread onnxruntime/core/providers/cuda/cuda_provider_factory.cc

Address CR

66e09d3

HectorSVC approved these changes Oct 12, 2020

View reviewed changes

ke1337 merged commit c444b9d into master Oct 13, 2020

ke1337 deleted the kedeng/stream branch October 13, 2020 05:12

ashbhandare mentioned this pull request Oct 22, 2020

Move Memcopy to and from host to default stream. #5342

Closed

hariharans29 mentioned this pull request Oct 30, 2020

Import some global methods in the ORT module's init script #5641

Closed

hariharans29 mentioned this pull request Dec 29, 2020

The same input, sometimes the output is different #5769

Closed

hariharans29 mentioned this pull request Sep 20, 2021

Fix default initialization value in C API header #9126

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add CUDA option to run copy in default stream#5445

Add CUDA option to run copy in default stream#5445
ke1337 merged 4 commits intomasterfrom
kedeng/stream

ke1337 commented Oct 10, 2020

Uh oh!

Uh oh!

Uh oh!

Uh oh!

HectorSVC left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ke1337 commented Oct 10, 2020

Uh oh!

Uh oh!

Uh oh!

Uh oh!

HectorSVC left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants