Enable ORT with CUDA 11 toolkit by weixingzhang · Pull Request #4168 · microsoft/onnxruntime

weixingzhang · 2020-06-09T06:56:16Z

CUDA 11 toolkit has been released on June 5.

Here are the details in this PR.

1. Seperate HOROVOD and MPI
2. Seperate NCCL from HOROVOD in CMakeLists.txt
3. Remove dependency on external cub for CUDA 11.
4. cudnnSetRNNDescriptor is changed in cuDNN 8.0
5. disable sm_30, sm_50 for CUDA 11 since they are deprecated in CUDA 11.

TODO:
Ampere(sm_80) support will be added later since some failures happen during compiling CUDA codes due to NV compiler issue.

Tixxx · 2020-06-09T17:49:20Z

+    if (MPI_FOUND)
+      set(MPI_HEADER_FILE "${MPI_INCLUDE_DIR}/mpi.h")
+      message( STATUS "Determining MPI version from the header file: ${MPI_HEADER_FILE}" )
+      file (STRINGS ${MPI_HEADER_FILE} MPI_MAJOR_VERSION_DEFINED


I think horovod used to use this logic to get version which proved to be unreliable since there are multiple MPI implementations out there. We can simply use something like:
execute_process(COMMAND mpirun --version RESULT_VARIABLE mpirun_output)
message("mpi version='${mpirun_output}'")

ok, will update the PR.

HectorSVC · 2020-06-09T18:15:09Z

      CUDNN_RETURN_IF_ERROR(cudnnCreateRNNDescriptor(&cudnn_rnn_desc_));

-    CUDNN_RETURN_IF_ERROR(cudnnSetRNNDescriptor(cudnnHandle,
+    CUDNN_RETURN_IF_ERROR(cudnnSetRNNDescriptor_v6(cudnnHandle,


cudnnSetRNNDescriptor_v6 [](start = 26, length = 24)

should we still support older cudnn v7 here? #Closed

No. The deprecation policy is changed in cuDNN 8. Here are the details: https://docs.nvidia.com/deeplearning/sdk/cudnn-developer-guide/index.html#backward-compatibility #Closed

1. Seperate HOROVOD and MPI 2. Seperate NCCL from HOROVOD in CMakeLists.txt 2. Remove dependency on external cub 3. cudnnSetRNNDescriptor is changed in cuDNN 8.0

weixingzhang added the training issues related to ONNX Runtime training; typically submitted using template label Jun 9, 2020

weixingzhang requested review from HectorSVC, SherlockNoMad, ke1337 and snnn June 9, 2020 06:56

weixingzhang requested a review from a team as a code owner June 9, 2020 06:56

weixingzhang added the core runtime issues related to core runtime label Jun 9, 2020

Tixxx reviewed Jun 9, 2020

View reviewed changes

Comment thread orttraining/orttraining/core/framework/mpi_setup.cc Outdated

Tixxx reviewed Jun 9, 2020

View reviewed changes

Comment thread cmake/CMakeLists.txt Outdated

Tixxx reviewed Jun 9, 2020

View reviewed changes

HectorSVC reviewed Jun 9, 2020

View reviewed changes

snnn reviewed Jun 9, 2020

View reviewed changes

Comment thread cmake/CMakeLists.txt

weixingzhang added 8 commits June 12, 2020 17:54

ORT on CUDA 11

ddb2a12

1. Seperate HOROVOD and MPI 2. Seperate NCCL from HOROVOD in CMakeLists.txt 2. Remove dependency on external cub 3. cudnnSetRNNDescriptor is changed in cuDNN 8.0

polish the code about MPI/NCCL in CMakeLists.txt and build.py

9589196

check CUDA version

2d1a9d0

${MPI_INCLUDE_DIRS} should be PUBLIC

0a97ad1

sm30, sm50 are deprecated in CUDA 11 Toolkit

8a60331

update change based on code review feedback.

f005c1f

add sm_52

9d7d254

improve MPI/NCCL build path

fe6b081

weixingzhang force-pushed the wezhan/cuda11 branch from 5804660 to fe6b081 Compare June 12, 2020 17:54

snnn approved these changes Jun 15, 2020

View reviewed changes

Tixxx approved these changes Jun 15, 2020

View reviewed changes

weixingzhang merged commit b4b1c64 into master Jun 15, 2020

weixingzhang deleted the wezhan/cuda11 branch June 15, 2020 15:47

manashgoswami mentioned this pull request Sep 29, 2020

Release supporting CUDA 11 #5267

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable ORT with CUDA 11 toolkit#4168

Enable ORT with CUDA 11 toolkit#4168
weixingzhang merged 8 commits into
masterfrom
wezhan/cuda11

weixingzhang commented Jun 9, 2020 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Tixxx Jun 9, 2020

Uh oh!

weixingzhang Jun 9, 2020

Uh oh!

HectorSVC Jun 9, 2020 •

edited

Loading

Uh oh!

weixingzhang Jun 9, 2020 •

edited by HectorSVC

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

weixingzhang commented Jun 9, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Tixxx Jun 9, 2020

Choose a reason for hiding this comment

Uh oh!

weixingzhang Jun 9, 2020

Choose a reason for hiding this comment

Uh oh!

HectorSVC Jun 9, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

weixingzhang Jun 9, 2020 • edited by HectorSVC Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

weixingzhang commented Jun 9, 2020 •

edited

Loading

HectorSVC Jun 9, 2020 •

edited

Loading

weixingzhang Jun 9, 2020 •

edited by HectorSVC

Loading