Skip to content

Enable ORT with CUDA 11 toolkit#4168

Merged
weixingzhang merged 8 commits into
masterfrom
wezhan/cuda11
Jun 15, 2020
Merged

Enable ORT with CUDA 11 toolkit#4168
weixingzhang merged 8 commits into
masterfrom
wezhan/cuda11

Conversation

@weixingzhang
Copy link
Copy Markdown
Contributor

@weixingzhang weixingzhang commented Jun 9, 2020

CUDA 11 toolkit has been released on June 5.

Here are the details in this PR.

1. Seperate HOROVOD and MPI
2. Seperate NCCL from HOROVOD in CMakeLists.txt
3. Remove dependency on external cub for CUDA 11.
4. cudnnSetRNNDescriptor is changed in cuDNN 8.0
5. disable sm_30, sm_50 for CUDA 11 since they are deprecated in CUDA 11.

TODO:
Ampere(sm_80) support will be added later since some failures happen during compiling CUDA codes due to NV compiler issue.

@weixingzhang weixingzhang added the training issues related to ONNX Runtime training; typically submitted using template label Jun 9, 2020
@weixingzhang weixingzhang requested a review from a team as a code owner June 9, 2020 06:56
@weixingzhang weixingzhang added the core runtime issues related to core runtime label Jun 9, 2020
Comment thread orttraining/orttraining/core/framework/mpi_setup.cc Outdated
Comment thread cmake/CMakeLists.txt Outdated
Comment thread cmake/CMakeLists.txt Outdated
if (MPI_FOUND)
set(MPI_HEADER_FILE "${MPI_INCLUDE_DIR}/mpi.h")
message( STATUS "Determining MPI version from the header file: ${MPI_HEADER_FILE}" )
file (STRINGS ${MPI_HEADER_FILE} MPI_MAJOR_VERSION_DEFINED
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think horovod used to use this logic to get version which proved to be unreliable since there are multiple MPI implementations out there. We can simply use something like:
execute_process(COMMAND mpirun --version RESULT_VARIABLE mpirun_output)
message("mpi version='${mpirun_output}'")

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, will update the PR.

CUDNN_RETURN_IF_ERROR(cudnnCreateRNNDescriptor(&cudnn_rnn_desc_));

CUDNN_RETURN_IF_ERROR(cudnnSetRNNDescriptor(cudnnHandle,
CUDNN_RETURN_IF_ERROR(cudnnSetRNNDescriptor_v6(cudnnHandle,
Copy link
Copy Markdown
Contributor

@HectorSVC HectorSVC Jun 9, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cudnnSetRNNDescriptor_v6 [](start = 26, length = 24)

should we still support older cudnn v7 here? #Closed

Copy link
Copy Markdown
Contributor Author

@weixingzhang weixingzhang Jun 9, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No. The deprecation policy is changed in cuDNN 8. Here are the details: https://docs.nvidia.com/deeplearning/sdk/cudnn-developer-guide/index.html#backward-compatibility #Closed

Comment thread cmake/CMakeLists.txt
@weixingzhang weixingzhang merged commit b4b1c64 into master Jun 15, 2020
@weixingzhang weixingzhang deleted the wezhan/cuda11 branch June 15, 2020 15:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

core runtime issues related to core runtime training issues related to ONNX Runtime training; typically submitted using template

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants