[build] make builder smarter and configurable wrt compute capabilities + docs #578

stas00 · 2020-12-05T07:07:08Z

This PR fixes:

RuntimeError: cuda runtime error (209) : no kernel image is available for execution on the device at ...

error, resulting from this build:

DS_BUILD_OPS=1 pip install --no-clean --no-cache -v --disable-pip-version-check -e .

multi_tensor_adam.cu and fused_lamb_cuda_kernel.cu were getting only -gencode=arch=compute_80,code=sm_80 flags and missing all the rest -gencode arch=compute_60,code=compute_60 -gencode arch=compute_61,code=compute_61 -gencode arch=compute_70,code=compute_70 -gencode arch=compute_80,code=compute_80 -gencode arch=compute_86,code=compute_86

The lone -gencode=arch=compute_80,code=sm_80 comes from CUDAExtension - which currently checks the capacity of the 1st card only and assumes that the other cards and the same. Moreover it clamps down the number to the minimum of the same first digit in:

python -c "import torch; print(torch.cuda.get_arch_list())" 
['sm_37', 'sm_50', 'sm_60', 'sm_70', 'sm_75', 'sm_80']

so, for example, sm_86 becomes sm_80.

I'm pretty sure it's wrong though since it's the card with compute_61 that was getting this error, so these 2 libs weren't built to support compute_61. This is with pytorch-nightly.

The 2 cards I have right now are:

$ CUDA_VISIBLE_DEVICES=0 python -c "import torch; print(torch.cuda.get_device_capability())"
(8, 6)
$ CUDA_VISIBLE_DEVICES=1 python -c "import torch; print(torch.cuda.get_device_capability())"
(6, 1)

Note: A PR has been proposed to fix this problem transparently to CUDAExtension users, but it won't be available until future versions of pytorch if it's accepted. pytorch/pytorch#48891

The cost of this deepspeed PR is that the build process is now slightly slower as it now has to build 4-5 kernels x 2 instead of just 2 (assuming all features are enabled to be compiled). So perhaps down the road we can fix that by conditioning on pytorch version and build less kernels. Alternatively you could copy the loop to get just the required archs: https://github.com/pytorch/pytorch/blob/b8f90d778d4c0739c7c07fe2b2fb0aef5e7c77e7/torch/utils/cpp_extension.py#L1531-L1545

edit: After understanding how CUDAExtension sorts out its archs I have made further improvements to this PR

by default build the cuda extension for all supported archs (discussion above)
changed jit_mode to check archs of all visible cards and not just the first one (similar to the PR I submitted to pytorch)
add support for env var TORCH_CUDA_ARCH_LIST, in exactly same manner as CUDAExtension does it. So now I can build deepspeed much faster by specifying only the archs that I need: TORCH_CUDA_ARCH_LIST="6.1;7.5;8.6" DS_BUILD_OPS=1 pip install --no-clean --no-cache -v --disable-pip-version-check -e .
TORCH_CUDA_ARCH_LIST overrides the CUDAOpBuilder's cross_compile_archs arg
refactor the code
document CUDAOpBuilder nuances
extend the advanced installation tutorial to document the arch mismatch error, its cause and how to build the most efficient version of deepspeed for the exact desired archs.

A related PR: proposed support for compute_86 in #577 to include the full capabilities of the rtx-30* cards.

This might be related to #95

ghost · 2020-12-05T07:07:28Z

All CLA requirements met.

jeffra

Thank you for this PR, this is really a great contribution!

fix compilation flags to support the full range of compute capabilities

404979f

stas00 requested review from arashashari, awan-10, cli99, conglongli, eltonzheng, jeffra, minjiaz, niumanar, RezaYazdaniAminabadi, samyam, ShadenSmith and tjruwase as code owners December 5, 2020 07:07

stas00 added 2 commits December 4, 2020 23:12

style

a4b4113

add more smarts, refactor and document

ce2201d

stas00 changed the title ~~[build] fix compilation flags to support the full range of compute capabilities~~ [build] make builder smarter and configurable wrt compute capabilities + docs Dec 6, 2020

fix flag

1c5895b

jeffra approved these changes Dec 7, 2020

View reviewed changes

jeffra merged commit ce363d0 into microsoft:master Dec 7, 2020

stas00 deleted the correct-compute-capabilities branch December 7, 2020 21:35

stas00 mentioned this pull request Dec 9, 2020

[build] fix computer capability arch flags, add PTX, handle PTX #591

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[build] make builder smarter and configurable wrt compute capabilities + docs #578

[build] make builder smarter and configurable wrt compute capabilities + docs #578

stas00 commented Dec 5, 2020 •

edited

ghost commented Dec 5, 2020 •

edited by ghost

jeffra left a comment

[build] make builder smarter and configurable wrt compute capabilities + docs #578

[build] make builder smarter and configurable wrt compute capabilities + docs #578

Conversation

stas00 commented Dec 5, 2020 • edited

ghost commented Dec 5, 2020 • edited by ghost

jeffra left a comment

Choose a reason for hiding this comment

stas00 commented Dec 5, 2020 •

edited

ghost commented Dec 5, 2020 •

edited by ghost