Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[build] make builder smarter and configurable wrt compute capabilities + docs #578

Merged
merged 4 commits into from Dec 7, 2020

Conversation

stas00
Copy link
Contributor

@stas00 stas00 commented Dec 5, 2020

This PR fixes:

RuntimeError: cuda runtime error (209) : no kernel image is available for execution on the device at ...

error, resulting from this build:

DS_BUILD_OPS=1 pip install --no-clean --no-cache -v --disable-pip-version-check -e . 

multi_tensor_adam.cu and fused_lamb_cuda_kernel.cu were getting only -gencode=arch=compute_80,code=sm_80 flags and missing all the rest -gencode arch=compute_60,code=compute_60 -gencode arch=compute_61,code=compute_61 -gencode arch=compute_70,code=compute_70 -gencode arch=compute_80,code=compute_80 -gencode arch=compute_86,code=compute_86

The lone -gencode=arch=compute_80,code=sm_80 comes from CUDAExtension - which currently checks the capacity of the 1st card only and assumes that the other cards and the same. Moreover it clamps down the number to the minimum of the same first digit in:

python -c "import torch; print(torch.cuda.get_arch_list())" 
['sm_37', 'sm_50', 'sm_60', 'sm_70', 'sm_75', 'sm_80']

so, for example, sm_86 becomes sm_80.

I'm pretty sure it's wrong though since it's the card with compute_61 that was getting this error, so these 2 libs weren't built to support compute_61. This is with pytorch-nightly.

The 2 cards I have right now are:

$ CUDA_VISIBLE_DEVICES=0 python -c "import torch; print(torch.cuda.get_device_capability())"
(8, 6)
$ CUDA_VISIBLE_DEVICES=1 python -c "import torch; print(torch.cuda.get_device_capability())"
(6, 1)

Note: A PR has been proposed to fix this problem transparently to CUDAExtension users, but it won't be available until future versions of pytorch if it's accepted. pytorch/pytorch#48891

The cost of this deepspeed PR is that the build process is now slightly slower as it now has to build 4-5 kernels x 2 instead of just 2 (assuming all features are enabled to be compiled). So perhaps down the road we can fix that by conditioning on pytorch version and build less kernels. Alternatively you could copy the loop to get just the required archs: https://github.com/pytorch/pytorch/blob/b8f90d778d4c0739c7c07fe2b2fb0aef5e7c77e7/torch/utils/cpp_extension.py#L1531-L1545


edit: After understanding how CUDAExtension sorts out its archs I have made further improvements to this PR

  • by default build the cuda extension for all supported archs (discussion above)
  • changed jit_mode to check archs of all visible cards and not just the first one (similar to the PR I submitted to pytorch)
  • add support for env var TORCH_CUDA_ARCH_LIST, in exactly same manner as CUDAExtension does it. So now I can build deepspeed much faster by specifying only the archs that I need: TORCH_CUDA_ARCH_LIST="6.1;7.5;8.6" DS_BUILD_OPS=1 pip install --no-clean --no-cache -v --disable-pip-version-check -e .
    TORCH_CUDA_ARCH_LIST overrides the CUDAOpBuilder's cross_compile_archs arg
  • refactor the code
  • document CUDAOpBuilder nuances
  • extend the advanced installation tutorial to document the arch mismatch error, its cause and how to build the most efficient version of deepspeed for the exact desired archs.

A related PR: proposed support for compute_86 in #577 to include the full capabilities of the rtx-30* cards.

This might be related to #95

@ghost
Copy link

ghost commented Dec 5, 2020

CLA assistant check
All CLA requirements met.

@stas00 stas00 changed the title [build] fix compilation flags to support the full range of compute capabilities [build] make builder smarter and configurable wrt compute capabilities + docs Dec 6, 2020
Copy link
Contributor

@jeffra jeffra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for this PR, this is really a great contribution!

@jeffra jeffra merged commit ce363d0 into microsoft:master Dec 7, 2020
@stas00 stas00 deleted the correct-compute-capabilities branch December 7, 2020 21:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants