[build] make builder smarter and configurable wrt compute capabilities + docs #578
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR fixes:
error, resulting from this build:
multi_tensor_adam.cu
andfused_lamb_cuda_kernel.cu
were getting only-gencode=arch=compute_80,code=sm_80
flags and missing all the rest-gencode arch=compute_60,code=compute_60 -gencode arch=compute_61,code=compute_61 -gencode arch=compute_70,code=compute_70 -gencode arch=compute_80,code=compute_80 -gencode arch=compute_86,code=compute_86
The lone
-gencode=arch=compute_80,code=sm_80
comes fromCUDAExtension
- which currently checks the capacity of the 1st card only and assumes that the other cards and the same. Moreover it clamps down the number to the minimum of the same first digit in:so, for example,
sm_86
becomessm_80
.I'm pretty sure it's wrong though since it's the card with
compute_61
that was getting this error, so these 2 libs weren't built to supportcompute_61
. This is with pytorch-nightly.The 2 cards I have right now are:
Note: A PR has been proposed to fix this problem transparently to
CUDAExtension
users, but it won't be available until future versions of pytorch if it's accepted. pytorch/pytorch#48891The cost of this deepspeed PR is that the build process is now slightly slower as it now has to build 4-5 kernels x 2 instead of just 2 (assuming all features are enabled to be compiled). So perhaps down the road we can fix that by conditioning on pytorch version and build less kernels. Alternatively you could copy the loop to get just the required archs: https://github.com/pytorch/pytorch/blob/b8f90d778d4c0739c7c07fe2b2fb0aef5e7c77e7/torch/utils/cpp_extension.py#L1531-L1545
edit: After understanding how
CUDAExtension
sorts out its archs I have made further improvements to this PRTORCH_CUDA_ARCH_LIST
, in exactly same manner asCUDAExtension
does it. So now I can build deepspeed much faster by specifying only the archs that I need:TORCH_CUDA_ARCH_LIST="6.1;7.5;8.6" DS_BUILD_OPS=1 pip install --no-clean --no-cache -v --disable-pip-version-check -e .
TORCH_CUDA_ARCH_LIST
overrides theCUDAOpBuilder
'scross_compile_archs
argCUDAOpBuilder
nuancesA related PR: proposed support for compute_86 in #577 to include the full capabilities of the rtx-30* cards.
This might be related to #95