Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow users to use mis-matched CUDA versions #3436

Merged
merged 2 commits into from May 3, 2023

Conversation

mrwyattii
Copy link
Contributor

We strictly enforce that CUDA major versions between the installed CUDA and torch-compiled CUDA match. However, in some cases the mismatch between major versions is OK. Rather than iterating and testing all possible mismatches, I'm adding an environment variable that can be set to skip this check and allow compilation of our kernels: DS_SKIP_CUDA_CHECK=1.

print(
f"{WARNING} DeepSpeed Op Builder: Installed CUDA version {sys_cuda_version} does not match the "
f"version torch was compiled with {torch.version.cuda}."
"Detected `DS_SKIP_CUDA_CHECK=1`: Allowing this combination of CUDA, but it may result in unexpected behavior."
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it worth printing the versions that mismatch even if they're skipping?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it will still be useful to indicate the mismatched versions. Especially if users run into errors from that mismatch!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I meant should it be added to the warning string for if they are overriding it, I agree it belongs in the other one too

@FarzanT
Copy link
Contributor

FarzanT commented May 26, 2023

Hi,
I don't think this has completely resolved this issue. I have cloned the repo. This is the command I'm using to install deepspeed via pip, while selecting certain ops to pre-compile:

(autopath) ftaj@ucn103-53:~/Github/DeepSpeed$ DS_SKIP_CUDA_CHECK=1 DCMAKE_CUDA_STANDARD=14 TORCH_CUDA_ARCH_LIST="7.0" DS_BUILD_CPU_ADAM=1 DS_BUILD_FUSED_ADAM=1 DS_BUILD_FUSED_LAMB=1 DS_BUILD_QUANTIZER=1 DS_BUILD_RANDOM_LTD=1 DS_BUILD_TRANSFORMER=1 DS_BUILD_SPARSE_ATTN=0 DS_BUILD_SPATIAL_INFERENCE=1 DS_BUILD_TRANSFORMER_INFERENCE=1 DS_BUILD_STOCHASTIC_TRANSFORMER=1 DS_BUILD_UTILS=1 DS_BUILD_AIO=0 pip install . --global-option="build_ext" --global-option="-j8"

Even though I am getting the warning:

[WARNING]  DeepSpeed Op Builder: Installed CUDA version 12.1 does not match the version torch was compiled with 11.8.Detected `DS_SKIP_CUDA_CHECK=1`: Allowing this combination of CUDA, but it may result in unexpected behavior.

I still get an error:

  RuntimeError:
  The detected CUDA version (12.1) mismatches the version that was used to compile
  PyTorch (11.8). Please make sure to use the same CUDA versions.

  [end of output]

  note: This error originates from a subprocess, and is likely not a problem with pip.
error: legacy-install-failure

Encountered error while trying to install package.
Full Stack
(autopath) ftaj@ucn103-53:~/Github/DeepSpeed$ DS_SKIP_CUDA_CHECK=1 DCMAKE_CUDA_STANDARD=14 TORCH_CUDA_ARCH_LIST="7.0" DS_BUILD_CPU_ADAM=1 DS_BUILD_FUSED_ADAM=1 DS_BUILD_FUSED_LAMB=1 DS_BUILD_QUANTIZER=1 DS_BUILD_RANDOM_LTD=1 DS_BUILD_TRANSFORMER=1 DS_BUILD_SPARSE_ATTN=0 DS_BUILD_SPATIAL_INFERENCE=1 DS_BUILD_TRANSFORMER_INFERENCE=1 DS_BUILD_STOCHASTIC_TRANSFORMER=1 DS_BUILD_UTILS=1 DS_BUILD_AIO=0 pip install . --global-option="build_ext" --global-option="-j8"
WARNING: Ignoring invalid distribution -orch (/u/ftaj/.conda/envs/autopath/lib/python3.10/site-packages)
WARNING: Implying --no-binary=:all: due to the presence of --build-option / --global-option / --install-option. Consider using --config-settings for more flexibility.
DEPRECATION: --no-binary currently disables reading from the cache of locally built wheels. In the future --no-binary will not influence the wheel cache. pip 23.1 will enforce this behaviour change. A possible replacement is to use the --no-cache-dir option. You can use the flag --use-feature=no-binary-enable-wheel-cache to test the upcoming behaviour. Discussion can be found at https://github.com/pypa/pip/issues/11453
WARNING: Ignoring invalid distribution -orch (/u/ftaj/.conda/envs/autopath/lib/python3.10/site-packages)
Processing /u/ftaj/Github/DeepSpeed
  Preparing metadata (setup.py) ... done
Requirement already satisfied: hjson in /u/ftaj/.conda/envs/autopath/lib/python3.10/site-packages (from deepspeed==0.9.3+0411a9f8) (3.1.0)
Requirement already satisfied: ninja in /u/ftaj/.conda/envs/autopath/lib/python3.10/site-packages (from deepspeed==0.9.3+0411a9f8) (1.11.1)
Requirement already satisfied: numpy in /u/ftaj/.conda/envs/autopath/lib/python3.10/site-packages (from deepspeed==0.9.3+0411a9f8) (1.24.3)
Requirement already satisfied: packaging>=20.0 in /u/ftaj/.conda/envs/autopath/lib/python3.10/site-packages (from deepspeed==0.9.3+0411a9f8) (23.1)
Requirement already satisfied: psutil in /u/ftaj/.conda/envs/autopath/lib/python3.10/site-packages (from deepspeed==0.9.3+0411a9f8) (5.9.5)
Requirement already satisfied: py-cpuinfo in /u/ftaj/.conda/envs/autopath/lib/python3.10/site-packages (from deepspeed==0.9.3+0411a9f8) (9.0.0)
Requirement already satisfied: pydantic<2.0.0 in /u/ftaj/.conda/envs/autopath/lib/python3.10/site-packages (from deepspeed==0.9.3+0411a9f8) (1.10.8)
Requirement already satisfied: torch in /u/ftaj/.conda/envs/autopath/lib/python3.10/site-packages (from deepspeed==0.9.3+0411a9f8) (2.0.1+cu118)
Requirement already satisfied: tqdm in /u/ftaj/.conda/envs/autopath/lib/python3.10/site-packages (from deepspeed==0.9.3+0411a9f8) (4.65.0)
Requirement already satisfied: typing-extensions>=4.2.0 in /u/ftaj/.conda/envs/autopath/lib/python3.10/site-packages (from pydantic<2.0.0->deepspeed==0.9.3+0411a9f8) (4.5.0)
Requirement already satisfied: jinja2 in /u/ftaj/.conda/envs/autopath/lib/python3.10/site-packages (from torch->deepspeed==0.9.3+0411a9f8) (3.1.2)
Requirement already satisfied: filelock in /u/ftaj/.conda/envs/autopath/lib/python3.10/site-packages (from torch->deepspeed==0.9.3+0411a9f8) (3.9.0)
Requirement already satisfied: networkx in /u/ftaj/.conda/envs/autopath/lib/python3.10/site-packages (from torch->deepspeed==0.9.3+0411a9f8) (2.8.4)
Requirement already satisfied: triton==2.0.0 in /u/ftaj/.conda/envs/autopath/lib/python3.10/site-packages (from torch->deepspeed==0.9.3+0411a9f8) (2.0.0)
Requirement already satisfied: sympy in /u/ftaj/.conda/envs/autopath/lib/python3.10/site-packages (from torch->deepspeed==0.9.3+0411a9f8) (1.11.1)
Requirement already satisfied: lit in /u/ftaj/.conda/envs/autopath/lib/python3.10/site-packages (from triton==2.0.0->torch->deepspeed==0.9.3+0411a9f8) (15.0.7)
Requirement already satisfied: cmake in /u/ftaj/.conda/envs/autopath/lib/python3.10/site-packages (from triton==2.0.0->torch->deepspeed==0.9.3+0411a9f8) (3.25.0)
Requirement already satisfied: MarkupSafe>=2.0 in /u/ftaj/.conda/envs/autopath/lib/python3.10/site-packages (from jinja2->torch->deepspeed==0.9.3+0411a9f8) (2.1.1)
Requirement already satisfied: mpmath>=0.19 in /u/ftaj/.conda/envs/autopath/lib/python3.10/site-packages/mpmath-1.2.1-py3.10.egg (from sympy->torch->deepspeed==0.9.3+0411a9f8) (1.2.1)
WARNING: Ignoring invalid distribution -orch (/u/ftaj/.conda/envs/autopath/lib/python3.10/site-packages)
Installing collected packages: deepspeed
  DEPRECATION: deepspeed is being installed using the legacy 'setup.py install' method, because the '--no-binary' option was enabled for it and this currently disables local wheel building for projects that don't have a 'pyproject.toml' file. pip 23.1 will enforce this behaviour change. A possible replacement is to enable the '--use-pep517' option. Discussion can be found at https://github.com/pypa/pip/issues/11451
  Running setup.py install for deepspeed ... error
  error: subprocess-exited-with-error

  Running setup.py install for deepspeed did not run successfully.
  exit code: 1

  [53 lines of output]
  Setting ds_accelerator to cuda (auto detect)
  Setting ds_accelerator to cuda (auto detect)
  DS_BUILD_OPS=0
   [WARNING]  async_io requires the dev libaio .so object and headers but these were not found.
   [WARNING]  async_io: please install the libaio-dev package with apt
   [WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
   [WARNING]  DeepSpeed Op Builder: Installed CUDA version 12.1 does not match the version torch was compiled with 11.8.Detected `DS_SKIP_CUDA_CHECK=1`: Allowing this combination of CUDA, but it may result in unexpected behavior.
   [WARNING]  DeepSpeed Op Builder: Installed CUDA version 12.1 does not match the version torch was compiled with 11.8.Detected `DS_SKIP_CUDA_CHECK=1`: Allowing this combination of CUDA, but it may result in unexpected behavior.
   [WARNING]  DeepSpeed Op Builder: Installed CUDA version 12.1 does not match the version torch was compiled with 11.8.Detected `DS_SKIP_CUDA_CHECK=1`: Allowing this combination of CUDA, but it may result in unexpected behavior.
   [WARNING]  DeepSpeed Op Builder: Installed CUDA version 12.1 does not match the version torch was compiled with 11.8.Detected `DS_SKIP_CUDA_CHECK=1`: Allowing this combination of CUDA, but it may result in unexpected behavior.
   [WARNING]  DeepSpeed Op Builder: Installed CUDA version 12.1 does not match the version torch was compiled with 11.8.Detected `DS_SKIP_CUDA_CHECK=1`: Allowing this combination of CUDA, but it may result in unexpected behavior.
   [WARNING]  DeepSpeed Op Builder: Installed CUDA version 12.1 does not match the version torch was compiled with 11.8.Detected `DS_SKIP_CUDA_CHECK=1`: Allowing this combination of CUDA, but it may result in unexpected behavior.
   [WARNING]  sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.0
   [WARNING]  using untested triton version (2.0.0), only 1.0.0 is known to be compatible
   [WARNING]  DeepSpeed Op Builder: Installed CUDA version 12.1 does not match the version torch was compiled with 11.8.Detected `DS_SKIP_CUDA_CHECK=1`: Allowing this combination of CUDA, but it may result in unexpected behavior.
   [WARNING]  DeepSpeed Op Builder: Installed CUDA version 12.1 does not match the version torch was compiled with 11.8.Detected `DS_SKIP_CUDA_CHECK=1`: Allowing this combination of CUDA, but it may result in unexpected behavior.
   [WARNING]  DeepSpeed Op Builder: Installed CUDA version 12.1 does not match the version torch was compiled with 11.8.Detected `DS_SKIP_CUDA_CHECK=1`: Allowing this combination of CUDA, but it may result in unexpected behavior.
   [WARNING]  DeepSpeed Op Builder: Installed CUDA version 12.1 does not match the version torch was compiled with 11.8.Detected `DS_SKIP_CUDA_CHECK=1`: Allowing this combination of CUDA, but it may result in unexpected behavior.
  Install Ops={'async_io': False, 'cpu_adagrad': False, 'cpu_adam': 1, 'fused_adam': 1, 'fused_lamb': 1, 'quantizer': 1, 'random_ltd': 1, 'sparse_attn': False, 'spatial_inference': 1, 'transformer': 1, 'stochastic_transformer': 1, 'transformer_inference': 1, 'utils': 1}
  version=0.9.3+0411a9f8, git_hash=0411a9f8, git_branch=master
  install_requires=['hjson', 'ninja', 'numpy', 'packaging>=20.0', 'psutil', 'py-cpuinfo', 'pydantic<2.0.0', 'torch', 'tqdm']
  compatible_ops={'async_io': False, 'cpu_adagrad': True, 'cpu_adam': True, 'fused_adam': True, 'fused_lamb': True, 'quantizer': True, 'random_ltd': True, 'sparse_attn': False, 'spatial_inference': True, 'transformer': True, 'stochastic_transformer': True, 'transformer_inference': True, 'utils': True}
  ext_modules=[<setuptools.extension.Extension('deepspeed.ops.adam.cpu_adam_op') at 0x7f1cb7f9e8c0>, <setuptools.extension.Extension('deepspeed.ops.adam.fused_adam_op') at 0x7f1c033c0a60>, <setuptools.extension.Extension('deepspeed.ops.lamb.fused_lamb_op') at 0x7f1c033c1930>, <setuptools.extension.Extension('deepspeed.ops.quantizer.quantizer_op') at 0x7f1c033c0df0>, <setuptools.extension.Extension('deepspeed.ops.random_ltd_op') at 0x7f1c033c0dc0>, <setuptools.extension.Extension('deepspeed.ops.spatial.spatial_inference_op') at 0x7f1c035879d0>, <setuptools.extension.Extension('deepspeed.ops.transformer.transformer_op') at 0x7f1c03587fa0>, <setuptools.extension.Extension('deepspeed.ops.transformer.stochastic_transformer_op') at 0x7f1c03587760>, <setuptools.extension.Extension('deepspeed.ops.transformer.inference.transformer_inference_op') at 0x7f1c0339f460>, <setuptools.extension.Extension('deepspeed.ops.utils_op') at 0x7f1c0339f490>]
  running build_ext
  Traceback (most recent call last):
    File "<string>", line 2, in <module>
    File "<pip-setuptools-caller>", line 34, in <module>
    File "/u/ftaj/Github/DeepSpeed/setup.py", line 277, in <module>
      setup(name='deepspeed',
    File "/u/ftaj/.conda/envs/autopath/lib/python3.10/site-packages/setuptools/__init__.py", line 87, in setup
      return distutils.core.setup(**attrs)
    File "/u/ftaj/.conda/envs/autopath/lib/python3.10/site-packages/setuptools/_distutils/core.py", line 185, in setup
      return run_commands(dist)
    File "/u/ftaj/.conda/envs/autopath/lib/python3.10/site-packages/setuptools/_distutils/core.py", line 201, in run_commands
      dist.run_commands()
    File "/u/ftaj/.conda/envs/autopath/lib/python3.10/site-packages/setuptools/_distutils/dist.py", line 969, in run_commands
      self.run_command(cmd)
    File "/u/ftaj/.conda/envs/autopath/lib/python3.10/site-packages/setuptools/dist.py", line 1208, in run_command
      super().run_command(command)
    File "/u/ftaj/.conda/envs/autopath/lib/python3.10/site-packages/setuptools/_distutils/dist.py", line 988, in run_command
      cmd_obj.run()
    File "/u/ftaj/.conda/envs/autopath/lib/python3.10/site-packages/setuptools/command/build_ext.py", line 84, in run
      _build_ext.run(self)
    File "/u/ftaj/.conda/envs/autopath/lib/python3.10/site-packages/setuptools/_distutils/command/build_ext.py", line 346, in run
      self.build_extensions()
    File "/u/ftaj/.conda/envs/autopath/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 499, in build_extensions
      _check_cuda_version(compiler_name, compiler_version)
    File "/u/ftaj/.conda/envs/autopath/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 387, in _check_cuda_version
      raise RuntimeError(CUDA_MISMATCH_MESSAGE.format(cuda_str_version, torch.version.cuda))
  RuntimeError:
  The detected CUDA version (12.1) mismatches the version that was used to compile
  PyTorch (11.8). Please make sure to use the same CUDA versions.

  [end of output]

  note: This error originates from a subprocess, and is likely not a problem with pip.
error: legacy-install-failure

Encountered error while trying to install package.

deepspeed

note: This is an issue with the package mentioned above, not pip.
hint: See above for output from the failure.
WARNING: Ignoring invalid distribution -orch (/u/ftaj/.conda/envs/autopath/lib/python3.10/site-packages)
WARNING: Ignoring invalid distribution -orch (/u/ftaj/.conda/envs/autopath/lib/python3.10/site-packages)
WARNING: Ignoring invalid distribution -orch (/u/ftaj/.conda/envs/autopath/lib/python3.10/site-packages)

I should note that installing without pre-compiling ops works normally.

Thanks!

@loadams
Copy link
Contributor

loadams commented Jun 6, 2023

Hi @FarzanT - are you still seeing this error? If so, can you please open a new bug so we can track it there rather than on the PR?

@mrwyattii mrwyattii deleted the mrwyattii/allow-cuda-mismatch branch July 7, 2023 02:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants