Conda Environment Install Issue #95

kleingeo · 2020-02-20T18:37:18Z

Trying to get DeepSpeed installed for local use with a Conda environment, but it seems that DeepSpeed in not installing to the environment itself. After building the wheel DeepSpeed is not installing into the proper Conda conda environment location. Apex is installing in the proper environment location. Unclear why DeepSpeed is not working but Apex is.

ShadenSmith · 2020-02-21T14:37:00Z

Hi there! We're in the process of rewriting our installation scripts (that were previously only used within Docker containers) and hoping to also release a conda package in short time. These sorts of issues should be fixed at that point.

msdejong · 2020-02-22T04:10:39Z

The installation scripts perform sudo -H pip installs, which install system wide. I replaced those with normal pip installs and it installed into the current environment without problems.

kleingeo · 2020-02-26T15:51:33Z

When I installed it I did that in the install script (for both the deep speed, apex and requirements). However, there were still issues in that DeepSpeed would not install to the right environment location. Looking at the installation a little more, this seemed more likely an issue with the wheel created for DeepSpeed in the install.sh file. I was able to get it working by forcing pip to install DeepSpeed into the correct location (the same location that Apex was correctly installed to).

ShadenSmith · 2020-03-09T17:09:12Z

We have a now have a conda package uploaded and we appreciate any feedback!

We have versions compiled for cudatoolkit versions 10.0 and 10.1 To install along with pytorch and other dependencies that are in the conda-forge channel:

conda install deepspeed cudatoolkit=10.1 -c deepspeed -c pytorch -c conda-forge

ShadenSmith · 2020-03-23T02:09:13Z

The repo's install.sh should respect the environment by default now (sudo is opt-in). Please let me know if the issue persists.

kleingeo · 2020-04-13T17:46:37Z

Using the conda install, deepspeed shows up when I run conda list but it is not available when trying to import in python.

ShadenSmith · 2020-04-20T17:18:41Z

Hi @kleingeo, thanks for the report. I can see that on my end now as well. Not sure what happened...I'm looking into it.

Interestingly, the deepspeed entry point looks fine and is found in my $PATH after installation. And I can see the DeepSpeed library installed under ~/miniconda3/envs/test/lib/python3.7/site-packages/deepspeed/ (where test is my conda environment name), and also see the expected ~/miniconda3/lib/python3.7/site-packages in my sys.path...so I'm not sure why the deepspeed library is not importable.

kleingeo · 2020-04-23T15:26:27Z

Yes, I remember having this problem a lot when trying to install deepspeed normally with the install.sh file. With a normal python virtual env it works, but for some reason with Conda, it consistently tries to install to another location. The only thing I found to work was to force pip (when using conda) to force the install location to where the install.sh file installs Apex.

jdongca2003 · 2020-05-21T16:21:17Z

@ShadenSmith , it is easier to install deepspeed via your conda command than 'install.sh' (prone to fail). In the deepspeed channel, only early-version deepspeed exists.

conda search -f deepspeed -c deepspeed
Loading channels: done
deepspeed 0.1.0 py3.6_cuda10.0.130_0 deepspeed
deepspeed 0.1.0 py3.6_cuda10.1.243_0 deepspeed
deepspeed 0.1.0 py3.7_cuda10.0.130_0 deepspeed
deepspeed 0.1.0 py3.7_cuda10.1.243_0 deepspeed

When do you plan to release new conda version of deepspeed with Zero2?

Thanks

ShadenSmith · 2020-05-27T15:28:35Z

Hi @jdongca2003, I have some time to dedicate to the DeepSpeed's conda infrastructure now that the v0.2 release is complete. I'm looking at improved packages (per the above bug report) and automating the package build process.

jdongca2003 · 2020-05-27T23:20:59Z

@ShadenSmith Thanks. I tested your conda deepspeed package on https://github.com/microsoft/DeepSpeedExamples/tree/master/cifar.
It failed on Tesla K80 and I got the following error mesage:
"
THCudaCheck FAIL file=csrc/fused_adam_cuda_kernel.cu line=135 error=209 : no kernel image is available for execution on the device
Traceback (most recent call last):
File "cifar10_deepspeed.py", line 178, in
model_engine.step()
File "/home/dong/miniconda3/envs/deepspeed/lib/python3.7/site-packages/deepspeed/pt/deepspeed_light.py", line 692, in step
self.optimizer.step()
File "/home/dong/miniconda3/envs/deepspeed/lib/python3.7/site-packages/apex/optimizers/fused_adam.py", line 146, in step
group['weight_decay'])
RuntimeError: cuda runtime error (209) : no kernel image is available for execution on the device at csrc/fused_adam_cuda_kernel.cu:135"

But it worked well on Tesla P4. Probably deepspeed does not support old GPU architecture.

analog75 · 2020-05-31T11:03:39Z

In V100, Same error with THCudaChecker happens!!

ConnollyLeon · 2020-11-11T03:47:34Z

@ShadenSmith Thanks. I tested your conda deepspeed package on https://github.com/microsoft/DeepSpeedExamples/tree/master/cifar.
It failed on Tesla K80 and I got the following error mesage:
"
THCudaCheck FAIL file=csrc/fused_adam_cuda_kernel.cu line=135 error=209 : no kernel image is available for execution on the device
Traceback (most recent call last):
File "cifar10_deepspeed.py", line 178, in
model_engine.step()
File "/home/dong/miniconda3/envs/deepspeed/lib/python3.7/site-packages/deepspeed/pt/deepspeed_light.py", line 692, in step
self.optimizer.step()
File "/home/dong/miniconda3/envs/deepspeed/lib/python3.7/site-packages/apex/optimizers/fused_adam.py", line 146, in step
group['weight_decay'])
RuntimeError: cuda runtime error (209) : no kernel image is available for execution on the device at csrc/fused_adam_cuda_kernel.cu:135"

But it worked well on Tesla P4. Probably deepspeed does not support old GPU architecture.

Hi @jdongca2003 ,
I encounter the same problem as you describe when using Tesla K80. And I found it work normally when applying them on Tesla V100. Have you solved this problem?

@ShadenSmith Could you please explain why this happen? Dose deepspeed not support Tesla K80?

Thanks.

commit 7dc1f95d69a0231b7e880913fb6efa74193971f2 Author: Guo Yejun <yejun.guo@intel.com> Date: Tue Oct 18 15:43:37 2022 +0800 pretain_gpt2.py: use get_accelerator().synchronize() (#25)

loadams · 2023-08-18T17:10:43Z

Hi, closing this issue as it is stale with respect to Cuda/Torch/DeepSpeed versions. However, we now provide an environment.yml for ease of building in conda, that is located at the root of our repo!

ShadenSmith closed this as completed Mar 23, 2020

ShadenSmith reopened this Apr 20, 2020

ShadenSmith added the bug Something isn't working label Apr 20, 2020

ShadenSmith self-assigned this Apr 20, 2020

stas00 mentioned this issue Dec 5, 2020

[build] make builder smarter and configurable wrt compute capabilities + docs #578

Merged

6 tasks

loadams closed this as completed Aug 18, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Conda Environment Install Issue #95

Conda Environment Install Issue #95

kleingeo commented Feb 20, 2020

ShadenSmith commented Feb 21, 2020

msdejong commented Feb 22, 2020

kleingeo commented Feb 26, 2020

ShadenSmith commented Mar 9, 2020

ShadenSmith commented Mar 23, 2020

kleingeo commented Apr 13, 2020

ShadenSmith commented Apr 20, 2020

kleingeo commented Apr 23, 2020

jdongca2003 commented May 21, 2020 •

edited

ShadenSmith commented May 27, 2020

jdongca2003 commented May 27, 2020 •

edited

analog75 commented May 31, 2020

ConnollyLeon commented Nov 11, 2020

loadams commented Aug 18, 2023

Conda Environment Install Issue #95

Conda Environment Install Issue #95

Comments

kleingeo commented Feb 20, 2020

ShadenSmith commented Feb 21, 2020

msdejong commented Feb 22, 2020

kleingeo commented Feb 26, 2020

ShadenSmith commented Mar 9, 2020

ShadenSmith commented Mar 23, 2020

kleingeo commented Apr 13, 2020

ShadenSmith commented Apr 20, 2020

kleingeo commented Apr 23, 2020

jdongca2003 commented May 21, 2020 • edited

ShadenSmith commented May 27, 2020

jdongca2003 commented May 27, 2020 • edited

analog75 commented May 31, 2020

ConnollyLeon commented Nov 11, 2020

loadams commented Aug 18, 2023

jdongca2003 commented May 21, 2020 •

edited

jdongca2003 commented May 27, 2020 •

edited