Getting CUDA to work in Conda environment #6

stephane-caron · 2023-03-01T12:54:01Z

Running from the latest commit from xpag's master, in an environment freshly created from environment.yaml. The GPU memory error from #5 is gone 🤔 Now

Reproduction code

import jax
import xpag

assert jax.lib.xla_bridge.get_backend().platform == "gpu"

nb_envs = 1  # the number of rollouts in parallel during training
env, eval_env, env_info = xpag.wrappers.gym_vec_env("HalfCheetah-v4", nb_envs)
agent = xpag.agents.SAC(
    env_info["observation_dim"]
    if not env_info["is_goalenv"]
    else env_info["observation_dim"] + env_info["desired_goal_dim"],
    env_info["action_dim"],
    {"actor_lr": 3e-3, "critic_lr": 3e-3, "tau": 5e-2, "seed": 0},
)

Outcome

First, the following warning is repeated about sixty times 😅

2023-03-01 13:46:52.972844: E external/org_tensorflow/tensorflow/compiler/xla/stream_executor/gpu/asm_compiler.cc:114] *** WARNING *** You are using ptxas 10.1.243, which is older than 11.1. ptxas before 11.1 is known to miscompile XLA code, leading to incorrect results or invalid-address errors.

But that may not be related to the issue at hand.

The error is:

Traceback (most recent call last):
  File "/.../.micromamba/envs/xpag2/lib/python3.10/site-packages/gym/vector/async_vector_env.py", line 546, in __del__
  File "/.../.micromamba/envs/xpag2/lib/python3.10/site-packages/gym/vector/vector_env.py", line 205, in close
  File "/.../.micromamba/envs/xpag2/lib/python3.10/site-packages/gym/vector/async_vector_env.py", line 462, in close_extras
AttributeError: 'NoneType' object has no attribute 'TimeoutError'
Exception ignored in: <function AsyncVectorEnv.__del__ at 0x7fc0204c8e50>
Traceback (most recent call last):
  File "/.../.micromamba/envs/xpag2/lib/python3.10/site-packages/gym/vector/async_vector_env.py", line 546, in __del__
  File "/.../.micromamba/envs/xpag2/lib/python3.10/site-packages/gym/vector/vector_env.py", line 205, in close
  File "/.../.micromamba/envs/xpag2/lib/python3.10/site-packages/gym/vector/async_vector_env.py", line 462, in close_extras
AttributeError: 'NoneType' object has no attribute 'TimeoutError'

The text was updated successfully, but these errors were encountered:

stephane-caron · 2023-03-01T14:19:04Z

OK, sorry for the copious log as this issue comes from upstream. It is only related to xpag insofar as my goal is to get it to work 😉

No TimeoutError

This second error is actually benign. Something goes wrong in gym when exiting the script (at destruction time, it tries to read from an imported module that has already been freed), but this is not blocking. Since gym is discontinued I guess we won't get that one fixed. How about switching to Gymnasium? (drop-in replacement)

ptax warning

You are using ptxas 10.1.243, which is older than 11.1.

Here the software is relying on the ptxas that was installed with nvidia-cuda-toolkit via Debian packages on my distribution (Ubuntu 20.04). That's version 10.1.
There is a reason why it is not using the one from cudatoolkit installed by Conda (which would be version 11.8 here): the cudatoolkit feedstock cannot distribute that binary because of a licensing issue: ptxas executable conda-forge/cudatoolkit-feedstock#72
An alternative proposed in Initial CUDA 11.3 Conda Packages conda-forge/cudatoolkit-feedstock#62 is to install CUDA from the nvidia channel:

$ micromamba install -c nvidia cuda=11.4

Here I specified cuda=11.4 because nvidia-smi tells me my CUDA version is 11.4. (The expression "CUDA version" is actually ambiguous, since there are two CUDA versions it could refer to, as the curious or exhausted traveler will hear 😉 (The version reported by nvidia-smi is the CUDA driver version.)). This generated a number of warnings since cudatoolkit was already in the environment and seems to conflict with cuda-toolkit:

warning  libmamba [cuda-cudart-dev-12.1.55-0] The following files were already present in the environment:
    - lib/libcudart.so

I could not ignore this error, as it produced:

jax._src.traceback_util.UnfilteredStackTrace: RuntimeError: jaxlib/gpu/solver_kernels.cc:45: operation gpusolverDnCreate(&handle) failed: cuSolver has not been initialized

But things eventually worked out after removing and reinstalling jax:

$ micromamba remove cudatoolkit  # removes jax
$ micromamba install -c nvidia cuda=11.4
$ micromamba install jax

Now I can run the SAC example locally:

~/.micromamba/envs/xpag4/lib/python3.10/site-packages/gym/utils/passive_env_checker.py:233: DeprecationWarning: `np.bool8` is a deprecated alias for `np.bool_`.  (Deprecated NumPy 1.24)
  if not isinstance(terminated, (bool, np.bool8)):
Logging in /home/nelson/results/xpag/train_mujoco
[           0 steps] [training time (ms) += 0         ] [ep reward: -7.059         ] 
~/.micromamba/envs/xpag4/lib/python3.10/site-packages/gym/utils/passive_env_checker.py:233: DeprecationWarning: `np.bool8` is a deprecated alias for `np.bool_`.  (Deprecated NumPy 1.24)
  if not isinstance(terminated, (bool, np.bool8)):
[        5000 steps] [training time (ms) += 1749      ] [ep reward: 1.439          ] 
[       10000 steps] [training time (ms) += 1734      ] [ep reward: 0.777          ] 
[       15000 steps] [training time (ms) += 15582     ] [ep reward: 2934.318       ] 
[       20000 steps] [training time (ms) += 9367      ] [ep reward: 3892.575       ]

Hoping this helps 😃

stephane-caron · 2023-03-01T14:37:26Z

An alternative proposed in conda-forge/cudatoolkit-feedstock#62 is to install CUDA from the nvidia channel:

Following the thread to its latest developments: things seem to be on a path to get better with the integration of CUDA directly into conda-forge: conda-forge/staged-recipes#21382

perrin-isir · 2023-03-01T14:38:35Z

Great that it worked out!
I guess that ptxas will always remain proprietary (see discussion here for instance), so using the NVIDIA channel or NVIDIA's recommended install is probably the only way to go.

About the gym bool8 warning: it will soon disappear, because I should make the move towards gymnasium instead of gym (everyone should, see: https://github.com/openai/gym#important-notice). In gymnasium the deprecated bool8 have been replaced (see passive_env_checker.py).

Out of curiosity, which NVIDIA GPU are you using for these results?
Here's what I get on my laptop with a Quadro P3000:

[        5000 steps] [training time (ms) += 503       ] [ep reward: -0.266         ] 
[       10000 steps] [training time (ms) += 487       ] [ep reward: 0.080          ] 
[       15000 steps] [training time (ms) += 13307     ] [ep reward: 1941.168       ] 
[       20000 steps] [training time (ms) += 6849      ] [ep reward: 960.468        ] 
[       25000 steps] [training time (ms) += 6748      ] [ep reward: 3722.916       ]

stephane-caron · 2023-03-01T15:46:20Z

Out of curiosity, which NVIDIA GPU are you using for these results?

This one is a good ol' GeForce GTX 1070 🤠 (Also I should mention I reduced the number of vectorized environments to one for testing.)

perrin-isir · 2023-03-01T16:40:34Z

Ah ok ok, so I think you'll be a bit faster than the P3000, I get this with num_envs = 1:

[           0 steps] [training time (ms) += 0         ] [ep reward: -7.059         ] 
[        5000 steps] [training time (ms) += 1681      ] [ep reward: 1.439          ] 
[       10000 steps] [training time (ms) += 1678      ] [ep reward: 0.777          ] 
[       15000 steps] [training time (ms) += 16560     ] [ep reward: 2565.911       ] 
[       20000 steps] [training time (ms) += 10555     ] [ep reward: 4170.352       ]

stephane-caron mentioned this issue Mar 1, 2023

Conda tensorflow adventure (cont'd) #5

Closed

stephane-caron changed the title ~~No timeout error in SAC agent~~ No TimeoutError in SAC agent Mar 1, 2023

stephane-caron changed the title ~~No TimeoutError in SAC agent~~ Getting CUDA to work in Conda environment Mar 1, 2023

stephane-caron closed this as completed Mar 1, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Getting CUDA to work in Conda environment #6

Getting CUDA to work in Conda environment #6

stephane-caron commented Mar 1, 2023

stephane-caron commented Mar 1, 2023

stephane-caron commented Mar 1, 2023

perrin-isir commented Mar 1, 2023

stephane-caron commented Mar 1, 2023 •

edited

Loading

perrin-isir commented Mar 1, 2023

Getting CUDA to work in Conda environment #6

Getting CUDA to work in Conda environment #6

Comments

stephane-caron commented Mar 1, 2023

Reproduction code

Outcome

stephane-caron commented Mar 1, 2023

No TimeoutError

ptax warning

stephane-caron commented Mar 1, 2023

perrin-isir commented Mar 1, 2023

stephane-caron commented Mar 1, 2023 • edited Loading

perrin-isir commented Mar 1, 2023

stephane-caron commented Mar 1, 2023 •

edited

Loading