Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Getting CUDA to work in Conda environment #6

Closed
stephane-caron opened this issue Mar 1, 2023 · 5 comments
Closed

Getting CUDA to work in Conda environment #6

stephane-caron opened this issue Mar 1, 2023 · 5 comments

Comments

@stephane-caron
Copy link
Contributor

Running from the latest commit from xpag's master, in an environment freshly created from environment.yaml. The GPU memory error from #5 is gone 🤔 Now

Reproduction code

import jax
import xpag

assert jax.lib.xla_bridge.get_backend().platform == "gpu"

nb_envs = 1  # the number of rollouts in parallel during training
env, eval_env, env_info = xpag.wrappers.gym_vec_env("HalfCheetah-v4", nb_envs)
agent = xpag.agents.SAC(
    env_info["observation_dim"]
    if not env_info["is_goalenv"]
    else env_info["observation_dim"] + env_info["desired_goal_dim"],
    env_info["action_dim"],
    {"actor_lr": 3e-3, "critic_lr": 3e-3, "tau": 5e-2, "seed": 0},
)

Outcome

First, the following warning is repeated about sixty times 😅

2023-03-01 13:46:52.972844: E external/org_tensorflow/tensorflow/compiler/xla/stream_executor/gpu/asm_compiler.cc:114] *** WARNING *** You are using ptxas 10.1.243, which is older than 11.1. ptxas before 11.1 is known to miscompile XLA code, leading to incorrect results or invalid-address errors.

But that may not be related to the issue at hand.

The error is:

Traceback (most recent call last):
  File "/.../.micromamba/envs/xpag2/lib/python3.10/site-packages/gym/vector/async_vector_env.py", line 546, in __del__
  File "/.../.micromamba/envs/xpag2/lib/python3.10/site-packages/gym/vector/vector_env.py", line 205, in close
  File "/.../.micromamba/envs/xpag2/lib/python3.10/site-packages/gym/vector/async_vector_env.py", line 462, in close_extras
AttributeError: 'NoneType' object has no attribute 'TimeoutError'
Exception ignored in: <function AsyncVectorEnv.__del__ at 0x7fc0204c8e50>
Traceback (most recent call last):
  File "/.../.micromamba/envs/xpag2/lib/python3.10/site-packages/gym/vector/async_vector_env.py", line 546, in __del__
  File "/.../.micromamba/envs/xpag2/lib/python3.10/site-packages/gym/vector/vector_env.py", line 205, in close
  File "/.../.micromamba/envs/xpag2/lib/python3.10/site-packages/gym/vector/async_vector_env.py", line 462, in close_extras
AttributeError: 'NoneType' object has no attribute 'TimeoutError'
@stephane-caron stephane-caron changed the title No timeout error in SAC agent No TimeoutError in SAC agent Mar 1, 2023
@stephane-caron
Copy link
Contributor Author

OK, sorry for the copious log as this issue comes from upstream. It is only related to xpag insofar as my goal is to get it to work 😉

No TimeoutError

This second error is actually benign. Something goes wrong in gym when exiting the script (at destruction time, it tries to read from an imported module that has already been freed), but this is not blocking. Since gym is discontinued I guess we won't get that one fixed. How about switching to Gymnasium? (drop-in replacement)

ptax warning

You are using ptxas 10.1.243, which is older than 11.1.
$ micromamba install -c nvidia cuda=11.4

Here I specified cuda=11.4 because nvidia-smi tells me my CUDA version is 11.4. (The expression "CUDA version" is actually ambiguous, since there are two CUDA versions it could refer to, as the curious or exhausted traveler will hear 😉 (The version reported by nvidia-smi is the CUDA driver version.)). This generated a number of warnings since cudatoolkit was already in the environment and seems to conflict with cuda-toolkit:

warning  libmamba [cuda-cudart-dev-12.1.55-0] The following files were already present in the environment:
    - lib/libcudart.so

I could not ignore this error, as it produced:

jax._src.traceback_util.UnfilteredStackTrace: RuntimeError: jaxlib/gpu/solver_kernels.cc:45: operation gpusolverDnCreate(&handle) failed: cuSolver has not been initialized

But things eventually worked out after removing and reinstalling jax:

$ micromamba remove cudatoolkit  # removes jax
$ micromamba install -c nvidia cuda=11.4
$ micromamba install jax

Now I can run the SAC example locally:

~/.micromamba/envs/xpag4/lib/python3.10/site-packages/gym/utils/passive_env_checker.py:233: DeprecationWarning: `np.bool8` is a deprecated alias for `np.bool_`.  (Deprecated NumPy 1.24)
  if not isinstance(terminated, (bool, np.bool8)):
Logging in /home/nelson/results/xpag/train_mujoco
[           0 steps] [training time (ms) += 0         ] [ep reward: -7.059         ] 
~/.micromamba/envs/xpag4/lib/python3.10/site-packages/gym/utils/passive_env_checker.py:233: DeprecationWarning: `np.bool8` is a deprecated alias for `np.bool_`.  (Deprecated NumPy 1.24)
  if not isinstance(terminated, (bool, np.bool8)):
[        5000 steps] [training time (ms) += 1749      ] [ep reward: 1.439          ] 
[       10000 steps] [training time (ms) += 1734      ] [ep reward: 0.777          ] 
[       15000 steps] [training time (ms) += 15582     ] [ep reward: 2934.318       ] 
[       20000 steps] [training time (ms) += 9367      ] [ep reward: 3892.575       ] 

Hoping this helps 😃

@stephane-caron stephane-caron changed the title No TimeoutError in SAC agent Getting CUDA to work in Conda environment Mar 1, 2023
@stephane-caron
Copy link
Contributor Author

An alternative proposed in conda-forge/cudatoolkit-feedstock#62 is to install CUDA from the nvidia channel:

Following the thread to its latest developments: things seem to be on a path to get better with the integration of CUDA directly into conda-forge: conda-forge/staged-recipes#21382

@perrin-isir
Copy link
Owner

Great that it worked out!
I guess that ptxas will always remain proprietary (see discussion here for instance), so using the NVIDIA channel or NVIDIA's recommended install is probably the only way to go.

About the gym bool8 warning: it will soon disappear, because I should make the move towards gymnasium instead of gym (everyone should, see: https://github.com/openai/gym#important-notice). In gymnasium the deprecated bool8 have been replaced (see passive_env_checker.py).

Out of curiosity, which NVIDIA GPU are you using for these results?
Here's what I get on my laptop with a Quadro P3000:

[        5000 steps] [training time (ms) += 503       ] [ep reward: -0.266         ] 
[       10000 steps] [training time (ms) += 487       ] [ep reward: 0.080          ] 
[       15000 steps] [training time (ms) += 13307     ] [ep reward: 1941.168       ] 
[       20000 steps] [training time (ms) += 6849      ] [ep reward: 960.468        ] 
[       25000 steps] [training time (ms) += 6748      ] [ep reward: 3722.916       ] 

@stephane-caron
Copy link
Contributor Author

stephane-caron commented Mar 1, 2023

Out of curiosity, which NVIDIA GPU are you using for these results?

This one is a good ol' GeForce GTX 1070 🤠 (Also I should mention I reduced the number of vectorized environments to one for testing.)

@perrin-isir
Copy link
Owner

Ah ok ok, so I think you'll be a bit faster than the P3000, I get this with num_envs = 1:

[           0 steps] [training time (ms) += 0         ] [ep reward: -7.059         ] 
[        5000 steps] [training time (ms) += 1681      ] [ep reward: 1.439          ] 
[       10000 steps] [training time (ms) += 1678      ] [ep reward: 0.777          ] 
[       15000 steps] [training time (ms) += 16560     ] [ep reward: 2565.911       ] 
[       20000 steps] [training time (ms) += 10555     ] [ep reward: 4170.352       ] 

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants