Conda tensorflow adventure (cont'd) #5

stephane-caron · 2023-02-28T17:10:18Z

This issue follows the conda installation instructions from #4, but it happens post-installation. Everything imports well, but trying to create a SAC agent fails.

Reproduction code

import jax
import xpag

assert jax.lib.xla_bridge.get_backend().platform == "gpu"
nb_envs = 10  # the number of rollouts in parallel during training
env, eval_env, env_info = xpag.wrappers.gym_vec_env("HalfCheetah-v4", nb_envs)
agent = xpag.agents.SAC(
    env_info["observation_dim"]
    if not env_info["is_goalenv"]
    else env_info["observation_dim"] + env_info["desired_goal_dim"],
    env_info["action_dim"],
    {"actor_lr": 3e-3, "critic_lr": 3e-3, "tau": 5e-2, "seed": 0},
)

Outcome

First, a not-so-inviting warning:

2023-02-28 18:06:42.237635: E external/org_tensorflow/tensorflow/compiler/xla/stream_executor/gpu/asm_compiler.cc:114] *** WARNING *** You are using ptxas 10.1.243, which is older than 11.1. ptxas before 11.1 is known to miscompile XLA code, leading to incorrect results or invalid-address errors.

Followed by:

2023-02-28 18:06:44.866666: E external/org_tensorflow/tensorflow/compiler/xla/python/pjit.cc:476] cache miss fail: XlaRuntimeError: INTERNAL: Failed to execute XLA Runtime executable: run time error: custom call 'xla.gpu.custom_call' failed: jaxlib/gpu/prng_kernels.cc:33: operation gpuGetLastError() failed: out of memory.

The text was updated successfully, but these errors were encountered:

stephane-caron · 2023-02-28T17:18:30Z

This answer suggests installing CUDA from Conda as well, as it will o/w use the old version in my /usr/bin. I'm trying this now (will take a while, bad connection).

stephane-caron · 2023-02-28T18:09:50Z

The answer did remove the first error, but the gist of the second error remains:

2023-02-28 19:08:03.177444: E external/org_tensorflow/tensorflow/compiler/xla/pjrt/pjrt_stream_executor_client.cc:2410] Execution of replica 0 failed: INTERNAL: Failed to execute XLA Runtime executable: run time error: custom call 'xla.gpu.custom_call' failed: jaxlib/gpu/prng_kernels.cc:33: operation gpuGetLastError() failed: out of memory.

I reduced the number of environments to 1 → same outcome.

Do you know how much GPU memory a SAC agent for HalfCheetah-v4 should require?

stephane-caron · 2023-02-28T19:23:26Z

OK, nevermind this is not an xpag-related issue. I just need to figure out a way to make all the dependencies work together on my machine 😅

perrin-isir · 2023-02-28T21:06:06Z

The amount of GPU memory is rarely an issue I think (the default SAC agents in the examples use relatively small networks), but JAX tries to preallocate 90% of the total GPU memory, and I have encountered various issues at this step. One command that has proven useful in some specific cases is the following, before executing the code:
export XLA_PYTHON_CLIENT_MEM_FRACTION=0.7
(Of course you can test with other values < 0.9)

By the way do you manage to use JAX for simple operations (sth like jax.numpy.sqrt(4.)) on your machine? It's failing specifically when creating a SAC agent? In this case, could you try also with a TD3 agent (there are less dependencies, in particular TD3 doesn't use tensorflow_probability)?

stephane-caron · 2023-03-01T12:55:27Z

Thanks for suggesting the memory fraction setting. I started from a clean slate (thanks conda), what follows is in a new environment created from xpag's environment.yaml.

In this new setting, gpuGetLastError() failed: out of memory is gone and it seems the script is executing further → #6 (to another error 😅).

By the way do you manage to use JAX for simple operations (sth like jax.numpy.sqrt(4.)) on your machine?

Yes, this works, although I probably should look into that ptxas warning:

Python 3.10.9 | packaged by conda-forge | (main, Feb  2 2023, 20:20:04) [GCC 11.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import jax
jax.numpy.sqrt(4.)>>> jax.numpy.sqrt(4.)
2023-03-01 13:47:16.968355: E external/org_tensorflow/tensorflow/compiler/xla/stream_executor/gpu/asm_compiler.cc:114] *** WARNING *** You are using ptxas 10.1.243, which is older than 11.1. ptxas before 11.1 is known to miscompile XLA code, leading to incorrect results or invalid-address errors.

Array(2., dtype=float32, weak_type=True)
>>> jax.numpy.sqrt(4.)
Array(2., dtype=float32, weak_type=True)

perrin-isir · 2023-03-01T14:21:31Z

On my main computer the ptxas version is 11.5.119.
Here it is mentioned that JAX requires CUDA >= 11.4, so indeed it may cause more serious bugs.

stephane-caron · 2023-03-01T14:25:55Z

Yes, this is now solved by #6 (comment). I could upgrade to CUDA toolkit >= 12.1 by installing CUDA driver == 11.4 (the one corresponding to my GPU driver) from the nvidia conda channel.

stephane-caron closed this as completed Feb 28, 2023

This was referenced Mar 1, 2023

Build error for pytinyrenderer wheel #4

Closed

Getting CUDA to work in Conda environment #6

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Conda tensorflow adventure (cont'd) #5

Conda tensorflow adventure (cont'd) #5

stephane-caron commented Feb 28, 2023

stephane-caron commented Feb 28, 2023

stephane-caron commented Feb 28, 2023

stephane-caron commented Feb 28, 2023

perrin-isir commented Feb 28, 2023 •

edited

Loading

stephane-caron commented Mar 1, 2023

perrin-isir commented Mar 1, 2023

stephane-caron commented Mar 1, 2023

Conda tensorflow adventure (cont'd) #5

Conda tensorflow adventure (cont'd) #5

Comments

stephane-caron commented Feb 28, 2023

Reproduction code

Outcome

stephane-caron commented Feb 28, 2023

stephane-caron commented Feb 28, 2023

stephane-caron commented Feb 28, 2023

perrin-isir commented Feb 28, 2023 • edited Loading

stephane-caron commented Mar 1, 2023

perrin-isir commented Mar 1, 2023

stephane-caron commented Mar 1, 2023

perrin-isir commented Feb 28, 2023 •

edited

Loading