Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error when running distributed_train_acme_qrdqn.py #2

Open
SaundersJE97 opened this issue Sep 8, 2022 · 5 comments
Open

Error when running distributed_train_acme_qrdqn.py #2

SaundersJE97 opened this issue Sep 8, 2022 · 5 comments

Comments

@SaundersJE97
Copy link

Hi, I have been trying to get 'distributed_train_acme_qrdqn.py' to run with only a few agents and I'm getting the following error. I think it might be an issue between jax, dm-acme, and dm-launchpad.

I did some digging and came across this acme/agents/jax/actors

This is where I get stuck as I'm not entirely sure how the Qr-DQN is built with jax and passed to launchpad. I would really appreciate any thoughts on this issue.

Operating System

  • Python 3.9.13
  • Ubuntu 20.04

Error

/usr/local/lib/python3.9/dist-packages/haiku/_src/data_structures.py:37: FutureWarning: jax.tree_structure is deprecated, and will be removed in a future release. Use jax.tree_util.tree_structure instead.
PyTreeDef = type(jax.tree_structure(None))
I0908 13:09:34.228399 140062111078208 xla_bridge.py:345] Unable to initialize backend 'tpu_driver': NOT_FOUND: Unable to find driver in registry given worker:
I0908 13:09:34.228528 140062111078208 xla_bridge.py:345] Unable to initialize backend 'cuda': module 'jaxlib.xla_extension' has no attribute 'GpuAllocatorConfig'
I0908 13:09:34.228579 140062111078208 xla_bridge.py:345] Unable to initialize backend 'rocm': module 'jaxlib.xla_extension' has no attribute 'GpuAllocatorConfig'
I0908 13:09:34.229399 140062111078208 xla_bridge.py:345] Unable to initialize backend 'tpu': INVALID_ARGUMENT: TpuPlatform is not available.
W0908 13:09:34.229537 140062111078208 xla_bridge.py:352] No GPU/TPU found, falling back to CPU. (Set TF_CPP_MIN_LOG_LEVEL=0 and rerun for more info.)
I0908 13:09:34.483748 140052206171904 courier_utils.py:120] Binding: run
I0908 13:09:34.487003 140052206171904 lp_utils.py:87] StepsLimiter: Starting with max_steps = 9600000 (actor_steps)
I0908 13:09:34.487962 140050360694528 node.py:61] Reverb client connecting to: localhost:33011
I0908 13:09:34.488504 140052214564608 savers.py:164] Attempting to restore checkpoint: None
I0908 13:09:35.382974 140050352301824 node.py:61] Reverb client connecting to: localhost:33011
I0908 13:09:35.442431 140046896195328 node.py:61] Reverb client connecting to: localhost:33011
I0908 13:09:35.453237 140046232733440 node.py:61] Reverb client connecting to: localhost:33011
I0908 13:09:35.453534 140052214564608 courier_utils.py:120] Binding: get_counts
I0908 13:09:35.463889 140046132836096 node.py:61] Reverb client connecting to: localhost:33011
I0908 13:09:35.473653 140046098515712 node.py:61] Reverb client connecting to: localhost:33011
I0908 13:09:35.482568 140052214564608 courier_utils.py:120] Binding: get_directory
I0908 13:09:35.483737 140045998618368 node.py:61] Reverb client connecting to: localhost:33011
I0908 13:09:35.503851 140045981832960 node.py:61] Reverb client connecting to: localhost:33011
I0908 13:09:35.504534 140045923084032 node.py:61] Reverb client connecting to: localhost:33011
I0908 13:09:35.504815 140052214564608 courier_utils.py:120] Binding: get_steps_key
I0908 13:09:35.524922 140045914691328 node.py:61] Reverb client connecting to: localhost:33011
I0908 13:09:35.525063 140052214564608 courier_utils.py:120] Binding: increment
I0908 13:09:35.525359 140045822371584 node.py:61] Reverb client connecting to: localhost:33011
I0908 13:09:35.526567 140052214564608 courier_utils.py:120] Binding: restore
I0908 13:09:35.533543 140052214564608 courier_utils.py:120] Binding: save
I0908 13:09:35.542086 140052214564608 savers.py:155] Saving checkpoint: /root/acme/20220908-130931/checkpoints/counter
I0908 13:09:36.944851 140052206171904 lp_utils.py:95] StepsLimiter: Reached 0 recorded steps
Node ThreadWorker(thread=<Thread(actor, stopped daemon 140045923084032)>, future=<Future at 0x7f61f80a19a0 state=finished raised AttributeError>) crashed:
Traceback (most recent call last):
File "/usr/local/lib/python3.9/dist-packages/launchpad/launch/worker_manager.py", line 474, in _check_workers
worker.future.result()
File "/usr/lib/python3.9/concurrent/futures/_base.py", line 439, in result
return self.__get_result()
File "/usr/lib/python3.9/concurrent/futures/_base.py", line 391, in __get_result
raise self._exception
File "/usr/local/lib/python3.9/dist-packages/launchpad/launch/worker_manager.py", line 250, in run_inner
future.set_result(f())
File "/usr/local/lib/python3.9/dist-packages/launchpad/nodes/python/node.py", line 75, in _construct_function
return functools.partial(self._function, *args, **kwargs)()
File "/usr/local/lib/python3.9/dist-packages/launchpad/nodes/courier/node.py", line 113, in run
instance = self._construct_instance() # pytype:disable=wrong-arg-types
File "/usr/local/lib/python3.9/dist-packages/launchpad/nodes/python/node.py", line 180, in _construct_instance
self._instance = self._constructor(*args, **kwargs)
File "/usr/local/lib/python3.9/dist-packages/acme/jax/experiments/make_distributed_experiment.py", line 169, in build_actor
actor = experiment.builder.make_actor(actor_key, policy_network,
File "/usr/local/lib/python3.9/dist-packages/acme/agents/jax/dqn/builder.py", line 99, in make_actor
return actors.GenericActor(
File "/usr/local/lib/python3.9/dist-packages/acme/agents/jax/actors.py", line 67, in init
self._init = jax.jit(actor.init, backend=backend)
AttributeError: 'function' object has no attribute 'init'

Python Packages

absl-py 0.15.0
ale-py 0.7.3
astunparse 1.6.3
async-generator 1.10
atari-py 0.2.9
attrs 22.1.0
bsuite 0.3.5
cached-property 1.5.2
cachetools 4.2.4
certifi 2021.10.8
chardet 3.0.4
charset-normalizer 2.0.7
chex 0.1.4
clang 5.0
cloudpickle 2.0.0
colorama 0.4.5
commonmark 0.9.1
cycler 0.11.0
dbus-python 1.2.16
decorator 5.1.0
dill 0.3.5.1
distrax 0.1.2
dm-acme 0.4.1
dm-control 0.0.364896371
dm-env 1.5
dm-haiku 0.0.7
dm-launchpad 0.5.2
dm-reverb 0.7.2
dm-sonnet 2.0.0
dm-tree 0.1.6
docker 6.0.0
dopamine-rl 4.0.0
etils 0.7.1
execnet 1.9.0
flatbuffers 1.12
flax 0.5.3
fonttools 4.37.1
frozendict 2.3.4
future 0.18.2
gast 0.4.0
gin 0.1.6
gin-config 0.5.0
glfw 2.5.4
google-api-core 2.8.2
google-api-python-client 2.58.0
google-auth 1.35.0
google-auth-httplib2 0.1.0
google-auth-oauthlib 0.4.6
google-cloud-aiplatform 1.16.1
google-cloud-bigquery 2.34.4
google-cloud-core 2.3.2
google-cloud-resource-manager 1.6.1
google-cloud-storage 2.5.0
google-crc32c 1.3.0
google-pasta 0.2.0
google-resumable-media 2.3.3
googleapis-common-protos 1.56.4
grpc-google-iam-v1 0.12.4
grpcio 1.47.0
grpcio-status 1.47.0
gym 0.21.0
h5py 3.1.0
httplib2 0.20.4
humanize 4.3.0
idna 3.3
imageio 2.21.2
immutabledict 2.2.1
importlab 0.7
importlib-metadata 4.8.1
importlib-resources 5.4.0
iniconfig 1.1.1
jax 0.3.16
jaxlib 0.3.14
jmp 0.0.2
joblib 1.1.0
keras 2.8.0
Keras-Preprocessing 1.1.2
kiwisolver 1.3.2
kubernetes 24.2.0
labmaze 1.0.5
libclang 12.0.0
libcst 0.4.7
lxml 4.9.1
Markdown 3.3.4
matplotlib 3.5.3
mizani 0.7.4
mock 4.0.3
msgpack 1.0.2
mypy-extensions 0.4.3
networkx 2.8.6
ninja 1.10.2.3
numpy 1.22.4
oauthlib 3.1.1
opencv-python 4.5.4.58
opensimplex 0.3
opt-einsum 3.3.0
optax 0.0.9
packaging 21.3
palettable 3.3.0
pandas 1.4.4
patsy 0.5.2
Pillow 8.4.0
pip 22.2.2
plotnine 0.9.0
pluggy 1.0.0
portpicker 1.5.2
promise 2.3
proto-plus 1.22.1
protobuf 3.19.1
psutil 5.9.1
py 1.11.0
pyasn1 0.4.8
pyasn1-modules 0.2.8
pygame 2.1.0
Pygments 2.13.0
PyGObject 3.36.0
PyOpenGL 3.1.6
pyparsing 3.0.4
pytest 7.1.2
pytest-forked 1.4.0
pytest-xdist 2.5.0
python-apt 2.0.0+ubuntu0.20.4.8
python-dateutil 2.8.2
pytype 2021.8.11
pytz 2021.3
PyWavelets 1.3.0
PyYAML 6.0
requests 2.26.0
requests-oauthlib 1.3.0
requests-unixsocket 0.2.0
rich 11.2.0
rlax 0.1.4
rlds 0.1.5
rsa 4.7.2
s2sphere 0.2.5
scikit-image 0.19.3
scikit-learn 1.0.1
scipy 1.7.1
setuptools 45.2.0
six 1.15.0
sklearn 0.0
SQLAlchemy 1.2.19
statsmodels 0.13.2
tabulate 0.8.10
tensorboard 2.8.0
tensorboard-data-server 0.6.1
tensorboard-plugin-wit 1.8.0
tensorflow 2.8.0
tensorflow-datasets 4.5.2
tensorflow-estimator 2.8.0
tensorflow-io-gcs-filesystem 0.26.0
tensorflow-metadata 1.10.0
tensorflow-probability 0.15.0
tensorstore 0.1.23
termcolor 1.1.0
tf-estimator-nightly 2.8.0.dev2021122109
tf-slim 1.1.0
tfp-nightly 0.15.0.dev20211104
threadpoolctl 3.0.0
tifffile 2022.8.12
toml 0.10.2
tomli 2.0.1
toolz 0.11.1
tqdm 4.64.0
transitions 0.8.10
trfl 1.2.0
typed-ast 1.5.4
typing_extensions 4.3.0
typing-inspect 0.8.0
uritemplate 4.1.1
urllib3 1.26.7
websocket-client 1.4.0
Werkzeug 2.0.2
wheel 0.34.2
wrapt 1.12.1
xmanager 0.2.0
zipp 3.6.0

@joshgreaves
Copy link
Collaborator

Hi,

Thanks for reporting this issue. After playing around with it myself it looks likely that Acme's API has changed. I'll update here once I've made progress.

@SaundersJE97
Copy link
Author

What versions of Acme / launchpad did you use as I can downgrade to get it to work.

@joshgreaves
Copy link
Collaborator

I'm not sure what version would work, but it's probably worth trying with version 0.4.0, since it was released just before the BLE, and we tested the agents at release https://pypi.org/project/dm-acme/0.4.0/#history.

I'm looking at upgrading this today. I'll respond here once I've tested and pushed the changes.

@joshgreaves
Copy link
Collaborator

Here's the status update:

  • I have code ready to check in to upgrade the Acme agents to Acme's new experiments API, but Acme hasn't been pushed to PyPI in a while. I'm checking in with them to see if they will be pushing in the near future. Once they push their code, I will submit my changes.
  • Realizing that Acme hasn't pushed in a while, I double checked that Acme 0.4.0 still works with the current BLE version on PyPI and found that it does when not using launchpad.

Were you able to get it working with Acme 0.4.0?

@SaundersJE97
Copy link
Author

SaundersJE97 commented Oct 22, 2022

Thank you for updating the repository, I wasn't able to get it working with launchpad 0.4.0 either, I believe they removed the caching node at some point which might be causing compatibility issues.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants