Multiprocessing EOF error #2

jluo-bgl · 2019-02-21T02:07:22Z

Hi, I'm trying to replicate your result, however the code not running well in my python 3.5 + macos, for example, multproccing I got EOF Error, I have fixed many this kind of error but I'm not sure how many I'm going to got further, so that knowing your tested environment would help me a lot. Thanks.

danijar · 2019-02-26T19:07:28Z

Hi @JamesLuoau, the code works for us on Debian with Python 2.7 and Python 3.5. I'm not sure why there should be a multiprocessing error here though -- the code is not parallelized and TensorFlow only uses threads as far as I know. Maybe try commenting our the ExternalProcess class in wrappers.py and make sure that it isn't used.

2877992943 · 2019-02-28T07:45:34Z

scripts/tasks.py

  from dm_control import suite
  def env_ctor():
    env = control.wrappers.DeepMindWrapper(suite.load(domain, task), (64, 64))
    env = control.wrappers.ActionRepeat(env, action_repeat)
    env = control.wrappers.LimitDuration(env, max_length)
    env = control.wrappers.PixelObservations(env, (64, 64), np.uint8, 'image')
    env = control.wrappers.ConvertTo32Bit(env)
    return env
  #env = control.wrappers.ExternalProcess(env_ctor) # change here
  env=env_ctor()
  return env

seems that this can work on macos

however, find "nan" in log print

INFO:tensorflow:Graph contains 5144438 trainable variables.
2019-02-28 15:31:46.425061: E tensorflow/core/common_runtime/session.cc:75] Not found: No session factory registered for the given session options: {target: "local" config: gpu_options { allow_growth: true }} Registered factories are {DIRECT_SESSION, GRPC_SESSION}.
2019-02-28 15:31:46.425243: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
INFO:tensorflow:
--------------------------------------------------
Epoch 1 phase train (phase step 0, global step 0).
step/score/loss/zs_entropy/zs_divergence =  [0, nan, 11855.6396, 35.8041, 3.36079955]

danijar · 2019-03-01T01:38:03Z

Yes, that's correct. The nan is intentional and shows up as the mean planning score for steps in which no planning simulation happens.

jluo-bgl · 2019-03-01T05:29:38Z

Hi, I can confirm that tensorflow 1.13 and tensorflow-probability 0.6.0 not working, the script tools/test_overshooting.py not able to pass, an exception "tensorflow attributeerror: 'template' object has no attribute 'updates'" will throw.

However, if I downgrade to tensorflow 1.12.0 and tensorflow-probability 0.5.0, test_overshooting.py passes.

Could you please provide a requirements.txt file for your environment? Thanks a lot.

I'm now getting an error below, appreciated for your help.

UnknownError (see above for traceback): RuntimeError: Cannot make context <dm_control._render.glfw_renderer.GLFWContext object at 0x13574b630> current on thread <_DummyThread(Dummy-5, started daemon 123145467191296)>: this context is already current on another thread <_DummyThread(Dummy-4, started daemon 123145466118144)>.
Traceback (most recent call last):

  File "/Users/user_name/anaconda/envs/planet/lib/python3.5/site-packages/tensorflow/python/ops/script_ops.py", line 206, in __call__
    ret = func(*args)

  File "/Users/user_name/git/planet/planet/control/in_graph_batch_env.py", line 95, in <lambda>
    lambda a: self._batch_env.step(a)[:3], [action],

  File "/Users/user_name/git/planet/planet/control/batch_env.py", line 86, in step
    for env, action in zip(self._envs, actions)]

  File "/Users/user_name/git/planet/planet/control/batch_env.py", line 86, in <listcomp>
    for env, action in zip(self._envs, actions)]

  File "/Users/user_name/git/planet/planet/control/wrappers.py", line 90, in step
    obs, reward, done, info = self._env.step(action)

  File "/Users/user_name/git/planet/planet/control/wrappers.py", line 367, in step
    transition = self._env.step(action, *args, **kwargs)

  File "/Users/user_name/git/planet/planet/control/wrappers.py", line 445, in step
    observ, reward, done, info = self._env.step(action)

  File "/Users/user_name/git/planet/planet/control/wrappers.py", line 156, in step
    obs[self._key] = self._render_image()

  File "/Users/user_name/git/planet/planet/control/wrappers.py", line 165, in _render_image
    image = self._env.render('rgb_array')

  File "/Users/user_name/git/planet/planet/control/wrappers.py", line 261, in render
    *self._render_size, camera_id=self._camera_id)

  File "/Users/user_name/anaconda/envs/planet/lib/python3.5/site-packages/dm_control/mujoco/engine.py", line 171, in render
    physics=self, height=height, width=width, camera_id=camera_id)

  File "/Users/user_name/anaconda/envs/planet/lib/python3.5/site-packages/dm_control/mujoco/engine.py", line 574, in __init__
    with self._physics.contexts.gl.make_current() as ctx:

  File "/Users/user_name/anaconda/envs/planet/lib/python3.5/contextlib.py", line 59, in __enter__
    return next(self.gen)

  File "/Users/user_name/anaconda/envs/planet/lib/python3.5/site-packages/dm_control/_render/base.py", line 116, in make_current
    _CURRENT_THREAD_FOR_CONTEXT[id(self)]))

RuntimeError: Cannot make context <dm_control._render.glfw_renderer.GLFWContext object at 0x13574b630> current on thread <_DummyThread(Dummy-5, started daemon 123145467191296)>: this context is already current on another thread <_DummyThread(Dummy-4, started daemon 123145466118144)>.


	 [[node graph/collection/should_collect_cartpole_balance/simulate-1/train-cartpole_balance-cem-12/scan/while/simulate/environment/simulate/step (defined at /Users/user_name/git/planet/planet/control/in_graph_batch_env.py:96)  = PyFunc[Tin=[DT_FLOAT], Tout=[DT_UINT8, DT_FLOAT, DT_BOOL], token="pyfunc_7", _device="/job:localhost/replica:0/task:0/device:CPU:0"](graph/collection/should_collect_cartpole_balance/simulate-1/train-cartpole_balance-cem-12/scan/while/simulate/Identity_5)]]

danijar · 2019-03-02T03:07:08Z

Hi @JamesLuoau, these both sound like issues with other libraries. Please ask about the AttributeError on the TensorFlow Probability repo and for the multi-threaded rendering error on the dm_control repo. Neither of these happen for me under Python 3.5, TensorFlow 1.12.0, and TensorFlow Probability 0.5.0. If many people are experiencing this, please upvote the comment above this one.

astronautas · 2019-03-10T17:04:37Z

@danijar, thanks to you and your team for such an interesting contribution to RL!

I am planning to scale up this implementation to a multi-agent environment to see how well it performs. I am facing the same problem as @JamesLuoau though. Here's the excerpt from the logs:

Traceback (most recent call last):

  File "/home/username/miniconda3/envs/planet/lib/python2.7/site-packages/tensorflow/python/ops/script_ops.py", line 206, in __call__
    ret = func(*args)

  File "planet/control/in_graph_batch_env.py", line 95, in <lambda>
    lambda a: self._batch_env.step(a)[:3], [action],

  File "planet/control/batch_env.py", line 86, in step
    for env, action in zip(self._envs, actions)]

  File "planet/control/wrappers.py", line 90, in step
    obs, reward, done, info = self._env.step(action)

  File "planet/control/wrappers.py", line 367, in step
    transition = self._env.step(action, *args, **kwargs)

  File "planet/control/wrappers.py", line 445, in step
    observ, reward, done, info = self._env.step(action)

  File "planet/control/wrappers.py", line 156, in step
    obs[self._key] = self._render_image()

  File "planet/control/wrappers.py", line 165, in _render_image
    image = self._env.render('rgb_array')

  File "planet/control/wrappers.py", line 261, in render
    *self._render_size, camera_id=self._camera_id)

  File "/home/username/miniconda3/envs/planet/lib/python2.7/site-packages/dm_control/mujoco/engine.py", line 171, in render
    physics=self, height=height, width=width, camera_id=camera_id)

  File "/home/username/miniconda3/envs/planet/lib/python2.7/site-packages/dm_control/mujoco/engine.py", line 574, in __init__
    with self._physics.contexts.gl.make_current() as ctx:

  File "/home/username/miniconda3/envs/planet/lib/python2.7/contextlib.py", line 17, in __enter__
    return self.gen.next()

  File "/home/username/miniconda3/envs/planet/lib/python2.7/site-packages/dm_control/_render/base.py", line 116, in make_current
    _CURRENT_THREAD_FOR_CONTEXT[id(self)]))

RuntimeError: Cannot make context <dm_control._render.glfw_renderer.GLFWContext object at 0x7f7b417a0550> current on thread <_DummyThread(Dummy-5, started daemon 140166375134976)>: this context is already current on another thread <_DummyThread(Dummy-4, started daemon 140166358349568)>.


	 [[node graph/collection/should_collect_cheetah_run/simulate-1/train-cheetah_run-cem-12/scan/while/simulate/environment/simulate/step (defined at planet/control/in_graph_batch_env.py:96)  = PyFunc[Tin=[DT_FLOAT], Tout=[DT_UINT8, DT_FLOAT, DT_BOOL], token="pyfunc_7", _device="/job:localhost/replica:0/task:0/device:CPU:0"](graph/collection/should_collect_cheetah_run/simulate-1/train-cheetah_run-cem-12/scan/while/simulate/Identity_5/_847)]]
	 [[{{node graph/summaries/general/sub/ReadVariableOp/_441}} = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device_incarnation=1, tensor_name="edge_4356_graph/summaries/general/sub/ReadVariableOp", tensor_type=DT_DOUBLE, _device="/job:localhost/replica:0/task:0/device:GPU:0"]()]]

It seems the vizualizations cannot start. With debug configuration, it runs till 15th step and then crashes. I am running with --config debug, so this kicks in when the testing starts. Maybe it has something to do with workers? Or maybe is it the transition from train to test?

Could you specify these things:

All the versions of "install_requires" dependencies.
Used dmcontrol rendering option (https://github.com/deepmind/dm_control#rendering)
[if nvidia] Nvidia drivers version.
OS and its version.
Mujoco Pro version.

Thank you :)

@JamesLuoau have you managed to fix this?

danijar · 2019-03-12T19:54:50Z

Thanks for letting me know. I will look into this but it will take a couple of days before I get to it. For now, I think everything works using the previous version of TensorFlow and TensorFlow Probability. I mentioned the versions for this above. I'm using the egl rendering option for dm_control.

astronautas · 2019-03-12T21:16:00Z

Thanks for letting me know. I will look into this but it will take a couple of days before I get to it. For now, I think everything works using the previous version of TensorFlow and TensorFlow Probability. I mentioned the versions for this above. I'm using the egl rendering option for dm_control.

Thanks @danijar 🥇. I use Tensorflow 1.12.0 and TF Prob 0.5.0 as well. I am starting to think this might be OS configuration issue, who knows 🤷‍♂️. Please, share Mujoco Pro, dmcontrol and mujoco py version too, as it's stated on mujoco py repo that it needs mujopro 1.5.0, yet dmcontrol depends on 2.0.0 version. And knowing your nvidia driver's version would be nice as well, as egl needs recent Nvidia drivers to work properly.

jluo-bgl · 2019-03-13T01:23:11Z

Hi, @astronautas, I haven't found a good way to run it yet. as you mentioned, I have to use mujoco 2.0.0

danijar · 2019-03-13T01:34:04Z

@astronautas and @JamesLuoau To debug this further, could you please confirm that you can create a dm_control environment and call render on it (outside of the PlaNet code)? I have both mjpro150 and mjpro200_linux installed on my machine but I think only the latter is used by dm_control. The PlaNet code is independent of the dm_control render option and should work will all of them as long as they support multi-threading -- I've used multiple options at some point.

lunar24 · 2019-03-14T07:59:37Z

hi,@danijar. As for configuration, I installed mujoco-py after installing mujoco 150. Then reinstall mujoco 200 and install dmcontrol. This is the only way I can think of. I don't know if there will be any problems.
In addition, when I run test_planet with pycharm, I get AttributeError:'PlanetTest'object has no attribute'create_tempdir' error. I haven't found a suitable solution online. I wonder if you know where the problem is.
Thanks very much.

danijar · 2019-03-14T16:23:15Z

@lunar24 You can install multiple MuJoCo versions by placing them into ~/.mujoco/. You also don't need mujoco-py to run the code as dm_control comes with its own bindings. To see if the code works, just run the command provided in the readme. If you do want to run the tests you need to call them as e.g. python3 -m scripts.test_planet.

astronautas · 2019-03-15T07:10:05Z

@astronautas and @JamesLuoau To debug this further, could you please confirm that you can create a dm_control environment and call render on it (outside of the PlaNet code)? I have both mjpro150 and mjpro200_linux installed on my machine but I think only the latter is used by dm_control. The PlaNet code is independent of the dm_control render option and should work will all of them as long as they support multi-threading -- I've used multiple options at some point.

I'll have some time this weekend for this. I'll post back the results.

lunar24 · 2019-03-16T11:06:08Z

@danijar
Hi, when I was running the program, I encountered an error about the process. Here are some error hints.

I have checked a lot of information about this mistake, but I have not solved this problem. On the other hand, I am concerned that changing the code of the calling process may cause other problems. Therefore, I hope to get your advice. Thank you very much for your help.

astronautas · 2019-03-16T11:10:12Z

@danijar I can confirm these things at the moment:

dmcontrol.viewer works (https://github.com/deepmind/dm_control/blob/master/dm_control/viewer/README.md). I use the glfw and glew rendering option.
Rendering does work. Here are the screenshots, an environment snapshot as well as some of printed-out rewards. The code I've used is as follows:

from dm_control import suite
import numpy as np
from dm_control import viewer
from threading import Thread
import cv2

def rewards(env):
  # Step through an episode and print out reward, discount and observation.
  action_spec = env.action_spec()
  time_step = env.reset()

  while not time_step.last():
    action = np.random.uniform(action_spec.minimum,
                              action_spec.maximum,
                              size=action_spec.shape)

    time_step = env.step(action)
    img = env.physics.render()

    cv2.imshow("img", img)
    cv2.waitKey(0)

    print(time_step.reward)

  print("END")

# Load one task:
env = suite.load(domain_name="cartpole", task_name="swingup")

# Iterate over a task set:
for domain_name, task_name in suite.BENCHMARKING:
  env = suite.load(domain_name, task_name)

thread = Thread(target=lambda: rewards(env)).start()

# viewer.launch(env)

Environment:

Ubuntu 16.04
Hardware rendering with a windowing system is supported via GLFW and GLEW
Mujoco Pro 2.0.0
Everything else is the same as yours.

@danijar, could you please verify again that the code works both with the ExternalProcess and without using it for the environment? I suspect that launching the environment in a separate process alleviates the rendering problem, as based on the logs, the problem is that the current context is set in multiple threads in the same process. Though, neither me nor @JamesLuoau can successfully launch the environment in a separate process.

EDIT: correct me if I'm wrong, that's how I see the current implementation:

There are 2 processes communicating with each other: training_process <-----> worker (environment).

Problem:
It seems that when the episodes are collected, the environment process never receives the last reset message (the one just before close message). It is sent and received when the training starts yet sent & never received when the epoch is about to be ended.

Maybe there's something incorrect with how the external methods on the environment get called? I am not sure whether that's the case but could you verify whether the environment process always writes to its end of pipe while the reinforcement learning process always writes to its own end of pipe?

danijar · 2019-03-17T16:47:21Z

@astronautas and @JamesLuoau Let's move this conversation over to #5 since the thread here got a bit confusing. I've responded to your questions there.

@lunar24 Thanks for reporting this. To keep the threads focused, I started a new ticket for your issue: #6. Please provide the details I asked for there so we can try to resolve this.

danijar changed the title ~~Hi, Could you please state what's your testing environment?~~ Test system environment Feb 26, 2019

danijar closed this as completed Mar 1, 2019

danijar reopened this Mar 12, 2019

This was referenced Mar 17, 2019

ConnectionResetError: Connection reset by peer #6

Closed

Cannot make context current on thread #5

Closed

danijar closed this as completed Mar 17, 2019

danijar changed the title ~~Test system environment~~ Multiprocessing EOF error Mar 17, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multiprocessing EOF error #2

Multiprocessing EOF error #2

jluo-bgl commented Feb 21, 2019

danijar commented Feb 26, 2019

2877992943 commented Feb 28, 2019 •

edited

Loading

danijar commented Mar 1, 2019

jluo-bgl commented Mar 1, 2019 •

edited by danijar

Loading

danijar commented Mar 2, 2019 •

edited

Loading

astronautas commented Mar 10, 2019 •

edited by danijar

Loading

danijar commented Mar 12, 2019 •

edited

Loading

astronautas commented Mar 12, 2019

jluo-bgl commented Mar 13, 2019 •

edited by danijar

Loading

danijar commented Mar 13, 2019

lunar24 commented Mar 14, 2019

danijar commented Mar 14, 2019

astronautas commented Mar 15, 2019

lunar24 commented Mar 16, 2019 •

edited

Loading

astronautas commented Mar 16, 2019 •

edited

Loading

danijar commented Mar 17, 2019

Multiprocessing EOF error #2

Multiprocessing EOF error #2

Comments

jluo-bgl commented Feb 21, 2019

danijar commented Feb 26, 2019

2877992943 commented Feb 28, 2019 • edited Loading

danijar commented Mar 1, 2019

jluo-bgl commented Mar 1, 2019 • edited by danijar Loading

danijar commented Mar 2, 2019 • edited Loading

astronautas commented Mar 10, 2019 • edited by danijar Loading

danijar commented Mar 12, 2019 • edited Loading

astronautas commented Mar 12, 2019

jluo-bgl commented Mar 13, 2019 • edited by danijar Loading

danijar commented Mar 13, 2019

lunar24 commented Mar 14, 2019

danijar commented Mar 14, 2019

astronautas commented Mar 15, 2019

lunar24 commented Mar 16, 2019 • edited Loading

astronautas commented Mar 16, 2019 • edited Loading

danijar commented Mar 17, 2019

2877992943 commented Feb 28, 2019 •

edited

Loading

jluo-bgl commented Mar 1, 2019 •

edited by danijar

Loading

danijar commented Mar 2, 2019 •

edited

Loading

astronautas commented Mar 10, 2019 •

edited by danijar

Loading

danijar commented Mar 12, 2019 •

edited

Loading

jluo-bgl commented Mar 13, 2019 •

edited by danijar

Loading

lunar24 commented Mar 16, 2019 •

edited

Loading

astronautas commented Mar 16, 2019 •

edited

Loading