Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

segmentation fault #66

Open
samarth-robo opened this issue Aug 19, 2021 · 23 comments
Open

segmentation fault #66

samarth-robo opened this issue Aug 19, 2021 · 23 comments
Assignees

Comments

@samarth-robo
Copy link

samarth-robo commented Aug 19, 2021

Hello,

I am using reverb 0.4.0 in tf-agents 0.9.0 through the ReverbReplayBuffer and ReverbAddTrajectoryObserver. Ray actors push experience to the the reverb server. Currently though, it is configured to have just 1 actor, and experience pushing is completed in a blocking manner before agent training in the main loop.

I am seeing segmentation faults happening at random times, always within the main process that samples from reverb to train the agent.

I was wondering if you have some hints about why this might be happening, or where I can start debugging?

*** SIGSEGV received at time=1629367150 on cpu 4 ***
PC: @     0x7f328092219f  (unknown)  deepmind::reverb::(anonymous namespace)::LocalSamplerWorker::FetchSamples()
    @     0x7f32d0810980       2320  (unknown)
    @     0x7f328091a14f        144  deepmind::reverb::Sampler::RunWorker()
    @     0x7f32cd7cd039  (unknown)  execute_native_thread_routine
    @     0x7f2be000ec70  (unknown)  (unknown)
    @     0x7f3280965ca0  (unknown)  (unknown)
    @ 0x75058b4808ec8348  (unknown)  (unknown)
Segmentation fault (core dumped)
@elhamAm
Copy link

elhamAm commented Aug 30, 2021

same here

@acassirer
Copy link
Collaborator

@ebrevdo Any ideas here? I don't know anything about the details of tf-agents but the only thing I can think of that could cause a segmentation fault would be if the shape of the data doesn't match expectations..

WDYT?

@ebrevdo
Copy link
Collaborator

ebrevdo commented Aug 31, 2021

@samarth-robo When you create the Table, do you pass it a signature kwarg (here's one example)? If you provide a signature some of those segfaults may turn into proper errors.

@tfboyd i wonder if part of our release/nightly build could be to create a debug version of reverb and upload it to a gcs bucket, so when people get segfaults like this we can point them to a debug build they can use? debug builds come out to ~200MB each. perhaps we can just do it for release versions?

@samarth-robo @elhamAm do you have a small repro we can use to try and debug?

@samarth-robo
Copy link
Author

@ebrevdo yes I give it the signature, following this example. Unfortunately I cannot share the env, but the rest should be OK. I will work on putting an example together. Repro is made more difficult by the fact that the segfault happens randomly after a few hours of training.

@ebrevdo
Copy link
Collaborator

ebrevdo commented Aug 31, 2021 via email

@samarth-robo
Copy link
Author

@ebrevdo I made this repository to reproduce the issue. It uses a random environment. But (un)fortunately I have not seen a segfault using that code yet.

When I examined the segfaults with GDB the backtraces pointed to reverb. But this new information indicates the issue is in the env (which uses MuJoCo and robosuite) or its interaction with tf-agents? I am not sure now.

I have not worked with bazel before. Will appreciate a pip wheel. Thanks!

@samarth-robo
Copy link
Author

Oh actually I ran the above repository again, and it crashed. Unfortunately I was not running it with GDB this time. Here is the error message:

INFO:__main__:Train 4789/15000: reward=1.1835, episode_length=125.0000, steps=598625.0000, episodes=4789.0000, collect_speed=619.5698, train_speed=68.3351, loss=0.9424
Traceback (most recent call last):
  File "/home/xxx/research/reverb_segfault_repro/trainer.py", line 238, in <module>
    trainer.train()
  File "/home/xxx/research/reverb_segfault_repro/trainer.py", line 183, in train
    losses: LossInfo = learner.run(iterations=n_sgd_steps)
  File "/home/xxx/miniconda3/envs/reverb_segfault_repro/lib/python3.9/site-packages/tf_agents/train/learner.py", line 246, in run
    loss_info = self._train(iterations, iterator, parallel_iterations)
  File "/home/xxx/miniconda3/envs/reverb_segfault_repro/lib/python3.9/site-packages/tensorflow/python/eager/def_function.py", line 885, in __call__
    result = self._call(*args, **kwds)
  File "/home/xxx/miniconda3/envs/reverb_segfault_repro/lib/python3.9/site-packages/tensorflow/python/eager/def_function.py", line 924, in _call
    results = self._stateful_fn(*args, **kwds)
  File "/home/xxx/miniconda3/envs/reverb_segfault_repro/lib/python3.9/site-packages/tensorflow/python/eager/function.py", line 3039, in __call__
    return graph_function._call_flat(
  File "/home/xxx/miniconda3/envs/reverb_segfault_repro/lib/python3.9/site-packages/tensorflow/python/eager/function.py", line 1963, in _call_flat
    return self._build_call_outputs(self._inference_function.call(
  File "/home/xxx/miniconda3/envs/reverb_segfault_repro/lib/python3.9/site-packages/tensorflow/python/eager/function.py", line 591, in call
    outputs = execute.execute(
  File "/home/xxx/miniconda3/envs/reverb_segfault_repro/lib/python3.9/site-packages/tensorflow/python/eager/execute.py", line 59, in quick_execute
    tensors = pywrap_tfe.TFE_Py_Execute(ctx._handle, device_name, op_name,
tensorflow.python.framework.errors_impl.InvalidArgumentError:  Cannot unpack column 4097 in chunk 2341602335940456937 which has 6 columns.
         [[{{node while/body/_103/while/IteratorGetNext}}]] [Op:__inference__train_327771]

Function call stack:
_train

[reverb/cc/platform/default/server.cc:84] Shutting down replay server
Segmentation fault (core dumped)

This is not the same error as the one I mentioned at the top, but I have experienced this one before.

I will run it with GDB again and report a stack trace if possible. @ebrevdo is that repository useful to you?

@ebrevdo
Copy link
Collaborator

ebrevdo commented Sep 6, 2021 via email

@samarth-robo
Copy link
Author

samarth-robo commented Sep 6, 2021

@ebrevdo it is not crashing after I removed the asynchronicity i.e. ensured that data was not being written to the reverb replay buffer while it was being sampled from.
So I think this issue can be closed now.
I will leave that repository out there if anyone wants to use it.
Thanks for your help!

@ebrevdo
Copy link
Collaborator

ebrevdo commented Sep 7, 2021

I'll leave this open until we can figure out what's going on or move everyone over to TrajectoryWriter. Thanks for the report and the repro, and for the additional details about parallel write/read (that should work just fine).

@ebrevdo ebrevdo reopened this Sep 7, 2021
@ebrevdo
Copy link
Collaborator

ebrevdo commented Sep 7, 2021

I think the error there is that we're reading from some bad memory, and likely related to the segfaults. The 4097 column number is extremely suspicious.

@ebrevdo
Copy link
Collaborator

ebrevdo commented Sep 7, 2021

@qstanczyk could this be related to your PR "Batch sample responses"?

@samarth-robo
Copy link
Author

I'll leave this open until we can figure out what's going on or move everyone over to TrajectoryWriter. Thanks for the report and the repro, and for the additional details about parallel write/read (that should work just fine).

the reason why I thought concurrent write/read would not work is this blue note in the documentation of ReverbReplayBuffer.as_dataset()

Screenshot from 2021-09-07 11-01-28

If you want to test concurrent write/read revert the commit ebfba0ab7c474b3831279a07e5e65e8af98f4269.

@ebrevdo
Copy link
Collaborator

ebrevdo commented Sep 7, 2021 via email

@ebrevdo
Copy link
Collaborator

ebrevdo commented Sep 8, 2021

I checked out your repo (the initial checkin, not the subsequent one) and installed the same versions of TF-Agents and built my own Reverb 0.4.0; using TF 2.6.0. I'm using python3.9. I run trainer.py and so far I haven't seen any errors or segfaults. I'm at Train step 1695/15000. How long until you saw the error?

@samarth-robo
Copy link
Author

Thanks, I also have Python version 3.9.5 and TF 2.6.0. I'm assuming, tf-agents 0.9.0 pulls in reverb 0.4.0? It is difficult to find out the version of pip-installed reveb.

I usually saw the segfault a little later, around iteration 5000. For example, the one above was at 4789. And it does not happen every time. But I have seen it at least twice.

@qstanczyk
Copy link
Collaborator

@qstanczyk could this be related to your PR "Batch sample responses"?
This change is 2 months old and I haven't seen any segfaults when running quite a few tests / benchmarks with Reverb. You never know... but it is low probability.

@samarth-robo
Copy link
Author

@ebrevdo I was wondering if you have been able to reproduce? I do see these segfaults intermittently. Here is another one which provides some more information. This one had non-concurrent read/write.

*** SIGSEGV received at time=1631229295 on cpu 0 ***
PC: @     0x7fcd62ab6a8f  (unknown)  deepmind::reverb::internal::UnpackChunkColumn()
    @     0x7fcdb2966980       1792  (unknown)
    @     0x7fcd62ab6e41        320  deepmind::reverb::internal::UnpackChunkColumnAndSlice()
    @     0x7fcd62ab7274         32  deepmind::reverb::internal::UnpackChunkColumnAndSlice()
    @     0x7fcd62a78165        880  deepmind::reverb::(anonymous namespace)::LocalSamplerWorker::FetchSamples()
    @     0x7fcd62a7014f        144  deepmind::reverb::Sampler::RunWorker()
    @     0x7fcdaf923039  (unknown)  execute_native_thread_routine
    @     0x7fc66800f0a0  (unknown)  (unknown)
    @     0x7fcd62abbca0  (unknown)  (unknown)
    @ 0x75058b4808ec8348  (unknown)  (unknown)
Segmentation fault (core dumped)

@ebrevdo
Copy link
Collaborator

ebrevdo commented Sep 29, 2021

@samarth-robo are you running with a -g2 compiled reverb? any chance of you running this in gdb and getting a full stacktrace? i wonder if that would give more info. @tfboyd do we have debug build pip packages availble now?

@ebrevdo
Copy link
Collaborator

ebrevdo commented Sep 29, 2021

Looks like you're trying to access something that's been freed. gdb may help us identify what object it is. My guess is that either chunk_data disappears out from under you, or output tensor out is nullptr. Here's the code.

@samarth-robo
Copy link
Author

@ebrevdo here is a GDB session I had copy-pasted into a Google doc some time ago. It is not the exact same session mentioned in my last comment, but the errors are in the same UnpackChunkColumn so maybe it will be helpful: https://docs.google.com/document/d/1SgplUHFUleQncjRc-aZRRJfrQBp9SxyJ7F7jdL8sGqA/edit?usp=sharing.

@samarth-robo
Copy link
Author

To answer your other question, I get reverb from pip. I am not sure if that one has been compiled with -g2.

@samarth-robo
Copy link
Author

checkpointing + daemontools/supervise is an effective work around for such crashes. supervise automatically restarts the training if it crashes.
A working example building on TF-Agents's example distributed learning code is available here: https://github.com/samarth-robo/sac_utils.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants