segmentation fault #66

samarth-robo · 2021-08-19T16:55:36Z

Hello,

I am using reverb 0.4.0 in tf-agents 0.9.0 through the ReverbReplayBuffer and ReverbAddTrajectoryObserver. Ray actors push experience to the the reverb server. Currently though, it is configured to have just 1 actor, and experience pushing is completed in a blocking manner before agent training in the main loop.

I am seeing segmentation faults happening at random times, always within the main process that samples from reverb to train the agent.

I was wondering if you have some hints about why this might be happening, or where I can start debugging?

*** SIGSEGV received at time=1629367150 on cpu 4 ***
PC: @     0x7f328092219f  (unknown)  deepmind::reverb::(anonymous namespace)::LocalSamplerWorker::FetchSamples()
    @     0x7f32d0810980       2320  (unknown)
    @     0x7f328091a14f        144  deepmind::reverb::Sampler::RunWorker()
    @     0x7f32cd7cd039  (unknown)  execute_native_thread_routine
    @     0x7f2be000ec70  (unknown)  (unknown)
    @     0x7f3280965ca0  (unknown)  (unknown)
    @ 0x75058b4808ec8348  (unknown)  (unknown)
Segmentation fault (core dumped)

The text was updated successfully, but these errors were encountered:

elhamAm · 2021-08-30T08:56:15Z

same here

acassirer · 2021-08-31T08:17:47Z

@ebrevdo Any ideas here? I don't know anything about the details of tf-agents but the only thing I can think of that could cause a segmentation fault would be if the shape of the data doesn't match expectations..

WDYT?

ebrevdo · 2021-08-31T14:44:49Z

@samarth-robo When you create the Table, do you pass it a signature kwarg (here's one example)? If you provide a signature some of those segfaults may turn into proper errors.

@tfboyd i wonder if part of our release/nightly build could be to create a debug version of reverb and upload it to a gcs bucket, so when people get segfaults like this we can point them to a debug build they can use? debug builds come out to ~200MB each. perhaps we can just do it for release versions?

@samarth-robo @elhamAm do you have a small repro we can use to try and debug?

samarth-robo · 2021-08-31T20:28:50Z

@ebrevdo yes I give it the signature, following this example. Unfortunately I cannot share the env, but the rest should be OK. I will work on putting an example together. Repro is made more difficult by the fact that the segfault happens randomly after a few hours of training.

ebrevdo · 2021-08-31T20:51:25Z

Are you able to build a version of reverb? Worked with bazel before? If so, I can give you instructions on how to get better stack traces. If not, I may be able to provide you a pip wheel to install in a few days.

…

On Tue, Aug 31, 2021 at 1:29 PM Samarth Brahmbhatt ***@***.***> wrote: @ebrevdo <https://github.com/ebrevdo> yes I give it the signature, following this example <https://github.com/tensorflow/agents/blob/master/tf_agents/experimental/distributed/examples/sac/sac_reverb_server.py>. Unfortunately I cannot share the env, but the rest should be OK. I will work on putting an example together. Repro is made more difficult by the fact that the segfault happens randomly after a few hours of training. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#66 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AANWFG6IA3XXHYV7E4CUVWLT7U3QZANCNFSM5COT4VOQ> . Triage notifications on the go with GitHub Mobile for iOS <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android <https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.

samarth-robo · 2021-09-03T06:52:13Z

@ebrevdo I made this repository to reproduce the issue. It uses a random environment. But (un)fortunately I have not seen a segfault using that code yet.

When I examined the segfaults with GDB the backtraces pointed to reverb. But this new information indicates the issue is in the env (which uses MuJoCo and robosuite) or its interaction with tf-agents? I am not sure now.

I have not worked with bazel before. Will appreciate a pip wheel. Thanks!

samarth-robo · 2021-09-03T19:46:20Z

Oh actually I ran the above repository again, and it crashed. Unfortunately I was not running it with GDB this time. Here is the error message:

INFO:__main__:Train 4789/15000: reward=1.1835, episode_length=125.0000, steps=598625.0000, episodes=4789.0000, collect_speed=619.5698, train_speed=68.3351, loss=0.9424
Traceback (most recent call last):
  File "/home/xxx/research/reverb_segfault_repro/trainer.py", line 238, in <module>
    trainer.train()
  File "/home/xxx/research/reverb_segfault_repro/trainer.py", line 183, in train
    losses: LossInfo = learner.run(iterations=n_sgd_steps)
  File "/home/xxx/miniconda3/envs/reverb_segfault_repro/lib/python3.9/site-packages/tf_agents/train/learner.py", line 246, in run
    loss_info = self._train(iterations, iterator, parallel_iterations)
  File "/home/xxx/miniconda3/envs/reverb_segfault_repro/lib/python3.9/site-packages/tensorflow/python/eager/def_function.py", line 885, in __call__
    result = self._call(*args, **kwds)
  File "/home/xxx/miniconda3/envs/reverb_segfault_repro/lib/python3.9/site-packages/tensorflow/python/eager/def_function.py", line 924, in _call
    results = self._stateful_fn(*args, **kwds)
  File "/home/xxx/miniconda3/envs/reverb_segfault_repro/lib/python3.9/site-packages/tensorflow/python/eager/function.py", line 3039, in __call__
    return graph_function._call_flat(
  File "/home/xxx/miniconda3/envs/reverb_segfault_repro/lib/python3.9/site-packages/tensorflow/python/eager/function.py", line 1963, in _call_flat
    return self._build_call_outputs(self._inference_function.call(
  File "/home/xxx/miniconda3/envs/reverb_segfault_repro/lib/python3.9/site-packages/tensorflow/python/eager/function.py", line 591, in call
    outputs = execute.execute(
  File "/home/xxx/miniconda3/envs/reverb_segfault_repro/lib/python3.9/site-packages/tensorflow/python/eager/execute.py", line 59, in quick_execute
    tensors = pywrap_tfe.TFE_Py_Execute(ctx._handle, device_name, op_name,
tensorflow.python.framework.errors_impl.InvalidArgumentError:  Cannot unpack column 4097 in chunk 2341602335940456937 which has 6 columns.
         [[{{node while/body/_103/while/IteratorGetNext}}]] [Op:__inference__train_327771]

Function call stack:
_train

[reverb/cc/platform/default/server.cc:84] Shutting down replay server
Segmentation fault (core dumped)

This is not the same error as the one I mentioned at the top, but I have experienced this one before.

I will run it with GDB again and report a stack trace if possible. @ebrevdo is that repository useful to you?

ebrevdo · 2021-09-06T07:50:39Z

Interesting. I would have expected that you would have gotten a cleaner error message if you created your Table with a signature. I'll try to repro on monday as well.

…

On Fri, Sep 3, 2021 at 12:46 PM Samarth Brahmbhatt ***@***.***> wrote: Oh actually I ran the above repository again, and it crashed. Unfortunately I was not running it with GDB this time. Here is the error message: INFO:__main__:Train 4789/15000: reward=1.1835, episode_length=125.0000, steps=598625.0000, episodes=4789.0000, collect_speed=619.5698, train_speed=68.3351, loss=0.9424 Traceback (most recent call last): File "/home/xxx/research/reverb_segfault_repro/trainer.py", line 238, in <module> trainer.train() File "/home/xxx/research/reverb_segfault_repro/trainer.py", line 183, in train losses: LossInfo = learner.run(iterations=n_sgd_steps) File "/home/xxx/miniconda3/envs/reverb_segfault_repro/lib/python3.9/site-packages/tf_agents/train/learner.py", line 246, in run loss_info = self._train(iterations, iterator, parallel_iterations) File "/home/xxx/miniconda3/envs/reverb_segfault_repro/lib/python3.9/site-packages/tensorflow/python/eager/def_function.py", line 885, in __call__ result = self._call(*args, **kwds) File "/home/xxx/miniconda3/envs/reverb_segfault_repro/lib/python3.9/site-packages/tensorflow/python/eager/def_function.py", line 924, in _call results = self._stateful_fn(*args, **kwds) File "/home/xxx/miniconda3/envs/reverb_segfault_repro/lib/python3.9/site-packages/tensorflow/python/eager/function.py", line 3039, in __call__ return graph_function._call_flat( File "/home/xxx/miniconda3/envs/reverb_segfault_repro/lib/python3.9/site-packages/tensorflow/python/eager/function.py", line 1963, in _call_flat return self._build_call_outputs(self._inference_function.call( File "/home/xxx/miniconda3/envs/reverb_segfault_repro/lib/python3.9/site-packages/tensorflow/python/eager/function.py", line 591, in call outputs = execute.execute( File "/home/xxx/miniconda3/envs/reverb_segfault_repro/lib/python3.9/site-packages/tensorflow/python/eager/execute.py", line 59, in quick_execute tensors = pywrap_tfe.TFE_Py_Execute(ctx._handle, device_name, op_name, tensorflow.python.framework.errors_impl.InvalidArgumentError: Cannot unpack column 4097 in chunk 2341602335940456937 which has 6 columns. [[{{node while/body/_103/while/IteratorGetNext}}]] [Op:__inference__train_327771] Function call stack: _train [reverb/cc/platform/default/server.cc:84] Shutting down replay server Segmentation fault (core dumped) This is not the same error as the one I mentioned at the top, but I have experienced this one before. I will run it with GDB again and report a stack trace if possible. @ebrevdo <https://github.com/ebrevdo> is that repository useful to you? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#66 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AANWFGYFVO3DSOCDTDCQCODUAEQZNANCNFSM5COT4VOQ> . Triage notifications on the go with GitHub Mobile for iOS <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android <https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.

samarth-robo · 2021-09-06T22:47:29Z

@ebrevdo it is not crashing after I removed the asynchronicity i.e. ensured that data was not being written to the reverb replay buffer while it was being sampled from.
So I think this issue can be closed now.
I will leave that repository out there if anyone wants to use it.
Thanks for your help!

ebrevdo · 2021-09-07T00:38:59Z

I'll leave this open until we can figure out what's going on or move everyone over to TrajectoryWriter. Thanks for the report and the repro, and for the additional details about parallel write/read (that should work just fine).

ebrevdo · 2021-09-07T00:41:55Z

I think the error there is that we're reading from some bad memory, and likely related to the segfaults. The 4097 column number is extremely suspicious.

ebrevdo · 2021-09-07T00:44:52Z

@qstanczyk could this be related to your PR "Batch sample responses"?

samarth-robo · 2021-09-07T18:05:01Z

I'll leave this open until we can figure out what's going on or move everyone over to TrajectoryWriter. Thanks for the report and the repro, and for the additional details about parallel write/read (that should work just fine).

the reason why I thought concurrent write/read would not work is this blue note in the documentation of ReverbReplayBuffer.as_dataset()

If you want to test concurrent write/read revert the commit ebfba0ab7c474b3831279a07e5e65e8af98f4269.

ebrevdo · 2021-09-07T18:07:11Z

That note doesn't apply to Reverb, since the iterator for Reverb is aware of updates to the service. Basically it should just work :)

…

On Tue, Sep 7, 2021 at 11:05 AM Samarth Brahmbhatt ***@***.***> wrote: I'll leave this open until we can figure out what's going on or move everyone over to TrajectoryWriter. Thanks for the report and the repro, and for the additional details about parallel write/read (that should work just fine). the reason why I thought concurrent write/read would not work is this blue note in the documentation of ReverbReplayBuffer.as_dataset() <https://www.tensorflow.org/agents/api_docs/python/tf_agents/replay_buffers/ReverbReplayBuffer#as_dataset> [image: Screenshot from 2021-09-07 11-01-28] <https://user-images.githubusercontent.com/2848070/132390408-d312e024-8d9a-4da2-b10c-a48ce7e92b29.png> If you want to test concurrent write/read revert the commit ebfba0ab7c474b3831279a07e5e65e8af98f4269 <samarth-robo/reverb_segfault_repro@ebfba0a> . — You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub <#66 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AANWFGYBVAFNQESDT5FSUKTUAZH5RANCNFSM5COT4VOQ> . Triage notifications on the go with GitHub Mobile for iOS <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android <https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.

ebrevdo · 2021-09-08T16:15:34Z

I checked out your repo (the initial checkin, not the subsequent one) and installed the same versions of TF-Agents and built my own Reverb 0.4.0; using TF 2.6.0. I'm using python3.9. I run trainer.py and so far I haven't seen any errors or segfaults. I'm at Train step 1695/15000. How long until you saw the error?

samarth-robo · 2021-09-08T17:40:37Z

Thanks, I also have Python version 3.9.5 and TF 2.6.0. I'm assuming, tf-agents 0.9.0 pulls in reverb 0.4.0? It is difficult to find out the version of pip-installed reveb.

I usually saw the segfault a little later, around iteration 5000. For example, the one above was at 4789. And it does not happen every time. But I have seen it at least twice.

qstanczyk · 2021-09-13T05:54:07Z

@qstanczyk could this be related to your PR "Batch sample responses"?
This change is 2 months old and I haven't seen any segfaults when running quite a few tests / benchmarks with Reverb. You never know... but it is low probability.

samarth-robo · 2021-09-29T17:45:33Z

@ebrevdo I was wondering if you have been able to reproduce? I do see these segfaults intermittently. Here is another one which provides some more information. This one had non-concurrent read/write.

*** SIGSEGV received at time=1631229295 on cpu 0 ***
PC: @     0x7fcd62ab6a8f  (unknown)  deepmind::reverb::internal::UnpackChunkColumn()
    @     0x7fcdb2966980       1792  (unknown)
    @     0x7fcd62ab6e41        320  deepmind::reverb::internal::UnpackChunkColumnAndSlice()
    @     0x7fcd62ab7274         32  deepmind::reverb::internal::UnpackChunkColumnAndSlice()
    @     0x7fcd62a78165        880  deepmind::reverb::(anonymous namespace)::LocalSamplerWorker::FetchSamples()
    @     0x7fcd62a7014f        144  deepmind::reverb::Sampler::RunWorker()
    @     0x7fcdaf923039  (unknown)  execute_native_thread_routine
    @     0x7fc66800f0a0  (unknown)  (unknown)
    @     0x7fcd62abbca0  (unknown)  (unknown)
    @ 0x75058b4808ec8348  (unknown)  (unknown)
Segmentation fault (core dumped)

ebrevdo · 2021-09-29T17:47:23Z

@samarth-robo are you running with a -g2 compiled reverb? any chance of you running this in gdb and getting a full stacktrace? i wonder if that would give more info. @tfboyd do we have debug build pip packages availble now?

ebrevdo · 2021-09-29T17:50:14Z

Looks like you're trying to access something that's been freed. gdb may help us identify what object it is. My guess is that either chunk_data disappears out from under you, or output tensor out is nullptr. Here's the code.

samarth-robo · 2021-09-29T17:55:35Z

@ebrevdo here is a GDB session I had copy-pasted into a Google doc some time ago. It is not the exact same session mentioned in my last comment, but the errors are in the same UnpackChunkColumn so maybe it will be helpful: https://docs.google.com/document/d/1SgplUHFUleQncjRc-aZRRJfrQBp9SxyJ7F7jdL8sGqA/edit?usp=sharing.

samarth-robo · 2021-09-29T17:56:21Z

To answer your other question, I get reverb from pip. I am not sure if that one has been compiled with -g2.

samarth-robo · 2022-12-12T23:54:28Z

checkpointing + daemontools/supervise is an effective work around for such crashes. supervise automatically restarts the training if it crashes.
A working example building on TF-Agents's example distributed learning code is available here: https://github.com/samarth-robo/sac_utils.

acassirer assigned ebrevdo Aug 31, 2021

samarth-robo closed this as completed Sep 6, 2021

ebrevdo reopened this Sep 7, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

segmentation fault #66

segmentation fault #66

samarth-robo commented Aug 19, 2021 •

edited

elhamAm commented Aug 30, 2021

acassirer commented Aug 31, 2021

ebrevdo commented Aug 31, 2021

samarth-robo commented Aug 31, 2021

ebrevdo commented Aug 31, 2021 via email

samarth-robo commented Sep 3, 2021

samarth-robo commented Sep 3, 2021

ebrevdo commented Sep 6, 2021 via email

samarth-robo commented Sep 6, 2021 •

edited

ebrevdo commented Sep 7, 2021

ebrevdo commented Sep 7, 2021

ebrevdo commented Sep 7, 2021

samarth-robo commented Sep 7, 2021

ebrevdo commented Sep 7, 2021 via email

ebrevdo commented Sep 8, 2021 •

edited

samarth-robo commented Sep 8, 2021

qstanczyk commented Sep 13, 2021

samarth-robo commented Sep 29, 2021

ebrevdo commented Sep 29, 2021

ebrevdo commented Sep 29, 2021

samarth-robo commented Sep 29, 2021

samarth-robo commented Sep 29, 2021

samarth-robo commented Dec 12, 2022

segmentation fault #66

segmentation fault #66

Comments

samarth-robo commented Aug 19, 2021 • edited

elhamAm commented Aug 30, 2021

acassirer commented Aug 31, 2021

ebrevdo commented Aug 31, 2021

samarth-robo commented Aug 31, 2021

ebrevdo commented Aug 31, 2021 via email

samarth-robo commented Sep 3, 2021

samarth-robo commented Sep 3, 2021

ebrevdo commented Sep 6, 2021 via email

samarth-robo commented Sep 6, 2021 • edited

ebrevdo commented Sep 7, 2021

ebrevdo commented Sep 7, 2021

ebrevdo commented Sep 7, 2021

samarth-robo commented Sep 7, 2021

ebrevdo commented Sep 7, 2021 via email

ebrevdo commented Sep 8, 2021 • edited

samarth-robo commented Sep 8, 2021

qstanczyk commented Sep 13, 2021

samarth-robo commented Sep 29, 2021

ebrevdo commented Sep 29, 2021

ebrevdo commented Sep 29, 2021

samarth-robo commented Sep 29, 2021

samarth-robo commented Sep 29, 2021

samarth-robo commented Dec 12, 2022

samarth-robo commented Aug 19, 2021 •

edited

samarth-robo commented Sep 6, 2021 •

edited

ebrevdo commented Sep 8, 2021 •

edited