Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[TSAN] pthread_atfork callbacks are causing deadlocks. #1116

Open
tau0 opened this issue Jul 2, 2019 · 5 comments
Open

[TSAN] pthread_atfork callbacks are causing deadlocks. #1116

tau0 opened this issue Jul 2, 2019 · 5 comments

Comments

@tau0
Copy link

tau0 commented Jul 2, 2019

Some context, we have an alarm which kills the binary after 240s with std::exit(1), but it seems that the binary is being stuck during this kill.

It looks very suspicious that we were stuck in trying to acquire some lock in ReportRace, maybe it is even a reason why we hit a timeout, see trace:

  1. std::exit(1) after 240s
  2. Wait for some time ~400s
  3. get a stack trace and kill it with SIGKILL.
  Timing out. Killing processes in state: S (sleeping)
   Attempting to pull stack traces: 
   [Thread debugging using libthread_db enabled]
   Using host libthread_db library "/usr/local/fbcode/platform007/lib/libthread_db.so.1".
   0x000056399741cdd6 in __sanitizer::BlockingMutex::Lock() ()

   Thread 1 (Thread 0x7fb1459001c0 (LWP 2314618)):
   #0  0x000056399741cdd6 in __sanitizer::BlockingMutex::Lock() ()
   #1  0x0000563997487350 in __tsan::ReportRace(__tsan::ThreadState*) ()
   #2  0x000056399748c22d in __tsan_report_race_thunk ()
   #3  0x000056399747ff11 in __tsan_write8 ()
   #4  0x00005639958f2868 in folly::threadlocal_detail::StaticMeta<void, void>::onForkChild() ()
   #5  0x00005639958e7aac in void folly::detail::function::FunctionTraits<void ()>::callSmall<void (*)()>(folly::detail::function::Data&) ()
   #6  0x00005639970e120c in folly::detail::(anonymous namespace)::AtForkList::child() ()
   #7  0x00007fb14333a602 in fork () from /usr/local/fbcode/platform007/lib/libc.so.6
   #8  0x00005639974339b3 in fork ()
   #9  0x000056399704c90a in folly::Subprocess::spawnInternal(std::unique_ptr<char const* [], std::default_delete<char const* []> >, char const*, folly::Subprocess::Options&, std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const*, int) ()
   #10 0x000056399704bb96 in folly::Subprocess::spawn(std::unique_ptr<char const* [], std::default_delete<char const* []> >, char const*, folly::Subprocess::Options const&, std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const*) ()
   #11 0x000056399704b9e4 in folly::Subprocess::Subprocess(std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&, folly::Subprocess::Options const&, char const*, std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const*) ()
   #12 0x0000563995e90162 in facebook::logdevice::IntegrationTestUtils::Node::start() ()
   #13 0x0000563995e8562e in facebook::logdevice::IntegrationTestUtils::Cluster::start(std::vector<short, std::allocator<short> >) ()
   #14 0x0000563995e4ecde in (anonymous namespace)::TailRecordIntegrationTest_SequencerFailOver_Test::TestBody() ()
   #15 0x00005639974b8ebf in void testing::internal::HandleExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) ()
   #16 0x00005639974aa6ab in testing::Test::Run() ()
   #17 0x00005639974aa801 in testing::TestInfo::Run() ()
   #18 0x00005639974aa914 in testing::TestCase::Run() ()
   #19 0x00005639974ab034 in testing::internal::UnitTestImpl::RunAllTests() ()
   #20 0x00005639974ab2b5 in testing::UnitTest::Run() ()
   #21 0x0000563996f94c97 in main ()

PS: Compiled with clang 8.0.

@tau0
Copy link
Author

tau0 commented Jul 2, 2019

For me it seems to be very similar to #788, but I'm not sure. I.e. we killed the binary during data race report.

@tau0 tau0 changed the title [TSAN] The binary is beeing stuck after std::exit(1) call in __sanitizer::BlockingMutex::Lock() [TSAN] The binary is beeing stuck in __sanitizer::BlockingMutex::Lock() after std::exit(1) call Jul 2, 2019
@dvyukov
Copy link
Contributor

dvyukov commented Jul 4, 2019

I think I see a potential problem in the stack trace.

In the fork interceptor tsan locks report_mtx:

TSAN_INTERCEPTOR(int, fork, int fake) {
  if (in_symbolizer())
    return REAL(fork)(fake);
  SCOPED_INTERCEPTOR_RAW(fork, fake);
  ForkBefore(thr, pc);
  ...

void ForkBefore(ThreadState *thr, uptr pc) {
  ctx->thread_registry->Lock();
  ctx->report_mtx.Lock();
}

It was assumed that fork does not call any instrumented user code.

But in this case fork calls folly::detail::(anonymous namespace)::AtForkList::child() which is instrumented and triggers a race, which tries to lock report_mtx again, which deadlocks.

There may be several potential solutions.
We may try to intercept pthread_atfork and run the callbacks ourselves.
Or register own callback and try to make tsan wrappers around the fork the innermost ones.
Or we may try to set after_multithreaded_fork and other ignores earlier, so that we don't try to report the race in the folly callbacks.
I am not sure which one is better.

@tau0
Copy link
Author

tau0 commented Jul 5, 2019

EDIT: Ignore this comment. Incorrectly read the code.

@tau0 tau0 changed the title [TSAN] The binary is beeing stuck in __sanitizer::BlockingMutex::Lock() after std::exit(1) call [TSAN] pthread_atfork handlers are causing deadlocks. Jul 10, 2019
@tau0 tau0 changed the title [TSAN] pthread_atfork handlers are causing deadlocks. [TSAN] pthread_atfork callbacks are causing deadlocks. Jul 10, 2019
@matt-c-clark
Copy link

I am seeing this issue as well. @dvyukov do you have any ideas on how to proceed? Is there any data I could provide to help?

@dvyukov
Copy link
Contributor

dvyukov commented Apr 7, 2020

Maybe this llvm/llvm-project@be41a98 fixes it? Looks similar.

stsquad pushed a commit to stsquad/qemu that referenced this issue Jun 8, 2020
Disable a few tests under CONFIG_TSAN, which
run into a known TSan issue that results in a hang.
google/sanitizers#1116

The disabled tests under TSan include all the qtests as well as
the test-char, test-qga, and test-qdev-global-props.

Signed-off-by: Robert Foley <robert.foley@linaro.org>
Message-Id: <20200605173422.1490-14-robert.foley@linaro.org>
rf972 pushed a commit to rf972/qemu that referenced this issue Jun 8, 2020
Disable a few tests under CONFIG_TSAN, which
run into a known TSan issue that results in a hang.
google/sanitizers#1116

The disabled tests under TSan include all the qtests as well as
the test-char, test-qga, and test-qdev-global-props.

Signed-off-by: Robert Foley <robert.foley@linaro.org>
rf972 pushed a commit to rf972/qemu that referenced this issue Jun 9, 2020
Disable a few tests under CONFIG_TSAN, which
run into a known TSan issue that results in a hang.
google/sanitizers#1116

The disabled tests under TSan include all the qtests as well as
the test-char, test-qga, and test-qdev-global-props.

Signed-off-by: Robert Foley <robert.foley@linaro.org>
Reviewed-by: Emilio G. Cota <cota@braap.org>
stsquad pushed a commit to stsquad/qemu that referenced this issue Jun 15, 2020
Disable a few tests under CONFIG_TSAN, which
run into a known TSan issue that results in a hang.
google/sanitizers#1116

The disabled tests under TSan include all the qtests as well as
the test-char, test-qga, and test-qdev-global-props.

Signed-off-by: Robert Foley <robert.foley@linaro.org>
Reviewed-by: Emilio G. Cota <cota@braap.org>
Signed-off-by: Alex Bennée <alex.bennee@linaro.org>
Message-Id: <20200609200738.445-14-robert.foley@linaro.org>
stsquad pushed a commit to stsquad/qemu that referenced this issue Jun 16, 2020
Disable a few tests under CONFIG_TSAN, which
run into a known TSan issue that results in a hang.
google/sanitizers#1116

The disabled tests under TSan include all the qtests as well as
the test-char, test-qga, and test-qdev-global-props.

Signed-off-by: Robert Foley <robert.foley@linaro.org>
Reviewed-by: Emilio G. Cota <cota@braap.org>
Signed-off-by: Alex Bennée <alex.bennee@linaro.org>
Message-Id: <20200609200738.445-14-robert.foley@linaro.org>
Message-Id: <20200612190237.30436-17-alex.bennee@linaro.org>
stsquad pushed a commit to stsquad/qemu that referenced this issue Jun 16, 2020
Disable a few tests under CONFIG_TSAN, which
run into a known TSan issue that results in a hang.
google/sanitizers#1116

The disabled tests under TSan include all the qtests as well as
the test-char, test-qga, and test-qdev-global-props.

Signed-off-by: Robert Foley <robert.foley@linaro.org>
Reviewed-by: Emilio G. Cota <cota@braap.org>
Signed-off-by: Alex Bennée <alex.bennee@linaro.org>
Message-Id: <20200609200738.445-14-robert.foley@linaro.org>
Message-Id: <20200612190237.30436-17-alex.bennee@linaro.org>
arcadia-devtools pushed a commit to catboost/catboost that referenced this issue Jul 15, 2020
google/sanitizers#1116

ref:a150ea15f1fcb6baf3d4ad282698e73b55bbb587
facebook-github-bot pushed a commit to facebook/folly that referenced this issue Sep 3, 2020
Summary:
Because of google/sanitizers#1116, any races detected during pthread_atfork deadlock the process.
For forks issued from folly we can use an instrumented fork version to minimize the amount of code running via pthread_atfork.

Reviewed By: yfeldblum

Differential Revision: D23281093

fbshipit-source-id: fd4d3bf06b7992ff314f631ed899854a8b3f6c4b
dotconnor pushed a commit to 5448C8/folly that referenced this issue Mar 19, 2021
Summary:
Because of google/sanitizers#1116, any races detected during pthread_atfork deadlock the process.
For forks issued from folly we can use an instrumented fork version to minimize the amount of code running via pthread_atfork.

Reviewed By: yfeldblum

Differential Revision: D23281093

fbshipit-source-id: fd4d3bf06b7992ff314f631ed899854a8b3f6c4b
arcadia-devtools pushed a commit to catboost/catboost that referenced this issue Jan 18, 2022
google/sanitizers#1116

ref:a150ea15f1fcb6baf3d4ad282698e73b55bbb587
robot-piglet pushed a commit to catboost/catboost that referenced this issue Jan 15, 2023
google/sanitizers#1116

ref:a150ea15f1fcb6baf3d4ad282698e73b55bbb587
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants