Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FATAL: ThreadSanitizer CHECK failed #950

Closed
WallStProg opened this issue May 3, 2018 · 5 comments
Closed

FATAL: ThreadSanitizer CHECK failed #950

WallStProg opened this issue May 3, 2018 · 5 comments

Comments

@WallStProg
Copy link

I'm getting the following error consistently w/several of my test pgms when built w/TSAN.

FATAL: ThreadSanitizer CHECK failed: /shared/buildtest/clang/trunk/llvm/projects/compiler-rt/lib/sanitizer_common/sanitizer_deadlock_detector.h:69 "((n_all_locks_)) < (((sizeof(all_locks_with_contexts_)/sizeof((all_locks_with_contexts_)[0]))))" (0x40, 0x40)
    #0 __tsan::TsanCheckFailed(char const*, int, char const*, unsigned long long, unsigned long long) /shared/buildtest/clang/trunk/llvm/projects/compiler-rt/lib/tsan/rtl/tsan_rtl_report.cc:48 (ITimerTest+0x492ad3)
    #1 __sanitizer::CheckFailed(char const*, int, char const*, unsigned long long, unsigned long long) /shared/buildtest/clang/trunk/llvm/projects/compiler-rt/lib/sanitizer_common/sanitizer_termination.cc:79 (ITimerTest+0x4ae005)
    #2 __sanitizer::DeadlockDetectorTLS<__sanitizer::TwoLevelBitVector<1ul, __sanitizer::BasicBitVector<unsigned long> > >::addLock(unsigned long, unsigned long, unsigned int) /shared/buildtest/clang/trunk/llvm/projects/compiler-rt/lib/sanitizer_common/sanitizer_deadlock_detector.h:69 (ITimerTest+0x4a246d)
    #3 __sanitizer::DeadlockDetector<__sanitizer::TwoLevelBitVector<1ul, __sanitizer::BasicBitVector<unsigned long> > >::onLockAfter(__sanitizer::DeadlockDetectorTLS<__sanitizer::TwoLevelBitVector<1ul, __sanitizer::BasicBitVector<unsigned long> > >*, unsigned long, unsigned int) /shared/buildtest/clang/trunk/llvm/projects/compiler-rt/lib/sanitizer_common/sanitizer_deadlock_detector.h:220 (ITimerTest+0x4a246d)
    #4 __sanitizer::DD::MutexAfterLock(__sanitizer::DDCallback*, __sanitizer::DDMutex*, bool, bool) /shared/buildtest/clang/trunk/llvm/projects/compiler-rt/lib/sanitizer_common/sanitizer_deadlock_detector1.cc:170 (ITimerTest+0x4a246d)
    #5 __tsan::MutexPostLock(__tsan::ThreadState*, unsigned long, unsigned long, unsigned int, int) /shared/buildtest/clang/trunk/llvm/projects/compiler-rt/lib/tsan/rtl/tsan_rtl_mutex.cc:200 (ITimerTest+0x4911cf)
    #6 pthread_mutex_lock /shared/buildtest/clang/trunk/llvm/projects/compiler-rt/lib/tsan/../sanitizer_common/sanitizer_common_interceptors.inc:4033 (ITimerTest+0x440cf0)
    #7 lockTimerHeap /shared/work/OpenMAMA/6.2.1/common/c_cpp/src/c/timers.c:411:5 (libmamazmqimpl.so+0x2e047)
    #8 zmqBridgeMamaTimer_destroy /home/btorpey/work/OpenMAMA-zmq/2.0/src/timer.c:168:4 (libmamazmqimpl.so+0x13d21)
    #9 mamaTimer_destroy /shared/work/OpenMAMA/6.2.1/mama/c_cpp/src/c/timer.c:229:37 (libmama.so+0x92968)
    #10 mamaEnvTimer_destroy /home/btorpey/work/transact/4.0.0/src/common/Middleware/MamaAdapter/mme/mamaEnvTimer.c:91:19 (libmme.so+0x5a94)
    #11 mamaEnvTimer_onTimerDestroy /home/btorpey/work/transact/4.0.0/src/common/Middleware/MamaAdapter/mme/mamaEnvTimer.c:131 (libmme.so+0x5a94)
    #12 wombatQueue_dispatchInt /shared/work/OpenMAMA/6.2.1/common/c_cpp/src/c/queue.c:326:9 (libmama.so+0xe0930)
    #13 wombatQueue_timedDispatch /shared/work/OpenMAMA/6.2.1/common/c_cpp/src/c/queue.c:342:12 (libmama.so+0xe09bd)
    #14 zmqBridgeMamaQueue_dispatch /home/btorpey/work/OpenMAMA-zmq/2.0/src/queue.c:253:16 (libmamazmqimpl.so+0x103a3)
    #15 mamaQueue_dispatch /shared/work/OpenMAMA/6.2.1/mama/c_cpp/src/c/queue.c:825:12 (libmama.so+0x8cb86)
    #16 dispatchThreadProc /shared/work/OpenMAMA/6.2.1/mama/c_cpp/src/c/queue.c:1303:30 (libmama.so+0x8e63a)
    #17 __tsan_thread_start_func /shared/buildtest/clang/trunk/llvm/projects/compiler-rt/lib/tsan/rtl/tsan_interceptors.cc:955 (ITimerTest+0x4247fd)
    #18 start_thread <null> (libpthread.so.0+0x33c0607aa0)
    #19 clone <null> (libc.so.6+0x33bfee8bcc)

Any ideas on how to work around this would be much appreciated -- thanks in advance!

@kcc
Copy link
Contributor

kcc commented May 3, 2018

At the very least you may disable the deadlock detector (TSAN_OPTIONS=detect_deadlocks=0)

If you have a reasonably small reproducer we may try to get this fixed.

@WallStProg
Copy link
Author

Thanks Kostya.

I've already disabled the deadlock detector, but wanted to enable it if possible.

It looks like the problem may be too many mutexes? That's quite possible, as the code that triggers this is an internal stress test pgm that creates many objects. If so, I could try building locally with a different size for all_locks_with_contexts_ -- do you know where that is set?

Also, on a related note, it also looks like TSAN reports what appear to be false positives w/recursive mutexes -- is that true?

Thanks again!

@kcc
Copy link
Contributor

kcc commented May 4, 2018

The code in question is in lib/sanitizer_common/sanitizer_deadlock_detector.h
It limits the number of simultaneously held locks in a given thread to an arbitrary large number 64.
If you hold 65 locks in one thread at once, this will fail.

I have to admit that I don't remember fine details of this code any more (haven't touched since 2014).
recursive mutexes should work, but OTOH we don't have too many of them, which means their support is not well tested.
If you can provide a minimal repro, please open a separate bug.

@dvyukov
Copy link
Contributor

dvyukov commented May 7, 2018

@WallStProg you showed some "destroy of a locked mutex" reports. I suspect they cause the unbounded number of mutexes locked by a thread.

POSIX is very clear on this:

Attempting to destroy a locked mutex results in undefined behavior.

http://pubs.opengroup.org/onlinepubs/009695399/functions/pthread_mutex_destroy.html

@WallStProg
Copy link
Author

It appears that the CHECK failure is in fact caused by creating > 64 mutexes in a single thread -- unusual, but this code is a stress test that does that deliberately.

I hear you about the "destroy locked" reports, and I'm in the process of fixing that. (I inherited this code, which has been running for quite some time w/no apparent issues, but I agree that UB is not OK).

For now, I've disabled the problematic tests when running w/"detect_deadlocks=1".

Thanks!

asfgit pushed a commit to apache/kudu that referenced this issue Nov 7, 2018
TSAN limits the number of simultaneous lock acquisitions in a single
thread to 64 when using the deadlock detector[1]. However, compaction
can select up to 128 (128MB budget / 1MB min rowset size) rowsets in a
single op. kudu-tool-test's TestNonRandomWorkloadLoadgen almost always
hits TSAN's limit when the KUDU-1400 changes following this patch are
applied. This patch prevents this by limiting the number of rowsets
selected for a compaction to 32 when running under TSAN.

I ran the test with the KUDU-1400 changes on top and saw 97/100
failures. With the change, I saw 100 successes.

[1]: google/sanitizers#950

Change-Id: I01ad4ba3a13995c194c3308d72c1eb9b611ef766
Reviewed-on: http://gerrit.cloudera.org:8080/11885
Tested-by: Kudu Jenkins
Reviewed-by: Adar Dembo <adar@cloudera.com>
Reviewed-by: Andrew Wong <awong@cloudera.com>
acelyc111 pushed a commit to acelyc111/kudu that referenced this issue Jan 23, 2019
TSAN limits the number of simultaneous lock acquisitions in a single
thread to 64 when using the deadlock detector[1]. However, compaction
can select up to 128 (128MB budget / 1MB min rowset size) rowsets in a
single op. kudu-tool-test's TestNonRandomWorkloadLoadgen almost always
hits TSAN's limit when the KUDU-1400 changes following this patch are
applied. This patch prevents this by limiting the number of rowsets
selected for a compaction to 32 when running under TSAN.

I ran the test with the KUDU-1400 changes on top and saw 97/100
failures. With the change, I saw 100 successes.

[1]: google/sanitizers#950

Change-Id: I01ad4ba3a13995c194c3308d72c1eb9b611ef766
Reviewed-on: http://gerrit.cloudera.org:8080/11885
Tested-by: Kudu Jenkins
Reviewed-by: Adar Dembo <adar@cloudera.com>
Reviewed-by: Andrew Wong <awong@cloudera.com>
wanchaol added a commit to pytorch/pytorch that referenced this issue Apr 16, 2020
As we hold a mutex for our custom C++ Node, when calling reentrant
backward from custom C++ function, we will cocurrently holding many
mutexes up to MAX_DEPTH. TSAN only allow 65 mutexes at once, otherwise
it will complain. This PR lower the limit according to TSAN.

TSAN Reference: google/sanitizers#950

[ghstack-poisoned]
wanchaol added a commit to pytorch/pytorch that referenced this issue Apr 16, 2020
As we hold a mutex for our custom C++ Node, when calling reentrant
backward from custom C++ function, we will cocurrently holding many
mutexes up to MAX_DEPTH. TSAN only allow 65 mutexes at once, otherwise
it will complain. This PR lower the limit according to TSAN.

TSAN Reference: google/sanitizers#950

ghstack-source-id: de61c260ea671025b486c0118af782efdf07aab3
Pull Request resolved: #36745
facebook-github-bot pushed a commit to pytorch/pytorch that referenced this issue Apr 17, 2020
Summary:
Pull Request resolved: #36745

As we hold a mutex for our custom C++ Node, when calling reentrant
backward from custom C++ function, we will cocurrently holding many
mutexes up to MAX_DEPTH. TSAN only allow 65 mutexes at once, otherwise
it will complain. This PR lower the limit according to TSAN.

TSAN Reference: google/sanitizers#950

Test Plan: Imported from OSS

Differential Revision: D21072604

Pulled By: wanchaol

fbshipit-source-id: 99cd1acab41a203d834fa4947f4e6f0ffd2e70f2
azat added a commit to azat/ClickHouse that referenced this issue Jan 10, 2021
…utexes

Under TSan you can lock only not more then 64 mutexes from one thread at
once [1] [2], while RESTART REPLICAS can acquire more (it depends on the
number of replicated tables).

  [1]: google/sanitizers#950 (comment)
  [2]: https://github.com/llvm/llvm-project/blob/b02eab9058e58782fca32dd8b1e53c27ed93f866/compiler-rt/lib/sanitizer_common/sanitizer_deadlock_detector.h#L67

And since stress tests run tests in parallel, you can have more then 64
ReplicatedMergeTree tables at once (even though it is unlikely).

Fix this by using RESTART REPLICA table over RESTART REPLICAS.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants