Make systhread mutexes errorcheck #9757

gadmm · 2020-07-12T16:24:28Z

From the commit log:

This means that an exception is raised when attempting to lock a mutex
locked from the same thread, e.g. from an asynchronous callback.

This changes the behaviour on Windows where mutexes were recursive.

Add test for deadlock inside asynchronous callbacks.

The main motivation is the detection of deadlocks caused by locking from an asynchronous callback, since this is a "bug at a distance" that can surprise users and be hard to debug. This has occurred many times: #5141, #5299, #7503, #8794...

This PR changes both Mutex and the channel locks. This is orthogonal to #9722 which aims to unlock channels before running asynchronous callbacks, and can be considered separately.

For POSIX, this converts a deadlock into a Sys_error exception, so this preserves compatibility.

There is a danger of breaking code for people who wrote win32-only programs and relied on the recursive behaviour of win32 mutexes. Unfortunately, the documentation of Mutex is ambiguous about the non-recursive nature of mutexes. If there is a risk to break programs, it may be preferable to leave the windows behaviour unchanged; otherwise to clarify the documentation.

Optionally, I can make the runtime lock errorcheck too. At #5299, it was previously discussed to make it recursive, and it was decided against it, and in particular the current windows behaviour is incorrect. This is easy to fix in the same way if desired.

For obvious reasons the Win32 code has not yet been tested. I will rely on CI for compilation and further help is welcome.

(I believe this can interest @xavierleroy and @stedolan.)

This means that an exception is raised when attempting to lock a mutex locked from the same thread, e.g. from an asynchronous callback. This changes the behaviour on Windows where mutexes were recursive. Add test for deadlock inside asynchronous callbacks.

jhjourdan · 2020-07-13T14:25:17Z

I have not reviewed the code, but I see two issues with this PR:

1- Performance : pthread's errorcheck mutexes have a cost, and some people do (did ?) care about the performance of systhread (cf #4351). So, at the very least, we should do a benchmarck for this PR.

2- Releasing a mutex owned by another thread is now explicitly forbidden and checked by pthread. In the previous code, this was theoretically an undefined behavior inherited from pthread, but my gut feeling is that this was harmless because there has always been an HB relationship between the locker and the unlocker thanks to the master lock. So, this PR changes the behavior of mutexes in this case (if we assume that the UB I just described is benign). On the other hand, this PR fixes a potential UB for well-typed programs, which is good.

stedolan · 2020-07-13T14:57:19Z

testsuite/tests/lib-threads/mutex_errorcheck.reference

@@ -0,0 +1,3 @@
+start
+Sys_error("Mutex.lock: Resource deadlock avoided")


The exact string reported here is libc- and locale-dependent. This test shouldn't fail if it changes.

stedolan · 2020-07-13T14:59:03Z

I think this is a good change, independent of the discussion at #9722. The Posix code looks good to me, but I haven't read the Windows side.

@jhjourdan releasing a mutex from the wrong thread seems like it could break on many libcs? Are there programs that do this?

jhjourdan · 2020-07-13T15:27:14Z

@jhjourdan releasing a mutex from the wrong thread seems like it could break on many libcs

Really? I doubt pthread_mutex implementations use thread-local variables in practice. I don't see why they would do that, and TLS is rather costly.

stedolan · 2020-07-13T15:41:49Z

Really? I doubt pthread_mutex implementations use thread-local variables in practice. I don't see why they would do that, and TLS is rather costly.

TLS is cheap, especially from within libc (there's generally a register reserved), and many implementations use it to determine the current thread ID. See e.g. glibc's implementation here.

It's convenient to have thread IDs around when you're writing a queued sleeping lock. (They're not needed for a simple spinlock, though).

xavierleroy · 2020-07-13T15:55:45Z

Before we argue on the relative merits of the three flavors of mutexes, let's keep in mind that there are at least three different uses of mutexes in OCaml's systhreads: 1- the masterlock, 2- implementing the abstract type Mutex.t, and 3- protecting I/O buffers.

To me, it could make sense to have reentrant mutexes protecting I/O buffers, for instance if we want to support functionality equivalent to flockfile/funlockfile in POSIX threads.

I could agree with Mutex.t being of the error-checking kind, but I'm sorry to note that the documentation for Mutex.unlock doesn't say that it must be executed by the thread that locked the mutex last!

Finally, for the masterlock, speed is crucial, so maybe we don't want any checking there.

gadmm · 2020-07-14T21:08:55Z

Thanks for the discussion, this is interesting.

I agree that statu quo is the safest bet in terms of compatibility. It is important to care about emergent properties, so @jhjourdan and @xavierleroy's points tend to convince me. However, unlocking from the another thread is unsupported at least on Windows, so this will have to remain unspecified. Please decide for me whether the current behaviour has to be preserved.

If so, before closing the PR we can try to recycle it. @xavierleroy suggests to make channel locks recursive so that the user may be able to lock channels. I can look into this, but I see the difficulty that it does not work against interference with asynchronous callbacks. It might be possible to work around this if we are careful to maintain only 3 states, and adapt the unlocking code in the loops at #9722 accordingly if it has already been locked twice. Did you have something else in mind?

It is also possible to add an optional argument to Mutex.create to choose between the default platform-specific behaviour, errorcheck mutexes, and recursive mutexes, and use this opportunity to document the situation (in particular that the latter two are portable).

gadmm · 2020-07-14T21:15:01Z

Another issue is that pthread_mutexattr_settype is from POSIX.1-2017, and I wonder if this is too recent. See https://pubs.opengroup.org/onlinepubs/9699919799/functions/pthread_mutexattr_settype.html.

stedolan · 2020-07-15T08:44:12Z

I'm sorry to note that the documentation for Mutex.unlock doesn't say that it must be executed by the thread that locked the mutex last!

Is this expected to work? If so, should we implement Mutex.unlock using something other than pthread_mutex_unlock?

xavierleroy · 2020-07-15T09:17:18Z

Is this expected to work?

I am tempted to say "no". In my mind, a mutex must be used in a well-bracketed way, thus be unlocked by the thread that locked it. Otherwise, it's a 0-1 semaphore :-) But it is troubling that we never got to document this assumption!

If so, should we implement Mutex.unlock using something other than pthread_mutex_unlock?

We can implement Mutex.t as a mutex + a condition variable + a Boolean, effectively implementing a 0-1 semaphore... But this is not good for performance, and feels overkill.

dra27 · 2020-07-15T10:21:56Z

The Windows implementation is not strong enough - calling LeaveCriticalSection from another thread which doesn't hold the mutex is undefined behaviour (which tends to work, AFAICT). In particular, this dummy snippet (which should probably form the basis of another test for this PR):

let msg s = print_endline s; flush stdout in
let m = Mutex.create () in
Mutex.lock m;
let thread1 = Thread.create (fun m -> Mutex.lock m; msg "thread1 locked m") m in
msg "master release m";
Mutex.unlock m;
Thread.join thread1;
msg "thread1 is dead";
let thread2 = Thread.create (fun m -> Mutex.lock m; msg "thread2 locked m") m in
msg "master release m";
Mutex.unlock m;
Thread.join thread2;
msg "thread2 is dead"

succeeds on trunk Windows 10 and Ubuntu 18.04. With this PR it still incorrectly succeeds with Windows 10, but on Ubuntu has the expected new behaviour:

master release m
thread1 locked m
thread1 is dead
master release m
Fatal error: exception Sys_error("Mutex.unlock: Operation not permitted")

Instead of using m->taken = 1, it should record - and check - the Thread ID. This is loosely how the mingw-w64 implementation of pthread mutexes work - except they have a slightly clever fast path which we might one day ~~steal~~borrow.

stedolan · 2020-07-15T12:12:36Z

If so, should we implement Mutex.unlock using something other than pthread_mutex_unlock?

We can implement Mutex.t as a mutex + a condition variable + a Boolean, effectively implementing a 0-1 semaphore... But this is not good for performance, and feels overkill.

Right, but pthread_mutex_unlock from a thread other than the one that locked is undefined behaviour, and I think there have been implementations that blow up on this (glibc lock elision?)

xavierleroy · 2020-07-15T13:51:09Z

Right, but pthread_mutex_unlock from a thread other than the one that locked is undefined behaviour, and I think there have been implementations that blow up on this (glibc lock elision?)

Then I'd rather use errorcheck mutexes, as proposed in this PR, at least to implement Mutex.t, and document the restriction on Mutex.unlock. It would be more efficient than implementing our own 0-1 semaphores.

gadmm · 2020-07-15T14:41:32Z

@dra27, thanks. It seems also that the mutex implementation can be very different between Windows versions, so we cannot really conclude that this used to accidentally work in the past. I'll have a look at the mingw implementation if we go in this direction.

stedolan · 2020-07-15T14:59:58Z

@gadmm

Another issue is that pthread_mutexattr_settype is from POSIX.1-2017, and I wonder if this is too recent. See https://pubs.opengroup.org/onlinepubs/9699919799/functions/pthread_mutexattr_settype.html.

I think it should be fine. This function was part of SUSv2 in 1997, and was included in POSIX.1-2001 as part of the XSI extension. The recent change is that it was made mandatory rather than part of an extension, but I'm not aware of any pthread implementations that lack it.

gadmm · 2020-08-10T20:02:08Z

FTR, https://discuss.ocaml.org/t/mutex-lock-resource-deadlock-avoided-on-freebsd-12-1-ocaml-4-09-1-lwt-4-2-1/6206 provides an example where 1) unlock is called from a different thread, and 2) this fails on some platform.

xavierleroy · 2020-08-19T16:08:08Z

I made a different Win32 implementation, see PR #9846. Review welcome.

I also have a variant where channels get recursive mutexes, but decided to keep it my freezer until we want to add the equivalent of flockfile/funlockfile from POSIX threads.

gadmm · 2020-08-19T19:50:40Z

The other version looks better (in the direction of what I had in mind). Therefore I am closing this one. Thanks for taking this off my plate.

To preserve behaviour, explicit polls are added: - in caml_raise, to raise the right exception when as system call is interrupted by a signal. - in sigprocmask, to ensure that signals are handled as soon as they are unmasked.

gadmm mentioned this pull request Jul 12, 2020

EINTR-based signals, again #9722

Merged

1 task

gadmm force-pushed the errorcheck-mutexes branch 2 times, most recently from 56fdcb7 to 243346c Compare July 12, 2020 18:18

gadmm force-pushed the errorcheck-mutexes branch from 243346c to 650283a Compare July 12, 2020 22:07

stedolan reviewed Jul 13, 2020

View reviewed changes

xavierleroy mentioned this pull request Aug 19, 2020

Use "error checking" mutexes in the threads library #9846

Merged

gadmm closed this Aug 19, 2020

xavierleroy mentioned this pull request Aug 28, 2020

Use "semaphore-like" mutexes in the threads library #9863

Closed

This was referenced Sep 14, 2020

Reimplement mutexes and condition variables in OCaml using atomics + suspend/notify #9915

Closed

Provide semaphores in the threading library #9930

Merged

gadmm mentioned this pull request Oct 13, 2020

Replace the non-standard PTHREAD_MUTEX_ERRORCHECK_NP and fix a leak ocaml-multicore/ocaml-multicore#413

Merged

Octachron mentioned this pull request Jan 26, 2021

OCaml 4.12.0 second beta release ocaml/opam-repository#18042

Merged

gadmm mentioned this pull request Jun 16, 2021

[RFC] Stdlib thread-safety - Format #10453

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make systhread mutexes errorcheck #9757

Make systhread mutexes errorcheck #9757

gadmm commented Jul 12, 2020 •

edited

jhjourdan commented Jul 13, 2020

stedolan Jul 13, 2020

stedolan commented Jul 13, 2020

jhjourdan commented Jul 13, 2020

stedolan commented Jul 13, 2020

xavierleroy commented Jul 13, 2020

gadmm commented Jul 14, 2020

gadmm commented Jul 14, 2020

stedolan commented Jul 15, 2020

xavierleroy commented Jul 15, 2020

dra27 commented Jul 15, 2020

stedolan commented Jul 15, 2020

xavierleroy commented Jul 15, 2020

gadmm commented Jul 15, 2020

stedolan commented Jul 15, 2020

gadmm commented Aug 10, 2020

xavierleroy commented Aug 19, 2020

gadmm commented Aug 19, 2020

		@@ -0,0 +1,3 @@
		start
		Sys_error("Mutex.lock: Resource deadlock avoided")

Make systhread mutexes errorcheck #9757

Make systhread mutexes errorcheck #9757

Conversation

gadmm commented Jul 12, 2020 • edited

jhjourdan commented Jul 13, 2020

stedolan Jul 13, 2020

Choose a reason for hiding this comment

stedolan commented Jul 13, 2020

jhjourdan commented Jul 13, 2020

stedolan commented Jul 13, 2020

xavierleroy commented Jul 13, 2020

gadmm commented Jul 14, 2020

gadmm commented Jul 14, 2020

stedolan commented Jul 15, 2020

xavierleroy commented Jul 15, 2020

dra27 commented Jul 15, 2020

stedolan commented Jul 15, 2020

xavierleroy commented Jul 15, 2020

gadmm commented Jul 15, 2020

stedolan commented Jul 15, 2020

gadmm commented Aug 10, 2020

xavierleroy commented Aug 19, 2020

gadmm commented Aug 19, 2020

gadmm commented Jul 12, 2020 •

edited