Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make systhread mutexes errorcheck #9757

Closed
wants to merge 1 commit into from
Closed

Conversation

gadmm
Copy link
Contributor

@gadmm gadmm commented Jul 12, 2020

From the commit log:

This means that an exception is raised when attempting to lock a mutex
locked from the same thread, e.g. from an asynchronous callback.

This changes the behaviour on Windows where mutexes were recursive.

Add test for deadlock inside asynchronous callbacks.


The main motivation is the detection of deadlocks caused by locking from an asynchronous callback, since this is a "bug at a distance" that can surprise users and be hard to debug. This has occurred many times: #5141, #5299, #7503, #8794...

This PR changes both Mutex and the channel locks. This is orthogonal to #9722 which aims to unlock channels before running asynchronous callbacks, and can be considered separately.

For POSIX, this converts a deadlock into a Sys_error exception, so this preserves compatibility.

There is a danger of breaking code for people who wrote win32-only programs and relied on the recursive behaviour of win32 mutexes. Unfortunately, the documentation of Mutex is ambiguous about the non-recursive nature of mutexes. If there is a risk to break programs, it may be preferable to leave the windows behaviour unchanged; otherwise to clarify the documentation.

Optionally, I can make the runtime lock errorcheck too. At #5299, it was previously discussed to make it recursive, and it was decided against it, and in particular the current windows behaviour is incorrect. This is easy to fix in the same way if desired.

For obvious reasons the Win32 code has not yet been tested. I will rely on CI for compilation and further help is welcome.

(I believe this can interest @xavierleroy and @stedolan.)

@gadmm gadmm mentioned this pull request Jul 12, 2020
1 task
@gadmm gadmm force-pushed the errorcheck-mutexes branch 2 times, most recently from 56fdcb7 to 243346c Compare July 12, 2020 18:18
This means that an exception is raised when attempting to lock a mutex
locked from the same thread, e.g. from an asynchronous callback.

This changes the behaviour on Windows where mutexes were recursive.

Add test for deadlock inside asynchronous callbacks.
@jhjourdan
Copy link
Contributor

I have not reviewed the code, but I see two issues with this PR:

1- Performance : pthread's errorcheck mutexes have a cost, and some people do (did ?) care about the performance of systhread (cf #4351). So, at the very least, we should do a benchmarck for this PR.

2- Releasing a mutex owned by another thread is now explicitly forbidden and checked by pthread. In the previous code, this was theoretically an undefined behavior inherited from pthread, but my gut feeling is that this was harmless because there has always been an HB relationship between the locker and the unlocker thanks to the master lock. So, this PR changes the behavior of mutexes in this case (if we assume that the UB I just described is benign). On the other hand, this PR fixes a potential UB for well-typed programs, which is good.

@@ -0,0 +1,3 @@
start
Sys_error("Mutex.lock: Resource deadlock avoided")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The exact string reported here is libc- and locale-dependent. This test shouldn't fail if it changes.

@stedolan
Copy link
Contributor

I think this is a good change, independent of the discussion at #9722. The Posix code looks good to me, but I haven't read the Windows side.

@jhjourdan releasing a mutex from the wrong thread seems like it could break on many libcs? Are there programs that do this?

@jhjourdan
Copy link
Contributor

@jhjourdan releasing a mutex from the wrong thread seems like it could break on many libcs

Really? I doubt pthread_mutex implementations use thread-local variables in practice. I don't see why they would do that, and TLS is rather costly.

@stedolan
Copy link
Contributor

Really? I doubt pthread_mutex implementations use thread-local variables in practice. I don't see why they would do that, and TLS is rather costly.

TLS is cheap, especially from within libc (there's generally a register reserved), and many implementations use it to determine the current thread ID. See e.g. glibc's implementation here.

It's convenient to have thread IDs around when you're writing a queued sleeping lock. (They're not needed for a simple spinlock, though).

@xavierleroy
Copy link
Contributor

Before we argue on the relative merits of the three flavors of mutexes, let's keep in mind that there are at least three different uses of mutexes in OCaml's systhreads: 1- the masterlock, 2- implementing the abstract type Mutex.t, and 3- protecting I/O buffers.

To me, it could make sense to have reentrant mutexes protecting I/O buffers, for instance if we want to support functionality equivalent to flockfile/funlockfile in POSIX threads.

I could agree with Mutex.t being of the error-checking kind, but I'm sorry to note that the documentation for Mutex.unlock doesn't say that it must be executed by the thread that locked the mutex last!

Finally, for the masterlock, speed is crucial, so maybe we don't want any checking there.

@gadmm
Copy link
Contributor Author

gadmm commented Jul 14, 2020

Thanks for the discussion, this is interesting.

I agree that statu quo is the safest bet in terms of compatibility. It is important to care about emergent properties, so @jhjourdan and @xavierleroy's points tend to convince me. However, unlocking from the another thread is unsupported at least on Windows, so this will have to remain unspecified. Please decide for me whether the current behaviour has to be preserved.

If so, before closing the PR we can try to recycle it. @xavierleroy suggests to make channel locks recursive so that the user may be able to lock channels. I can look into this, but I see the difficulty that it does not work against interference with asynchronous callbacks. It might be possible to work around this if we are careful to maintain only 3 states, and adapt the unlocking code in the loops at #9722 accordingly if it has already been locked twice. Did you have something else in mind?

It is also possible to add an optional argument to Mutex.create to choose between the default platform-specific behaviour, errorcheck mutexes, and recursive mutexes, and use this opportunity to document the situation (in particular that the latter two are portable).

@gadmm
Copy link
Contributor Author

gadmm commented Jul 14, 2020

Another issue is that pthread_mutexattr_settype is from POSIX.1-2017, and I wonder if this is too recent. See https://pubs.opengroup.org/onlinepubs/9699919799/functions/pthread_mutexattr_settype.html.

@stedolan
Copy link
Contributor

I'm sorry to note that the documentation for Mutex.unlock doesn't say that it must be executed by the thread that locked the mutex last!

Is this expected to work? If so, should we implement Mutex.unlock using something other than pthread_mutex_unlock?

@xavierleroy
Copy link
Contributor

Is this expected to work?

I am tempted to say "no". In my mind, a mutex must be used in a well-bracketed way, thus be unlocked by the thread that locked it. Otherwise, it's a 0-1 semaphore :-) But it is troubling that we never got to document this assumption!

If so, should we implement Mutex.unlock using something other than pthread_mutex_unlock?

We can implement Mutex.t as a mutex + a condition variable + a Boolean, effectively implementing a 0-1 semaphore... But this is not good for performance, and feels overkill.

@dra27
Copy link
Member

dra27 commented Jul 15, 2020

The Windows implementation is not strong enough - calling LeaveCriticalSection from another thread which doesn't hold the mutex is undefined behaviour (which tends to work, AFAICT). In particular, this dummy snippet (which should probably form the basis of another test for this PR):

let msg s = print_endline s; flush stdout in
let m = Mutex.create () in
Mutex.lock m;
let thread1 = Thread.create (fun m -> Mutex.lock m; msg "thread1 locked m") m in
msg "master release m";
Mutex.unlock m;
Thread.join thread1;
msg "thread1 is dead";
let thread2 = Thread.create (fun m -> Mutex.lock m; msg "thread2 locked m") m in
msg "master release m";
Mutex.unlock m;
Thread.join thread2;
msg "thread2 is dead"

succeeds on trunk Windows 10 and Ubuntu 18.04. With this PR it still incorrectly succeeds with Windows 10, but on Ubuntu has the expected new behaviour:

master release m
thread1 locked m
thread1 is dead
master release m
Fatal error: exception Sys_error("Mutex.unlock: Operation not permitted")

Instead of using m->taken = 1, it should record - and check - the Thread ID. This is loosely how the mingw-w64 implementation of pthread mutexes work - except they have a slightly clever fast path which we might one day stealborrow.

@stedolan
Copy link
Contributor

If so, should we implement Mutex.unlock using something other than pthread_mutex_unlock?

We can implement Mutex.t as a mutex + a condition variable + a Boolean, effectively implementing a 0-1 semaphore... But this is not good for performance, and feels overkill.

Right, but pthread_mutex_unlock from a thread other than the one that locked is undefined behaviour, and I think there have been implementations that blow up on this (glibc lock elision?)

@xavierleroy
Copy link
Contributor

Right, but pthread_mutex_unlock from a thread other than the one that locked is undefined behaviour, and I think there have been implementations that blow up on this (glibc lock elision?)

Then I'd rather use errorcheck mutexes, as proposed in this PR, at least to implement Mutex.t, and document the restriction on Mutex.unlock. It would be more efficient than implementing our own 0-1 semaphores.

@gadmm
Copy link
Contributor Author

gadmm commented Jul 15, 2020

@dra27, thanks. It seems also that the mutex implementation can be very different between Windows versions, so we cannot really conclude that this used to accidentally work in the past. I'll have a look at the mingw implementation if we go in this direction.

@stedolan
Copy link
Contributor

@gadmm

Another issue is that pthread_mutexattr_settype is from POSIX.1-2017, and I wonder if this is too recent. See https://pubs.opengroup.org/onlinepubs/9699919799/functions/pthread_mutexattr_settype.html.

I think it should be fine. This function was part of SUSv2 in 1997, and was included in POSIX.1-2001 as part of the XSI extension. The recent change is that it was made mandatory rather than part of an extension, but I'm not aware of any pthread implementations that lack it.

@gadmm
Copy link
Contributor Author

gadmm commented Aug 10, 2020

FTR, https://discuss.ocaml.org/t/mutex-lock-resource-deadlock-avoided-on-freebsd-12-1-ocaml-4-09-1-lwt-4-2-1/6206 provides an example where 1) unlock is called from a different thread, and 2) this fails on some platform.

@xavierleroy
Copy link
Contributor

I made a different Win32 implementation, see PR #9846. Review welcome.

I also have a variant where channels get recursive mutexes, but decided to keep it my freezer until we want to add the equivalent of flockfile/funlockfile from POSIX threads.

@gadmm
Copy link
Contributor Author

gadmm commented Aug 19, 2020

The other version looks better (in the direction of what I had in mind). Therefore I am closing this one. Thanks for taking this off my plate.

@gadmm gadmm closed this Aug 19, 2020
xavierleroy referenced this pull request Aug 24, 2020
To preserve behaviour, explicit polls are added:

   - in caml_raise, to raise the right exception when as system
     call is interrupted by a signal.

   - in sigprocmask, to ensure that signals are handled as soon
     as they are unmasked.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants