New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sync: inconsistent mutex state #736

Closed
canning-duck opened this Issue Jan 3, 2018 · 3 comments

Comments

Projects
None yet
2 participants
@canning-duck
Contributor

canning-duck commented Jan 3, 2018

Hello,

We have some piece of code with a few goroutines and mutexes. Given a "certain" number of concurrent operations, the runtime panics with "sync: inconsistent mutex state" in GopherJS 1.9 (this one: https://golang.org/src/sync/mutex.go#L143). The issue does not occur with GopherJS 1.8, or on any other platform with Go 1.9 (android/ios/macos).

Regarding the Go and GopherJS source code, and as far as I understand it (ie. not very much), it seems that:

I'm not yet at ease isolating the issue in a unit-test but I'm ready to try patching our local copy of GopherJS and see how it goes. If I get it right, lifo is about adding ch at the head of semWaiters instead of the tail. But what about handoff? Has anyone any clue?

@canning-duck

This comment has been minimized.

Show comment
Hide comment
@canning-duck

canning-duck Jan 3, 2018

Contributor

I've found a scenario:

package main

import (
	"fmt"
	"sync"
	"time"
)

func main() {
	var mx sync.Mutex
	var wg sync.WaitGroup
	work := func(i int) {
		defer func() {
			err := recover()
			if err != nil {
				fmt.Println(i, "recover", err)
			}
		}()

		mx.Lock()
		time.Sleep(2 * time.Millisecond)
		mx.Unlock()

		mx.Lock()
		time.Sleep(2 * time.Millisecond)
		mx.Unlock()

		fmt.Println(i, "work done")
	}
	for i := 0; i < 10; i += 1 {
		wg.Add(1)
		go func(i int) {
			defer wg.Done()
			work(i)
		}(1 + i)
	}
	wg.Wait()
	fmt.Println("all work done")
}

https://gopherjs.github.io/playground/#/uWcv2N7D2L

Locking the mutex a second time in the same goroutine seems to be the trigger. Note that time.Sleep(2 * time.Millisecond) ensures that the mutex enters "starvation" mode (https://github.com/golang/go/blob/77ea9f9f315befca503b55bc3b62fcb2d8cb9b32/src/sync/mutex.go#L135).

EDIT: TestMutexFairness also triggers the panic (and is actually skipped because of that).

Contributor

canning-duck commented Jan 3, 2018

I've found a scenario:

package main

import (
	"fmt"
	"sync"
	"time"
)

func main() {
	var mx sync.Mutex
	var wg sync.WaitGroup
	work := func(i int) {
		defer func() {
			err := recover()
			if err != nil {
				fmt.Println(i, "recover", err)
			}
		}()

		mx.Lock()
		time.Sleep(2 * time.Millisecond)
		mx.Unlock()

		mx.Lock()
		time.Sleep(2 * time.Millisecond)
		mx.Unlock()

		fmt.Println(i, "work done")
	}
	for i := 0; i < 10; i += 1 {
		wg.Add(1)
		go func(i int) {
			defer wg.Done()
			work(i)
		}(1 + i)
	}
	wg.Wait()
	fmt.Println("all work done")
}

https://gopherjs.github.io/playground/#/uWcv2N7D2L

Locking the mutex a second time in the same goroutine seems to be the trigger. Note that time.Sleep(2 * time.Millisecond) ensures that the mutex enters "starvation" mode (https://github.com/golang/go/blob/77ea9f9f315befca503b55bc3b62fcb2d8cb9b32/src/sync/mutex.go#L135).

EDIT: TestMutexFairness also triggers the panic (and is actually skipped because of that).

@canning-duck

This comment has been minimized.

Show comment
Hide comment
@canning-duck

canning-duck Jan 5, 2018

Contributor

OK the real culprit is the channel communication between runtime_Semrelease and runtime_SemacquireMutex. If runtime_SemacquireMutex is called right after ch <- true (in runtime_Semrelease) but before <-ch reception (from a previous call to runtime_SemacquireMutex), the method will acquire the semaphore because it is != 0 (it has not been decremented yet) and so two goroutines would acquire the same semaphore at the same time. The panic is raised in Go 1.9 because the new runtime checks for consistency and (rightfully) detects that one of the two acquisition is bogus. Outside this check, the effect is probably smoothen out by the fact that the JS runtime is single-threaded.

I'm preparing a pull request to fix this and implement the lifo parameter (because this one is easy). handoff is still beyond my understanding (and probably requires to pass the parameter to the channel handling code in GopherJS — I'm not ready to look at that).

Contributor

canning-duck commented Jan 5, 2018

OK the real culprit is the channel communication between runtime_Semrelease and runtime_SemacquireMutex. If runtime_SemacquireMutex is called right after ch <- true (in runtime_Semrelease) but before <-ch reception (from a previous call to runtime_SemacquireMutex), the method will acquire the semaphore because it is != 0 (it has not been decremented yet) and so two goroutines would acquire the same semaphore at the same time. The panic is raised in Go 1.9 because the new runtime checks for consistency and (rightfully) detects that one of the two acquisition is bogus. Outside this check, the effect is probably smoothen out by the fact that the JS runtime is single-threaded.

I'm preparing a pull request to fix this and implement the lifo parameter (because this one is easy). handoff is still beyond my understanding (and probably requires to pass the parameter to the channel handling code in GopherJS — I'm not ready to look at that).

canning-duck added a commit to canning-duck/gopherjs that referenced this issue Jan 5, 2018

r-l-x added a commit to r-l-x/gopherjs that referenced this issue Jan 5, 2018

@dmitshur

This comment has been minimized.

Show comment
Hide comment
@dmitshur

dmitshur Jan 9, 2018

Member

Hey @canning-duck.

Thanks for filing this issue and investigating it!

When implementing the changes for Go 1.9, I ran into this issue and didn't have enough time to investigate it fully, that's why it was left as a TODO. See commit message of d8a12c0. It didn't seem critical enough at the time, but of course fixing it is very nice.

The explanation you posted here is fantastic, thanks again.

Member

dmitshur commented Jan 9, 2018

Hey @canning-duck.

Thanks for filing this issue and investigating it!

When implementing the changes for Go 1.9, I ran into this issue and didn't have enough time to investigate it fully, that's why it was left as a TODO. See commit message of d8a12c0. It didn't seem critical enough at the time, but of course fixing it is very nice.

The explanation you posted here is fantastic, thanks again.

@dmitshur dmitshur closed this in 9b73d05 Jan 18, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment