Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: scheduler stuck on Go 1.14.1 #38223

Closed
muirdm opened this issue Apr 2, 2020 · 2 comments
Closed

runtime: scheduler stuck on Go 1.14.1 #38223

muirdm opened this issue Apr 2, 2020 · 2 comments

Comments

@muirdm
Copy link

@muirdm muirdm commented Apr 2, 2020

What version of Go are you using (go version)?

$ go version
go version go1.14.1 darwin/amd64

Does this issue reproduce with the latest release?

Yes.

What operating system and processor architecture are you using (go env)?

darwin/amd64 in dev, linux/amd64 in production

What did you do?

Upgraded to Go 1.14.1.

What did you expect to see?

Systems nominal.

What did you see instead?

Some services in production get stuck after a few hours. They were using 100% CPU of a single core and had strace activity like:

$ sudo strace -p 7781 -f
strace: Process 7781 attached with 12 threads
[pid 13328] futex(0xc0016d7d48, FUTEX_WAIT_PRIVATE, 0, NULL 
[pid  7888] futex(0xc002def2c8, FUTEX_WAIT_PRIVATE, 0, NULL 
[pid  7810] futex(0xc000566bc8, FUTEX_WAIT_PRIVATE, 0, NULL 
[pid  7814] futex(0xc002112148, FUTEX_WAIT_PRIVATE, 0, NULL 
[pid  7800] futex(0x582cf88, FUTEX_WAIT_PRIVATE, 0, {0, 100000} 
[pid  7788] futex(0x586ac18, FUTEX_WAIT_PRIVATE, 0, NULL 
[pid  7787] futex(0xc000100148, FUTEX_WAIT_PRIVATE, 0, NULL 
[pid  7786] futex(0x586ad20, FUTEX_WAIT_PRIVATE, 0, NULL 
[pid  7785] sched_yield( 
[pid  7784] futex(0xc00006a848, FUTEX_WAIT_PRIVATE, 0, NULL 
[pid  7785] <... sched_yield resumed> ) = 0
[pid  7783] nanosleep({0, 20000},  
[pid  7785] sched_yield( 
[pid  7800] <... futex resumed> )       = -1 ETIMEDOUT (Connection timed out)
[pid  7785] <... sched_yield resumed> ) = 0
[pid  7800] futex(0x582cf88, FUTEX_WAIT_PRIVATE, 0, {0, 100000} 
[pid  7785] sched_yield()               = 0
[pid  7781] futex(0x582dc28, FUTEX_WAIT_PRIVATE, 0, NULL 
[pid  7785] sched_yield( 
[pid  7783] <... nanosleep resumed> NULL) = 0
[pid  7785] <... sched_yield resumed> ) = 0
[pid  7783] nanosleep({0, 20000},  
[pid  7785] sched_yield()               = 0
[pid  7785] sched_yield( 
[pid  7800] <... futex resumed> )       = -1 ETIMEDOUT (Connection timed out)
[pid  7785] <... sched_yield resumed> ) = 0
[pid  7800] futex(0x582cf88, FUTEX_WAIT_PRIVATE, 0, {0, 100000} 
[pid  7785] sched_yield()               = 0
[pid  7783] <... nanosleep resumed> NULL) = 0
[pid  7785] sched_yield()               = 0
[pid  7785] sched_yield( 
[pid  7783] nanosleep({0, 20000},  
[pid  7785] <... sched_yield resumed> ) = 0
[pid  7800] <... futex resumed> )       = -1 ETIMEDOUT (Connection timed out)
[pid  7785] sched_yield( 

It looked like no goroutines were getting scheduled. Even our internal debug http server was not responding, which I've never seen before.

I think I have been able to reproduce it locally on Mac and linux (docker). On Mac it seems to use 50% CPU instead of 100%. Unfortunately the repro involves running queries against the cassandra database using the gocql driver.

  1. Get cassandra running docker run -p 9042:9042/tcp --name cassandra-repro -d cassandra:latest
  2. Install gocql go get 'github.com/gocql/gocql@v0.0.0-20200221113847-372a19b1a852' (didn't try other versions of gocql yet).
  3. Run below program:
package main

import (
	"fmt"
	"sync"
	"time"

	"github.com/gocql/gocql"
)

func main() {
	c := gocql.NewCluster("127.0.0.1")
	c.Timeout = 5 * time.Second
	c.NumConns = 30

	sess, err := c.CreateSession()
	if err != nil {
		panic(err)
	}

	go func() {
		for {
			fmt.Println("still running")
			time.Sleep(5 * time.Second)
		}
	}()

	var iter int
	for {
		var wg sync.WaitGroup

		for i := 0; i < 5000; i++ {
			wg.Add(1)

			go func() {
				defer wg.Done()

				err := sess.Query("select * from system.peers").Exec()
				if err != nil {
					panic(err)
				}
			}()
		}

		wg.Wait()

		fmt.Println(iter)

		iter++
	}
}

It usually reproduces for me within 60 seconds, but sometimes it takes longer. It may be easier to reproduce on linux.

@ianlancetaylor
Copy link
Contributor

@ianlancetaylor ianlancetaylor commented Apr 2, 2020

Could be a dup of #38023/#38070. Those issues will be fixed in the upcoming 1.14.2 release. In the meantime, can you try tip?

@muirdm
Copy link
Author

@muirdm muirdm commented Apr 2, 2020

Seems to be the same issue. Not sure how I missed those (well, I am sure: I didn't search closed issues).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
2 participants
You can’t perform that action at this time.