Skip to content

runtime: possible goroutine starvation with Linux FIFO scheduler and limited CPU cores #76353

@nanrong-zhang

Description

@nanrong-zhang

Due to specific environmental constraints, I had to run Docker (built with Go toolchain version 1.20.5) on a system with only four CPU cores configured with FIFO scheduling policy. The program exhibited occasional freezes, and I've identified a reliable reproduction method: performing highly concurrent docker load operations on images.

Upon investigation, all four dockerd processes were stuck in infinite loops within the bgsweep function. Based on the function implementation and hotspot instruction analysis from perf top, the deadlock logic can be summarized as follows:

func bgsweep(c chan int) {
    // ...
    for {
        const sweepBatchSize = 10
        nSwept := 0
        for sweepone() != ^uintptr(0) {    // nothing left to sweep, sweepone() returns ^uintptr(0)
            sweep.nbgsweep++
            nSwept++
            if nSwept%sweepBatchSize == 0 {
                goschedIfBusy()
            }
        }
        for freeSomeWbufs(true) {    // no Wbufs need to free, freeSomeWbufs returns false
            goschedIfBusy()
        }
        lock(&sweep.lock)
        if !isSweepDone() {    // isSweepDone returns false, indicating other sweepers haven't finished
            unlock(&sweep.lock)
            // add goschedIfBusy() here in go1.25
            continue    // Wait for other sweepers to complete, leading to infinite loop
        }
        sweep.parked = true
        goparkunlock(&sweep.lock, waitReasonGCSweepWait, traceEvGoBlock, 1)
    }
}

Under the FIFO scheduling policy, other sweepers' tasks and the sysmon thread cannot be scheduled, causing the deadlock. After switching to Round-Robin (RR) scheduling policy, the issue no longer reproduces.

I noticed that in Go 1.25, goschedIfBusy() is explicitly called when other sweepers haven't completed their work, which should resolve this particular issue. However, Go's goroutine scheduling mechanism appears particularly susceptible to similar problems when combined with FIFO scheduling policy, especially under CPU-constrained conditions.

I am uncertain if this is a concern that needs broader attention, or if I should simply avoid using the FIFO scheduling policy altogether. I'm looking forward to the community's feedback, thanks.

Metadata

Metadata

Assignees

No one assigned

    Labels

    BugReportIssues describing a possible bug in the Go implementation.NeedsInvestigationSomeone must examine and confirm this is a valid issue and not a duplicate of an existing one.compiler/runtimeIssues related to the Go compiler and/or runtime.

    Type

    No type

    Projects

    Status

    Todo

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions