Skip to content

proposal: testing: prevent worker starvation via ordered scheduling using barrier tasks #73106

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
jsm opened this issue Mar 31, 2025 · 2 comments
Labels
Implementation Issues describing a semantics-preserving change to the Go implementation. Proposal
Milestone

Comments

@jsm
Copy link

jsm commented Mar 31, 2025

Overview

This proposal aims to address the scheduling issue in our test execution system (#61233) where a run task might be scheduled while an earlier run task is still waiting on its compile dependencies. This was introduced in a change to enforce tests start in package order. This situation can lead to a worker goroutine being blocked on a channel, which in turn may starve the worker pool. My proposed solution is to introduce chained barrier tasks that ensure run tasks are scheduled in order. This would ensure that any channel lockups resolve immediately, while maintaining start order of package tests.

Context

In our current DAG structure:

  • We have run tasks per test package
  • Each run task has a corresponding compile task as its child.
  • All tasks are given a priority, based off their insertion order into the DAG
  • Run tasks use an internal channel-based locking mechanism to enforce that they start in insertion order.

When scheduling work off this DAG structure, we do the following:

  • Initialize a worker pool of goroutines (the -p flag)
  • Create a set of of all leaf nodes (tasks with no dependencies)
  • Pick the leaf node with the highest priority (as defined in the last section)
  • Upon completion, signal to all parents of its completion
  • Add to the set of leaf nodes, any nodes that no longer have an children that have not been completed

Issue

Due to varying compile task dependency chains (which may have different depths and breadths), it’s possible for a later compile task (say, compile_8) to finish before an earlier one (such as compile_0). Without additional coordination, the run_8 task could be scheduled while run_0 is still waiting on its compile dependencies. The work scheduler will have a goroutine execute run_8, but will lockup, waiting for all preceding run tasks to start.

This can cause situations where the entire worker pool of goroutines get locked in channel receives, save for one. This effectively reduces the concurrency to become single-threaded.

We at Samsara saw some of our larger test suites go from 15m => 90m runtimes when running with a pool of 16 (-p 16) and saw task concurrency collapse and become serialized to only running one task at a time.

Proposed Solution

I propose adding a barrier task for each run task to enforce scheduling order. Each barrier task ensures that:

  1. Its associated compile task is complete.
  2. It does not clear until the previous barrier task has cleared.

By making each run task depend on its corresponding barrier task, we ensure that run tasks are scheduled strictly in insertion order—even if compile tasks finish out of order. This extra layer of coordination prevents a run task from being scheduled prematurely and thus avoids worker lock-ups.

Design Details

Barrier Task per Run Task

For every run task run_n, add a barrier task barrier_n.

Barrier-to-Compile:

Each barrier task (barrier_n) depends on its corresponding compile task (compile_n), ensuring that the compile phase is complete before proceeding.

Chained Barrier Dependency:

For every barrier task with n > 0, add a dependency on the previous barrier task (barrier_n depends on barrier_(n-1)). This horizontal dependency guarantees that a run task is not scheduled until all earlier run tasks have completed their barriers, and thus compile tasks.

Run Task Dependency

Each run task is modified to depend on its corresponding barrier task. This makes sure that even if a compile task completes early, its run task won’t be scheduled until all prior run tasks have been processed.

Text Diagrams

Before (Original Graph)

         root
           │
  ┌────────┼────────┐
  │        │        │
 run_0   run_1   run_2   ... run_n
   │        │        │
compile_0 compile_1 compile_2 ... compile_n

Explanation

  • The root node spawns run tasks (run_0, run_1, …).
  • Each run_n has a single child, compile_n.
  • Run tasks use internal channel-based locking to enforce start order.

After (With Chained Barrier Tasks)


         root
           │
  ┌────────┼────────┐
  │        │        │
 run_0   run_1   run_2   ... run_n
   │        │        │
barrier_0 barrier_1 ... barrier_n
   │        │        │
   │ <───── │ <───── │   (each barrier_n depends on barrier_(n-1))
   │        │        │
compile_0 compile_1 ... compile_

Explanation

  • The root node spawns run tasks (run_0, run_1, …).
  • Each run_n now depends on a corresponding barrier_n.
  • Each barrier_n depends on its associated compile_n task.
  • Additionally, each barrier (starting with barrier_1) has a horizontal dependency on the previous barrier (e.g., barrier_1 depends on barrier_0).
  • This chain ensures that a run task (e.g., run_1) is scheduled only after its compile task is complete and after the previous run task (via barrier_0) has been fully processed.

Proof: Barriers Enforce Scheduling Order

Goal

Prove that run_n cannot be scheduled until all run_{<n} are also schedulable.

Base Case (n = 0)

  • run_0 depends on barrier_0
  • barrier_0 depends on compile_0
  • So run_0 becomes a DAG leaf once compile_0 is done

Inductive Step (n > 0)

  • run_n depends on barrier_n
  • barrier_n depends on:
    • compile_n
    • barrier_{n-1}
  • Therefore:
    • run_n is not schedulable until barrier_{n-1} is complete
    • By induction, all barrier_{<n} must be complete first
    • So all run_{<n} must have already been schedulable

Conclusion

The chain of barriers ensures that run tasks are only scheduled once all earlier run tasks are ready. This avoids scheduling tasks that will block on internal locks and restores proper utilization of the worker pool.

@jsm jsm added the Proposal label Mar 31, 2025
@gopherbot gopherbot added this to the Proposal milestone Mar 31, 2025
@gabyhelp gabyhelp added the Implementation Issues describing a semantics-preserving change to the Go implementation. label Mar 31, 2025
@seankhliao seankhliao changed the title proposal: go/src/internal/test: prevent worker starvation via ordered scheduling using barrier tasks proposal: testing: prevent worker starvation via ordered scheduling using barrier tasks Apr 1, 2025
@jsm
Copy link
Author

jsm commented Apr 14, 2025

I'll add that I do think this approach is pretty heavy handed. I think the better solution would be to remove the start-in-order guarantee, so we can just remove the channel locking completely.

@jsm
Copy link
Author

jsm commented Apr 14, 2025

cc: @rsc as the person who introduced start-in-order behavior with https://go-review.googlesource.com/c/go/+/448357

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Implementation Issues describing a semantics-preserving change to the Go implementation. Proposal
Projects
None yet
Development

No branches or pull requests

3 participants