You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This proposal aims to address the scheduling issue in our test execution system (#61233) where a run task might be scheduled while an earlier run task is still waiting on its compile dependencies. This was introduced in a change to enforce tests start in package order. This situation can lead to a worker goroutine being blocked on a channel, which in turn may starve the worker pool. My proposed solution is to introduce chained barrier tasks that ensure run tasks are scheduled in order. This would ensure that any channel lockups resolve immediately, while maintaining start order of package tests.
Context
In our current DAG structure:
We have run tasks per test package
Each run task has a corresponding compile task as its child.
All tasks are given a priority, based off their insertion order into the DAG
Run tasks use an internal channel-based locking mechanism to enforce that they start in insertion order.
When scheduling work off this DAG structure, we do the following:
Initialize a worker pool of goroutines (the -p flag)
Create a set of of all leaf nodes (tasks with no dependencies)
Pick the leaf node with the highest priority (as defined in the last section)
Upon completion, signal to all parents of its completion
Add to the set of leaf nodes, any nodes that no longer have an children that have not been completed
Issue
Due to varying compile task dependency chains (which may have different depths and breadths), it’s possible for a later compile task (say, compile_8) to finish before an earlier one (such as compile_0). Without additional coordination, the run_8 task could be scheduled while run_0 is still waiting on its compile dependencies. The work scheduler will have a goroutine execute run_8, but will lockup, waiting for all preceding run tasks to start.
This can cause situations where the entire worker pool of goroutines get locked in channel receives, save for one. This effectively reduces the concurrency to become single-threaded.
We at Samsara saw some of our larger test suites go from 15m => 90m runtimes when running with a pool of 16 (-p 16) and saw task concurrency collapse and become serialized to only running one task at a time.
Proposed Solution
I propose adding a barrier task for each run task to enforce scheduling order. Each barrier task ensures that:
Its associated compile task is complete.
It does not clear until the previous barrier task has cleared.
By making each run task depend on its corresponding barrier task, we ensure that run tasks are scheduled strictly in insertion order—even if compile tasks finish out of order. This extra layer of coordination prevents a run task from being scheduled prematurely and thus avoids worker lock-ups.
Design Details
Barrier Task per Run Task
For every run task run_n, add a barrier task barrier_n.
Barrier-to-Compile:
Each barrier task (barrier_n) depends on its corresponding compile task (compile_n), ensuring that the compile phase is complete before proceeding.
Chained Barrier Dependency:
For every barrier task with n > 0, add a dependency on the previous barrier task (barrier_n depends on barrier_(n-1)). This horizontal dependency guarantees that a run task is not scheduled until all earlier run tasks have completed their barriers, and thus compile tasks.
Run Task Dependency
Each run task is modified to depend on its corresponding barrier task. This makes sure that even if a compile task completes early, its run task won’t be scheduled until all prior run tasks have been processed.
Each run_n now depends on a corresponding barrier_n.
Each barrier_n depends on its associated compile_n task.
Additionally, each barrier (starting with barrier_1) has a horizontal dependency on the previous barrier (e.g., barrier_1 depends on barrier_0).
This chain ensures that a run task (e.g., run_1) is scheduled only after its compile task is complete and after the previous run task (via barrier_0) has been fully processed.
Proof: Barriers Enforce Scheduling Order
Goal
Prove that run_n cannot be scheduled until all run_{<n} are also schedulable.
Base Case (n = 0)
run_0 depends on barrier_0
barrier_0 depends on compile_0
So run_0 becomes a DAG leaf once compile_0 is done
Inductive Step (n > 0)
run_n depends on barrier_n
barrier_n depends on:
compile_n
barrier_{n-1}
Therefore:
run_n is not schedulable until barrier_{n-1} is complete
By induction, all barrier_{<n} must be complete first
So all run_{<n} must have already been schedulable
Conclusion
The chain of barriers ensures that run tasks are only scheduled once all earlier run tasks are ready. This avoids scheduling tasks that will block on internal locks and restores proper utilization of the worker pool.
The text was updated successfully, but these errors were encountered:
seankhliao
changed the title
proposal: go/src/internal/test: prevent worker starvation via ordered scheduling using barrier tasks
proposal: testing: prevent worker starvation via ordered scheduling using barrier tasks
Apr 1, 2025
I'll add that I do think this approach is pretty heavy handed. I think the better solution would be to remove the start-in-order guarantee, so we can just remove the channel locking completely.
Overview
This proposal aims to address the scheduling issue in our test execution system (#61233) where a run task might be scheduled while an earlier run task is still waiting on its compile dependencies. This was introduced in a change to enforce tests start in package order. This situation can lead to a worker goroutine being blocked on a channel, which in turn may starve the worker pool. My proposed solution is to introduce chained barrier tasks that ensure run tasks are scheduled in order. This would ensure that any channel lockups resolve immediately, while maintaining start order of package tests.
Context
In our current DAG structure:
When scheduling work off this DAG structure, we do the following:
-p
flag)Issue
Due to varying compile task dependency chains (which may have different depths and breadths), it’s possible for a later compile task (say, compile_8) to finish before an earlier one (such as compile_0). Without additional coordination, the run_8 task could be scheduled while run_0 is still waiting on its compile dependencies. The work scheduler will have a goroutine execute run_8, but will lockup, waiting for all preceding run tasks to start.
This can cause situations where the entire worker pool of goroutines get locked in channel receives, save for one. This effectively reduces the concurrency to become single-threaded.
We at Samsara saw some of our larger test suites go from 15m => 90m runtimes when running with a pool of 16 (
-p 16
) and saw task concurrency collapse and become serialized to only running one task at a time.Proposed Solution
I propose adding a barrier task for each run task to enforce scheduling order. Each barrier task ensures that:
By making each run task depend on its corresponding barrier task, we ensure that run tasks are scheduled strictly in insertion order—even if compile tasks finish out of order. This extra layer of coordination prevents a run task from being scheduled prematurely and thus avoids worker lock-ups.
Design Details
Barrier Task per Run Task
For every run task run_n, add a barrier task barrier_n.
Barrier-to-Compile:
Each barrier task (barrier_n) depends on its corresponding compile task (compile_n), ensuring that the compile phase is complete before proceeding.
Chained Barrier Dependency:
For every barrier task with n > 0, add a dependency on the previous barrier task (barrier_n depends on barrier_(n-1)). This horizontal dependency guarantees that a run task is not scheduled until all earlier run tasks have completed their barriers, and thus compile tasks.
Run Task Dependency
Each run task is modified to depend on its corresponding barrier task. This makes sure that even if a compile task completes early, its run task won’t be scheduled until all prior run tasks have been processed.
Text Diagrams
Before (Original Graph)
Explanation
After (With Chained Barrier Tasks)
Explanation
Proof: Barriers Enforce Scheduling Order
Goal
Prove that run_n cannot be scheduled until all run_{<n} are also schedulable.
Base Case (n = 0)
Inductive Step (n > 0)
Conclusion
The chain of barriers ensures that run tasks are only scheduled once all earlier run tasks are ready. This avoids scheduling tasks that will block on internal locks and restores proper utilization of the worker pool.
The text was updated successfully, but these errors were encountered: