Skip to content

Conversation

@ethanburrelldd
Copy link
Contributor

@ethanburrelldd ethanburrelldd commented Sep 12, 2025

Summary

Adds allowOversubscription property to control whether weighted operations can exceed concurrency limits.

Details

Previously, weighted operations could be started in a case where the sum of all running operations would exceed the concurrency limit.

This PR adds an allowOversubscription property that allows for more control over currency behavior while maintaining backward compatible with the current behavior. Backward compatibility is maintained since the current implementation lets the CPU sit idle for less cycles.

  • allowOversubscription = true, keeps existing behavior, sum of operations can exceed limit (default)
  • allowOversubscription = false, enforces the concurrency limit preventing the option from running until weight concurrency is available

In my company's repo, we're running into some resource issues where many small projects get queued with a large project that should be running alone. Now, for this expensive project we will use allowOversubscription: false, weight: 8, where 8 is the parallelism set when running rush build -p 8, this should allow the expensive project to run with a higher level of isolation. In these cases the expensive operation will eat up CPU resources causing the smaller operations to timeout.

Example

With maxConcurrency = 8 and concurrentUnitsInProgress = 4 after the last task exited

  • Previously: An operation of weight 5 task could start
  • Now (allowOversubscription=true): An operation of weight 5 task could start
  • Before (allowOversubscription=false): An operation of weight 5 will wait to start until concurrentUnitsInProgress <= 3

Changes:

  • Added allowOversubscription option to command-line.schema.json (defaults to true)
  • updated CommandLineJson.ts class to support this option
  • Propagate allowOversubscription through the operation lifecycle
  • updated _forEachWeightedAsync to handle the cases when this is set
  • Added test coverage for this new option

How it was tested

Can I please get a pre-release so that I can test this version on our repo?

  • integration testing by patching the library in production repo
  • testing in rush-redis-cobuild-plugin-integration-test
  • unit tests

Impacted documentation

@ethanburrelldd
Copy link
Contributor Author

@microsoft-github-policy-service agree company="DoorDash"

Copy link
Contributor

@D4N14L D4N14L left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks good, though would like @dmichon-msft to take a look here.

@dmichon-msft
Copy link
Contributor

dmichon-msft commented Sep 12, 2025

For clarity, this isn't a bugfix, it's a behavior change. Exceeding the concurrency was the original intended design for how to handle large tasks. Waiting for sufficient capacity to take the entire job results in the CPUs spending more time idling and in general is expected to slow down overall completion.

@dmichon-msft
Copy link
Contributor

I think the safest way to address the competing priorities would be to add an extra option maxOversubscription or similar (with a default of 0), which affects how far a large operation is allowed to push the max concurrency over the limit temporarily.

@ethanburrelldd
Copy link
Contributor Author

ethanburrelldd commented Sep 15, 2025

@dmichon-msft I'd like to get clarity on the intended behavior for the concurrency parameter to check that I'm understanding this library's expected behavior correctly.

The JSDoc for concurrency states it should "limit the maximum number of concurrent promises to the specified number." My change enforces this as a strict limit, but I wanted to understand if the previous behavior that allowed tasks to exceed this limit was intentional (and the docs need updating) or if it was a bug.

Here's the different options we have:

  1. Allow oversubscription: Let large tasks exceed the limit (previous behavior)
  2. Strict limit: Wait until sufficient capacity is available (current approach)
  3. Task-level control: Let individual tasks opt into exceeding limits

If we want to go ahead with #3, I'd suggest using a boolean allowExceedingConcurrency (default = false true) to configure this behavior at the task level. I think the numeric approach creates unpredictable and hard to configure behavior. Example: if running a cobuild on 2 agents with concurrency=4 and tasks [1,2,3,4], depending on scheduling you might get:

  • 1, 3 and 2, 4 (second agent exceeding limit by 2)
    • if maxOversubscription = 1 then 4 would wait for 2 to finish before executing
  • 1, 4 and 2, 3 (both agents exceeding limit by 1)
    • if maxOversubscription = 1 then 4 would execute despite being over concurrency of 4

I think a boolean keeps the behavior deterministic and simpler for developers to reason about, I think setting the overage amount is difficult to reason about. I agree that changing this behavior could affect build times of existing repos, let me know the best way to land this in a safe way while exposing this isolation logic to project maintainers.

Please let me know which approach we'd like to go forward with and I can update this PR.

@dmichon-msft
Copy link
Contributor

I can work with a boolean control, seems simple enough.

Oversubscription was supported in the original design to deal with the scenario of "what happens if you specify a max concurrency of less than the largest operation weight", but arguably that gets handled by clipping the weights to the max concurrency. The other consideration is that if you have 16 cores, are running a long operation that takes 1, and have a queued operation that can use 16 cores (but in practice uses up to that, whatever it can get), then having to wait for that long running operation is wasteful.

Arguably the best way to handle heavy jobs is probably to tune your configured operation weights to take better advantage of the hardware (or better yet, to shard that expensive operation so that it can be scheduled more easily).

@ethanburrelldd ethanburrelldd changed the title [node-core-library] Fix weighted oversubscription [node-core-library] Add allowOversubscription option Sep 16, 2025
@dmichon-msft
Copy link
Contributor

I apologize for the miscommunication; I think allowOversubscription should be a flag in the options to Async.forEachAsync, not something we try to specify for individual tasks. If you try to do it on individual tasks the algorithm gets really confusing, because theoretically you should only engage in oversubscription if all currently executing tasks allow it.

@aramissennyeydd
Copy link
Contributor

@dmichon-msft Chiming in here a little late (I've been sick the past few days), hopefully adding a little more context on how we ended up here. We've been investigating a whole slew of unit test flakes recently that are pretty easy to track down to "this test phase ran with our other expensive unit test phase" or "this test phase ran with our expensive NextJS app build phase". In those cases the test phase is weight 1 and the expensive phases are already weight 8. (the unit tests we're running have already been sharded and the NextJS app build can't be :( )

To address the flakes, we have a few options:

  1. Try to drop rush parallelism to 1 so that all phases are run in isolation and don't impact other phases. We've tested this and it's caused a very significant slowdown in CI execution (upwards of 3-4x slower).
  2. Update all test phase weights to 8. Basically treating all tests as the problem and ensuring only 1 runs at a time. Also causes a significant slowdown.
  3. This PR to try and isolate the known offending phases so that they don't impact the rest of the executing operations.

@ethanburrelldd
Copy link
Contributor Author

@dmichon-msft

Thanks for the feedback, I've added the allowOversubscription option into command-line.json that gives users more granular config into how their parallelism works.

I'd appreciate another review on this, whenever this is in a state close to approval it would be great to have a preview release so I can test on my teams repo.

@ethanburrelldd
Copy link
Contributor Author

ethanburrelldd commented Sep 19, 2025

@iclanton is this PR good to merge? I can test via preview in our repo if you'd like more testing before this goes in.

@ethanburrelldd
Copy link
Contributor Author

@dmichon-msft @iclanton @D4N14L

Hey Team, this PR seems to be a solid fix for the test flakes and performance issues we're experiencing when several expensive projects run on the same build agent. Could you provide a timeline for a merge or a dev preview so we can test it out? Thanks for your patience with my pings, we're just really excited to get this resolved. 😃

* If true (default), will start operations even when they would exceed the limit.
* If false, waits until sufficient capacity is available.
*/
allowOversubscription?: boolean;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  /**
   * Controls whether operations can start even if doing so would exceed the total concurrency limit.
   * If true (default), will start operations even when they would exceed the limit.
   * If false, waits until sufficient capacity is available.
   */
  allowOversubscription?: boolean;

Rush Stack's convention is that optional booleans always default to false.

The allowOversubscription=false behavior seems like a more natural/intuitive operation, so maybe we should make false the default?

Although that's technically a "breaking" change, a bit less parallelism in an edge case is unlikely to break anyone's existing code. In fact, it's arguably a bugfix.

@dmichon-msft

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've updated the comments and changed the default to false for Async but not for Rush.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ethanburrelldd Do you think we should change the default for Rush as well?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changing the default for Async to false makes sense, but I'm worried about changing the default for Rush. It might slow things down for repo maintainers who update their version. Since the previous behavior allowed for oversubscription, keeping true as the default for Rush seems like the right move to avoid breaking things for other maintainers.

@octogonz
Copy link
Collaborator

🚀 @microsoft/rush version 5.158.1-pr5355.0 has been published.

@ethanburrelldd Let us know how if it solves your problem.

@iclanton
Copy link
Member

iclanton commented Oct 6, 2025

@ethanburrelldd - Did that release work for you?

@iclanton iclanton moved this from Needs triage to In Progress in Bug Triage Oct 6, 2025
@ethanburrelldd
Copy link
Contributor Author

Thanks for generating the pre-release! We're seeing decreased parallelism alongside tasks that have weight = concurrency limit (EG: @org/main-web-app (cotest) - shard 1/4). This allows a small subset of expensive tests from projects with weight < concurrency limit to have better performance due to no longer running at the same time as the long running expensive test cases.

Here's a snapshot comparing build plans before / after the preview release.

rushVersion = 5.157.0:
   @org/lib-ui-components (build-storybook) ----------------####----------------------------------------------------------------  59.4s
                @org/service-web-app (test) -----------------#######------------------------------------------------------------  78.6s
                      @org/chat-app (build) -----------------###----------------------------------------------------------------  28.0s
     @org/main-web-app (cotest) - shard 1/4 -----------------#####################################------------------------------ 632.7s

rushVersion = 5.158.1-pr5355.0:
                @org/small-utility (build) --#---------------------------------------------------------------------------------    0.1s
    @org/main-web-app (cotest) - shard 1/4 ---#####################################################---------------------------- 643.7s

@octogonz
Copy link
Collaborator

octogonz commented Oct 7, 2025

Great, thanks for following up!

@octogonz octogonz merged commit 3a5cc0e into microsoft:main Oct 7, 2025
5 checks passed
@github-project-automation github-project-automation bot moved this from In Progress to Closed in Bug Triage Oct 7, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: Closed

Development

Successfully merging this pull request may close these issues.

6 participants