Skip to content

ci: split DDS coverage shards and parallelize merge-tree farm tests#26586

Closed
frankmueller-msft wants to merge 1 commit intomicrosoft:mainfrom
frankmueller-msft:ci/combined-pipeline-parallelization
Closed

ci: split DDS coverage shards and parallelize merge-tree farm tests#26586
frankmueller-msft wants to merge 1 commit intomicrosoft:mainfrom
frankmueller-msft:ci/combined-pipeline-parallelization

Conversation

@frankmueller-msft
Copy link
Contributor

@frankmueller-msft frankmueller-msft commented Feb 27, 2026

Summary

Split the single DDS mocha coverage shard (17m 04s) into 5 parallel jobs, reducing the coverage critical path by 52%. This is the primary bottleneck on the build-client pipeline's wall-clock time.

Changes:

  • Split ci:test:mocha:dds into 5 shards: tree, merge-tree:farm, merge-tree:unit, other, non-dds
  • Add mocha --parallel to the merge-tree farm shard (farm/fuzz tests with 5-minute timeouts dominate execution time)
  • Add dedicated .mocharc.farm.cjs and .mocharc.unit.cjs configs in packages/dds/merge-tree/
  • Add timeoutInMinutes and timing budget enforcement to coverage jobs (consistent with test jobs)
  • Skip coverage artifact publishing and Merge Coverage Reports job in the Internal project (where testCoverage is disabled)

Performance Results

Pipeline total: 41m 19s → 36m 15s (12% faster)

Baseline pipeline (critical path in bold)

gantt
    title Baseline — 41m 19s
    dateFormat mm-ss
    axisFormat %M:%S

    section Build
    Build ~18m                       :done, b, 00-00, 18m

    section Coverage
    MochaTestDds 17m 04s             :crit, done, c1, after b, 17m
    MochaTestNonDds 9m 08s           :done, c2, after b, 9m
    RealsvcLocalTest 6m 08s          :done, c3, after b, 6m
    Merge Coverage 5m 27s            :crit, done, m, after c1, 5m

    section Task Tests
    RealsvcTinyliciousTest 16m 47s   :done, t1, after b, 17m
    JestTest ~5m 30s                 :done, t2, after b, 6m
    StressTinyliciousTest ~5m        :done, t3, after b, 5m
Loading

Optimized pipeline (build #380986, critical path in bold)

gantt
    title Optimized — 36m 15s
    dateFormat mm-ss
    axisFormat %M:%S

    section Build
    Build 17m 48s                         :done, b, 00-00, 18m

    section Coverage (changed)
    MochaTestDdsMergeTreeUnit 8m 08s      :crit, done, c1, after b, 8m
    MochaTestDdsTree 6m 51s               :done, c2, after b, 7m
    MochaTestNonDds 6m 41s                :done, c3, after b, 7m
    MochaTestDdsMergeTreeFarm 6m 26s      :done, c4, after b, 6m
    RealsvcLocalTest 6m 08s               :done, c5, after b, 6m
    MochaTestDdsOther 4m 45s              :done, c6, after b, 5m
    Merge Coverage 5m 37s                 :crit, done, m, after c1, 6m

    section Task Tests (unchanged)
    RealsvcTinyliciousTest 14m 42s        :done, t1, after b, 15m
    JestTest 4m 52s                       :done, t2, after b, 5m
    StressTinyliciousTest 3m 29s          :done, t3, after b, 4m
Loading

Coverage shard breakdown

Baseline Shard Time Optimized Shard(s) Time
MochaTestDds (all 16 DDS packages) 17m 04s MochaTestDdsMergeTreeUnit 8m 08s
MochaTestDdsTree 6m 51s
MochaTestDdsMergeTreeFarm (--parallel) 6m 26s
MochaTestDdsOther (13 packages) 4m 45s
MochaTestNonDds 9m 08s MochaTestNonDds 6m 41s
RealsvcLocalTest 6m 08s RealsvcLocalTest 6m 08s
Merge Coverage Reports 5m 27s Merge Coverage Reports 5m 37s

Why not split tinylicious tests?

The single tinylicious shard (14m 42s) finishes before the coverage path completes (8m 08s + 5m 37s merge = 13m 45s after build), so it is not on the critical path. Splitting it would add complexity without reducing pipeline time.

Internal project behavior

This pipeline runs in both the Public and Internal ADO projects. Coverage instrumentation (c8) is only enabled in Public (testCoverage: ${{ eq(variables['System.TeamProject'], 'public') }}).

In the Internal project:

  • The 5 coverage test jobs still run the tests (without c8), so tests execute faster without instrumentation overhead
  • Coverage artifact publishing is skipped (no nyc/.nyc_output data to publish)
  • The Merge Coverage Reports job is skipped entirely (no coverage data to merge)

This avoids wasting an agent slot on the Merge Coverage job in Internal, where it would only do setup and then fail on empty data.

Files changed

File Change
package.json Add DDS shard scripts
packages/dds/merge-tree/package.json Add farm/unit mocha scripts
packages/dds/merge-tree/.mocharc.farm.cjs New: farm/fuzz test config with parallel: true
packages/dds/merge-tree/.mocharc.unit.cjs New: unit test config (excludes farm tests)
tools/pipelines/build-client.yml 1 coverage entry → 5 entries
tools/pipelines/templates/build-npm-client-package.yml Add timeouts, timing budgets; condition coverage artifacts + merge job on testCoverage

Coverage verification

The shard split is purely organizational — it changes which CI job runs which tests, not which tests are run. Every test that ran before still runs exactly once:

  • All 16 DDS packages are covered across the 4 DDS shards with no gaps or overlaps:
    • tree shard: @fluidframework/tree (184 test files — the single largest DDS package)
    • merge-tree:farm shard: 9 farm/fuzz test files (*Farm*, beastTest*) run with --parallel
    • merge-tree:unit shard: 47 unit test files (everything in merge-tree except farm tests)
    • other shard: remaining 13 DDS packages (cell, counter, map, matrix, sequence, etc.)
  • Non-DDS packages (non-dds shard) and real-service local tests are unchanged
  • The pnpm --filter expressions are complementary: ./packages/dds/tree + ./packages/dds/merge-tree (farm + unit configs) + ./packages/dds/** !tree !merge-tree = all of ./packages/dds/**
  • Merge Coverage Reports successfully merges all 5 shard artifacts (confirmed in build #380986)

Test plan

  • CI pipeline passes with all 5 coverage shards running in parallel
  • Merge Coverage Reports successfully merges all shard artifacts
  • Pipeline wall-clock time reduced vs baseline (41m 19s → 36m 15s)
  • No test coverage gaps — all 16 DDS packages covered exactly once across shards
  • Verify Merge Coverage Reports job is skipped in Internal project

Supersedes #26559, #26562, #26571.

🤖 Generated with Claude Code

Copilot AI review requested due to automatic review settings February 27, 2026 17:12
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR combines three pipeline parallelization optimizations (#26559, #26562, #26571) to significantly reduce the client build pipeline execution time. The changes introduce parallel coverage test jobs, shard mocha tests into DDS and non-DDS groups, move post-build work (docs, bundle analysis, devtools) into parallel jobs, and add timing budget enforcement to catch performance regressions.

Changes:

  • Parallelized coverage tests with individual jobs for each test type and a merge job to combine results
  • Sharded mocha tests into DDS (packages/dds/**) and non-DDS (!packages/dds/**) groups using pnpm filters
  • Extracted docs build, bundle analysis, and devtools build from the main build job into parallel post-build jobs that run concurrently with test jobs
  • Folded AreTheTypesWrong check into the build job (eliminating a separate test job)
  • Added timing budget enforcement template that warns when jobs exceed their expected duration
  • Increased npm pack concurrency from 1 to 4

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
tools/pipelines/templates/include-steps-timing-budget.yml New template for recording job start time and checking if elapsed time exceeds budget, emitting ADO warnings for performance regressions
tools/pipelines/templates/build-npm-client-package.yml Major refactoring: parallelized coverage jobs with merge step; extracted docs/bundle/devtools into parallel jobs; added timing budgets to build, merge_coverage, test, and post-build jobs; integrated AreTheTypesWrong into build job
tools/pipelines/build-client.yml Updated coverage test configuration to use sharded mocha tests (dds/non-dds); enabled taskCheckAreTheTypesWrong parameter
scripts/pack-packages.sh Increased flub exec concurrency from 1 to 4 for pack operations to speed up npm pack step
package.json Added 6 new scripts for DDS/non-DDS mocha test variants (test:mocha:dds, test:mocha:non-dds, and their CI/coverage equivalents)

Comment on lines +585 to +591
- job: Merge_coverage
displayName: "Merge Coverage Reports"
dependsOn:
- build
- ${{ each test in parameters.coverageTests }}:
- Coverage_${{ test.jobName }}
condition: succeededOrFailed()
Copy link

Copilot AI Feb 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Merge_coverage job is missing a timeoutInMinutes setting. Other jobs in the pipeline have explicit timeouts: the build job has 120 minutes, Test jobs have 45 minutes, and parallel post-build jobs have 30 minutes. Consider adding timeoutInMinutes: 45 to ensure the merge job doesn't hang indefinitely if something goes wrong.

Copilot uses AI. Check for mistakes.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed — added timeoutInMinutes: 45 to the Merge_coverage job.

- ${{ each test in parameters.coverageTests }}:
- job: Coverage_${{ test.jobName }}
displayName: "Coverage ${{ test.jobName }}"
dependsOn: build
Copy link

Copilot AI Feb 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The individual Coverage jobs are missing a timeoutInMinutes setting. The Test jobs have timeoutInMinutes: 45 (line 777), and the parallel post-build jobs have timeoutInMinutes: 30. Coverage jobs should also have an explicit timeout to prevent them from running indefinitely if something goes wrong. Based on the PR description's timing budget table showing coverage tests at 35 minutes, consider adding timeoutInMinutes: 45 to align with other test jobs.

Suggested change
dependsOn: build
dependsOn: build
timeoutInMinutes: 45

Copilot uses AI. Check for mistakes.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed — added timeoutInMinutes: 45 to the Coverage jobs.

Comment on lines +515 to +572
steps:
# Setup
- checkout: self
path: $(FluidFrameworkDirectory)
clean: true
lfs: '${{ parameters.checkoutSubmodules }}'
submodules: '${{ parameters.checkoutSubmodules }}'

- script: |
echo "commit: $(COMMIT_SHA)"
git fetch origin $(COMMIT_SHA)
git checkout $(COMMIT_SHA)
displayName: "Checkout build commit"

- template: /tools/pipelines/templates/include-use-node-version.yml@self

- template: /tools/pipelines/templates/include-install.yml@self
parameters:
packageManager: '${{ parameters.packageManager }}'
buildDirectory: '${{ parameters.buildDirectory }}'
packageManagerInstallCommand: '${{ parameters.packageManagerInstallCommand }}'

- task: DownloadPipelineArtifact@2
inputs:
artifact: build_output_archive
targetPath: $(Build.StagingDirectory)

- script: |
echo "Extracting build output archive contents..."
tar --extract --gzip --file $(Build.StagingDirectory)/build_output_archive.tar.gz --directory $(Pipeline.Workspace)/${{ parameters.buildDirectory }}
displayName: Extract Build Output Contents

# Set variable startTest if everything is good so far and we'll start running tests,
# so that the steps to process/upload test coverage results only run if we got to the point of actually running tests.
- script: |
echo "##vso[task.setvariable variable=startTest]true"
displayName: Start Test

- template: /tools/pipelines/templates/include-test-task.yml@self
parameters:
taskTestStep: '${{ test.name }}'
buildDirectory: '${{ parameters.buildDirectory }}'
testCoverage: '${{ parameters.testCoverage }}'

- task: Npm@1
displayName: 'npm run test:copyresults'
condition: and(succeededOrFailed(), eq(variables['startTest'], 'true'))
inputs:
command: custom
workingDir: '$(Pipeline.Workspace)/${{ parameters.buildDirectory }}'
customCommand: 'run test:copyresults'

# Process test result, include publishing and logging
- template: /tools/pipelines/templates/include-process-test-results.yml@self
parameters:
buildDirectory: '${{ parameters.buildDirectory }}'
testResultDirs: '${{ parameters.testResultDirs }}'

Copy link

Copilot AI Feb 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The individual Coverage jobs are missing timing budget enforcement steps. The Merge_coverage job has timing budget checks (lines 609-613 for start, 755-759 for check), and all Test jobs have them (lines 794-798 for start, 914-918 for check). For consistency and to catch performance regressions in individual coverage shards, consider adding timing budget steps to the Coverage jobs. Based on the PR description, a budget of 35 minutes would be appropriate.

Copilot uses AI. Check for mistakes.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed — added timing budget start/check steps to Coverage jobs with a 25-minute budget, matching the Test jobs.

@frankmueller-msft frankmueller-msft force-pushed the ci/combined-pipeline-parallelization branch 7 times, most recently from 5924b5c to 65dc86f Compare February 28, 2026 06:00
@frankmueller-msft frankmueller-msft changed the title ci: combined pipeline parallelization (#26559 + #26562 + #26571) ci: split DDS coverage shards and parallelize merge-tree farm tests Feb 28, 2026
@frankmueller-msft frankmueller-msft force-pushed the ci/combined-pipeline-parallelization branch 2 times, most recently from 1cf21d3 to 3e4234a Compare February 28, 2026 06:56
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@frankmueller-msft
Copy link
Contributor Author

Closing in favor of #26624, which takes a simpler approach: enabling mocha parallel mode on the merge-tree suite (4 lines) instead of splitting CI jobs. Achieved a 36% reduction in coverage test time (22 min → 14 min) without adding pipeline complexity.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants