Skip to content

Testing reliability & perf 2026 — workstream #25978

@eleanorjboyd

Description

@eleanorjboyd

Tracking the user-reported testing regressions and performance complaints that emerged after the project-structure / [test-by-project] rework, plus the older follow-ups that ride alongside them.

The work is grouped into four problem areas. Each item below points at the existing user-filed issue (or a new engineering issue where there is no good match) so progress is visible in one place.


Problem areas (what users are reporting)

  1. Test discovery is much slower than CLI pytest.
    Multiple users on large parametrized suites — 30k tests take 40s in the Test Explorer vs ~2s on the CLI; 328k tests take 66s vs ~10s. Profiled to O(n²) list scans plus an oversized JSON payload.

  2. Test tree is rebuilt from scratch on every change.
    Saving any .py file re-discovers the whole workspace and wipes the existing tree. While re-discovery is in flight, users can't re-run or debug a test because the items have been cleared. "Debug: Restart" breaks for the same reason.

  3. Run / debug pipeline regressions.
    Tests appear as "skipped" even though they ran; debug runs lose results because the result pipe is cancelled the moment the subprocess exits; pytest-subtests failures get reported as success; the env selected via the Python Environments API is not always honored.

  4. Hard discovery failures still open.
    Smaller correctness bugs in the pytest plugin (HIDDEN_PARAM, pipe writer broken by mock.patch("builtins.open")).


Order of execution (impact × effort)


Closing as already resolved


How we measure progress

Baseline telemetry has already been wired up (TS-side). Each fix above has a corresponding metric so dashboards can verify the change actually moves the needle (and catch any unintended regressions):

Area Primary metric What "fixed" looks like
Discovery perf UNITTEST.DISCOVERY.DONE.totalDurationMs p50/p90 sliced by testCount bucket × mode Large-suite p90 drops by an order of magnitude; mode='project' converges to mode='legacy'.
Tree rebuilt every save UNITTEST.DISCOVERY.TRIGGER.fileKind='non-test' share; UNITTEST.TREE.UPDATE.rebuiltFromScratch share; msSinceLastTrigger p50 non-test share drops to ~0%; rebuiltFromScratch=false share grows as incremental updates land.
Run / debug pipeline UNITTEST.RUN.DONE.missingCount > 0 share; pipeClosedEarly share; failureCategory distribution missingCount>0 and pipeClosedEarly shares drop to near-0 on mode='project' and debugging=true.
Discovery hard failures UNITTEST.DISCOVERY.DONE.failureCategory distribution Each individual fix shrinks its corresponding bucket.

Per-area success criteria are checked off as each fix ships and the telemetry confirms the change.


Out of scope (deliberately, for now)

  • Per-test or per-file names in telemetry — privacy-sensitive, not needed for the questions above.
  • True added / removed counts in UNITTEST.TREE.UPDATE — needs an O(n) set diff per discovery; revisit only if beforeCount/afterCount + rebuiltFromScratch aren't enough signal.
  • Migrating off named pipes entirely — out of scope for this workstream.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions