Fix TaskManager frozen container validation in stress tests#26422
Fix TaskManager frozen container validation in stress tests#26422anthony-murphy-agent wants to merge 25 commits intomicrosoft:mainfrom
Conversation
…d ops - Add storeHandle field to createDataStore and createChannel ops that stores the handle in root, making objects collaboratively reachable - Increase detached phase from 20 to 100 ops (20 creation + 80 DDS) - Increase generator ops from 100 to 200 Co-Authored-By: anthony-murphy <anthony.murphy@microsoft.com> Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Select DDS op targets by picking a channel type first across all datastores globally, then picking a channel of that type. This eliminates directory dominance (80% → 19%) and distributes ops more evenly across all 10 DDS types. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Co-Authored-By: anthony-murphy <anthony.murphy@microsoft.com>
…then channel Split the detached creation phase into two sub-phases: datastores first (ops 0-9), then channels (ops 10-19). This ensures multiple datastores exist before channels are created, so the global type-first selection distributes channels across datastores instead of concentrating them in datastore-0 (64% → 26%). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Co-Authored-By: anthony-murphy <anthony.murphy@microsoft.com>
…nimization comment Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Co-Authored-By: anthony-murphy <anthony.murphy@microsoft.com>
…ory state tracking Introduces ContainerStateTracker to maintain channel type metadata in memory, avoiding repeated queries to the system under test. Moves global type-first channel selection into the tracker class and adds datastoreTag to LocalServerStressState for cleaner reducer access. Also skips seed 54 (pre-existing ConsensusOrderedCollection bug) and updates storeHandle comments per review feedback. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Co-Authored-By: anthony-murphy <anthony.murphy@microsoft.com>
- Fix misleading comment in mixinClientSelection: clarify that only type
metadata lookup is in-memory, not channel/datastore discovery
- Update SelectedClientSpec.channelTag type to "root" | `channel-${number}`
to accurately reflect possible values and remove unsafe cast
- Rename datastoreCreationPhaseOps/channelCreationPhaseOps to
datastoreCreationPhaseEndOp/channelCreationPhaseEndOp to clarify
they are end thresholds, not counts
- Format test results JSON file with biome
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: anthony-murphy <anthony.murphy@microsoft.com>
Add per-client caching of resolved IChannel and StressDataObject instances in ContainerStateTracker. Channel type selection is now fully in-memory using an inverse type index, and resolved channels are cached on first access per client. The reducer also uses the cache via resolveChannel() instead of calling getContainerObjects()/getChannels() directly. Also removes accidentally committed test results file (54.json). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Co-Authored-By: anthony-murphy <anthony.murphy@microsoft.com>
- Remove channelNameMap SharedMap from StressDataObject; channel tracking
is now fully handled by ContainerStateTracker
- Replace getChannels() with getChannel(name) for single-channel resolution
- Use state tracker for channel names in ddsOperations.ts
- Simplify channelTag typing to `channel-${number}` (ignore root)
- Pass stateTracker to validateConsistency for channel enumeration
- Skip seeds 45, 155 (pre-existing ConsensusOrderedCollection bugs)
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: anthony-murphy <anthony.murphy@microsoft.com>
…from DefaultStressDataObject Replace the Fluid SharedMap used for cross-client object discovery with in-memory tracking in ContainerStateTracker. This eliminates the circular dependency where the test harness relied on the system under test (Fluid DDSes) for its own bookkeeping. Key changes: - ContainerStateTracker now stores handle absolute paths and resolves objects via resolveHandle instead of getContainerObjects() - Added registerBlob, resolveContainerObject, resolveAllContainerObjects, and getAllHandles methods to ContainerStateTracker - Gutted DefaultStressDataObject: removed containerObjectMap, locally created objects tracking, pending registrations, and staging mode deferral logic - Validation uses intersection-based channel comparison, fixing 7 previously-failing seeds where frozen containers couldn't resolve all objects Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Co-Authored-By: anthony-murphy <anthony.murphy@microsoft.com>
… into StressDataObject DefaultStressDataObject only added staging mode and an alias after gutting the containerObjectMap. Move staging mode methods and the "default" alias onto StressDataObject, delete the subclass, and simplify createRuntimeFactory to use a single factory. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Co-Authored-By: anthony-murphy <anthony.murphy@microsoft.com>
…hes from state tracker Remove resolvedChannelCache and resolvedObjectCache. Resolving existing Fluid objects is fast; caching only helps avoid redundant calls for objects that already exist. The complexity isn't worth the ~1 minute speedup across 200 seeds. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Co-Authored-By: anthony-murphy <anthony.murphy@microsoft.com>
…results
Address PR feedback:
- Type channelTag as `channel-${number}` in registerChannel
- Type resolveContainerObject tag as `datastore-${number}` | `blob-${number}`
- Type objectPaths map key and ResolvedContainerObject.tag accordingly
- Add src/test/results/ to .gitignore to prevent committing failure files
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: anthony-murphy <anthony.murphy@microsoft.com>
…ressor and clusterSize These comments from the original codebase explain important design decisions about the idCompressor instanceof check and the cluster size override. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Co-Authored-By: anthony-murphy <anthony.murphy@microsoft.com>
… StressDataObject - Replace containerRuntimeForTest with resolveByAbsolutePath method that encapsulates resolveHandle logic with local caching - Rename alias to defaultInstanceAlias per review feedback - Remove unused imports from ContainerStateTracker The state tracker no longer needs to know about container runtime internals; resolution is now delegated to StressDataObject. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Co-Authored-By: anthony-murphy <anthony.murphy@microsoft.com>
… handling - Replace intersection-based channel validation with strict key-set equality so real divergence (missing channels) is caught as an error - Add .catch() to blob upload promise chain to prevent unhandled rejections - Eagerly cache created datastores in resolvedObjectCache on the creating client - Add cacheResolvedObject() method to StressDataObject for pre-populating cache - Group skip seeds by failure reason using sub-arrays - Fix stale comment about per-client IChannel caching - Document staging mode phantom entry tradeoff in state tracker Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Co-Authored-By: anthony-murphy <anthony.murphy@microsoft.com>
Resolve merge conflicts in 4 files, preserving PR's state tracker approach while incorporating upstream changes (storeHandle, naming conventions, dataStoreOperations.ts). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Co-Authored-By: anthony-murphy <anthony.murphy@microsoft.com>
- Cache newly created datastores on state.client.entryPoint (not state.datastore) so ContainerStateTracker resolution hits the cache. - Widen resolveByAbsolutePath typing to not imply StressDataObject is always present (blobs don't have it). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Co-Authored-By: anthony-murphy <anthony.murphy@microsoft.com>
Revert the constant renaming (use upstream's datastoreCreationPhaseEnd / channelCreationPhaseEnd) and remove the eager cache logic from the createDataStore reducer -- resolveByAbsolutePath already caches on first resolution, making eager caching redundant. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Co-Authored-By: anthony-murphy <anthony.murphy@microsoft.com>
Per review feedback, don't gitignore the test results directory so accidentally committed result files remain visible. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Co-Authored-By: anthony-murphy <anthony.murphy@microsoft.com>
Convert the per-instance resolvedObjectCache to a static class member shared across all StressDataObject instances. Remove the public cacheResolvedObject method; add a static clearCache() method that the harness can call between test runs. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Co-Authored-By: anthony-murphy <anthony.murphy@microsoft.com>
The static resolvedObjectCache was incorrect because different clients have different container runtimes, and handles resolved in one client's runtime are not valid for another. Converting to per-instance caching naturally scopes the cache to each client's StressDataObject entrypoint, since resolveByAbsolutePath is always called on client.entryPoint. No clearCache() needed -- each test seed creates fresh containers with fresh StressDataObject instances, so the cache resets naturally. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Co-Authored-By: anthony-murphy <anthony.murphy@microsoft.com>
… remove stale comment - Use factory.attributes.type instead of factory.type in registerDatastore for consistency with ddsModelMap keying - Replace manual Fisher-Yates shuffle with random.shuffle and simplify selectChannelForOperation to shuffle all channel types in a single loop, eliminating the separate fallback path - Remove incorrect staging mode comment (channels/datastores created in staging mode remain valid on the creating client even after discard) - Update skip list for seeds affected by shuffle order change Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Co-Authored-By: anthony-murphy <anthony.murphy@microsoft.com>
…ve redundant .StressDataObject access - Rename ResolvedContainerObject.stressDataObject field to datastore - Access getChannel() directly on StressDataObject instance instead of going through the .StressDataObject member (which is only needed for anonymous object discovery) - Update all usages in containerStateTracker.ts, ddsOperations.ts, and dataStoreOperations.ts Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Co-Authored-By: anthony-murphy <anthony.murphy@microsoft.com>
Revert the changes below line 225 in ddsOperations.ts as the original validation logic (assert size equality + iterate aMap.keys()) is sufficient and doesn't need to change. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Co-Authored-By: anthony-murphy <anthony.murphy@microsoft.com>
| import { makeUnreachableCodePathProxy } from "./utils.js"; | ||
|
|
||
| /** | ||
| * DDS types whose in-memory state depends on quorum membership events (e.g., |
There was a problem hiding this comment.
this shouldn't be true. however, the quorum can change before a dds is loaded
100c173 to
231a541
Compare
| // Stashed ops (from applyStashedOp) aren't tracked in latestPendingOps. | ||
| // Just resubmit them with the same sentinel metadata. | ||
| if (localOpMetadata === stashedOpMetadata) { | ||
| this.submitLocalMessage(content, stashedOpMetadata); |
There was a problem hiding this comment.
we don't want to resubmit stashed ops. so just return
There was a problem hiding this comment.
Pull request overview
Updates the local-server stress test harness to correctly validate “frozen containers” against live clients by skipping quorum-dependent DDS types (notably TaskManager) during consistency checks, and refactors stress-test handle/channel tracking to use an in-memory ContainerStateTracker.
Changes:
- Add
skipChannelTypessupport tovalidateConsistencyOfAllDDSand use it for frozen-container validation (skip TaskManager). - Replace SharedMap-based container object/channel discovery with a new in-memory
ContainerStateTracker+ per-entrypoint resolve caching. - Fix/adjust DDS behavior for frozen containers: TaskManager stashed-op handling and ConsensusOrderedCollection quorum scrubbing.
Reviewed changes
Copilot reviewed 10 out of 10 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| packages/test/local-server-stress-tests/src/test/localServerStress.spec.ts | Updates validation callback signature and modifies the seed skip list. |
| packages/test/local-server-stress-tests/src/stressDataObject.ts | Removes DefaultStressDataObject/SharedMap registry; adds resolve-by-absolute-path caching and simplifies blob/channel APIs. |
| packages/test/local-server-stress-tests/src/localServerStressHarness.ts | Integrates ContainerStateTracker, adds frozen-container skip types, and threads tracker into validation + channel selection. |
| packages/test/local-server-stress-tests/src/ddsOperations.ts | Extends validateConsistencyOfAllDDS to accept a tracker and optional channel-type skip set; switches handle discovery to tracker. |
| packages/test/local-server-stress-tests/src/dataStoreOperations.ts | Uses tracker-based datastore enumeration for dirty-state checks. |
| packages/test/local-server-stress-tests/src/containerStateTracker.ts | New in-memory registry for datastores/blobs/channels, plus type-first channel selection and resolution helpers. |
| packages/test/local-server-stress-tests/src/baseModel.ts | Registers created datastores/channels/blobs with the tracker during reducer operations. |
| packages/dds/task-manager/src/taskManager.ts | Changes stashed-op behavior to resubmit with sentinel metadata and adjusts pending-op tracking accordingly. |
| packages/dds/ordered-collection/src/consensusOrderedCollection.ts | Scrubs stale jobTracking entries based on quorum during load/connect and handles acquire-after-leave ordering. |
| packages/dds/ordered-collection/api-report/ordered-collection.legacy.beta.api.md | Updates API report to reflect newly surfaced onConnect() method. |
| // Pre-existing DDS bugs (not introduced by this PR): | ||
| skip: [ | ||
| ...[10, 63, 125, 180], // ConsensusOrderedCollection live-client ordering divergence | ||
| ], |
There was a problem hiding this comment.
The skip list here doesn’t match the PR description. The description says several previously-skipped seeds were removed (and implies only the remaining known-bad seeds are skipped), but this file now skips seeds 10, 63, 125, and 180. Please either update the PR description/test plan to reflect the new skip set, or adjust the skip list to match the intended behavior.
| // where the snapshot includes an acquire for a client whose ClientLeave was | ||
| // sequenced before the acquire (so the snapshot's quorum excludes the client, | ||
| // but jobTracking still has the entry). The onConnect scrub handles entries | ||
| // added during savedOps replay, but this catch entries already in the snapshot |
There was a problem hiding this comment.
Typo/grammar in this comment: "but this catch entries" should be "but this catches entries".
| // added during savedOps replay, but this catch entries already in the snapshot | |
| // added during savedOps replay, but this catches entries already in the snapshot |
| } | ||
| } | ||
| // Raise events after all state changes, matching removeClient pattern. | ||
| added.map((value) => this.emit("add", value, false /* newlyAdded */)); |
There was a problem hiding this comment.
added.map(...) is being used only for side effects. Using forEach/a for...of loop would avoid allocating an unused array and makes the intent clearer.
| added.map((value) => this.emit("add", value, false /* newlyAdded */)); | |
| added.forEach((value) => this.emit("add", value, false /* newlyAdded */)); |
| protected applyStashedOp(content: unknown): void { | ||
| // We don't apply any stashed ops since during the rehydration process. Since we lose any assigned tasks | ||
| // during rehydration we cannot be assigned any tasks. Additionally, without the in-memory state of the | ||
| // previous dds, we also cannot re-volunteer based on a previous subscribeToTask() call. Since we are | ||
| // unable to be assigned to any tasks, there is no reason to process abandon/complete ops either. | ||
| assertIsTaskManagerOperation(content); | ||
| // Resubmit the op so it can be matched on ack. Use stashedOpMetadata so the | ||
| // opWatcher handlers skip latestPendingOps tracking for these resubmitted ops. | ||
| this.submitLocalMessage(content, stashedOpMetadata); | ||
| } |
There was a problem hiding this comment.
This changes TaskManager’s stashed-op behavior from “ignore” to “resubmit with sentinel localOpMetadata”, and also alters opWatcher/latestPendingOps handling based on that sentinel. There don’t appear to be existing unit tests covering stashed-op rehydration paths in this package; please add coverage to ensure applyStashedOp/resubmit/rollback don’t mutate latestPendingOps and don’t assert when acks arrive for stashed ops.
231a541 to
ac74400
Compare
Implement applyStashedOp to resubmit stashed ops with a sentinel metadata value (-1), allowing opWatcher handlers to skip latestPendingOps tracking for rehydrated ops. Handle stashed ops in reSubmitCore and rollback. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Co-Authored-By: anthony-murphy <anthony.murphy@microsoft.com>
ac74400 to
c1fff32
Compare
Summary
removeClientreducer. Frozen containers don't process quorum events (removeMember), so TaskManager'staskQueues(which depend on quorum membership) will inevitably diverge.skipChannelTypesparameter tovalidateConsistencyOfAllDDSto support this.Depends on #26381.
Test plan
🤖 Generated with Claude Code