Skip to content

Normalize merge-tree's segment order on rebase of an insert op#11946

Merged
Abe27342 merged 18 commits into
microsoft:nextfrom
Abe27342:merge-tree-reconnect-normalize
Oct 11, 2022
Merged

Normalize merge-tree's segment order on rebase of an insert op#11946
Abe27342 merged 18 commits into
microsoft:nextfrom
Abe27342:merge-tree-reconnect-normalize

Conversation

@Abe27342
Copy link
Copy Markdown
Contributor

@Abe27342 Abe27342 commented Sep 13, 2022

Description

This fixes an eventual consistency issue exposed by fuzz testing.
The core problem is that clients that are reconnecting may end up in a state in which they don't resolve tiebreaks of inserted segments the same way as other clients.
This is because the tiebreaking logic makes assumptions about the relative ordering of inserted/removed segments in the collab window.
Specifically, in the absence of reconnect it's never the case that a segment inserted at seq A appears immediately after a segment removed at seq B, where:

  • both A and B in the collab window
  • A > B
  • the ops with seq A and seq B are concurrent

(In this case, the insertion logic would have placed the inserted segment before the removed segment on all clients).

However, this invariant was no longer true with reconnect due to rebasing ops.
The regression test demonstrates a case where this happens.
The proposed fix implemented in this PR is to have the merge-tree update its internal state when one of its pending segment groups is rebased to instead be in terms of the current sequence number.

AB#1675

@github-actions github-actions Bot added area: dds Issues related to distributed data structures public api change Changes to a public API base: next PRs targeted against next branch labels Sep 13, 2022
Comment thread packages/dds/merge-tree/src/mergeTree.ts Outdated
Comment thread packages/dds/merge-tree/src/test/client.applyMsg.spec.ts
Comment thread packages/dds/merge-tree/src/mergeTree.ts Outdated
}),
);

const newOrder = [...consecutiveLocalSegments, ...adjacentRemovedSegs];
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this seems off. i don't think removed segs will always just come after local segs

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also, this doesn't seem idempotent, like what happens if there are multiple local inserts?

would it make more sense to just rebase everything at once in some kind of fix up pass before regenerate.

Copy link
Copy Markdown
Contributor Author

@Abe27342 Abe27342 Sep 16, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pushed an approach which I think is nicer that's more along these lines. I'm no longer seeing any fuzz failures on the reconnect tests for interval collection if I modify the operations to not include intervals (so it's just basic add/remove text), whereas previously there was a ~2.5% failure rate over 1k tests, with all of those failures fundamentally reducing to this issue of mismatched segment order causing inconsistent conflict resolution.

I was mistaken in Standup the other day, fixing up the callback invocation on local references doesn't fix interval collections stuff. But the core of that problem is that the reconnect strategy used for interval collection requires rebasing character positions from a previous localSeq, which normalizing the segment order upon reconnect completely messes up. I think a reasonable enough way to solve that might be for SharedSegmentSequence to give interval collection a chance to query the underlying merge-tree pre-normalization, and then IntervalCollection's pending state tracking can store the character position + refSeq/localSeq combos it cares about.

I also like the idea to modify merge-tree reconnect farm to repro this issue, I'll do that in this PR before checkin.

Comment thread packages/dds/merge-tree/src/mergeTree.ts Outdated
@anthony-murphy
Copy link
Copy Markdown
Contributor

a thought i had on this area, could we augment the reconnect farm to hit it? that test is pretty simplistic right now.

Abram Sanderson added 2 commits September 27, 2022 17:50
…, simplifying rebase strategy along the way. Also re-link tracking groups and check that in the test
@Abe27342
Copy link
Copy Markdown
Contributor Author

a thought i had on this area, could we augment the reconnect farm to hit it? that test is pretty simplistic right now.

Changing it to have both a client that always disconnects on application and a different one that only sometimes disconnects seems to suffice to make it hit the issue. There are still some failures on the long version of that test after doing so, but ~80% less (and not on new tests), so I'm reasonably comfortable checking this in.

@Abe27342 Abe27342 marked this pull request as ready for review September 29, 2022 16:21
@Abe27342 Abe27342 requested review from a team and msfluid-bot as code owners September 29, 2022 16:21
Comment thread packages/dds/merge-tree/src/client.ts Outdated
Comment thread packages/dds/merge-tree/src/mergeTree.ts
@msfluid-bot
Copy link
Copy Markdown
Collaborator

msfluid-bot commented Oct 4, 2022

@fluid-example/bundle-size-tests: +7.94 KB
Metric NameBaseline SizeCompare SizeSize Diff
aqueduct.js 401.71 KB 404.61 KB +2.89 KB
connectionState.js 680 Bytes 680 Bytes No change
containerRuntime.js 200.17 KB 200.17 KB No change
loader.js 154.17 KB 154.17 KB No change
map.js 47.66 KB 47.66 KB No change
matrix.js 138.44 KB 140.59 KB +2.15 KB
odspDriver.js 153.15 KB 153.15 KB No change
odspPrefetchSnapshot.js 40.53 KB 40.53 KB No change
sharedString.js 158.57 KB 161.47 KB +2.89 KB
Total Size 1.3 MB 1.3 MB +7.94 KB

Baseline commit: 151ffe3

Generated by 🚫 dangerJS against 31821a8

Comment thread packages/dds/merge-tree/src/client.ts Outdated
let op: IMergeTreeOp | undefined;
if (len === 0 || len < minLength) {
const text = client.longClientId!.repeat(random.integer(1, 3)(mt));
if (len === 0 || len < minLength || (len < minLength * 3 && random.bool(1 / 3))) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why add the insert condition? can't the test just add an insert operation if it wants it?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is just meant to be a fallback case, that handles empty/min len, as we usually want data in the tree.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, that seems better. I'll convert tests that seem like they should have an insert op to inject one in config. The lack of inserts was one reason the reconnect farm didn't repro this issue (text was only inserted when the string dipped below min length, so it was pretty unlikely to have a lot of adjacent surviving segments that would end up in a different order on rebase)

Copy link
Copy Markdown
Contributor

@anthony-murphy anthony-murphy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a couple small comments, but otherwise looks good.

@Abe27342 Abe27342 merged commit 45a2620 into microsoft:next Oct 11, 2022
@Abe27342 Abe27342 deleted the merge-tree-reconnect-normalize branch October 11, 2022 19:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area: dds: sharedstring area: dds Issues related to distributed data structures base: next PRs targeted against next branch public api change Changes to a public API

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants