Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Matrix: Fix incorrect row/col adjustment w/multiple reconnections #5905

Merged
merged 1 commit into from Apr 22, 2021

Conversation

DLehenbauer
Copy link
Contributor

@DLehenbauer DLehenbauer commented Apr 22, 2021

This fixes a data loss issue during reconnection that reproduces when the initial submission fails and the 1st reconnection attempt also fails. (Fixes #5808).

The cause of the issue is that SharedMatrix previously advanced 'localSeq' when resubmitting ops the 1st time. During the 2nd reconnection attempt, the runtime replays the ops that were resubmitted during the 1st attempt (as opposed to the originally submitted ops), which now have higher 'localSeq' than those stored in the MergeTree segments.

This results in incorrect adjustment of row/col when resubmitting 'setCell()' ops, as the higher 'localSeqs' cause adjustment to behave as if all 'setCell()' calls occurred after all row/col insertions and removals (i.e., no adjustment takes place.)

const localSeq = this.nextLocalSeq();
this.sendSetCellOp(row, col, value, rowHandle, colHandle);
}
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ugh. What an awful diff for a simple change:

  1. The 'isAttached()' check is hoisted from 'sendSetCellOp()' to the caller. (above ---^)
  2. 'sendSetCellOp()' now takes 'localSeq' as an argument. (below ----v)
  3. ...which forced me to reformat due to line length (below ---v)

The motivation for #1 is to pedantically avoid advancing 'nextLocalSeq' when no op will be sent.
The motivation for #2 is to enable 'reSubmitCore' to preserve the 'localSeq' from the original op.

@@ -511,6 +515,7 @@ export class SharedMatrix<T extends Serializable = Serializable>
setOp.value,
rowHandle,
colHandle,
localSeq,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

...and here's the only "real" change: preserving 'localSeq' during resubmission.

[undefined, undefined, undefined, "A"],
]);
});

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

New test case to cover the missing multiple reconnection case.

trace?.push(`containerRuntime${matrixIndex + 1}.connected = true;`);
runtimes[matrixIndex].connected = true;
}

trace?.push(`containerRuntime${matrixIndex + 1}.connected = false;`);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change to stress to cover multiple reconnections as well.

@msfluid-bot
Copy link
Collaborator

@fluidframework/base-host: No change
Metric NameBaseline SizeCompare SizeSize Diff
main.js 175.45 KB 175.45 KB No change
Total Size 175.45 KB 175.45 KB No change
@fluid-example/bundle-size-tests: +71 Bytes
Metric NameBaseline SizeCompare SizeSize Diff
container.js 205.46 KB 205.46 KB No change
map.js 49.53 KB 49.53 KB No change
matrix.js 144.06 KB 144.13 KB +71 Bytes
odspDriver.js 210.15 KB 210.15 KB No change
sharedString.js 158.98 KB 158.98 KB No change
Total Size 768.18 KB 768.25 KB +71 Bytes

Baseline commit: 6974711

Generated by 🚫 dangerJS against 72352e1

@@ -269,7 +276,7 @@ describe("Matrix", () => {
{ numClients: 2, numOps: 200, syncProbability: 0.3, disconnectProbability: 0, seed: 0x84d43a0a },
{ numClients: 3, numOps: 200, syncProbability: 0.1, disconnectProbability: 0, seed: 0x655c763b },
{ numClients: 5, numOps: 200, syncProbability: 0.0, disconnectProbability: 0, seed: 0x2f98736d },
{ numClients: 2, numOps: 200, syncProbability: 0.3, disconnectProbability: 1, seed: 0x84d43a0a },
{ numClients: 2, numOps: 200, syncProbability: 0.2, disconnectProbability: 0.4, seed: 0x84d43a0a },
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adjusted the mix a bit. Previously, this stress config would force reconnection after every op, but it's more interesting to have a handful of ops pending during resubmission.

(There's also a pre-existing "long haul" stress that hits reconnection with hundreds of ops pending, but I only run that on my box.)

// length = 3
// end = -1 + 3 = 2
//
// In which case, pass the empty segment into 'findReconnectionPostition()'.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

^--- This logic was already correct. Just preserving a comment from a similar test.

// the original 'localSeq' or caused by state mutations during reconnection.
containerRuntime1.connected = false;
containerRuntime1.connected = true;

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

^--- Reconnecting twice is the secret to repro'ing this bug.

process.stdout.write(
`Stress loop: ${++iterations} iterations completed - Total Elapsed: ${
((Date.now() - start) / 1000).toFixed(2)
}s\n`);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

^-- The above console output only appears in the "long haul" stress. (i.e., not during 'npm t', in the lab, etc.)

@DLehenbauer DLehenbauer changed the title Matrix: Fix incorrect row/col adjustments during 2nd reconnection attempt Matrix: Fix incorrect row/col adjustment w/multiple reconnections Apr 22, 2021
Copy link
Contributor

@anthony-murphy anthony-murphy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:shipit:

@anthony-murphy
Copy link
Contributor

should this get back ported to release 0.38?

@DLehenbauer
Copy link
Contributor Author

Thanks, @anthony-murphy!

We're okay waiting for 0.39 to release. The issue was found by refreshing a document while pasting. That scenario is a bit fringe and the app has already mitigated by changing the row/col insertion order during paste.

@DLehenbauer DLehenbauer merged commit 9576558 into main Apr 22, 2021
@DLehenbauer DLehenbauer deleted the matrix branch April 22, 2021 20:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

SharedMatrix: Ops transmitted during reconnection phase are adjusted incorrectly
3 participants