New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Matrix: Fix incorrect row/col adjustment w/multiple reconnections #5905
Conversation
const localSeq = this.nextLocalSeq(); | ||
this.sendSetCellOp(row, col, value, rowHandle, colHandle); | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ugh. What an awful diff for a simple change:
- The 'isAttached()' check is hoisted from 'sendSetCellOp()' to the caller. (above ---^)
- 'sendSetCellOp()' now takes 'localSeq' as an argument. (below ----v)
- ...which forced me to reformat due to line length (below ---v)
The motivation for #1 is to pedantically avoid advancing 'nextLocalSeq' when no op will be sent.
The motivation for #2 is to enable 'reSubmitCore' to preserve the 'localSeq' from the original op.
@@ -511,6 +515,7 @@ export class SharedMatrix<T extends Serializable = Serializable> | |||
setOp.value, | |||
rowHandle, | |||
colHandle, | |||
localSeq, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
...and here's the only "real" change: preserving 'localSeq' during resubmission.
[undefined, undefined, undefined, "A"], | ||
]); | ||
}); | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
New test case to cover the missing multiple reconnection case.
trace?.push(`containerRuntime${matrixIndex + 1}.connected = true;`); | ||
runtimes[matrixIndex].connected = true; | ||
} | ||
|
||
trace?.push(`containerRuntime${matrixIndex + 1}.connected = false;`); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Change to stress to cover multiple reconnections as well.
■ @fluidframework/base-host: No change
⯅ @fluid-example/bundle-size-tests: +71 Bytes
Baseline commit: 6974711 |
@@ -269,7 +276,7 @@ describe("Matrix", () => { | |||
{ numClients: 2, numOps: 200, syncProbability: 0.3, disconnectProbability: 0, seed: 0x84d43a0a }, | |||
{ numClients: 3, numOps: 200, syncProbability: 0.1, disconnectProbability: 0, seed: 0x655c763b }, | |||
{ numClients: 5, numOps: 200, syncProbability: 0.0, disconnectProbability: 0, seed: 0x2f98736d }, | |||
{ numClients: 2, numOps: 200, syncProbability: 0.3, disconnectProbability: 1, seed: 0x84d43a0a }, | |||
{ numClients: 2, numOps: 200, syncProbability: 0.2, disconnectProbability: 0.4, seed: 0x84d43a0a }, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adjusted the mix a bit. Previously, this stress config would force reconnection after every op, but it's more interesting to have a handful of ops pending during resubmission.
(There's also a pre-existing "long haul" stress that hits reconnection with hundreds of ops pending, but I only run that on my box.)
// length = 3 | ||
// end = -1 + 3 = 2 | ||
// | ||
// In which case, pass the empty segment into 'findReconnectionPostition()'. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
^--- This logic was already correct. Just preserving a comment from a similar test.
// the original 'localSeq' or caused by state mutations during reconnection. | ||
containerRuntime1.connected = false; | ||
containerRuntime1.connected = true; | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
^--- Reconnecting twice is the secret to repro'ing this bug.
process.stdout.write( | ||
`Stress loop: ${++iterations} iterations completed - Total Elapsed: ${ | ||
((Date.now() - start) / 1000).toFixed(2) | ||
}s\n`); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
^-- The above console output only appears in the "long haul" stress. (i.e., not during 'npm t', in the lab, etc.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should this get back ported to release 0.38? |
Thanks, @anthony-murphy! We're okay waiting for 0.39 to release. The issue was found by refreshing a document while pasting. That scenario is a bit fringe and the app has already mitigated by changing the row/col insertion order during paste. |
This fixes a data loss issue during reconnection that reproduces when the initial submission fails and the 1st reconnection attempt also fails. (Fixes #5808).
The cause of the issue is that SharedMatrix previously advanced 'localSeq' when resubmitting ops the 1st time. During the 2nd reconnection attempt, the runtime replays the ops that were resubmitted during the 1st attempt (as opposed to the originally submitted ops), which now have higher 'localSeq' than those stored in the MergeTree segments.
This results in incorrect adjustment of row/col when resubmitting 'setCell()' ops, as the higher 'localSeqs' cause adjustment to behave as if all 'setCell()' calls occurred after all row/col insertions and removals (i.e., no adjustment takes place.)