Matrix: Fix incorrect row/col adjustment w/multiple reconnections #5905

DLehenbauer · 2021-04-22T18:48:38Z

This fixes a data loss issue during reconnection that reproduces when the initial submission fails and the 1st reconnection attempt also fails. (Fixes #5808).

The cause of the issue is that SharedMatrix previously advanced 'localSeq' when resubmitting ops the 1st time. During the 2nd reconnection attempt, the runtime replays the ops that were resubmitted during the 1st attempt (as opposed to the originally submitted ops), which now have higher 'localSeq' than those stored in the MergeTree segments.

This results in incorrect adjustment of row/col when resubmitting 'setCell()' ops, as the higher 'localSeqs' cause adjustment to behave as if all 'setCell()' calls occurred after all row/col insertions and removals (i.e., no adjustment takes place.)

DLehenbauer · 2021-04-22T18:56:40Z

packages/dds/matrix/src/matrix.ts

-            const localSeq = this.nextLocalSeq();
+            this.sendSetCellOp(row, col, value, rowHandle, colHandle);
+        }
+    }


Ugh. What an awful diff for a simple change:

The 'isAttached()' check is hoisted from 'sendSetCellOp()' to the caller. (above ---^)

'sendSetCellOp()' now takes 'localSeq' as an argument. (below ----v)

...which forced me to reformat due to line length (below ---v)

The motivation for #1 is to pedantically avoid advancing 'nextLocalSeq' when no op will be sent.
The motivation for #2 is to enable 'reSubmitCore' to preserve the 'localSeq' from the original op.

DLehenbauer · 2021-04-22T18:57:43Z

packages/dds/matrix/src/matrix.ts

@@ -511,6 +515,7 @@ export class SharedMatrix<T extends Serializable = Serializable>
                            setOp.value,
                            rowHandle,
                            colHandle,
+                            localSeq,


...and here's the only "real" change: preserving 'localSeq' during resubmission.

DLehenbauer · 2021-04-22T18:58:32Z

packages/dds/matrix/src/test/matrix.spec.ts

+                [undefined, undefined, undefined, "A"],
+            ]);
+        });
+


New test case to cover the missing multiple reconnection case.

DLehenbauer · 2021-04-22T18:58:57Z

packages/dds/matrix/src/test/matrix.stress.spec.ts

+                            trace?.push(`containerRuntime${matrixIndex + 1}.connected = true;`);
+                            runtimes[matrixIndex].connected = true;
+                        }
+
                        trace?.push(`containerRuntime${matrixIndex + 1}.connected = false;`);


Change to stress to cover multiple reconnections as well.

msfluid-bot · 2021-04-22T19:03:03Z

■ @fluidframework/base-host: No change

Metric Name	Baseline Size	Compare Size	Size Diff
main.js	175.45 KB	175.45 KB	■ No change
Total Size	175.45 KB	175.45 KB	■ No change

⯅ @fluid-example/bundle-size-tests: +71 Bytes

Metric Name	Baseline Size	Compare Size	Size Diff
container.js	205.46 KB	205.46 KB	■ No change
map.js	49.53 KB	49.53 KB	■ No change
matrix.js	144.06 KB	144.13 KB	⯅ +71 Bytes
odspDriver.js	210.15 KB	210.15 KB	■ No change
sharedString.js	158.98 KB	158.98 KB	■ No change
Total Size	768.18 KB	768.25 KB	⯅ +71 Bytes

Baseline commit: 6974711

Generated by 🚫 dangerJS against 72352e1

DLehenbauer · 2021-04-22T19:05:04Z

packages/dds/matrix/src/test/matrix.stress.spec.ts

@@ -269,7 +276,7 @@ describe("Matrix", () => {
            { numClients: 2, numOps: 200, syncProbability: 0.3, disconnectProbability: 0, seed: 0x84d43a0a },
            { numClients: 3, numOps: 200, syncProbability: 0.1, disconnectProbability: 0, seed: 0x655c763b },
            { numClients: 5, numOps: 200, syncProbability: 0.0, disconnectProbability: 0, seed: 0x2f98736d },
-            { numClients: 2, numOps: 200, syncProbability: 0.3, disconnectProbability: 1, seed: 0x84d43a0a },
+            { numClients: 2, numOps: 200, syncProbability: 0.2, disconnectProbability: 0.4, seed: 0x84d43a0a },


Adjusted the mix a bit. Previously, this stress config would force reconnection after every op, but it's more interesting to have a handful of ops pending during resubmission.

(There's also a pre-existing "long haul" stress that hits reconnection with hundreds of ops pending, but I only run that on my box.)

DLehenbauer · 2021-04-22T19:06:58Z

packages/dds/matrix/src/test/matrix.spec.ts

+            //           length = 3
+            //           end    = -1 + 3 = 2
+            //
+            //       In which case, pass the empty segment into 'findReconnectionPostition()'.


^--- This logic was already correct. Just preserving a comment from a similar test.

DLehenbauer · 2021-04-22T19:07:22Z

packages/dds/matrix/src/test/matrix.spec.ts

+            // the original 'localSeq' or caused by state mutations during reconnection.
+            containerRuntime1.connected = false;
+            containerRuntime1.connected = true;
+


^--- Reconnecting twice is the secret to repro'ing this bug.

DLehenbauer · 2021-04-22T19:09:12Z

packages/dds/matrix/src/test/matrix.stress.spec.ts

+                    process.stdout.write(
+                        `Stress loop: ${++iterations} iterations completed - Total Elapsed: ${
+                            ((Date.now() - start) / 1000).toFixed(2)
+                        }s\n`);


^-- The above console output only appears in the "long haul" stress. (i.e., not during 'npm t', in the lab, etc.)

anthony-murphy

anthony-murphy · 2021-04-22T19:54:14Z

should this get back ported to release 0.38?

DLehenbauer · 2021-04-22T20:16:53Z

Thanks, @anthony-murphy!

We're okay waiting for 0.39 to release. The issue was found by refreshing a document while pasting. That scenario is a bit fringe and the app has already mitigated by changing the row/col insertion order during paste.

Matrix: Preserve 'localSeq' across multiple reconnections

72352e1

DLehenbauer requested a review from anthony-murphy April 22, 2021 18:48

github-actions bot requested a review from vladsud April 22, 2021 18:48

DLehenbauer commented Apr 22, 2021

View reviewed changes

DLehenbauer changed the title ~~Matrix: Fix incorrect row/col adjustments during 2nd reconnection attempt~~ Matrix: Fix incorrect row/col adjustment w/multiple reconnections Apr 22, 2021

anthony-murphy approved these changes Apr 22, 2021

View reviewed changes

DLehenbauer merged commit 9576558 into main Apr 22, 2021

DLehenbauer deleted the matrix branch April 22, 2021 20:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Matrix: Fix incorrect row/col adjustment w/multiple reconnections #5905

Matrix: Fix incorrect row/col adjustment w/multiple reconnections #5905

DLehenbauer commented Apr 22, 2021 •

edited

DLehenbauer Apr 22, 2021 •

edited

DLehenbauer Apr 22, 2021

DLehenbauer Apr 22, 2021

DLehenbauer Apr 22, 2021

msfluid-bot commented Apr 22, 2021

DLehenbauer Apr 22, 2021

DLehenbauer Apr 22, 2021 •

edited

DLehenbauer Apr 22, 2021

DLehenbauer Apr 22, 2021

anthony-murphy left a comment

anthony-murphy commented Apr 22, 2021

DLehenbauer commented Apr 22, 2021

Matrix: Fix incorrect row/col adjustment w/multiple reconnections #5905

Matrix: Fix incorrect row/col adjustment w/multiple reconnections #5905

Conversation

DLehenbauer commented Apr 22, 2021 • edited

DLehenbauer Apr 22, 2021 • edited

Choose a reason for hiding this comment

DLehenbauer Apr 22, 2021

Choose a reason for hiding this comment

DLehenbauer Apr 22, 2021

Choose a reason for hiding this comment

DLehenbauer Apr 22, 2021

Choose a reason for hiding this comment

msfluid-bot commented Apr 22, 2021

DLehenbauer Apr 22, 2021

Choose a reason for hiding this comment

DLehenbauer Apr 22, 2021 • edited

Choose a reason for hiding this comment

DLehenbauer Apr 22, 2021

Choose a reason for hiding this comment

DLehenbauer Apr 22, 2021

Choose a reason for hiding this comment

anthony-murphy left a comment

Choose a reason for hiding this comment

anthony-murphy commented Apr 22, 2021

DLehenbauer commented Apr 22, 2021

DLehenbauer commented Apr 22, 2021 •

edited

DLehenbauer Apr 22, 2021 •

edited

DLehenbauer Apr 22, 2021 •

edited