New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refresh immediately after summary ack #18968
Refresh immediately after summary ack #18968
Conversation
packages/runtime/container-runtime/src/summary/runningSummarizer.ts
Outdated
Show resolved
Hide resolved
packages/runtime/container-runtime/src/summary/summaryGenerator.ts
Outdated
Show resolved
Hide resolved
packages/runtime/container-runtime/src/summary/runningSummarizer.ts
Outdated
Show resolved
Hide resolved
packages/runtime/container-runtime/src/summary/runningSummarizer.ts
Outdated
Show resolved
Hide resolved
| await waitForContainerConnection(container); | ||
| await waitForContainerConnection(container2); | ||
|
|
||
| // The first summary needs to be generated otherwise the summarizer will only generate one summary. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know, I was running the test without generating the first summary and it would not work. I'll have to dig into this to answer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Specifically this doesn't work because the retry logic based on the call to trySummarize requires the summarizerLock to be defined. Because it's undefined the heuristics runner doesn't run in the afterSummaryAction. So essentially the first summary needs to occur to enable the summary lock to be locked, and then a refresh needs to happen in order to enable the lock. It's a race condition.
| dataObject._root.set(`first`, "op"); | ||
| await provider.ensureSynchronized(); | ||
|
|
||
| const array: string[] = []; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is this array doing? And why does container3 load from the first ack?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be honest, I was curious as to what the behavior of the container would be, whether or not the 2nd op would be summarized or not. Based on the tests, with the local server, the 2nd op would be summarized in 1st the summary. With the tinylicious server, I think the 2nd op would not be summarized until the 2nd summary. It's interesting behavior that I wasn't sure if it was important to note. The summaries generated would be different, but the document should recover/become eventually consistent. This is a rare edge case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another note, the array should contain 2 acks, and I should check for that at the end of the test. Another note, if we update the summary heuristics runner in any fashion, this test will likely start failing.
| } | ||
| }); | ||
|
|
||
| // Sending ops from two different clients/containers causes the heuristics summarizer to handle ops twice, which causes the summarizer to summarize before acking its first summary. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is no validation that the second summarize is triggered before the first one is acked. You should add that validation somehow otherwise, it makes the test undetereministic. It may not be testing what is claimed here as everything is async.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't know exactly how to add in that validation without just starting to cast to the actual internal runtime implementation. What needs to happen is that the running summarizer listens to every op and figures out if it needs to run the summary heuristics runner, so I set the max ops threshold very low (1 op). Then during the summary itself, there need to be unprocessed ops that sit in the inbound queue that again trigger the heuristics runner from the afterSummaryAction (or a timer, but ops is simpler). I have two containers send different ops, because when I try to do it all in one container, maybe all the ops get batched into one op (or are processed together) which doesn't trigger the summary. I'm not exactly sure if what's the case, but when I did try it with 1 container I could not effect the same result. The second op from the second container should trigger the 2nd summary as it sits in the queue unprocessed as the run summary is triggered synchronously. I'm not in love with my repro, but if there's a better way to implement it and write it so it's more testable, I'm all ears.
| }); | ||
|
|
||
| const dataObject3 = (await container3.getEntryPoint()) as ITestDataObject; | ||
| assert(dataObject3._root.get(`a`) === "op1"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this validation needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This helps test eventual consistency. There's actually some interesting behavior here that may be worth investigating as this is an edge case.
packages/test/test-end-to-end-tests/src/test/summarizeRefreshLatestSummary.spec.ts
Outdated
Show resolved
Hide resolved
packages/runtime/container-runtime/src/summary/runningSummarizer.ts
Outdated
Show resolved
Hide resolved
packages/runtime/container-runtime/src/summary/runningSummarizer.ts
Outdated
Show resolved
Hide resolved
| await waitForContainerConnection(container); | ||
| await waitForContainerConnection(container2); | ||
|
|
||
| // The first summary needs to be generated otherwise the summarizer will only generate one summary. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did not understand what this comment means.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added additional clarifying comments. This is to put the summarizer client in the "right" state for conditions of two summaries occurring simultaneously one after another.
| // The first summary needs to be generated otherwise the summarizer will only generate one summary. | ||
| dataObject._root.set(`first`, "op"); | ||
| await provider.ensureSynchronized(); | ||
| mockLogger.assertMatch([ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just waiting for the op to be processed is not enough. It doesn't guarantee that the summary would have been generated and can result in this test being flaky.
Is it possible to reliably test this in an end-to-end test? I am not sure what you could wait on to validate that things happen as expected.
I would do something like just call the summarize function on summary generator and validate that refresh happens. Maybe in a unit test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I limited this to just local server, do you think it'll still be flakey here?
| const twoSummariesDeferred = new Deferred<void>(); | ||
| const summaryVersions: string[] = []; | ||
| let summariesSeen = 0; | ||
| container.on("op", (op) => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a lot going on here. Can you add a comment explaining what's going on.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In order to locally cause two summaries to be generated in a row, I have to do this weird dance, I'm wondering if this is such a freak scenario that it's not happening at any decent level in production.
[AB#3400](https://dev.azure.com/fluidframework/internal/_workitems/edit/3400) Updated from previous PR: #18754 as it was getting stale. ## Description After ensuring that we've refreshed our latest summary state: #18968, this PR adds validation that before we start a summary, that all the summary sequence numbers are as expected from the `RunningSummarizer` down to the `SummarizerNode` system itself. This is to prevent any sort of data corruption/data loss from accidentally occurring. This check used to run in the two-commit summary system, which would prevent us from submitting a bad summary. Now that we've moved to the one-commit summary system, this should keep us safe from potentially negatively impacting documents. ## What's in the PR This PR adds validation that the summary system from the RunningSummarizer down to the SummarizerNode system that the starting sequence numbers are correct. The code is behind a feature flag. There is telemetry that's added so that we can see before we even turn on the feature what the impact to the entire system will look like. --------- Co-authored-by: Mark Fields <markfields@users.noreply.github.com>
[AB#6490](https://dev.azure.com/fluidframework/internal/_workitems/edit/6490) ## Background Attempting [AB#3400](https://dev.azure.com/fluidframework/internal/_workitems/edit/3400) to add validation to the summarizer before summaries occurred to verify that the summary the running summarizer is tracking is the same as the summarizer node system, we discovered that the validation was failing. The validation caught locally that a single summarizer would start a second summary before processing the ack of the first summary it created. ## Solution This PR attempts to address that issue. Instead of waiting for the summary lock to be released, the ack is immediately refreshed. This may result in multiple refreshes. The `ContainerRuntime`'s refresh logic is resilient and should be able to handle refreshing multiple of the same acks. ## Further Work The `RunningSummarizer` really only expects its own acks. Any foreign acks should close/restart the summarizer. The code in the `RunningSummarizer` should reflect that reality.
[AB#3400](https://dev.azure.com/fluidframework/internal/_workitems/edit/3400) Updated from previous PR: microsoft#18754 as it was getting stale. ## Description After ensuring that we've refreshed our latest summary state: microsoft#18968, this PR adds validation that before we start a summary, that all the summary sequence numbers are as expected from the `RunningSummarizer` down to the `SummarizerNode` system itself. This is to prevent any sort of data corruption/data loss from accidentally occurring. This check used to run in the two-commit summary system, which would prevent us from submitting a bad summary. Now that we've moved to the one-commit summary system, this should keep us safe from potentially negatively impacting documents. ## What's in the PR This PR adds validation that the summary system from the RunningSummarizer down to the SummarizerNode system that the starting sequence numbers are correct. The code is behind a feature flag. There is telemetry that's added so that we can see before we even turn on the feature what the impact to the entire system will look like. --------- Co-authored-by: Mark Fields <markfields@users.noreply.github.com>
AB#6490
Background
Attempting AB#3400 to add validation to the summarizer before summaries occurred to verify that the summary the running summarizer is tracking is the same as the summarizer node system, we discovered that the validation was failing. The validation caught locally that a single summarizer would start a second summary before processing the ack of the first summary it created.
Solution
This PR attempts to address that issue. Instead of waiting for the summary lock to be released, the ack is immediately refreshed. This may result in multiple refreshes. The
ContainerRuntime's refresh logic is resilient and should be able to handle refreshing multiple of the same acks.Further Work
The
RunningSummarizerreally only expects its own acks. Any foreign acks should close/restart the summarizer. The code in theRunningSummarizershould reflect that reality.