New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Summarizer correctness validation #18754
Summarizer correctness validation #18754
Conversation
⯅ @fluid-example/bundle-size-tests: +2.33 KB
Baseline commit: 7620034 |
packages/runtime/container-runtime/src/summary/summarizerNode/summarizerNodeUtils.ts
Outdated
Show resolved
Hide resolved
packages/runtime/container-runtime/src/summary/summarizerTypes.ts
Outdated
Show resolved
Hide resolved
| @@ -14,6 +14,15 @@ export interface IRefreshSummaryResult { | |||
| isSummaryNewer: boolean; | |||
| } | |||
|
|
|||
| export interface IStartSummaryResult { | |||
| /** The number of summarizerNodes at the start of the summary. */ | |||
| nodes: number; | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there value in tracking this number? May be useful for % of invalid nodes, but not too sure (since any number of invalid nodes is bad)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It may be useful to see if it's a partial node tree issue or a whole node tree issue. I'm not sure what I'll see, but I'd like to know if part of the tree is updated, or none of the tree is updated.
…summarizerNodeUtils.ts Co-authored-by: Kian Thompson <102998837+kian-thompson@users.noreply.github.com>
| this.generator = new SummaryGenerator( | ||
| this.pendingAckTimer, | ||
| this.heuristicData, | ||
| this.submitSummaryCallback, | ||
| () => { | ||
| async (options: IRefreshSummaryAckOptions) => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you separate this logic into a separate PR. This will make it easier to review and also help back out or pick this part in case of any requirements.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will do
| () => { | ||
| async (options: IRefreshSummaryAckOptions) => { | ||
| if (immediatelyRefreshLatestSummaryAck) { | ||
| await this.refreshLatestSummaryAckCallback(options); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This fix seems fine but it introduces another checkpoint into the system when we have other patterns to already handle these, i.e., summarizingLock.
The summarizingLock exists to ensure that summary submission and refreshing an ack are mutually exclusive and they happen in correct orders. I believe the real bug here is that in the afterSummaryAction function, there is an attempt to run the summaryHeuristics synchronously holding the same summarizingLock that was used to run the summary. That logic should be async to give any other process holding the lock to go through first.
Even with this fix, that bug exists where summary submission can hold the lock indefinitely.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see what you mean. Ok that is really helpful. I'll look into this deeper.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would suggest trying that fix out. I am not 100% sure that will fix everything. Every time I look at the logic around summarizing + refreshing, I always learn something that I did not expect :-).
|
Migrated changes to this PR as this pr is old #19488 |
[AB#3400](https://dev.azure.com/fluidframework/internal/_workitems/edit/3400) Updated from previous PR: #18754 as it was getting stale. ## Description After ensuring that we've refreshed our latest summary state: #18968, this PR adds validation that before we start a summary, that all the summary sequence numbers are as expected from the `RunningSummarizer` down to the `SummarizerNode` system itself. This is to prevent any sort of data corruption/data loss from accidentally occurring. This check used to run in the two-commit summary system, which would prevent us from submitting a bad summary. Now that we've moved to the one-commit summary system, this should keep us safe from potentially negatively impacting documents. ## What's in the PR This PR adds validation that the summary system from the RunningSummarizer down to the SummarizerNode system that the starting sequence numbers are correct. The code is behind a feature flag. There is telemetry that's added so that we can see before we even turn on the feature what the impact to the entire system will look like. --------- Co-authored-by: Mark Fields <markfields@users.noreply.github.com>
[AB#3400](https://dev.azure.com/fluidframework/internal/_workitems/edit/3400) Updated from previous PR: microsoft#18754 as it was getting stale. ## Description After ensuring that we've refreshed our latest summary state: microsoft#18968, this PR adds validation that before we start a summary, that all the summary sequence numbers are as expected from the `RunningSummarizer` down to the `SummarizerNode` system itself. This is to prevent any sort of data corruption/data loss from accidentally occurring. This check used to run in the two-commit summary system, which would prevent us from submitting a bad summary. Now that we've moved to the one-commit summary system, this should keep us safe from potentially negatively impacting documents. ## What's in the PR This PR adds validation that the summary system from the RunningSummarizer down to the SummarizerNode system that the starting sequence numbers are correct. The code is behind a feature flag. There is telemetry that's added so that we can see before we even turn on the feature what the impact to the entire system will look like. --------- Co-authored-by: Mark Fields <markfields@users.noreply.github.com>
AB#6490 and AB#3400
Description
Doing the work for adding validation to check to see that the latest summary reference sequence number is the same when we summarize from the
RunningSummarizerdown to theSummarizerNodesystem, we ran into a bug in which the summary reference sequence numbers did not match. What was happening was for a single summarizer would submit a summary (S1), receive that summary's ack (S1Ack) in theSummaryCollection, but it wouldn't update theSummarizerNodesystem with that information as there was a summary lock. It would then start a new summary (S2) without processing (S1Ack) and thus we would be in an inconsitent state.For AB#6490, some logic was added to the
SummaryGeneratorwhich was already waiting for the summarize ack to actually process the summarize ack before considering its summary successfully completed.What's in the PR
This PR adds validation that the summary system from the
RunningSummarizerdown to theSummarizerNodesystem that the starting sequence numbers are correct.This PR also forces the
SummaryGeneratorto immediately update theSummarizerNodesystem with the summary ack it had already waited for.