Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Summarizer correctness validation #18754

Conversation

tyler-cai-microsoft
Copy link
Contributor

@tyler-cai-microsoft tyler-cai-microsoft commented Dec 11, 2023

AB#6490 and AB#3400

Description

Doing the work for adding validation to check to see that the latest summary reference sequence number is the same when we summarize from the RunningSummarizer down to the SummarizerNode system, we ran into a bug in which the summary reference sequence numbers did not match. What was happening was for a single summarizer would submit a summary (S1), receive that summary's ack (S1Ack) in the SummaryCollection, but it wouldn't update the SummarizerNode system with that information as there was a summary lock. It would then start a new summary (S2) without processing (S1Ack) and thus we would be in an inconsitent state.

For AB#6490, some logic was added to the SummaryGenerator which was already waiting for the summarize ack to actually process the summarize ack before considering its summary successfully completed.

What's in the PR

This PR adds validation that the summary system from the RunningSummarizer down to the SummarizerNode system that the starting sequence numbers are correct.

This PR also forces the SummaryGenerator to immediately update the SummarizerNode system with the summary ack it had already waited for.

@github-actions github-actions bot added area: runtime Runtime related issues area: tests Tests to add, test infrastructure improvements, etc public api change Changes to a public API base: main PRs targeted against main branch labels Dec 11, 2023
@msfluid-bot
Copy link
Collaborator

msfluid-bot commented Dec 12, 2023

@fluid-example/bundle-size-tests: +2.33 KB
Metric NameBaseline SizeCompare SizeSize Diff
aqueduct.js 598.52 KB 599.67 KB +1.16 KB
connectionState.js 680 Bytes 680 Bytes No change
containerRuntime.js 293.66 KB 294.82 KB +1.16 KB
loader.js 188.15 KB 188.15 KB +2 Bytes
map.js 99.29 KB 99.29 KB +2 Bytes
matrix.js 242.82 KB 242.82 KB +2 Bytes
odspDriver.js 112 KB 112.01 KB +2 Bytes
odspPrefetchSnapshot.js 67.4 KB 67.4 KB +2 Bytes
sharedString.js 307.67 KB 307.67 KB No change
sharedTree2.js 323.16 KB 323.16 KB No change
Total Size 2.29 MB 2.29 MB +2.33 KB

Baseline commit: 7620034

Generated by 🚫 dangerJS against 396caba

@@ -14,6 +14,15 @@ export interface IRefreshSummaryResult {
isSummaryNewer: boolean;
}

export interface IStartSummaryResult {
/** The number of summarizerNodes at the start of the summary. */
nodes: number;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there value in tracking this number? May be useful for % of invalid nodes, but not too sure (since any number of invalid nodes is bad)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It may be useful to see if it's a partial node tree issue or a whole node tree issue. I'm not sure what I'll see, but I'd like to know if part of the tree is updated, or none of the tree is updated.

this.generator = new SummaryGenerator(
this.pendingAckTimer,
this.heuristicData,
this.submitSummaryCallback,
() => {
async (options: IRefreshSummaryAckOptions) => {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you separate this logic into a separate PR. This will make it easier to review and also help back out or pick this part in case of any requirements.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do

() => {
async (options: IRefreshSummaryAckOptions) => {
if (immediatelyRefreshLatestSummaryAck) {
await this.refreshLatestSummaryAckCallback(options);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This fix seems fine but it introduces another checkpoint into the system when we have other patterns to already handle these, i.e., summarizingLock.
The summarizingLock exists to ensure that summary submission and refreshing an ack are mutually exclusive and they happen in correct orders. I believe the real bug here is that in the afterSummaryAction function, there is an attempt to run the summaryHeuristics synchronously holding the same summarizingLock that was used to run the summary. That logic should be async to give any other process holding the lock to go through first.

Even with this fix, that bug exists where summary submission can hold the lock indefinitely.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see what you mean. Ok that is really helpful. I'll look into this deeper.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would suggest trying that fix out. I am not 100% sure that will fix everything. Every time I look at the logic around summarizing + refreshing, I always learn something that I did not expect :-).

@tyler-cai-microsoft
Copy link
Contributor Author

Migrated changes to this PR as this pr is old #19488

tyler-cai-microsoft added a commit that referenced this pull request Feb 12, 2024
[AB#3400](https://dev.azure.com/fluidframework/internal/_workitems/edit/3400)

Updated from previous PR:
#18754 as it was getting
stale.

## Description
After ensuring that we've refreshed our latest summary state:
#18968, this PR adds
validation that before we start a summary, that all the summary sequence
numbers are as expected from the `RunningSummarizer` down to the
`SummarizerNode` system itself. This is to prevent any sort of data
corruption/data loss from accidentally occurring. This check used to run
in the two-commit summary system, which would prevent us from submitting
a bad summary. Now that we've moved to the one-commit summary system,
this should keep us safe from potentially negatively impacting
documents.

## What's in the PR
This PR adds validation that the summary system from the
RunningSummarizer down to the SummarizerNode system that the starting
sequence numbers are correct.

The code is behind a feature flag. There is telemetry that's added so
that we can see before we even turn on the feature what the impact to
the entire system will look like.

---------

Co-authored-by: Mark Fields <markfields@users.noreply.github.com>
kian-thompson pushed a commit to kian-thompson/FluidFramework that referenced this pull request Feb 13, 2024
[AB#3400](https://dev.azure.com/fluidframework/internal/_workitems/edit/3400)

Updated from previous PR:
microsoft#18754 as it was getting
stale.

## Description
After ensuring that we've refreshed our latest summary state:
microsoft#18968, this PR adds
validation that before we start a summary, that all the summary sequence
numbers are as expected from the `RunningSummarizer` down to the
`SummarizerNode` system itself. This is to prevent any sort of data
corruption/data loss from accidentally occurring. This check used to run
in the two-commit summary system, which would prevent us from submitting
a bad summary. Now that we've moved to the one-commit summary system,
this should keep us safe from potentially negatively impacting
documents.

## What's in the PR
This PR adds validation that the summary system from the
RunningSummarizer down to the SummarizerNode system that the starting
sequence numbers are correct.

The code is behind a feature flag. There is telemetry that's added so
that we can see before we even turn on the feature what the impact to
the entire system will look like.

---------

Co-authored-by: Mark Fields <markfields@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area: runtime Runtime related issues area: tests Tests to add, test infrastructure improvements, etc base: main PRs targeted against main branch public api change Changes to a public API
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants