Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refresh immediately after summary ack #18968

Conversation

tyler-cai-microsoft
Copy link
Contributor

@tyler-cai-microsoft tyler-cai-microsoft commented Dec 21, 2023

AB#6490

Background

Attempting AB#3400 to add validation to the summarizer before summaries occurred to verify that the summary the running summarizer is tracking is the same as the summarizer node system, we discovered that the validation was failing. The validation caught locally that a single summarizer would start a second summary before processing the ack of the first summary it created.

Solution

This PR attempts to address that issue. Instead of waiting for the summary lock to be released, the ack is immediately refreshed. This may result in multiple refreshes. The ContainerRuntime's refresh logic is resilient and should be able to handle refreshing multiple of the same acks.

Further Work

The RunningSummarizer really only expects its own acks. Any foreign acks should close/restart the summarizer. The code in the RunningSummarizer should reflect that reality.

@github-actions github-actions bot added area: runtime Runtime related issues area: tests Tests to add, test infrastructure improvements, etc base: main PRs targeted against main branch labels Dec 21, 2023
await waitForContainerConnection(container);
await waitForContainerConnection(container2);

// The first summary needs to be generated otherwise the summarizer will only generate one summary.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know, I was running the test without generating the first summary and it would not work. I'll have to dig into this to answer.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Specifically this doesn't work because the retry logic based on the call to trySummarize requires the summarizerLock to be defined. Because it's undefined the heuristics runner doesn't run in the afterSummaryAction. So essentially the first summary needs to occur to enable the summary lock to be locked, and then a refresh needs to happen in order to enable the lock. It's a race condition.

dataObject._root.set(`first`, "op");
await provider.ensureSynchronized();

const array: string[] = [];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is this array doing? And why does container3 load from the first ack?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be honest, I was curious as to what the behavior of the container would be, whether or not the 2nd op would be summarized or not. Based on the tests, with the local server, the 2nd op would be summarized in 1st the summary. With the tinylicious server, I think the 2nd op would not be summarized until the 2nd summary. It's interesting behavior that I wasn't sure if it was important to note. The summaries generated would be different, but the document should recover/become eventually consistent. This is a rare edge case.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another note, the array should contain 2 acks, and I should check for that at the end of the test. Another note, if we update the summary heuristics runner in any fashion, this test will likely start failing.

}
});

// Sending ops from two different clients/containers causes the heuristics summarizer to handle ops twice, which causes the summarizer to summarize before acking its first summary.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no validation that the second summarize is triggered before the first one is acked. You should add that validation somehow otherwise, it makes the test undetereministic. It may not be testing what is claimed here as everything is async.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't know exactly how to add in that validation without just starting to cast to the actual internal runtime implementation. What needs to happen is that the running summarizer listens to every op and figures out if it needs to run the summary heuristics runner, so I set the max ops threshold very low (1 op). Then during the summary itself, there need to be unprocessed ops that sit in the inbound queue that again trigger the heuristics runner from the afterSummaryAction (or a timer, but ops is simpler). I have two containers send different ops, because when I try to do it all in one container, maybe all the ops get batched into one op (or are processed together) which doesn't trigger the summary. I'm not exactly sure if what's the case, but when I did try it with 1 container I could not effect the same result. The second op from the second container should trigger the 2nd summary as it sits in the queue unprocessed as the run summary is triggered synchronously. I'm not in love with my repro, but if there's a better way to implement it and write it so it's more testable, I'm all ears.

});

const dataObject3 = (await container3.getEntryPoint()) as ITestDataObject;
assert(dataObject3._root.get(`a`) === "op1");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this validation needed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This helps test eventual consistency. There's actually some interesting behavior here that may be worth investigating as this is an edge case.

@rajatch-ff rajatch-ff requested a review from a team December 29, 2023 21:21
@msfluid-bot
Copy link
Collaborator

msfluid-bot commented Jan 4, 2024

Baseline CI build failed, cannot generate bundle analysis at this time


Baseline commit: 22904a9

Generated by 🚫 dangerJS against f469edd

await waitForContainerConnection(container);
await waitForContainerConnection(container2);

// The first summary needs to be generated otherwise the summarizer will only generate one summary.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did not understand what this comment means.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added additional clarifying comments. This is to put the summarizer client in the "right" state for conditions of two summaries occurring simultaneously one after another.

// The first summary needs to be generated otherwise the summarizer will only generate one summary.
dataObject._root.set(`first`, "op");
await provider.ensureSynchronized();
mockLogger.assertMatch([
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just waiting for the op to be processed is not enough. It doesn't guarantee that the summary would have been generated and can result in this test being flaky.

Is it possible to reliably test this in an end-to-end test? I am not sure what you could wait on to validate that things happen as expected.
I would do something like just call the summarize function on summary generator and validate that refresh happens. Maybe in a unit test.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I limited this to just local server, do you think it'll still be flakey here?

const twoSummariesDeferred = new Deferred<void>();
const summaryVersions: string[] = [];
let summariesSeen = 0;
container.on("op", (op) => {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a lot going on here. Can you add a comment explaining what's going on.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In order to locally cause two summaries to be generated in a row, I have to do this weird dance, I'm wondering if this is such a freak scenario that it's not happening at any decent level in production.

@tyler-cai-microsoft tyler-cai-microsoft merged commit a940e0b into microsoft:main Feb 1, 2024
30 checks passed
tyler-cai-microsoft added a commit that referenced this pull request Feb 12, 2024
[AB#3400](https://dev.azure.com/fluidframework/internal/_workitems/edit/3400)

Updated from previous PR:
#18754 as it was getting
stale.

## Description
After ensuring that we've refreshed our latest summary state:
#18968, this PR adds
validation that before we start a summary, that all the summary sequence
numbers are as expected from the `RunningSummarizer` down to the
`SummarizerNode` system itself. This is to prevent any sort of data
corruption/data loss from accidentally occurring. This check used to run
in the two-commit summary system, which would prevent us from submitting
a bad summary. Now that we've moved to the one-commit summary system,
this should keep us safe from potentially negatively impacting
documents.

## What's in the PR
This PR adds validation that the summary system from the
RunningSummarizer down to the SummarizerNode system that the starting
sequence numbers are correct.

The code is behind a feature flag. There is telemetry that's added so
that we can see before we even turn on the feature what the impact to
the entire system will look like.

---------

Co-authored-by: Mark Fields <markfields@users.noreply.github.com>
alexvy86 pushed a commit to alexvy86/FluidFramework that referenced this pull request Feb 13, 2024
[AB#6490](https://dev.azure.com/fluidframework/internal/_workitems/edit/6490)

## Background
Attempting
[AB#3400](https://dev.azure.com/fluidframework/internal/_workitems/edit/3400)
to add validation to the summarizer before summaries occurred to verify
that the summary the running summarizer is tracking is the same as the
summarizer node system, we discovered that the validation was failing.
The validation caught locally that a single summarizer would start a
second summary before processing the ack of the first summary it
created.

## Solution
This PR attempts to address that issue. Instead of waiting for the
summary lock to be released, the ack is immediately refreshed. This may
result in multiple refreshes. The `ContainerRuntime`'s refresh logic is
resilient and should be able to handle refreshing multiple of the same
acks.

## Further Work
The `RunningSummarizer` really only expects its own acks. Any foreign
acks should close/restart the summarizer. The code in the
`RunningSummarizer` should reflect that reality.
kian-thompson pushed a commit to kian-thompson/FluidFramework that referenced this pull request Feb 13, 2024
[AB#3400](https://dev.azure.com/fluidframework/internal/_workitems/edit/3400)

Updated from previous PR:
microsoft#18754 as it was getting
stale.

## Description
After ensuring that we've refreshed our latest summary state:
microsoft#18968, this PR adds
validation that before we start a summary, that all the summary sequence
numbers are as expected from the `RunningSummarizer` down to the
`SummarizerNode` system itself. This is to prevent any sort of data
corruption/data loss from accidentally occurring. This check used to run
in the two-commit summary system, which would prevent us from submitting
a bad summary. Now that we've moved to the one-commit summary system,
this should keep us safe from potentially negatively impacting
documents.

## What's in the PR
This PR adds validation that the summary system from the
RunningSummarizer down to the SummarizerNode system that the starting
sequence numbers are correct.

The code is behind a feature flag. There is telemetry that's added so
that we can see before we even turn on the feature what the impact to
the entire system will look like.

---------

Co-authored-by: Mark Fields <markfields@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area: runtime Runtime related issues area: tests Tests to add, test infrastructure improvements, etc base: main PRs targeted against main branch
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants