Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal - refreshLatestSummaryAck's fetchSnapshotFromStorage always retrieving the latest summary #12313

Merged
merged 13 commits into from
Nov 1, 2022

Conversation

NicholasCouri
Copy link
Contributor

Description

The refreshLatestSummaryAck currently retrieves a version of a summary based on the summaryAckHandle received. The proposal is to ignore that handle and always retrieve the latest version (explicitly ignoring the Ack Handle).
In case the delta manager is behind the latest summary's ref seq number we would wait for it to catch up and proceed with the same existing logic.

As noted during the meeting, I'm still in the process of validating the change and the end-to-end test was an initial attempt to validate it but I'd love to get feedback in case there are scenarios that I might be breaking or something I'm missing

@NicholasCouri NicholasCouri requested review from a team as code owners October 6, 2022 18:51
@github-actions github-actions bot added area: runtime Runtime related issues area: tests Tests to add, test infrastructure improvements, etc base: main PRs targeted against main branch labels Oct 6, 2022
@msfluid-bot
Copy link
Collaborator

msfluid-bot commented Oct 6, 2022

@fluid-example/bundle-size-tests: +672 Bytes
Metric NameBaseline SizeCompare SizeSize Diff
aqueduct.js 403.94 KB 404.27 KB +336 Bytes
connectionState.js 680 Bytes 680 Bytes No change
containerRuntime.js 201.1 KB 201.43 KB +336 Bytes
loader.js 153.8 KB 153.8 KB No change
map.js 47.39 KB 47.39 KB No change
matrix.js 138.69 KB 138.69 KB No change
odspDriver.js 118.76 KB 118.76 KB No change
odspPrefetchSnapshot.js 40.54 KB 40.54 KB No change
sharedString.js 158.29 KB 158.29 KB No change
Total Size 1.3 MB 1.3 MB +672 Bytes

Baseline commit: 4fadd10

Generated by 🚫 dangerJS against 5f43213

@vladsud
Copy link
Contributor

vladsud commented Oct 7, 2022

I do not think your code changes that, but do we test and does it work correctly IF summarizer has pending summary, but we get an ack about some other summary first, followed by nack for our summary? Or maybe even no nack, if PUSH crashed somewhere in between?

@vladsud
Copy link
Contributor

vladsud commented Oct 7, 2022

Can you please share a bit more info when we retrieve summary by handle? It feels like with summarizer always starting to wait for summary ops/acks after fully being connected, it should happen ONLY if we have 2 summarizers in the session. While we know this is what actually happens (effectively) with FRS service summaries, where can it happen for SPO? It clearly happens as I hear feedback about it :) I want to make sure we walk that scenarios explicitly - mentally through the code, but also replicate in end-to-end tests, and ensure that it continues to work correctly

@NicholasCouri
Copy link
Contributor Author

NicholasCouri commented Oct 7, 2022

I have added telemetry fluid:telemetry:SummarizerNode:PendingSummaryNotFound to find out when/if it happens and it only happened 6 times (FRS) out of 15k hits in the test db. I'll investigate those. The majority of the time it is a summary ack and we have no pending summaries. Here is a query I was looking at:

union database("Office Fluid Test").office_fluid_ff*
| where Data_hostName == "@fluid-internal/test-service-load"
| where Data_eventName == "fluid:telemetry:SummarizerNode:PendingSummaryNotFound"
| extend propJson = parse_json(Data_details)
| extend summaryRefSeq = propJson.summaryRefSeq
| extend pendingSize = propJson.pendingSize
| summarize Total=count(), PendingSizeNotZero=countif(pendingSize > 0),
PendingSizeNotZeroAndREfSeq0=countif(pendingSize > 0 and Data_referenceSequenceNumber == 0),
RefSeqNumber0=countif(Data_referenceSequenceNumber == 0),
NoUpdate=countif(Data_referenceSequenceNumber == summaryRefSeq),
DidFetchSnapshot=countif(Data_referenceSequenceNumber < summaryRefSeq) by Data_driverType, Data_driverEndpointName


In reply to: 1270858639

@NicholasCouri
Copy link
Contributor Author

I suppose we do try to get summaries by handle whenever we get the latest snapshot from cache and while trying to catch up identify the summary acks among the ops and handlesummaryAcks is called on to refresh the state. The query I shared previously shows how often it happens (pendingSize == 0) on our Office Fluid Test db.


In reply to: 1270860197

@vladsud
Copy link
Contributor

vladsud commented Oct 14, 2022

I suppose we do try to get summaries by handle whenever we get the latest snapshot from cache and while trying to catch up identify the summary acks among the ops and handlesummaryAcks is called on to refresh the state. The query I shared previously shows how often it happens (pendingSize == 0) on our Office Fluid Test db.

I'd love to confirm that. We should not used cached snapshots for summarizer today, but there still might be a race condition while loading those results in that flow.
Conforming for sure is important because it helps us answer two key questions:

  1. With these changes, are we ready to enable loading summarizer from cache? And if so, is it efficient? I.e. I'd think that sequence results in a bunch of calls to getLatest in a row very tight in time, and maybe we should find a way not to do some of these calls (summarizer loading from cache will trigger refresh of cache. When summarizer detects it's not current, it will ask for refresh, which could return results from cache (cache is almost new), and not do another network call).
  2. Understanding if this is strictly boot problem, and if so, considering making this code run only on boot (not having a loop) and push all other cases on "regular" recovery path that involves reboot of summarizer.

@NicholasCouri
Copy link
Contributor Author

NicholasCouri commented Oct 14, 2022

Can you please share a bit more info when we retrieve summary by handle? It feels like with summarizer always starting to wait for summary ops/acks after fully being connected, it should happen ONLY if we have 2 summarizers in the session. While we know this is what actually happens (effectively) with FRS service summaries, where can it happen for SPO? It clearly happens as I hear feedback about it :) I want to make sure we walk that scenarios explicitly - mentally through the code, but also replicate in end-to-end tests, and ensure that it continues to work correctly

I've added some information to Task 444 - it seems the first time the summarizer is launched, after catching up, the handleSummaryAcks does receive the last ack from the summaryCollection before its processing has started. Its local reference number is still 0 and we query the snapshot.

@@ -2379,17 +2380,8 @@ export class ContainerRuntime extends TypedEventEmitter<IContainerRuntimeEvents>

if (latestSnapshotRefSeq > this.deltaManager.lastSequenceNumber) {
Copy link
Contributor

@agarwal-navin agarwal-navin Oct 26, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Remove this check since its done in waitForDeltaManagerToCatchup #Resolved

@@ -2850,24 +2842,61 @@ export class ContainerRuntime extends TypedEventEmitter<IContainerRuntimeEvents>
}
}

private async waitForDeltaManagerToCatchup(latestSnapshotRefSeq: number,
Copy link
Contributor

@agarwal-navin agarwal-navin Oct 26, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: formatting of params should be consistent. #Resolved

summaryLogger: ITelemetryLogger,
): Promise<void> {
if (latestSnapshotRefSeq > this.deltaManager.lastSequenceNumber) {
// We need to catch up to the latest summary's reference sequence number before pausing.
Copy link
Contributor

@agarwal-navin agarwal-navin Oct 26, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update this comment - it's not before pausing anymore since this is a generic function. #Resolved

/** Implementation of ISummarizerInternalsProvider.refreshLatestSummaryAck */
public async refreshLatestSummaryAck(options: IRefreshSummaryAckOptions) {
const { proposalHandle, ackHandle, summaryRefSeq, summaryLogger } = options;
const readAndParseBlob = async <T>(id: string) => readAndParse<T>(this.storage, id);
// The call to fetch the snapshot is very expensive and not always needed.
// It should only be done by the summarizerNode, if required.
const snapshotTreeFetcher = async () => {
// When fetching from storage we will always get the latest version and do not use the ackHandle.
Copy link
Contributor

@agarwal-navin agarwal-navin Oct 26, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Move this comment outside this function with the outer comment. #Resolved


const latestSnapshotRefSeq = await seqFromTree(fetchResult.snapshotTree, readAndParseBlob);
assert(latestSnapshotRefSeq >= summaryRefSeq,
"Latest snapshot reference number greater than summaryRefSeq");
Copy link
Contributor

@agarwal-navin agarwal-navin Oct 26, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"Latest snapshot reference number greater than summaryRefSeq");
"Latest snapshot reference number should be greater than summaryRefSeq");
``` #Resolved

{
eventName: "LatestSummaryRetrieved",
lastSequenceNumber: latestSnapshotRefSeq,
targetSequenceNumber: summaryRefSeq,
Copy link
Contributor

@agarwal-navin agarwal-navin Oct 26, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we also log the ackHandle so that it helps correlating with which ack this is done for? #Resolved

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be good.

return container;
};

it("Summarizer does not fetch during first summary", async () => {
Copy link
Contributor

@agarwal-navin agarwal-navin Oct 26, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What scenario is this testing? Summaries do not result in fetch right? #Resolved


const summarizer2 = await createSummarizer(provider, mainContainer);
versionWrap = await incrementCellValueAndRunSummary(summarizer2, 4 /* expectedMatrixCellValue */);
assert(versionWrap.fetchCount === 0, "There should be one fetch after second summary");
Copy link
Contributor

@agarwal-navin agarwal-navin Oct 26, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This message says there should be one fetch but the assert is that fetchCount should be 0? #Resolved


const containerRuntime = (summarizer as any).runtime as ContainerRuntime;
let getVersionsFunc = containerRuntime.storage.getVersions;
const funcSummary1 = async (versionId: string | null,
Copy link
Contributor

@agarwal-navin agarwal-navin Oct 26, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Rename this to getVersionsOverride so that its clear what it is doing. #Resolved

const value = getAndIncrementCellValue(mainDataStore.matrix, 0, 0, "1");
assert(value === 3, "Value matches expected");

const summarizer2 = await createSummarizer(provider, mainContainer);
Copy link
Contributor

@agarwal-navin agarwal-navin Oct 26, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no guarantee that this will load from latest summary. It can also load from the first summary created when the detached container attaches. #Resolved

secondSummarizer.close();

// Load summarizer from previous version triggers fetch.
const newContainer = await loadContainer(summaryVersion);
Copy link
Contributor

@agarwal-navin agarwal-navin Oct 26, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why create a container as well as summarizer? Why not just the summarizer? #Resolved

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No particular reason - trying to make it a little different.

assert(summaryVersion, "Summary version should be defined");
summarizerClient.close();

const secondSummarizer = await createSummarizer(provider, mainContainer);
Copy link
Contributor

@agarwal-navin agarwal-navin Oct 26, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Add a comment that explains why couple more summaries are done, i.e., so that there are more recent summaries. #Resolved

};
export const TestDataObjectType1 = "@fluid-example/test-dataStore1";

class TestDataObject1 extends DataObject {
Copy link
Contributor

@agarwal-navin agarwal-navin Oct 26, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would suggest using the default data object provided by the TestObjectProvider. It has a root map that you can use to send ops. That will make the test simpler and also you don't have to worry about back-compat as it will be tested automatically. Here, the data store runtime will always be of the current version (N) and you lose compat testing. #Pending

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AS I plan to add additional tests, it will give me more flexibility to do that.


// Wait for the summarizer to run again.
const summaryVersion3 = await waitForSummary(summarizerClient2);
assert(summaryVersion3, "Summary version should be defined");
Copy link
Contributor

@agarwal-navin agarwal-navin Oct 26, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to validate that the fetch was for the latest snapshot? #Pending

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could not find an easy way - will proceed with this check in so unblock Jatin and take a look at it later on.

return container;
};

it("Loading summarizer from old snapshot", async () => {
Copy link
Contributor

@agarwal-navin agarwal-navin Oct 26, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not clear what this test is doing and its hard to follow. Can you please update the test name to reflect it? And also add either a description here or add comments in the test that explain what is happening. #Resolved

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was looking at this test and I realized I do not need it anymore - the FetchValidation does what I need.

@@ -0,0 +1,224 @@
/*!
Copy link
Contributor

@agarwal-navin agarwal-navin Oct 26, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is a separate file needed for this single test? Can't it be part of the above file? #Resolved

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed the Latest as it does not add much value anymore.

/** Implementation of ISummarizerInternalsProvider.refreshLatestSummaryAck */
public async refreshLatestSummaryAck(options: IRefreshSummaryAckOptions) {
const { proposalHandle, ackHandle, summaryRefSeq, summaryLogger } = options;
const readAndParseBlob = async <T>(id: string) => readAndParse<T>(this.storage, id);
// The call to fetch the snapshot is very expensive and not always needed.
// It should only be done by the summarizerNode, if required.
// When fetching from storage we will always get the latest version and do not use the ackHandle.
Copy link
Contributor

@agarwal-navin agarwal-navin Oct 26, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still think we should handle scenarios where this can be called when a refresh is already in progress. We should not rely on the caller doing the right thing.
At least, add asserts or telemetry when this happens so we know someone screwed up. The telemetry will tell us if our assumption is wrong. #Resolved

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't we have a lock in place for this not to happen ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah but the lock is in summarizer classes. What if there is a bug and it ends up calling it multiple times. Adding an assert or log will help identify these scenarios rather than trying to figure it from other logs which has turned out to be quite cumbersome.

@NicholasCouri NicholasCouri merged commit e28fb39 into microsoft:main Nov 1, 2022
@@ -3091,11 +3117,16 @@ const waitForSeq = async (
// TODO: remove cast to any when actual event is determined
deltaManager.on("closed" as any, reject);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kian-thompson - FYI, as you work on dispose/close on Container, we should have guidance for developers how to change their code, and ideally scan repo and fix our usage, if usage needs to be changed. I think there is no changes, but let's double check :)

agarwal-navin added a commit that referenced this pull request Feb 8, 2023
…reference sequence number (#14052)

## Bug
When refreshLatestSummaryAck is called, if pending state corresponding
to the ack is not found, the latest summary is downloaded. This change
was made fairly recently - [Proposal - refreshLatestSummaryAck's
fetchSnapshotFromStorage always retrieving the latest summary by
NicholasCouri · Pull Request #12313 · microsoft/FluidFramework
(github.com)](#12313).

This means that the snapshot that is downloaded may have a reference
sequence number (fetchedSummaryRefSeq) that is greater than the ack's
(summaryRefSeq). However, the latest summary reference sequence number
of summarizer nodes and GC is updated to summaryRefSeq which can lead to
inconsitent state and future summaries may be incorrect.

## Fix
After the latest snapshot is downloaded, use the reference sequence
number of this snapshot to update the state of the summarizer nodes and
GC.

## Telemetry Improvements
Made a couple of improvements to the refresh latest summary telemetry:
- Renamed the telemetry to distinguish between refreshLatestSummaryAck
and refreshLatestSummaryAckFromServer. This will help in analyzing logs.
- Added the reference sequence number and version of the downloaded
summary which was missing when snapshot was downloaded for
refreshLatestSummaryAckFromServer.
markfields pushed a commit to markfields/FluidFramework that referenced this pull request Feb 9, 2023
…changes)

Fixed bug - RefreshLatestSummaryAck can end up updating wrong latest reference sequence number (microsoft#14052)

When refreshLatestSummaryAck is called, if pending state corresponding
to the ack is not found, the latest summary is downloaded. This change
was made fairly recently - [Proposal - refreshLatestSummaryAck's
fetchSnapshotFromStorage always retrieving the latest summary by
NicholasCouri · Pull Request microsoft#12313 · microsoft/FluidFramework
(github.com)](microsoft#12313).

This means that the snapshot that is downloaded may have a reference
sequence number (fetchedSummaryRefSeq) that is greater than the ack's
(summaryRefSeq). However, the latest summary reference sequence number
of summarizer nodes and GC is updated to summaryRefSeq which can lead to
inconsitent state and future summaries may be incorrect.

After the latest snapshot is downloaded, use the reference sequence
number of this snapshot to update the state of the summarizer nodes and
GC.

Made a couple of improvements to the refresh latest summary telemetry:
- Renamed the telemetry to distinguish between refreshLatestSummaryAck
and refreshLatestSummaryAckFromServer. This will help in analyzing logs.
- Added the reference sequence number and version of the downloaded
summary which was missing when snapshot was downloaded for
refreshLatestSummaryAckFromServer.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area: runtime Runtime related issues area: tests Tests to add, test infrastructure improvements, etc base: main PRs targeted against main branch
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants