-
Notifications
You must be signed in to change notification settings - Fork 531
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Proposal - refreshLatestSummaryAck's fetchSnapshotFromStorage always retrieving the latest summary #12313
Proposal - refreshLatestSummaryAck's fetchSnapshotFromStorage always retrieving the latest summary #12313
Conversation
# Conflicts: # packages/runtime/container-runtime/src/containerRuntime.ts
⯅ @fluid-example/bundle-size-tests: +672 Bytes
Baseline commit: 4fadd10 |
I do not think your code changes that, but do we test and does it work correctly IF summarizer has pending summary, but we get an ack about some other summary first, followed by nack for our summary? Or maybe even no nack, if PUSH crashed somewhere in between? |
Can you please share a bit more info when we retrieve summary by handle? It feels like with summarizer always starting to wait for summary ops/acks after fully being connected, it should happen ONLY if we have 2 summarizers in the session. While we know this is what actually happens (effectively) with FRS service summaries, where can it happen for SPO? It clearly happens as I hear feedback about it :) I want to make sure we walk that scenarios explicitly - mentally through the code, but also replicate in end-to-end tests, and ensure that it continues to work correctly |
I have added telemetry fluid:telemetry:SummarizerNode:PendingSummaryNotFound to find out when/if it happens and it only happened 6 times (FRS) out of 15k hits in the test db. I'll investigate those. The majority of the time it is a summary ack and we have no pending summaries. Here is a query I was looking at: union database("Office Fluid Test").office_fluid_ff* In reply to: 1270858639 |
I suppose we do try to get summaries by handle whenever we get the latest snapshot from cache and while trying to catch up identify the summary acks among the ops and handlesummaryAcks is called on to refresh the state. The query I shared previously shows how often it happens (pendingSize == 0) on our Office Fluid Test db. In reply to: 1270860197 |
I'd love to confirm that. We should not used cached snapshots for summarizer today, but there still might be a race condition while loading those results in that flow.
|
packages/test/test-end-to-end-tests/src/test/SummarizeFromLatest.spec.ts
Outdated
Show resolved
Hide resolved
I've added some information to Task 444 - it seems the first time the summarizer is launched, after catching up, the handleSummaryAcks does receive the last ack from the summaryCollection before its processing has started. Its local reference number is still 0 and we query the snapshot. |
@@ -2379,17 +2380,8 @@ export class ContainerRuntime extends TypedEventEmitter<IContainerRuntimeEvents> | |||
|
|||
if (latestSnapshotRefSeq > this.deltaManager.lastSequenceNumber) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Remove this check since its done in waitForDeltaManagerToCatchup
#Resolved
@@ -2850,24 +2842,61 @@ export class ContainerRuntime extends TypedEventEmitter<IContainerRuntimeEvents> | |||
} | |||
} | |||
|
|||
private async waitForDeltaManagerToCatchup(latestSnapshotRefSeq: number, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: formatting of params should be consistent. #Resolved
summaryLogger: ITelemetryLogger, | ||
): Promise<void> { | ||
if (latestSnapshotRefSeq > this.deltaManager.lastSequenceNumber) { | ||
// We need to catch up to the latest summary's reference sequence number before pausing. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Update this comment - it's not before pausing anymore since this is a generic function. #Resolved
/** Implementation of ISummarizerInternalsProvider.refreshLatestSummaryAck */ | ||
public async refreshLatestSummaryAck(options: IRefreshSummaryAckOptions) { | ||
const { proposalHandle, ackHandle, summaryRefSeq, summaryLogger } = options; | ||
const readAndParseBlob = async <T>(id: string) => readAndParse<T>(this.storage, id); | ||
// The call to fetch the snapshot is very expensive and not always needed. | ||
// It should only be done by the summarizerNode, if required. | ||
const snapshotTreeFetcher = async () => { | ||
// When fetching from storage we will always get the latest version and do not use the ackHandle. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Move this comment outside this function with the outer comment. #Resolved
|
||
const latestSnapshotRefSeq = await seqFromTree(fetchResult.snapshotTree, readAndParseBlob); | ||
assert(latestSnapshotRefSeq >= summaryRefSeq, | ||
"Latest snapshot reference number greater than summaryRefSeq"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"Latest snapshot reference number greater than summaryRefSeq"); | |
"Latest snapshot reference number should be greater than summaryRefSeq"); | |
``` #Resolved |
{ | ||
eventName: "LatestSummaryRetrieved", | ||
lastSequenceNumber: latestSnapshotRefSeq, | ||
targetSequenceNumber: summaryRefSeq, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we also log the ackHandle so that it helps correlating with which ack this is done for? #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might be good.
return container; | ||
}; | ||
|
||
it("Summarizer does not fetch during first summary", async () => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What scenario is this testing? Summaries do not result in fetch right? #Resolved
|
||
const summarizer2 = await createSummarizer(provider, mainContainer); | ||
versionWrap = await incrementCellValueAndRunSummary(summarizer2, 4 /* expectedMatrixCellValue */); | ||
assert(versionWrap.fetchCount === 0, "There should be one fetch after second summary"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This message says there should be one fetch but the assert is that fetchCount should be 0? #Resolved
|
||
const containerRuntime = (summarizer as any).runtime as ContainerRuntime; | ||
let getVersionsFunc = containerRuntime.storage.getVersions; | ||
const funcSummary1 = async (versionId: string | null, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Rename this to getVersionsOverride so that its clear what it is doing. #Resolved
const value = getAndIncrementCellValue(mainDataStore.matrix, 0, 0, "1"); | ||
assert(value === 3, "Value matches expected"); | ||
|
||
const summarizer2 = await createSummarizer(provider, mainContainer); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is no guarantee that this will load from latest summary. It can also load from the first summary created when the detached container attaches. #Resolved
secondSummarizer.close(); | ||
|
||
// Load summarizer from previous version triggers fetch. | ||
const newContainer = await loadContainer(summaryVersion); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why create a container as well as summarizer? Why not just the summarizer? #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No particular reason - trying to make it a little different.
assert(summaryVersion, "Summary version should be defined"); | ||
summarizerClient.close(); | ||
|
||
const secondSummarizer = await createSummarizer(provider, mainContainer); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Add a comment that explains why couple more summaries are done, i.e., so that there are more recent summaries. #Resolved
}; | ||
export const TestDataObjectType1 = "@fluid-example/test-dataStore1"; | ||
|
||
class TestDataObject1 extends DataObject { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would suggest using the default data object provided by the TestObjectProvider. It has a root map that you can use to send ops. That will make the test simpler and also you don't have to worry about back-compat as it will be tested automatically. Here, the data store runtime will always be of the current version (N) and you lose compat testing. #Pending
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AS I plan to add additional tests, it will give me more flexibility to do that.
|
||
// Wait for the summarizer to run again. | ||
const summaryVersion3 = await waitForSummary(summarizerClient2); | ||
assert(summaryVersion3, "Summary version should be defined"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible to validate that the fetch was for the latest snapshot? #Pending
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I could not find an easy way - will proceed with this check in so unblock Jatin and take a look at it later on.
return container; | ||
}; | ||
|
||
it("Loading summarizer from old snapshot", async () => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not clear what this test is doing and its hard to follow. Can you please update the test name to reflect it? And also add either a description here or add comments in the test that explain what is happening. #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was looking at this test and I realized I do not need it anymore - the FetchValidation does what I need.
@@ -0,0 +1,224 @@ | |||
/*! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is a separate file needed for this single test? Can't it be part of the above file? #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed the Latest as it does not add much value anymore.
/** Implementation of ISummarizerInternalsProvider.refreshLatestSummaryAck */ | ||
public async refreshLatestSummaryAck(options: IRefreshSummaryAckOptions) { | ||
const { proposalHandle, ackHandle, summaryRefSeq, summaryLogger } = options; | ||
const readAndParseBlob = async <T>(id: string) => readAndParse<T>(this.storage, id); | ||
// The call to fetch the snapshot is very expensive and not always needed. | ||
// It should only be done by the summarizerNode, if required. | ||
// When fetching from storage we will always get the latest version and do not use the ackHandle. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still think we should handle scenarios where this can be called when a refresh is already in progress. We should not rely on the caller doing the right thing.
At least, add asserts or telemetry when this happens so we know someone screwed up. The telemetry will tell us if our assumption is wrong. #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't we have a lock in place for this not to happen ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah but the lock is in summarizer classes. What if there is a bug and it ends up calling it multiple times. Adding an assert or log will help identify these scenarios rather than trying to figure it from other logs which has turned out to be quite cumbersome.
@@ -3091,11 +3117,16 @@ const waitForSeq = async ( | |||
// TODO: remove cast to any when actual event is determined | |||
deltaManager.on("closed" as any, reject); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kian-thompson - FYI, as you work on dispose/close on Container, we should have guidance for developers how to change their code, and ideally scan repo and fix our usage, if usage needs to be changed. I think there is no changes, but let's double check :)
…reference sequence number (#14052) ## Bug When refreshLatestSummaryAck is called, if pending state corresponding to the ack is not found, the latest summary is downloaded. This change was made fairly recently - [Proposal - refreshLatestSummaryAck's fetchSnapshotFromStorage always retrieving the latest summary by NicholasCouri · Pull Request #12313 · microsoft/FluidFramework (github.com)](#12313). This means that the snapshot that is downloaded may have a reference sequence number (fetchedSummaryRefSeq) that is greater than the ack's (summaryRefSeq). However, the latest summary reference sequence number of summarizer nodes and GC is updated to summaryRefSeq which can lead to inconsitent state and future summaries may be incorrect. ## Fix After the latest snapshot is downloaded, use the reference sequence number of this snapshot to update the state of the summarizer nodes and GC. ## Telemetry Improvements Made a couple of improvements to the refresh latest summary telemetry: - Renamed the telemetry to distinguish between refreshLatestSummaryAck and refreshLatestSummaryAckFromServer. This will help in analyzing logs. - Added the reference sequence number and version of the downloaded summary which was missing when snapshot was downloaded for refreshLatestSummaryAckFromServer.
…changes) Fixed bug - RefreshLatestSummaryAck can end up updating wrong latest reference sequence number (microsoft#14052) When refreshLatestSummaryAck is called, if pending state corresponding to the ack is not found, the latest summary is downloaded. This change was made fairly recently - [Proposal - refreshLatestSummaryAck's fetchSnapshotFromStorage always retrieving the latest summary by NicholasCouri · Pull Request microsoft#12313 · microsoft/FluidFramework (github.com)](microsoft#12313). This means that the snapshot that is downloaded may have a reference sequence number (fetchedSummaryRefSeq) that is greater than the ack's (summaryRefSeq). However, the latest summary reference sequence number of summarizer nodes and GC is updated to summaryRefSeq which can lead to inconsitent state and future summaries may be incorrect. After the latest snapshot is downloaded, use the reference sequence number of this snapshot to update the state of the summarizer nodes and GC. Made a couple of improvements to the refresh latest summary telemetry: - Renamed the telemetry to distinguish between refreshLatestSummaryAck and refreshLatestSummaryAckFromServer. This will help in analyzing logs. - Added the reference sequence number and version of the downloaded summary which was missing when snapshot was downloaded for refreshLatestSummaryAckFromServer.
Description
The refreshLatestSummaryAck currently retrieves a version of a summary based on the summaryAckHandle received. The proposal is to ignore that handle and always retrieve the latest version (explicitly ignoring the Ack Handle).
In case the delta manager is behind the latest summary's ref seq number we would wait for it to catch up and proceed with the same existing logic.
As noted during the meeting, I'm still in the process of validating the change and the end-to-end test was an initial attempt to validate it but I'd love to get feedback in case there are scenarios that I might be breaking or something I'm missing