Skip to content

GC: Persist SweepTimeout to summary and support using a shorter duration for testing#12331

Closed
markfields wants to merge 5 commits into
microsoft:mainfrom
markfields:gc-sweep-timeout-testoverrides
Closed

GC: Persist SweepTimeout to summary and support using a shorter duration for testing#12331
markfields wants to merge 5 commits into
microsoft:mainfrom
markfields:gc-sweep-timeout-testoverrides

Conversation

@markfields
Copy link
Copy Markdown
Member

@markfields markfields commented Oct 7, 2022

Important update

This PR seems to have a fatal problem: The initial summary (created while detached) cannot include the driver's snapshot cache policy. So we have to persist some placeholder value and depend on a later summary to fill it in. But this means that there is no authoritative value, as different sessions could have different policies in effect.

Discussing this with Navin, we'll see what the outcome is. One possibility is to close this in favor of #12419

Description

This does a few key things:

  • Plumbs the Snapshot Cache Expiry value in from the driver
  • Persists that value, along with the value used for the buffer, so all 3 Sweep Timeout constituents are persisted
  • Ensures Snapshot Cache policy is consistent with other config state

It does not change the logic around overriding Session Expiry, and uses a very simple scheme for overriding buffer, tying it to whether SnapshotCache is 0 or not.

We can iterate on strategy for setting these values for new containers over time - the important part is to get the values persisted and always use those values for existing containers.

See previous PR #12044

Reviewer Guidance

  • Still need to update unit tests
  • Still need to write some e2e tests to try out shrinking the timeout (now we can test sweep in e2e tests!)
  • We may consider porting part of this back to internal.1.4 so that release uses the persisted values if present. I think this is optional though. Or maybe we port the whole change. 🤷‍♂️

Does this introduce a breaking change?

If this introduces a breaking change, please describe the impact and migration path for existing applications below.
See Breaking-vs-Non-breaking-Changes for details.
If there are no breaking changes, delete this section.

Any relevant logs or outputs

  • Use this section to provide either screenshots or output of logs as code snippets

Other information or known dependencies

  • Any other information or known dependencies that is important to this PR.
  • Links to internal bug/work tracker items.
  • TODO that are to be done after this PR.

@markfields markfields requested review from a team as code owners October 7, 2022 14:10
@github-actions github-actions Bot added area: loader Loader related issues area: runtime Runtime related issues area: tests Tests to add, test infrastructure improvements, etc labels Oct 7, 2022
@github-actions github-actions Bot added the base: main PRs targeted against main branch label Oct 7, 2022
const pendingRuntimeState = context.pendingLocalState as IPendingRuntimeState | undefined;
const baseSnapshot: ISnapshotTree | undefined = pendingRuntimeState?.baseSnapshot ?? context.baseSnapshot;

//* Test to prove this doesn't get persisted as undefined?
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a todo, might or might not do it, it would be an e2e test. When this is undefined due to storage not existing yet, it shouldn't be persisted because only Summarizer persists these values.

Comment thread packages/runtime/container-runtime/src/garbageCollection.ts Outdated
// If we're using TestOverrides to achieve a short Sweep Timeout, lock the buffer to 1ms. Otherwise 1 day.
this.sweepTimeoutBufferMs =
(this.maxSnapshotCacheDurationMs === 0)
? 1
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this 1?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just felt like it should be non-zero 🤷‍♂️

/** If this is present, the session for this container will expire after this time and the container will close */
readonly sessionExpiryTimeoutMs?: number;
/** Maximum duration a snapshot may be cached and then loaded from for this container */
readonly maxSnapshotCacheDurationMs?: number | "none";
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are there two different types?

Copy link
Copy Markdown
Member Author

@markfields markfields Oct 11, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are 3 actually:

  • undefined - old files created when the value was hardcoded to 5 days (or 2 days for even older files)
  • none - files created at a hypothetical future time where the driver didn't implement the policy so GC shouldn't run
  • number - files created by this code where the driver does implement the policy and passed in the policy's value

@markfields markfields requested review from a team as code owners October 12, 2022 13:50
@github-actions github-actions Bot added the area: driver Driver related issues label Oct 12, 2022
@markfields markfields marked this pull request as draft October 12, 2022 13:51
* Validates this scenario: When a GC node (data store or attachment blob) becomes inactive, i.e, it has been
* unreferenced for a certain amount of time, using the node results in an error telemetry.
*/
describeNoCompat("GC inactive nodes tests", (getTestObjectProvider) => {
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file is WIP - I will not be replacing the existing tests, but rather copying them first or updating them to run in 2 modes.

@markfields
Copy link
Copy Markdown
Member Author

Replaced by #12419

@markfields markfields closed this Oct 14, 2022
markfields added a commit that referenced this pull request Oct 15, 2022
## Background

As-is, the Sweep Timeout will always be at least 6 days due to hardcoded
Snapshot Cache limit (5 days) and Buffer (1 day). This means it's
impossible to test the Sweep Timer and the code that runs after
something is SweepReady!

There have been multiple attempts on my part to address this: #11084,
#12044, #12331

Much of the difficulty has arisen from the challenge of aligning driver
policy with GC config. I'm finally punting on that and tracking it as a
separate item
([AB#2238](https://dev.azure.com/fluidframework/235294da-091d-4c29-84fc-cdfc3d90890b/_workitems/edit/2238)).

## This change

It's pretty simple:

* Persist SweepTimeoutMs in the summary
* For new containers, that can be overridden via the
`"Fluid.GarbageCollection.TestOverride.SweepTimeoutMs"` config setting
* Default behavior for new containers (and for backfilling existing
containers with no value persisted) is SweepTimeout + 6 days (same as it
has been)
sharptrip pushed a commit to sharptrip/FluidFramework that referenced this pull request Oct 17, 2022
## Background

As-is, the Sweep Timeout will always be at least 6 days due to hardcoded
Snapshot Cache limit (5 days) and Buffer (1 day). This means it's
impossible to test the Sweep Timer and the code that runs after
something is SweepReady!

There have been multiple attempts on my part to address this: microsoft#11084,
microsoft#12044, microsoft#12331

Much of the difficulty has arisen from the challenge of aligning driver
policy with GC config. I'm finally punting on that and tracking it as a
separate item
([AB#2238](https://dev.azure.com/fluidframework/235294da-091d-4c29-84fc-cdfc3d90890b/_workitems/edit/2238)).

## This change

It's pretty simple:

* Persist SweepTimeoutMs in the summary
* For new containers, that can be overridden via the
`"Fluid.GarbageCollection.TestOverride.SweepTimeoutMs"` config setting
* Default behavior for new containers (and for backfilling existing
containers with no value persisted) is SweepTimeout + 6 days (same as it
has been)
markfields added a commit to markfields/FluidFramework that referenced this pull request Oct 17, 2022
## Background

As-is, the Sweep Timeout will always be at least 6 days due to hardcoded
Snapshot Cache limit (5 days) and Buffer (1 day). This means it's
impossible to test the Sweep Timer and the code that runs after
something is SweepReady!

There have been multiple attempts on my part to address this: microsoft#11084,
microsoft#12044, microsoft#12331

Much of the difficulty has arisen from the challenge of aligning driver
policy with GC config. I'm finally punting on that and tracking it as a
separate item
([AB#2238](https://dev.azure.com/fluidframework/235294da-091d-4c29-84fc-cdfc3d90890b/_workitems/edit/2238)).

## This change

It's pretty simple:

* Persist SweepTimeoutMs in the summary
* For new containers, that can be overridden via the
`"Fluid.GarbageCollection.TestOverride.SweepTimeoutMs"` config setting
* Default behavior for new containers (and for backfilling existing
containers with no value persisted) is SweepTimeout + 6 days (same as it
has been)
markfields added a commit that referenced this pull request Oct 17, 2022
_back-port of #12419 to internal.2.0 release_

## Background

As-is, the Sweep Timeout will always be at least 6 days due to hardcoded
Snapshot Cache limit (5 days) and Buffer (1 day). This means it's
impossible to test the Sweep Timer and the code that runs after
something is SweepReady!

There have been multiple attempts on my part to address this: #11084,
#12044, #12331

Much of the difficulty has arisen from the challenge of aligning driver
policy with GC config. I'm finally punting on that and tracking it as a
separate item

([AB#2238](https://dev.azure.com/fluidframework/235294da-091d-4c29-84fc-cdfc3d90890b/_workitems/edit/2238)).

## This change

It's pretty simple:

* Persist SweepTimeoutMs in the summary
* For new containers, that can be overridden via the
`"Fluid.GarbageCollection.TestOverride.SweepTimeoutMs"` config setting
* Default behavior for new containers (and for backfilling existing
containers with no value persisted) is SweepTimeout + 6 days (same as it
has been)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area: driver Driver related issues area: loader Loader related issues area: runtime Runtime related issues area: tests Tests to add, test infrastructure improvements, etc base: main PRs targeted against main branch

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants