Skip to content

Add configuration for consumer handling of retry on container creation#26698

Merged
kian-thompson merged 5 commits intomicrosoft:mainfrom
kian-thompson:57593-create-retry-count
Mar 23, 2026
Merged

Add configuration for consumer handling of retry on container creation#26698
kian-thompson merged 5 commits intomicrosoft:mainfrom
kian-thompson:57593-create-retry-count

Conversation

@kian-thompson
Copy link
Copy Markdown
Contributor

@kian-thompson kian-thompson commented Mar 11, 2026

Currently, we continue to retry retriable errors indefinitely on container creation. However in case of certain consumers where there is limited time (30/60 seconds) for the create, we should not have infinite retries even on retriable errors.

If the consumer of attach wants to handle the retry mechanism, they can enable the "Fluid.Container.DisableCloseOnAttachFailure" flag. This flag will cause no internal retries to occur and let any error surface to the consumer of attach.

AB#57593

@kian-thompson kian-thompson requested a review from a team as a code owner March 11, 2026 01:10
Copilot AI review requested due to automatic review settings March 11, 2026 01:10
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a configurable cap on retries for retriable errors during container creation/attach flows, addressing scenarios where hosts have a bounded time window for creation.

Changes:

  • Extend runWithRetry to support an optional maxRetries limit and throw a non-retriable wrapped error when exceeded.
  • Plumb a new maxCreateRetries option through Container.attach() (with config override via Fluid.Container.CreateMaxRetries) into the attach-time runWithRetry call.
  • Add unit tests for the new retry-limit behavior and update the legacy beta API report.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
packages/loader/driver-utils/src/test/runWithRetry.spec.ts Adds coverage for bounded retries, success-before-limit, unlimited default, and 0 retries behavior.
packages/loader/driver-utils/src/runWithRetry.ts Implements maxRetries support and telemetry when the retry cap is exceeded.
packages/loader/container-loader/src/container.ts Adds maxCreateRetries attach option and wires config/attach override into attach-time retries.
packages/common/container-definitions/src/loader.ts Updates IContainer.attach() typing/docs to expose maxCreateRetries.
packages/common/container-definitions/api-report/container-definitions.legacy.beta.api.md Updates generated legacy beta API report to reflect the new attach option.

Comment on lines +133 to +135
// Check if max retries limit has been reached
if (progress.maxRetries !== undefined && numRetries > progress.maxRetries) {
logger.sendTelemetryEvent(
Copy link

Copilot AI Mar 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

progress.maxRetries is used without validation. If a caller passes NaN, Infinity, a negative value, or a non-integer, the comparison numRetries > progress.maxRetries can behave unexpectedly (e.g., NaN makes the limit never trigger, reintroducing infinite retries). Consider normalizing the option (e.g., require a finite, non-negative integer; otherwise treat as undefined or throw a non-retriable UsageError).

Copilot uses AI. Check for mistakes.
@anthony-murphy
Copy link
Copy Markdown
Contributor

anthony-murphy commented Mar 11, 2026

we built the ability to allow customers to retry attach if it should fail. did we consider moving retry for attach/create failures to the consumer when retriable attach is enabled: "Fluid.Container.RetryOnAttachFailure" with the eventually goal for enabling it by default, as the current behavior of just retrying forever isn't great for customers. Bascially, we can just set retries to 0 when Fluid.Container.RetryOnAttachFailure is on, and then the consumer owns retry, and whatever policy they want?

what i like about this approach is it lets use move away from exposing complex policy in our apis, and lets consumers have control.

@kian-thompson kian-thompson changed the title Add configuration for retry on container creation Add configuration for consumer handling of retry on container creation Mar 19, 2026
Copy link
Copy Markdown

@shlevari shlevari left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good in general, just some light comments about the naming and exposure surface for controlling the retries.

// When RetryOnAttachFailure is enabled, use no internal retries
// The consumer will own the retry policy
const retryOnAttachFailure =
this.mc.config.getBoolean("Fluid.Container.RetryOnAttachFailure") === true;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this flag name already locked in? To me its naming is unintuitive -- it sounds like it's telling the Fluid Container to do retries, but instead its saying the caller will do the retries.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, does it make sense to give the caller control over how many retries to do? If they want to disable it, set it to zero, but that way it gives them the opportunity to optimize for their specific startup time requirements

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, MaxRetriesOnAttachFailure seems nice, otherwise DisableRetryOnAttachFailure maybe

Copy link
Copy Markdown
Contributor Author

@kian-thompson kian-thompson Mar 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@anthony-murphy Thoughts on changing the name to something like Fluid.Container.DisableRetryOnAttachFailure?

Also, does it make sense to give the caller control over how many retries to do?

The important thing here is that the errors are bubbled up to the consumer of attach. I don't have any strong preference on how consumers can implement a retry mechanism, but retries are not necessary for this ask.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the flag primarily allows attach to be retired by the consumer, as without it the container will close when attach fails

Copy link
Copy Markdown
Contributor

@anthony-murphy anthony-murphy Mar 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the goal is to get out of the container level retry game here, and let the consumer decide what they want to do, retry calling, serialize the container and save it somewhere, or dispose the container throw it away

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed the flag name to DisableCloseOnAttachFailure to be extra clear about what it does

// When RetryOnAttachFailure is enabled, use no internal retries
// The consumer will own the retry policy
const retryOnAttachFailure =
this.mc.config.getBoolean("Fluid.Container.RetryOnAttachFailure") === true;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, MaxRetriesOnAttachFailure seems nice, otherwise DisableRetryOnAttachFailure maybe

@kian-thompson kian-thompson merged commit f84aa34 into microsoft:main Mar 23, 2026
32 checks passed
@kian-thompson kian-thompson deleted the 57593-create-retry-count branch March 23, 2026 19:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants