Skip to content

Conversation

@DeagleGross
Copy link
Contributor

@DeagleGross DeagleGross commented Aug 12, 2025

Add configurable retry on Cosmos container creation.

Also includes #425.

Fixes #307
Fixes #305

Contribution Checklist

  • The code builds clean without any errors or warnings
  • The PR follows the Contribution Guidelines
  • All unit tests pass, and I have added new tests where possible
  • I didn't break anyone 😄

@DeagleGross DeagleGross self-assigned this Aug 12, 2025
@github-actions github-actions bot changed the title chore: support retries on Cosmos storage creation .NET: chore: support retries on Cosmos storage creation Aug 12, 2025
@DeagleGross DeagleGross marked this pull request as ready for review August 15, 2025 11:09
Copilot AI review requested due to automatic review settings August 15, 2025 11:09
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR adds configurable retry functionality to Cosmos DB container creation operations. The implementation addresses transient failures that can occur during container initialization by introducing exponential backoff retry logic with customizable parameters.

Key changes:

  • Introduces a new options class for configuring retry behavior with exponential backoff
  • Modifies the LazyCosmosContainer to support retry logic during container initialization
  • Adds comprehensive test coverage for the new retry functionality

Reviewed Changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
CosmosActorStateStorageOptions.cs Defines configuration options for retry behavior including max attempts, delays, and backoff multiplier
LazyCosmosContainer.cs Implements retry logic with exponential backoff for container initialization operations
ServiceCollectionExtensions.cs Updates dependency injection to pass retry options to LazyCosmosContainer
LazyCosmosContainerTests.cs Adds integration test to verify retry configuration is properly applied
Microsoft.Extensions.AI.Agents.Runtime.Storage.CosmosDB.csproj Adds Microsoft.Extensions.Options package reference
Directory.Packages.props Defines version for Microsoft.Extensions.Options package

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

@adityamandaleeka
Copy link
Member

This is an improvement, but it seems like it still can leave the Lazy in a permanently busted state if the bounded number of retries fail.

One way to avoid that is to move to an IAsyncDisposable pattern with an internal retry loop and a CTS. That way the initialization will keep retrying until it succeeds or is canceled.

Example:

internal sealed class LazyCosmosContainer : IAsyncDisposable
{
    private readonly CancellationTokenSource _cts = new();
    private Task<Container>? _initTask;

    public Task<Container> GetContainerAsync()
        => _initTask ??= InitializeWithRetryAsync(_cts.Token);

    private async Task<Container> InitializeWithRetryAsync(CancellationToken ct)
    {
        var delay = TimeSpan.FromSeconds(1);
        while (true)
        {
            ct.ThrowIfCancellationRequested();
            try { return await InitializeContainerAsync(); }
            catch (CosmosException ex) when (IsTransient(ex))
            {
                await Task.Delay(delay, ct);
                delay = TimeSpan.FromSeconds(Math.Min(delay.TotalSeconds * 2, 30));
            }
        }
    }

    public ValueTask DisposeAsync()
    {
        _cts.Cancel();
        _cts.Dispose();
        return default;
    }
}

@adityamandaleeka
Copy link
Member

BTW it might be good to add some jitter in the backoff too so that if multiple instances start at the same time they don't all hammer cosmos in sync.

Copy link
Member

@adityamandaleeka adityamandaleeka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. We'll need to keep an eye on the CI for a while to make sure it's stable.

@DeagleGross DeagleGross added this pull request to the merge queue Aug 21, 2025
Merged via the queue into main with commit 25291de Aug 21, 2025
15 checks passed
@DeagleGross DeagleGross deleted the dmkorolev/cosmos-retries branch August 21, 2025 18:40
ReubenBond pushed a commit to ReubenBond/agent-framework that referenced this pull request Oct 28, 2025
* support retries

* tests + registration options

* fix ordering ..

* HK + update packages

* fix paths

* Update dotnet/tests/CosmosDB.IntegrationTests/Microsoft.Extensions.AI.Agents.Runtime.Storage.CosmosDB.Tests/CosmosTestFixture.cs

* re create project and fix some pk usage

* fix all tests

* try workflow?

* wip 1

* fix definition

* try with cosmos_use_emulator env?

* try ignore SSL errors?

* other cert verifications

* hardcode to 8081?

* proper valuation of ENV

* logging

* ensure db exsists for CI

* bump

* cleanup

* fix usage

* nit comment

* try only release for stability?

* try skip some flaky tests

* merge fixes + rollback container

* reimplement with iasyncdisposable pattern

* remove example doc struct
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add retries with backoff to Cosmos container initialization .NET: Implement hierarchical partition keys for Cosmos storage impl

5 participants