Enable programmatic DCP shut down by karolz-ms · Pull Request #7098 · microsoft/aspire

karolz-ms · 2025-01-14T01:37:41Z

Description

This change takes advantage of the just-introduced DCP capability for API-triggered shutdown. This should significantly reduce resource consumption when running tests. Also removes some dead code and increases timeouts for tests using the app orchestrator which were unrealistically short (the latter applies to development test run only).

Note: if the programmatic shutdown fails for some reason (unlikely), the existing mechanism of DCP watching the AppHost process will still kick in. Exactly the same resource cleanup code is executed in either case.

Fixes # (issue)
#6678
#6561

Checklist

Is this feature complete?
- Yes. Ready to ship.
- No. Follow-up changes expected.
Are you including unit tests for the changes and scenario tests if relevant?
- Yes
- No
Did you add public API?
- Yes
  - If yes, did you have an API Review for it?
    - Yes
    - No
  - Did you add <remarks /> and <code /> elements on your triple slash comments?
    - Yes
    - No
- No
Does the change make any security assumptions or guarantees?
- Yes
  - If yes, have you done a threat model and had a security review?
    - Yes
    - No
- No
Does the change require an update in our Aspire docs?
- Yes
  - Is this introducing a breaking change?
    - Yes
      - Link to aspire-docs issue (please use this breaking-change template):
    - No
      - Link to aspire-docs issue (please use this doc-idea template):
- No

karolz-ms · 2025-01-14T01:39:03Z

        return suffix;
    }

-    public async Task DeleteResourcesAsync(CancellationToken cancellationToken = default)


This is dead code, remnant from early Aspire lifetime. Never worked properly.

JamesNK · 2025-01-14T02:21:40Z

        {
            await app.StartAsync();
-        }).DefaultTimeout();
+        }).DefaultTimeout(TestConstants.DefaultOrchestratorTestTimeout);


Are extra constants needed? LongTimeoutDuration is 20 seconds locally and 120 seconds in CI. That seams like it is enough. What takes over 20 seconds locally? We could make the long duration longer if needed.

I don't know what is the right thing for Default, Long (and a new bigger value if required?) timeout consts vs having values explicitly for one product area. I worry that someone will go though and place orchestrator timeouts everywhere just because they're writing code in that area vs using the defaults which suit them.

I want a situation where DefaultTimeout() works for 99% of code, and the other 1% can be handled by specifying a long timeout. Keep things simple.

the right timeout when debugging is none 😄

https://github.com/dotnet/aspire/blob/0f8b4c1ad94d1a1973269abf5a9bc4db34c73f6e/tests/Shared/AsyncTestHelpers.cs#L98-L103

😄

Lets jut increase the default to something high

The default timeout being high on the server is fine, and the long timeout being high is fine locally and on the server, but when running tests you don't want to wait 60 seconds because something will never happen. It's a productivity sink.

Default timeout should fail fast in 5 seconds to not waste the developers time.

No it should waste developers time. It should “never” happened locally and when it does there are lots tools you can use (like a debugger)

if you don’t make it high then we won’t be using it in certain classes of tests and it’ll defeat the entire purpose

The primary goal of this change is to eliminate false negatives when I am verifying a change by running multiple tests from the IDE, via TestExplorer. With the current timeout values, on my daily go-to dev machine (64 GB RAM, SSD drive, not too old) running DistributedApplicationTests results in a handful of tests timing out pretty much on every run. I am wasting a lot of time re-running these failed tests.

Actual test debugging is not an issue with longer timeouts in my experience. When I need a tight inner loop for debugging a particular test, I can just lower/extend individual timeouts as necessary; that scenario happens much less often than "run all test to verify nothing is broken", so I want to optimize for the latter.

There are many reasons why 5 seconds for "short" and 20 seconds for "long" may not be enough for tests that involve DCP. For example, during startup DCP is invoked not 1, but 3 times and does probing for container runtime. If container runtime is "sleeping" that probe may take a while. Anything involving a container is subject to image download/image build delays, and involves retries that DCP does to work-around container orchestrator flakiness. Tests running in parallel affecting each other. Antivirus having to scan new build of Aspire/DCP. Etc. etc.

No it should waste developers time. It should “never” happened locally and when it does there are lots tools you can use (like a debugger)

if you don’t make it high then we won’t be using it in certain classes of tests and it’ll defeat the entire purpose

Yeah I know that timeouts should never happen when tests are in a good state, but deving code can break things.

When you run tests locally to double check things are good, you want broken tests to fail in 5 seconds with a timeout rather than a minute.

davidfowl · 2025-01-14T07:23:29Z

@karolz-ms can you remove these in this PR:

https://github.com/dotnet/aspire/blob/66465d1e19f20616be209d4d77a019aa4b68ea75/tests/helix/send-to-helix-inner.proj#L125-L139

@joperezr can you review those changes please? I want to make sure we can run without cleaning up to see if things are working well with the latest DCP changes plus these new app model changes.

@davidfowl

(suggestion from @davidfowl)

davidfowl

Lets see if this destabilizes the CI 😄

cc @eerhardt

eerhardt · 2025-01-15T15:32:51Z

-    <HelixPreCommand Include="$(_DeleteDockerVolumesCommand)" />
-    <HelixPreCommand Include="docker network prune -f" />
-
-    <HelixPostCommand Include="docker container ls --all" />


This is useful in situations where the container crashes, to see what is happening on the machine. It is also nice to guarantee somewhat stable machines.

Do we have to delete this?

cc @radical

While that is true, it's nicer if we can actually find bugs instead of having the CI cover up dcp problems 😄

Can we at least add back the diagnostics lines? If you don't want the prune/clean calls for now, that's fine.

Yep that's reasonable

@karolz-ms can you add the diagnostic ones back?

Working on it

davidfowl · 2025-01-17T09:43:48Z

Since this change went in I'm seeing containers stuck in the starting state and a large set of new tests failing as a result. Lets add the diagnostics back so we can see if things are not being cleaned up properly in some cases. I wonder if some tests are not calling StopAsync properly and things still aren't being cleaned up on a per test basis.

joperezr · 2025-01-18T00:02:21Z

@karolz-ms can you remove these in this PR:

https://github.com/dotnet/aspire/blob/66465d1e19f20616be209d4d77a019aa4b68ea75/tests/helix/send-to-helix-inner.proj#L125-L139

@joperezr can you review those changes please? I want to make sure we can run without cleaning up to see if things are working well with the latest DCP changes plus these new app model changes.

Sorry @davidfowl just saw this. Changes look good other than @eerhardt comments about it being nice to have the diagnostic lines.

davidfowl · 2025-01-18T04:42:47Z

We added them back, well @karolz-ms did 😄

davidfowl · 2025-01-18T06:20:12Z

@karolz-ms Can confirm now that with the diagnostic messages added back, we didn't fully fix the issue with dcp. Containers now don't start on the CI. I'll submit a change that does cleanup on start, that way we can track if there are leaking networks still but tests won't fail. We might need more logs here to understand why this is still happening.

cc @dbreshears

karolz-ms added 3 commits January 13, 2025 15:15

Stop app orchestrator when ApplicationExecutor is stopped

9a41ea7

Update minimum DCP version

36dc3bd

Increase timeouts for tests using app orchestrator (debug run)

850de08

karolz-ms requested review from JamesNK, danegsta and davidfowl January 14, 2025 01:37

karolz-ms requested review from eerhardt, mitchdenny and radical as code owners January 14, 2025 01:37

karolz-ms commented Jan 14, 2025

View reviewed changes

davidfowl reviewed Jan 14, 2025

View reviewed changes

Comment thread src/Aspire.Hosting/Dcp/ApplicationExecutor.cs

JamesNK reviewed Jan 14, 2025

View reviewed changes

Comment thread tests/Aspire.Hosting.Tests/Dcp/TestKubernetesService.cs Outdated

JamesNK reviewed Jan 14, 2025

View reviewed changes

Comment thread src/Aspire.Hosting/Dcp/DcpKubernetesClient.cs Outdated

JamesNK reviewed Jan 14, 2025

View reviewed changes

Comment thread src/Aspire.Hosting/Dcp/DcpKubernetesClient.cs Outdated

davidfowl reviewed Jan 14, 2025

View reviewed changes

Comment thread src/Aspire.Hosting/OtlpConfigurationExtensions.cs

karolz-ms added 2 commits January 14, 2025 10:49

Improve code formatting

18f2fe6

Remove container cleanup for Helix pipelines

150ed79

(suggestion from @davidfowl)

davidfowl approved these changes Jan 14, 2025

View reviewed changes

karolz-ms mentioned this pull request Jan 14, 2025

Can't WaitForResourceAsync KnownResourceStates.Finished #4878

Closed

1 task

karolz-ms merged commit b02fa42 into main Jan 14, 2025

karolz-ms deleted the dev/karolz/dcp-test-use branch January 14, 2025 22:33

This was referenced Jan 15, 2025

Tests keep DCP instances running until the whole test run ends #6561

Closed

Improve test stability issues caused by resources failing to start due to "address in use" errors #6678

Closed

eerhardt reviewed Jan 15, 2025

View reviewed changes

This was referenced Jan 18, 2025

Clean up networks, volumes and containers on start of the test. #7145

Merged

Clean up processes on start, but wait on shutdown #7185

Merged

ReubenBond mentioned this pull request Feb 2, 2025

Remove non-existent property from DistributedApplicationFactory #7377

Merged

18 tasks

github-actions Bot locked and limited conversation to collaborators Feb 17, 2025

github-actions Bot added the area-app-model Issues pertaining to the APIs in Aspire.Hosting, e.g. DistributedApplication label Mar 10, 2025

Conversation

karolz-ms commented Jan 14, 2025

Description

Checklist

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

JamesNK Jan 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

davidfowl Jan 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

davidfowl commented Jan 14, 2025

Uh oh!

davidfowl left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

davidfowl commented Jan 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

joperezr commented Jan 18, 2025

Uh oh!

davidfowl commented Jan 18, 2025

Uh oh!

davidfowl commented Jan 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

JamesNK Jan 14, 2025 •

edited

Loading

davidfowl Jan 14, 2025 •

edited

Loading

davidfowl commented Jan 17, 2025 •

edited

Loading

davidfowl commented Jan 18, 2025 •

edited

Loading