Skip to content

Enable programmatic DCP shut down#7098

Merged
karolz-ms merged 5 commits intomainfrom
dev/karolz/dcp-test-use
Jan 14, 2025
Merged

Enable programmatic DCP shut down#7098
karolz-ms merged 5 commits intomainfrom
dev/karolz/dcp-test-use

Conversation

@karolz-ms
Copy link
Copy Markdown
Contributor

Description

This change takes advantage of the just-introduced DCP capability for API-triggered shutdown. This should significantly reduce resource consumption when running tests. Also removes some dead code and increases timeouts for tests using the app orchestrator which were unrealistically short (the latter applies to development test run only).

Note: if the programmatic shutdown fails for some reason (unlikely), the existing mechanism of DCP watching the AppHost process will still kick in. Exactly the same resource cleanup code is executed in either case.

Fixes # (issue)
#6678
#6561

Checklist

  • Is this feature complete?
    • Yes. Ready to ship.
    • No. Follow-up changes expected.
  • Are you including unit tests for the changes and scenario tests if relevant?
    • Yes
    • No
  • Did you add public API?
    • Yes
      • If yes, did you have an API Review for it?
        • Yes
        • No
      • Did you add <remarks /> and <code /> elements on your triple slash comments?
        • Yes
        • No
    • No
  • Does the change make any security assumptions or guarantees?
    • Yes
      • If yes, have you done a threat model and had a security review?
        • Yes
        • No
    • No
  • Does the change require an update in our Aspire docs?

return suffix;
}

public async Task DeleteResourcesAsync(CancellationToken cancellationToken = default)
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is dead code, remnant from early Aspire lifetime. Never worked properly.

Comment thread src/Aspire.Hosting/Dcp/ApplicationExecutor.cs
{
await app.StartAsync();
}).DefaultTimeout();
}).DefaultTimeout(TestConstants.DefaultOrchestratorTestTimeout);
Copy link
Copy Markdown
Member

@JamesNK JamesNK Jan 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are extra constants needed? LongTimeoutDuration is 20 seconds locally and 120 seconds in CI. That seams like it is enough. What takes over 20 seconds locally? We could make the long duration longer if needed.

I don't know what is the right thing for Default, Long (and a new bigger value if required?) timeout consts vs having values explicitly for one product area. I worry that someone will go though and place orchestrator timeouts everywhere just because they're writing code in that area vs using the defaults which suit them.

I want a situation where DefaultTimeout() works for 99% of code, and the other 1% can be handled by specifying a long timeout. Keep things simple.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the right timeout when debugging is none 😄

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets jut increase the default to something high

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The default timeout being high on the server is fine, and the long timeout being high is fine locally and on the server, but when running tests you don't want to wait 60 seconds because something will never happen. It's a productivity sink.

Default timeout should fail fast in 5 seconds to not waste the developers time.

Copy link
Copy Markdown
Contributor

@davidfowl davidfowl Jan 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No it should waste developers time. It should “never” happened locally and when it does there are lots tools you can use (like a debugger)

if you don’t make it high then we won’t be using it in certain classes of tests and it’ll defeat the entire purpose

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The primary goal of this change is to eliminate false negatives when I am verifying a change by running multiple tests from the IDE, via TestExplorer. With the current timeout values, on my daily go-to dev machine (64 GB RAM, SSD drive, not too old) running DistributedApplicationTests results in a handful of tests timing out pretty much on every run. I am wasting a lot of time re-running these failed tests.

Actual test debugging is not an issue with longer timeouts in my experience. When I need a tight inner loop for debugging a particular test, I can just lower/extend individual timeouts as necessary; that scenario happens much less often than "run all test to verify nothing is broken", so I want to optimize for the latter.

There are many reasons why 5 seconds for "short" and 20 seconds for "long" may not be enough for tests that involve DCP. For example, during startup DCP is invoked not 1, but 3 times and does probing for container runtime. If container runtime is "sleeping" that probe may take a while. Anything involving a container is subject to image download/image build delays, and involves retries that DCP does to work-around container orchestrator flakiness. Tests running in parallel affecting each other. Antivirus having to scan new build of Aspire/DCP. Etc. etc.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No it should waste developers time. It should “never” happened locally and when it does there are lots tools you can use (like a debugger)

if you don’t make it high then we won’t be using it in certain classes of tests and it’ll defeat the entire purpose

Yeah I know that timeouts should never happen when tests are in a good state, but deving code can break things.

When you run tests locally to double check things are good, you want broken tests to fail in 5 seconds with a timeout rather than a minute.

Comment thread tests/Aspire.Hosting.Tests/Dcp/TestKubernetesService.cs Outdated
Comment thread src/Aspire.Hosting/Dcp/DcpKubernetesClient.cs Outdated
Comment thread src/Aspire.Hosting/Dcp/DcpKubernetesClient.cs Outdated
Comment thread src/Aspire.Hosting/OtlpConfigurationExtensions.cs
@davidfowl
Copy link
Copy Markdown
Contributor

@karolz-ms can you remove these in this PR:

https://github.com/dotnet/aspire/blob/66465d1e19f20616be209d4d77a019aa4b68ea75/tests/helix/send-to-helix-inner.proj#L125-L139

@joperezr can you review those changes please? I want to make sure we can run without cleaning up to see if things are working well with the latest DCP changes plus these new app model changes.

Copy link
Copy Markdown
Contributor

@davidfowl davidfowl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets see if this destabilizes the CI 😄

cc @eerhardt

<HelixPreCommand Include="$(_DeleteDockerVolumesCommand)" />
<HelixPreCommand Include="docker network prune -f" />

<HelixPostCommand Include="docker container ls --all" />
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is useful in situations where the container crashes, to see what is happening on the machine. It is also nice to guarantee somewhat stable machines.

Do we have to delete this?

cc @radical

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While that is true, it's nicer if we can actually find bugs instead of having the CI cover up dcp problems 😄

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we at least add back the diagnostics lines? If you don't want the prune/clean calls for now, that's fine.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep that's reasonable

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@karolz-ms can you add the diagnostic ones back?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Working on it

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@davidfowl
Copy link
Copy Markdown
Contributor

davidfowl commented Jan 17, 2025

Since this change went in I'm seeing containers stuck in the starting state and a large set of new tests failing as a result. Lets add the diagnostics back so we can see if things are not being cleaned up properly in some cases. I wonder if some tests are not calling StopAsync properly and things still aren't being cleaned up on a per test basis.

@joperezr
Copy link
Copy Markdown
Member

@karolz-ms can you remove these in this PR:

https://github.com/dotnet/aspire/blob/66465d1e19f20616be209d4d77a019aa4b68ea75/tests/helix/send-to-helix-inner.proj#L125-L139

@joperezr can you review those changes please? I want to make sure we can run without cleaning up to see if things are working well with the latest DCP changes plus these new app model changes.

Sorry @davidfowl just saw this. Changes look good other than @eerhardt comments about it being nice to have the diagnostic lines.

@davidfowl
Copy link
Copy Markdown
Contributor

We added them back, well @karolz-ms did 😄

@davidfowl
Copy link
Copy Markdown
Contributor

davidfowl commented Jan 18, 2025

@karolz-ms Can confirm now that with the diagnostic messages added back, we didn't fully fix the issue with dcp. Containers now don't start on the CI. I'll submit a change that does cleanup on start, that way we can track if there are leaking networks still but tests won't fail. We might need more logs here to understand why this is still happening.

cc @dbreshears

@github-actions github-actions Bot locked and limited conversation to collaborators Feb 17, 2025
@github-actions github-actions Bot added the area-app-model Issues pertaining to the APIs in Aspire.Hosting, e.g. DistributedApplication label Mar 10, 2025
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

area-app-model Issues pertaining to the APIs in Aspire.Hosting, e.g. DistributedApplication

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants