Skip to content

Clean up processes on start, but wait on shutdown#7185

Merged
davidfowl merged 12 commits intomainfrom
davidfowl/cleanup-on-start-again
Jan 24, 2025
Merged

Clean up processes on start, but wait on shutdown#7185
davidfowl merged 12 commits intomainfrom
davidfowl/cleanup-on-start-again

Conversation

@davidfowl
Copy link
Copy Markdown
Contributor

Trying to narrow down what might be causing #7098 (comment). Just saw a flaky test #7184 on a PR that had 2 containers hanging around and 6 networks:

+ docker container ls --all
CONTAINER ID   IMAGE                                        COMMAND                  CREATED         STATUS         PORTS                       NAMES
f7075871afca   mcr.microsoft.com/mssql/server:2022-latest   "/opt/mssql/bin/perm…"   2 minutes ago   Up 2 minutes   127.0.0.1:32773->1433/tcp   resource-qfczqtcf-44975b
1135e744ae8c   mcr.microsoft.com/mssql/server:2022-latest   "/opt/mssql/bin/perm…"   3 minutes ago   Up 3 minutes   127.0.0.1:32772->1433/tcp   sqlserver-jfhfxdez-73fe4195
+ docker volume ls
DRIVER    VOLUME NAME
+ docker network ls
NETWORK ID     NAME                                DRIVER    SCOPE
fb2665b475fc   bridge                              bridge    local
581e22aa3346   default-aspire-network-1kehlug2b8   bridge    local
be06bf54b5e1   default-aspire-network-8t4322bof0   bridge    local
40fe071b8ca3   default-aspire-network-elqgk6njg4   bridge    local
651ed8f1543e   default-aspire-network-fb93578n1o   bridge    local
c68e2fb49e65   default-aspire-network-kitbd1l6c0   bridge    local
f4b8495d6db3   default-aspire-network-pgrkmjt3g0   bridge    local
d4e6971983fe   host                                host      local
c5c27b532323   none                                null      local
+ pgrep -lf dotnet-tests|dcp.exe|dcpctrl.exe
+ awk {print ; system("kill -9 "$1)}
+ exit 1
['Aspire.Hosting.SqlServer.Tests' END OF WORK ITEM LOG: Command exited with 1]

It's unclear if dcp would have cleaned up some of these because the test infrastructure kills it ungracefully. Instead, clean up on the start of the test and after the test runs, we will wait for 60 seconds for dcp to quit, if it didn't then fail.

Comment thread tests/helix/send-to-helix-inner.proj
Comment thread tests/helix/send-to-helix-inner.proj
Comment thread tests/helix/send-to-helix-inner.proj Outdated
@davidfowl
Copy link
Copy Markdown
Contributor Author

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

@davidfowl
Copy link
Copy Markdown
Contributor Author

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

@davidfowl
Copy link
Copy Markdown
Contributor Author

@karolz-ms is there any way to associate this start-apiserver --monitor 12233 --detach --kubeconfig "/datadisks/disk1/work/ADFF09A7/t/aspire.YAFf8k/kubeconfig"

With the logs that come out of dcp? I have all of the dcp logs being extracted now but its hard to correlate.

@davidfowl
Copy link
Copy Markdown
Contributor Author

Also I'm seeing this repeated in the dcp logs:

{"level":"debug","ts":"2025-01-23T05:37:11.087Z","logger":"dcpctrl.ContainerOrchestrator","msg":"Running Docker command","ContainerRuntime":"","Command":"/usr/bin/docker network rm --force 76dddaa083d5b207e8187fc1f9af0974b27a842c1fa9b78e0c999a4c833c671b"}
{"level":"debug","ts":"2025-01-23T05:37:11.090Z","logger":"dcpctrl.os-executor","msg":"starting waiting for process to exit","pid":28066}
{"level":"debug","ts":"2025-01-23T05:37:11.139Z","logger":"dcpctrl.os-executor","msg":"process wait ended","pid":28066,"Error":"exit status 125"}

@davidfowl
Copy link
Copy Markdown
Contributor Author

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

@davidfowl
Copy link
Copy Markdown
Contributor Author

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

@karolz-ms
Copy link
Copy Markdown
Contributor

@karolz-ms is there any way to associate this start-apiserver --monitor 12233 --detach --kubeconfig "/datadisks/disk1/work/ADFF09A7/t/aspire.YAFf8k/kubeconfig"

With the logs that come out of dcp? I have all of the dcp logs being extracted now but its hard to correlate.

That path to the kubeconfig file should show up in the logs from dcpctrl process.. although I think Aspire tests share the same session folder for everything, so that might not help.
But what probably will help is that the argument to --monitor flag is the PID of the dcp process that started the dcpctrl process, and that PID should also appear in the dcpctrl logs.

@davidfowl davidfowl merged commit ff4368e into main Jan 24, 2025
@davidfowl davidfowl deleted the davidfowl/cleanup-on-start-again branch January 24, 2025 01:19
@davidfowl
Copy link
Copy Markdown
Contributor Author

davidfowl commented Jan 24, 2025

@karolz-ms do I want to share the session file? If it makes sense, maybe there should dcp can tweak the output files names for this scenario?

@karolz-ms
Copy link
Copy Markdown
Contributor

karolz-ms commented Jan 24, 2025

@davidfowl up until now we have been using one session folder per DCP invocation. Safest bet would be to have one folder per test suite run and within it, one folder per DCP invocation (or per invocation of a test that is using DCP). But sharing a session folder should be fine as long as each DCP instance is instructed to preserve the session folder.

We can tweak the file naming as necessary, e.g. add a prefix that is per-DCP invocation etc. (e.g. that prefix could be associated with test name)

@davidfowl
Copy link
Copy Markdown
Contributor Author

We can tweak the file naming as necessary, e.g. add a prefix that is per-DCP invocation etc. (e.g. that prefix could be associated with test name)

Can you add this? The current outout looks a little crazy as its hard to tell what is going on. We could also consider making that prefix a folder.

@github-actions github-actions Bot locked and limited conversation to collaborators Feb 24, 2025
@github-actions github-actions Bot added the area-app-model Issues pertaining to the APIs in Aspire.Hosting, e.g. DistributedApplication label Mar 10, 2025
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

area-app-model Issues pertaining to the APIs in Aspire.Hosting, e.g. DistributedApplication

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants