New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Logs aggregator occasionally fails to spin up #1311
Comments
This has hit multiple users now and is not a great first contact experience with Kurtosis, even though a |
+1 I ran into this today! |
After debugging with @chunha-park when he encountered this error - we were able to see that the logs aggregator container exited with status code 255. Vector logs of the exited container don't show any |
This PR #1371 configures the log aggregator container to be restarted upon failure. This should make it so that users no longer encounter this issue, but in case they do, the PR always improves the logging to instruct users to restart the engine. I'm leaving this issue open to continue investigation around why the log aggregator is occasionally exiting. It's important that the log aggregator is live 100% of the time so that no services logs are dropped, potentially impacting a users ability to debug using cc. @leeederek |
## Description: This change adjusts the log aggregator container configuration so that it's Docker always restarts the container when it detects it has exited with non-zero exit code. It improve how we handle errors related to log aggregator not existing when creating an enclave. This is a bandaid that should address #1311 but doesn't address the issue entirely. More context in issue's discussion. ## Is this change user facing? YES (users should not experience #1311 anymore) ## References (if applicable): #1311
Thank you @tedim52 ! |
Ran into this issue again here #1832
For some reason, even though the logs aggregator container is set to be restarted on failure, it wasn't in this case. |
## Description: This change strengthens the restart policy for the logs aggregator. Prior to this, the restart only occurred on failure. Now, we make docker attempt to always restart the logs aggregator. This should help address #1832 where the logs aggregator was stopped with a `137` status code but wasn't restarted. This change also addresses a `Propagate must be provided with a cause` panic occurred here: #1832. This was caused by nil err's being propagated in the create logs collector code. This change fixes that issue. ## Is this change user facing? NO ## References: #1832 #1311
failure on
|
@tedim52 did this guy ever get resolved? |
We haven't issues with this for a few months, closing it out and can reopen if we see it again. |
What's your CLI version?
0.82.24
Description & steps to reproduce
When starting an engine, and then starting an enclave. Users occassionally get this error:
Desired behavior
The logs collector (and then enclave) should spin up successfully , because the logs aggregator should already exist. For now, users can do a
kurtosis service restart
and this will usually fix the issue.Ideally, there would be an availability check when the engine is created engine creation fails if the logs aggregator doesn't exist.
What is the severity of this bug?
Critical; I am blocked and Kurtosis is unusable for me because of this bug.
The text was updated successfully, but these errors were encountered: