Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Logs aggregator occasionally fails to spin up #1311

Closed
tedim52 opened this issue Sep 15, 2023 · 9 comments
Closed

Logs aggregator occasionally fails to spin up #1311

tedim52 opened this issue Sep 15, 2023 · 9 comments
Assignees
Labels
bug Something isn't working critical Critical bug or feature

Comments

@tedim52
Copy link
Contributor

tedim52 commented Sep 15, 2023

What's your CLI version?

0.82.24

Description & steps to reproduce

When starting an engine, and then starting an enclave. Users occassionally get this error:

INFO[2023-09-09T20:55:08+02:00] Creating a new enclave for Starlark to run inside... 
Error:  An error occurred running command 'run'
  Caused by: An error occurred calling the run function for command 'run'
  Caused by: An error occurred getting the enclave context for enclave ''
  Caused by: Unable to create new enclave with name ''
  Caused by: An error occurred creating an enclave with name ''
  Caused by: rpc error: code = Unknown desc = An error occurred creating new enclave with name ''
  Caused by: An error occurred creating new enclave with name 'waning-glade' using api container image version '' and api container log level 'debug'
  Caused by: An error occurred creating enclave with name waning-glade and uuid '2d42458ee5dc4ab6b8eb51bbdb5b2f82'
  Caused by: An error occurred creating enclave with UUID '2d42458ee5dc4ab6b8eb51bbdb5b2f82'
  Caused by: An error occurred creating the logs collector with TCP port number '9713' and HTTP port number '9712'
  Caused by: The logs aggregator is not running; the logs collector cannot be run without a running logs aggregator

Desired behavior

The logs collector (and then enclave) should spin up successfully , because the logs aggregator should already exist. For now, users can do a kurtosis service restart and this will usually fix the issue.

Ideally, there would be an availability check when the engine is created engine creation fails if the logs aggregator doesn't exist.

What is the severity of this bug?

Critical; I am blocked and Kurtosis is unusable for me because of this bug.

@tedim52 tedim52 added the bug Something isn't working label Sep 15, 2023
@tedim52 tedim52 self-assigned this Sep 15, 2023
@github-actions github-actions bot added the critical Critical bug or feature label Sep 15, 2023
@leeederek
Copy link
Contributor

This has hit multiple users now and is not a great first contact experience with Kurtosis, even though a kurtosis engine restart usually solves the issue.

@chunha-park
Copy link
Collaborator

+1 I ran into this today!

@tedim52
Copy link
Contributor Author

tedim52 commented Sep 25, 2023

After debugging with @chunha-park when he encountered this error - we were able to see that the logs aggregator container exited with status code 255. Vector logs of the exited container don't show any ERROR or WARN messages so it's hard to tell why the container exited exactly.

Screen Shot 2023-09-25 at 2 45 51 PM
Screen Shot 2023-09-25 at 2 46 04 PM

@tedim52
Copy link
Contributor Author

tedim52 commented Sep 25, 2023

This PR #1371 configures the log aggregator container to be restarted upon failure. This should make it so that users no longer encounter this issue, but in case they do, the PR always improves the logging to instruct users to restart the engine.

I'm leaving this issue open to continue investigation around why the log aggregator is occasionally exiting. It's important that the log aggregator is live 100% of the time so that no services logs are dropped, potentially impacting a users ability to debug using kurtosis service logs.

cc. @leeederek

tedim52 added a commit that referenced this issue Sep 26, 2023
## Description:
This change adjusts the log aggregator container configuration so that
it's Docker always restarts the container when it detects it has exited
with non-zero exit code. It improve how we handle errors related to log
aggregator not existing when creating an enclave.

This is a bandaid that should address
#1311 but doesn't
address the issue entirely. More context in issue's discussion.

## Is this change user facing?
YES (users should not experience
#1311 anymore)

## References (if applicable):
#1311
@leeederek
Copy link
Contributor

Thank you @tedim52 !

@tedim52
Copy link
Contributor Author

tedim52 commented Nov 22, 2023

Ran into this issue again here #1832

2023-11-19 13:17:17 WARN[2023-11-19T16:17:17Z][docker_kurtosis_backend.go:CreateLogsCollectorForEnclave] Logs aggregator exists but is not running. Instead container status is 'STOPPED'. This is unexpected as docker should have restarted the container automatically. 
2023-11-19 13:17:17 WARN[2023-11-19T16:17:17Z][docker_kurtosis_backend.go:CreateLogsCollectorForEnclave] This can be fixed by restarting the engine using `kurtosis engine restart` and attempting to create the enclave again. 

For some reason, even though the logs aggregator container is set to be restarted on failure, it wasn't in this case.

github-merge-queue bot pushed a commit that referenced this issue Nov 22, 2023
## Description:
This change strengthens the restart policy for the logs aggregator.
Prior to this, the restart only occurred on failure. Now, we make docker
attempt to always restart the logs aggregator. This should help address
#1832 where the logs
aggregator was stopped with a `137` status code but wasn't restarted.

This change also addresses a `Propagate must be provided with a cause`
panic occurred here:
#1832. This was caused
by nil err's being propagated in the create logs collector code. This
change fixes that issue.

## Is this change user facing?
NO

## References:
#1832
#1311
@tedim52 tedim52 reopened this Nov 30, 2023
@tedim52
Copy link
Contributor Author

tedim52 commented Nov 30, 2023

failure on 0.85.36

➜  ~ kurtosis run github.com/kurtosis-tech/llm-package '{"model": "llama2"}'
INFO[2023-11-30T16:11:08-05:00] No Kurtosis engine was found; attempting to start one...
INFO[2023-11-30T16:11:08-05:00] Starting the centralized logs components...
INFO[2023-11-30T16:11:08-05:00] Centralized logs components started.
INFO[2023-11-30T16:11:08-05:00] Pulling image 'kurtosistech/engine:0.85.36'
INFO[2023-11-30T16:11:12-05:00] Successfully started Kurtosis engine
INFO[2023-11-30T16:11:12-05:00] Creating a new enclave for Starlark to run inside...
Error:  An error occurred running command 'run'
  Caused by: An error occurred calling the run function for command 'run'
  Caused by: An error occurred getting the enclave context for enclave ''
  Caused by: Unable to create new enclave with name ''
  Caused by: An error occurred creating an enclave with name ''
  Caused by: rpc error: code = Unknown desc = An error occurred creating new enclave with name '0x4000216340'
  Caused by: An error occurred creating new enclave with name 'cool-tundra' using api container image version '' and api container log level 'debug'
  Caused by: An error occurred creating enclave with name `cool-tundra` and uuid '6bb097aca38049b59c0b87dcf1c24e0a'
  Caused by: An error occurred creating enclave with UUID '6bb097aca38049b59c0b87dcf1c24e0a'
  Caused by: An error occurred creating the logs collector with TCP port number '9713' and HTTP port number '9712'
  Caused by: The logs aggregator container exists but is not running. Instead logs aggregator container status is 'STOPPED'. The logs collector cannot be run without a logs aggregator.

@mieubrisse
Copy link
Member

@tedim52 did this guy ever get resolved?

@tedim52
Copy link
Contributor Author

tedim52 commented Feb 27, 2024

We haven't issues with this for a few months, closing it out and can reopen if we see it again.

@tedim52 tedim52 closed this as completed Feb 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working critical Critical bug or feature
Projects
None yet
Development

No branches or pull requests

4 participants