Logs aggregator occasionally fails to spin up #1311

tedim52 · 2023-09-15T12:01:02Z

What's your CLI version?

0.82.24

Description & steps to reproduce

When starting an engine, and then starting an enclave. Users occassionally get this error:

INFO[2023-09-09T20:55:08+02:00] Creating a new enclave for Starlark to run inside... 
Error:  An error occurred running command 'run'
  Caused by: An error occurred calling the run function for command 'run'
  Caused by: An error occurred getting the enclave context for enclave ''
  Caused by: Unable to create new enclave with name ''
  Caused by: An error occurred creating an enclave with name ''
  Caused by: rpc error: code = Unknown desc = An error occurred creating new enclave with name ''
  Caused by: An error occurred creating new enclave with name 'waning-glade' using api container image version '' and api container log level 'debug'
  Caused by: An error occurred creating enclave with name waning-glade and uuid '2d42458ee5dc4ab6b8eb51bbdb5b2f82'
  Caused by: An error occurred creating enclave with UUID '2d42458ee5dc4ab6b8eb51bbdb5b2f82'
  Caused by: An error occurred creating the logs collector with TCP port number '9713' and HTTP port number '9712'
  Caused by: The logs aggregator is not running; the logs collector cannot be run without a running logs aggregator

Desired behavior

The logs collector (and then enclave) should spin up successfully , because the logs aggregator should already exist. For now, users can do a kurtosis service restart and this will usually fix the issue.

Ideally, there would be an availability check when the engine is created engine creation fails if the logs aggregator doesn't exist.

What is the severity of this bug?

Critical; I am blocked and Kurtosis is unusable for me because of this bug.

The text was updated successfully, but these errors were encountered:

leeederek · 2023-09-19T14:26:09Z

This has hit multiple users now and is not a great first contact experience with Kurtosis, even though a kurtosis engine restart usually solves the issue.

chunha-park · 2023-09-19T19:52:44Z

+1 I ran into this today!

tedim52 · 2023-09-25T18:47:59Z

After debugging with @chunha-park when he encountered this error - we were able to see that the logs aggregator container exited with status code 255. Vector logs of the exited container don't show any ERROR or WARN messages so it's hard to tell why the container exited exactly.

tedim52 · 2023-09-25T18:51:14Z

This PR #1371 configures the log aggregator container to be restarted upon failure. This should make it so that users no longer encounter this issue, but in case they do, the PR always improves the logging to instruct users to restart the engine.

I'm leaving this issue open to continue investigation around why the log aggregator is occasionally exiting. It's important that the log aggregator is live 100% of the time so that no services logs are dropped, potentially impacting a users ability to debug using kurtosis service logs.

cc. @leeederek

## Description: This change adjusts the log aggregator container configuration so that it's Docker always restarts the container when it detects it has exited with non-zero exit code. It improve how we handle errors related to log aggregator not existing when creating an enclave. This is a bandaid that should address #1311 but doesn't address the issue entirely. More context in issue's discussion. ## Is this change user facing? YES (users should not experience #1311 anymore) ## References (if applicable): #1311

leeederek · 2023-09-26T18:42:35Z

Thank you @tedim52 !

tedim52 · 2023-11-22T14:49:12Z

Ran into this issue again here #1832

2023-11-19 13:17:17 WARN[2023-11-19T16:17:17Z][docker_kurtosis_backend.go:CreateLogsCollectorForEnclave] Logs aggregator exists but is not running. Instead container status is 'STOPPED'. This is unexpected as docker should have restarted the container automatically. 
2023-11-19 13:17:17 WARN[2023-11-19T16:17:17Z][docker_kurtosis_backend.go:CreateLogsCollectorForEnclave] This can be fixed by restarting the engine using `kurtosis engine restart` and attempting to create the enclave again.

For some reason, even though the logs aggregator container is set to be restarted on failure, it wasn't in this case.

## Description: This change strengthens the restart policy for the logs aggregator. Prior to this, the restart only occurred on failure. Now, we make docker attempt to always restart the logs aggregator. This should help address #1832 where the logs aggregator was stopped with a `137` status code but wasn't restarted. This change also addresses a `Propagate must be provided with a cause` panic occurred here: #1832. This was caused by nil err's being propagated in the create logs collector code. This change fixes that issue. ## Is this change user facing? NO ## References: #1832 #1311

tedim52 · 2023-11-30T21:16:40Z

failure on 0.85.36

➜  ~ kurtosis run github.com/kurtosis-tech/llm-package '{"model": "llama2"}'
INFO[2023-11-30T16:11:08-05:00] No Kurtosis engine was found; attempting to start one...
INFO[2023-11-30T16:11:08-05:00] Starting the centralized logs components...
INFO[2023-11-30T16:11:08-05:00] Centralized logs components started.
INFO[2023-11-30T16:11:08-05:00] Pulling image 'kurtosistech/engine:0.85.36'
INFO[2023-11-30T16:11:12-05:00] Successfully started Kurtosis engine
INFO[2023-11-30T16:11:12-05:00] Creating a new enclave for Starlark to run inside...
Error:  An error occurred running command 'run'
  Caused by: An error occurred calling the run function for command 'run'
  Caused by: An error occurred getting the enclave context for enclave ''
  Caused by: Unable to create new enclave with name ''
  Caused by: An error occurred creating an enclave with name ''
  Caused by: rpc error: code = Unknown desc = An error occurred creating new enclave with name '0x4000216340'
  Caused by: An error occurred creating new enclave with name 'cool-tundra' using api container image version '' and api container log level 'debug'
  Caused by: An error occurred creating enclave with name `cool-tundra` and uuid '6bb097aca38049b59c0b87dcf1c24e0a'
  Caused by: An error occurred creating enclave with UUID '6bb097aca38049b59c0b87dcf1c24e0a'
  Caused by: An error occurred creating the logs collector with TCP port number '9713' and HTTP port number '9712'
  Caused by: The logs aggregator container exists but is not running. Instead logs aggregator container status is 'STOPPED'. The logs collector cannot be run without a logs aggregator.

mieubrisse · 2024-02-27T03:37:31Z

@tedim52 did this guy ever get resolved?

tedim52 · 2024-02-27T15:06:49Z

We haven't issues with this for a few months, closing it out and can reopen if we see it again.

tedim52 added the bug Something isn't working label Sep 15, 2023

tedim52 self-assigned this Sep 15, 2023

github-actions bot added the critical Critical bug or feature label Sep 15, 2023

tedim52 mentioned this issue Sep 25, 2023

fix: restart log aggregator on failure #1371

Merged

tedim52 mentioned this issue Nov 22, 2023

fix: always restart logs aggregator #1841

Merged

leeederek closed this as completed Nov 28, 2023

tedim52 reopened this Nov 30, 2023

tedim52 closed this as completed Feb 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Logs aggregator occasionally fails to spin up #1311

Logs aggregator occasionally fails to spin up #1311

tedim52 commented Sep 15, 2023 •

edited

leeederek commented Sep 19, 2023

chunha-park commented Sep 19, 2023

tedim52 commented Sep 25, 2023

tedim52 commented Sep 25, 2023 •

edited

leeederek commented Sep 26, 2023

tedim52 commented Nov 22, 2023 •

edited

tedim52 commented Nov 30, 2023

mieubrisse commented Feb 27, 2024

tedim52 commented Feb 27, 2024

Logs aggregator occasionally fails to spin up #1311

Logs aggregator occasionally fails to spin up #1311

Comments

tedim52 commented Sep 15, 2023 • edited

What's your CLI version?

Description & steps to reproduce

Desired behavior

What is the severity of this bug?

leeederek commented Sep 19, 2023

chunha-park commented Sep 19, 2023

tedim52 commented Sep 25, 2023

tedim52 commented Sep 25, 2023 • edited

leeederek commented Sep 26, 2023

tedim52 commented Nov 22, 2023 • edited

tedim52 commented Nov 30, 2023

mieubrisse commented Feb 27, 2024

tedim52 commented Feb 27, 2024

tedim52 commented Sep 15, 2023 •

edited

tedim52 commented Sep 25, 2023 •

edited

tedim52 commented Nov 22, 2023 •

edited