Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Flow: services terminate too early during process shutdown #266

Open
rfratto opened this issue Jan 24, 2024 · 4 comments
Open

Flow: services terminate too early during process shutdown #266

rfratto opened this issue Jan 24, 2024 · 4 comments
Labels
bug Something isn't working

Comments

@rfratto
Copy link
Member

rfratto commented Jan 24, 2024

What's wrong?

When Flow is shutting down, it does the following:

  • Cancel the running context associated with all running services and components
  • Wait for all running services and components to finish terminating
  • Exit the process

Because some components take a while to terminate (such as prometheus.remote_write which waits to flush buffered metrics), this means that the process may be running for a period of time without services available.

This has a few implications, but the main issue is that the HTTP service is terminated nearly immediately, preventing users from collecting metrics during the shutdown process. This will be observable as scrape failures, as the agent is still running but can't be scraped.

Steps to reproduce

  1. Run flow mode with a Prometheus metrics pipeline sending metrics to a non-running address/port combination (forcing metrics to buffer in-memory and sends to be retried)
  2. Request a graceful termination of the agent process.
  3. Observe that the HTTP server becomes unavailable while the process is still running, and the process takes a non-zero amount of time before it finally gracefully terminates.

System information

N/A

Software version

v0.39.1

Configuration

No response

Logs

No response

@rfratto rfratto added the bug Something isn't working label Jan 24, 2024
@rfratto
Copy link
Member Author

rfratto commented Jan 24, 2024

I'll also add that it's not obvious what a clean fix for this problem looks like. Maybe the Flow controller needs a dedicated scheduler just for services to allow it to be terminated later?

Copy link
Contributor

This issue has not had any activity in the past 30 days, so the needs-attention label has been added to it.
If the opened issue is a bug, check to see if a newer release fixed your issue. If it is no longer relevant, please feel free to close this issue.
The needs-attention label signals to maintainers that something has fallen through the cracks. No action is needed by you; your issue will be kept open and you do not have to respond to this comment. The label will be removed the next time this job runs if there is new activity.
Thank you for your contributions!

@rfratto
Copy link
Member Author

rfratto commented Apr 11, 2024

Hi there 👋

On April 9, 2024, Grafana Labs announced Grafana Alloy, the spirital successor to Grafana Agent and the final form of Grafana Agent flow mode. As a result, Grafana Agent has been deprecated and will only be receiving bug and security fixes until its end-of-life around November 1, 2025.

To make things easier for maintainers, we're in the process of migrating all issues tagged variant/flow to the Grafana Alloy repository to have a single home for tracking issues. This issue is likely something we'll want to address in both Grafana Alloy and Grafana Agent, so just because it's being moved doesn't mean we won't address the issue in Grafana Agent :)

@rfratto rfratto transferred this issue from grafana/agent Apr 11, 2024
@ptodev
Copy link
Contributor

ptodev commented Apr 26, 2024

I'm also not sure what a clean solution would look like. It can get complicated when we discuss a cluster of collectors doing a rolling restart. There are a few obvious options:

  • If the collector which is shutting down continues scraping, it may never be able to shut down.
  • On the other hand, if the shutting down collector is allowed to stop scraping and remote write the rest of the samples, there will be out of order errors from Mimir if another collector is brought up to start scraping in place of the 1st.
  • Alternatively, if the shutting down collector is "killed" without allowing it to replay its WAL, then those samples will be lost.

Probably the only realistic solution is to "not fall behind" 😄 For users affected by this, in the short term it might be best to look into why the collector is taking so long to send the samples.

Some other, "cleaner", solutions could be:

  • If we have a "replay the WAL" functionality, then a collector can shut down abruptly without remote writing everything from the WAL. The new collector which starts in its place could pick up where the first left off. It'd be interesting if this option works with clustering, where the "ownership" of a series may change from one collector in a cluster to another, during those restarts.
  • Or, the collector which is starting up could hold off on remote writing until the shutting down collector has finished.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants