Flow: services terminate too early during process shutdown #266

rfratto · 2024-01-24T23:37:59Z

What's wrong?

When Flow is shutting down, it does the following:

Cancel the running context associated with all running services and components
Wait for all running services and components to finish terminating
Exit the process

Because some components take a while to terminate (such as prometheus.remote_write which waits to flush buffered metrics), this means that the process may be running for a period of time without services available.

This has a few implications, but the main issue is that the HTTP service is terminated nearly immediately, preventing users from collecting metrics during the shutdown process. This will be observable as scrape failures, as the agent is still running but can't be scraped.

Steps to reproduce

Run flow mode with a Prometheus metrics pipeline sending metrics to a non-running address/port combination (forcing metrics to buffer in-memory and sends to be retried)
Request a graceful termination of the agent process.
Observe that the HTTP server becomes unavailable while the process is still running, and the process takes a non-zero amount of time before it finally gracefully terminates.

System information

N/A

Software version

v0.39.1

Configuration

No response

Logs

No response

The text was updated successfully, but these errors were encountered:

rfratto · 2024-01-24T23:40:05Z

I'll also add that it's not obvious what a clean fix for this problem looks like. Maybe the Flow controller needs a dedicated scheduler just for services to allow it to be terminated later?

github-actions · 2024-02-24T00:09:19Z

This issue has not had any activity in the past 30 days, so the needs-attention label has been added to it.
If the opened issue is a bug, check to see if a newer release fixed your issue. If it is no longer relevant, please feel free to close this issue.
The needs-attention label signals to maintainers that something has fallen through the cracks. No action is needed by you; your issue will be kept open and you do not have to respond to this comment. The label will be removed the next time this job runs if there is new activity.
Thank you for your contributions!

rfratto · 2024-04-11T20:25:23Z

Hi there 👋

On April 9, 2024, Grafana Labs announced Grafana Alloy, the spirital successor to Grafana Agent and the final form of Grafana Agent flow mode. As a result, Grafana Agent has been deprecated and will only be receiving bug and security fixes until its end-of-life around November 1, 2025.

To make things easier for maintainers, we're in the process of migrating all issues tagged variant/flow to the Grafana Alloy repository to have a single home for tracking issues. This issue is likely something we'll want to address in both Grafana Alloy and Grafana Agent, so just because it's being moved doesn't mean we won't address the issue in Grafana Agent :)

ptodev · 2024-04-26T16:00:24Z

I'm also not sure what a clean solution would look like. It can get complicated when we discuss a cluster of collectors doing a rolling restart. There are a few obvious options:

If the collector which is shutting down continues scraping, it may never be able to shut down.
On the other hand, if the shutting down collector is allowed to stop scraping and remote write the rest of the samples, there will be out of order errors from Mimir if another collector is brought up to start scraping in place of the 1st.
Alternatively, if the shutting down collector is "killed" without allowing it to replay its WAL, then those samples will be lost.

Probably the only realistic solution is to "not fall behind" 😄 For users affected by this, in the short term it might be best to look into why the collector is taking so long to send the samples.

Some other, "cleaner", solutions could be:

If we have a "replay the WAL" functionality, then a collector can shut down abruptly without remote writing everything from the WAL. The new collector which starts in its place could pick up where the first left off. It'd be interesting if this option works with clustering, where the "ownership" of a series may change from one collector in a cluster to another, during those restarts.
Or, the collector which is starting up could hold off on remote writing until the shutting down collector has finished.

rfratto added the bug Something isn't working label Jan 24, 2024

github-actions bot added the needs-attention label Feb 24, 2024

rfratto transferred this issue from grafana/agent Apr 11, 2024

github-actions bot removed the needs-attention label Apr 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Flow: services terminate too early during process shutdown #266

Flow: services terminate too early during process shutdown #266

rfratto commented Jan 24, 2024 •

edited

rfratto commented Jan 24, 2024

github-actions bot commented Feb 24, 2024

rfratto commented Apr 11, 2024

ptodev commented Apr 26, 2024

Flow: services terminate too early during process shutdown #266

Flow: services terminate too early during process shutdown #266

Comments

rfratto commented Jan 24, 2024 • edited

What's wrong?

Steps to reproduce

System information

Software version

Configuration

Logs

rfratto commented Jan 24, 2024

github-actions bot commented Feb 24, 2024

rfratto commented Apr 11, 2024

ptodev commented Apr 26, 2024

rfratto commented Jan 24, 2024 •

edited