Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Drop chunks at the beginning of persistence queue #1119

Closed
fabxc opened this Issue Sep 27, 2015 · 12 comments

Comments

Projects
None yet
5 participants
@fabxc
Copy link
Member

fabxc commented Sep 27, 2015

We just had a case of suspected disk failure. Chunks were piling up waiting for persistence. When the capacity was reached, no new chunks were created and ingestion essentially stopped.

As sample data becomes exponentially less important over time, the oldest chunks waiting for persistence should be dropped. This way the most recent data is at least in memory and the current state of the world remains queryable.

@beorn7

@beorn7

This comment has been minimized.

Copy link
Member

beorn7 commented Sep 28, 2015

I'd be very cautious here. "Sample data becomes exponentially less important over time" is a big assumption. Throwing data away to ingest new data could be very surprising to some users, and adversary to their use-case. If at all, this needs to be configurable by (yet another) flag.

In general, I'd set this to low priority as you are usually in a very bad shape anyway if you run into this situation. You are either consistently overloading your server or you have a broken disk (as in this case). Both cases will most likely trigger other problems anyway, so I'd doubt you would be able to continue monitoring based on the most recent data for much longer even with the change suggested here.

Whoever wants to implement this: It will break invariants of the storage model because it will create holes in the middle of a time series. A dropped chunk needs to get its chunk descriptor removed, too, and offsets (like the persist watermark and the chunk desc offset) need to be adjusted. Dropping chunks to be persisted will most certainly render the time series "dirty" (triggering more frequent checkpointing - which doesn't work anyway because of the overloaded/broken disk... but then we are only trying to keep monitoring running for a bit longer here...)

@fabxc

This comment has been minimized.

Copy link
Member Author

fabxc commented Sep 28, 2015

If the persistence queue is full Prometheus is already running in degraded mode. Something is broken and either way we are going to lose data. I don't see how this breaks invariants. If samples are dropped a bit further ahead it causes holes in the time series in the same manner.

I see no reason for this to be configurable. If the question is what my cpu load was 45 minutes ago and what it is now, for monitoring the answer simply is "now".

@juliusv

This comment has been minimized.

Copy link
Member

juliusv commented Sep 28, 2015

Trying to resolve some confusion here, I guess @fabxc is talking about changing the behavior of this bit of code: https://github.com/prometheus/prometheus/blob/master/storage/local/storage.go#L540-L549

I.e. when samples haven't even been added to a chunk yet, but are waiting for being appended and are blocking newer values. Is that correct?

That piece of code could instead drop the current samples upon backlog and favor new ones.

I agree though that in this degraded mode, things are likely screwed up enough to not matter much either way.

@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented Sep 28, 2015

I share Beorn's concerns here.

@beorn7

This comment has been minimized.

Copy link
Member

beorn7 commented Sep 28, 2015

The invariant I was speaking about is something the internal implementation of the storage layer depends on, not an external assumption of storage behavior. It's just a heads up for whoever implements this that means something has to change at many places to not create a lot of inconsistencies.

@juliusv The code you have referenced is the code to append the most recent sample. With the suggestion here, we would not block here but drop the oldest chunks of each time series that are waiting for persistence. Semantically, that's pretty clear. It's just that the internal implementation is not good in creating holes in the middle of a time series (while simply not letting samples enter the storage layer does not cause any trouble to the storage layer's consistency - it's not a "hole", it's more like "has never happened").

But too be clear: I'm talking about two completely different aspects:

  1. The effort to implement this without introducing bugs. Certainly doable but the effort needs to be weighed against the benefits when choosing to work on this compared to other features.
  2. The question if it is what the user wants. I'd be happy to have the behavior suggested here as default, but I see the reasons stated above as important enough to make it configurable.

In any case, I believe we are discussing a fringe case of little relevance here.

@barkerd427

This comment has been minimized.

Copy link

barkerd427 commented Sep 28, 2015

What does the HA story look like for Prometheus? This seems like an issue
where this instance should be removed from load rather than add code to
more gracefully handle a degradation until failure. I'd rather have HA
capabilities that can compensate for these issues than have some unknown
number of samples missing. Of course, this may be entering the backup
storage discussion again.
On Sep 28, 2015 6:16 AM, "Björn Rabenstein" notifications@github.com
wrote:

The invariant I was speaking about is something the internal
implementation of the storage layer depends on, not an external assumption
of storage behavior. It's just a heads up for whoever implements this that
means something has to change at many places to not create a lot of
inconsistencies.

@juliusv https://github.com/juliusv The code you have referenced is
the code to append the most recent sample. With the suggestion here, we
would not block here but drop the oldest chunks of each time series that
are waiting for persistence. Semantically, that's pretty clear. It's just
that the internal implementation is not good in creating holes in the
middle of a time series (while simply not letting samples enter the storage
layer does not cause any trouble to the storage layer's consistency - it's
not a "hole", it's more like "has never happened").

But too be clear: I'm talking about two completely different aspects:

  1. The effort to implement this without introducing bugs. Certainly doable
    but the effort needs to be weighed against the benefits when choosing to
    work on this compared to other features.
  2. The question if it is what the user wants. I'd be happy to have the
    behavior suggested here as default, but I see the reasons stated above as
    important enough to make it configurable.

In any case, I believe we are discussing a fringe case of little relevance
here.


Reply to this email directly or view it on GitHub
#1119 (comment)
.

@beorn7

This comment has been minimized.

Copy link
Member

beorn7 commented Sep 28, 2015

Yes, another reason why this is an issue of low relevance. To truly defend against single-host failure, you would run two Prometheis in parallel, scraping the same targets.

@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented Sep 28, 2015

The usual cause of failure is going to be too many samples, or rules that are too expensive. Running another identical Prometheus isn't going to help you there, the best you can do is keeping an eye on the capacity of your monitoring, and when things go bad to detect the situation with metamonitoring and get a human in to fix it.

I'm not sure it's worth doing anything here.

@barkerd427

This comment has been minimized.

Copy link

barkerd427 commented Sep 28, 2015

From an operations standpoint, it would be ideal to run two Prometheis with
one as a hot standby. As long as I can determine that this failure is
occurring, then I can properly react to it.
On Sep 28, 2015 7:09 AM, "Björn Rabenstein" notifications@github.com
wrote:

Yes, another reason why this is an issue of low relevance. To truly defend
against single-host failure, you would run two Prometheis in parallel,
scraping the same targets.


Reply to this email directly or view it on GitHub
#1119 (comment)
.

@beorn7

This comment has been minimized.

Copy link
Member

beorn7 commented Sep 28, 2015

As a reminder, this issue was triggered by a disk failure. That's the case I was referring to when describing the HA setup with 2 (or more) identically configured Prometheis.

@fabxc

This comment has been minimized.

Copy link
Member Author

fabxc commented Sep 28, 2015

I agree that in context of the implementation details this is nothing
feasible. Had it just been about unqueueing a single object I would have
preferred that behavior though.

On Mon, Sep 28, 2015, 2:16 PM Daniel Barker notifications@github.com
wrote:

From an operations standpoint, it would be ideal to run two Prometheis with
one as a hot standby. As long as I can determine that this failure is
occurring, then I can properly react to it.
On Sep 28, 2015 7:09 AM, "Björn Rabenstein" notifications@github.com
wrote:

Yes, another reason why this is an issue of low relevance. To truly
defend
against single-host failure, you would run two Prometheis in parallel,
scraping the same targets.


Reply to this email directly or view it on GitHub
<
#1119 (comment)

.


Reply to this email directly or view it on GitHub
#1119 (comment)
.

@fabxc fabxc closed this Sep 29, 2015

@lock

This comment has been minimized.

Copy link

lock bot commented Mar 24, 2019

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked and limited conversation to collaborators Mar 24, 2019

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
You can’t perform that action at this time.