Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Series lastSampleValue{,Set} is not set upon checkpoint recovery #2602

Closed
juliusv opened this Issue Apr 7, 2017 · 3 comments

Comments

Projects
None yet
2 participants
@juliusv
Copy link
Member

juliusv commented Apr 7, 2017

We do not set the series' lastSampleValue and lastSampleValueSet when recovering a series from the checkpoint upon startup: https://sourcegraph.com/github.com/prometheus/prometheus@acd72ae1a7bf62ef37f69830531666548ce40907/-/blob/storage/local/heads.go#L211:1-221:1

But we only do duplicate sample value detection correctly if this field is set: https://github.com/prometheus/prometheus/blob/master/storage/local/storage.go#L919-L920

Thus, when one restarts a Prometheus, a duplicate sample (that would normally be a noop) is now counted as a sample with equal timestamp, but different value.

/cc @beorn7

juliusv added a commit to cortexproject/cortex that referenced this issue Apr 7, 2017

Fix tracking of last sample time and value
To be totally correct, we would also need to set seriesLastSampleValue,
but there's no good way to get at it, and apparently we have had the
same problem after unarchiving series or after restoring from a
checkpoint at startup:
prometheus/prometheus#2602

But that is only a minor problem, as it only means that we will
misreport duplicate samples (same timestamp *and* same value) as
duplicate timestamp, but different value (error instead of silent noop).
Before trying to fix that here, I'd want to see what the reaction to the
upstream issue is, and whether that results in us adding an efficient
way to get the last sample value of a restored series in general.

Fixes #355

juliusv added a commit to cortexproject/cortex that referenced this issue Apr 7, 2017

Fix tracking of last sample time and value
This fixes the tracking of the last sample time and value, which is
needed to detect duplicate or out-of-order samples upon ingestion.

To be totally correct, we would also need to set seriesLastSampleValue,
but there's no good way to get at it, and apparently we have had the
same problem after unarchiving series or after restoring from a
checkpoint at startup:
prometheus/prometheus#2602

But that is only a minor problem, as it only means that we will
misreport duplicate samples (same timestamp *and* same value) as
duplicate timestamp, but different value (error instead of silent noop).
Before trying to fix that here, I'd want to see what the reaction to the
upstream issue is, and whether that results in us adding an efficient
way to get the last sample value of a restored series in general.

So for now, this just removes the "lastSampleValueSet = true" after a
transfer from a different ingester, because that is totally a lie.

Fixes #355

juliusv added a commit to cortexproject/cortex that referenced this issue Apr 7, 2017

Fix tracking of last sample time and value
This fixes the tracking of the last sample time and value, which is
needed to detect duplicate or out-of-order samples upon ingestion.

To be totally correct, we would also need to set seriesLastSampleValue,
but there's no good way to get at it, and apparently we have had the
same problem after unarchiving series or after restoring from a
checkpoint at startup:
prometheus/prometheus#2602

But that is only a minor problem, as it only means that we will
misreport duplicate samples (same timestamp *and* same value) as
duplicate timestamp, but different value (error instead of silent noop).
Before trying to fix that here, I'd want to see what the reaction to the
upstream issue is, and whether that results in us adding an efficient
way to get the last sample value of a restored series in general.

So for now, this just removes the "lastSampleValueSet = true" after a
transfer from a different ingester, because that is totally a lie.

Fixes #355

juliusv added a commit to cortexproject/cortex that referenced this issue Apr 7, 2017

Fix tracking of last sample time and value
This fixes the tracking of the last sample time and value, which is
needed to detect duplicate or out-of-order samples upon ingestion.

To be totally correct, we would also need to set seriesLastSampleValue
in the transfer receiver code, but there's no good way to get at it, and
apparently we have had the same problem after unarchiving series or
after restoring from a checkpoint at startup:
prometheus/prometheus#2602

But that is only a minor problem, as it only means that we will
misreport duplicate samples (same timestamp *and* same value) as
duplicate timestamp, but different value (error instead of silent noop).
Before trying to fix that here, I'd want to see what the reaction to the
upstream issue is, and whether that results in us adding an efficient
way to get the last sample value of a restored series in general.

So for now, this just removes the "lastSampleValueSet = true" after a
transfer from a different ingester, because that is totally a lie.

Fixes #355
@beorn7

This comment has been minimized.

Copy link
Member

beorn7 commented Apr 7, 2017

Yes, that was deliberate back then, as we didn't want to change the checkpoint format just to enable this check. Explicitly retrieving the last sample from the last chunk of each series would be quite expensive and increase startup time. (At least that was the rationale back then.)

tomwilkie added a commit to cortexproject/cortex that referenced this issue Apr 8, 2017

Fix tracking of last sample time and value (#393)
This fixes the tracking of the last sample time and value, which is
needed to detect duplicate or out-of-order samples upon ingestion.

To be totally correct, we would also need to set seriesLastSampleValue
in the transfer receiver code, but there's no good way to get at it, and
apparently we have had the same problem after unarchiving series or
after restoring from a checkpoint at startup:
prometheus/prometheus#2602

But that is only a minor problem, as it only means that we will
misreport duplicate samples (same timestamp *and* same value) as
duplicate timestamp, but different value (error instead of silent noop).
Before trying to fix that here, I'd want to see what the reaction to the
upstream issue is, and whether that results in us adding an efficient
way to get the last sample value of a restored series in general.

So for now, this just removes the "lastSampleValueSet = true" after a
transfer from a different ingester, because that is totally a lie.

Fixes #355
@juliusv

This comment has been minimized.

Copy link
Member Author

juliusv commented Apr 8, 2017

Ah, good to know. I bet we're not aiming to fix this before the new storage then, so closing.

@juliusv juliusv closed this Apr 8, 2017

@lock

This comment has been minimized.

Copy link

lock bot commented Mar 23, 2019

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked and limited conversation to collaborators Mar 23, 2019

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
You can’t perform that action at this time.