write error , out-of-order series added with label set #5617

whaike · 2019-05-31T09:23:23Z

1、Prometheus 2.6.0 （ in docker）， remote_write + influxdb（ in docker）.
2、Prometheus always OOM，use 【count by (name)({name=~".+"}) >10000】got metric 【node_cpu_seconds_total】
3、Stop prometheus ，login influxDB，【drop measurement node_cpu_seconds_total】,In prometheus.yaml add
`metric_relabel_configs:

  - source_labels: [__name__]
    regex: 'node_cpu_seconds_total'
    action: drop`

4、start Prometheus，see：
prometheus | level=error ts=2019-05-31T08:36:52.501Z caller=db.go:363 component=tsdb msg="compaction failed" err="persist head block: write compaction: add series: out-of-order series added with label set "{}""
5、update Prometheus to version 2.10.0，Still the same mistake。

The text was updated successfully, but these errors were encountered:

wu0407 · 2019-09-06T06:19:09Z

use metric_relabel_configs to drop label same error

level=error ts=2019-09-06T06:17:20.778Z caller=db.go:561 component=tsdb msg="compaction failed" err="persist head block: write compaction: add series: out-of-order series added with label set \"{}\""

metric_relabel_configs

- regex: "^instance$"
        action: labeldrop

wu0407 · 2019-10-05T13:44:55Z

and --storage.tsdb.retention.time=10d not work, not clean the old data

dolgovas · 2019-10-15T12:06:01Z

have same error with
action: labeldrop
in
metric_relabel_configs

dolgovas · 2019-10-16T17:50:39Z

i just update to last version

# prometheus --version
prometheus, version 2.13.0 (branch: HEAD, revision: 6ea4252299f542669aca11860abc2192bdc7bede)
  build user:       root@f30bdad2c3fd
  build date:       20191004-11:25:34
  go version:       go1.13.1

error not going ((

dolgovas · 2019-10-16T19:46:30Z

@whaike have you fixed this problem?

pvanderlinden · 2020-02-27T12:39:05Z

I have this error as well, is there any solution?

roidelapluie · 2020-02-27T12:43:52Z

@pvanderlinden this is probably an issue with your relabel config. Can you reach our
prometheus-users mailing list with more details about your config + prometheus version?

Thanks!

roidelapluie · 2020-02-27T21:46:13Z

It looks like at some points your metrics relabel config produced completely empty labelsets.

This is hard to know which values produced that because there is no labelsets ... and we can not use promql against empty labelsets.

I have open #6891 so that if that appens at least the target will be seen as down.

roidelapluie · 2020-02-27T21:53:20Z

@brian-brazil Do you want me to add:

diff --git a/tsdb/compact.go b/tsdb/compact.go
index 70aa9eee0..ab88983dd 100644
--- a/tsdb/compact.go
+++ b/tsdb/compact.go
@@ -740,6 +740,12 @@ func (c *LeveledCompactor) populateBlock(blocks []BlockReader, meta *BlockMeta,
 		}
 
 		lset, chks, dranges := set.At() // The chunks here are not fully deleted.
+
+		// Previous versions of Prometheus allowed ingesting empty labelsets.
+		if len(lset) == 0 {
+			continue
+		}
+
 		if overlapping {
 			// If blocks are overlapping, it is possible to have unsorted chunks.
 			sort.Slice(chks, func(i, j int) bool {

with #6891 ? or should we recommend users to delete their wal ?

pvanderlinden · 2020-02-28T09:35:40Z

I might have made a mistake in the labeldrop config, which I fixed later on.

There is no workaround to fix this currently, besides deleting the wal? If I delete the wal, I will lose all data from when the issue first happened?

pvanderlinden · 2020-03-03T11:50:46Z

Unfortunately this started to cause big problems and I had to dump the WAL and lose all data of last week. It would be good to address this issue once and for all, a configuration error should not corrupt the datastore.

roidelapluie · 2020-03-03T11:52:12Z

@pvanderlinden v2.17.0 will not allow you to ingest those errors in the first place: #6891.

pvanderlinden · 2020-03-03T11:54:53Z

I just hoped there was a response on my question if there was another option then deleting the wal, and what I would lose if I did that. But thanks for the update 👍

roidelapluie · 2020-03-03T12:08:41Z

The WAL issues will be discussed tomorrow in the community call.

pvanderlinden · 2020-03-03T12:12:13Z

ok, unfortunately I can't join (and already had to remove the data on production to be able to access prometheus again).

pvanderlinden · 2020-03-04T15:49:22Z

Unfortunately the error is still there even after removing the suspected wrong labeldrop config. Also the new version is not out, so it is still dumping all the data into something I need to delete. Is there any way to figure out where it's coming from?

roidelapluie · 2020-03-04T15:51:13Z

On 04 Mar 07:49, Paul van der Linden wrote: Unfortunately the error is still there even after removing the suspected wrong labeldrop config. Also the new version is not out, so it is still dumping all the data into something I need to delete. Is there any way to figure out where it's coming from? -- You are receiving this because you commented. Reply to this email directly or view it on GitHub: #5617 (comment)

Can you share your full config? Thanks

…

-- (o- Julien Pivotto //\ Open-Source Consultant V_/_ Inuits - https://www.inuits.eu

pvanderlinden · 2020-03-04T16:20:52Z

On 04 Mar 07:49, Paul van der Linden wrote: Unfortunately the error is still there even after removing the suspected wrong labeldrop config. Also the new version is not out, so it is still dumping all the data into something I need to delete. Is there any way to figure out where it's coming from? -- You are receiving this because you commented. Reply to this email directly or view it on GitHub: #5617 (comment)
Can you share your full config? Thanks
…
-- (o- Julien Pivotto //\ Open-Source Consultant V_/_ Inuits - https://www.inuits.eu

sure
config.yaml.txt

Misosooup · 2020-03-16T03:13:32Z

We are experiencing the same issue as well. We had to dump WAL and lose all the data from last 3 days.

roidelapluie · 2020-03-16T07:58:52Z

Can you share your relabel configs? Le lun. 16 mars 2020 à 04:13, Misosooup <notifications@github.com> a écrit :

…

We are experiencing the same issue as well. We had to dump WAL and lose all the data from last 3 days. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#5617 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AACHHJRICEETCXMEKES3NPLRHWKORANCNFSM4HRW4D6A> .

Misosooup · 2020-03-16T20:49:42Z

Sure.

prometheus.yaml.txt

When this occurred, WAL stops getting written into blocks and we got a backlog of them. I have tried deleting the series using the delete-series API but that didn't work either. This was pretty random as the metric is different from staging and production.

I have updated to 2.17-rc0 as of yesterday.

roidelapluie · 2020-03-22T20:22:47Z

Can you test your config with 2.17.o-rc.4 ?

roidelapluie · 2020-03-22T22:04:22Z

Sure.

prometheus.yaml.txt

When this occurred, WAL stops getting written into blocks and we got a backlog of them. I have tried deleting the series using the delete-series API but that didn't work either. This was pretty random as the metric is different from staging and production.

I have updated to 2.17-rc0 as of yesterday.

wait I did not notice that this time the labelset is NOT empty

roidelapluie · 2020-05-04T14:18:30Z

Can you please tell us if you still see this issue with Prometheus 2.17.2? Thanks.

Misosooup · 2020-05-05T02:04:30Z

Can you please tell us if you still see this issue with Prometheus 2.17.2? Thanks.

For me i don't see this anymore in 2.17.2.

codesome · 2020-07-08T07:16:42Z

It can be reproduced by running this test (Remove the t.Skip() before running. Run it multiple times if it doesn't fail in the first attempt.)

prometheus/tsdb/head_test.go

Line 1879 in 30505a2

func TestHeadCompactionRace(t *testing.T) {

I plan to investigate this next week.

roidelapluie · 2020-07-08T22:13:15Z

@codesome your test is indeed failing but it means that somehow we try to compact time series which have not been commited yet, therefore potentially breaking isolation in head compaction?

@beorn7 it seems that we have no isolation test with a head compaction in the middle. I am not familiar with head compaction. Can it compact the recent data (< 5 min) ?

mohitevishal · 2020-07-09T23:25:34Z

We hit this issue with version 2.19.2.

$ prometheus --version
prometheus, version 2.19.2 (branch: HEAD, revision: c448ada63d83002e9c1d2c9f84e09f55a61f0ff7)
  build user:       root@dd72efe1549d
  build date:       20200626-09:02:20
  go version:       go1.14.4

Started seeing this issue yesterday and prometheus has been generating compaction failed error since then. This is also causing load on memory, eventually causing the pod to go OOM.

On restart we see the following error


level=info ts=2020-07-09T03:43:15.568Z caller=head.go:645 component=tsdb msg="Replaying WAL and on-disk memory mappable chunks if any, this may take a while"
level=error ts=2020-07-09T03:43:18.425Z caller=head.go:650 component=tsdb msg="Loading on-disk chunks failed" err="iterate on on-disk chunks: out of sequence m-mapped chunk for series ref 186078539"
level=info ts=2020-07-09T03:43:18.425Z caller=head.go:745 component=tsdb msg="Deleting mmapped chunk files"
level=info ts=2020-07-09T03:43:18.425Z caller=head.go:748 component=tsdb msg="Deletion of mmap chunk files failed, discarding chunk files completely" err="cannot handle error: iterate on on-disk chunks: out of sequence m-mapped chunk for series ref 186078539"
level=info ts=2020-07-09T03:43:46.449Z caller=head.go:682 component=tsdb msg="WAL checkpoint loaded"

And prometheus pod restarts many times, here is the error snippet of last logs before it terminates, note that the label set is not empty.

level=info ts=2020-07-09T04:18:20.170Z caller=main.go:646 msg="Server is ready to receive web requests."
ts=2020-07-09T04:18:20.672Z caller=dedupe.go:112 component=remote level=info remote_name=d1086f url=http://prometheus-storage-adapter.mynamespace.svc.cluster.local/receive msg="Replaying WAL" queue=d1086f
level=error ts=2020-07-09T04:19:36.646Z caller=federate.go:192 component=web msg="federation failed" err="write tcp 172.34.1.15:9090->172.34.2.1:42662: write: broken pipe"
level=error ts=2020-07-09T04:20:05.000Z caller=federate.go:192 component=web msg="federation failed" err="write tcp 172.34.1.15:9090->172.34.3.1:34092: write: broken pipe"
level=error ts=2020-07-09T04:20:22.696Z caller=db.go:675 component=tsdb msg="compaction failed" err="persist head block: write compaction: add series: out-of-order series added with label set \"{__name__=\\\"ALERTS\\\", alertname=\\\"PodFrequentlyRestarting\\\", alertstate=\\\"firing\\\", alerttype=\\\"namespace\\\", clustername=\\\"staging\\\", container=\\\"mycontainer\\\", instance=\\\"172.34.4.77:8443\\\", job=\\\"kube-state-metrics\\\", namespace=\\\"other-namespace\\\", node=\\\"10.20.0.12\\\", pod=\\\"mycontainer-6b65bd93d5-sctp2\\\", resourcekey=\\\"pod\\\", severity=\\\"high\\\"}\""

I've deleted the pod mycontainer-6b65bd93d5-sctp2 assuming this one might be causing some issues. But it that hasn't changed any thing, neither has cleaning up tmp files.

Like others, WAL is filling up and blocks are not getting written.

Is there anyway to just clear up the corrupt data portion and not lose any other data that's not been written to block? Any word on if anyone else saw this issue in 2.19.2?

Any help is really appreciated. Thanks!

mohitevishal · 2020-07-10T00:32:21Z

After cleaning up the wal dir and restarting prometheus, out-of-order series error is gone, prometheus is scraping metrics again. but I also see the following error in logs

level=info ts=2020-07-10T00:27:16.928Z caller=head.go:792 component=tsdb msg="Head GC completed" duration=2.917190228s
level=error ts=2020-07-10T00:27:16.928Z caller=db.go:675 component=tsdb msg="compaction failed" err="head truncate failed (in compact): truncate chunks.HeadReadWriter: maxt of the files are not set"

@roidelapluie @codesome Do you guys have any advice?

Misosooup · 2020-07-10T00:54:06Z

We had this error as well in the past, but after following some strict process, our Prometheus has been running with 0 corrupted blocks. This has lasted for 4 months now in production. We have 40 days of TSDB blocks store if NFS as long term storage (I know it's risky and Prometheus do not support NFS too well).

Upgraded from 2.17 to 2.19 with no issues as well.

Do you have 2 Prometheus running and writing to the same file system?

This was the case for us at the start, as we are trying to ensure that Prometheus has no downtime, we ended up allow 2 Prometheus to write to the same NFS which caused a lot of compaction errors.

The lock file is really important as well as different Prometheus indexes metric differently so you have to make sure that no 2 Prometheus are running at the same time even if you are restarting the container.

The downside of this is that we cannot scale Prometheus at this time, hence we are looking into Thanos and Cortex.

mohitevishal · 2020-07-10T01:17:20Z

@Misosooup thank you so much for the tips! We too use NFS for storage and have over 6 months worth of retention. This is the first time we saw this error. Ours is a multi-tenancy environment, we scrape metrics for all our tenants, though of late we've allowed some of our tenants to run their own Prometheus pods in their respective namespaces to scrape specific metrics for their workloads. To your point of running multiple Prometheus instances, though they don't write to same filesystem, I guess we need to look into if this affects us with our use-case in any way.

To update on the maxt of the files are not set error, it went away for a few minutes, the files from wal were written to block and few mins later it's back again.

mohitevishal · 2020-07-14T20:18:16Z

Just to update on the issue we were facing, the compaction errors, both out-of-order series and maxt of the files are not set as well as the out of sequence m-mapped chunk for series errors are gone with v1.18.1.

Not sure if these 2 commits(d4b9fe8, f4dd456) introduced these issues which were part of v1.19.0 and up. I tried with v.19.2 and v1.19.1, the errors persisted on both versions, even after cleaning up the wal dir. v1.18.1 seemed to solve all these issues.

beorn7 · 2020-07-14T23:08:52Z

Note that the original issue filed here is quite old, but we are now talking about an issue apparently introduced in 2.19.

@codesome What's the state of your investigation? It would be great to get a fix in before we release 2.20.

mohitevishal · 2020-07-14T23:16:32Z

Yes, it seems this got introduced with 2.19. Let me know if you folks needs a separate issue for this. Thanks!

codesome · 2020-07-15T06:39:56Z

@codesome What's the state of your investigation? It would be great to get a fix in before we release 2.20.

Halted, and won't be able to look at it before 2.20. This issue is old and not introduced in 2.19 as the TSDB changes in 2.19 does not interact with this part.

Just to update on the issue we were facing, the compaction errors, both out-of-order series and maxt of the files are not set as well as the out of sequence m-mapped chunk for series errors are gone with v1.18.1.

maxt of file files not set and out of sequence mmap chunks are unrelated to this issue but is a different issue and needs investigation, can you open issues for it (separate issues of both of them) with logs? Thanks!

beorn7 · 2020-07-15T14:18:41Z

Thanks, @codesome.

@mohitevishal it would be great indeed if you could file separate issues with all your current state, one each for the maxt of the files are not set issue and the out of sequence m-mapped chunk for series issue.

roidelapluie · 2020-07-28T07:59:54Z

@codesome Yet another user has been reporting this error: #7679. It would be great to find the cause.

brian-brazil · 2020-07-28T09:14:56Z

#7679 is using EFS which is not supported, as are other users above.

apr1809 · 2020-07-28T09:21:35Z

#7679 is using EFS which is not supported, as are other users above.

@brian-brazil As per the Prometheus documentation Non POSIX compliant filesystems are not supported by Prometheus's local storage. But EFS is POSIX compliant.

brian-brazil · 2020-07-28T09:41:14Z

EFS is NFS, and NFS is not supported. Multiple users have reported issues which turned out to be related to using EFS.

trallnag · 2020-10-29T10:18:27Z

Just run into the same problem, most probably due to NFS. I guess I'll move my Promtheus stack to a dedicated single-node ECS cluster so that I can use EBS volumes instead or look into Prometheus remote read. Thanos is just too big for my current use case

surajnarwade · 2020-12-01T13:01:21Z

I faced the same issue but it was intermittent, after restarting the prometheus, error was gone.

prometheus version: 2.23.0

error:

level=error ts=2020-12-01T10:13:38.791Z caller=head.go:650 component=tsdb msg="Loading on-disk chunks failed" err="iterate on on-disk chunks: corruption in head chunk file /prometheus/chunks_head/043704: out of sequence m-mapped chunk for series ref 1634617"

hf-cjharries · 2020-12-17T21:46:14Z

EFS is NFS, and NFS is not supported. Multiple users have reported issues which turned out to be related to using EFS.

@brian-brazil Trying to get some clarification. Elsewhere the docs say POSIX compliance is the important thing. Here you're saying that POSIX compliance doesn't actually matter if it's some sort of network filesystem (even if it's not NFS specifically). I know this sounds pedantic but, if you confirm POSIX compliance doesn't matter, I'll update the language in the docs with a PR to say something like "local-only POSIX-compliant." My team has been going in circles on this.

brian-brazil · 2020-12-17T21:53:51Z

As far as we are aware require POSIX compliance, which exceedingly few network filesystems have. At this point we've reports of basically every form of networked filesystem having an issue for someone - however that's not to say that e.g. some NFS implementation may actually be sufficiently correct to not cause issues (and I suspect at least one is).

If you can show that some networked filesystem is not disobeying POSIX semantics or otherwise doing odd stuff (e.g. creating files without being asked to) but still having issues, then we should update the wording accordingly.

hf-cjharries · 2020-12-17T22:00:10Z

I really appreciate the blazing fast reply! What kind of metrics would be useful to you, one way or another? We're specifically looking at EFS, which, as several issues here and the newest docs call out, is not supported. However, their copy says they're POSIX-compliant. I see two easy solutions:

If you have some metrics we can kick over to AWS to have them prove it works, we'll do that (although who knows the turnaround). If they pass, we'll submit a PR on the docs that covers that.
We can submit a PR that changes the language to something like I mentioned above. Again, it's pedantic, but so are devs. There are a ton (well, at least three) threads on reddit and SO that waffle on this too, but they all predate the doc PR you worked on that explicitly calls out AWS EFS.

Again, thanks for the response! I really appreciate the extra details.

brian-brazil · 2020-12-17T22:42:11Z

We know that EFS doesn't work with Prometheus, based on numerous user reports. Seeing exactly what is going wrong in syscall terms would be good.

EFS claiming POSIX compliance is also new to me, previously it was only claimed for permissions.

In any case, this is the wrong issue to discuss any findings as this is about bad data resulting from a bug in Prometheus which is fixed.

brian-brazil · 2021-01-18T08:55:07Z

@brokencode64 That is not what this issue is about, if you can reproduce on the latest version please open a new issue.

brian-brazil · 2021-01-18T15:43:02Z

We looked at this in the bug scrub. This related to a fixed bug and this bug was remaining open in case we wished to do something about bad blocks produced by this bug. Given the lack of confirmed reports in supported settings in over a year, such handling does not seems required. In addition this issue has been derailed by unrelated tsdb support questions. Accordingly we're going to close this.

walterdolce mentioned this issue Dec 4, 2019

Prometheus not cleaning up WAL files #6408

Closed

roidelapluie mentioned this issue Feb 27, 2020

tsdb: don't allow ingesting empty labelsets #6891

Merged

roidelapluie mentioned this issue Jul 8, 2020

Out-of-order series during head compaction #7531

Closed

codesome mentioned this issue Jul 13, 2020

Revert "Fix unknown symbol error during head compaction (#7526)" #7556

Merged

roidelapluie mentioned this issue Jul 28, 2020

Compaction failure - out-of-order series error #7679

Closed

roidelapluie mentioned this issue Aug 13, 2020

Prometheus configure with EFS on AWS prometheus-operator/prometheus-operator#3150

Closed

This was referenced Aug 25, 2020

Compaction failing with "reload blocks: head truncate failed: truncate chunks.HeadReadWriter: maxt of the files are not set" after running out of disk. #7753

Closed

fix: return a corruption error when chunk head out of sequence #7855

Merged

This comment has been minimized.

Sign in to view

dennis-menge mentioned this issue Dec 7, 2020

[EKS] [Fargate][Container Insights]: Make container insights available to EKS Fargate clusters aws/containers-roadmap#920

Closed

brian-brazil closed this as completed Jan 18, 2021

prometheus locked as resolved and limited conversation to collaborators Nov 16, 2021

write error , out-of-order series added with label set #5617

write error , out-of-order series added with label set #5617

Comments

whaike commented May 31, 2019

wu0407 commented Sep 6, 2019

wu0407 commented Oct 5, 2019 • edited

dolgovas commented Oct 15, 2019 • edited

dolgovas commented Oct 16, 2019 • edited

dolgovas commented Oct 16, 2019

pvanderlinden commented Feb 27, 2020

roidelapluie commented Feb 27, 2020

roidelapluie commented Feb 27, 2020

roidelapluie commented Feb 27, 2020

pvanderlinden commented Feb 28, 2020

pvanderlinden commented Mar 3, 2020

roidelapluie commented Mar 3, 2020 • edited

pvanderlinden commented Mar 3, 2020

roidelapluie commented Mar 3, 2020

pvanderlinden commented Mar 3, 2020

pvanderlinden commented Mar 4, 2020

roidelapluie commented Mar 4, 2020 via email

pvanderlinden commented Mar 4, 2020

Misosooup commented Mar 16, 2020

roidelapluie commented Mar 16, 2020 via email

Misosooup commented Mar 16, 2020 • edited

roidelapluie commented Mar 22, 2020

roidelapluie commented Mar 22, 2020

roidelapluie commented May 4, 2020

Misosooup commented May 5, 2020 • edited

codesome commented Jul 8, 2020

roidelapluie commented Jul 8, 2020

mohitevishal commented Jul 9, 2020

mohitevishal commented Jul 10, 2020 • edited

Misosooup commented Jul 10, 2020

mohitevishal commented Jul 10, 2020

mohitevishal commented Jul 14, 2020

beorn7 commented Jul 14, 2020

mohitevishal commented Jul 14, 2020

codesome commented Jul 15, 2020

beorn7 commented Jul 15, 2020

roidelapluie commented Jul 28, 2020

brian-brazil commented Jul 28, 2020

apr1809 commented Jul 28, 2020 • edited

brian-brazil commented Jul 28, 2020

trallnag commented Oct 29, 2020

surajnarwade commented Dec 1, 2020

This comment has been minimized.

hf-cjharries commented Dec 17, 2020 • edited

brian-brazil commented Dec 17, 2020

hf-cjharries commented Dec 17, 2020

brian-brazil commented Dec 17, 2020

brian-brazil commented Jan 18, 2021

brian-brazil commented Jan 18, 2021

wu0407 commented Oct 5, 2019 •

edited

dolgovas commented Oct 15, 2019 •

edited

dolgovas commented Oct 16, 2019 •

edited

roidelapluie commented Mar 3, 2020 •

edited

Misosooup commented Mar 16, 2020 •

edited

Misosooup commented May 5, 2020 •

edited

mohitevishal commented Jul 10, 2020 •

edited

apr1809 commented Jul 28, 2020 •

edited

hf-cjharries commented Dec 17, 2020 •

edited