New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
write error , out-of-order series added with label set #5617
Comments
use metric_relabel_configs to drop label same error
metric_relabel_configs
|
and --storage.tsdb.retention.time=10d not work, not clean the old data |
have same error with |
i just update to last version
error not going (( |
@whaike have you fixed this problem? |
I have this error as well, is there any solution? |
@pvanderlinden this is probably an issue with your relabel config. Can you reach our Thanks! |
It looks like at some points your metrics relabel config produced completely empty labelsets. This is hard to know which values produced that because there is no labelsets ... and we can not use promql against empty labelsets. I have open #6891 so that if that appens at least the target will be seen as down. |
@brian-brazil Do you want me to add: diff --git a/tsdb/compact.go b/tsdb/compact.go
index 70aa9eee0..ab88983dd 100644
--- a/tsdb/compact.go
+++ b/tsdb/compact.go
@@ -740,6 +740,12 @@ func (c *LeveledCompactor) populateBlock(blocks []BlockReader, meta *BlockMeta,
}
lset, chks, dranges := set.At() // The chunks here are not fully deleted.
+
+ // Previous versions of Prometheus allowed ingesting empty labelsets.
+ if len(lset) == 0 {
+ continue
+ }
+
if overlapping {
// If blocks are overlapping, it is possible to have unsorted chunks.
sort.Slice(chks, func(i, j int) bool { with #6891 ? or should we recommend users to delete their wal ? |
I might have made a mistake in the labeldrop config, which I fixed later on. There is no workaround to fix this currently, besides deleting the wal? If I delete the wal, I will lose all data from when the issue first happened? |
Unfortunately this started to cause big problems and I had to dump the WAL and lose all data of last week. It would be good to address this issue once and for all, a configuration error should not corrupt the datastore. |
@pvanderlinden v2.17.0 will not allow you to ingest those errors in the first place: #6891. |
I just hoped there was a response on my question if there was another option then deleting the wal, and what I would lose if I did that. But thanks for the update 👍 |
The WAL issues will be discussed tomorrow in the community call. |
ok, unfortunately I can't join (and already had to remove the data on production to be able to access prometheus again). |
Unfortunately the error is still there even after removing the suspected wrong labeldrop config. Also the new version is not out, so it is still dumping all the data into something I need to delete. Is there any way to figure out where it's coming from? |
On 04 Mar 07:49, Paul van der Linden wrote:
Unfortunately the error is still there even after removing the suspected wrong labeldrop config. Also the new version is not out, so it is still dumping all the data into something I need to delete. Is there any way to figure out where it's coming from?
--
You are receiving this because you commented.
Reply to this email directly or view it on GitHub:
#5617 (comment)
Can you share your full config?
Thanks
…--
(o- Julien Pivotto
//\ Open-Source Consultant
V_/_ Inuits - https://www.inuits.eu
|
sure |
We are experiencing the same issue as well. We had to dump WAL and lose all the data from last 3 days. |
Can you share your relabel configs?
Le lun. 16 mars 2020 à 04:13, Misosooup <notifications@github.com> a écrit :
… We are experiencing the same issue as well. We had to dump WAL and lose
all the data from last 3 days.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#5617 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AACHHJRICEETCXMEKES3NPLRHWKORANCNFSM4HRW4D6A>
.
|
Sure. When this occurred, WAL stops getting written into blocks and we got a backlog of them. I have tried deleting the series using the delete-series API but that didn't work either. This was pretty random as the metric is different from staging and production. I have updated to 2.17-rc0 as of yesterday. |
Can you test your config with 2.17.o-rc.4 ? |
wait I did not notice that this time the labelset is NOT empty |
Can you please tell us if you still see this issue with Prometheus 2.17.2? Thanks. |
For me i don't see this anymore in 2.17.2. |
It can be reproduced by running this test (Remove the Line 1879 in 30505a2
I plan to investigate this next week. |
@codesome your test is indeed failing but it means that somehow we try to compact time series which have not been commited yet, therefore potentially breaking isolation in head compaction? @beorn7 it seems that we have no isolation test with a head compaction in the middle. I am not familiar with head compaction. Can it compact the recent data (< 5 min) ? |
We hit this issue with version 2.19.2.
Started seeing this issue yesterday and prometheus has been generating On restart we see the following error
And prometheus pod restarts many times, here is the error snippet of last logs before it terminates, note that the label set is not empty.
I've deleted the pod Like others, WAL is filling up and blocks are not getting written. Is there anyway to just clear up the corrupt data portion and not lose any other data that's not been written to block? Any word on if anyone else saw this issue in 2.19.2? Any help is really appreciated. Thanks! |
After cleaning up the wal dir and restarting prometheus, out-of-order series error is gone, prometheus is scraping metrics again. but I also see the following error in logs
@roidelapluie @codesome Do you guys have any advice? |
We had this error as well in the past, but after following some strict process, our Prometheus has been running with 0 corrupted blocks. This has lasted for 4 months now in production. We have 40 days of TSDB blocks store if NFS as long term storage (I know it's risky and Prometheus do not support NFS too well). Upgraded from 2.17 to 2.19 with no issues as well. Do you have 2 Prometheus running and writing to the same file system? This was the case for us at the start, as we are trying to ensure that Prometheus has no downtime, we ended up allow 2 Prometheus to write to the same NFS which caused a lot of compaction errors. The lock file is really important as well as different Prometheus indexes metric differently so you have to make sure that no 2 Prometheus are running at the same time even if you are restarting the container. The downside of this is that we cannot scale Prometheus at this time, hence we are looking into Thanos and Cortex. |
@Misosooup thank you so much for the tips! We too use NFS for storage and have over 6 months worth of retention. This is the first time we saw this error. Ours is a multi-tenancy environment, we scrape metrics for all our tenants, though of late we've allowed some of our tenants to run their own Prometheus pods in their respective namespaces to scrape specific metrics for their workloads. To your point of running multiple Prometheus instances, though they don't write to same filesystem, I guess we need to look into if this affects us with our use-case in any way. To update on the |
Just to update on the issue we were facing, the compaction errors, both Not sure if these 2 commits(d4b9fe8, f4dd456) introduced these issues which were part of |
Note that the original issue filed here is quite old, but we are now talking about an issue apparently introduced in 2.19. @codesome What's the state of your investigation? It would be great to get a fix in before we release 2.20. |
Yes, it seems this got introduced with 2.19. Let me know if you folks needs a separate issue for this. Thanks! |
Halted, and won't be able to look at it before 2.20. This issue is old and not introduced in 2.19 as the TSDB changes in 2.19 does not interact with this part.
maxt of file files not set and out of sequence mmap chunks are unrelated to this issue but is a different issue and needs investigation, can you open issues for it (separate issues of both of them) with logs? Thanks! |
Thanks, @codesome. @mohitevishal it would be great indeed if you could file separate issues with all your current state, one each for the |
#7679 is using EFS which is not supported, as are other users above. |
@brian-brazil As per the Prometheus documentation Non POSIX compliant filesystems are not supported by Prometheus's local storage. But EFS is POSIX compliant. |
EFS is NFS, and NFS is not supported. Multiple users have reported issues which turned out to be related to using EFS. |
Just run into the same problem, most probably due to NFS. I guess I'll move my Promtheus stack to a dedicated single-node ECS cluster so that I can use EBS volumes instead or look into Prometheus remote read. Thanos is just too big for my current use case |
I faced the same issue but it was intermittent, after restarting the prometheus, error was gone. prometheus version: 2.23.0 error:
|
This comment has been minimized.
This comment has been minimized.
@brian-brazil Trying to get some clarification. Elsewhere the docs say POSIX compliance is the important thing. Here you're saying that POSIX compliance doesn't actually matter if it's some sort of network filesystem (even if it's not NFS specifically). I know this sounds pedantic but, if you confirm POSIX compliance doesn't matter, I'll update the language in the docs with a PR to say something like "local-only POSIX-compliant." My team has been going in circles on this. |
As far as we are aware require POSIX compliance, which exceedingly few network filesystems have. At this point we've reports of basically every form of networked filesystem having an issue for someone - however that's not to say that e.g. some NFS implementation may actually be sufficiently correct to not cause issues (and I suspect at least one is). If you can show that some networked filesystem is not disobeying POSIX semantics or otherwise doing odd stuff (e.g. creating files without being asked to) but still having issues, then we should update the wording accordingly. |
I really appreciate the blazing fast reply! What kind of metrics would be useful to you, one way or another? We're specifically looking at EFS, which, as several issues here and the newest docs call out, is not supported. However, their copy says they're POSIX-compliant. I see two easy solutions:
Again, thanks for the response! I really appreciate the extra details. |
We know that EFS doesn't work with Prometheus, based on numerous user reports. Seeing exactly what is going wrong in syscall terms would be good. EFS claiming POSIX compliance is also new to me, previously it was only claimed for permissions. In any case, this is the wrong issue to discuss any findings as this is about bad data resulting from a bug in Prometheus which is fixed. |
@brokencode64 That is not what this issue is about, if you can reproduce on the latest version please open a new issue. |
We looked at this in the bug scrub. This related to a fixed bug and this bug was remaining open in case we wished to do something about bad blocks produced by this bug. Given the lack of confirmed reports in supported settings in over a year, such handling does not seems required. In addition this issue has been derailed by unrelated tsdb support questions. Accordingly we're going to close this. |
1、Prometheus 2.6.0 ( in docker), remote_write + influxdb( in docker).
2、Prometheus always OOM,use 【count by (name)({name=~".+"}) >10000】got metric 【node_cpu_seconds_total】
3、Stop prometheus ,login influxDB,【drop measurement node_cpu_seconds_total】,In prometheus.yaml add
`metric_relabel_configs:
4、start Prometheus,see:
prometheus | level=error ts=2019-05-31T08:36:52.501Z caller=db.go:363 component=tsdb msg="compaction failed" err="persist head block: write compaction: add series: out-of-order series added with label set "{}""
5、update Prometheus to version 2.10.0,Still the same mistake。
The text was updated successfully, but these errors were encountered: