Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

query shows bogus results #3776

Closed
auhlig opened this Issue Feb 1, 2018 · 18 comments

Comments

Projects
None yet
4 participants
@auhlig
Copy link

auhlig commented Feb 1, 2018

What did you do?
We're using 2 Prometheis in a federation. A "collector" and a "frontend". The collector gets the metrics from, in this case, cAdvisor. The frontend federates aggregated data from the collector (config). While querying for container_cpu_system_seconds_total in the collector returns the expected results, the same query in the frontend shows other metrics. See screenshot below.

What did you expect to see?
only container_cpu_system_seconds_total metrics

What did you see instead? Under which circumstances?
As shown in the screenshot below, the query for container_cpu_system_seconds_total also shows unrelated metrics.
query

Environment

  • System information:

Linux 4.14.11-coreos x86_64

  • Prometheus version:

prometheus, version 2.1.0 (branch: HEAD, revision: 85f23d8)
build user: root@6e784304d3ff
build date: 20180119-12:01:23
go version: go1.9.2

  • Prometheus configuration file:

Can be found here.

@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented Feb 1, 2018

That's not good. What versions of Prometheus has been run using the same storage?

@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented Feb 1, 2018

@RichiH

This comment has been minimized.

Copy link
Member

RichiH commented Feb 1, 2018

Unlikely, but do you have a proxy or cache in front of the Prometheus instance you're querying?

@auhlig

This comment has been minimized.

Copy link
Author

auhlig commented Feb 1, 2018

The same storage was used by the previous version, Prom 2.0.0 . We have this setting in 16 regions, however only 1 showed the above bogus result. No proxy or cache involved.

@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented Feb 1, 2018

Are there any indications of disk data corruption on the machine? We have checksums so should catch that, but you might have gotten really unlucky.

@auhlig

This comment has been minimized.

Copy link
Author

auhlig commented Feb 1, 2018

There were issues in the past related to the retention fix: #3534. However Prometheus 2.0.0 showed the correct results.

@fabxc

This comment has been minimized.

Copy link
Member

fabxc commented Feb 9, 2018

@auhlig sorry about the delay on this as it is quite severe.

If you roll back to Prometheus 2.0.0, do the results become correct again. How big are your index files in the storage directory?

@auhlig

This comment has been minimized.

Copy link
Author

auhlig commented Feb 14, 2018

Hi @fabxc,
Thanks for the reply.
Since this issue was quite visible, as it messed up our Grafana dashboards, we had to wipe the data in the affected prometheus as a quick fix. Thus I cannot test this anymore. The fresh prometheus works as expected.
EDIT: If it helps the index files after 1 week with prom 2.1 are as follows

/prometheus # for dir in /prometheus/* ;do dir=${dir%*/};ls -lah $dir/index | awk '{print $5" "$9}'; done
34.9M /prometheus/01C5R5MB22Z0FDMZJF24BWAQS9/index
38.2M /prometheus/01C5RT7G6C2HTGHZZ3DMCKMX7Q/index
32.6M /prometheus/01C5SETMP56MRF0JH1WAMN61WY/index
32.5M /prometheus/01C5T3DVGR739FA5H02X1PZE7Y/index
39.6M /prometheus/01C5TR11FPBXYSM7MQW1AT7BFG/index
37.5M /prometheus/01C5VCM6GMVVXENJM533XH3NNJ/index
34.3M /prometheus/01C5W17BX7G6A4W1S1V0G65NVK/index
33.1M /prometheus/01C5WNTHKW86MQH82AQT1RWBS2/index
43.4M /prometheus/01C5XADT38E2JQDT2X1M8P1P6H/index
40.5M /prometheus/01C5XZ0WCJA34Y6VFXR688MRCC/index
34.1M /prometheus/01C5YKM1WJ1A0QKKK5X3MER1GP/index
32.6M /prometheus/01C5Z877JVMKD7575HSJY6Y5CB/index
32.7M /prometheus/01C5ZWTD18VJD5HRXZY98QHCYW/index
32.8M /prometheus/01C60HDKABAZ7NGN8RHJWE812B/index
32.5M /prometheus/01C6160RQEMBVH3W5N9BZH24ZX/index
32.5M /prometheus/01C61TKYV0MM5FJHZ61ZSJ8AN8/index
32.6M /prometheus/01C62F74AM0YMNS12H7KH3HZ33/index
32.6M /prometheus/01C633TAAA3SFTP7PFA9CCTBAZ/index
32.6M /prometheus/01C63RDFZDQS6RD8J7TMAVAEG6/index
32.7M /prometheus/01C64D0NEEEG60M9KVZRHYFHJ2/index
34.9M /prometheus/01C651KV8GDSWR8BGM54PB64ZX/index
33.6M /prometheus/01C65P71DT709EVKYJ2NE7AGJR/index
33.5M /prometheus/01C66AT6R5PDY0HDHPCG1JBHRV/index
32.9M /prometheus/01C66ZDCJX1138G451X85RR78K/index
35.5M /prometheus/01C67M0KEYNMH0VXHA4DDCFZXK/index
35.8M /prometheus/01C688KS05VKCSJWE3ENSDZ2KJ/index
34.5M /prometheus/01C68X6XZDF6BVFZKM7J0J614V/index
33.1M /prometheus/01C69HT3REJZ17V4DKYRKA5M74/index
34.1M /prometheus/01C6A6DAA19Z4FBKCMX1BHQQKE/index
@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented Mar 8, 2018

Without the data, we can only presume that this fixed unless someone else reports it.

@auhlig

This comment has been minimized.

Copy link
Author

auhlig commented Mar 22, 2018

Another instance: Query for go_goroutines shows also go_memstats_*.
prom_foo

I can see about a trillion *.tmp folders containing only data (no index, meta.json)

@fabxc

This comment has been minimized.

Copy link
Member

fabxc commented Mar 22, 2018

Nothing in the logs? Left over .tmp folders would usually indicate some kind of failed compactions.

@fabxc

This comment has been minimized.

Copy link
Member

fabxc commented Mar 22, 2018

Can you provide us one or more blocks for affected time ranges before wiping this time? Probably fine to skip on the chunk files – at least initially.

@auhlig

This comment has been minimized.

Copy link
Author

auhlig commented Mar 22, 2018

Grepping for component=tsdb shows 3 types of errors:

(1) level=error ts=2018-03-22T15:35:09.928277674Z caller=db.go:265 component=tsdb msg="compaction failed" err="persist head block: write compaction: add series: out-of-order series added with label set ...
(2) level=info ts=2018-03-22T15:35:04.79684528Z caller=compact.go:387 component=tsdb msg="compact blocks" count=1 mint=1520755200000 maxt=1520762400000
(3) level=error ts=2018-03-22T15:35:09.928191279Z caller=compact.go:397 component=tsdb msg="removed tmp folder after failed compaction" err="remove /prometheus/01C975JSTWD7EGP3GPM1S005MV.tmp/chunks: directory not empty"

@auhlig

This comment has been minimized.

Copy link
Author

auhlig commented Mar 22, 2018

Something like the following?

/prometheus # du -ha 01C6SMSMDYM5M8GVSQG8X6J50S.tmp
168.7M	01C6SMSMDYM5M8GVSQG8X6J50S.tmp/chunks/000001
168.7M	01C6SMSMDYM5M8GVSQG8X6J50S.tmp/chunks
32.1M	01C6SMSMDYM5M8GVSQG8X6J50S.tmp/index
200.8M	01C6SMSMDYM5M8GVSQG8X6J50S.tmp

Most of the 14221 *.tmp folders look like:

prometheus # du -ha 01C87J6S6RNDN9ZWSYJGB34PNC.tmp
4.0K	01C87J6S6RNDN9ZWSYJGB34PNC.tmp/chunks
8.0K	01C87J6S6RNDN9ZWSYJGB34PNC.tmp

@fabxc

This comment has been minimized.

Copy link
Member

fabxc commented Mar 22, 2018

The .tmp folders are just leftovers from failed compactions. We don't really care about them. From the logs I take that it actually never manages to write a block with the .tmp suffixes? This definitely means this is not a disk corruption issue.

That's pretty unfortunate for remote debugging. In the (1) log line you are showing, is it always the same series shown where you subtituted for with ... or do they differ? Full examples would help.

Log message (3) confuses me as it apparently cannot remove the chunk directory. I thought it was an issue in the code but we are actually calling os.RemoveAll which deletes directories recursively. What filesystem are you using?

What Prometheus version are you using by now? Note that 2.1.0 and 2.2.0 both have issues and you should definitely upgrade to 2.2.1.

@fabxc fabxc reopened this Mar 22, 2018

@auhlig

This comment has been minimized.

Copy link
Author

auhlig commented Mar 22, 2018

Regarding the blocks:

/prometheus # find . -name index -type f -exec ls -lh {} \;
-rw-r--r--    1 root     root       32.0M Feb 20 13:00 ./01C6SMSMDYM5M8GVSQG8X6J50S.tmp/index
-rw-r--r--    1 root     root        4.5M Mar 11 09:00 ./01C8A4KG1KBVYG3X7N41WPXDAY/index
-rw-r--r--    1 root     root       40.7M Mar  9 13:00 ./01C85DHQJVMJ4WSBKXWKCBBG5D/index
-rw-r--r--    1 root     root       44.3M Mar  7 19:00 ./01C80XBF9P97ZH113END42ZZ3H/index
-rw-r--r--    1 root     root       40.2M Mar  8 01:00 ./01C81HYN45GJHS40YC62NWV7SJ/index
-rw-r--r--    1 root     root       40.2M Mar  8 07:00 ./01C826HV2ARSMRSST9M72VFH1P/index
-rw-r--r--    1 root     root       40.3M Mar  8 13:00 ./01C82V50X7EHK065M28QW1B2WA/index
-rw-r--r--    1 root     root       42.0M Mar  8 19:00 ./01C83FR6JY4140V5XE92EAGXZG/index
-rw-r--r--    1 root     root       40.3M Mar  9 01:00 ./01C844BC57HF5GSGM9Z8YVK0T4/index
-rw-r--r--    1 root     root       40.5M Mar  9 07:00 ./01C84RYJ33CRVY5TEDMSAMFPXF/index
-rw-r--r--    1 root     root       40.5M Mar  9 19:00 ./01C8624X5EB1YQE4JPRP6QVSG0/index
-rw-r--r--    1 root     root       40.6M Mar 10 01:00 ./01C86PR3CMXAVJ7D23CWQRYHWG/index
-rw-r--r--    1 root     root       40.7M Mar 10 07:00 ./01C87BB95Q3CZ98P1QRVSE1YHB/index
-rw-r--r--    1 root     root       44.2M Mar 11 01:00 ./01C8994QGGH9J01P09B6QMX0YN/index
-rw-r--r--    1 root     root       44.1M Mar 11 07:00 ./01C89XQXM877C5HHJTEV5WCT60/index
-rw-r--r--    1 root     root       39.7M Mar  4 13:00 ./01C7RHJ440HKK4KFJKR5ZYRS31/index
-rw-r--r--    1 root     root       39.8M Mar  4 19:00 ./01C7S65AMP0MSH22FJSKMBNFWK/index
-rw-r--r--    1 root     root       39.7M Mar  5 01:00 ./01C7STRG4BVHQ80EQPP8NSPHDR/index
-rw-r--r--    1 root     root       40.0M Mar  5 07:00 ./01C7TFBNCQ7WSG57XE0WR876V1/index
-rw-r--r--    1 root     root       41.5M Mar  5 13:00 ./01C7V3YVMQ0H2X8QRTTTPRY053/index
-rw-r--r--    1 root     root       40.6M Mar  5 19:00 ./01C7VRJ1CFFN7MMDW4G9RT9747/index
-rw-r--r--    1 root     root       40.6M Mar  7 13:00 ./01C808R9VNT5ASJZKXF1B35RC5/index
-rw-r--r--    1 root     root       40.3M Mar  6 01:00 ./01C7WD573G8QFZEJ1JEW7NHV2X/index
-rw-r--r--    1 root     root       40.2M Mar  6 07:00 ./01C7X1RCQYXGKS2THYG2PH51BG/index
-rw-r--r--    1 root     root       40.9M Mar  6 13:00 ./01C7XPBJHTKK4MTVQMZDZFBEMN/index
-rw-r--r--    1 root     root       40.4M Mar  6 19:00 ./01C7YAYRT12W1FR3J6EEBB97DT/index
-rw-r--r--    1 root     root       40.1M Mar  7 01:00 ./01C7YZHY33A4EMYTTZB82B7CWS/index
-rw-r--r--    1 root     root       40.1M Mar  7 07:00 ./01C7ZM53JTX2YCGNY5DCRCYYQJ/index
-rw-r--r--    1 root     root       43.2M Mar 10 19:00 ./01C88MHHGTPC9XP72TQVVKCS6V/index
-rw-r--r--    1 root     root       20.0M Mar 10 13:00 ./01C87ZYERSPXWZ5350Y9NZYHGG.tmp/index
-rw-r--r--    1 root     root       49.8M Mar 10 13:00 ./01C87ZYC6R6KRTSHM3ER1GNNGN/index

Regarding (1): It's always the same series as shown below.

level=error ts=2018-03-22T16:05:13.960244727Z caller=db.go:265 component=tsdb msg="compaction failed" err="persist head block: write compaction: add series: out-of-order series added with label set \"{__name__=\\\"apiserver_request_count\\\",client=\\\"Swagger-Codegen/1.0.0-alpha/python\\\",code=\\\"200\\\",component=\\\"apiserver\\\",contentType=\\\"application/json\\\",instance=\\\"<removedTheNodeName>\\\",job=\\\"kube-system/apiserver\\\",region=\\\"ap-cn-1\\\",resource=\\\"deployments\\\",verb=\\\"PATCH\\\"}\""

Regarding os.RemoveAll: Is the directory we're trying to delete used by another goroutine? That could potentially lead to the error.

We're still the same Prometheus version:

/prometheus # prometheus --version
prometheus, version 2.1.0 (branch: HEAD, revision: 85f23d82a045d103ea7f3c89a91fba4a93e6367a)
  build user:       root@6e784304d3ff
  build date:       20180119-12:01:23
  go version:       go1.9.2

I'll update staging today and continue with our productive regions next week.

@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented Jun 15, 2018

We believe this was fixed in 2.3.0 as part of the big promql changes, if you still see this after that version please let us know.

@lock

This comment has been minimized.

Copy link

lock bot commented Mar 22, 2019

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked and limited conversation to collaborators Mar 22, 2019

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
You can’t perform that action at this time.