Join GitHub today
GitHub is home to over 31 million developers working together to host and review code, manage projects, and build software together.
Sign upprometheus skips scrapes due to throttling #2222
Comments
This comment has been minimized.
This comment has been minimized.
|
Did you recently increase the load on Prometheus, i.e. higher number of targets? It's very unlikely that this has anything to do with 1.4, even less likely that the downgrade caused a previously working version to not work anymore. Can you read through the storage documentation and verify whether you've to adjust any of the flags for your setup? |
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
|
Debug info which one of our colleagues could catch today: Let me know if I can help you with anything else, or we doing something really wrong. |
This comment has been minimized.
This comment has been minimized.
One thing that line tells me is that you have configured a decent amount of memoryChunks (~3M) but you have left the It can well be that your server was quite close to getting into persistence problems. Only after the restart, it had to catch up on things like deletion of old samples. There is a bit of a hysteresis, so it might never recover to the healthy state even with the same load. But as said, with the setting suggested above, your server should be able to handle much more. I'm really sorry that flag tweaking is some kind of dark art right now. I'm working on making this all automagic, but that's not trivial to do. |
This comment has been minimized.
This comment has been minimized.
|
My current command line arguments looks like this:
We need to have around 24 hours to check whether we have better Prometheus behavior. |
This comment has been minimized.
This comment has been minimized.
|
we had 24 hours of normal operations and I believe issues is solved. Thank you for help. |
onorua
closed this
Nov 29, 2016
This comment has been minimized.
This comment has been minimized.
|
That is kinda weird, we could sustain less than 300 nodes with mentioned config, but then problem come back again as soon as we reached 300+ nodes. |
onorua
reopened this
Nov 30, 2016
This comment has been minimized.
This comment has been minimized.
|
The log message about the throttling is usually quite helpful. Also, all the |
This comment has been minimized.
This comment has been minimized.
|
onorua
changed the title
prometheus skips scrapes every ~2 hours for 10-20 minutes
prometheus crash on load
Dec 1, 2016
onorua
changed the title
prometheus crash on load
prometheus skips scrapes due to throttling
Dec 1, 2016
This comment has been minimized.
This comment has been minimized.
|
The throttling happens now because you have too many chunks in memory:
Check
Each time series in memory needs at least one chunk. And probably most of them are actively appended to, so their one chunk will be still open and cannot be persisted. Thus, you have about 2.1M open chunks (which cannot be persisted) plus 1.3M chunks that can be persisted (and are actively done so, but you cannot just persist them in one go, see the linked talk — 1.3M chunks waiting for persistence is a reasonable number if you have 2.1M time series in memory). In sum, you have 3.4M chunks that are right now in memory and cannot be evicted. That's more than you have configured ( The straight forward remedy is to increase |
This comment has been minimized.
This comment has been minimized.
|
Yes, that is conclusion I've come to as well, we need more prometheus servers :) |
brian-brazil
closed this
Mar 27, 2017
Alexvianet
referenced this issue
Mar 29, 2018
Closed
prometheus max-chunks-to-persist paramether #177
This comment has been minimized.
This comment has been minimized.
lock
bot
commented
Mar 23, 2019
|
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs. |

onorua commentedNov 27, 2016
What did you do?
Actually our prometheus got updated to version 1.4 which did not work well (#2220), so we rolled back to version 1.3.1, and this issue started to happen.
What did you expect to see?
Don't skip scrapes every 2 hours or so.
What did you see instead? Under which circumstances?

Environment
we run prometheus with following flags:
config follows, we don't do anything fancy, it is just a DO machine which was working fine for like a months until that upgrade happened.
System information:
Linux 4.4.0-45-generic x86_64Prometheus version:
I'll do
/debug/pprof/goroutine?debug=2as soon as I've get this happening the time I'll be online, but as it is holidays, it is getting tricky.