Join GitHub today
GitHub is home to over 31 million developers working together to host and review code, manage projects, and build software together.
Sign upPrometheus storage considered dirty on every startup #985
Comments
This comment has been minimized.
This comment has been minimized.
|
How are you shutting down prometheus? You should send a SIGTERM, and then wait for it to terminate. |
This comment has been minimized.
This comment has been minimized.
|
Indeed. This is what upstart is doing. Upstart will wait some amount of time, then send SIGKILL, but I confirmed via upstart debug logs that prometheus is exiting by itself. |
This comment has been minimized.
This comment has been minimized.
|
That's very odd, can you post the logs of Prometheus at shutdown? |
This comment has been minimized.
This comment has been minimized.
|
This comment has been minimized.
This comment has been minimized.
|
Is it possible that some of this is due to interactions with opentsdb? I have a ridiculous number (>100s) of goroutines that have been pending for 300+ minutes waiting for remote storage. The service also dies occasionally due to OOM, on a machine with 180G of ram. This server is ingesting node_exporter metrics from ~200 machines, plus metrics from three custom exporters of which at most one of each type is installed on each machine that runs a node_exporter. One wrinkle is that I wrote one of the custom exporters stupidly wrong, which causes it to crash every so often on each machine, causing prometheus to consider that endpoint UNHEALTHY, which might also add load? Not sure. Are there docs anywhere describing the storage system architecture? |
This comment has been minimized.
This comment has been minimized.
|
What storage flags are you running your Prometheus with. With that load you likely have to do at least some basic tuning. Scraping the unhealthy targets should reduce the load if anything as we don't have to ingest their samples. There are no detailed docs about the storage architecture. (There are vague plans to write something up eventually.) If you are interested, I'm happy to do a video call to give you an overview. Just ping me at fabian.reinartz@soundcloud.com. |
This comment has been minimized.
This comment has been minimized.
|
I pass 4 flags: 'alertmanager.url', 'config.file', 'storage.local.path' and 'storage.remote.opentsdb-url', so yeah, I'm sure some tuning is needed. What do you suggest? I'll ping you off-thread as well. |
This comment has been minimized.
This comment has been minimized.
|
http://prometheus.io/docs/operating/storage/ provides some information about storage flags. Look at the |
This comment has been minimized.
This comment has been minimized.
|
The remote storage is isolated from the local, so that if the remote storage if full it shouldn't affect things (and you'd see lots of
That'll be ~80k time series from the node exporters and I'm going to guess similar from the customs. That shouldn't present a problem for prometheus on what I presume is a very beefy machine. What's your scrape interval, how many samples/s are you ingesting and are you on SSD or HDD? It might be worthwhile to look at the node exporter consoles and prometheus console for the prometheus machine. |
This comment has been minimized.
This comment has been minimized.
|
@fabxc According to my /metrics, the @brian-brazil I do indeed see many of those messages. My scrape interval is 15s, I'm not sure how to get to the built in consoles (/consoles/prometheus_overview{.html,} doesn't work on my install at least). However, ( I've copied my /metrics on the prometheus server below, just for reference:
|
This comment has been minimized.
This comment has been minimized.
|
That's 10X the samples/s of the next biggest HDD prometheus we're aware of. What does disk i/o utilisation look like ( It's |
This comment has been minimized.
This comment has been minimized.
|
This comment has been minimized.
This comment has been minimized.
|
Also, that consoles link doesn't work ( |
This comment has been minimized.
This comment has been minimized.
|
Disk seems fine, and you're using around one core. Have you set Yes, you need to copy the consoles and console_libraries directories. |
This comment has been minimized.
This comment has been minimized.
|
OMG, I entirely forgot about GOMAXPROCS. Everything works waaaaay better now :) |
This comment has been minimized.
This comment has been minimized.
|
I'm still getting the issue where storage is marked dirty every time, but since that seems to be more likely a problem with my specific setup, I'm closing this issue. Thanks for the help! |
swsnider
closed this
Aug 17, 2015
This comment has been minimized.
This comment has been minimized.
lock
bot
commented
Mar 24, 2019
|
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs. |
swsnider commentedAug 13, 2015
I'm using an upstart script to launch prometheus (prometheus is invoked by cronolog inside the script section, no real content other than that), and that works well for us (upstart commands perform as advertised).
However, every single time that prometheus starts up after being shutdown by an upstart command, storage is considered DIRTY:
Is this because of #591 was closed without actually implementing sync()? Or something else?