Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prometheus storage considered dirty on every startup #985

Closed
swsnider opened this Issue Aug 13, 2015 · 17 comments

Comments

Projects
None yet
3 participants
@swsnider
Copy link
Contributor

swsnider commented Aug 13, 2015

I'm using an upstart script to launch prometheus (prometheus is invoked by cronolog inside the script section, no real content other than that), and that works well for us (upstart commands perform as advertised).

However, every single time that prometheus starts up after being shutdown by an upstart command, storage is considered DIRTY:

prometheus, version 0.15.1 (branch: master, revision: 17eebbc)
  build user:       @19c92959409e
  build date:       20150729-21:28:47
  go version:       1.4.2
time="2015-08-13T17:32:32Z" level=info msg="Loading configuration file /etc/prometheus/prometheus.yaml" file=main.go line=173
time="2015-08-13T17:32:32Z" level=info msg="Loading series map and head chunks..." file=storage.go line=263
time="2015-08-13T17:32:49Z" level=warning msg="Persistence layer appears dirty." file=persistence.go line=689
time="2015-08-13T17:32:49Z" level=warning msg="Starting crash recovery. Prometheus is inoperational until complete." file=crashrecovery.go line=39
time="2015-08-13T17:32:49Z" level=info msg="Scanning files." file=crashrecovery.go line=51

Is this because of #591 was closed without actually implementing sync()? Or something else?

@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented Aug 13, 2015

How are you shutting down prometheus? You should send a SIGTERM, and then wait for it to terminate.

@swsnider

This comment has been minimized.

Copy link
Contributor Author

swsnider commented Aug 13, 2015

Indeed. This is what upstart is doing. Upstart will wait some amount of time, then send SIGKILL, but I confirmed via upstart debug logs that prometheus is exiting by itself.

@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented Aug 14, 2015

That's very odd, can you post the logs of Prometheus at shutdown?

@swsnider

This comment has been minimized.

Copy link
Contributor Author

swsnider commented Aug 14, 2015

time="2015-08-14T20:23:46Z" level=warning msg="Sample ingestion resumed." file=storage.go line=546
time="2015-08-14T20:23:46Z" level=warning msg="Sample ingestion resumed." file=storage.go line=546
time="2015-08-14T20:23:46Z" level=warning msg="Sample ingestion resumed." file=storage.go line=546
time="2015-08-14T20:23:46Z" level=warning msg="Sample ingestion resumed." file=storage.go line=546
time="2015-08-14T20:23:46Z" level=warning msg="Sample ingestion resumed." file=storage.go line=546
time="2015-08-14T20:23:46Z" level=warning msg="Sample ingestion resumed." file=storage.go line=546
time="2015-08-14T20:23:46Z" level=warning msg="Sample ingestion resumed." file=storage.go line=546
time="2015-08-14T20:23:46Z" level=warning msg="Sample ingestion resumed." file=storage.go line=546
time="2015-08-14T20:23:46Z" level=warning msg="Sample ingestion resumed." file=storage.go line=546
time="2015-08-14T20:23:46Z" level=warning msg="Sample ingestion resumed." file=storage.go line=546
time="2015-08-14T20:23:46Z" level=warning msg="Sample ingestion resumed." file=storage.go line=546
time="2015-08-14T20:23:46Z" level=warning msg="Sample ingestion resumed." file=storage.go line=546
time="2015-08-14T20:23:46Z" level=warning msg="Sample ingestion resumed." file=storage.go line=546
time="2015-08-14T20:23:46Z" level=warning msg="Sample ingestion resumed." file=storage.go line=546
time="2015-08-14T20:24:56Z" level=info msg="Checkpointing in-memory metrics and chunks..." file=persistence.go line=541
@swsnider

This comment has been minimized.

Copy link
Contributor Author

swsnider commented Aug 14, 2015

Is it possible that some of this is due to interactions with opentsdb? I have a ridiculous number (>100s) of goroutines that have been pending for 300+ minutes waiting for remote storage.

The service also dies occasionally due to OOM, on a machine with 180G of ram. This server is ingesting node_exporter metrics from ~200 machines, plus metrics from three custom exporters of which at most one of each type is installed on each machine that runs a node_exporter.

One wrinkle is that I wrote one of the custom exporters stupidly wrong, which causes it to crash every so often on each machine, causing prometheus to consider that endpoint UNHEALTHY, which might also add load? Not sure.

Are there docs anywhere describing the storage system architecture?

@fabxc

This comment has been minimized.

Copy link
Member

fabxc commented Aug 14, 2015

What storage flags are you running your Prometheus with. With that load you likely have to do at least some basic tuning.

Scraping the unhealthy targets should reduce the load if anything as we don't have to ingest their samples.

There are no detailed docs about the storage architecture. (There are vague plans to write something up eventually.) If you are interested, I'm happy to do a video call to give you an overview. Just ping me at fabian.reinartz@soundcloud.com.

@swsnider

This comment has been minimized.

Copy link
Contributor Author

swsnider commented Aug 14, 2015

I pass 4 flags: 'alertmanager.url', 'config.file', 'storage.local.path' and 'storage.remote.opentsdb-url', so yeah, I'm sure some tuning is needed. What do you suggest?

I'll ping you off-thread as well.

@fabxc

This comment has been minimized.

Copy link
Member

fabxc commented Aug 14, 2015

http://prometheus.io/docs/operating/storage/ provides some information about storage flags.

Look at the storage.local.memory-chunks and storage.local.max-chunks-to-persist. The latter one is likely the reason why you are seeing those warnings. They basically say that the storage is not ready for appending so the samples are piling up. Eventually, new samples are dropped.

@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented Aug 16, 2015

The remote storage is isolated from the local, so that if the remote storage if full it shouldn't affect things (and you'd see lots of Remote storage queue full, discarding sample.).

The service also dies occasionally due to OOM, on a machine with 180G of ram. This server is ingesting node_exporter metrics from ~200 machines, plus metrics from three custom exporters of which at most one of each type is installed on each machine that runs a node_exporter.

That'll be ~80k time series from the node exporters and I'm going to guess similar from the customs. That shouldn't present a problem for prometheus on what I presume is a very beefy machine. What's your scrape interval, how many samples/s are you ingesting and are you on SSD or HDD? It might be worthwhile to look at the node exporter consoles and prometheus console for the prometheus machine.

@swsnider

This comment has been minimized.

Copy link
Contributor Author

swsnider commented Aug 17, 2015

@fabxc According to my /metrics, the prometheus_local_storage_memory_series was 500,000+, so this makes sense. I've seen vast improvement from tuning these flags, thanks for the pointer.

@brian-brazil I do indeed see many of those messages. My scrape interval is 15s, I'm not sure how to get to the built in consoles (/consoles/prometheus_overview{.html,} doesn't work on my install at least). However, (rate(prometheus_local_storage_ingested_samples_total[5m]) == 39674.668288367975), and I'm unfortunately on an HDD because of the hardware I was given for testing.

I've copied my /metrics on the prometheus server below, just for reference:

# HELP go_gc_duration_seconds A summary of the GC invocation durations.
# TYPE go_gc_duration_seconds summary
go_gc_duration_seconds{quantile="0"} 0.00023809000000000002
go_gc_duration_seconds{quantile="0.25"} 0.203586453
go_gc_duration_seconds{quantile="0.5"} 22.265724745
go_gc_duration_seconds{quantile="0.75"} 73.957165681
go_gc_duration_seconds{quantile="1"} 134.876760354
go_gc_duration_seconds_sum 2461.877243197
go_gc_duration_seconds_count 61
# HELP go_goroutines Number of goroutines that currently exist.
# TYPE go_goroutines gauge
go_goroutines 1.668616e+06
# HELP http_request_duration_microseconds The HTTP request latencies in microseconds.
# TYPE http_request_duration_microseconds summary
http_request_duration_microseconds{handler="alerts",quantile="0.5"} NaN
http_request_duration_microseconds{handler="alerts",quantile="0.9"} NaN
http_request_duration_microseconds{handler="alerts",quantile="0.99"} NaN
http_request_duration_microseconds_sum{handler="alerts"} 0
http_request_duration_microseconds_count{handler="alerts"} 0
http_request_duration_microseconds{handler="consoles",quantile="0.5"} NaN
http_request_duration_microseconds{handler="consoles",quantile="0.9"} NaN
http_request_duration_microseconds{handler="consoles",quantile="0.99"} NaN
http_request_duration_microseconds_sum{handler="consoles"} 0
http_request_duration_microseconds_count{handler="consoles"} 0
http_request_duration_microseconds{handler="drop_series",quantile="0.5"} NaN
http_request_duration_microseconds{handler="drop_series",quantile="0.9"} NaN
http_request_duration_microseconds{handler="drop_series",quantile="0.99"} NaN
http_request_duration_microseconds_sum{handler="drop_series"} 0
http_request_duration_microseconds_count{handler="drop_series"} 0
http_request_duration_microseconds{handler="federate",quantile="0.5"} NaN
http_request_duration_microseconds{handler="federate",quantile="0.9"} NaN
http_request_duration_microseconds{handler="federate",quantile="0.99"} NaN
http_request_duration_microseconds_sum{handler="federate"} 0
http_request_duration_microseconds_count{handler="federate"} 0
http_request_duration_microseconds{handler="graph",quantile="0.5"} NaN
http_request_duration_microseconds{handler="graph",quantile="0.9"} NaN
http_request_duration_microseconds{handler="graph",quantile="0.99"} NaN
http_request_duration_microseconds_sum{handler="graph"} 0
http_request_duration_microseconds_count{handler="graph"} 0
http_request_duration_microseconds{handler="heap",quantile="0.5"} NaN
http_request_duration_microseconds{handler="heap",quantile="0.9"} NaN
http_request_duration_microseconds{handler="heap",quantile="0.99"} NaN
http_request_duration_microseconds_sum{handler="heap"} 0
http_request_duration_microseconds_count{handler="heap"} 0
http_request_duration_microseconds{handler="label_values",quantile="0.5"} NaN
http_request_duration_microseconds{handler="label_values",quantile="0.9"} NaN
http_request_duration_microseconds{handler="label_values",quantile="0.99"} NaN
http_request_duration_microseconds_sum{handler="label_values"} 0
http_request_duration_microseconds_count{handler="label_values"} 0
http_request_duration_microseconds{handler="metrics",quantile="0.5"} NaN
http_request_duration_microseconds{handler="metrics",quantile="0.9"} NaN
http_request_duration_microseconds{handler="metrics",quantile="0.99"} NaN
http_request_duration_microseconds_sum{handler="metrics"} 0
http_request_duration_microseconds_count{handler="metrics"} 0
http_request_duration_microseconds{handler="prometheus",quantile="0.5"} 10179.848
http_request_duration_microseconds{handler="prometheus",quantile="0.9"} 23372.595
http_request_duration_microseconds{handler="prometheus",quantile="0.99"} 26023.137
http_request_duration_microseconds_sum{handler="prometheus"} 1.2615919335999992e+07
http_request_duration_microseconds_count{handler="prometheus"} 294
http_request_duration_microseconds{handler="query",quantile="0.5"} NaN
http_request_duration_microseconds{handler="query",quantile="0.9"} NaN
http_request_duration_microseconds{handler="query",quantile="0.99"} NaN
http_request_duration_microseconds_sum{handler="query"} 0
http_request_duration_microseconds_count{handler="query"} 0
http_request_duration_microseconds{handler="query_range",quantile="0.5"} NaN
http_request_duration_microseconds{handler="query_range",quantile="0.9"} NaN
http_request_duration_microseconds{handler="query_range",quantile="0.99"} NaN
http_request_duration_microseconds_sum{handler="query_range"} 0
http_request_duration_microseconds_count{handler="query_range"} 0
http_request_duration_microseconds{handler="series",quantile="0.5"} NaN
http_request_duration_microseconds{handler="series",quantile="0.9"} NaN
http_request_duration_microseconds{handler="series",quantile="0.99"} NaN
http_request_duration_microseconds_sum{handler="series"} 0
http_request_duration_microseconds_count{handler="series"} 0
http_request_duration_microseconds{handler="static",quantile="0.5"} NaN
http_request_duration_microseconds{handler="static",quantile="0.9"} NaN
http_request_duration_microseconds{handler="static",quantile="0.99"} NaN
http_request_duration_microseconds_sum{handler="static"} 0
http_request_duration_microseconds_count{handler="static"} 0
http_request_duration_microseconds{handler="status",quantile="0.5"} NaN
http_request_duration_microseconds{handler="status",quantile="0.9"} NaN
http_request_duration_microseconds{handler="status",quantile="0.99"} NaN
http_request_duration_microseconds_sum{handler="status"} 4.277188049000001e+06
http_request_duration_microseconds_count{handler="status"} 2
http_request_duration_microseconds{handler="version",quantile="0.5"} NaN
http_request_duration_microseconds{handler="version",quantile="0.9"} NaN
http_request_duration_microseconds{handler="version",quantile="0.99"} NaN
http_request_duration_microseconds_sum{handler="version"} 0
http_request_duration_microseconds_count{handler="version"} 0
# HELP http_request_size_bytes The HTTP request sizes in bytes.
# TYPE http_request_size_bytes summary
http_request_size_bytes{handler="alerts",quantile="0.5"} NaN
http_request_size_bytes{handler="alerts",quantile="0.9"} NaN
http_request_size_bytes{handler="alerts",quantile="0.99"} NaN
http_request_size_bytes_sum{handler="alerts"} 0
http_request_size_bytes_count{handler="alerts"} 0
http_request_size_bytes{handler="consoles",quantile="0.5"} NaN
http_request_size_bytes{handler="consoles",quantile="0.9"} NaN
http_request_size_bytes{handler="consoles",quantile="0.99"} NaN
http_request_size_bytes_sum{handler="consoles"} 0
http_request_size_bytes_count{handler="consoles"} 0
http_request_size_bytes{handler="drop_series",quantile="0.5"} NaN
http_request_size_bytes{handler="drop_series",quantile="0.9"} NaN
http_request_size_bytes{handler="drop_series",quantile="0.99"} NaN
http_request_size_bytes_sum{handler="drop_series"} 0
http_request_size_bytes_count{handler="drop_series"} 0
http_request_size_bytes{handler="federate",quantile="0.5"} NaN
http_request_size_bytes{handler="federate",quantile="0.9"} NaN
http_request_size_bytes{handler="federate",quantile="0.99"} NaN
http_request_size_bytes_sum{handler="federate"} 0
http_request_size_bytes_count{handler="federate"} 0
http_request_size_bytes{handler="graph",quantile="0.5"} NaN
http_request_size_bytes{handler="graph",quantile="0.9"} NaN
http_request_size_bytes{handler="graph",quantile="0.99"} NaN
http_request_size_bytes_sum{handler="graph"} 0
http_request_size_bytes_count{handler="graph"} 0
http_request_size_bytes{handler="heap",quantile="0.5"} NaN
http_request_size_bytes{handler="heap",quantile="0.9"} NaN
http_request_size_bytes{handler="heap",quantile="0.99"} NaN
http_request_size_bytes_sum{handler="heap"} 0
http_request_size_bytes_count{handler="heap"} 0
http_request_size_bytes{handler="label_values",quantile="0.5"} NaN
http_request_size_bytes{handler="label_values",quantile="0.9"} NaN
http_request_size_bytes{handler="label_values",quantile="0.99"} NaN
http_request_size_bytes_sum{handler="label_values"} 0
http_request_size_bytes_count{handler="label_values"} 0
http_request_size_bytes{handler="metrics",quantile="0.5"} NaN
http_request_size_bytes{handler="metrics",quantile="0.9"} NaN
http_request_size_bytes{handler="metrics",quantile="0.99"} NaN
http_request_size_bytes_sum{handler="metrics"} 0
http_request_size_bytes_count{handler="metrics"} 0
http_request_size_bytes{handler="prometheus",quantile="0.5"} 291
http_request_size_bytes{handler="prometheus",quantile="0.9"} 291
http_request_size_bytes{handler="prometheus",quantile="0.99"} 291
http_request_size_bytes_sum{handler="prometheus"} 85554
http_request_size_bytes_count{handler="prometheus"} 294
http_request_size_bytes{handler="query",quantile="0.5"} NaN
http_request_size_bytes{handler="query",quantile="0.9"} NaN
http_request_size_bytes{handler="query",quantile="0.99"} NaN
http_request_size_bytes_sum{handler="query"} 0
http_request_size_bytes_count{handler="query"} 0
http_request_size_bytes{handler="query_range",quantile="0.5"} NaN
http_request_size_bytes{handler="query_range",quantile="0.9"} NaN
http_request_size_bytes{handler="query_range",quantile="0.99"} NaN
http_request_size_bytes_sum{handler="query_range"} 0
http_request_size_bytes_count{handler="query_range"} 0
http_request_size_bytes{handler="series",quantile="0.5"} NaN
http_request_size_bytes{handler="series",quantile="0.9"} NaN
http_request_size_bytes{handler="series",quantile="0.99"} NaN
http_request_size_bytes_sum{handler="series"} 0
http_request_size_bytes_count{handler="series"} 0
http_request_size_bytes{handler="static",quantile="0.5"} NaN
http_request_size_bytes{handler="static",quantile="0.9"} NaN
http_request_size_bytes{handler="static",quantile="0.99"} NaN
http_request_size_bytes_sum{handler="static"} 0
http_request_size_bytes_count{handler="static"} 0
http_request_size_bytes{handler="status",quantile="0.5"} NaN
http_request_size_bytes{handler="status",quantile="0.9"} NaN
http_request_size_bytes{handler="status",quantile="0.99"} NaN
http_request_size_bytes_sum{handler="status"} 1126
http_request_size_bytes_count{handler="status"} 2
http_request_size_bytes{handler="version",quantile="0.5"} NaN
http_request_size_bytes{handler="version",quantile="0.9"} NaN
http_request_size_bytes{handler="version",quantile="0.99"} NaN
http_request_size_bytes_sum{handler="version"} 0
http_request_size_bytes_count{handler="version"} 0
# HELP http_requests_total Total number of HTTP requests made.
# TYPE http_requests_total counter
http_requests_total{code="200",handler="prometheus",method="get"} 294
http_requests_total{code="200",handler="status",method="get"} 2
# HELP http_response_size_bytes The HTTP response sizes in bytes.
# TYPE http_response_size_bytes summary
http_response_size_bytes{handler="alerts",quantile="0.5"} NaN
http_response_size_bytes{handler="alerts",quantile="0.9"} NaN
http_response_size_bytes{handler="alerts",quantile="0.99"} NaN
http_response_size_bytes_sum{handler="alerts"} 0
http_response_size_bytes_count{handler="alerts"} 0
http_response_size_bytes{handler="consoles",quantile="0.5"} NaN
http_response_size_bytes{handler="consoles",quantile="0.9"} NaN
http_response_size_bytes{handler="consoles",quantile="0.99"} NaN
http_response_size_bytes_sum{handler="consoles"} 0
http_response_size_bytes_count{handler="consoles"} 0
http_response_size_bytes{handler="drop_series",quantile="0.5"} NaN
http_response_size_bytes{handler="drop_series",quantile="0.9"} NaN
http_response_size_bytes{handler="drop_series",quantile="0.99"} NaN
http_response_size_bytes_sum{handler="drop_series"} 0
http_response_size_bytes_count{handler="drop_series"} 0
http_response_size_bytes{handler="federate",quantile="0.5"} NaN
http_response_size_bytes{handler="federate",quantile="0.9"} NaN
http_response_size_bytes{handler="federate",quantile="0.99"} NaN
http_response_size_bytes_sum{handler="federate"} 0
http_response_size_bytes_count{handler="federate"} 0
http_response_size_bytes{handler="graph",quantile="0.5"} NaN
http_response_size_bytes{handler="graph",quantile="0.9"} NaN
http_response_size_bytes{handler="graph",quantile="0.99"} NaN
http_response_size_bytes_sum{handler="graph"} 0
http_response_size_bytes_count{handler="graph"} 0
http_response_size_bytes{handler="heap",quantile="0.5"} NaN
http_response_size_bytes{handler="heap",quantile="0.9"} NaN
http_response_size_bytes{handler="heap",quantile="0.99"} NaN
http_response_size_bytes_sum{handler="heap"} 0
http_response_size_bytes_count{handler="heap"} 0
http_response_size_bytes{handler="label_values",quantile="0.5"} NaN
http_response_size_bytes{handler="label_values",quantile="0.9"} NaN
http_response_size_bytes{handler="label_values",quantile="0.99"} NaN
http_response_size_bytes_sum{handler="label_values"} 0
http_response_size_bytes_count{handler="label_values"} 0
http_response_size_bytes{handler="metrics",quantile="0.5"} NaN
http_response_size_bytes{handler="metrics",quantile="0.9"} NaN
http_response_size_bytes{handler="metrics",quantile="0.99"} NaN
http_response_size_bytes_sum{handler="metrics"} 0
http_response_size_bytes_count{handler="metrics"} 0
http_response_size_bytes{handler="prometheus",quantile="0.5"} 3057
http_response_size_bytes{handler="prometheus",quantile="0.9"} 3096
http_response_size_bytes{handler="prometheus",quantile="0.99"} 3102
http_response_size_bytes_sum{handler="prometheus"} 877660
http_response_size_bytes_count{handler="prometheus"} 294
http_response_size_bytes{handler="query",quantile="0.5"} NaN
http_response_size_bytes{handler="query",quantile="0.9"} NaN
http_response_size_bytes{handler="query",quantile="0.99"} NaN
http_response_size_bytes_sum{handler="query"} 0
http_response_size_bytes_count{handler="query"} 0
http_response_size_bytes{handler="query_range",quantile="0.5"} NaN
http_response_size_bytes{handler="query_range",quantile="0.9"} NaN
http_response_size_bytes{handler="query_range",quantile="0.99"} NaN
http_response_size_bytes_sum{handler="query_range"} 0
http_response_size_bytes_count{handler="query_range"} 0
http_response_size_bytes{handler="series",quantile="0.5"} NaN
http_response_size_bytes{handler="series",quantile="0.9"} NaN
http_response_size_bytes{handler="series",quantile="0.99"} NaN
http_response_size_bytes_sum{handler="series"} 0
http_response_size_bytes_count{handler="series"} 0
http_response_size_bytes{handler="static",quantile="0.5"} NaN
http_response_size_bytes{handler="static",quantile="0.9"} NaN
http_response_size_bytes{handler="static",quantile="0.99"} NaN
http_response_size_bytes_sum{handler="static"} 0
http_response_size_bytes_count{handler="static"} 0
http_response_size_bytes{handler="status",quantile="0.5"} NaN
http_response_size_bytes{handler="status",quantile="0.9"} NaN
http_response_size_bytes{handler="status",quantile="0.99"} NaN
http_response_size_bytes_sum{handler="status"} 745415
http_response_size_bytes_count{handler="status"} 2
http_response_size_bytes{handler="version",quantile="0.5"} NaN
http_response_size_bytes{handler="version",quantile="0.9"} NaN
http_response_size_bytes{handler="version",quantile="0.99"} NaN
http_response_size_bytes_sum{handler="version"} 0
http_response_size_bytes_count{handler="version"} 0
# HELP process_cpu_seconds_total Total user and system CPU time spent in seconds.
# TYPE process_cpu_seconds_total counter
process_cpu_seconds_total 7151.74
# HELP process_max_fds Maximum number of open file descriptors.
# TYPE process_max_fds gauge
process_max_fds 1024
# HELP process_open_fds Number of open file descriptors.
# TYPE process_open_fds gauge
process_open_fds 153
# HELP process_resident_memory_bytes Resident memory size in bytes.
# TYPE process_resident_memory_bytes gauge
process_resident_memory_bytes 1.05398337536e+11
# HELP process_start_time_seconds Start time of the process since unix epoch in seconds.
# TYPE process_start_time_seconds gauge
process_start_time_seconds 1.4398246244e+09
# HELP process_virtual_memory_bytes Virtual memory size in bytes.
# TYPE process_virtual_memory_bytes gauge
process_virtual_memory_bytes 1.06608111616e+11
# HELP prometheus_build_info A metric with a constant '1' value labeled by version, revision, and branch from which Prometheus was built.
# TYPE prometheus_build_info gauge
prometheus_build_info{branch="master",revision="7f9a8de",version="0.15.1"} 1
# HELP prometheus_configuration_attempts number of times a configuration reload was attempted, failed or succeeded.
# TYPE prometheus_configuration_attempts counter
prometheus_configuration_attempts{state="attempted"} 1
prometheus_configuration_attempts{state="succeeded"} 1
# HELP prometheus_dns_sd_lookup_failures_total The number of DNS-SD lookup failures.
# TYPE prometheus_dns_sd_lookup_failures_total counter
prometheus_dns_sd_lookup_failures_total 0
# HELP prometheus_dns_sd_lookups_total The number of DNS-SD lookups.
# TYPE prometheus_dns_sd_lookups_total counter
prometheus_dns_sd_lookups_total 0
# HELP prometheus_evaluator_duration_milliseconds The duration for all evaluations to execute.
# TYPE prometheus_evaluator_duration_milliseconds summary
prometheus_evaluator_duration_milliseconds{quantile="0.01"} 9860
prometheus_evaluator_duration_milliseconds{quantile="0.05"} 9860
prometheus_evaluator_duration_milliseconds{quantile="0.5"} 17521
prometheus_evaluator_duration_milliseconds{quantile="0.9"} 21991
prometheus_evaluator_duration_milliseconds{quantile="0.99"} 26634
prometheus_evaluator_duration_milliseconds_sum 6.509127e+06
prometheus_evaluator_duration_milliseconds_count 245
# HELP prometheus_local_storage_checkpoint_duration_milliseconds The duration (in milliseconds) it took to checkpoint in-memory metrics and head chunks.
# TYPE prometheus_local_storage_checkpoint_duration_milliseconds gauge
prometheus_local_storage_checkpoint_duration_milliseconds 1.011040272601e+06
# HELP prometheus_local_storage_chunk_ops_total The total number of chunk operations by their type.
# TYPE prometheus_local_storage_chunk_ops_total counter
prometheus_local_storage_chunk_ops_total{type="clone"} 725
prometheus_local_storage_chunk_ops_total{type="create"} 3.217324e+06
prometheus_local_storage_chunk_ops_total{type="load"} 138
prometheus_local_storage_chunk_ops_total{type="persist"} 637281
prometheus_local_storage_chunk_ops_total{type="pin"} 71882
prometheus_local_storage_chunk_ops_total{type="transcode"} 1.547023e+06
prometheus_local_storage_chunk_ops_total{type="unpin"} 71882
# HELP prometheus_local_storage_chunkdesc_ops_total The total number of chunk descriptor operations by their type.
# TYPE prometheus_local_storage_chunkdesc_ops_total counter
prometheus_local_storage_chunkdesc_ops_total{type="evict"} 841470
prometheus_local_storage_chunkdesc_ops_total{type="load"} 1.010165e+06
# HELP prometheus_local_storage_chunks_to_persist The current number of chunks waiting for persistence.
# TYPE prometheus_local_storage_chunks_to_persist gauge
prometheus_local_storage_chunks_to_persist 2.573101e+06
# HELP prometheus_local_storage_fingerprint_mappings_total The total number of fingerprints being mapped to avoid collisions.
# TYPE prometheus_local_storage_fingerprint_mappings_total counter
prometheus_local_storage_fingerprint_mappings_total 0
# HELP prometheus_local_storage_inconsistencies_total A counter incremented each time an inconsistency in the local storage is detected. If this is greater zero, restart the server as soon as possible.
# TYPE prometheus_local_storage_inconsistencies_total counter
prometheus_local_storage_inconsistencies_total 0
# HELP prometheus_local_storage_indexing_batch_duration_milliseconds Quantiles for batch indexing duration in milliseconds.
# TYPE prometheus_local_storage_indexing_batch_duration_milliseconds summary
prometheus_local_storage_indexing_batch_duration_milliseconds{quantile="0.5"} 16448.477141
prometheus_local_storage_indexing_batch_duration_milliseconds{quantile="0.9"} 17207.785174
prometheus_local_storage_indexing_batch_duration_milliseconds{quantile="0.99"} 17207.785174
prometheus_local_storage_indexing_batch_duration_milliseconds_sum 1.4983375939149999e+06
prometheus_local_storage_indexing_batch_duration_milliseconds_count 71
# HELP prometheus_local_storage_indexing_batch_sizes Quantiles for indexing batch sizes (number of metrics per batch).
# TYPE prometheus_local_storage_indexing_batch_sizes summary
prometheus_local_storage_indexing_batch_sizes{quantile="0.5"} 44
prometheus_local_storage_indexing_batch_sizes{quantile="0.9"} 70
prometheus_local_storage_indexing_batch_sizes{quantile="0.99"} 70
prometheus_local_storage_indexing_batch_sizes_sum 1.571379e+06
prometheus_local_storage_indexing_batch_sizes_count 71
# HELP prometheus_local_storage_indexing_queue_capacity The capacity of the indexing queue.
# TYPE prometheus_local_storage_indexing_queue_capacity gauge
prometheus_local_storage_indexing_queue_capacity 16384
# HELP prometheus_local_storage_indexing_queue_length The number of metrics waiting to be indexed.
# TYPE prometheus_local_storage_indexing_queue_length gauge
prometheus_local_storage_indexing_queue_length 55
# HELP prometheus_local_storage_ingested_samples_total The total number of samples ingested.
# TYPE prometheus_local_storage_ingested_samples_total counter
prometheus_local_storage_ingested_samples_total 1.78882882e+08
# HELP prometheus_local_storage_invalid_preload_requests_total The total number of preload requests referring to a non-existent series. This is an indication of outdated label indexes.
# TYPE prometheus_local_storage_invalid_preload_requests_total counter
prometheus_local_storage_invalid_preload_requests_total 0
# HELP prometheus_local_storage_maintain_series_duration_milliseconds The duration (in milliseconds) it took to perform maintenance on a series.
# TYPE prometheus_local_storage_maintain_series_duration_milliseconds summary
prometheus_local_storage_maintain_series_duration_milliseconds{location="memory",quantile="0.5"} NaN
prometheus_local_storage_maintain_series_duration_milliseconds{location="memory",quantile="0.9"} NaN
prometheus_local_storage_maintain_series_duration_milliseconds{location="memory",quantile="0.99"} NaN
prometheus_local_storage_maintain_series_duration_milliseconds_sum{location="memory"} 237788.0250960018
prometheus_local_storage_maintain_series_duration_milliseconds_count{location="memory"} 204810
# HELP prometheus_local_storage_max_chunks_to_persist The maximum number of chunks that can be waiting for persistence before sample ingestion will stop.
# TYPE prometheus_local_storage_max_chunks_to_persist gauge
prometheus_local_storage_max_chunks_to_persist 3.146466e+06
# HELP prometheus_local_storage_memory_chunkdescs The current number of chunk descriptors in memory.
# TYPE prometheus_local_storage_memory_chunkdescs gauge
prometheus_local_storage_memory_chunkdescs 1.9584554e+07
# HELP prometheus_local_storage_memory_chunks The current number of chunks in memory, excluding cloned chunks (i.e. chunks without a descriptor).
# TYPE prometheus_local_storage_memory_chunks gauge
prometheus_local_storage_memory_chunks 3.109625e+06
# HELP prometheus_local_storage_memory_series The current number of series in memory.
# TYPE prometheus_local_storage_memory_series gauge
prometheus_local_storage_memory_series 691046
# HELP prometheus_local_storage_out_of_order_samples_total The total number of samples that were discarded because their timestamps were at or before the last received sample for a series.
# TYPE prometheus_local_storage_out_of_order_samples_total counter
prometheus_local_storage_out_of_order_samples_total 0
# HELP prometheus_local_storage_persist_errors_total The total number of errors while persisting chunks.
# TYPE prometheus_local_storage_persist_errors_total counter
prometheus_local_storage_persist_errors_total 0
# HELP prometheus_local_storage_series_ops_total The total number of series operations by their type.
# TYPE prometheus_local_storage_series_ops_total counter
prometheus_local_storage_series_ops_total{type="archive"} 10
prometheus_local_storage_series_ops_total{type="create"} 4071
prometheus_local_storage_series_ops_total{type="maintenance_in_memory"} 204810
prometheus_local_storage_series_ops_total{type="unarchive"} 8
# HELP prometheus_notifications_latency_milliseconds Latency quantiles for sending alert notifications (not including dropped notifications).
# TYPE prometheus_notifications_latency_milliseconds summary
prometheus_notifications_latency_milliseconds{quantile="0.5"} 812
prometheus_notifications_latency_milliseconds{quantile="0.9"} 1201
prometheus_notifications_latency_milliseconds{quantile="0.99"} 1338
prometheus_notifications_latency_milliseconds_sum 246021
prometheus_notifications_latency_milliseconds_count 244
# HELP prometheus_notifications_queue_capacity The capacity of the alert notifications queue.
# TYPE prometheus_notifications_queue_capacity gauge
prometheus_notifications_queue_capacity 100
# HELP prometheus_notifications_queue_length The number of alert notifications in the queue.
# TYPE prometheus_notifications_queue_length gauge
prometheus_notifications_queue_length 0
# HELP prometheus_remote_storage_queue_capacity The capacity of the queue of samples to be sent to the remote storage.
# TYPE prometheus_remote_storage_queue_capacity gauge
prometheus_remote_storage_queue_capacity{type="opentsdb"} 102400
# HELP prometheus_remote_storage_queue_length The number of processed samples queued to be sent to the remote storage.
# TYPE prometheus_remote_storage_queue_length gauge
prometheus_remote_storage_queue_length{type="opentsdb"} 0
# HELP prometheus_remote_storage_sent_latency_milliseconds Latency quantiles for sending sample batches to the remote storage.
# TYPE prometheus_remote_storage_sent_latency_milliseconds summary
prometheus_remote_storage_sent_latency_milliseconds{type="opentsdb",quantile="0.5"} 738
prometheus_remote_storage_sent_latency_milliseconds{type="opentsdb",quantile="0.9"} 1326
prometheus_remote_storage_sent_latency_milliseconds{type="opentsdb",quantile="0.99"} 10252
prometheus_remote_storage_sent_latency_milliseconds_sum{type="opentsdb"} 6.8996348e+07
prometheus_remote_storage_sent_latency_milliseconds_count{type="opentsdb"} 50551
# HELP prometheus_remote_storage_sent_samples_total Total number of processed samples to be sent to remote storage.
# TYPE prometheus_remote_storage_sent_samples_total counter
prometheus_remote_storage_sent_samples_total{result="dropped",type="opentsdb"} 7.011685e+06
prometheus_remote_storage_sent_samples_total{result="failure",type="opentsdb"} 13200
prometheus_remote_storage_sent_samples_total{result="success",type="opentsdb"} 5.0419e+06
# HELP prometheus_rule_evaluation_duration_milliseconds The duration for a rule to execute.
# TYPE prometheus_rule_evaluation_duration_milliseconds summary
prometheus_rule_evaluation_duration_milliseconds{rule_type="alerting",quantile="0.5"} 13812
prometheus_rule_evaluation_duration_milliseconds{rule_type="alerting",quantile="0.9"} 18568
prometheus_rule_evaluation_duration_milliseconds{rule_type="alerting",quantile="0.99"} 19260
prometheus_rule_evaluation_duration_milliseconds_sum{rule_type="alerting"} 5.335308e+06
prometheus_rule_evaluation_duration_milliseconds_count{rule_type="alerting"} 245
# HELP prometheus_rule_evaluation_failures_total The total number of rule evaluation failures.
# TYPE prometheus_rule_evaluation_failures_total counter
prometheus_rule_evaluation_failures_total 0
# HELP prometheus_target_interval_length_seconds Actual intervals between scrapes.
# TYPE prometheus_target_interval_length_seconds summary
prometheus_target_interval_length_seconds{interval="15s",quantile="0.01"} 5.092907957
prometheus_target_interval_length_seconds{interval="15s",quantile="0.05"} 7.742574289
prometheus_target_interval_length_seconds{interval="15s",quantile="0.5"} 19.019487486
prometheus_target_interval_length_seconds{interval="15s",quantile="0.9"} 154.68425185
prometheus_target_interval_length_seconds{interval="15s",quantile="0.99"} 170.651805527
prometheus_target_interval_length_seconds_sum{interval="15s"} 1.9645108370758814e+06
prometheus_target_interval_length_seconds_count{interval="15s"} 80409
@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented Aug 17, 2015

That's 10X the samples/s of the next biggest HDD prometheus we're aware of. What does disk i/o utilisation look like (iostat -x)?

It's /consoles/prometheus.html

@swsnider

This comment has been minimized.

Copy link
Contributor Author

swsnider commented Aug 17, 2015

Linux 3.19.5-145.el6REDACTED (REDACTED)     08/17/2015  _x86_64_    (24 CPU)

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           3.96    0.00    0.71    0.07    0.00   95.26

Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s avgrq-sz avgqu-sz   await  svctm  %util
sda               0.03    85.94    4.04   76.01    42.26  5907.25    74.32     1.84   22.94   0.28   2.23
dm-0              0.00     0.00    4.07  162.35    42.24  5907.25    35.75     7.89   47.42   0.13   2.23
dm-1              0.00     0.00    0.00    0.00     0.01     0.00     8.84     0.00    0.39   0.38   0.00
@swsnider

This comment has been minimized.

Copy link
Contributor Author

swsnider commented Aug 17, 2015

Also, that consoles link doesn't work (open consoles/prometheus.html: no such file or directory), I complied with just make, did I need to bundle the consoles directory alongside, or should it just work via embedding?

@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented Aug 17, 2015

Disk seems fine, and you're using around one core. Have you set GOMAXPROCS?

Yes, you need to copy the consoles and console_libraries directories.

@swsnider

This comment has been minimized.

Copy link
Contributor Author

swsnider commented Aug 17, 2015

OMG, I entirely forgot about GOMAXPROCS.

Everything works waaaaay better now :)

@swsnider

This comment has been minimized.

Copy link
Contributor Author

swsnider commented Aug 17, 2015

I'm still getting the issue where storage is marked dirty every time, but since that seems to be more likely a problem with my specific setup, I'm closing this issue. Thanks for the help!

@swsnider swsnider closed this Aug 17, 2015

@lock

This comment has been minimized.

Copy link

lock bot commented Mar 24, 2019

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked and limited conversation to collaborators Mar 24, 2019

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
You can’t perform that action at this time.