Currently exposed prometheus metrics are not enough for problem analysis #123

onorua · 2017-08-19T14:13:47Z

I've tried to analyze the reason why prometheus server is utilizing more than 128GB of RAM (causing its killing by OOMKiller), but I could not find the reason for it, as there is no time series and samples/sec metrics exposed anymore. samples/sec may be Prometheus specific, but amount of series is definitely TSDB specific.

I believe we need something like this:
https://github.com/prometheus/tsdb/blob/master/head.go#L219-L221
but on "global" level.

If you think that 1.x era metrics are not applicable anymore, could you please provide the list of metrics to pay attention and some performance indicators?

fabxc · 2017-08-19T15:04:29Z

You can still get samples per second via rate(tsdb_samples_appended_total[5m]).
Total amount of series in the DB is actually non-trivial to even compute, but currently active series can be queried via sum(scrape_samples_scraped).

(I told you out of band, but just again here for public reference)

We should definitely have some docs on how to analyze 2.0.

fabxc · 2017-09-11T08:09:41Z

Amount of metrics has significantly increased in master, giving detailed overviews of active series, in-memory chunks, and more. They are all available under tsdb_*.

fabxc closed this as completed Sep 11, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Currently exposed prometheus metrics are not enough for problem analysis #123

Currently exposed prometheus metrics are not enough for problem analysis #123

onorua commented Aug 19, 2017

fabxc commented Aug 19, 2017 •

edited

Loading

fabxc commented Sep 11, 2017

Currently exposed prometheus metrics are not enough for problem analysis #123

Currently exposed prometheus metrics are not enough for problem analysis #123

Comments

onorua commented Aug 19, 2017

fabxc commented Aug 19, 2017 • edited Loading

fabxc commented Sep 11, 2017

fabxc commented Aug 19, 2017 •

edited

Loading