Join GitHub today
GitHub is home to over 31 million developers working together to host and review code, manage projects, and build software together.
Sign upAn instant query that matches many older (often archived) time series spends a lot of time in index lookups although data from the older series is irrelevant for the instant query. #1264
Comments
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
|
Is it just that one query that takes very long or all of them? The SVG graph shows that most of the time is spent within rule evaluation and index lookups. |
This comment has been minimized.
This comment has been minimized.
|
Without having looked into all the details: If you have a lot of archived time series matching the query, they will slow down the index lookup although they will not counted in the final result. |
This comment has been minimized.
This comment has been minimized.
|
I'd like to just drop the "data" directory for this instance, but if this bug is useful we can see what happened. Is it helpful to look into this? It seems any query that covers ~16000 timeseries is taking 30000ms+ on this instance. prometheus_evaluator_duration_milliseconds is hovering around 20000 on this instance. The evaluation interval is 60 seconds, this instance only has two recording rules and four alerting rules. They're fairly simple, I can provide sanitized versions if you like. When I filed this bug, I started a mirror Prometheus with much of the same data (a subset of metric names; 182232 series). It currently has only 4 days of history instead of 20 days. The mirror instance takes only 350ms for the same query and result (same exact dataset). The mirror instance has prometheus_evaluator_duration_milliseconds quantile=0.99 at 457 with even more rules(!) |
This comment has been minimized.
This comment has been minimized.
|
My assumption is that your query covers more than the counted number of timeseries, just that most of them are too old to be included in the final count. That would be consistent with the observed behavior that a server with less history is much faster. The size of archived_fingerprint_to_metric implies quite a number of archivals over the lifetime of the storage. You can query the number of archive ops over the last 20 days with this query (in Console view, not Graph):
That still doesn't tell you the number of total timeseries for "metric_name", but it gives you an idea how much series churn you have. You can play tricks like the following (but the query will take quite long, definitely do it in Console view only):
Looking up hundreds of thousands of entries in leveldb takes time, even if everything is in the page cache already. There is no way around it, except picking a faster implementation. (Like the C++ implementation, but then we need cgo again...) |
This comment has been minimized.
This comment has been minimized.
|
Thanks for your input! There was definitely churn at some point due to all the "instance" labels changing (thinking about dropping these for my use case)
I can't get I don't intend to use Prometheus for long term storage, so perhaps I can just lower |
brian-brazil
added
the
bug
label
Dec 16, 2015
This comment has been minimized.
This comment has been minimized.
|
Or just don't have too much churn. Sometimes, there is no work-around, and that's a real problem then, but unfortunately, the index lookups are by design. If you only want to access 10,000 series, but have 100,000 archived series that the query will need to touch (as in: look up in the indices, only to find out they are too old to be considered), the query will be slow by design. We have similar problem at SoundCloud for servers with a let of series churn. Currently, I could imagine to maintain an index just for "fresh" series, which could be used for instant queries (rule evaluation). Needs more thinking. Perhaps the solution of the staleness problem is related. I'll rename this issue so that it represents the actual problem better. |
beorn7
changed the title
count(metric_name) takes 35 seconds to return 13660
An instant query that matches many older (often archived) time series spends a lot of time in index lookups although data from the older series is irrelevant for the instant query.
Dec 17, 2015
beorn7
added
enhancement
and removed
bug
labels
Dec 17, 2015
This comment has been minimized.
This comment has been minimized.
|
We have a similar problem. Our use case has a massive churn in labels, due to dynamic jobs. |
beorn7
added a commit
that referenced
this issue
Feb 18, 2016
beorn7
added a commit
that referenced
this issue
Feb 19, 2016
beorn7
closed this
Mar 18, 2016
This comment has been minimized.
This comment has been minimized.
|
Fixed by #1477 |
This comment has been minimized.
This comment has been minimized.
lock
bot
commented
Mar 24, 2019
|
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs. |
jamessanford commentedDec 11, 2015
Any ideas what might be happening here? Hopefully I'm just doing something wrong.
prometheus release-0.16.1
prometheus has 385126 prometheus_local_storage_memory_series
metric_name has 16338 series
count(metric_name)takes 35 seconds to evaluate (not graph!)pprof data during a query execute for
count(metric_name)is here:https://froop.com/tmp/prom_pprof/pprof.prometheus.samples.cpu.003.pb.gz
https://froop.com/tmp/prom_pprof/pprof.prometheus.samples.cpu.003.svg
https://froop.com/tmp/prom_pprof/pprof.prometheus.samples.cpu.003.txt
size of historic data:
size of leveldbs:
I have some flags set to increase leveldb cache size -- without the index-cache-size flags, the query is even slower.
All of the leveldb data is already cached in RAM (it's not just slow disk):