Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

An instant query that matches many older (often archived) time series spends a lot of time in index lookups although data from the older series is irrelevant for the instant query. #1264

Closed
jamessanford opened this Issue Dec 11, 2015 · 10 comments

Comments

Projects
None yet
5 participants
@jamessanford
Copy link
Contributor

jamessanford commented Dec 11, 2015

Any ideas what might be happening here? Hopefully I'm just doing something wrong.

prometheus release-0.16.1
prometheus has 385126 prometheus_local_storage_memory_series
metric_name has 16338 series
count(metric_name) takes 35 seconds to evaluate (not graph!)

pprof data during a query execute for count(metric_name) is here:
https://froop.com/tmp/prom_pprof/pprof.prometheus.samples.cpu.003.pb.gz
https://froop.com/tmp/prom_pprof/pprof.prometheus.samples.cpu.003.svg
https://froop.com/tmp/prom_pprof/pprof.prometheus.samples.cpu.003.txt

size of historic data:

48G data

size of leveldbs:

267M    archived_fingerprint_to_metric
101M    archived_fingerprint_to_timerange
2.6M    labelname_to_labelvalues
430M    labelpair_to_fingerprints

I have some flags set to increase leveldb cache size -- without the index-cache-size flags, the query is even slower.

-storage.local.max-chunks-to-persist=3145728 \
-storage.local.memory-chunks=3145728 \
-storage.local.index-cache-size.fingerprint-to-timerange=51214400 \
-storage.local.index-cache-size.fingerprint-to-metric=102428800 \
-storage.local.index-cache-size.label-name-to-label-values=102428800 \
-storage.local.index-cache-size.label-pair-to-fingerprints=209715200 \

All of the leveldb data is already cached in RAM (it's not just slow disk):

data% time cat *_to_*/* >/dev/null
real    0m0.147s
user    0m0.004s
sys 0m0.142s
@jamessanford

This comment has been minimized.

@fabxc

This comment has been minimized.

Copy link
Member

fabxc commented Dec 14, 2015

Is it just that one query that takes very long or all of them?

The SVG graph shows that most of the time is spent within rule evaluation and index lookups.
Can you show us the rules you have configured for that server and their evaluation interval?

@beorn7

This comment has been minimized.

Copy link
Member

beorn7 commented Dec 14, 2015

Without having looked into all the details: If you have a lot of archived time series matching the query, they will slow down the index lookup although they will not counted in the final result.

@jamessanford

This comment has been minimized.

Copy link
Contributor Author

jamessanford commented Dec 14, 2015

I'd like to just drop the "data" directory for this instance, but if this bug is useful we can see what happened. Is it helpful to look into this?

It seems any query that covers ~16000 timeseries is taking 30000ms+ on this instance. prometheus_evaluator_duration_milliseconds is hovering around 20000 on this instance.

The evaluation interval is 60 seconds, this instance only has two recording rules and four alerting rules. They're fairly simple, I can provide sanitized versions if you like.

When I filed this bug, I started a mirror Prometheus with much of the same data (a subset of metric names; 182232 series). It currently has only 4 days of history instead of 20 days.

The mirror instance takes only 350ms for the same query and result (same exact dataset). The mirror instance has prometheus_evaluator_duration_milliseconds quantile=0.99 at 457 with even more rules(!)

@beorn7

This comment has been minimized.

Copy link
Member

beorn7 commented Dec 15, 2015

My assumption is that your query covers more than the counted number of timeseries, just that most of them are too old to be included in the final count. That would be consistent with the observed behavior that a server with less history is much faster.

The size of archived_fingerprint_to_metric implies quite a number of archivals over the lifetime of the storage. You can query the number of archive ops over the last 20 days with this query (in Console view, not Graph):

increase(prometheus_local_storage_series_ops_total{type=~"^archive$"}[20d])

That still doesn't tell you the number of total timeseries for "metric_name", but it gives you an idea how much series churn you have.

You can play tricks like the following (but the query will take quite long, definitely do it in Console view only):

count(max_over_time(metric_name[20d]))

Looking up hundreds of thousands of entries in leveldb takes time, even if everything is in the page cache already. There is no way around it, except picking a faster implementation. (Like the C++ implementation, but then we need cgo again...)

@jamessanford

This comment has been minimized.

Copy link
Contributor Author

jamessanford commented Dec 15, 2015

Thanks for your input!

There was definitely churn at some point due to all the "instance" labels changing (thinking about dropping these for my use case)

increase(prometheus_local_storage_series_ops_total{type=~"^archive$"}[20d])
Element Value
{instance="localhost:9240",job="prometheus",type="archive"} 2509304.5114708142
{instance="localhost:8052",job="prometheus",type="archive"} 20445408.48219974
{instance="localhost:8234",job="prometheus",type="archive"} 34654746.73629243

I can't get count(max_over_time(metric_name[20d])) to complete, but I can for metric_name{label="one_node"}[20d] and it looks like 20 days has 9 * metric_count, or about 120000 for all of metric_name[20d].

I don't intend to use Prometheus for long term storage, so perhaps I can just lower -storage.local.retention to a week.

@brian-brazil brian-brazil added the bug label Dec 16, 2015

@beorn7

This comment has been minimized.

Copy link
Member

beorn7 commented Dec 17, 2015

Or just don't have too much churn. Sometimes, there is no work-around, and that's a real problem then, but unfortunately, the index lookups are by design. If you only want to access 10,000 series, but have 100,000 archived series that the query will need to touch (as in: look up in the indices, only to find out they are too old to be considered), the query will be slow by design.

We have similar problem at SoundCloud for servers with a let of series churn. Currently, I could imagine to maintain an index just for "fresh" series, which could be used for instant queries (rule evaluation). Needs more thinking. Perhaps the solution of the staleness problem is related.

I'll rename this issue so that it represents the actual problem better.

@beorn7 beorn7 changed the title count(metric_name) takes 35 seconds to return 13660 An instant query that matches many older (often archived) time series spends a lot of time in index lookups although data from the older series is irrelevant for the instant query. Dec 17, 2015

@beorn7 beorn7 added enhancement and removed bug labels Dec 17, 2015

@mwitkow

This comment has been minimized.

Copy link
Contributor

mwitkow commented Feb 10, 2016

We have a similar problem. Our use case has a massive churn in labels, due to dynamic jobs.
We run a federation set up, where leafs scrape and aggregate jobs, with a small storage.local.retention=2h. Even though most timeseries can end within the 2h timeframe, they still impact the performance of rule evaluation and federation scrapes to a point where they both exceed the 30s interval for our datapoints.

beorn7 added a commit that referenced this issue Feb 18, 2016

Streamline series iterator creation
This will fix issue #1035 and will also help to make issue #1264 less
bad.

The fundamental problem in the current code:

In the preload phase, we quite accurately determine which chunks will
be used for the query being executed. However, in the subsequent step
of creating series iterators, the created iterators are referencing
_all_ in-memory chunks in their series, even the un-pinned ones. In
iterator creation, we copy a pointer to each in-memory chunk of a
series into the iterator. While this creates a certain amount of
allocation churn, the worst thing about it is that copying the chunk
pointer out of the chunkDesc requires a mutex acquisition. (Remember
that the iterator will also reference un-pinned chunks, so we need to
acquire the mutex to protect against concurrent eviction.) The worst
case happens if a series doesn't even contain any relevant samples for
the query time range. We notice that during preloading but then we
will still create a series iterator for it. But even for series that
do contain relevant samples, the overhead is quite bad for instant
queries that retrieve a single sample from each series, but still go
through all the effort of series iterator creation. All of that is
particularly bad if a series has many in-memory chunks.

This commit addresses the problem from two sides:

First, it merges preloading and iterator creation into one step,
i.e. the preload call returns an iterator for exactly the preloaded
chunks.

Second, the required mutex acquisition in chunkDesc has been greatly
reduced. That was enabled by a side effect of the first step, which is
that the iterator is only referencing pinned chunks, so there is no
risk of concurrent eviction anymore, and chunks can be accessed
without mutex acquisition.

To simplify the code changes for the above, the long-planned change of
ValueAtTime to ValueAtOrBefore time was performed at the same
time. (It should have been done first, but it kind of accidentally
happened while I was in the middle of writing the series iterator
changes. Sorry for that.) So far, we actively filtered the up to two
values that were returned by ValueAtTime, i.e. we invested work to
retrieve up to two values, and then we invested more work to throw one
of them away.

The SeriesIterator.BoundaryValues method can be removed once #1401 is
fixed. But I really didn't want to load even more changes into this
PR.

Benchmarks:

The BenchmarkFuzz.* benchmarks run 83% faster (i.e. about six times
faster) and allocate 95% fewer bytes. The reason for that is that the
benchmark reads one sample after another from the time series and
creates a new series iterator for each sample read.

To find out how much these improvements matter in practice, I have
mirrored a beefy Prometheus server at SoundCloud that suffers from
both issues #1035 and #1264. To reach steady state that would be
comparable, the server needs to run for 15d. So far, it has run for
1d. The test server currently has only half as many memory time series
and 60% of the memory chunks the main server has. The 90th percentile
rule evaluation cycle time is ~11s on the main server and only ~3s on
the test server. However, these numbers might get much closer over
time.

In addition to performance improvements, this commit removes about 150
LOC.

beorn7 added a commit that referenced this issue Feb 19, 2016

Streamline series iterator creation
This will fix issue #1035 and will also help to make issue #1264 less
bad.

The fundamental problem in the current code:

In the preload phase, we quite accurately determine which chunks will
be used for the query being executed. However, in the subsequent step
of creating series iterators, the created iterators are referencing
_all_ in-memory chunks in their series, even the un-pinned ones. In
iterator creation, we copy a pointer to each in-memory chunk of a
series into the iterator. While this creates a certain amount of
allocation churn, the worst thing about it is that copying the chunk
pointer out of the chunkDesc requires a mutex acquisition. (Remember
that the iterator will also reference un-pinned chunks, so we need to
acquire the mutex to protect against concurrent eviction.) The worst
case happens if a series doesn't even contain any relevant samples for
the query time range. We notice that during preloading but then we
will still create a series iterator for it. But even for series that
do contain relevant samples, the overhead is quite bad for instant
queries that retrieve a single sample from each series, but still go
through all the effort of series iterator creation. All of that is
particularly bad if a series has many in-memory chunks.

This commit addresses the problem from two sides:

First, it merges preloading and iterator creation into one step,
i.e. the preload call returns an iterator for exactly the preloaded
chunks.

Second, the required mutex acquisition in chunkDesc has been greatly
reduced. That was enabled by a side effect of the first step, which is
that the iterator is only referencing pinned chunks, so there is no
risk of concurrent eviction anymore, and chunks can be accessed
without mutex acquisition.

To simplify the code changes for the above, the long-planned change of
ValueAtTime to ValueAtOrBefore time was performed at the same
time. (It should have been done first, but it kind of accidentally
happened while I was in the middle of writing the series iterator
changes. Sorry for that.) So far, we actively filtered the up to two
values that were returned by ValueAtTime, i.e. we invested work to
retrieve up to two values, and then we invested more work to throw one
of them away.

The SeriesIterator.BoundaryValues method can be removed once #1401 is
fixed. But I really didn't want to load even more changes into this
PR.

Benchmarks:

The BenchmarkFuzz.* benchmarks run 83% faster (i.e. about six times
faster) and allocate 95% fewer bytes. The reason for that is that the
benchmark reads one sample after another from the time series and
creates a new series iterator for each sample read.

To find out how much these improvements matter in practice, I have
mirrored a beefy Prometheus server at SoundCloud that suffers from
both issues #1035 and #1264. To reach steady state that would be
comparable, the server needs to run for 15d. So far, it has run for
1d. The test server currently has only half as many memory time series
and 60% of the memory chunks the main server has. The 90th percentile
rule evaluation cycle time is ~11s on the main server and only ~3s on
the test server. However, these numbers might get much closer over
time.

In addition to performance improvements, this commit removes about 150
LOC.

@beorn7 beorn7 closed this Mar 18, 2016

@beorn7

This comment has been minimized.

Copy link
Member

beorn7 commented Mar 18, 2016

Fixed by #1477

@lock

This comment has been minimized.

Copy link

lock bot commented Mar 24, 2019

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked and limited conversation to collaborators Mar 24, 2019

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
You can’t perform that action at this time.