store: extremely slow and CPU pegged at 100% (lru.RemoveOldest) #955

ottoyiu · 2019-03-21T00:21:34Z

Thanos, Prometheus and Golang version used

thanos 0.3.2

    image: improbable/thanos:v0.3.2

Only reproducible in 0.3.2 and not 0.3.1.

What happened

store does not respond to any queries, as clients like grafana all timeout.
CPU is pegged at 100% usage.
profiling CPU usage using:

curl http:/xxx:10902/debug/pprof/profile -O
go-torch -f "flame.svg" thanos profile 
go tool pprof thanos -svg profile

shows all the CPU time being spent on 'RemoveOldest':
https://github.com/GiedriusS/thanos/blob/9679a193f433353287ea3052320dbc9e46bc3e9e/pkg/store/cache.go#L131

What you expected to happen

CPU not be pegged

How to reproduce it (as minimally and precisely as possible):

I don't know how to reproduce this but it happens only on our largest prometheus instances with 1.5+ million head time series. Restarting store, and making a few queries lead to CPU being pegged at 100% cpu again.
Edit: This problem does not occur in thanos-store 0.3.1.

Maybe relevant:
#873

Full logs to relevant components

Anything else we need to know
Linux 4.9.0-6-amd64 #1 SMP Debian 4.9.82-1+deb9u3 (2018-03-02) x86_64 GNU/Linux

The text was updated successfully, but these errors were encountered:

ottoyiu · 2019-03-21T01:04:54Z

Like the changelog said, I can also bump up index-cache-size to something very large to revert back to 0.3.1's behaviour.

warning WARING warning #873 fix fixes actual handling of index-cache-size. Handling of limit for this cache was broken so it was unbounded all the time. From this release actual value matters and is extremely low by default. To "revert" the old behaviour (no boundary), use a large enough value.

FUSAKLA · 2019-03-23T00:05:38Z

Yes, you should definitely try that. What was the value of index-cache-size now?

The store does not have enough space for the cache so it has to remove the oldest all the time possibly.

ottoyiu · 2019-03-23T00:10:41Z

@FUSAKLA thanks for the reply! I have it set to 16GB now, and don't have anymore issues related to this.

Is there a formula in which to compute how big the index-cache-size needs to be, relative to the # of blocks and size of the blocks?

The store does not have enough space for the cache so it has to remove the oldest all the time possibly.

If that's the case, should the 'store' report a warning about the set size being too large for the LRU size?

FUSAKLA · 2019-03-23T01:19:59Z

great to hear, well it depends on the queries you send, for how long time range, how many series.. there is lot of factors. Not sure if there is one universal formula.

Hmm.. not sure about the warning. It's still valid behavior aligned with the set cache size but I see the motivation.
I think better would suit you watching the cache metrics. there are:

thanos_store_index_cache_items_size_bytes
thanos_store_index_cache_items
thanos_store_index_cache_hits_total
thanos_store_index_cache_items_overflowed_total
thanos_store_index_cache_requests_total
thanos_store_index_cache_items_added_total
thanos_store_index_cache_items_evicted_total

I think those should tell you how big it should be hopefully

ottoyiu · 2019-03-23T05:14:58Z

@FUSAKLA those metrics are definitely useful. Thank you!

I can definitely see the 250MB plateau that caused the constant cpu churn while it tried to constantly evict the index. Going to close this issue now :)

ottoyiu closed this as completed Mar 23, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

store: extremely slow and CPU pegged at 100% (lru.RemoveOldest) #955

store: extremely slow and CPU pegged at 100% (lru.RemoveOldest) #955

ottoyiu commented Mar 21, 2019 •

edited

ottoyiu commented Mar 21, 2019

FUSAKLA commented Mar 23, 2019

ottoyiu commented Mar 23, 2019 •

edited

FUSAKLA commented Mar 23, 2019 •

edited

ottoyiu commented Mar 23, 2019

store: extremely slow and CPU pegged at 100% (lru.RemoveOldest) #955

store: extremely slow and CPU pegged at 100% (lru.RemoveOldest) #955

Comments

ottoyiu commented Mar 21, 2019 • edited

ottoyiu commented Mar 21, 2019

FUSAKLA commented Mar 23, 2019

ottoyiu commented Mar 23, 2019 • edited

FUSAKLA commented Mar 23, 2019 • edited

ottoyiu commented Mar 23, 2019

ottoyiu commented Mar 21, 2019 •

edited

ottoyiu commented Mar 23, 2019 •

edited

FUSAKLA commented Mar 23, 2019 •

edited