Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

store: extremely slow and CPU pegged at 100% (lru.RemoveOldest) #955

Closed
ottoyiu opened this issue Mar 21, 2019 · 5 comments
Closed

store: extremely slow and CPU pegged at 100% (lru.RemoveOldest) #955

ottoyiu opened this issue Mar 21, 2019 · 5 comments

Comments

@ottoyiu
Copy link

ottoyiu commented Mar 21, 2019

Thanos, Prometheus and Golang version used

  • thanos 0.3.2
    image: improbable/thanos:v0.3.2

Only reproducible in 0.3.2 and not 0.3.1.

What happened

  • store does not respond to any queries, as clients like grafana all timeout.
  • CPU is pegged at 100% usage.
  • profiling CPU usage using:
curl http:/xxx:10902/debug/pprof/profile -O
go-torch -f "flame.svg" thanos profile 
go tool pprof thanos -svg profile                                                                                                                                        

shows all the CPU time being spent on 'RemoveOldest':
https://github.com/GiedriusS/thanos/blob/9679a193f433353287ea3052320dbc9e46bc3e9e/pkg/store/cache.go#L131

profile003

flame

What you expected to happen

  • CPU not be pegged

How to reproduce it (as minimally and precisely as possible):

  • I don't know how to reproduce this but it happens only on our largest prometheus instances with 1.5+ million head time series. Restarting store, and making a few queries lead to CPU being pegged at 100% cpu again.
  • Edit: This problem does not occur in thanos-store 0.3.1.

Maybe relevant:
#873

Full logs to relevant components

Anything else we need to know
Linux 4.9.0-6-amd64 #1 SMP Debian 4.9.82-1+deb9u3 (2018-03-02) x86_64 GNU/Linux

@ottoyiu
Copy link
Author

ottoyiu commented Mar 21, 2019

Like the changelog said, I can also bump up index-cache-size to something very large to revert back to 0.3.1's behaviour.

warning WARING warning #873 fix fixes actual handling of index-cache-size. Handling of limit for this cache was broken so it was unbounded all the time. From this release actual value matters and is extremely low by default. To "revert" the old behaviour (no boundary), use a large enough value.

@FUSAKLA
Copy link
Member

FUSAKLA commented Mar 23, 2019

Yes, you should definitely try that. What was the value of index-cache-size now?

The store does not have enough space for the cache so it has to remove the oldest all the time possibly.

@ottoyiu
Copy link
Author

ottoyiu commented Mar 23, 2019

@FUSAKLA thanks for the reply! I have it set to 16GB now, and don't have anymore issues related to this.

Is there a formula in which to compute how big the index-cache-size needs to be, relative to the # of blocks and size of the blocks?

The store does not have enough space for the cache so it has to remove the oldest all the time possibly.

If that's the case, should the 'store' report a warning about the set size being too large for the LRU size?

@FUSAKLA
Copy link
Member

FUSAKLA commented Mar 23, 2019

great to hear, well it depends on the queries you send, for how long time range, how many series.. there is lot of factors. Not sure if there is one universal formula.

Hmm.. not sure about the warning. It's still valid behavior aligned with the set cache size but I see the motivation.
I think better would suit you watching the cache metrics. there are:

  • thanos_store_index_cache_items_size_bytes
  • thanos_store_index_cache_items
  • thanos_store_index_cache_hits_total
  • thanos_store_index_cache_items_overflowed_total
  • thanos_store_index_cache_requests_total
  • thanos_store_index_cache_items_added_total
  • thanos_store_index_cache_items_evicted_total

I think those should tell you how big it should be hopefully

@ottoyiu
Copy link
Author

ottoyiu commented Mar 23, 2019

@FUSAKLA those metrics are definitely useful. Thank you!

I can definitely see the 250MB plateau that caused the constant cpu churn while it tried to constantly evict the index. Going to close this issue now :)

image

@ottoyiu ottoyiu closed this as completed Mar 23, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants