Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Information needed #2142

Closed
AsenZahariev opened this issue Dec 5, 2017 · 9 comments
Closed

Information needed #2142

AsenZahariev opened this issue Dec 5, 2017 · 9 comments
Labels

Comments

@AsenZahariev
Copy link

AsenZahariev commented Dec 5, 2017

Hello folks,
I saw the new option REMOTE_BUFFER_SIZE.
Can we have more information about this ? What exactly are these 1024 * 1024 values ?

Thank you!
Asen Z

@deniszh
Copy link
Member

deniszh commented Dec 5, 2017

Hello @AsenZahariev ,
This is part of rather new PR #2136
Citing it here:

Currently pickle & msgpack using lots of small reads to decode the responses from remote hosts, which causes slowdowns for the remote host and can drastically increase the time taken to receive large responses.
This PR adds a new BufferedHTTPReader class that can be used to wrap the result before passing it to load(). It reads from the underlying response object in chunks to keep memory usage reasonable without slowing down the producer.

REMOTE_BUFFER_SIZE is a default buffer size for http calls to remote hosts in cluster. Default is 1024 * 1024 = 1048576, i.e. 1MB - should be fine for most clusters.

@deniszh
Copy link
Member

deniszh commented Dec 5, 2017

Are you experiencing any issues, or just curious?

@AsenZahariev
Copy link
Author

AsenZahariev commented Dec 5, 2017

First thank you for the quick answer. Just curious, because we have a large cluster and very heavy queries.
Let me describe in short what we have and what we done so far so you can get the picture of my curiosity.
We have 4 graphite clusters running on 0.9.15( i will lie you about the commit version) combine these servers are receiving around 4Mil metrics per minute. On these nodes are running all carbon components (relays(5) and caches(10) per node) also with graphite webapp using memcached for 5 minutes(600seconds). In front of all there is LoadBalancer. Relays are using const hashing for distributing the metrics and each relay can sent metrics to all caches (10caches x node , 40 caches combine). No aggregation. All are running on pypy. Which btw helped a lot with minimizing the utilization of CPU and RAM for carbon`s relay and cache. Each carbonlink_host is directed to their local caches and instances.

What we have done to increase the reading speed for Grafana and overall stability is that we build a separate box only with graphite webapp latest version from the master.
Configuration of the graphite webapp box we put a local memcached with configured policy (e.g. 0,60 ; 7200, 120; 21600,180; and so on).
On cluster_servers directive we put the IP addresses of the 4 nodes with the respective port.
The first issue we encounter is that the queries started to timeout so we put a big values and increase the retries.
We started to use POST request ( REMOTE_STORE_USE_POST = True)
Using the new option (REMOTE_BUFFER_SIZE = 1024 * 1024), and yes it helped a lot!
On the graphite nodes we only enable REMOTE_PREFETCH_DATA.

I will be forever grateful if this threat can become a guide for building sustainable graphite cluster, because we search and experiment a lot before even thinking to start threat here.

P.S.

  1. This is one of cluster , the other one is even more ...heavy
  2. We have two instances of Grafana ( 2.5 and 4.4.6) both with 700 dashboards and around 5k-6k of graphs , with crazy queries using multiple wildcards (*)
  3. Seyren v1.5 with nearly 400 checks.
  4. We can't update the graphite nodes to newer version.

Kind regards,
Asen Z.

@deniszh
Copy link
Member

deniszh commented Dec 5, 2017

Cool, thanks for sharing!
Please note that (IIRC) this behavior is enabled by default now, and doesn't require any additional variables.
REMOTE_PREFETCH_DATA and REMOTE_STORE_USE_POST are useful for cluster tuning ofc.

@AsenZahariev
Copy link
Author

AsenZahariev commented Dec 5, 2017

Sorry, IIRC ? I'm little bit new :)
Anyway, do you have any other recommendation/best practices for such setup like mine ?

@deniszh
Copy link
Member

deniszh commented Dec 5, 2017

IIRC is an acronym for "If I Recall(or Remember) Correctly".
And looks like I'm not really because REMOTE_PREFETCH_DATA is not used now (after #2093). REMOTE_STORE_USE_POST can be useful but not mandatory for features above, POST has no limit for size request (contrary to GET).

@AsenZahariev
Copy link
Author

AsenZahariev commented Dec 6, 2017

Hey Denis,
We switched back to POST since the post is having some limitation. In general is perfect for small queries, but for something with multiple wildcards not so much.
About "REMOTE_PREFETCH_DATA" yes we saw that change and we only are using it on our graphite nodes (version 0.9.15). Do you think there is some limitation? We use 0.9.15(back-end graphite webapp nodes) and 1.1.0(front-end graphite web app) ?

@deniszh
Copy link
Member

deniszh commented Dec 6, 2017

For 0.9.15 PREFETCH is still valid, of course.

@AsenZahariev
Copy link
Author

AsenZahariev commented Dec 15, 2017

@deniszh Thank you for your feedback! I can confirm that with the above settings the getting graphite webapp in front of your graphite cluster with the settings we have(of course, once again, it depends on your environment/infrastructure), the reading is better. I have a couple of more questions ,but there are going to be in a separate thread. Cheers!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants