Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

proposal: increase default value for `max_samples_per_send` #5166

Open
valyala opened this Issue Jan 31, 2019 · 9 comments

Comments

Projects
None yet
6 participants
@valyala
Copy link

valyala commented Jan 31, 2019

Proposal

Default value for max_samples_per_send - 100 - is too low for any non-idle Prometheus setup with remote_write enabled. It results in too frequent requests to remote storage if Prometheus scrapes more than a few hundred metrics per second. High requests' frequency wastes resources on both Prometheus and remote storage sides, so users have to increase max_samples_per_send after the first attempt to write metrics to remote storage.

It would be great if default value for max_samples_per_send is increased from 100 to 1000 or even 10K. This would simplify remote_write configuration for the majority of users.

@cstyan

This comment has been minimized.

Copy link
Contributor

cstyan commented Feb 1, 2019

Default value for max_samples_per_send - 100 - is too low for any non-idle Prometheus setup with remote_write enabled. It results in too frequent requests to remote storage if Prometheus scrapes more than a few hundred metrics per second.

Can you elaborate? I don't think we're seeing any issues like this with any of our Prometheus instances.

@valyala

This comment has been minimized.

Copy link
Author

valyala commented Feb 1, 2019

  • Github users usually use high values for max_samples_per_send. This suggests that the default value is too low.
  • See this and this issue. They contain suggestions for increasing max_samples_per_send to 1000 in order to fix performance issues.
@juliusv

This comment has been minimized.

Copy link
Member

juliusv commented Feb 4, 2019

Yeah, 100 seems a bit low.

@bboreham

This comment has been minimized.

Copy link
Contributor

bboreham commented Feb 4, 2019

Another datapoint: Weaveworks customer config is set to 1000.

@valyala a lot of those configs on GitHub also have max_shards: 10000 which suggests they haven't thought this through...

@valyala

This comment has been minimized.

Copy link
Author

valyala commented Feb 4, 2019

Yeah, default max_shards should be lowered to appropriate value when increasing default max_samples_per_send value

@beorn-

This comment has been minimized.

Copy link

beorn- commented Apr 17, 2019

The system CPU usage was quite high after adding a significant amount of metrics.

We were about to scale the platform and we noticed the CPU system usage was high. After profiling the system we've noticed that the kernel was spending a very significant amount of time handling lookups in established TCP connections kernel hashtable.

After checking things i've seen that in our setup we were pushing 150k datapoints/s which meant 1500 tcp connections/s : Hence a serious amount of TIME_WAIT connections (perfectly normal).

Load was through the roof (more than 30 on a 12 cores server) and rules evaluation time exploded( 30s instead of the usual milliseconds)

We have ended up with

    queue_config:
      capacity: 300000
      max_shards: 100
      max_samples_per_send: 10000

It now works fine with 2-3 load and very good rules evaluations time

One problem still remains. it eats up more memory, if i'm not mistaken and the prometheus code is not really meant for big max_samples_per_send values. Maybe fixing keep-alive through http 1.1 might be a quick win too ?

@beorn-

This comment has been minimized.

Copy link

beorn- commented Apr 17, 2019

About keep-alived connections @elwinar and I came up with elwinar@a153ee9

It seems to fix the issue.

@bboreham

This comment has been minimized.

Copy link
Contributor

bboreham commented Apr 17, 2019

FYI “http pipelining” means something different to keep-alive, and is not relevant here. See https://en.m.wikipedia.org/wiki/HTTP_pipelining

This doesn’t impact your suggestion; I just like to keep the terminology clear.

@beorn-

This comment has been minimized.

Copy link

beorn- commented Apr 17, 2019

I stand corrected. To avoid any unneeded confusion i have edited my past comments. Thanks @bboreham .

elwinar added a commit to elwinar/prometheus that referenced this issue Apr 18, 2019

Exhaust every request body before closing it (prometheus#5166)
From the documentation:
> The default HTTP client's Transport may not
> reuse HTTP/1.x "keep-alive" TCP connections if the Body is
> not read to completion and closed.

This effectively enable keep-alive for the fixed requests.

elwinar added a commit to elwinar/prometheus that referenced this issue Apr 18, 2019

Exhaust every request body before closing it (prometheus#5166)
From the documentation:
> The default HTTP client's Transport may not
> reuse HTTP/1.x "keep-alive" TCP connections if the Body is
> not read to completion and closed.

This effectively enable keep-alive for the fixed requests.

Signed-off-by: Romain Baugue <romain.baugue@elwinar.com>

brian-brazil added a commit that referenced this issue Apr 18, 2019

Exhaust every request body before closing it (#5166) (#5479)
From the documentation:
> The default HTTP client's Transport may not
> reuse HTTP/1.x "keep-alive" TCP connections if the Body is
> not read to completion and closed.

This effectively enable keep-alive for the fixed requests.

Signed-off-by: Romain Baugue <romain.baugue@elwinar.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.