Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

prometheus 2.4.3 not increasing shards to match remote write capacity #4771

Closed
BertHartm opened this Issue Oct 22, 2018 · 6 comments

Comments

Projects
None yet
3 participants
@BertHartm
Copy link

BertHartm commented Oct 22, 2018

Bug Report

What did you do?
I upgraded from prometheus 2.2.1 to 2.4.3 while using the remote write endpoint

What did you expect to see?
I was using 3 remote_write shards before and not dropping metrics. I expected that to continue. I write about 30k metrics/s, and the scrape interval is 15s.

What did you see instead? Under which circumstances?
Remote storage shards goes to 3 on the new version by default, but it's dropping about 40k samples per minute. I would expect it to increase the number of shards to compensate, but that doesn't seem to happen. Taking down my remote_write endpoint causes the number of shards to increase which resolves the issue until a restart.

Environment

  • System information:

Linux 4.15.0-1023-azure x86_64

  • Prometheus version:

prometheus, version 2.4.3 (branch: HEAD, revision: 167a4b4)
build user: root@1e42b46043e9
build date: 20181004-08:42:02
go version: go1.11.1

  • Prometheus configuration file:
global:
  scrape_interval:     15s # original default 1m
  scrape_timeout:      10s # original default 10s
  evaluation_interval: 15s # original default 1m evaluating rules

  # The labels to add to any time series or alerts when communicating with
  # external systems (federation, remote storage, Alertmanager).
  external_labels:
    environment: dev

rule_files:
  - /etc/prometheus/rules/platform/*

... lots of scrape configs ...

remote_write:
  - url: "http://m3coordinator:7201/api/v1/prom/remote/write"
  • Logs:
ts=2018-10-22T20:43:17.202625602Z caller=queue_manager.go:340 component=remote queue=0:http://m3coordinator:7201/api/v1/prom/remote/write msg="Remote storage resharding" from=1 to=2
ts=2018-10-22T20:45:17.20249228Z caller=queue_manager.go:340 component=remote queue=0:http://m3coordinator:7201/api/v1/prom/remote/write msg="Remote storage resharding" from=2 to=3

then lots of:

ts=2018-10-22T20:49:10.917752342Z caller=queue_manager.go:230 component=remote queue=0:http://m3coordinator:7201/api/v1/prom/remote/write msg="Remote storage queue full, discarding sample. Multiple subsequent messages of this kind may be suppressed."

I briefly take down m3coordinator:

ts=2018-10-22T20:51:47.202538115Z caller=queue_manager.go:340 component=remote queue=0:http://m3coordinator:7201/api/v1/prom/remote/write msg="Remote storage resharding" from=3 to=5

and fewer of the "remote storage queue full" messages

It seems like the dropped samples are happening roughly every minute (we have a few scrape jobs that are every minute), and max_over_time(prometheus_remote_storage_queue_length[1m]) is only showing values in the 40k value sometimes, usually sub 10k. Also, 5 shards doesn't completely solve the dropped samples, only mostly.

@BertHartm

This comment has been minimized.

Copy link
Author

BertHartm commented Oct 23, 2018

I just noticed that 2.4.3 seems to have changed the default for remote_write -> queue_config -> capacity from 100k to 10k. It also seems to have changed max_retries from 10 to 3.
This doesn't jive with the documentation: https://prometheus.io/docs/prometheus/latest/configuration/configuration/#%3Cremote_write%3E

Bumping my 2.4.3 instance to the higher capacity and retry count seems to fix my issue (though it also causes a longer hangout at 2 shards which isn't quite enough either.

I'm not sure if the documentation is out of date, or if this is a bug in prometheus.

As an aside, is there a way to hint the number of shards prometheus should use at startup? It takes about 5 minutes to get to a stable value now, and took 3 hours to get there with the default values.

@simonpasquier

This comment has been minimized.

Copy link
Member

simonpasquier commented Oct 25, 2018

I'm not sure if the documentation is out of date

It was and has been fixed by #4715. It will appear correctly once v2.5.0 is out.

@BertHartm

This comment has been minimized.

Copy link
Author

BertHartm commented Oct 25, 2018

Thanks,

So my original issue comes down to a lack of tuning of the remote storage queue on my part then. I'm happy to close out the issue on that basis, or leave open if we believe the default behavior should be better.

@simonpasquier

This comment has been minimized.

Copy link
Member

simonpasquier commented Oct 25, 2018

I think we can close this issue. The rationale for changing the defaults were detailed here: the new defaults minimize the risk of Prometheus being OOM-killed when the remote write is down.

@BertHartm BertHartm closed this Oct 25, 2018

@ntindall

This comment has been minimized.

Copy link

ntindall commented Jan 3, 2019

@brian-brazil @tomwilkie @gouthamve - we should document changes to default configuration options in the release notes. I don't see anything about it in the v2.5.0 notes...

We just got a bit burned by this upgrading from 2.3.x to 2.6.0.

@simonpasquier

This comment has been minimized.

Copy link
Member

simonpasquier commented Jan 4, 2019

The change was documented in the changelog in the v2.4.0 section.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.