Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Log flooding with "skipping resharding" warning (remote_write with write_relabel_configs) #7124

Closed
Brightside56 opened this issue Apr 14, 2020 · 5 comments

Comments

@Brightside56
Copy link

Brightside56 commented Apr 14, 2020

What did you do?
Installed prometheus which scrapes some targets and writes metrics with label longterm=true (with the help of write_relabel_configs) to remote (InfluxDB v1.7.10). Prometheus is up and running for 5 minutes.

What did you expect to see?
Samples of metrics with label longterm=true in remote storage and crystal clean prometheus log without errors/warnings.

What did you see instead? Under which circumstances?
Remote write is working and samples I need are writing into remote storage, however prometheus is constantly flooding to log with Skipping resharding, last successful send was beyond threshold. I believe it's not normal for production use. When I disable write_relabel_configs and prometheus starts to write all metrics to remote - I don't see warnings anymore, however writing all metrics to remote is not my case)

Environment

  • System information:

    Prometheus is run in Docker container prom/prometeus:v2.17.1 (on Kubernetes 1.11) with Linux 4.19.88 x86_64

  • Prometheus version:

prometheus, version 2.17.1 (branch: HEAD, revision: ae041f97cfc6f43494bed65ec4ea4e3a0cf2ac69)
  build user:       root@806b02dfe114
  build date:       20200326-16:18:19
  go version:       go1.13.9

I've also tried prometheus docker image with version v2.15.1 which I have on another cluster. Result is same.

  • Prometheus configuration file:
global:
  scrape_interval: 15s
  scrape_timeout: 15s
  evaluation_interval: 15s
scrape_configs:
- job_name: prometheus
  static_configs:
    - targets: ['localhost:9090']
- job_name: 'backup'
  static_configs:
    - targets: ['xx.xx.xx.xx:9100']
  metric_relabel_configs:
  - source_labels: [__name__]
    regex: job_name
    target_label: longterm
    replacement: "true"
    action: replace
remote_write:
- url: "http://influxdb:8086/api/v1/prom/write?db=prometheus"
  write_relabel_configs:
  - source_labels: [longterm]
    regex: "true"
    action: keep

Also I've tried to use VictoriaMetrics v1.34.7 as remote storage. Result is same. I believe it's Prometheus issue and it's not remote storage related. When I disable write_relabel_configs there are no warnings anymore, but I don't want to write all metrics from Prometheus to remote, only part of them.

  • Logs (with debug enabled):
ts=2020-04-14T17:54:02.434Z caller=dedupe.go:112 component=remote level=warn remote_name=5cdc99 url="http://influxdb:8086/api/v1/prom/write?db=prometheus" msg="Skipping resharding, last successful send was beyond threshold" lastSendTimestamp=1586886827 minSendTimestamp=1586886832
ts=2020-04-14T17:54:12.433Z caller=dedupe.go:112 component=remote level=debug remote_name=5cdc99 url="http://influxdb:8086/api/v1/prom/write?db=prometheus" msg=QueueManager.calculateDesiredShards samplesInRate=40.70478667535873 samplesOutRate=0.40327868852482673 samplesKeptRatio=0.009895446214914182 samplesPendingRate=-0.00048666128925889396 samplesPending=3.62512824512011 samplesOutDuration=0.00011184780134993974 timePerSample=0.00027734617407895615 desiredShards=0.00012176698219658768 highestSent=1.586886842e+09 highestRecv=1.586886851e+09
ts=2020-04-14T17:54:12.434Z caller=dedupe.go:112 component=remote level=debug remote_name=5cdc99 url="http://influxdb:8086/api/v1/prom/write?db=prometheus" msg=QueueManager.updateShardsLoop lowerBound=0.7 desiredShards=0.00012176698219658768 upperBound=1.3
ts=2020-04-14T17:54:22.434Z caller=dedupe.go:112 component=remote level=debug remote_name=5cdc99 url="http://influxdb:8086/api/v1/prom/write?db=prometheus" msg=QueueManager.calculateDesiredShards samplesInRate=45.04382934028698 samplesOutRate=0.44262295081986136 samplesKeptRatio=0.00981791909153685 samplesPendingRate=-0.00038627878393004966 samplesPending=2.2111833601796564 samplesOutDuration=0.00011988218107995179 timePerSample=0.00027084492762496047 desiredShards=0.00012576643740220896 highestSent=1.586886857e+09 highestRecv=1.586886862e+09
ts=2020-04-14T17:54:22.434Z caller=dedupe.go:112 component=remote level=debug remote_name=5cdc99 url="http://influxdb:8086/api/v1/prom/write?db=prometheus" msg=QueueManager.updateShardsLoop lowerBound=0.7 desiredShards=0.00012576643740220896 upperBound=1.3
ts=2020-04-14T17:54:32.433Z caller=dedupe.go:112 component=remote level=warn remote_name=5cdc99 url="http://influxdb:8086/api/v1/prom/write?db=prometheus" msg="Skipping resharding, last successful send was beyond threshold" lastSendTimestamp=1586886857 minSendTimestamp=1586886862
ts=2020-04-14T17:54:32.750Z caller=dedupe.go:112 component=remote level=debug remote_name=5cdc99 url="http://influxdb:8086/api/v1/prom/write?db=prometheus" msg="runShard timer ticked, sending samples" samples=6 shard=0
ts=2020-04-14T17:54:42.433Z caller=dedupe.go:112 component=remote level=debug remote_name=5cdc99 url="http://influxdb:8086/api/v1/prom/write?db=prometheus" msg=QueueManager.calculateDesiredShards samplesInRate=40.70805077778367 samplesOutRate=0.40327868852471127 samplesKeptRatio=0.009894653723637794 samplesPendingRate=-0.0004866223142777071 samplesPending=3.6251285958939024 samplesOutDuration=0.00010140815589116914 timePerSample=0.00025145924834794555 desiredShards=0.00011040151132867278 highestSent=1.586886872e+09 highestRecv=1.586886881e+09
ts=2020-04-14T17:54:42.433Z caller=dedupe.go:112 component=remote level=debug remote_name=5cdc99 url="http://influxdb:8086/api/v1/prom/write?db=prometheus" msg=QueueManager.updateShardsLoop lowerBound=0.7 desiredShards=0.00011040151132867278 upperBound=1.3
ts=2020-04-14T17:54:52.433Z caller=dedupe.go:112 component=remote level=debug remote_name=5cdc99 url="http://influxdb:8086/api/v1/prom/write?db=prometheus" msg=QueueManager.calculateDesiredShards samplesInRate=45.04644062222693 samplesOutRate=0.442622950819769 samplesKeptRatio=0.009817350456545372 samplesPendingRate=-0.000386256411405439 samplesPending=2.2111834720418178 samplesOutDuration=0.00011178064471293532 timePerSample=0.0002525414565736134 desiredShards=0.00011726725390399748 highestSent=1.586886887e+09 highestRecv=1.586886892e+09
ts=2020-04-14T17:54:52.433Z caller=dedupe.go:112 component=remote level=debug remote_name=5cdc99 url="http://influxdb:8086/api/v1/prom/write?db=prometheus" msg=QueueManager.updateShardsLoop lowerBound=0.7 desiredShards=0.00011726725390399748 upperBound=1.3
ts=2020-04-14T17:55:02.433Z caller=dedupe.go:112 component=remote level=warn remote_name=5cdc99 url="http://influxdb:8086/api/v1/prom/write?db=prometheus" msg="Skipping resharding, last successful send was beyond threshold" lastSendTimestamp=1586886887 minSendTimestamp=1586886892
ts=2020-04-14T17:55:12.434Z caller=dedupe.go:112 component=remote level=debug remote_name=5cdc99 url="http://influxdb:8086/api/v1/prom/write?db=prometheus" msg=QueueManager.calculateDesiredShards samplesInRate=40.709721998225234 samplesOutRate=0.40327868852465215 samplesKeptRatio=0.0098942480172355 samplesPendingRate=-0.000486602361503663 samplesPending=3.6251287754683363 samplesOutDuration=9.668225261627861e-05 timePerSample=0.00023974054510536946 desiredShards=0.00010525649778796156 highestSent=1.586886902e+09 highestRecv=1.586886911e+09
ts=2020-04-14T17:55:12.434Z caller=dedupe.go:112 component=remote level=debug remote_name=5cdc99 url="http://influxdb:8086/api/v1/prom/write?db=prometheus" msg=QueueManager.updateShardsLoop lowerBound=0.7 desiredShards=0.00010525649778796156 upperBound=1.3
ts=2020-04-14T17:55:22.434Z caller=dedupe.go:112 component=remote level=debug remote_name=5cdc99 url="http://influxdb:8086/api/v1/prom/write?db=prometheus" msg=QueueManager.calculateDesiredShards samplesInRate=45.04777759858019 samplesOutRate=0.4426229508197217 samplesKeptRatio=0.009817059340924824 samplesPendingRate=-0.0003862449576759741 samplesPending=2.2111835293102287 samplesOutDuration=0.00011750858209302289 timePerSample=0.000265482352136059 desiredShards=0.0001232763429168163 highestSent=1.586886917e+09 highestRecv=1.586886922e+09
ts=2020-04-14T17:55:22.434Z caller=dedupe.go:112 component=remote level=debug remote_name=5cdc99 url="http://influxdb:8086/api/v1/prom/write?db=prometheus" msg=QueueManager.updateShardsLoop lowerBound=0.7 desiredShards=0.0001232763429168163 upperBound=1.3

@csmarchbanks
Copy link
Member

csmarchbanks commented Apr 14, 2020

Thank you for your report, looks like there are a couple things happening here:

  1. That warning should never be printed as your debug logs indicate that resharding is not necessary. The check to skip resharding should only be called if resharding is actually going to take place.
  2. At very low remote write volumes it is very easy to go through multiple batch send durations without new samples coming in which causes the skipping resharding case to be erroneously triggered.

Fixing the first case would also make it almost impossible to trigger the second case as low volumes are unlikely to ever reshard above the minimum anyway, so perhaps let's just start with that.

@thaniri
Copy link
Contributor

thaniri commented Apr 19, 2020

Hello, this may be an overly simple attempt at fixing this problem, I am still learning. Please let me know if I am heading in the correct direction on my fork: https://github.com/thaniri/prometheus/pull/2/files

If I understand what is going on here, resharding will be skipped if this if statement is evaluated to true.

Copy and pasting this if statement here means that the warning is not printed when we evaluate calculateDesiredShards(), but rather after we evaluate calculateDesiredShards() we later decide to not reshard because we have not written to remote storage in the configured time frame.

What I would like to know is what we should do with the failing test cases in queue_manager_test.go expected calculateDesiredShards to not want to reshard, wants to change from 10 to 1 shards. The test case looks like:

		{
			// Test that ensures that if we haven't successfully sent a
			// sample recently the queue will not reshard.
			startingShards: 10,
			reshard:        false,
			samplesIn:      1000,
			samplesOut:     10,
		},

The test case is correctly saying this is wrong, because we have moved the logic that "decides" to not reshard to a higher level in the code.

I cannot find an existing test suite for updateShardsLoop() where we moved the aforementioned if statement into.

Can you please offer up any advice on how we should handle this test case? It seems like an actually important case to capture, but I am not sure where to move this test to (or if I need to create a completely new test suite).

@csmarchbanks
Copy link
Member

Thanks for looking into this @thaniri! The PR you linked looks like a good approach to me, I like that calculateDesiredShards() would now only deal with the math of calculating desired shards.

As far as testing goes, I would say you should change that test to focus on testing updating sharding at the higher level. Perhaps break the inner part of updateShardsLoop() into an updateShards() function that would be easier to test?

@thaniri
Copy link
Contributor

thaniri commented Apr 19, 2020

Thank you @csmarchbanks.

I have now put up a pull request: #7143

As for the testing, I realized that the test case that was failing actually just needed to be tweaked a little bit, not that we needed an entirely new test suite.

@csmarchbanks
Copy link
Member

Fixed in #7143

@prometheus prometheus locked as resolved and limited conversation to collaborators Nov 26, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

3 participants