Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

upgrade from 2.7.1 to 2.8.0 provokes consistently ~ 30% loss of metrics passing through remote_write #5389

Closed
beorn- opened this Issue Mar 21, 2019 · 11 comments

Comments

Projects
None yet
4 participants
@beorn-
Copy link

beorn- commented Mar 21, 2019

Bug Report

What did you do?
After an upgrade from prometheus 2.7.1 to prometheus 2.8.0 i faced a a big metric loss with the remote_write rewrite.

What did you expect to see?
Nothing special.

What did you see instead? Under which circumstances?
loosing roughly 30% of metrics because of "dropped sample for series that was not explicitly dropped via relabelling"

Environment
Debian stretch with github release binary.

  • System information:

Linux 4.18.0-0.bpo.1-amd64 x86_64

  • Prometheus version:

prometheus, version 2.8.0 (branch: HEAD, revision: 5936949)
build user: root@4c4d5c29b71f
build date: 20190312-07:46:58
go version: go1.11.5

  • Prometheus configuration file:
global:
  scrape_interval:     15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
  evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.
  external_labels:
      monitor: 'prometheus'

rule_files:
  - "rules/*.yml"
  - "rules/app/*.yml"

scrape_configs:

[..]

remote_write:
  - url: "http://<redacted>:8000/write"
    queue_config:
      capacity: 1000000
      max_shards: 2000
  • Logs:

prom28.txt

  • Rollback:

went fine. back to normal with 2.7.1 version.

@beorn-

This comment has been minimized.

Copy link
Author

beorn- commented Mar 21, 2019

(1.7M timeseries in this instance tried on different prometheis instances, got the same results with different setup) Since it's dropped before actually pushing the metrics i don't think it's related to the software receiving remote_write stream. Here it's metrictank.

@krasi-georgiev

This comment has been minimized.

Copy link
Member

krasi-georgiev commented Mar 21, 2019

@tomwilkie

This comment has been minimized.

Copy link
Member

tomwilkie commented Mar 21, 2019

Thanks for the report @beorn-!

Looks like you have a corrupt WAL:

level=warn ts=2019-03-13T14:06:47.52121461Z caller=head.go:450 component=tsdb msg="unknown series references" count=98915

Can you grab screenshots of the following queries please?

  • rate(prometheus_tsdb_head_samples_appended_total[1m])
  • rate(prometheus_remote_storage_samples_in_total[1m])
  • rate(prometheus_remote_storage_dropped_samples_total[1m])

(edit: added one more)

Those missing series records from the WAL explain why the remote write code is dropping samples for those series. When the next checkpoint hits, the remote write code will pick this up and start sending entries for those series. I'm surprised this doesn't happen earlier TBH - I will investigate.

@krasi-georgiev

This comment has been minimized.

Copy link
Member

krasi-georgiev commented Mar 21, 2019

btw we have been getting "unknown series references" for quite a while and never found the real culprit for that.
prometheus/tsdb#21

@beorn-

This comment has been minimized.

Copy link
Author

beorn- commented Mar 21, 2019

prometheus_2_8_issue_5389_v2

Here you go @tomwilkie

The green graph is 2.7.1. the other are the 2.8.0 with a few restarts along the way

@cstyan

This comment has been minimized.

Copy link
Contributor

cstyan commented Mar 21, 2019

Unfortunately there isn't much we can do in 2.8 regarding remote write if the WAL is corrupt.

In prior versions remote write got samples to send via copying them from scrapes, but now we read them from the WAL, and within the WAL there are series records (telling us about metric name and labels) and samples records (with a ref ID to the series record, and then the TS and value for that sample).

This means we cache the results of the series records we read in 2.8. With unknown series references it means we'll see samples records whose ref ID we haven't seen or cached, so remote write won't be able to send on those samples. That's the samples_dropped metric you're seeing when you're running 2.8.

@beorn-

This comment has been minimized.

Copy link
Author

beorn- commented Mar 22, 2019

So @cstyan somehow flushing the WAL should solve my issue right ? Is there some fsck-ish tool somewhere ? Can it be safely reseted ?

@tomwilkie

This comment has been minimized.

Copy link
Member

tomwilkie commented Mar 25, 2019

QQ - where is the Prometheus data stored? Is this a local or a network disk? Are you using a particular cloud provider?

@tomwilkie

This comment has been minimized.

Copy link
Member

tomwilkie commented Mar 25, 2019

Also, I suspect if you leave it running for long enough the errors will subside - if possible it would be worth a try.

@beorn-

This comment has been minimized.

Copy link
Author

beorn- commented Mar 25, 2019

It is stored locally on baremetal servers.

I've checked the WAL keeps something like 5 hours. During the reload i still got

prometheus[13613]: level=warn ts=2019-03-25T14:39:53.746021713Z caller=head.go:440 component=tsdb msg="unknown series references" count=99630

it's been days since my last upgrade test so the WAL are brand new.

According to the explications/assumptions given here I expected to be out of the woods, but i've had the very same problem. A rollback instantly fixed the issue.

brian-brazil added a commit that referenced this issue Apr 2, 2019

Fixes #5424 #5389
Signed-off-by: Brian Brazil <brian.brazil@robustperception.io>
@beorn-

This comment has been minimized.

Copy link
Author

beorn- commented Apr 5, 2019

the fix works.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.