Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

error tailing WAL remote_write #5347

Open
andridzi opened this Issue Mar 12, 2019 · 12 comments

Comments

Projects
None yet
3 participants
@andridzi
Copy link

andridzi commented Mar 12, 2019

-->

Bug Report

What did you do?
After upgrading from 2.7.0 to 2.8.0 I've started to get WAL tailing errors

  • Logs:
msg="error tailing WAL" err="unexpected non-zero byte in page term bytes"
msg="error tailing WAL" err="unexpected checksum f4c57f67, expected dd5fb88e"
msg="error tailing WAL" err="record length greater than a single page: 59671 > 32768
@simonpasquier

This comment has been minimized.

Copy link
Member

simonpasquier commented Mar 13, 2019

@andridzi

This comment has been minimized.

Copy link
Author

andridzi commented Mar 13, 2019

I've also tried to delete wal records and start prometheus and even after that still get the same errors

@tomwilkie

This comment has been minimized.

Copy link
Member

tomwilkie commented Mar 13, 2019

Sorry you hit an issue @andridzi? Can you post the logs from your Prometheus at startup pls? Also, might be worth grabbing a copy of the WAL if its not too late.

@andridzi

This comment has been minimized.

Copy link
Author

andridzi commented Mar 13, 2019

logs from startup

level=info ts=2019-03-13T13:26:07.777064243Z caller=main.go:322 build_context="(go=go1.11.5, user=root@4c4d5c29b71f, date=20190312-07:46:58)"
level=info ts=2019-03-13T13:26:07.777090826Z caller=main.go:323 host_details="(Linux 4.4.0-135-generic #161-Ubuntu SMP Mon Aug 27 10:45:01 UTC 2018 x86_64 prometheus-785c5d89d5-6lbxc example.com)"
level=info ts=2019-03-13T13:26:07.777113029Z caller=main.go:324 fd_limits="(soft=1048576, hard=1048576)"
level=info ts=2019-03-13T13:26:07.777138614Z caller=main.go:325 vm_limits="(soft=unlimited, hard=unlimited)"
level=info ts=2019-03-13T13:26:07.777964478Z caller=main.go:640 msg="Starting TSDB ..."
level=info ts=2019-03-13T13:26:07.778024951Z caller=web.go:418 component=web msg="Start listening for connections" address=0.0.0.0:9090
level=info ts=2019-03-13T13:26:07.81206711Z caller=repair.go:48 component=tsdb msg="found healthy block" mint=1549756800000 maxt=1549951200000 ulid=01D3GF02HYK23SAKD453055FVF
level=info ts=2019-03-13T13:26:07.815692495Z caller=repair.go:48 component=tsdb msg="found healthy block" mint=1549951200000 maxt=1550145600000 ulid=01D3P8C519VZKRRY9C5TBCACB7
level=info ts=2019-03-13T13:26:07.818478255Z caller=repair.go:48 component=tsdb msg="found healthy block" mint=1550145600000 maxt=1550340000000 ulid=01D3W1RJA1JQ8N8QWMBH51ZG22
level=info ts=2019-03-13T13:26:07.82103567Z caller=repair.go:48 component=tsdb msg="found healthy block" mint=1550340000000 maxt=1550534400000 ulid=01D41V51067TW456XPZDV9C3ED
level=info ts=2019-03-13T13:26:07.824171801Z caller=repair.go:48 component=tsdb msg="found healthy block" mint=1550534400000 maxt=1550728800000 ulid=01D47MJCABT01HJER7W83PP0ZC
level=info ts=2019-03-13T13:26:07.826735363Z caller=repair.go:48 component=tsdb msg="found healthy block" mint=1550728800000 maxt=1550923200000 ulid=01D4DDYAP7TKM6HAECGRV424Q8
level=info ts=2019-03-13T13:26:07.829232544Z caller=repair.go:48 component=tsdb msg="found healthy block" mint=1550923200000 maxt=1551117600000 ulid=01D4K7AX1GZ85V84ANZFMAH8BD
level=info ts=2019-03-13T13:26:07.832015781Z caller=repair.go:48 component=tsdb msg="found healthy block" mint=1551117600000 maxt=1551312000000 ulid=01D4S0QPBH05NMGDHB5AAXYJFY
level=info ts=2019-03-13T13:26:07.834685258Z caller=repair.go:48 component=tsdb msg="found healthy block" mint=1551312000000 maxt=1551506400000 ulid=01D4YT52HH975XF77MXYR847NY
level=info ts=2019-03-13T13:26:07.836631027Z caller=repair.go:48 component=tsdb msg="found healthy block" mint=1551506400000 maxt=1551700800000 ulid=01D54KHJVARF0N1TDG0FHT2KJ7
level=info ts=2019-03-13T13:26:07.838478687Z caller=repair.go:48 component=tsdb msg="found healthy block" mint=1551700800000 maxt=1551895200000 ulid=01D5ACXP41DQQF82H4ABYWHH6G
level=info ts=2019-03-13T13:26:07.840387528Z caller=repair.go:48 component=tsdb msg="found healthy block" mint=1551895200000 maxt=1552089600000 ulid=01D5G6A55JERPHWY8A6VDKYNDD
level=info ts=2019-03-13T13:26:07.842182655Z caller=repair.go:48 component=tsdb msg="found healthy block" mint=1552089600000 maxt=1552284000000 ulid=01D5NZR4X6A19K7WV5WEQ5C3ZZ
level=info ts=2019-03-13T13:26:07.844110695Z caller=repair.go:48 component=tsdb msg="found healthy block" mint=1552284000000 maxt=1552348800000 ulid=01D5QXG8XBHE3F31F84B3T4MWE
level=info ts=2019-03-13T13:26:07.846244934Z caller=repair.go:48 component=tsdb msg="found healthy block" mint=1552348800000 maxt=1552413600000 ulid=01D5SV9MEZ8QKZ909DN2PHGPEC
level=info ts=2019-03-13T13:26:07.847386486Z caller=repair.go:48 component=tsdb msg="found healthy block" mint=1552413600000 maxt=1552435200000 ulid=01D5TFWDD8567YNQ5SQT9HKWTR
level=info ts=2019-03-13T13:26:07.848354129Z caller=repair.go:48 component=tsdb msg="found healthy block" mint=1552456800000 maxt=1552464000000 ulid=01D5V4FAGFM90R0XTZM7JYM3WH
level=info ts=2019-03-13T13:26:07.849304108Z caller=repair.go:48 component=tsdb msg="found healthy block" mint=1552435200000 maxt=1552456800000 ulid=01D5V4FJNVTZKYWX25G1Q86614
level=info ts=2019-03-13T13:26:07.850217438Z caller=repair.go:48 component=tsdb msg="found healthy block" mint=1552464000000 maxt=1552471200000 ulid=01D5VBB1QSWPF0ANBC3AMGMA1J
level=info ts=2019-03-13T13:26:07.851150379Z caller=repair.go:48 component=tsdb msg="found healthy block" mint=1552471200000 maxt=1552478400000 ulid=01D5VJ6S07AWEXDGB6BT1YA06S
level=warn ts=2019-03-13T13:26:20.684901032Z caller=head.go:450 component=tsdb msg="unknown series references" count=4
level=info ts=2019-03-13T13:26:20.797806959Z caller=main.go:655 msg="TSDB started"
level=info ts=2019-03-13T13:26:20.79788232Z caller=main.go:724 msg="Loading configuration file" filename=/etc/prometheus/prometheus.yml
ts=2019-03-13T13:26:20.799957248Z caller=dedupe.go:111 component=remote level=info queue=0:http://prometheus-postgresql-adapter.monitoring.svc.lu-production.cluster:9201/write msg="starting WAL watcher" queue=0:http://prometheus-postgresql-adapter.monitoring.svc.example.cluster:9201/write
level=info ts=2019-03-13T13:26:20.800342801Z caller=kubernetes.go:191 component="discovery manager scrape" discovery=k8s msg="Using pod service account via in-cluster config"
level=info ts=2019-03-13T13:26:20.801429085Z caller=kubernetes.go:191 component="discovery manager scrape" discovery=k8s msg="Using pod service account via in-cluster config"
level=info ts=2019-03-13T13:26:20.80246004Z caller=kubernetes.go:191 component="discovery manager scrape" discovery=k8s msg="Using pod service account via in-cluster config"
level=info ts=2019-03-13T13:26:20.803462918Z caller=kubernetes.go:191 component="discovery manager scrape" discovery=k8s msg="Using pod service account via in-cluster config"
level=info ts=2019-03-13T13:26:20.832454514Z caller=main.go:751 msg="Completed loading of configuration file" filename=/etc/prometheus/prometheus.yml
level=info ts=2019-03-13T13:26:20.832502549Z caller=main.go:609 msg="Server is ready to receive web requests."```

@andridzi andridzi closed this Mar 13, 2019

@andridzi

This comment has been minimized.

Copy link
Author

andridzi commented Mar 13, 2019

and then error messages

ts=2019-03-13T13:26:48.653714952Z caller=dedupe.go:111 component=remote level=error queue=0:http://prometheus-postgresql-adapter.monitoring.svc.example.cluster:9201/write msg="error tailing WAL" err="unexpected non-zero byte in page term bytes"
ts=2019-03-13T13:27:00.800420896Z caller=dedupe.go:111 component=remote level=info queue=0:http://prometheus-postgresql-adapter.monitoring.svc.example.cluster:9201/write msg="Remote storage resharding" from=1 to=4
ts=2019-03-13T13:27:10.800449729Z caller=dedupe.go:111 component=remote level=info queue=0:http://prometheus-postgresql-adapter.monitoring.svc.example.cluster:9201/write msg="Remote storage resharding" from=4 to=1
ts=2019-03-13T13:27:30.800464227Z caller=dedupe.go:111 component=remote level=info queue=0:http://prometheus-postgresql-adapter.monitoring.svc.example.cluster:9201/write msg="Remote storage resharding" from=1 to=2
ts=2019-03-13T13:27:40.800438851Z caller=dedupe.go:111 component=remote level=info queue=0:http://prometheus-postgresql-adapter.monitoring.svc.example.cluster:9201/write msg="Remote storage resharding" from=2 to=1
ts=2019-03-13T13:28:42.872995251Z caller=dedupe.go:111 component=remote level=error queue=0:http://prometheus-postgresql-adapter.monitoring.svc.example.cluster:9201/write msg="error tailing WAL" err="unexpected last record, dropping buffer"```

@andridzi andridzi reopened this Mar 13, 2019

@tomwilkie

This comment has been minimized.

Copy link
Member

tomwilkie commented Mar 14, 2019

Is this still happening? When we hit a corrupt WAL in the remote_write code, we back off and try and replay the WAL from the last checkpoint. Hopefully a new checkpoint has been created and the WAL has been able to tail successfully.

Its strange the startup didn't detect any corruption (although not sure about msg="unknown series references" count=4). Could you grab a copy of the WAL and share it with me?

@andridzi

This comment has been minimized.

Copy link
Author

andridzi commented Mar 14, 2019

unfortunately, it is not possible due to security reasons
and yes, I'm still periodically getting such errors

@tomwilkie

This comment has been minimized.

Copy link
Member

tomwilkie commented Mar 18, 2019

@andridzi I understand. Can you post a screen shot of prometheus_wal_watcher_current_segment and rate(prometheus_wal_watcher_records_read_total[1m]) run over the past couple of days?

@andridzi

This comment has been minimized.

Copy link
Author

andridzi commented Mar 18, 2019

Screen Shot 2019-03-18 at 10 42 09 PM

Screen Shot 2019-03-18 at 10 43 00 PM

@tomwilkie

This comment has been minimized.

Copy link
Member

tomwilkie commented Mar 19, 2019

Yeah that doesn't look very happy. The large spikes in records read show the retries replaying the WAL when it hits corruptions.

Can you tell me a bit about the system you're running on? What OS and disk are you using?

@andridzi

This comment has been minimized.

Copy link
Author

andridzi commented Mar 19, 2019

it is running in k8s cluster
image based on alpine 3.8
volume is mounted directly from node
and on the node it is glusterfs (I guess it can be the purpose of the issue)

@tomwilkie

This comment has been minimized.

Copy link
Member

tomwilkie commented Mar 25, 2019

and on the node it is glusterfs (I guess it can be the purpose of the issue)

Oh thats interesting. While I doubt we're seeing corruption from that, I could imagine a possibility where we don't play well with gluster from a fsync PoV. Unfortunately I can't think of a way to test this. Perhaps you could spin up another Prometheus using local disks?

If you restart this Prometheus does the issue persist?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.