Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WAL remote_write starts falling behind after config reload #5299

Closed
tomwilkie opened this Issue Mar 4, 2019 · 8 comments

Comments

Projects
None yet
3 participants
@tomwilkie
Copy link
Member

tomwilkie commented Mar 4, 2019

This is running master plus #5286

screenshot 2019-03-04 at 14 39 16

@tomwilkie

This comment has been minimized.

Copy link
Member Author

tomwilkie commented Mar 4, 2019

From our logs, we can see the remote write code thinks the WAL is corrupt:

ts=2019-03-04T15:21:52.090182044Z caller=dedupe.go:111 component=remote level=warn msg="record spans page boundaries" start=18688 end=25026 pageSize=32768
ts=2019-03-04T15:21:52.090180554Z caller=dedupe.go:111 component=remote level=warn msg="record spans page boundaries" start=18688 end=25026 pageSize=32768
ts=2019-03-04T15:21:52.090231087Z caller=dedupe.go:111 component=remote level=warn msg="record spans page boundaries" start=30650 end=4164 pageSize=32768
ts=2019-03-04T15:21:52.333399078Z caller=dedupe.go:111 component=remote level=warn msg="ignoring error reading to end of segment, may have dropped data" segment=6854 err="unexpected
 non-zero byte in page term bytes"
ts=2019-03-04T15:21:52.398811546Z caller=dedupe.go:111 component=remote level=warn msg="record spans page boundaries" start=31130 end=4760 pageSize=32768
ts=2019-03-04T15:21:52.398912414Z caller=dedupe.go:111 component=remote level=warn msg="record spans page boundaries" start=31693 end=2868 pageSize=32768
ts=2019-03-04T15:21:52.398951932Z caller=dedupe.go:111 component=remote level=warn msg="record spans page boundaries" start=31398 end=7205 pageSize=32768
ts=2019-03-04T15:21:52.398810478Z caller=dedupe.go:111 component=remote level=warn msg="record spans page boundaries" start=31130 end=4760 pageSize=32768
ts=2019-03-04T15:21:52.39898598Z caller=dedupe.go:111 component=remote level=warn msg="record spans page boundaries" start=29884 end=2896 pageSize=32768
ts=2019-03-04T15:21:52.399111896Z caller=dedupe.go:111 component=remote level=warn msg="record spans page boundaries" start=29842 end=13040 pageSize=32768
ts=2019-03-04T15:21:52.39903856Z caller=dedupe.go:111 component=remote level=warn msg="record spans page boundaries" start=32743 end=500 pageSize=32768
ts=2019-03-04T15:21:52.399254047Z caller=dedupe.go:111 component=remote level=warn msg="record spans page boundaries" start=31535 end=6334 pageSize=32768
ts=2019-03-04T15:21:52.399307028Z caller=dedupe.go:111 component=remote level=warn msg="record spans page boundaries" start=32724 end=1819 pageSize=32768
ts=2019-03-04T15:21:52.399333042Z caller=dedupe.go:111 component=remote level=warn msg="record spans page boundaries" start=27867 end=9862 pageSize=32768
ts=2019-03-04T15:21:52.399366897Z caller=dedupe.go:111 component=remote level=warn msg="record spans page boundaries" start=19546 end=16723 pageSize=32768
ts=2019-03-04T15:21:52.399872841Z caller=dedupe.go:111 component=remote level=warn msg="record spans page boundaries" start=32440 end=9271 pageSize=32768
ts=2019-03-04T15:21:52.892213392Z caller=dedupe.go:111 component=remote level=warn msg="ignoring error reading to end of segment, may have dropped data" segment=6855 err="unexpected non-zero byte in page term bytes"
ts=2019-03-04T15:21:52.892213511Z caller=dedupe.go:111 component=remote level=warn msg="ignoring error reading to end of segment, may have dropped data" segment=6855 err="unexpected non-zero byte in page term bytes"
ts=2019-03-04T15:21:53.0368095Z caller=dedupe.go:111 component=remote level=warn msg="record spans page boundaries" start=26255 end=8890 pageSize=32768
ts=2019-03-04T15:21:53.0368838Z caller=dedupe.go:111 component=remote level=warn msg="record spans page boundaries" start=30692 end=6531 pageSize=32768
ts=2019-03-04T15:21:53.036808293Z caller=dedupe.go:111 component=remote level=warn msg="record spans page boundaries" start=26255 end=8890 pageSize=32768
ts=2019-03-04T15:21:53.036946628Z caller=dedupe.go:111 component=remote level=warn msg="record spans page boundaries" start=21446 end=16723 pageSize=32768
ts=2019-03-04T15:21:53.036972913Z caller=dedupe.go:111 component=remote level=warn msg="record spans page boundaries" start=29751 end=7396 pageSize=32768
ts=2019-03-04T15:21:53.037005076Z caller=dedupe.go:111 component=remote level=warn msg="record spans page boundaries" start=29586 end=9671 pageSize=32768
ts=2019-03-04T15:21:53.037025685Z caller=dedupe.go:111 component=remote level=warn msg="record spans page boundaries" start=28487 end=5617 pageSize=32768
ts=2019-03-04T15:21:53.036808239Z caller=dedupe.go:111 component=remote level=warn msg="record spans page boundaries" start=26255 end=8890 pageSize=32768
ts=2019-03-04T15:21:53.050387701Z caller=dedupe.go:111 component=remote level=warn msg="ignoring error reading to end of segment, may have dropped data" segment=6856 err="unexpected non-zero byte in page term bytes"
ts=2019-03-04T15:21:53.050387037Z caller=dedupe.go:111 component=remote level=warn msg="ignoring error reading to end of segment, may have dropped data" segment=6856 err="unexpected non-zero byte in page term bytes"
ts=2019-03-04T15:21:53.114975486Z caller=dedupe.go:111 component=remote level=warn msg="record spans page boundaries" start=30188 end=3411 pageSize=32768
ts=2019-03-04T15:21:53.115013876Z caller=dedupe.go:111 component=remote level=warn msg="record spans page boundaries" start=31686 end=4493 pageSize=32768
ts=2019-03-04T15:21:53.115079307Z caller=dedupe.go:111 component=remote level=warn msg="record spans page boundaries" start=11093 end=25026 pageSize=32768
ts=2019-03-04T15:21:53.115108672Z caller=dedupe.go:111 component=remote level=warn msg="record spans page boundaries" start=26168 end=8534 pageSize=32768
ts=2019-03-04T15:21:53.186848072Z caller=dedupe.go:111 component=remote level=warn msg="ignoring error reading to end of segment, may have dropped data" segment=6857 err="unexpected non-zero byte in page term bytes"
ts=2019-03-04T15:21:53.186849897Z caller=dedupe.go:111 component=remote level=warn msg="ignoring error reading to end of segment, may have dropped data" segment=6857 err="unexpected non-zero byte in page term bytes"
ts=2019-03-04T15:21:53.271489549Z caller=dedupe.go:111 component=remote level=warn msg="record spans page boundaries" start=30515 end=7774 pageSize=32768
ts=2019-03-04T15:21:53.271575534Z caller=dedupe.go:111 component=remote level=warn msg="record spans page boundaries" start=24624 end=12203 pageSize=32768
ts=2019-03-04T15:21:53.271661571Z caller=dedupe.go:111 component=remote level=warn msg="record spans page boundaries" start=28495 end=6586 pageSize=32768
ts=2019-03-04T15:21:53.271725348Z caller=dedupe.go:111 component=remote level=warn msg="record spans page boundaries" start=24998 end=8430 pageSize=32768
ts=2019-03-04T15:21:53.27281998Z caller=dedupe.go:111 component=remote level=warn msg="record spans page boundaries" start=31178 end=9091 pageSize=32768
ts=2019-03-04T15:21:54.672889124Z caller=dedupe.go:111 component=remote level=warn msg="ignoring error reading to end of segment, may have dropped data" segment=6858 err="unexpected non-zero byte in page term bytes"
ts=2019-03-04T15:21:55.06590825Z caller=dedupe.go:111 component=remote level=error msg="error tailing WAL" err="unexpected non-zero byte in page term bytes"
@tomwilkie

This comment has been minimized.

Copy link
Member Author

tomwilkie commented Mar 4, 2019

Current theory: either (a) the WAL gets corrupted when we reload config or (b) the wal watcher gets out of sync somehow when we reload config.

/cc @cstyan @codesome - might have to hold 2.8 for this.

@tomwilkie

This comment has been minimized.

Copy link
Member Author

tomwilkie commented Mar 4, 2019

The second change here might help: #5300

@tomwilkie

This comment has been minimized.

Copy link
Member Author

tomwilkie commented Mar 4, 2019

We grabbed a copy of the WAL and tried to replay it, multiple segments were corrupt. I struggle to see how this has anything to do with the WAL code. When I restarted the Prometheus server, it also found corruption and repaired the WAL:

level=warn ts=2019-03-04T16:21:31.434945995Z caller=wal.go:116 component=tsdb msg="last page of the wal is torn, filling it with zeros" segment=/prometheus/data/wal/00006861
level=warn ts=2019-03-04T16:22:05.078776022Z caller=head.go:497 component=tsdb msg="encountered WAL error, attempting repair" err="read records: corruption in segment /prometheus/da
ta/wal/00006854 at 44040192: unexpected non-zero byte in padded page"
level=warn ts=2019-03-04T16:22:05.086185462Z caller=wal.go:291 component=tsdb msg="starting corruption repair" segment=6854 offset=44040192
level=warn ts=2019-03-04T16:22:05.086365462Z caller=wal.go:299 component=tsdb msg="deleting all segments newer than corrupted segment" segment=6854
level=warn ts=2019-03-04T16:22:05.10135186Z caller=wal.go:321 component=tsdb msg="rewrite corrupted segment" segment=6854
level=info ts=2019-03-04T16:22:06.322044547Z caller=main.go:647 msg="TSDB started"
level=debug ts=2019-03-04T16:22:06.322132089Z caller=main.go:648 msg="TSDB options" MinBlockDuration=2h MaxBlockDuration=36h MaxBytes=0B NoLockfile=false RetentionDuration=15d WALSe
gmentSize=0B

Even after this, remote_write was still broken:

ts=2019-03-04T16:22:30.764244826Z caller=dedupe.go:111 component=remote level=error msg="error tailing WAL" err="unexpected non-zero byte in page term bytes"
ts=2019-03-04T16:23:17.974594694Z caller=dedupe.go:111 component=remote level=debug msg="reading checkpoint" dir=/prometheus/data/wal/checkpoint.006849
ts=2019-03-04T16:23:20.361749597Z caller=dedupe.go:111 component=remote level=debug msg="read series references from checkpoint" checkpoint=/prometheus/data/wal/checkpoint.006849
ts=2019-03-04T16:23:20.361909694Z caller=dedupe.go:111 component=remote level=debug msg="tailing WAL" lastCheckpoint=/prometheus/data/wal/checkpoint.006849 checkpointIndex=6849 currentSegment=6850 lastSegment=6855
ts=2019-03-04T16:23:32.968556035Z caller=dedupe.go:111 component=remote level=error msg="error tailing WAL" err="unexpected non-zero byte in page term bytes"
ts=2019-03-04T16:24:18.272611834Z caller=dedupe.go:111 component=remote level=debug msg="reading checkpoint" dir=/prometheus/data/wal/checkpoint.006849
ts=2019-03-04T16:24:20.664579524Z caller=dedupe.go:111 component=remote level=debug msg="read series references from checkpoint" checkpoint=/prometheus/data/wal/checkpoint.006849
ts=2019-03-04T16:24:20.664699339Z caller=dedupe.go:111 component=remote level=debug msg="tailing WAL" lastCheckpoint=/prometheus/data/wal/checkpoint.006849 checkpointIndex=6849 currentSegment=6850 lastSegment=6855
ts=2019-03-04T16:24:33.069504794Z caller=dedupe.go:111 component=remote level=error msg="error tailing WAL" err="unexpected non-zero byte in page term bytes"
ts=2019-03-04T16:25:25.465273626Z caller=dedupe.go:111 component=remote level=debug msg="reading checkpoint" dir=/prometheus/data/wal/checkpoint.006849
ts=2019-03-04T16:25:28.979608068Z caller=dedupe.go:111 component=remote level=debug msg="read series references from checkpoint" checkpoint=/prometheus/data/wal/checkpoint.006849
ts=2019-03-04T16:25:28.979737683Z caller=dedupe.go:111 component=remote level=debug msg="tailing WAL" lastCheckpoint=/prometheus/data/wal/checkpoint.006849 checkpointIndex=6849 currentSegment=6850 lastSegment=6855
ts=2019-03-04T16:25:33.868108983Z caller=dedupe.go:111 component=remote level=error msg="error tailing WAL" err="unexpected non-zero byte in page term bytes"
@tomwilkie

This comment has been minimized.

Copy link
Member Author

tomwilkie commented Mar 5, 2019

I think #5300 fixed this.

@tomwilkie tomwilkie closed this Mar 5, 2019

@andridzi

This comment has been minimized.

Copy link

andridzi commented Mar 12, 2019

hi @tomwilkie
after upgrading from 2.7.0 to 2.8.0 I've started facing the same issue
msg="error tailing WAL" err="unexpected non-zero byte in page term bytes"
and at the same time I'm getting other errors related to WAL
error tailing WAL" err="unexpected checksum f4c57f67, expected dd5fb88e"
error tailing WAL" err="record length greater than a single page: 59671 > 32768
could you please advise me how to deal with this?
thanks

@cstyan

This comment has been minimized.

Copy link
Contributor

cstyan commented Mar 13, 2019

@andridzi when you upgraded did prometheus repair the WAL when it started on v2.8? If you copy your current WAL dir to somewhere else and start a new prometheus using that copied WAL, does it attempt to repair the WAL?

Those messages are indicative of a corruption in the WAL.

@andridzi

This comment has been minimized.

Copy link

andridzi commented Mar 13, 2019

I've also tried to delete wal records and start prometheus and even after that still get the same errors
So seems like the problem is with data based on which WAL records are created. Or creation mechanism.

P.S.: after copying current wal dir and start new instance - no attempts to correct wal records

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.