fix wal panic when page flush fails. #582

krasi-georgiev · 2019-04-15T13:26:47Z

New records should be added to the page only when the last flush
succeeded. Otherwise the page would be full and panics when trying to
add a new record.

to replicate the bug:

dd if=/dev/zero of=tmp/data.img bs=1M count=20 // create a 20mb disk image
mkfs.ext4 /tmp/data.img
mkdir /tmp/prometheus/
sudo mount -t ext4 -o loop /tmp/data.img /tmp/prometheus/
GO111MODULE=on go run cmd/prometheus/main.go   --config.file=.local/simpleMany.yaml --storage.tsdb.path=/tmp/prometheus   --storage.tsdb.wal-segment-size=1MB

before the fix this will panic when the disk image is full because it is trying to add more records to the page when it is already full.

At the moment can't think of simple test to test for this, but will add it to https://github.com/prometheus/tsdb/issues/579 if we figure out how to inject file system faults.

Signed-off-by: Krasi Georgiev kgeorgie@redhat.com

New records should be added to the page only when the last flush succeeded. Otherwise the page would be full and panics when trying to add a new record. Signed-off-by: Krasi Georgiev <kgeorgie@redhat.com>

bwplotka

I think it makes sense, but not fully approving, as it's the first time I see this code. Some suggestions though. Let me know if I need to really really dive into this flow.

wal/wal.go

bwplotka · 2019-04-27T19:51:28Z

wal/wal.go

@@ -429,7 +429,6 @@ func (w *WAL) flushPage(clear bool) error {
 	// No more data will fit into the page. Enqueue and clear it.
 	if clear {
 		p.alloc = pageSize // Write till end of page.
-		w.pageCompletions.Inc()
 	}
 	n, err := w.segment.Write(p.buf[p.flushed:p.alloc])
 	if err != nil {


Line 437: What if the clear is true, and we do Write and n is not actually p.alloc? I think we would then drop some bytes, right? Now is the question if the implementation of segment.Write does partial writes in any case (n). Worth to check?

I know it's not your bug/issue but: https://www.oreilly.com/library/view/97-things-every/9780596809515/ch08.html

I think we would then drop some bytes, right?

Not sure what you mean by this. Can you add more details.

n, err := w.segment.Write(p.buf[p.flushed:p.alloc]) if err != nil { return err } p.flushed += n

When Write returns an error tsdb will exit(leaving the WAL in corrupted state) and will run a repair on the next start up.

ping @bwplotka

Signed-off-by: Krasi Georgiev <kgeorgie@redhat.com>

gouthamve

LGTM!

krasi-georgiev · 2019-05-16T13:40:26Z

merging as is.
@bwplotka if you still think there is a problem please let me know and I will open another PR.

This was referenced Apr 15, 2019

Fatal error handling (when writes to wal file fail) #247

Closed

Prometheus 2.0.0-beta5 doesn't recover nicely when running out of disk space prometheus/prometheus#3283

Closed

krasi-georgiev requested review from gouthamve and codesome April 15, 2019 14:14

fix wal panic when page flush fails.

4198a24

New records should be added to the page only when the last flush succeeded. Otherwise the page would be full and panics when trying to add a new record. Signed-off-by: Krasi Georgiev <kgeorgie@redhat.com>

krasi-georgiev force-pushed the wal-no-space-error branch from e4400c9 to 4198a24 Compare April 15, 2019 14:49

krasi-georgiev mentioned this pull request Apr 16, 2019

add-go-fuse-to-inject-filesystem-error #583

Open

bwplotka reviewed Apr 27, 2019

View reviewed changes

comment nits

19accad

Signed-off-by: Krasi Georgiev <kgeorgie@redhat.com>

gouthamve approved these changes May 15, 2019

View reviewed changes

krasi-georgiev merged commit 96a8784 into prometheus-junkyard:master May 16, 2019

krasi-georgiev deleted the wal-no-space-error branch May 16, 2019 13:40

krasi-georgiev mentioned this pull request Jun 5, 2019

WAL log samples: log series: write metrics/wal/000315: file already closed #317

Closed

krasi-georgiev mentioned this pull request Aug 13, 2019

tsdb: Tests with injected file system errors. prometheus/prometheus#5866

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix wal panic when page flush fails. #582

fix wal panic when page flush fails. #582

krasi-georgiev commented Apr 15, 2019 •

edited

bwplotka left a comment

bwplotka Apr 27, 2019

bwplotka Apr 27, 2019 •

edited

krasi-georgiev Apr 29, 2019

krasi-georgiev May 6, 2019

gouthamve left a comment

krasi-georgiev commented May 16, 2019

fix wal panic when page flush fails. #582

fix wal panic when page flush fails. #582

Conversation

krasi-georgiev commented Apr 15, 2019 • edited

bwplotka left a comment

Choose a reason for hiding this comment

bwplotka Apr 27, 2019

Choose a reason for hiding this comment

bwplotka Apr 27, 2019 • edited

Choose a reason for hiding this comment

krasi-georgiev Apr 29, 2019

Choose a reason for hiding this comment

krasi-georgiev May 6, 2019

Choose a reason for hiding this comment

gouthamve left a comment

Choose a reason for hiding this comment

krasi-georgiev commented May 16, 2019

krasi-georgiev commented Apr 15, 2019 •

edited

bwplotka Apr 27, 2019 •

edited