fix wal panic when page flush fails. #582
fix wal panic when page flush fails. #582
Conversation
New records should be added to the page only when the last flush succeeded. Otherwise the page would be full and panics when trying to add a new record. Signed-off-by: Krasi Georgiev <kgeorgie@redhat.com>
e4400c9
to
4198a24
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it makes sense, but not fully approving, as it's the first time I see this code. Some suggestions though. Let me know if I need to really really dive into this flow.
@@ -429,7 +429,6 @@ func (w *WAL) flushPage(clear bool) error { | |||
// No more data will fit into the page. Enqueue and clear it. | |||
if clear { | |||
p.alloc = pageSize // Write till end of page. | |||
w.pageCompletions.Inc() | |||
} | |||
n, err := w.segment.Write(p.buf[p.flushed:p.alloc]) | |||
if err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Line 437: What if the clear is true, and we do Write
and n
is not actually p.alloc
? I think we would then drop some bytes, right? Now is the question if the implementation of segment.Write
does partial writes in any case (n
). Worth to check?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know it's not your bug/issue but: https://www.oreilly.com/library/view/97-things-every/9780596809515/ch08.html
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we would then drop some bytes, right?
Not sure what you mean by this. Can you add more details.
n, err := w.segment.Write(p.buf[p.flushed:p.alloc])
if err != nil {
return err
}
p.flushed += n
When Write
returns an error tsdb will exit(leaving the WAL in corrupted state) and will run a repair on the next start up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ping @bwplotka
Signed-off-by: Krasi Georgiev <kgeorgie@redhat.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
merging as is. |
fixes: prometheus/prometheus#3283
New records should be added to the page only when the last flush
succeeded. Otherwise the page would be full and panics when trying to
add a new record.
to replicate the bug:
before the fix this will panic when the disk image is full because it is trying to add more records to the page when it is already full.
At the moment can't think of simple test to test for this, but will add it to https://github.com/prometheus/tsdb/issues/579 if we figure out how to inject file system faults.
Signed-off-by: Krasi Georgiev kgeorgie@redhat.com