Use concurrent.Writer in place of bufio.Writer #149

alin-amana · 2017-09-18T16:26:33Z

Use concurrent.Writer in place of bufio.Writer and remove all Flush() calls from under mutex protection to reduce the likelihood of commits blocking writing to slow/busy disks.

concurrent.Writer is a highly concurrent thread safe drop-in replacement for bufio.Writer. I wrote and tested it over the past week and have been running a Prometheus instance using it for WAL writes for the past couple of days. Essentially it allows for Flush() calls to safely proceed in parallel with writes as well as isolating concurrent writes from one another (with isolation having the ACID meaning, i.e. serialization of writes).

This is a partial solution to scrapes and rule evals blocking on disk writes. It works perfectly as long as there is buffer space in the writer, but will still block if the buffer fills up. A complete solution that would avoid blocking on disk altogether would periodically scan the memory for new data points and log them to disk completely asynchronously (with regard to scrape and eval execution). Unfortunately I don't have the necessary understanding of Prometheus' internals to implement that instead.

I have been running a patched Prometheus beta.4 instance alongside a stock beta.4 instance on a particularly disk challenged machine and have seen zero instances of scrape or eval blocking as opposed to virtually hourly such occurrences for the stock instance. I have also restarted the instance a number of times to check for possible WAL corruption and saw no log messages mentioning this. The writer itself has excellent test coverage, from single-threaded tests to concurrent writes, flushes, errors and resets. It adds very little CPU overhead (less than 2x for in-memory writes compared to a naive mutex protected implementation) and will attempt to flush as little as possible except for explicit Flush() calls (when it will always flush).

… calls from under mutex protection to reduce the likelihood of commits blocking on slow disk.

fabxc · 2017-09-18T16:39:48Z

Thanks for putting in all the effort. Sounds very interesting. Will take it for a spin :)

alin-amana · 2017-09-18T23:25:36Z

I've just looked at the test failure and it's caused by my moving the Writer.Flush() call in SegmentWAL.cut() from the synchronous flow into the asynchronous goroutine. I didn't think it would cause issues in Prometheus (I might be wrong) but it definitely causes the test to be flaky, possibly because it may not actually flush before the bulk of the test executes. That may explain why the first sample it reads is (3, 4), the first entry in the second segment. Maybe closing the WAL would actually ensure the flush completes, as Close() does a flush of its own and they would be serialized.

alin-amana · 2017-09-19T13:45:48Z

Please ignore my suggestion to use Close(), it would only flush the last file and have no effect on the flakiness.

I can either put the Writer.Flush() call back into the synchronous path (which would be a shame, as I don't think it's necessary outside of tests -- but I might be wrong, though) or have SegmentWAL.cut() return a channel or some other kind of synchronization mechanism so the caller may optionally wait for full disk syncing.

alin-amana · 2017-09-19T13:52:09Z

Oh BTW: I added the option to have concurrent.Writer automatically trigger an asynchronous flush whenever a specified fraction of the buffer is in use.

This could be used instead of or in addition to the periodic SegmentWAL flushes and it should help further prevent blocking on write, particularly for large Prometheus instances, which are more likely to fill the 8 MB buffer in the 5 second interval between flushes and then have to wait synchronously for a flush. I have not included it in this PR (even though it would only be a one line change) because it's a different feature and so should be committed separately.

…aller to wait on the asynchronous finishing of the current segment. Mostly for testing.

…mmit.

alin-amana · 2017-09-20T08:39:29Z

Also BTW: if you prefer having the code in prometheus/tsdb or prometheus/common I'd be happy to copy it over. I have no strong feelings about doing it one way or the other, just thought it may be useful for others too so I started off with a separate repo.

…and_flush_for_wal

alin-amana · 2017-09-25T07:51:49Z

I have been running one patched and one non-patched Prometheus instance side-by-side for the last week+ now. I have noticed no issues, apart from some WAL corruption before beta5, which occurred for both the patched and original builds.

Regarding consistency, i.e. reliably scraping and evaluating at configured intervals, it's a night and day difference. Here is the actual measured scraping interval of the official beta5 build:

And here it is for the patched build:

The difference is orders of magnitude (1500% spikes/troughs vs 2% spikes/troughs). Again, this only holds for small to medium Prometheus deployments, that don't fill the 8 MB WAL writer buffer in 5 seconds, but it's much better than nothing.

…amana/prometheus-tsdb into concurrent_write_and_flush_for_wal

alin-amana · 2017-10-10T08:31:20Z

I've merged in the latest updates so there are no conflicts. @fabxc, have you had a chance to take a look at it?

krasi-georgiev · 2018-09-13T16:20:05Z

@alin-amana is this still relevant after the WAL rewrite?

alin-amana · 2018-09-17T08:09:12Z

[Sorry, missed your comment in the mass of email.]

I have not had the time to look at the WAL rewrite, TBH. So I don't know what kind of changes are there that might make this change unnecessary.

But unless:
(a) WAL writing has become completely asynchronous, meaning that writing samples to the TSDB will not block when the disk is 100% busy/unavailable for a couple of minutes; or
(b) flushing WAL writes never happens under mutex lock (including flushing the writer, not merely the fileutil.Fdatasync call);
I think this change is still pertinent.

IIRC (this was already a year ago) the problem was that bufio.Writer flushes would at times block on disk and, on the disk challenged VM I was playing with, this condition could last for tens of seconds. Unfortunately bufio.Writer is not thread safe, so flushes and writes have to be protected by mutex, meaning that during the times when flushes were blocked, no new samples could be committed to the TSDB, even though the buffer was mostly empty.

As noted before, this is not a perfect solution, meaning it's still possible for scrapes and rule evals to block on disk given enough throughput (i.e. enough to fill the buffer before the flush could complete). Fully preventing blocking could be achieved either via dynamically sized buffers or by converting the WAL into a write-behind-log.

krasi-georgiev · 2018-09-17T08:48:30Z

this sound like a good improvment.
When you find the time I would really appreciate if you have a look at the new WAL and if these changes are still relevant please rebase and I will spend the time to review it.

Use concurrent.Writer in place of bufio.Writer and remove all Flush()…

4bf8a82

… calls from under mutex protection to reduce the likelihood of commits blocking on slow disk.

alin-amana added 2 commits September 19, 2017 17:36

Add an optional channel parameter to SegmentWAL.cut(), to allow the c…

fc844b6

…aller to wait on the asynchronous finishing of the current segment. Mostly for testing.

Revert debugging change that was unintentionally included with the co…

ddb7571

…mmit.

alin-amana added 2 commits September 21, 2017 11:50

Merge remote-tracking branch 'upstream/master' into concurrent_write_…

a14cd97

…and_flush_for_wal

Merge remote-tracking branch 'upstream/master' into concurrent_write_…

50698d2

…and_flush_for_wal

alin-amana added 3 commits October 9, 2017 23:47

Merge upstream/master.

83b0f06

Merge branch 'concurrent_write_and_flush_for_wal' of github.com:alin-…

3393737

…amana/prometheus-tsdb into concurrent_write_and_flush_for_wal

Fix compilation error.

593b968

Merge upstream/master.

c1aecbc

free mentioned this pull request Jan 29, 2018

rate()/increase() extrapolation considered harmful prometheus/prometheus#3746

Closed

free mentioned this pull request Apr 26, 2018

Longer-term plans for the Prometheus x-fork? free/prometheus#2

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use concurrent.Writer in place of bufio.Writer #149

Use concurrent.Writer in place of bufio.Writer #149

alin-amana commented Sep 18, 2017

fabxc commented Sep 18, 2017

alin-amana commented Sep 18, 2017

alin-amana commented Sep 19, 2017

alin-amana commented Sep 19, 2017

alin-amana commented Sep 20, 2017

alin-amana commented Sep 25, 2017

alin-amana commented Oct 10, 2017

krasi-georgiev commented Sep 13, 2018

alin-amana commented Sep 17, 2018

krasi-georgiev commented Sep 17, 2018

Use concurrent.Writer in place of bufio.Writer #149

Are you sure you want to change the base?

Use concurrent.Writer in place of bufio.Writer #149

Conversation

alin-amana commented Sep 18, 2017

fabxc commented Sep 18, 2017

alin-amana commented Sep 18, 2017

alin-amana commented Sep 19, 2017

alin-amana commented Sep 19, 2017

alin-amana commented Sep 20, 2017

alin-amana commented Sep 25, 2017

alin-amana commented Oct 10, 2017

krasi-georgiev commented Sep 13, 2018

alin-amana commented Sep 17, 2018

krasi-georgiev commented Sep 17, 2018