whisper: disable buffering for update operations #181

iksaif · 2016-09-26T10:38:17Z

We found out that 200GiB of memory isn't enough for 8M metrics, and
past this limit whisper will start reading from disk at every update.

After this patch, updates() will get a 100% hit rate on the page cache
with only 100GiB of memory.

We found out that 200GiB of memory isn't enough for 8M metrics, and past this limit whisper will start reading from disk at every update. After this patch, updates() will get a 100% hit rate on the page cache with only 100GiB of memory.

coveralls · 2016-09-26T10:39:25Z

Coverage increased (+0.01%) to 88.372% when pulling a3efa84 on criteo-forks:criteo-dev into 78f761c on graphite-project:master.

coveralls · 2016-09-26T10:39:25Z

Coverage increased (+0.01%) to 88.372% when pulling a3efa84 on criteo-forks:criteo-dev into 78f761c on graphite-project:master.

coveralls · 2016-09-26T10:39:25Z

Coverage increased (+0.01%) to 88.372% when pulling a3efa84 on criteo-forks:criteo-dev into 78f761c on graphite-project:master.

coveralls · 2016-09-26T10:39:25Z

Coverage increased (+0.01%) to 88.372% when pulling a3efa84 on criteo-forks:criteo-dev into 78f761c on graphite-project:master.

deniszh · 2016-09-26T10:50:50Z

whisper.py

@@ -95,6 +95,9 @@ def _py_fallocate(fd, offset, len_):
 CACHE_HEADERS = False
 AUTOFLUSH = False
 FADVISE_RANDOM = False
+# Buffering setting applied to all operations that do *not* require
+# a full scan of the file in order to minimize cache thrashing.
+BUFFERING = 0


Should it be "-1" by default?
According to docs: "The optional buffering argument specifies the file’s desired buffer size: 0 means unbuffered, 1 means line buffered, any other positive value means use a buffer of (approximately) that size (in bytes). A negative buffering means to use the system default, which is usually line buffered for tty devices and fully buffered for other files. If omitted, the system default is used."

The default is -1 which means system default, which is 8k usually on linux. That's super wasteful for things such as update_many which only read the header and a few points.

Well. I'm just not sure that 0 is a good idea for default. OTOH there's no easy way how you can change that programmatically....

After thinking about it, 0 might be more like "performance related bugfix" than "weird tunable". If it's always the case that we read extra data, and we never need that data, then there's no point to have read that extra data. That's not something you'd tune.

That said, I'd like to see some before/after data (graphs!) showing the effect of this change to confirm my assumptions.

bmhatfield · 2016-09-26T11:29:35Z

This PR seems HUGE to me. I'm really excited about the performance implications on a large system. Can you share any before/after graphs or metrics for how you validated this?

iksaif · 2016-09-26T13:23:52Z

Some before after:

Reads going down after the fix
Way less memory used for cache

The times of both graphs don't match because these are from two different machines. I can get clean correctable graphs if needed. Our configuration is 24 cores, 256GiB of RAM, 2TiB of SSD, ~10M metrics.

In some usecases you may want to tune that, or revert it to 0 (if you read a lot), which is why I made it somewhat tunnable. But in most cases it should clearly not be buffered.

deniszh · 2016-09-26T13:34:40Z

I tried the same on my test cluster - but I have no big read load there. I can confirm that you will get some free RAM after restart though. :)

iksaif · 2016-09-26T13:36:07Z

Note that you need to drop the caches before and after to get a sense of the actually used cache :)

The most important thing here is that I'm not reading 1GB/s of data from my disk anymore !

coveralls · 2016-09-26T13:40:33Z

Coverage increased (+0.01%) to 88.372% when pulling a3efa84 on criteo-forks:criteo-dev into 78f761c on graphite-project:master.

coveralls · 2016-09-26T13:40:33Z

Coverage increased (+0.01%) to 88.372% when pulling a3efa84 on criteo-forks:criteo-dev into 78f761c on graphite-project:master.

coveralls · 2016-09-26T13:40:33Z

Coverage increased (+0.01%) to 88.372% when pulling a3efa84 on criteo-forks:criteo-dev into 78f761c on graphite-project:master.

coveralls · 2016-09-26T13:40:33Z

Coverage increased (+0.01%) to 88.372% when pulling a3efa84 on criteo-forks:criteo-dev into 78f761c on graphite-project:master.

bmhatfield · 2016-09-26T14:40:54Z

I can try this out on my cluster to confirm your observations.

One thing I am having a little trouble finding a reference on, though, is what effect exactly does this change have on the file interactions? Under the covers, python is calling setvbuf, I think, but I can't find information on what effect setting that to unbuffered has on reading regular files. Do you have a source that guided your thinking on this?

iksaif · 2016-09-26T15:02:46Z

http://stackoverflow.com/questions/18194374/default-buffer-size-for-a-file?answertab=votes#tab-top has some information. My thinking is that linux will probably buffer 4k anyway (size of a page) and unbuffered reads will simply cause more syscalls at worse.

mleinart · 2016-09-26T15:20:52Z

Hm, several people referring to setvbuf() but it looks like (at least in 2.7) python's io module only sets that on stdin, stdout, stderr. For the File buffering it appears to be doing its own buffering though it doesn't appear to be doing any readahead when in binary mode (maybe someone could read this a little closer? https://github.com/python/cpython/blob/2.7/Modules/_io/bufferedio.c#L1432-L1529)).

Based on this though, I'm pretty surprised (and a little skeptical) to see a huge difference in read behavior. My expectation would be that we reduce memory usage somewhat and maybe a tad CPU, but that reads would be constant unless FADVISE_RANDOM were flipped as well.

I think I'm 👍 on this, including changing the default, but I'd like to see more people test it in prod and get someone else's eyes on the Python implementation to verify my conclusion.

iksaif · 2016-09-26T15:39:08Z

Note that I'm running with Pypy, which might have a different behavior. But the documentation doesn't say that this applies only to stdin/stdout/stderr. I also validated with strace that my reads were called with 8k before the change.

reads are constant, it's just that before the change they were completely thrashing the page cache:

Before

# ./bin/cachestat 
Counting cache functions... Output every 1 seconds.
    HITS   MISSES  DIRTIES    RATIO   BUFFERS_MB   CACHE_MB
   49514    85660    24737    36.6%         8262     223187
   58743   209480    40935    21.9%         8265     223112
   29536   219433    37805    11.9%         8272     223270
   79205   189157    49085    29.5%         8273     223281
   36131   230937    38658    13.5%         8273     223123
   59280   204204    39298    22.5%         8271     223022
   70966    87138    27158    44.9%         8270     222998
   32321    87837    22470    26.9%         8271     223181
   44200   171163    33245    20.5%         8270     222998
   52094   222617    40092    19.0%         8268     222992
   26670   226120    36743    10.6%         8268     222986
   58898   220398    40836    21.1%         8272     223025

After

# ./perf-tools/bin/cachestat 
Counting cache functions... Output every 1 seconds.
    HITS   MISSES  DIRTIES    RATIO   BUFFERS_MB   CACHE_MB
  222662        0   108501   100.0%         9978     214294
  211045        0   116320   100.0%         9978     214295
  204292        0   114225   100.0%         9978     214296
  177172        0   104677   100.0%         9978     214296
  211580        0   115365   100.0%         9978     214297
  241099        0   115595   100.0%         9978     214298
  237547        0   117324   100.0%         9978     214299
  226525        0   120717   100.0%         9978     214300
  222532        0   118824   100.0%         9978     214300
  217729        0   117804   100.0%         9978     214301

bmhatfield · 2016-09-26T16:54:17Z

@iksaif One thing that's kind of tricky with Graphite is that the memory pressure the carbon-daemons (especially carbon-cache) put on the system overall after running for long periods of time can cause the page cache to come under pressure and eventually thrash. In your environment, do you see the level of thrashing you posted above after a clean restart without your changes?

EDIT: For context, I am studying my own environment prior to deploying your changes to observe the effects. I noticed that after a clean restart, my ratio was 100% but has subsequently been degrading as the carbon-caches start filling with buffered data.

iksaif · 2016-09-26T17:24:56Z

Yes, I this would consistently apparent after a clean restart and reboot once we reached a big enough number of metric updated every minute. Even worse, there was no way to fix it (appart from my patch).

bmhatfield · 2016-09-26T20:38:10Z

I restarted things a few hours ago to get to a clean-ish starting state to evaluate the impact of this change, and unfortunately I have not seen any page-cache thrashing in a meaningful way since then (large queries drop it for a moment, but the ratio goes back to 100% promptly). This is somewhat intentional, as my systems are tuned to make good use of the page cache already.

I'd personally like to get a sense of a few things before merging this.

Can we quantify exactly the behavior change as a result of setting this? It seems that I'm not alone in being a bit confused by what it changes "under the covers". Do you have data on "bytes read per update", before and after?
I would love to see a reproduce case. If someone else can get to one faster, that'd be sweet. I think my turnaround cycle unless I force memory pressure is going to be long.

Additionally, can you share some data about your setup? Are you doing 8M metrics per host? Per what time interval?

bmhatfield · 2016-09-26T20:42:45Z

It's also probably worth noting that I am able to gain significant improvements in read-vs-write ratios simply by restarting the carbon daemons (and evicting the bloated-over-time cached points):

bmhatfield · 2016-09-26T23:25:00Z

Also, just to add, the intent of me posting those graphs was to illustrate that Graphite can be an interesting tool to scale, and that there can be confusing behavior changes just from something as simple as restarting it's daemons.

That said, I don't meant to wet-blanket your change by asking lots of questions. I'm really glad you're doing performance research and opening PRs against this project to push things forward. I personally like to truly understand how changes like this work before merging them, as I have made errors in the past where I've associated performance improvements with the wrong change.

Thanks again for your work so far, and I hope we can demonstrate how exactly this change works and get it merged in!

iksaif · 2016-09-27T06:23:11Z

Update with the patch

stat("foo.wsp", {st_mode=S_IFREG|0644, st_size=151312, ...}) = 0
open("foo.wsp", O_RDWR) = 144
fstat(144, {st_mode=S_IFREG|0644, st_size=151312, ...}) = 0
flock(144, LOCK_EX)                     = 0
lseek(144, 0, SEEK_CUR)                 = 0
lseek(144, 0, SEEK_SET)                 = 0
read(144, "\0\0\0\1\1\3413\200?\0\0\0\0\0\0\3", 16) = 16
read(144, "\0\0\0004\0\0\0<\0\0-\0", 12) = 12
read(144, "\0\2\0344\0\0\16\20\0\0\2\320", 12) = 12
read(144, "\0\2=\364\0\1Q\200\0\0\1m", 12) = 12
lseek(144, 0, SEEK_SET)                 = 0
lseek(144, 52, SEEK_SET)                = 52
read(144, "W\346\334l@)\237\317\303\370\21\365", 12) = 12
lseek(144, 41944, SEEK_SET)             = 41944
write(144, "W\352\16\240@%\322\364'\204\251H", 12) = 12
lseek(144, 52, SEEK_SET)                = 52
read(144, "W\346\334l@)\237\317\303\370\21\365", 12) = 12
lseek(144, 41752, SEEK_SET)             = 41752
read(144, "W\352\n\340@$\224\364\203\312\373;W\352\v\34@#\237\322\351F\200\27W\352\vX@%\34#"..., 720) = 720
close(144)                              = 0

Update without the patch

stat("foo.wsp", {st_mode=S_IFREG|0644, st_size=151312, ...}) = 0
open("foo.wsp", O_RDWR) = 3
fstat(3, {st_mode=S_IFREG|0644, st_size=151312, ...}) = 0
flock(3, LOCK_EX)                       = 0
lseek(3, 0, SEEK_CUR)                   = 0
lseek(3, 0, SEEK_CUR)                   = 0
read(3, "\0\0\0\1\1\3413\200?\0\0\0\0\0\0\3\0\0\0004\0\0\0<\0\0-\0\0\2\0344"..., 8192) = 8192
lseek(3, 0, SEEK_CUR)                   = 8192
lseek(3, 0, SEEK_CUR)                   = 8192
lseek(3, 0, SEEK_CUR)                   = 8192
lseek(3, 66964, SEEK_SET)               = 66964
lseek(3, 0, SEEK_CUR)                   = 66964
write(3, "W\352\17\30\0\0\0\0\0\0\0\0", 12) = 12
lseek(3, 52, SEEK_SET)                  = 52
read(3, "W\332h8\0\0\0\0\0\0\0\0W\332ht\0\0\0\0\0\0\0\0W\332h\260\0\0\0\0"..., 8192) = 8192
lseek(3, 0, SEEK_CUR)                   = 8244
lseek(3, 66748, SEEK_SET)               = 66748
read(3, "W\352\n\340?\204z\341G\256\24{W\352\v\34?\204z\341G\256\24{W\352\vX\0\0\0\0"..., 8192) = 8192
close(3)                                = 0

With the update, the syscalls will ask the kernel to read only what we need instead of buffering 8k more. The actual numbers are probably different because of minimal io size of the disk and the page size, but both of these numbers should be smaller that 8k.

Based on the behavior seen on my system I would say that without the patch we would read at least two times more than with the patch and that half of the data being read was never used.

If you want to see the effect on your setup, simply use echo 3 > /proc/sys/vm/drop_caches on a running system and check how much time it takes for the cache size to stabilize and how much you actually need.

I running that now on two machines doing exactly the same thing (replicated clusters, with and without the patch). I'll upload the results soon.

iksaif · 2016-09-27T07:23:43Z

Dropping caches on live instances (exactly the same traffic).

patched server:

vanilla server:

bmhatfield · 2016-09-27T13:31:41Z

That's phenomenal. Thank you very much for the extra information. I'm convinced, and happy to merge this. I'll give it an hour for @mleinart, @obfuscurity or @deniszh to weigh in with anything else, otherwise, I will merge this ~10:30am ET.

obfuscurity · 2016-09-27T13:54:43Z

👍 good find, @iksaif

bmhatfield · 2016-09-27T14:15:23Z

Thanks!

deniszh · 2016-09-27T14:26:54Z

No objections here. On moderate load I see no regression here, and free ram is good. :)

mleinart · 2016-09-27T18:14:53Z

A late 👍, thanks for the research and testing all

On Tue, Sep 27, 2016 at 9:26 AM, Denis Zhdanov notifications@github.com
wrote:

No objections here. On moderate load I see no regression here, and free
ram is good. :)

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#181 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AATaL3gslk4KWkv2lZbNyROLiMvv0Dj2ks5quSevgaJpZM4KGX7b
.

piotr1212 · 2017-03-01T15:08:09Z

Experienced the same problem. Didn't realize it could be so simple to fix!! This is amazing.

whisper: disable buffering for update operations

a3efa84

We found out that 200GiB of memory isn't enough for 8M metrics, and past this limit whisper will start reading from disk at every update. After this patch, updates() will get a 100% hit rate on the page cache with only 100GiB of memory.

deniszh reviewed Sep 26, 2016

View reviewed changes

iksaif closed this Sep 26, 2016

iksaif reopened this Sep 26, 2016

bmhatfield merged commit 7627589 into graphite-project:master Sep 27, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

whisper: disable buffering for update operations #181

whisper: disable buffering for update operations #181

iksaif commented Sep 26, 2016

coveralls commented Sep 26, 2016

coveralls commented Sep 26, 2016

coveralls commented Sep 26, 2016

coveralls commented Sep 26, 2016

deniszh Sep 26, 2016

iksaif Sep 26, 2016 •

edited

deniszh Sep 26, 2016

bmhatfield Sep 26, 2016

bmhatfield commented Sep 26, 2016

iksaif commented Sep 26, 2016 •

edited

deniszh commented Sep 26, 2016

iksaif commented Sep 26, 2016 •

edited

coveralls commented Sep 26, 2016

coveralls commented Sep 26, 2016

coveralls commented Sep 26, 2016

coveralls commented Sep 26, 2016

bmhatfield commented Sep 26, 2016

iksaif commented Sep 26, 2016

mleinart commented Sep 26, 2016 •

edited

iksaif commented Sep 26, 2016 •

edited

bmhatfield commented Sep 26, 2016 •

edited

iksaif commented Sep 26, 2016

bmhatfield commented Sep 26, 2016 •

edited

bmhatfield commented Sep 26, 2016

bmhatfield commented Sep 26, 2016

iksaif commented Sep 27, 2016 •

edited

iksaif commented Sep 27, 2016

bmhatfield commented Sep 27, 2016

obfuscurity commented Sep 27, 2016

bmhatfield commented Sep 27, 2016

deniszh commented Sep 27, 2016

mleinart commented Sep 27, 2016

piotr1212 commented Mar 1, 2017

whisper: disable buffering for update operations #181

whisper: disable buffering for update operations #181

Conversation

iksaif commented Sep 26, 2016

coveralls commented Sep 26, 2016

coveralls commented Sep 26, 2016

coveralls commented Sep 26, 2016

coveralls commented Sep 26, 2016

deniszh Sep 26, 2016

Choose a reason for hiding this comment

iksaif Sep 26, 2016 • edited

Choose a reason for hiding this comment

deniszh Sep 26, 2016

Choose a reason for hiding this comment

bmhatfield Sep 26, 2016

Choose a reason for hiding this comment

bmhatfield commented Sep 26, 2016

iksaif commented Sep 26, 2016 • edited

deniszh commented Sep 26, 2016

iksaif commented Sep 26, 2016 • edited

coveralls commented Sep 26, 2016

coveralls commented Sep 26, 2016

coveralls commented Sep 26, 2016

coveralls commented Sep 26, 2016

bmhatfield commented Sep 26, 2016

iksaif commented Sep 26, 2016

mleinart commented Sep 26, 2016 • edited

iksaif commented Sep 26, 2016 • edited

bmhatfield commented Sep 26, 2016 • edited

iksaif commented Sep 26, 2016

bmhatfield commented Sep 26, 2016 • edited

bmhatfield commented Sep 26, 2016

bmhatfield commented Sep 26, 2016

iksaif commented Sep 27, 2016 • edited

iksaif commented Sep 27, 2016

bmhatfield commented Sep 27, 2016

obfuscurity commented Sep 27, 2016

bmhatfield commented Sep 27, 2016

deniszh commented Sep 27, 2016

mleinart commented Sep 27, 2016

piotr1212 commented Mar 1, 2017

iksaif Sep 26, 2016 •

edited

iksaif commented Sep 26, 2016 •

edited

iksaif commented Sep 26, 2016 •

edited

mleinart commented Sep 26, 2016 •

edited

iksaif commented Sep 26, 2016 •

edited

bmhatfield commented Sep 26, 2016 •

edited

bmhatfield commented Sep 26, 2016 •

edited

iksaif commented Sep 27, 2016 •

edited