Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

whisper: disable buffering for update operations #181

Merged
merged 1 commit into from Sep 27, 2016

Conversation

iksaif
Copy link
Member

@iksaif iksaif commented Sep 26, 2016

We found out that 200GiB of memory isn't enough for 8M metrics, and
past this limit whisper will start reading from disk at every update.

After this patch, updates() will get a 100% hit rate on the page cache
with only 100GiB of memory.

We found out that 200GiB of memory isn't enough for 8M metrics, and
past this limit whisper will start reading from disk at every update.

After this patch, updates() will get a 100% hit rate on the page cache
with only 100GiB of memory.
@coveralls
Copy link

Coverage Status

Coverage increased (+0.01%) to 88.372% when pulling a3efa84 on criteo-forks:criteo-dev into 78f761c on graphite-project:master.

3 similar comments
@coveralls
Copy link

Coverage Status

Coverage increased (+0.01%) to 88.372% when pulling a3efa84 on criteo-forks:criteo-dev into 78f761c on graphite-project:master.

@coveralls
Copy link

Coverage Status

Coverage increased (+0.01%) to 88.372% when pulling a3efa84 on criteo-forks:criteo-dev into 78f761c on graphite-project:master.

@coveralls
Copy link

Coverage Status

Coverage increased (+0.01%) to 88.372% when pulling a3efa84 on criteo-forks:criteo-dev into 78f761c on graphite-project:master.

@@ -95,6 +95,9 @@ def _py_fallocate(fd, offset, len_):
CACHE_HEADERS = False
AUTOFLUSH = False
FADVISE_RANDOM = False
# Buffering setting applied to all operations that do *not* require
# a full scan of the file in order to minimize cache thrashing.
BUFFERING = 0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should it be "-1" by default?
According to docs: "The optional buffering argument specifies the file’s desired buffer size: 0 means unbuffered, 1 means line buffered, any other positive value means use a buffer of (approximately) that size (in bytes). A negative buffering means to use the system default, which is usually line buffered for tty devices and fully buffered for other files. If omitted, the system default is used."

Copy link
Member Author

@iksaif iksaif Sep 26, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The default is -1 which means system default, which is 8k usually on linux. That's super wasteful for things such as update_many which only read the header and a few points.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well. I'm just not sure that 0 is a good idea for default. OTOH there's no easy way how you can change that programmatically....

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After thinking about it, 0 might be more like "performance related bugfix" than "weird tunable". If it's always the case that we read extra data, and we never need that data, then there's no point to have read that extra data. That's not something you'd tune.

That said, I'd like to see some before/after data (graphs!) showing the effect of this change to confirm my assumptions.

@bmhatfield
Copy link
Member

This PR seems HUGE to me. I'm really excited about the performance implications on a large system. Can you share any before/after graphs or metrics for how you validated this?

@iksaif
Copy link
Member Author

iksaif commented Sep 26, 2016

Some before after:
server-metrology 3
server-metrology 2

  • Reads going down after the fix
  • Way less memory used for cache

The times of both graphs don't match because these are from two different machines. I can get clean correctable graphs if needed. Our configuration is 24 cores, 256GiB of RAM, 2TiB of SSD, ~10M metrics.

In some usecases you may want to tune that, or revert it to 0 (if you read a lot), which is why I made it somewhat tunnable. But in most cases it should clearly not be buffered.

@deniszh
Copy link
Member

deniszh commented Sep 26, 2016

I tried the same on my test cluster - but I have no big read load there. I can confirm that you will get some free RAM after restart though. :)

@iksaif
Copy link
Member Author

iksaif commented Sep 26, 2016

Note that you need to drop the caches before and after to get a sense of the actually used cache :)

The most important thing here is that I'm not reading 1GB/s of data from my disk anymore !

@iksaif iksaif closed this Sep 26, 2016
@iksaif iksaif reopened this Sep 26, 2016
@coveralls
Copy link

Coverage Status

Coverage increased (+0.01%) to 88.372% when pulling a3efa84 on criteo-forks:criteo-dev into 78f761c on graphite-project:master.

3 similar comments
@coveralls
Copy link

Coverage Status

Coverage increased (+0.01%) to 88.372% when pulling a3efa84 on criteo-forks:criteo-dev into 78f761c on graphite-project:master.

@coveralls
Copy link

Coverage Status

Coverage increased (+0.01%) to 88.372% when pulling a3efa84 on criteo-forks:criteo-dev into 78f761c on graphite-project:master.

@coveralls
Copy link

Coverage Status

Coverage increased (+0.01%) to 88.372% when pulling a3efa84 on criteo-forks:criteo-dev into 78f761c on graphite-project:master.

@bmhatfield
Copy link
Member

I can try this out on my cluster to confirm your observations.

One thing I am having a little trouble finding a reference on, though, is what effect exactly does this change have on the file interactions? Under the covers, python is calling setvbuf, I think, but I can't find information on what effect setting that to unbuffered has on reading regular files. Do you have a source that guided your thinking on this?

@iksaif
Copy link
Member Author

iksaif commented Sep 26, 2016

http://stackoverflow.com/questions/18194374/default-buffer-size-for-a-file?answertab=votes#tab-top has some information. My thinking is that linux will probably buffer 4k anyway (size of a page) and unbuffered reads will simply cause more syscalls at worse.

@mleinart
Copy link
Member

mleinart commented Sep 26, 2016

Hm, several people referring to setvbuf() but it looks like (at least in 2.7) python's io module only sets that on stdin, stdout, stderr. For the File buffering it appears to be doing its own buffering though it doesn't appear to be doing any readahead when in binary mode (maybe someone could read this a little closer? https://github.com/python/cpython/blob/2.7/Modules/_io/bufferedio.c#L1432-L1529)).

Based on this though, I'm pretty surprised (and a little skeptical) to see a huge difference in read behavior. My expectation would be that we reduce memory usage somewhat and maybe a tad CPU, but that reads would be constant unless FADVISE_RANDOM were flipped as well.

I think I'm 👍 on this, including changing the default, but I'd like to see more people test it in prod and get someone else's eyes on the Python implementation to verify my conclusion.

@iksaif
Copy link
Member Author

iksaif commented Sep 26, 2016

Note that I'm running with Pypy, which might have a different behavior. But the documentation doesn't say that this applies only to stdin/stdout/stderr. I also validated with strace that my reads were called with 8k before the change.

reads are constant, it's just that before the change they were completely thrashing the page cache:

Before

# ./bin/cachestat 
Counting cache functions... Output every 1 seconds.
    HITS   MISSES  DIRTIES    RATIO   BUFFERS_MB   CACHE_MB
   49514    85660    24737    36.6%         8262     223187
   58743   209480    40935    21.9%         8265     223112
   29536   219433    37805    11.9%         8272     223270
   79205   189157    49085    29.5%         8273     223281
   36131   230937    38658    13.5%         8273     223123
   59280   204204    39298    22.5%         8271     223022
   70966    87138    27158    44.9%         8270     222998
   32321    87837    22470    26.9%         8271     223181
   44200   171163    33245    20.5%         8270     222998
   52094   222617    40092    19.0%         8268     222992
   26670   226120    36743    10.6%         8268     222986
   58898   220398    40836    21.1%         8272     223025

After

# ./perf-tools/bin/cachestat 
Counting cache functions... Output every 1 seconds.
    HITS   MISSES  DIRTIES    RATIO   BUFFERS_MB   CACHE_MB
  222662        0   108501   100.0%         9978     214294
  211045        0   116320   100.0%         9978     214295
  204292        0   114225   100.0%         9978     214296
  177172        0   104677   100.0%         9978     214296
  211580        0   115365   100.0%         9978     214297
  241099        0   115595   100.0%         9978     214298
  237547        0   117324   100.0%         9978     214299
  226525        0   120717   100.0%         9978     214300
  222532        0   118824   100.0%         9978     214300
  217729        0   117804   100.0%         9978     214301

@bmhatfield
Copy link
Member

bmhatfield commented Sep 26, 2016

@iksaif One thing that's kind of tricky with Graphite is that the memory pressure the carbon-daemons (especially carbon-cache) put on the system overall after running for long periods of time can cause the page cache to come under pressure and eventually thrash. In your environment, do you see the level of thrashing you posted above after a clean restart without your changes?

EDIT: For context, I am studying my own environment prior to deploying your changes to observe the effects. I noticed that after a clean restart, my ratio was 100% but has subsequently been degrading as the carbon-caches start filling with buffered data.

@iksaif
Copy link
Member Author

iksaif commented Sep 26, 2016

Yes, I this would consistently apparent after a clean restart and reboot once we reached a big enough number of metric updated every minute. Even worse, there was no way to fix it (appart from my patch).

@bmhatfield
Copy link
Member

bmhatfield commented Sep 26, 2016

I restarted things a few hours ago to get to a clean-ish starting state to evaluate the impact of this change, and unfortunately I have not seen any page-cache thrashing in a meaningful way since then (large queries drop it for a moment, but the ratio goes back to 100% promptly). This is somewhat intentional, as my systems are tuned to make good use of the page cache already.

I'd personally like to get a sense of a few things before merging this.

  1. Can we quantify exactly the behavior change as a result of setting this? It seems that I'm not alone in being a bit confused by what it changes "under the covers". Do you have data on "bytes read per update", before and after?
  2. I would love to see a reproduce case. If someone else can get to one faster, that'd be sweet. I think my turnaround cycle unless I force memory pressure is going to be long.

Additionally, can you share some data about your setup? Are you doing 8M metrics per host? Per what time interval?

@bmhatfield
Copy link
Member

It's also probably worth noting that I am able to gain significant improvements in read-vs-write ratios simply by restarting the carbon daemons (and evicting the bloated-over-time cached points):

cache-size
disk-cache
disk-reads

@bmhatfield
Copy link
Member

Also, just to add, the intent of me posting those graphs was to illustrate that Graphite can be an interesting tool to scale, and that there can be confusing behavior changes just from something as simple as restarting it's daemons.

That said, I don't meant to wet-blanket your change by asking lots of questions. I'm really glad you're doing performance research and opening PRs against this project to push things forward. I personally like to truly understand how changes like this work before merging them, as I have made errors in the past where I've associated performance improvements with the wrong change.

Thanks again for your work so far, and I hope we can demonstrate how exactly this change works and get it merged in!

@iksaif
Copy link
Member Author

iksaif commented Sep 27, 2016

Update with the patch

stat("foo.wsp", {st_mode=S_IFREG|0644, st_size=151312, ...}) = 0
open("foo.wsp", O_RDWR) = 144
fstat(144, {st_mode=S_IFREG|0644, st_size=151312, ...}) = 0
flock(144, LOCK_EX)                     = 0
lseek(144, 0, SEEK_CUR)                 = 0
lseek(144, 0, SEEK_SET)                 = 0
read(144, "\0\0\0\1\1\3413\200?\0\0\0\0\0\0\3", 16) = 16
read(144, "\0\0\0004\0\0\0<\0\0-\0", 12) = 12
read(144, "\0\2\0344\0\0\16\20\0\0\2\320", 12) = 12
read(144, "\0\2=\364\0\1Q\200\0\0\1m", 12) = 12
lseek(144, 0, SEEK_SET)                 = 0
lseek(144, 52, SEEK_SET)                = 52
read(144, "W\346\334l@)\237\317\303\370\21\365", 12) = 12
lseek(144, 41944, SEEK_SET)             = 41944
write(144, "W\352\16\240@%\322\364'\204\251H", 12) = 12
lseek(144, 52, SEEK_SET)                = 52
read(144, "W\346\334l@)\237\317\303\370\21\365", 12) = 12
lseek(144, 41752, SEEK_SET)             = 41752
read(144, "W\352\n\340@$\224\364\203\312\373;W\352\v\34@#\237\322\351F\200\27W\352\vX@%\34#"..., 720) = 720
close(144)                              = 0

Update without the patch

stat("foo.wsp", {st_mode=S_IFREG|0644, st_size=151312, ...}) = 0
open("foo.wsp", O_RDWR) = 3
fstat(3, {st_mode=S_IFREG|0644, st_size=151312, ...}) = 0
flock(3, LOCK_EX)                       = 0
lseek(3, 0, SEEK_CUR)                   = 0
lseek(3, 0, SEEK_CUR)                   = 0
read(3, "\0\0\0\1\1\3413\200?\0\0\0\0\0\0\3\0\0\0004\0\0\0<\0\0-\0\0\2\0344"..., 8192) = 8192
lseek(3, 0, SEEK_CUR)                   = 8192
lseek(3, 0, SEEK_CUR)                   = 8192
lseek(3, 0, SEEK_CUR)                   = 8192
lseek(3, 66964, SEEK_SET)               = 66964
lseek(3, 0, SEEK_CUR)                   = 66964
write(3, "W\352\17\30\0\0\0\0\0\0\0\0", 12) = 12
lseek(3, 52, SEEK_SET)                  = 52
read(3, "W\332h8\0\0\0\0\0\0\0\0W\332ht\0\0\0\0\0\0\0\0W\332h\260\0\0\0\0"..., 8192) = 8192
lseek(3, 0, SEEK_CUR)                   = 8244
lseek(3, 66748, SEEK_SET)               = 66748
read(3, "W\352\n\340?\204z\341G\256\24{W\352\v\34?\204z\341G\256\24{W\352\vX\0\0\0\0"..., 8192) = 8192
close(3)                                = 0

With the update, the syscalls will ask the kernel to read only what we need instead of buffering 8k more. The actual numbers are probably different because of minimal io size of the disk and the page size, but both of these numbers should be smaller that 8k.

Based on the behavior seen on my system I would say that without the patch we would read at least two times more than with the patch and that half of the data being read was never used.

If you want to see the effect on your setup, simply use echo 3 > /proc/sys/vm/drop_caches on a running system and check how much time it takes for the cache size to stabilize and how much you actually need.

I running that now on two machines doing exactly the same thing (replicated clusters, with and without the patch). I'll upload the results soon.

@iksaif
Copy link
Member Author

iksaif commented Sep 27, 2016

Dropping caches on live instances (exactly the same traffic).

patched server:
io-good
ram-good

vanilla server:

io-bad
ram-bad

@bmhatfield
Copy link
Member

That's phenomenal. Thank you very much for the extra information. I'm convinced, and happy to merge this. I'll give it an hour for @mleinart, @obfuscurity or @deniszh to weigh in with anything else, otherwise, I will merge this ~10:30am ET.

@obfuscurity
Copy link
Member

👍 good find, @iksaif

@bmhatfield bmhatfield merged commit 7627589 into graphite-project:master Sep 27, 2016
@bmhatfield
Copy link
Member

Thanks!

@deniszh
Copy link
Member

deniszh commented Sep 27, 2016

No objections here. On moderate load I see no regression here, and free ram is good. :)

@mleinart
Copy link
Member

A late 👍, thanks for the research and testing all

On Tue, Sep 27, 2016 at 9:26 AM, Denis Zhdanov notifications@github.com
wrote:

No objections here. On moderate load I see no regression here, and free
ram is good. :)


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#181 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AATaL3gslk4KWkv2lZbNyROLiMvv0Dj2ks5quSevgaJpZM4KGX7b
.

@piotr1212
Copy link
Member

Experienced the same problem. Didn't realize it could be so simple to fix!! This is amazing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants