New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
whisper: disable buffering for update operations #181
Conversation
We found out that 200GiB of memory isn't enough for 8M metrics, and past this limit whisper will start reading from disk at every update. After this patch, updates() will get a 100% hit rate on the page cache with only 100GiB of memory.
3 similar comments
@@ -95,6 +95,9 @@ def _py_fallocate(fd, offset, len_): | |||
CACHE_HEADERS = False | |||
AUTOFLUSH = False | |||
FADVISE_RANDOM = False | |||
# Buffering setting applied to all operations that do *not* require | |||
# a full scan of the file in order to minimize cache thrashing. | |||
BUFFERING = 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should it be "-1" by default?
According to docs: "The optional buffering argument specifies the file’s desired buffer size: 0 means unbuffered, 1 means line buffered, any other positive value means use a buffer of (approximately) that size (in bytes). A negative buffering means to use the system default, which is usually line buffered for tty devices and fully buffered for other files. If omitted, the system default is used."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The default is -1 which means system default, which is 8k usually on linux. That's super wasteful for things such as update_many which only read the header and a few points.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well. I'm just not sure that 0 is a good idea for default. OTOH there's no easy way how you can change that programmatically....
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After thinking about it, 0 might be more like "performance related bugfix" than "weird tunable". If it's always the case that we read extra data, and we never need that data, then there's no point to have read that extra data. That's not something you'd tune.
That said, I'd like to see some before/after data (graphs!) showing the effect of this change to confirm my assumptions.
This PR seems HUGE to me. I'm really excited about the performance implications on a large system. Can you share any before/after graphs or metrics for how you validated this? |
The times of both graphs don't match because these are from two different machines. I can get clean correctable graphs if needed. Our configuration is 24 cores, 256GiB of RAM, 2TiB of SSD, ~10M metrics. In some usecases you may want to tune that, or revert it to 0 (if you read a lot), which is why I made it somewhat tunnable. But in most cases it should clearly not be buffered. |
I tried the same on my test cluster - but I have no big read load there. I can confirm that you will get some free RAM after restart though. :) |
Note that you need to drop the caches before and after to get a sense of the actually used cache :) The most important thing here is that I'm not reading 1GB/s of data from my disk anymore ! |
3 similar comments
I can try this out on my cluster to confirm your observations. One thing I am having a little trouble finding a reference on, though, is what effect exactly does this change have on the file interactions? Under the covers, python is calling |
http://stackoverflow.com/questions/18194374/default-buffer-size-for-a-file?answertab=votes#tab-top has some information. My thinking is that linux will probably buffer 4k anyway (size of a page) and unbuffered reads will simply cause more syscalls at worse. |
Hm, several people referring to setvbuf() but it looks like (at least in 2.7) python's io module only sets that on stdin, stdout, stderr. For the File buffering it appears to be doing its own buffering though it doesn't appear to be doing any readahead when in binary mode (maybe someone could read this a little closer? https://github.com/python/cpython/blob/2.7/Modules/_io/bufferedio.c#L1432-L1529)). Based on this though, I'm pretty surprised (and a little skeptical) to see a huge difference in read behavior. My expectation would be that we reduce memory usage somewhat and maybe a tad CPU, but that reads would be constant unless FADVISE_RANDOM were flipped as well. I think I'm 👍 on this, including changing the default, but I'd like to see more people test it in prod and get someone else's eyes on the Python implementation to verify my conclusion. |
Note that I'm running with Pypy, which might have a different behavior. But the documentation doesn't say that this applies only to stdin/stdout/stderr. I also validated with strace that my reads were called with 8k before the change. reads are constant, it's just that before the change they were completely thrashing the page cache: Before
After
|
@iksaif One thing that's kind of tricky with Graphite is that the memory pressure the carbon-daemons (especially EDIT: For context, I am studying my own environment prior to deploying your changes to observe the effects. I noticed that after a clean restart, my ratio was 100% but has subsequently been degrading as the |
Yes, I this would consistently apparent after a clean restart and reboot once we reached a big enough number of metric updated every minute. Even worse, there was no way to fix it (appart from my patch). |
I restarted things a few hours ago to get to a clean-ish starting state to evaluate the impact of this change, and unfortunately I have not seen any page-cache thrashing in a meaningful way since then (large queries drop it for a moment, but the ratio goes back to 100% promptly). This is somewhat intentional, as my systems are tuned to make good use of the page cache already. I'd personally like to get a sense of a few things before merging this.
Additionally, can you share some data about your setup? Are you doing 8M metrics per host? Per what time interval? |
Also, just to add, the intent of me posting those graphs was to illustrate that Graphite can be an interesting tool to scale, and that there can be confusing behavior changes just from something as simple as restarting it's daemons. That said, I don't meant to wet-blanket your change by asking lots of questions. I'm really glad you're doing performance research and opening PRs against this project to push things forward. I personally like to truly understand how changes like this work before merging them, as I have made errors in the past where I've associated performance improvements with the wrong change. Thanks again for your work so far, and I hope we can demonstrate how exactly this change works and get it merged in! |
Update with the patch
Update without the patch
With the update, the syscalls will ask the kernel to read only what we need instead of buffering 8k more. The actual numbers are probably different because of minimal io size of the disk and the page size, but both of these numbers should be smaller that 8k. Based on the behavior seen on my system I would say that without the patch we would read at least two times more than with the patch and that half of the data being read was never used. If you want to see the effect on your setup, simply use I running that now on two machines doing exactly the same thing (replicated clusters, with and without the patch). I'll upload the results soon. |
That's phenomenal. Thank you very much for the extra information. I'm convinced, and happy to merge this. I'll give it an hour for @mleinart, @obfuscurity or @deniszh to weigh in with anything else, otherwise, I will merge this ~10:30am ET. |
👍 good find, @iksaif |
Thanks! |
No objections here. On moderate load I see no regression here, and free ram is good. :) |
A late 👍, thanks for the research and testing all On Tue, Sep 27, 2016 at 9:26 AM, Denis Zhdanov notifications@github.com
|
Experienced the same problem. Didn't realize it could be so simple to fix!! This is amazing. |
We found out that 200GiB of memory isn't enough for 8M metrics, and
past this limit whisper will start reading from disk at every update.
After this patch, updates() will get a 100% hit rate on the page cache
with only 100GiB of memory.