-
Notifications
You must be signed in to change notification settings - Fork 70
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Network and Disk I/O blocking and file handles : optimizations #14
Comments
Now using |
According to the Linux Kernel implementation, readahead() will simply walk all pages in the requested file range, look up the page in the mappings RBT and if not already there(not cached) will schedule a read ahead for it. This means that the cost is minimal, other than iterating the pages and looking each one up on the RBT -- which seems like a good tradeoff. |
More measurements writev()It takes 0.081s to writev() 95MBs readahead()for 32MBs:
sendfile()
|
See also:
This doesn't affect Tank because of the single-threaded design, but it's good to keep this in mind. From code comments: // The Linux XFS implementation is challenged wrt. append: a write that changes
// eof will be blocked by any other concurrent AIO operation to the same file, whether
// it changes file size or not. Furthermore, ftruncate() will also block and be blocked
// by AIO, so attempts to game the system and call ftruncate() have to be done very carefully.
//
// Other Linux filesystems may have different locking rules, so this may need to be
// adjusted for them. |
As of 0925e89, Tank will be more fair to clients, by trying to reduce the time it would block in sendfile(), because either a consumer requested a very large payload(e.g dozens of MBs) and/or reading from the underlying filesystem is very slow(low transfer rates). Instead of executing a single sendfile() for say 32MBs, it will instead break this down into 512K requests, and once it has transferred 4MBs it will return control to the main I/O loop thereby giving a chance to other connections/clients, and then resuming the transfer from where it was left of. By breaking down the single sendfile() call to multiple, there is a better chance for the kernel to have paged-in the contents already in time for subsequent sendfile() calls - we readahead() when we receive a consume request, instructing the kernel to page-in data in the background, before we commence streaming data. This helps a lot when you have lots of connections/clients, and you want to be fair to them, where 1 or 2 clients asking for dozens of MBs won't block processing of incoming requests or transfer of outgoing responses. There is no perfect solution though. We could dedicate one thread/connection, but that'd result in other problems, so that when the kernel puts a connection/client thread to sleep because it blocks for I/O, another would be scheduled in its place. You may want to read SeaStar's tutorial for why this is not a good idea. We could have used AIO, and it could have worked, except that there caveats. In practice, we 'd be required to use O_DIRECT access and XFS(though other filesystems are catching up). That means we 'd bypass the kernel cache and we 'd need our own cache. Furthermore, the more data you need to read asynchronous, the longer it takes to setup the request -- that is, the time it takes to We could have used a threads-pool, and manage multiple clients/thread, which could mitigate the effect, but at the cost of complexity and state serialization overhead. Ultimately, short of the Linux Kernel introducing new APIs that in effect signal the userspace application before a thread is about to be blocked, thereby giving it a chance to e.g yield to an application managed fiber/green-thread, and conversely be signalled again when a blocking operation has been completed and the thread will be made runnable again, the best all way would be to just port Netflix’s sendfile() improvements to Linux (see earlier comments). |
Now using another heuristic; if total time spent in try_send() (specifically, we keep track of total time spent in sendfile() ), then we abort early. Also, we initially sendfile() 128k and then switch to 640k / iteration, and transmit threshold (total amount of bytes that can be sent in the current try_send() call ) has been raised to 24MBs. It turns out, that if the data to be accessed are missing from the kernel VM caches, e.g free && sync && echo 3 > /proc/sys/vm/drop_caches && free it takes 1000s of microseconds to sendfile() a few KBs worth of data, whereas if data is in cache, it takes no more than 250-300 microseconds. So having a fixed upper limit (currently, 3k microseconds) helps identifying situations where data is not in cache and sending however much was requested would require spending too long in try_send(), at the expense of all other active requests.
|
Threads Pool in NGINX: NGINX is now using thread pools, so that sendfile() won't block the I/O thread, because they too figured out that when it needs to block to page-in blocks, it can can kill performance. I should consider this - and expose it as a tank deamon option. |
Serving 100 Gbps from an Open Connect Appliance : A great write-up mostly specific to FreeBSD sendfile() impl. and kernel semantics, but likely relevant to what we do here as well. |
The new See more here: https://news.ycombinator.com/item?id=15412534 Will probably support this once soon. |
It was suggested that sendfile() would return EAGAIN/EWOULDBLOCK if the file was opened with O_NONBLOCK, and sendfile() would need to block. Alas, sendfile() still blocks. It would have been fantastic if this worked. |
The silver bullet in term of async file IO on Linux nowadays appears to be |
@giampaolo I am familiar with io_uring; in fact, I 've been following the development since it was announced, and I have been experimenting with it for some time, and I discussed ways to implement a zero-copy sendfile alternative via io_uring (for TANK, and for other projects). I suspect TANK will support io_uring for disk (and later, network) I/O soon. I wasn't familiar with KTLS -- this looks great. Currently, TANK does not support TLS connections (because we have no need for that, and none requested that either), but it should be pretty trivial to support TLS connections if such need arises. |
Is that public? I'd be interested in taking a look at it as I want to try to integrate it in an async FTP server. Extra: am not sure if this is useful for TANK (note: I'm not a user of TANK, I just ended up here by accident), but FYI with |
@giampaolo no, the discussions are private — the ideas discussed didn’t pan out though but will consider other alternatives. sys_sendfile will eventually be supported ( or a similar zero copy opcode anyway ) in io_uring. Considered splice() but in the end it wasn’t worth it. You may want to study haproxy’s implemention if you are interested in similar ideas. |
No, this is not possible as of today. |
Streaming from broker
We are using
sendfile()
to stream data from segment logs to clients (or brokers who are acting as followers). This works great, and this is what Kafka’s doing, but maybe we can do better, considering that sendfile() can block if the data is neither on the disk cache nor on a fast SSD storage, which will in turn affect other producers and consumers, because of the current single-thread design, although even if we do wind up using multiple threads on the server, it still won’t guarantee a mostly block-free operation.NGINX and Netflix contributed an excellent new sendfile implementation for FreeBSD, which supports AIO, which is really exactly what’d love to be able to use.Specifically, that new sys call adds 2 new flags and refines an existing flag (SF_NOCACHE, SF_READAHEAD, SF_NODISKAIO). Unfortunately, this won’t become available on Linux anytime soon.
We could consider Linux AIO (use of libaio, with -laio and libaio.h, io_submit() etc), but that’d require opening files with O_DIRECT, which comes with a whole lot of restrictions, and even then, we ‘d have to transfer from the file to user space memory, and then use write() to stream to the socket, or a fairly elaborate scheme with pipes and use of the various *splice(), tee() methods. I am not sure the complexity is going to be worth it, or that we ‘d necessarily get more performance out of it, given the need for more sys calls and need to copy or shuffle around more data.
Another alternative is use of mmap() and then use of *splice() methods to transfer mmaped file data to the socket.
Many of those sys calls accept flags, and SPLICE_F_MOVE|SPLICE_F_NONBLOCK may come in handy. We still need to resort to pipe trickery, but again, this may be worth it.
We should also consider LightHTTPD’s ‘asynchronous’ sendfile hack. Effectively what they do is:
Indeed, the data is never copied to userspace; they are moved from kernel/user space. It requires use of AIO (or POSIX AIO or some other userspace threads I/O handoff scheme). The implementation can be found here.
All told, there are other options to consider, especially if we are going to support other OS and platforms. This all comes down to reducing or even eliminating the likelihood for blocking sendfile() operations, so that other consumers/producers won’t block waiting for it. It may not be really worth it for now, but we should come back to this if and when it does.
Appending bundles to segments
We are using
writev()
to append data to segment log files, which is always going to be fast because it’s an append operation(although there are edge cases where it may not work like that). This should almost never block, but it might.We can, again, rely on AIO (specifically, linux AIO) for this, in order to minimize or eliminate the likelihood for blocking writev(). The problem again is that it requires opening files /w O_DIRECT, and the underlying filesystem must properly support AIO semantics. XFS seems to be the only safe choice — in fact, only 3.16+ Linux Kernel includes an XFS impl. that properly deals with appends.
We could take into account the OS/architecture and filesystem, to optionally use AIO to do this.
File handles
If we are going to support many thousands of partitions, we need to consider the requirements. Specifically, we currently need 2 FDs for each partition(for the current segment’s log and index), and 1 index for each immutable segment. So for a partition of 5 immutable segments, we ‘d need 5 + 2 = 7 FDs. Furthermore, we need to mmap() all index files, although those are fairly small.
We could maintain a simple LRU (or maybe look into alternative replacement policies) cache of all FDs for opened segment files and limit it based on e.g getrlimit(, RLIMIT_NOFILE). So whenever we ‘d get EMFILE from accept4(), open(), socket() etc, we ‘d ask the cache to close FDs. If we need to open a file, and we get EMFILE, we ‘d need the cache to close FDs so that we can open the file — if the cache is empty it means that we have used all FDs for sockets and we should perhaps try to use
setrlimit()
to adjust RLIMIT_NOFILE.We are not going to need to solve this problem yet, but we should consider this for both performance reasons and for efficient support of thousands or even million of partitions.
Warming up disk pages
We can use MINCORE(2) to determine which segment log pages not current in-memory(block/file caches) and then 'touch' them so that they are paged-in prior to accessing them. We should also look into the use of fcntl(fd, F_NOCACHE), posix_fadvise(), readahead(), fadvise(), posix_fallocate() and fallocate() calls and use them when and where appropriate.
The text was updated successfully, but these errors were encountered: