Commits
Anuj-Gupta/blo…
Name already in use
Commits on Mar 29, 2023
-
null_blk: add support for copy offload
Implementaion is based on existing read and write infrastructure. Suggested-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> Signed-off-by: Anuj Gupta <anuj20.g@samsung.com> Signed-off-by: Nitesh Shetty <nj.shetty@samsung.com> Signed-off-by: Vincent Fu <vincent.fu@samsung.com>
-
dm: Enable copy offload for dm-linear target
Setting copy_offload_supported flag to enable offload. Signed-off-by: Nitesh Shetty <nj.shetty@samsung.com>
-
dm: Add support for copy offload.
Before enabling copy for dm target, check if underlying devices and dm target support copy. Avoid split happening inside dm target. Fail early if the request needs split, currently splitting copy request is not supported. Signed-off-by: Nitesh Shetty <nj.shetty@samsung.com>
-
nvmet: add copy command support for bdev and file ns
Add support for handling target command on target. For bdev-ns we call into blkdev_issue_copy, which the block layer completes by a offloaded copy request to backend bdev or by emulating the request. For file-ns we call vfs_copy_file_range to service our request. Currently target always shows copy capability by setting NVME_CTRL_ONCS_COPY in controller ONCS. Signed-off-by: Nitesh Shetty <nj.shetty@samsung.com> Signed-off-by: Anuj Gupta <anuj20.g@samsung.com>
-
nvme: add copy offload support
For device supporting native copy, nvme driver receives read and write request with BLK_COPY op flags. For read request the nvme driver populates the payload with source information. For write request the driver converts it to nvme copy command using the source information in the payload and submits to the device. current design only supports single source range. This design is courtesy Mikulas Patocka's token based copy trace event support for nvme_copy_cmd. Set the device copy limits to queue limits. Signed-off-by: Kanchan Joshi <joshi.k@samsung.com> Signed-off-by: Nitesh Shetty <nj.shetty@samsung.com> Signed-off-by: Javier González <javier.gonz@samsung.com> Signed-off-by: Anuj Gupta <anuj20.g@samsung.com>
-
fs, block: copy_file_range for def_blk_ops for direct block device.
For direct block device opened with O_DIRECT, use copy_file_range to issue device copy offload, and fallback to generic_copy_file_range incase device copy offload capability is absent. Modify checks to allow bdevs to use copy_file_range. Suggested-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Anuj Gupta <anuj20.g@samsung.com> Signed-off-by: Nitesh Shetty <nj.shetty@samsung.com>
-
For the devices which does not support copy, copy emulation is added. It is required for in-kernel users like fabrics, where file descriptor is not available and hence they can't use copy_file_range. Copy-emulation is implemented by reading from source into memory and writing to the corresponding destination asynchronously. Also emulation is used, if copy offload fails or partially completes. Signed-off-by: Nitesh Shetty <nj.shetty@samsung.com> Signed-off-by: Vincent Fu <vincent.fu@samsung.com> Signed-off-by: Anuj Gupta <anuj20.g@samsung.com>
-
block: Add copy offload support infrastructure
Introduce blkdev_issue_copy which takes similar arguments as copy_file_range and performs copy offload between two bdevs. Introduce REQ_COPY copy offload operation flag. Create a read-write bio pair with a token as payload and submitted to the device in order. Read request populates token with source specific information which is then passed with write request. This design is courtesy Mikulas Patocka's token based copy Larger copy will be divided, based on max_copy_sectors limit. Signed-off-by: Nitesh Shetty <nj.shetty@samsung.com> Signed-off-by: Anuj Gupta <anuj20.g@samsung.com>
-
block: Introduce queue limits for copy-offload support
Add device limits as sysfs entries, - copy_offload (RW) - copy_max_bytes (RW) - copy_max_bytes_hw (RO) Above limits help to split the copy payload in block layer. copy_offload: used for setting copy offload(1) or emulation(0). copy_max_bytes: maximum total length of copy in single payload. copy_max_bytes_hw: Reflects the device supported maximum limit. Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Nitesh Shetty <nj.shetty@samsung.com> Signed-off-by: Kanchan Joshi <joshi.k@samsung.com> Signed-off-by: Anuj Gupta <anuj20.g@samsung.com>
Commits on Mar 28, 2023
-
Merge branch 'iter-ubuf' into for-next
* iter-ubuf: iov_iter: import single vector iovecs as ITER_UBUF iov_iter: convert import_single_range() to ITER_UBUF IB/qib: make qib_write_iter() deal with ITER_UBUF iov_iter IB/hfi1: make hfi1_write_iter() deal with ITER_UBUF iov_iter snd: make snd_map_bufs() deal with ITER_UBUF snd: move mapping an iov_iter to user bufs into a helper iov_iter: add iovec_nr_user_vecs() helper iov_iter: teach iov_iter_iovec() to deal with ITER_UBUF
-
Merge branch 'for-6.4/block' into for-next
* for-6.4/block: block: open code __blk_account_io_done() block: open code __blk_account_io_start()
-
iov_iter: import single vector iovecs as ITER_UBUF
Add a special case to __import_iovec(), which imports a single segment iovec as an ITER_UBUF rather than an ITER_IOVEC. ITER_UBUF is cheaper to iterate than ITER_IOVEC, and for a single segment iovec, there's no point in using a segmented iterator. Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
iov_iter: convert import_single_range() to ITER_UBUF
Since we're just importing a single vector, we don't have to turn it into an ITER_IOVEC. Instead turn it into an ITER_UBUF, which is cheaper to iterate. Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
IB/qib: make qib_write_iter() deal with ITER_UBUF iov_iter
Don't assume that a user backed iterator is always of the type ITER_IOVEC. Handle the single segment case separately, then we can use the same logic for ITER_UBUF and ITER_IOVEC. Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
IB/hfi1: make hfi1_write_iter() deal with ITER_UBUF iov_iter
Don't assume that a user backed iterator is always of the type ITER_IOVEC. Handle the single segment case separately, then we can use the same logic for ITER_UBUF and ITER_IOVEC. Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
snd: make snd_map_bufs() deal with ITER_UBUF
This probably doesn't make any sense, as it's reliant on passing in different things in multiple segments. Most likely we can just make this go away as it's already checking for ITER_IOVEC upon entry, and it looks like nr_segments == 2 is the smallest legal value. IOW, any attempt to readv/writev with 1 segment would fail with -EINVAL already. Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
snd: move mapping an iov_iter to user bufs into a helper
snd_pcm_{readv,writev} both do the same mapping of a struct iov_iter into an array of buffers. Move this into a helper. No functional changes intended in this patch. Signed-off-by: Jens Axboe <axboe@kernel.dk> -
iov_iter: add iovec_nr_user_vecs() helper
This returns the number of user segments in an iov_iter. The input can either be an ITER_IOVEC, where it'll return the number of iovecs. Or it can be an ITER_UBUF, in which case the number of segments is always 1. Outside of those two, no user backed iterators exist. Just return 0 for those. Signed-off-by: Jens Axboe <axboe@kernel.dk>
Commits on Mar 27, 2023
-
block: open code __blk_account_io_done()
There is only one caller for __blk_account_io_done(), the function is small enough to fit in its caller blk_account_io_done(). Remove the function and opencode in the its caller blk_account_io_done(). Signed-off-by: Chaitanya Kulkarni <kch@nvidia.com> Link: https://lore.kernel.org/r/20230327073427.4403-2-kch@nvidia.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
block: open code __blk_account_io_start()
There is only one caller for __blk_account_io_start(), the function is small enough to fit in its caller blk_account_io_start(). Remove the function and opencode in the its caller blk_account_io_start(). Signed-off-by: Chaitanya Kulkarni <kch@nvidia.com> Link: https://lore.kernel.org/r/20230327073427.4403-2-kch@nvidia.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Merge branch 'for-6.4/io_uring' into for-next
* for-6.4/io_uring: io_uring: encapsulate task_work state io_uring: remove extra tw trylocks io_uring/io-wq: drop outdated comment io_uring: kill unused notif declarations io_uring/rw: transform single vector readv/writev into ubuf io-wq: Drop struct io_wqe io-wq: Move wq accounting to io_wq io_uring/kbuf: disallow mapping a badly aligned provided ring buffer io_uring: Add KASAN support for alloc_caches io_uring: Move from hlist to io_wq_work_node io_uring: One wqe per wq io_uring: add support for user mapped provided buffer ring io_uring/kbuf: rename struct io_uring_buf_reg 'pad' to'flags' io_uring/kbuf: add buffer_list->is_mapped member io_uring/kbuf: move pinning of provided buffer ring into helper io_uring: Adjust mapping wrt architecture aliasing requirements io_uring: avoid hashing O_DIRECT writes if the filesystem doesn't need it fs: add FMODE_DIO_PARALLEL_WRITE flag
-
Merge branch 'for-6.4/block' into for-next
* for-6.4/block: blk-mq: remove hybrid polling blk-crypto: drop the NULL check from blk_crypto_put_keyslot() blk-mq: return actual keyslot error in blk_insert_cloned_request() blk-crypto: remove blk_crypto_insert_cloned_request() blk-crypto: make blk_crypto_evict_key() more robust blk-crypto: make blk_crypto_evict_key() return void blk-mq: release crypto keyslot before reporting I/O complete nbd: use the structured req attr check nbd: allow genl access outside init_net
-
Merge branch 'for-6.4/splice' into for-next
* for-6.4/splice: block: convert bio_map_user_iov to use iov_iter_extract_pages block: Convert bio_iov_iter_get_pages to use iov_iter_extract_pages block: Add BIO_PAGE_PINNED and associated infrastructure block: Replace BIO_NO_PAGE_REF with BIO_PAGE_REFFED with inverted logic block: Fix bio_flagged() so that gcc can better optimise it iomap: Don't get an reference on ZERO_PAGE for direct I/O block zeroing iov_iter: Kill ITER_PIPE cifs: Use generic_file_splice_read() splice: Do splice read from a file without using ITER_PIPE tty, proc, kernfs, random: Use direct_splice_read() coda: Implement splice-read overlayfs: Implement splice-read shmem: Implement splice-read splice: Make do_splice_to() generic and export it splice: Clean up direct_splice_read() a bit
-
io_uring: encapsulate task_work state
For task works we're passing around a bool pointer for whether the current ring is locked or not, let's wrap it in a structure, that will make it more opaque preventing abuse and will also help us to pass more info in the future if needed. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/1ecec9483d58696e248d1bfd52cf62b04442df1d.1679931367.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
io_uring: remove extra tw trylocks
Before cond_resched()'ing in handle_tw_list() we also drop the current ring context, and so the next loop iteration will need to pick/pin a new context and do trylock. The chunk removed by this patch was intended to be an optimisation covering exactly this case, i.e. retaking the lock after reschedule, but in reality it's skipped for the first iteration after resched as described and will keep hammering the lock if it's contended. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/1ecec9483d58696e248d1bfd52cf62b04442df1d.1679931367.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
io_uring/io-wq: drop outdated comment
Since the move to PF_IO_WORKER, we don't juggle memory context manually anymore. Remove that outdated part of the comment for __io_worker_idle(). Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
io_uring: kill unused notif declarations
There are two leftover structures from the notification registration mechanism that has never been released, kill them. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/f05f65aebaf8b1b5bf28519a8fdb350e3e7c9ad0.1679924536.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
io_uring/rw: transform single vector readv/writev into ubuf
It's very common to have applications that use vectored reads or writes, even if they only pass in a single segment. Obviously they should be using read/write at that point, but... Vectored IO comes with the downside of needing to retain iovec state, and hence they require and allocation and state copy if they end up getting deferred. Additionally, they also require extra cleanup when completed as the memory as the allocated state memory has to be freed. Automatically transform single segment IORING_OP_{READV,WRITEV} into IORING_OP_{READ,WRITE}, and hence into an ITER_UBUF. Outside of being more efficient if needing deferral, ITER_UBUF is also more efficient for normal processing compared to ITER_IOVEC, as they don't require iteration. The latter is apparent when running peak testing, where using IORING_OP_READV to randomly read 24 drives previously scored: IOPS=72.54M, BW=35.42GiB/s, IOS/call=32/31 IOPS=73.35M, BW=35.81GiB/s, IOS/call=32/31 IOPS=72.71M, BW=35.50GiB/s, IOS/call=32/31 IOPS=73.29M, BW=35.78GiB/s, IOS/call=32/32 IOPS=73.45M, BW=35.86GiB/s, IOS/call=32/32 IOPS=73.19M, BW=35.74GiB/s, IOS/call=31/32 IOPS=72.89M, BW=35.59GiB/s, IOS/call=32/31 IOPS=73.07M, BW=35.68GiB/s, IOS/call=32/32 and after the change we get: IOPS=77.31M, BW=37.75GiB/s, IOS/call=32/31 IOPS=77.32M, BW=37.75GiB/s, IOS/call=32/32 IOPS=77.45M, BW=37.81GiB/s, IOS/call=31/31 IOPS=77.47M, BW=37.83GiB/s, IOS/call=32/32 IOPS=77.14M, BW=37.67GiB/s, IOS/call=32/32 IOPS=77.14M, BW=37.66GiB/s, IOS/call=31/31 IOPS=77.37M, BW=37.78GiB/s, IOS/call=32/32 IOPS=77.25M, BW=37.72GiB/s, IOS/call=32/32 which is a nice win as well. Signed-off-by: Jens Axboe <axboe@kernel.dk> -
Since commit 0654b05 ("io_uring: One wqe per wq"), we have just a single io_wqe instance embedded per io_wq. Drop the extra structure in favor of accessing struct io_wq directly, cleaning up quite a bit of dereferences and backpointers. No functional changes intended. Tested with liburing's testsuite and mmtests performance microbenchmarks. I didn't observe any performance regressions. Signed-off-by: Gabriel Krisman Bertazi <krisman@suse.de> Link: https://lore.kernel.org/r/20230322011628.23359-2-krisman@suse.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
io-wq: Move wq accounting to io_wq
Since we now have a single io_wqe per io_wq instead of per-node, and in preparation to its removal, move the accounting into the parent structure. Signed-off-by: Gabriel Krisman Bertazi <krisman@suse.de> Link: https://lore.kernel.org/r/20230322011628.23359-2-krisman@suse.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
io_uring/kbuf: disallow mapping a badly aligned provided ring buffer
On at least parisc, we have strict requirements on how we virtually map an address that is shared between the application and the kernel. On these platforms, IOU_PBUF_RING_MMAP should be used when setting up a shared ring buffer for provided buffers. If the application is mapping these pages and asking the kernel to pin+map them as well, then we have no control over what virtual address we get in the kernel. For that case, do a sanity check if SHM_COLOUR is defined, and disallow the mapping request. The application must fall back to using IOU_PBUF_RING_MMAP for this case, and liburing will do that transparently with the set of helpers that it has. Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
io_uring: Add KASAN support for alloc_caches
Add support for KASAN in the alloc_caches (apoll and netmsg_cache). Thus, if something touches the unused caches, it will raise a KASAN warning/exception. It poisons the object when the object is put to the cache, and unpoisons it when the object is gotten or freed. Signed-off-by: Breno Leitao <leitao@debian.org> Reviewed-by: Gabriel Krisman Bertazi <krisman@suse.de> Link: https://lore.kernel.org/r/20230223164353.2839177-2-leitao@debian.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
io_uring: Move from hlist to io_wq_work_node
Having cache entries linked using the hlist format brings no benefit, and also requires an unnecessary extra pointer address per cache entry. Use the internal io_wq_work_node single-linked list for the internal alloc caches (async_msghdr and async_poll) This is required to be able to use KASAN on cache entries, since we do not need to touch unused (and poisoned) cache entries when adding more entries to the list. Suggested-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Breno Leitao <leitao@debian.org> Link: https://lore.kernel.org/r/20230223164353.2839177-2-leitao@debian.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
Right now io_wq allocates one io_wqe per NUMA node. As io_wq is now bound to a task, the task basically uses only the NUMA local io_wqe, and almost never changes NUMA nodes, thus, the other wqes are mostly unused. Allocate just one io_wqe embedded into io_wq, and uses all possible cpus (cpu_possible_mask) in the io_wqe->cpumask. Signed-off-by: Breno Leitao <leitao@debian.org> Link: https://lore.kernel.org/r/20230310201107.4020580-1-leitao@debian.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
-
io_uring: add support for user mapped provided buffer ring
The ring mapped provided buffer rings rely on the application allocating the memory for the ring, and then the kernel will map it. This generally works fine, but runs into issues on some architectures where we need to be able to ensure that the kernel and application virtual address for the ring play nicely together. This at least impacts architectures that set SHM_COLOUR, but potentially also anyone setting SHMLBA. To use this variant of ring provided buffers, the application need not allocate any memory for the ring. Instead the kernel will do so, and the allocation must subsequently call mmap(2) on the ring with the offset set to: IORING_OFF_PBUF_RING | (bgid << IORING_OFF_PBUF_SHIFT) to get a virtual address for the buffer ring. Normally the application would allocate a suitable piece of memory (and correctly aligned) and simply pass that in via io_uring_buf_reg.ring_addr and the kernel would map it. Outside of the setup differences, the kernel allocate + user mapped provided buffer ring works exactly the same. Acked-by: Helge Deller <deller@gmx.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>