New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Direct IO Support #10018
base: master
Are you sure you want to change the base?
Direct IO Support #10018
Conversation
21eddef
to
3464559
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd like to understand the use cases for the various property values. Could we do something simpler like:
directio=standard | always | disabled
where standard means: if you request DIRECTIO, we’ll do it directly if we think it’s a good idea (e.g. writes are recordsize-aligned), and otherwise we'll do the i/o non-directly (we won't fail it for poor alignment). This is the default.
always means act like DIRECTIO was always requested (may be actually direct or indirect depending on i/o alignment, won't fail for poor alignment).
disabled means act like DIRECTIO was never requested (which is the current behavior).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Couple quick comments.
865bfc2
to
428962a
Compare
Codecov ReportAttention:
Additional details and impacted files@@ Coverage Diff @@
## master #10018 +/- ##
===========================================
- Coverage 75.17% 61.94% -13.24%
===========================================
Files 402 260 -142
Lines 128071 73582 -54489
===========================================
- Hits 96283 45578 -50705
+ Misses 31788 28004 -3784
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
f405fed
to
a6894d1
Compare
module/zfs/dmu.c
Outdated
| zio = zio_write(pio, os->os_spa, txg, bp, data, | ||
| db->db.db_size, db->db.db_size, &zp, | ||
| dmu_write_direct_ready, NULL, NULL, dmu_write_direct_done, dsa, | ||
| ZIO_PRIORITY_SYNC_WRITE, ZIO_FLAG_CANFAIL, &zb); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a little concerned about bypassing zio_dva_throttle(). Background: slides 11-16 and video from my talk at BSDCAN 2016.
This means that DIRECTIO writes will be spread out among the vdevs using the old round-robin algorithm. This could potentially result in poor performance due to allocating from the slowest / most fragmented vdev, and we could potentially make the vdevs even more imbalanced (at least in terms of performance/fragmentation). @grwilson do you have any thoughts on this? How big the impact could be, and potential ways to mitigate? Could we make this use the throttle?
a6894d1
to
04e3a35
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is such a large reorganization really needed? Couldn't be things solved by more prototypes at the beginning/in the header files? I'm just asking, because this will make debuging by git blame more difficult.
|
Overall, I have to say, thanks for taking this one on! This looks like it wasn't trivial to figure out. With regards to I'm excited about this PR, it looks to be a solid basis for support of |
7240ecf
to
7089709
Compare
cf998e3
to
9c4d98e
Compare
84b3be8
to
0bfa246
Compare
961a851
to
9ec7126
Compare
68ee040
to
ca88e6a
Compare
ca88e6a
to
b320dcf
Compare
fd6429a
to
4154057
Compare
4154057
to
304fb88
Compare
af59a63
to
7bb313f
Compare
7bb313f
to
2d0b46a
Compare
|
My system is crashing when I use zfs with this direct io patch applied. GPT4 analysis: The following information has been extracted from the crash logs: Kernel and System Information: The system was running Oracle Linux Server with a 5.15.0-101.103.2.1.el9uek.x86_64 kernel version, on a QEMU Standard PC (i440FX + PIIX, 1996) virtualized environment. Crash Details: A NULL pointer dereference occurred, which is a type of error where the program attempted to read or write memory that it should not have access to. This often indicates a bug in the kernel or in a kernel module. Specific Error: The error occurred in a function abd_bio_map_off within the ZFS module. The stack trace indicates functions related to disk I/O operations within the ZFS file system were involved (__vdev_disk_physio, vdev_disk_io_start, zio_vdev_io_start). Tainted Kernel: The kernel was tainted with the 'P' flag, indicating that proprietary modules were loaded, and 'O', which denotes that an externally-built module was loaded. Specifically, the ZFS module is mentioned, which is known to be outside of the mainline Linux kernel source. Bug Location: The actual bug that caused the crash is reported at abd_bio_map_off+0xf2/0x2c0 [zfs] indicating an issue within the ZFS filesystem module. |
2d0b46a
to
0d98c9a
Compare
I apologize for taking so long to reply to this. By any chance do you happen to know the IO workload that was taking place when this happen? In particular what was the IO size that was being issued? Also, what was the recordsize set to as well as the ashift? I would like to be able to replicate the IO workload to see if I can get this trigger myself. Thank you providing the trace dump. I am hoping with more details on the IO workload and ZFS configuration I will be able to replicate this bug to get it fixed. |
0d98c9a
to
3c72dea
Compare
c28b199
to
caee5c9
Compare
caee5c9
to
17b0a79
Compare
bb63fda
to
d90e7b5
Compare
Adding O_DIRECT support to ZFS to bypass the ARC for writes/reads.
O_DIRECT support in ZFS will always ensure there is coherency between
buffered and O_DIRECT IO requests. This ensures that all IO requests,
whether buffered or direct, will see the same file contents at all
times. Just as in other FS's , O_DIRECT does not imply O_SYNC. While
data is written directly to VDEV disks, metadata will not be synced
until the associated TXG is synced.
For both O_DIRECT read and write request the offset and requeset sizes,
at a minimum, must be PAGE_SIZE aligned. In the event they are not,
then EINVAL is returned unless the direct property is set to always (see
below).
For O_DIRECT writes:
The request also must be block aligned (recordsize) or the write
request will take the normal (buffered) write path. In the event that
request is block aligned and a cached copy of the buffer in the ARC,
then it will be discarded from the ARC forcing all further reads to
retrieve the data from disk.
For O_DIRECT reads:
The only alignment restrictions are PAGE_SIZE alignment. In the event
that the requested data is in buffered (in the ARC) it will just be
copied from the ARC into the user buffer.
For both O_DIRECT writes and reads the O_DIRECT flag will be ignored in
the event that file contents are mmap'ed. In this case, all requests
that are at least PAGE_SIZE aligned will just fall back to the buffered
paths. If the request however is not PAGE_SIZE aligned, EINVAL will
be returned as always regardless if the file's contents are mmap'ed.
Since O_DIRECT writes go through the normal ZIO pipeline, the
following operations are supported just as with normal buffered writes:
Checksum
Compression
Dedup
Encryption
Erasure Coding
There is one caveat for the data integrity of O_DIRECT writes that is
distinct for each of the OS's supported by ZFS.
FreeBSD - FreeBSD is able to place user pages under write protection so
any data in the user buffers and written directly down to the
VDEV disks is guaranteed to not change. There is no concern
with data integrity and O_DIRECT writes.
Linux - Linux is not able to place anonymous user pages under write
protection. Because of this, if the user decides to manipulate
the page contents while the write operation is occurring, data
integrity can not be guaranteed. However, there is a module
parameter `zfs_vdev_direct_write_verify_pct` that contols the
percentage of O_DIRECT writes that can occur to a top-level
VDEV before a checksum verify is run before the contents of the
user buffers are committed to disk. In the event of a checksum
verification failure the write will be redirected through the
ARC. The deafault value for `zfs_vdev_direct_write_verify_pct`
is 2 percent of Direct I/O writes to a top-level VDEV. The
number of O_DIRECT write checksum verification errors can be
observed by doing `zpool status -d`, which will list all
verification errors that have occurred on a top-level VDEV.
Along with `zpool status`, a ZED event will be issues as
`dio_verify` when a checksum verification error occurs.
A new dataset property `direct` has been added with the following 3
allowable values:
disabled - Accepts O_DIRECT flag, but silently ignores it and treats
the request as a buffered IO request.
standard - Follows the alignment restrictions outlined above for
write/read IO requests when the O_DIRECT flag is used.
always - Treats every write/read IO request as though it passed
O_DIRECT and will do O_DIRECT if the alignment restrictions
are met otherwise will redirect through the ARC. This
property will not allow a request to fail.
Signed-off-by: Brian Atkinson <batkinson@lanl.gov>
Co-authored-by: Mark Maybee <mark.maybee@delphix.com>
Co-authored-by: Matt Macy <mmacy@FreeBSD.org>
Co-authored-by: Brian Behlendorf <behlendorf@llnl.gov>
d90e7b5
to
161492f
Compare
Adding O_DIRECT support to ZFS.
Motivation and Context
By adding Direct IO support to ZFS, the ARC can be bypassed when issuing reads/writes.
There are certain cases where caching data in the ARC can decrease overall performance.
In particular the performance of ZPool's composed of NVMe devices displayed poor read/write
performance due to the extra overhead of memcpy's issued to the ARC.
There are also cases where caching in the ARC may not make sense such as when data
will not be referenced later. By using the O_DIRECT flag, unnecessary data copies to the
ARC can be avoided.
Closes Issue: #8381
Description
O_DIRECT support in ZFS will always ensure there is coherency between buffered and O_DIRECT IO requests.
This ensures that all IO requests, whether buffered or direct, will see the same file contents at all times. Just
as in other FS's , O_DIRECT does not imply O_SYNC. While data is written directly to VDEV disks, metadata will
not be synced until the associated TXG is synced.
For both O_DIRECT read and write request the offset and requeset sizes, at a minimum, must be PAGE_SIZE aligned.
In the event they are not, then
EINVALis returned except for in the event the direct property is set to always.For O_DIRECT writes:
The request also must be block aligned (recordsize) or the write request will take the normal (buffered) write path.
In the event that request is block aligned and a cached copy of the buffer in the ARC, then it will be discarded
from the ARC forcing all further reads to retrieve the data from disk.
For O_DIRECT reads:
The only alignment restrictions are PAGE_SIZE alignment. In the event that the requested data is in buffered
(in the ARC) it will just be copied from the ARC into the user buffer.
To ensure data integrity for all data written using O_DIRECT, all user pages are made stable in the event one
of the following is required:
Checksum
Compression
Encryption
Parity
By making the user pages stable, we make sure the contents of the user provided buffer can not be changed after
any of the above operations have taken place.
A new dataset property
directhas been added with the following 3allowable values:
disabled - Accepts O_DIRECT flag, but silently ignores it and treats the request as a buffered IO request.
standard - Follows the alignment restrictions outlined above for write/read IO requests when the O_DIRECT flag is used.
always - Treats every write/read IO request as though it passed O_DIRECT. In the event the request is not page aligned, it will be redirected through the ARC. All other alignment restrictions are followed.
Direct IO does not bypass the ZIO pipeline, so all checksums, compression, etc. are still
all supported with Direct IO.
Some issues that still need to be addressed:
How Has This Been Tested?
Testing was primarily done using FIO and XDD with striping, mirror, raidz, and dRAID VDEV ZPool's.
Tests were performed on CentOS using various kernel's ranging from 3.10, 4.18, and 4.20.
Types of changes
Checklist:
Signed-off-by.