Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Direct IO Support #10018

Open
wants to merge 1 commit into
base: master
Choose a base branch
from
Open

Conversation

bwatkinson
Copy link
Contributor

@bwatkinson bwatkinson commented Feb 18, 2020

Adding O_DIRECT support to ZFS.

Motivation and Context

By adding Direct IO support to ZFS, the ARC can be bypassed when issuing reads/writes.
There are certain cases where caching data in the ARC can decrease overall performance.
In particular the performance of ZPool's composed of NVMe devices displayed poor read/write
performance due to the extra overhead of memcpy's issued to the ARC.

There are also cases where caching in the ARC may not make sense such as when data
will not be referenced later. By using the O_DIRECT flag, unnecessary data copies to the
ARC can be avoided.

Closes Issue: #8381

Description

O_DIRECT support in ZFS will always ensure there is coherency between buffered and O_DIRECT IO requests.
This ensures that all IO requests, whether buffered or direct, will see the same file contents at all times. Just
as in other FS's , O_DIRECT does not imply O_SYNC. While data is written directly to VDEV disks, metadata will
not be synced until the associated TXG is synced.
For both O_DIRECT read and write request the offset and requeset sizes, at a minimum, must be PAGE_SIZE aligned.
In the event they are not, then EINVAL is returned except for in the event the direct property is set to always.

For O_DIRECT writes:
The request also must be block aligned (recordsize) or the write request will take the normal (buffered) write path.
In the event that request is block aligned and a cached copy of the buffer in the ARC, then it will be discarded
from the ARC forcing all further reads to retrieve the data from disk.

For O_DIRECT reads:
The only alignment restrictions are PAGE_SIZE alignment. In the event that the requested data is in buffered
(in the ARC) it will just be copied from the ARC into the user buffer.

To ensure data integrity for all data written using O_DIRECT, all user pages are made stable in the event one
of the following is required:
Checksum
Compression
Encryption
Parity

By making the user pages stable, we make sure the contents of the user provided buffer can not be changed after
any of the above operations have taken place.

A new dataset property direct has been added with the following 3
allowable values:

  • disabled - Accepts O_DIRECT flag, but silently ignores it and treats the request as a buffered IO request.

  • standard - Follows the alignment restrictions outlined above for write/read IO requests when the O_DIRECT flag is used.

  • always - Treats every write/read IO request as though it passed O_DIRECT. In the event the request is not page aligned, it will be redirected through the ARC. All other alignment restrictions are followed.

Direct IO does not bypass the ZIO pipeline, so all checksums, compression, etc. are still
all supported with Direct IO.

Some issues that still need to be addressed:

  • Create ZTS tests for O_DIRECT
  • Possibly allow for DVA throttle with O_DIRECT writes
  • Further testing/verification of FreeBSD (majority of debugging has been on Linux)
  • Possibly allow for O_DIRECT with zvols
  • Address race conditions in dbuf code with O_DIRECT

How Has This Been Tested?

Testing was primarily done using FIO and XDD with striping, mirror, raidz, and dRAID VDEV ZPool's.

Tests were performed on CentOS using various kernel's ranging from 3.10, 4.18, and 4.20.

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Performance enhancement (non-breaking change which improves efficiency)
  • Code cleanup (non-breaking change which makes code smaller or more readable)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Documentation (a change to man pages or other documentation)

Checklist:

  • My code follows the ZFS on Linux code style requirements.
  • I have updated the documentation accordingly.
  • I have read the contributing document.
  • I have added tests to cover my changes.
  • I have run the ZFS Test Suite with this change applied.
  • All commit messages are properly formatted and contain Signed-off-by.

@behlendorf behlendorf self-requested a review February 18, 2020 17:17
@behlendorf behlendorf added the Type: Feature Feature request or new feature label Feb 18, 2020
@behlendorf behlendorf added this to New OpenZFS 2.0 Features (0.8->2.0) in OpenZFS 2.0 Feb 18, 2020
@behlendorf behlendorf added the Status: Work in Progress Not yet ready for general review label Feb 18, 2020
@bwatkinson bwatkinson force-pushed the direct_page_aligned branch 2 times, most recently from 21eddef to 3464559 Compare February 19, 2020 20:06
Copy link
Member

@ahrens ahrens left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to understand the use cases for the various property values. Could we do something simpler like:

directio=standard | always | disabled

where standard means: if you request DIRECTIO, we’ll do it directly if we think it’s a good idea (e.g. writes are recordsize-aligned), and otherwise we'll do the i/o non-directly (we won't fail it for poor alignment). This is the default.

always means act like DIRECTIO was always requested (may be actually direct or indirect depending on i/o alignment, won't fail for poor alignment).

disabled means act like DIRECTIO was never requested (which is the current behavior).

man/man8/zfsprops.8 Outdated Show resolved Hide resolved
man/man8/zfsprops.8 Outdated Show resolved Hide resolved
module/zfs/dbuf.c Outdated Show resolved Hide resolved
module/zfs/vdev_queue.c Outdated Show resolved Hide resolved
Copy link
Contributor

@behlendorf behlendorf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couple quick comments.

config/kernel-get-user-pages.m4 Show resolved Hide resolved
config/kernel-get-user-pages.m4 Outdated Show resolved Hide resolved
config/kernel-get-user-pages.m4 Outdated Show resolved Hide resolved
include/os/linux/kernel/linux/kmap_compat.h Outdated Show resolved Hide resolved
include/os/linux/spl/sys/mutex.h Outdated Show resolved Hide resolved
include/os/linux/spl/sys/uio.h Outdated Show resolved Hide resolved
include/sys/abd.h Outdated Show resolved Hide resolved
module/os/linux/zfs/abd.c Outdated Show resolved Hide resolved
module/os/linux/zfs/abd.c Outdated Show resolved Hide resolved
@bwatkinson bwatkinson force-pushed the direct_page_aligned branch 3 times, most recently from 865bfc2 to 428962a Compare February 25, 2020 00:25
@codecov
Copy link

codecov bot commented Feb 25, 2020

Codecov Report

Attention: 309 lines in your changes are missing coverage. Please review.

Comparison is base (161ed82) 75.17% compared to head (04e3a35) 61.94%.
Report is 2308 commits behind head on master.

❗ Current head 04e3a35 differs from pull request most recent head 161492f. Consider uploading reports for the commit 161492f to get more accurate results

Files Patch % Lines
module/zfs/dmu.c 51.01% 265 Missing ⚠️
module/os/linux/zfs/abd.c 88.30% 20 Missing ⚠️
module/zfs/dbuf.c 75.71% 17 Missing ⚠️
lib/libzpool/kernel.c 0.00% 5 Missing ⚠️
include/sys/abd.h 50.00% 2 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff             @@
##           master   #10018       +/-   ##
===========================================
- Coverage   75.17%   61.94%   -13.24%     
===========================================
  Files         402      260      -142     
  Lines      128071    73582    -54489     
===========================================
- Hits        96283    45578    -50705     
+ Misses      31788    28004     -3784     
Flag Coverage Δ
kernel 51.01% <43.78%> (-27.75%) ⬇️
user 59.10% <59.33%> (+11.67%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@bwatkinson bwatkinson force-pushed the direct_page_aligned branch 3 times, most recently from f405fed to a6894d1 Compare February 28, 2020 00:27
module/zfs/dmu.c Outdated Show resolved Hide resolved
module/zfs/dmu.c Outdated
Comment on lines 1761 to 1799
zio = zio_write(pio, os->os_spa, txg, bp, data,
db->db.db_size, db->db.db_size, &zp,
dmu_write_direct_ready, NULL, NULL, dmu_write_direct_done, dsa,
ZIO_PRIORITY_SYNC_WRITE, ZIO_FLAG_CANFAIL, &zb);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a little concerned about bypassing zio_dva_throttle(). Background: slides 11-16 and video from my talk at BSDCAN 2016.

This means that DIRECTIO writes will be spread out among the vdevs using the old round-robin algorithm. This could potentially result in poor performance due to allocating from the slowest / most fragmented vdev, and we could potentially make the vdevs even more imbalanced (at least in terms of performance/fragmentation). @grwilson do you have any thoughts on this? How big the impact could be, and potential ways to mitigate? Could we make this use the throttle?

Copy link
Contributor

@snajpa snajpa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is such a large reorganization really needed? Couldn't be things solved by more prototypes at the beginning/in the header files? I'm just asking, because this will make debuging by git blame more difficult.

@snajpa
Copy link
Contributor

snajpa commented Jun 20, 2020

Overall, I have to say, thanks for taking this one on! This looks like it wasn't trivial to figure out.

With regards to zio_dva_throttle() and performance, I'd like to point to an older PR here: #7560 - so it looks like skipping it might have some justification. Ideally IMHO it'd be best to leave it up to the user (ie. configurable).

I'm excited about this PR, it looks to be a solid basis for support of .splice_read()/.splice_write() in order to support IO to/from pipes. I was looking at it this week because of vpsfreecz/linux@1a980b8 - with OverlayFS on top of ZFS, this patch makes all apps using sendfile(2) go tra-la. Issue about that one: #1156

@bwatkinson bwatkinson force-pushed the direct_page_aligned branch 3 times, most recently from cf998e3 to 9c4d98e Compare September 12, 2023 17:06
@bwatkinson bwatkinson force-pushed the direct_page_aligned branch 2 times, most recently from 84b3be8 to 0bfa246 Compare September 19, 2023 22:16
@bwatkinson bwatkinson force-pushed the direct_page_aligned branch 2 times, most recently from 961a851 to 9ec7126 Compare September 22, 2023 23:15
@bwatkinson bwatkinson force-pushed the direct_page_aligned branch 2 times, most recently from 68ee040 to ca88e6a Compare October 5, 2023 23:35
@amotin amotin mentioned this pull request Oct 19, 2023
13 tasks
@bwatkinson bwatkinson force-pushed the direct_page_aligned branch 3 times, most recently from fd6429a to 4154057 Compare October 25, 2023 20:34
@bwatkinson bwatkinson force-pushed the direct_page_aligned branch 2 times, most recently from af59a63 to 7bb313f Compare November 10, 2023 00:40
@swamikevala
Copy link

My system is crashing when I use zfs with this direct io patch applied.

GPT4 analysis:
The provided vmcore-dmesg.txt file contains system log messages that can be used to analyze the state of the system leading up to and during the crash of your virtual machine (VM).

The following information has been extracted from the crash logs:

Kernel and System Information: The system was running Oracle Linux Server with a 5.15.0-101.103.2.1.el9uek.x86_64 kernel version, on a QEMU Standard PC (i440FX + PIIX, 1996) virtualized environment.

Crash Details: A NULL pointer dereference occurred, which is a type of error where the program attempted to read or write memory that it should not have access to. This often indicates a bug in the kernel or in a kernel module.

Specific Error: The error occurred in a function abd_bio_map_off within the ZFS module. The stack trace indicates functions related to disk I/O operations within the ZFS file system were involved (__vdev_disk_physio, vdev_disk_io_start, zio_vdev_io_start).

Tainted Kernel: The kernel was tainted with the 'P' flag, indicating that proprietary modules were loaded, and 'O', which denotes that an externally-built module was loaded. Specifically, the ZFS module is mentioned, which is known to be outside of the mainline Linux kernel source.

Bug Location: The actual bug that caused the crash is reported at abd_bio_map_off+0xf2/0x2c0 [zfs] indicating an issue within the ZFS filesystem module.

vmcore-dmesg.txt

@bwatkinson
Copy link
Contributor Author

My system is crashing when I use zfs with this direct io patch applied.

GPT4 analysis: The provided vmcore-dmesg.txt file contains system log messages that can be used to analyze the state of the system leading up to and during the crash of your virtual machine (VM).

The following information has been extracted from the crash logs:

Kernel and System Information: The system was running Oracle Linux Server with a 5.15.0-101.103.2.1.el9uek.x86_64 kernel version, on a QEMU Standard PC (i440FX + PIIX, 1996) virtualized environment.

Crash Details: A NULL pointer dereference occurred, which is a type of error where the program attempted to read or write memory that it should not have access to. This often indicates a bug in the kernel or in a kernel module.

Specific Error: The error occurred in a function abd_bio_map_off within the ZFS module. The stack trace indicates functions related to disk I/O operations within the ZFS file system were involved (__vdev_disk_physio, vdev_disk_io_start, zio_vdev_io_start).

Tainted Kernel: The kernel was tainted with the 'P' flag, indicating that proprietary modules were loaded, and 'O', which denotes that an externally-built module was loaded. Specifically, the ZFS module is mentioned, which is known to be outside of the mainline Linux kernel source.

Bug Location: The actual bug that caused the crash is reported at abd_bio_map_off+0xf2/0x2c0 [zfs] indicating an issue within the ZFS filesystem module.

vmcore-dmesg.txt

I apologize for taking so long to reply to this. By any chance do you happen to know the IO workload that was taking place when this happen? In particular what was the IO size that was being issued?

Also, what was the recordsize set to as well as the ashift? I would like to be able to replicate the IO workload to see if I can get this trigger myself.

Thank you providing the trace dump. I am hoping with more details on the IO workload and ZFS configuration I will be able to replicate this bug to get it fixed.

@bwatkinson bwatkinson force-pushed the direct_page_aligned branch 2 times, most recently from c28b199 to caee5c9 Compare January 3, 2024 19:09
@bwatkinson bwatkinson force-pushed the direct_page_aligned branch 2 times, most recently from bb63fda to d90e7b5 Compare February 1, 2024 21:46
Adding O_DIRECT support to ZFS to bypass the ARC for writes/reads.

O_DIRECT support in ZFS will always ensure there is coherency between
buffered and O_DIRECT IO requests. This ensures that all IO requests,
whether buffered or direct, will see the same file contents at all
times. Just as in other FS's , O_DIRECT does not imply O_SYNC. While
data is written directly to VDEV disks, metadata will not be synced
until the associated  TXG is synced.
For both O_DIRECT read and write request the offset and requeset sizes,
at a minimum, must be PAGE_SIZE aligned. In the event they are not,
then EINVAL is returned unless the direct property is set to always (see
below).

For O_DIRECT writes:
The request also must be block aligned (recordsize) or the write
request will take the normal (buffered) write path. In the event that
request is block aligned and a cached copy of the buffer in the ARC,
then it will be discarded from the ARC forcing all further reads to
retrieve the data from disk.

For O_DIRECT reads:
The only alignment restrictions are PAGE_SIZE alignment. In the event
that the requested data is in buffered (in the ARC) it will just be
copied from the ARC into the user buffer.

For both O_DIRECT writes and reads the O_DIRECT flag will be ignored in
the event that file contents are mmap'ed. In this case, all requests
that are at least PAGE_SIZE aligned will just fall back to the buffered
paths. If the request however is not PAGE_SIZE aligned, EINVAL will
be returned as always regardless if the file's contents are mmap'ed.

Since O_DIRECT writes go through the normal ZIO pipeline, the
following operations are supported just as with normal buffered writes:
Checksum
Compression
Dedup
Encryption
Erasure Coding
There is one caveat for the data integrity of O_DIRECT writes that is
distinct for each of the OS's supported by ZFS.
FreeBSD - FreeBSD is able to place user pages under write protection so
          any data in the user buffers and written directly down to the
	  VDEV disks is guaranteed to not change. There is no concern
	  with data integrity and O_DIRECT writes.
Linux - Linux is not able to place anonymous user pages under write
        protection. Because of this, if the user decides to manipulate
	the page contents while the write operation is occurring, data
	integrity can not be guaranteed. However, there is a module
	parameter `zfs_vdev_direct_write_verify_pct` that contols the
	percentage of O_DIRECT writes that can occur to a top-level
	VDEV before a checksum verify is run before the contents of the
	user buffers are committed to disk. In the event of a checksum
	verification failure the write will be redirected through the
	ARC. The deafault value for `zfs_vdev_direct_write_verify_pct`
	is 2 percent of Direct I/O writes to a top-level VDEV. The
	number of O_DIRECT write checksum verification errors can be
	observed by doing `zpool status -d`, which will list all
	verification errors that have occurred on a top-level VDEV.
	Along with `zpool status`, a ZED event will be issues as
	`dio_verify` when a checksum verification error occurs.

A new dataset property `direct` has been added with the following 3
allowable values:
disabled - Accepts O_DIRECT flag, but silently ignores it and treats
	   the request as a buffered IO request.
standard - Follows the alignment restrictions  outlined above for
	   write/read IO requests when the O_DIRECT flag is used.
always   - Treats every write/read IO request as though it passed
           O_DIRECT and will do O_DIRECT if the alignment restrictions
	   are met otherwise will redirect through the ARC. This
	   property will not allow a request to fail.

Signed-off-by: Brian Atkinson <batkinson@lanl.gov>
Co-authored-by: Mark Maybee <mark.maybee@delphix.com>
Co-authored-by: Matt Macy <mmacy@FreeBSD.org>
Co-authored-by: Brian Behlendorf <behlendorf@llnl.gov>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Status: Code Review Needed Ready for review and testing Type: Feature Feature request or new feature
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet