New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add statechange event when drive is replaced #9513
Conversation
* zfs redact error messages do not end with newline character * 30af21b inadvertently removed some ZFS_PROP comments * man/zfs: zfs redact <redaction_snapshot> is not optional Reviewed-by: Giuseppe Di Natale <guss80@gmail.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: George Melikov <mail@gmelikov.ru> Signed-off-by: loli10K <ezomori.nozomu@gmail.com> Closes openzfs#8988
We return ENOSPC in metaslab_activate if the metaslab has weight 0, to avoid activating a metaslab with no space available. For sanity checking, we also assert that there is no free space in the range tree in that case. Reviewed-by: Igor Kozhukhov <igor@dilos.org> Reviewed by: Matt Ahrens <matt@delphix.com> Reviewed by: Serapheim Dimitropoulos <serapheim.dimitro@delphix.com> Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Paul Dagnelie <pcd@delphix.com> Closes openzfs#8968
This patch fixes an issue where dsl_dataset_crypt_stats() would VERIFY that it was able to hold the encryption root. This function should instead silently continue without populating the related field in the nvlist, as is the convention for this code. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tom Caputi <tcaputi@datto.com> Closes openzfs#8976
This reverts commit aa7aab6. The change is not compatible with CentOS 6's 2.6.32 based kernel due to differnces in the bio layer. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue openzfs#8961
ZFS tracing efforts are hampered by the inability to access zfs static probes(probes using DTRACE_PROBE macros). The probes are available via tracepoints for GPL modules only. The build could be modified to generate a function for each unique DTRACE_PROBE invocation. These could be then accessed via kprobes. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Matt Ahrens <matt@delphix.com> Reviewed-by: Richard Elling <Richard.Elling@RichardElling.com> Signed-off-by: Brad Lewis <brad.lewis@delphix.com> Closes openzfs#8659 Closes openzfs#8663
Currently, sequential async write workloads spend a lot of time contending on the dn_struct_rwlock. This lock is responsible for protecting the entire block tree below it; this naturally results in some serialization during heavy write workloads. This can be resolved by having per-dbuf locking, which will allow multiple writers in the same object at the same time. We introduce a new rwlock, the db_rwlock. This lock is responsible for protecting the contents of the dbuf that it is a part of; when reading a block pointer from a dbuf, you hold the lock as a reader. When writing data to a dbuf, you hold it as a writer. This allows multiple threads to write to different parts of a file at the same time. Reviewed by: Brad Lewis <brad.lewis@delphix.com> Reviewed by: Matt Ahrens matt@delphix.com Reviewed by: George Wilson george.wilson@delphix.com Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Paul Dagnelie <pcd@delphix.com> External-issue: DLPX-52564 External-issue: DLPX-53085 External-issue: DLPX-57384 Closes openzfs#8946
Due to some changes introduced in 30af21b 'zfs send' can crash when provided with invalid inputs: this change attempts to add more checks to the affected code paths. Reviewed-by: Attila Fülöp <attila@fueloep.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Signed-off-by: loli10K <ezomori.nozomu@gmail.com> Closes openzfs#9001
This commit ensures make(1) targets that build .deb packages fail if alien(1) can't convert all .rpm files; additionally it also updates the zfs-dracut package name which was changed to "noarch" in ca4e5a7. Reviewed-by: Neal Gompa <ngompa@datto.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Olaf Faaland <faaland1@llnl.gov> Signed-off-by: loli10K <ezomori.nozomu@gmail.com> Closes openzfs#8990 Closes openzfs#8991
Strategy of parallel mount is as follows. 1) Initial thread dispatching is to select sets of mount points that don't have dependencies on other sets, hence threads can/should run lock-less and shouldn't race with other threads for other sets. Each thread dispatched corresponds to top level directory which may or may not have datasets to be mounted on sub directories. 2) Subsequent recursive thread dispatching for each thread from 1) is to mount datasets for each set of mount points. The mount points within each set have dependencies (i.e. child directories), so child directories are processed only after parent directory completes. The problem is that the initial thread dispatching in zfs_foreach_mountpoint() can be multi-threaded when it needs to be single-threaded, and this puts threads under race condition. This race appeared as mount/unmount issues on ZoL for ZoL having different timing regarding mount(2) execution due to fork(2)/exec(2) of mount(8). `zfs unmount -a` which expects proper mount order can't unmount if the mounts were reordered by the race condition. There are currently two known patterns of input list `handles` in `zfs_foreach_mountpoint(..,handles,..)` which cause the race condition. 1) openzfs#8833 case where input is `/a /a /a/b` after sorting. The problem is that libzfs_path_contains() can't correctly handle an input list with two same top level directories. There is a race between two POSIX threads A and B, * ThreadA for "/a" for test1 and "/a/b" * ThreadB for "/a" for test0/a and in case of openzfs#8833, ThreadA won the race. Two threads were created because "/a" wasn't considered as `"/a" contains "/a"`. 2) openzfs#8450 case where input is `/ /var/data /var/data/test` after sorting. The problem is that libzfs_path_contains() can't correctly handle an input list containing "/". There is a race between two POSIX threads A and B, * ThreadA for "/" and "/var/data/test" * ThreadB for "/var/data" and in case of openzfs#8450, ThreadA won the race. Two threads were created because "/var/data" wasn't considered as `"/" contains "/var/data"`. In other words, if there is (at least one) "/" in the input list, the initial thread dispatching must be single-threaded since every directory is a child of "/", meaning they all directly or indirectly depend on "/". In both cases, the first non_descendant_idx() call fails to correctly determine "path1-contains-path2", and as a result the initial thread dispatching creates another thread when it needs to be single-threaded. Fix a conditional in libzfs_path_contains() to consider above two. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed by: Sebastien Roy <sebastien.roy@delphix.com> Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro@gmail.com> Closes openzfs#8450 Closes openzfs#8833 Closes openzfs#8878
Use python -Esc to set __python_sitelib. Reviewed-by: Neal Gompa <ngompa@datto.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Shaun Tancheff <stancheff@cray.com> Closes openzfs#8969
log_neg_expect was using the wrong exit status to detect if a process got killed by SIGSEGV or SIGBUS, resulting in false positives. Reviewed-by: loli10K <ezomori.nozomu@gmail.com> Reviewed by: John Kennedy <john.kennedy@delphix.com> Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Attila Fülöp <attila@fueloep.org> Closes openzfs#9003
Large allocation over the spl_kmem_alloc_warn value was being performed. Switched to vmem_alloc interface as specified for large allocations. Changed the subsequent frees to match. Reviewed-by: Tom Caputi <tcaputi@datto.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: nmattis <nickm970@gmail.com> Closes openzfs#8934 Closes openzfs#9011
Restore the SIMD optimization for 4.19.38 LTS, 4.14.120 LTS,
and 5.0 and newer kernels. This is accomplished by leveraging
the fact that by definition dedicated kernel threads never need
to concern themselves with saving and restoring the user FPU state.
Therefore, they may use the FPU as long as we can guarantee user
tasks always restore their FPU state before context switching back
to user space.
For the 5.0 and 5.1 kernels disabling preemption and local
interrupts is sufficient to allow the FPU to be used. All non-kernel
threads will restore the preserved user FPU state.
For 5.2 and latter kernels the user FPU state restoration will be
skipped if the kernel determines the registers have not changed.
Therefore, for these kernels we need to perform the additional
step of saving and restoring the FPU registers. Invalidating the
per-cpu global tracking the FPU state would force a restore but
that functionality is private to the core x86 FPU implementation
and unavailable.
In practice, restricting SIMD to kernel threads is not a major
restriction for ZFS. The vast majority of SIMD operations are
already performed by the IO pipeline. The remaining cases are
relatively infrequent and can be handled by the generic code
without significant impact. The two most noteworthy cases are:
1) Decrypting the wrapping key for an encrypted dataset,
i.e. `zfs load-key`. All other encryption and decryption
operations will use the SIMD optimized implementations.
2) Generating the payload checksums for a `zfs send` stream.
In order to avoid making any changes to the higher layers of ZFS
all of the `*_get_ops()` functions were updated to take in to
consideration the calling context. This allows for the fastest
implementation to be used as appropriate (see kfpu_allowed()).
The only other notable instance of SIMD operations being used
outside a kernel thread was at module load time. This code
was moved in to a taskq in order to accommodate the new kernel
thread restriction.
Finally, a few other modifications were made in order to further
harden this code and facilitate testing. They include updating
each implementations operations structure to be declared as a
constant. And allowing "cycle" to be set when selecting the
preferred ops in the kernel as well as user space.
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes openzfs#8754
Closes openzfs#8793
Closes openzfs#8965
struct pathname is originally from Solaris VFS, and it has been used in ZoL to merely call VOP from Linux VFS interface without API change, therefore pathname::pn_path* are unused and unneeded. Technically, struct pathname is a wrapper for C string in ZoL. Saves stack a bit on lookup and unlink. (#if0'd members instead of removing since comments refer to them.) Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Richard Elling <Richard.Elling@RichardElling.com> Reviewed-by: George Melikov <mail@gmelikov.ru> Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro@gmail.com> Closes openzfs#9025
This patch corrects a small issue where the dsl_destroy_head() code that runs when the async_destroy feature is disabled would not properly decrypt the dataset before beginning processing. If the dataset is not able to be decrypted, the optimization code now simply does not run and the dataset is completely destroyed in the DSL sync task. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tom Caputi <tcaputi@datto.com> Closes openzfs#9021
External consumers such as Lustre require access to the dnode interfaces in order to correctly manipulate dnodes. Reviewed-by: James Simmons <uja.ornl@yahoo.com> Reviewed-by: Olaf Faaland <faaland1@llnl.gov> Reviewed-by: George Melikov <mail@gmelikov.ru> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue openzfs#8994 Closes openzfs#9027
ZFS_ACLTYPE_POSIXACL has already been tested in zpl_init_acl(), so no need to test again on POSIX ACL access. Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro@gmail.com> Closes openzfs#9009
Modify zfs-mount-generator to produce a dependency on new zfs-import-key-*.service units, dynamically created at boot to call zfs load-key for the encryption root, before attempting to mount any encrypted datasets. These units are created by zfs-mount-generator, and RequiresMountsFor on the keyfile, if present, or call systemd-ask-password if a passphrase is requested. This patch includes suggestions from @Fabian-Gruenbichler, @ryanjaeb and @rlaager, as well an adaptation of @rlaager's script to retry on incorrect password entry. Reviewed-by: Richard Laager <rlaager@wiktel.com> Reviewed-by: Fabian Grünbichler <f.gruenbichler@proxmox.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Antonio Russo <antonio.e.russo@gmail.com> Closes openzfs#8750 Closes openzfs#8848
Reviewed-by: Richard Laager <rlaager@wiktel.com> Reviewed-by: Fabian Grünbichler <f.gruenbichler@proxmox.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Antonio Russo <antonio.e.russo@gmail.com> Closes openzfs#8750 Closes openzfs#8848
= Motivation At Delphix we've seen a lot of customer systems where fragmentation is over 75% and random writes take a performance hit because a lot of time is spend on I/Os that update on-disk space accounting metadata. Specifically, we seen cases where 20% to 40% of sync time is spend after sync pass 1 and ~30% of the I/Os on the system is spent updating spacemaps. The problem is that these pools have existed long enough that we've touched almost every metaslab at least once, and random writes scatter frees across all metaslabs every TXG, thus appending to their spacemaps and resulting in many I/Os. To give an example, assuming that every VDEV has 200 metaslabs and our writes fit within a single spacemap block (generally 4K) we have 200 I/Os. Then if we assume 2 levels of indirection, we need 400 additional I/Os and since we are talking about metadata for which we keep 2 extra copies for redundancy we need to triple that number, leading to a total of 1800 I/Os per VDEV every TXG. We could try and decrease the number of metaslabs so we have less I/Os per TXG but then each metaslab would cover a wider range on disk and thus would take more time to be loaded in memory from disk. In addition, after it's loaded, it's range tree would consume more memory. Another idea would be to just increase the spacemap block size which would allow us to fit more entries within an I/O block resulting in fewer I/Os per metaslab and a speedup in loading time. The problem is still that we don't deal with the number of I/Os going up as the number of metaslabs is increasing and the fact is that we generally write a lot to a few metaslabs and a little to the rest of them. Thus, just increasing the block size would actually waste bandwidth because we won't be utilizing our bigger block size. = About this patch This patch introduces the Log Spacemap project which provides the solution to the above problem while taking into account all the aforementioned tradeoffs. The details on how it achieves that can be found in the references sections below and in the code (see Big Theory Statement in spa_log_spacemap.c). Even though the change is fairly constraint within the metaslab and lower-level SPA codepaths, there is a side-change that is user-facing. The change is that VDEV IDs from VDEV holes will no longer be reused. To give some background and reasoning for this, when a log device is removed and its VDEV structure was replaced with a hole (or was compacted; if at the end of the vdev array), its vdev_id could be reused by devices added after that. Now with the pool-wide space maps recording the vdev ID, this behavior can cause problems (e.g. is this entry referring to a segment in the new vdev or the removed log?). Thus, to simplify things the ID reuse behavior is gone and now vdev IDs for top-level vdevs are truly unique within a pool. = Testing The illumos implementation of this feature has been used internally for a year and has been in production for ~6 months. For this patch specifically there don't seem to be any regressions introduced to ZTS and I have been running zloop for a week without any related problems. = Performance Analysis (Linux Specific) All performance results and analysis for illumos can be found in the links of the references. Redoing the same experiments in Linux gave similar results. Below are the specifics of the Linux run. After the pool reached stable state the percentage of the time spent in pass 1 per TXG was 64% on average for the stock bits while the log spacemap bits stayed at 95% during the experiment (graph: sdimitro.github.io/img/linux-lsm/PercOfSyncInPassOne.png). Sync times per TXG were 37.6 seconds on average for the stock bits and 22.7 seconds for the log spacemap bits (related graph: sdimitro.github.io/img/linux-lsm/SyncTimePerTXG.png). As a result the log spacemap bits were able to push more TXGs, which is also the reason why all graphs quantified per TXG have more entries for the log spacemap bits. Another interesting aspect in terms of txg syncs is that the stock bits had 22% of their TXGs reach sync pass 7, 55% reach sync pass 8, and 20% reach 9. The log space map bits reached sync pass 4 in 79% of their TXGs, sync pass 7 in 19%, and sync pass 8 at 1%. This emphasizes the fact that not only we spend less time on metadata but we also iterate less times to convergence in spa_sync() dirtying objects. [related graphs: stock- sdimitro.github.io/img/linux-lsm/NumberOfPassesPerTXGStock.png lsm- sdimitro.github.io/img/linux-lsm/NumberOfPassesPerTXGLSM.png] Finally, the improvement in IOPs that the userland gains from the change is approximately 40%. There is a consistent win in IOPS as you can see from the graphs below but the absolute amount of improvement that the log spacemap gives varies within each minute interval. sdimitro.github.io/img/linux-lsm/StockVsLog3Days.png sdimitro.github.io/img/linux-lsm/StockVsLog10Hours.png = Porting to Other Platforms For people that want to port this commit to other platforms below is a list of ZoL commits that this patch depends on: Make zdb results for checkpoint tests consistent db58794 Update vdev_is_spacemap_addressable() for new spacemap encoding 419ba59 Simplify spa_sync by breaking it up to smaller functions 8dc2197 Factor metaslab_load_wait() in metaslab_load() b194fab Rename range_tree_verify to range_tree_verify_not_present df72b8b Change target size of metaslabs from 256GB to 16GB c853f38 zdb -L should skip leak detection altogether 21e7cf5 vs_alloc can underflow in L2ARC vdevs 7558997 Simplify log vdev removal code 6c926f4 Get rid of space_map_update() for ms_synced_length 425d323 Introduce auxiliary metaslab histograms 928e8ad Error path in metaslab_load_impl() forgets to drop ms_sync_lock 8eef997 = References Background, Motivation, and Internals of the Feature - OpenZFS 2017 Presentation: youtu.be/jj2IxRkl5bQ - Slides: slideshare.net/SerapheimNikolaosDim/zfs-log-spacemaps-project Flushing Algorithm Internals & Performance Results (Illumos Specific) - Blogpost: sdimitro.github.io/post/zfs-lsm-flushing/ - OpenZFS 2018 Presentation: youtu.be/x6D2dHRjkxw - Slides: slideshare.net/SerapheimNikolaosDim/zfs-log-spacemap-flushing-algorithm Upstream Delphix Issues: DLPX-51539, DLPX-59659, DLPX-57783, DLPX-61438, DLPX-41227, DLPX-59320 DLPX-63385 Reviewed-by: Sean Eric Fagan <sef@ixsystems.com> Reviewed-by: Matt Ahrens <matt@delphix.com> Reviewed-by: George Wilson <gwilson@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Serapheim Dimitropoulos <serapheim@delphix.com> Closes openzfs#8442
Adds the ability to sanity check zfs create arguments and to see the value of any additional properties that will local to the dataset. For example, automation that may need to adjust quota on a parent filesystem before creating a volume may call `zfs create -nP -V <size> <volume>` to obtain the value of refreservation. This adds the following options to zfs create: - -n dry-run (no-op) - -v verbose - -P parseable (implies verbose) Reviewed-by: Ryan Moeller <ryan@ixsystems.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Matt Ahrens <matt@delphix.com> Reviewed-by: Jerry Jelinek <jerry.jelinek@joyent.com> Signed-off-by: Mike Gerdts <mike.gerdts@joyent.com> Closes openzfs#8974
The cast of the size_t returned by strlcpy() to a uint64_t by the VERIFY3U can result in a build failure when CONFIG_FORTIFY_SOURCE is set. This is due to the additional hardening. Since the token is expected to always fit in strval the VERIFY3U has been removed. If somehow it doesn't, it will still be safely truncated. Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Don Brady <don.brady@delphix.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue openzfs#8999 Closes openzfs#9020
Resolve an assortment of style inconsistencies including use of white space, typos, capitalization, and line wrapping. There is no functional change. Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: George Melikov <mail@gmelikov.ru> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes openzfs#9030
zfs_refcount_*() are to be wrapped by zfsctl_snapshot_*() in this file. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro@gmail.com> Closes openzfs#9039
Make use of __GFP_HIGHMEM flag in vmem_alloc, which is required for some 32-bit systems to make use of full available memory. While kernel versions >=4.12-rc1 add this flag implicitly, older kernels do not. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Sebastian Gottschall <s.gottschall@dd-wrt.com> Signed-off-by: Michael Niewöhner <foss@mniewoehner.de> Closes openzfs#9031
When CONFIG_X86_DEBUG_FPU is defined the alternatives_patched symbol is pulled in as a dependency which results in a build failure. To prevent this undefine CONFIG_X86_DEBUG_FPU to disable the WARN_ON_FPU() macro and rely on WARN_ON_ONCE debugging checks which were previously added. Reviewed-by: Tony Hutter <hutter2@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes openzfs#9041 Closes openzfs#9049
lockdep reports a possible recursive lock in dbuf_destroy. It is true that dbuf_destroy is acquiring the dn_dbufs_mtx on one dnode while holding it on another dnode. However, it is impossible for these to be the same dnode because, among other things,dbuf_destroy checks MUTEX_HELD before acquiring the mutex. This fix defines a class NESTED_SINGLE == 1 and changes that lock to call mutex_enter_nested with a subclass of NESTED_SINGLE. In order to make the userspace code compile, include/sys/zfs_context.h now defines mutex_enter_nested and NESTED_SINGLE. This is the lockdep report: [ 122.950921] ============================================ [ 122.950921] WARNING: possible recursive locking detected [ 122.950921] 4.19.29-4.19.0-debug-d69edad5368c1166 openzfs#1 Tainted: G O [ 122.950921] -------------------------------------------- [ 122.950921] dbu_evict/1457 is trying to acquire lock: [ 122.950921] 0000000083e9cbcf (&dn->dn_dbufs_mtx){+.+.}, at: dbuf_destroy+0x3c0/0xdb0 [zfs] [ 122.950921] but task is already holding lock: [ 122.950921] 0000000055523987 (&dn->dn_dbufs_mtx){+.+.}, at: dnode_evict_dbufs+0x90/0x740 [zfs] [ 122.950921] other info that might help us debug this: [ 122.950921] Possible unsafe locking scenario: [ 122.950921] CPU0 [ 122.950921] ---- [ 122.950921] lock(&dn->dn_dbufs_mtx); [ 122.950921] lock(&dn->dn_dbufs_mtx); [ 122.950921] *** DEADLOCK *** [ 122.950921] May be due to missing lock nesting notation [ 122.950921] 1 lock held by dbu_evict/1457: [ 122.950921] #0: 0000000055523987 (&dn->dn_dbufs_mtx){+.+.}, at: dnode_evict_dbufs+0x90/0x740 [zfs] [ 122.950921] stack backtrace: [ 122.950921] CPU: 0 PID: 1457 Comm: dbu_evict Tainted: G O 4.19.29-4.19.0-debug-d69edad5368c1166 openzfs#1 [ 122.950921] Hardware name: Supermicro H8SSL-I2/H8SSL-I2, BIOS 080011 03/13/2009 [ 122.950921] Call Trace: [ 122.950921] dump_stack+0x91/0xeb [ 122.950921] __lock_acquire+0x2ca7/0x4f10 [ 122.950921] lock_acquire+0x153/0x330 [ 122.950921] dbuf_destroy+0x3c0/0xdb0 [zfs] [ 122.950921] dbuf_evict_one+0x1cc/0x3d0 [zfs] [ 122.950921] dbuf_rele_and_unlock+0xb84/0xd60 [zfs] [ 122.950921] dnode_evict_dbufs+0x3a6/0x740 [zfs] [ 122.950921] dmu_objset_evict+0x7a/0x500 [zfs] [ 122.950921] dsl_dataset_evict_async+0x70/0x480 [zfs] [ 122.950921] taskq_thread+0x979/0x1480 [spl] [ 122.950921] kthread+0x2e7/0x3e0 [ 122.950921] ret_from_fork+0x27/0x50 Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Jeff Dike <jdike@akamai.com> Closes openzfs#8984
Commit torvalds/linux@94a9717b updated the rwsem's owner field to contain additional flags describing the rwsem's state. Rather then update the wrappers to mask out these bits, the code no longer relies on the owner stored by the kernel. This does increase the size of a krwlock_t but it makes the implementation less sensitive to future kernel changes. Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Tomohiro Kusumi <kusumi.tomohiro@gmail.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes openzfs#9029
The Linux kernel's rwsem's have never provided an interface to allow a reader to be upgraded to a writer. Historically, this functionality has been implemented by a SPL wrapper function. However, this approach depends on internal knowledge of the rw_semaphore and is therefore rather brittle. Since the ZFS code must always be able to fallback to rw_exit() and rw_enter() when an rw_tryupgrade() fails; this functionality isn't critical. Furthermore, the only potentially performance sensitive consumer is dmu_zfetch() and no decrease in performance was observed with this change applied. See the PR comments for additional testing details. Therefore, it is being retired to make the build more robust and to simplify the rwlock implementation. Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Tomohiro Kusumi <kusumi.tomohiro@gmail.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes openzfs#9029
These functions are unused and can be removed along with the spl-mutex.c and spl-rwlock.c source files. Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Tomohiro Kusumi <kusumi.tomohiro@gmail.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes openzfs#9029
Factor Linux specific memory pressure handling out of ARC. Each platform will have different available interfaces for managing memory pressure. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Jorgen Lundman <lundman@lundman.net> Reviewed-by: Ryan Moeller <ryan@ixsystems.com> Signed-off-by: Matt Macy <mmacy@FreeBSD.org> Closes openzfs#9472
Currently, for certain sizes and classes of allocations we use SPL caches that are backed by caches in the Linux Slab allocator to reduce fragmentation and increase utilization of memory. The way things are implemented for these caches as of now though is that we don't keep any statistics of the allocations that we make from these caches. This patch enables the tracking of allocated objects in those SPL caches by making the trade-off of grabbing the cache lock at every object allocation and free to update the respective counter. Additionally, this patch makes those caches visible in the /proc/spl/kmem/slab special file. As a side note, enabling the specific counter for those caches enables SDB to create a more user-friendly interface than /proc/spl/kmem/slab that can also cross-reference data from slabinfo. Here is for example the output of one of those caches in SDB that outputs the name of the underlying Linux cache, the memory of SPL objects allocated in that cache, and the percentage of those objects compared to all the objects in it: ``` > spl_kmem_caches | filter obj.skc_name == "zio_buf_512" | pp name ... source total_memory util ----------- ... ----------------- ------------ ---- zio_buf_512 ... kmalloc-512[SLUB] 16.9MB 8 ``` Reviewed-by: Matt Ahrens <matt@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Serapheim Dimitropoulos <serapheim@delphix.com> Closes openzfs#9474
Giving a name to this enum makes it discoverable from
debugging tools like DRGN and SDB. For example, with
the name proposed on this patch we can iterate over
these values in DRGN:
```
>>> prog.type('enum kmc_bit').enumerators
(('KMC_BIT_NOTOUCH', 0), ('KMC_BIT_NODEBUG', 1),
('KMC_BIT_NOMAGAZINE', 2), ('KMC_BIT_NOHASH', 3),
('KMC_BIT_QCACHE', 4), ('KMC_BIT_KMEM', 5),
('KMC_BIT_VMEM', 6), ('KMC_BIT_SLAB', 7),
...
```
This enables SDB to easily pretty-print the flags of
the spl_kmem_caches in the system like this:
```
> spl_kmem_caches -o "name,flags,total_memory"
name flags total_memory
------------------------ ----------------------- ------------
abd_t KMC_NOMAGAZINE|KMC_SLAB 4.5MB
arc_buf_hdr_t_full KMC_NOMAGAZINE|KMC_SLAB 12.3MB
... <cropped> ...
ddt_cache KMC_VMEM 583.7KB
ddt_entry_cache KMC_NOMAGAZINE|KMC_SLAB 0.0B
... <cropped> ...
zio_buf_1048576 KMC_NODEBUG|KMC_VMEM 0.0B
... <cropped> ...
```
Reviewed-by: Matt Ahrens <matt@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Closes openzfs#9478
With 4k disks, this test will fail in the last section because the expected human readable value of 20.0M is reported as 20.1M. Rather than use the human readable property, switch to the parsable property and verify that the values are reasonably close. Reviewed-by: Igor Kozhukhov <igor@dilos.org> Reviewed-by: Ryan Moeller <ryan@ixsystems.com> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Signed-off-by: John Kennedy <john.kennedy@delphix.com> Closes openzfs#9477
We don't need to include stdio_ext.h Reviewed-by: Igor Kozhukhov <igor@dilos.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Matt Macy <mmacy@FreeBSD.org> Closes openzfs#9483
Reviewed-by: Igor Kozhukhov <igor@dilos.org> Reviewed-by: Ryan Moeller <ryan@ixsystems.com> Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Matt Macy <mmacy@FreeBSD.org> Closes openzfs#9486
Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Matt Macy <mmacy@FreeBSD.org> Closes openzfs#9493
Fixes an obvious issue of calling arc_buf_destroy() on an unallocated arc_buf. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Matt Ahrens <matt@delphix.com> Reviewed-by: Paul Dagnelie <pcd@delphix.com> Signed-off-by: Chris Dunlop <chris@onthe.net.au> Closes openzfs#9453
Consistently use the `zfs_ioctl()` wrapper since `ioctl()` cannot be called directly due to differing semantics between platforms. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Jorgen Lundman <lundman@lundman.net> Signed-off-by: Matt Macy <mmacy@FreeBSD.org> Closes openzfs#9492
Contrary to initial testing we cannot rely on these kernels to invalidate the per-cpu FPU state and restore the FPU registers. Nor can we guarantee that the kernel won't modify the FPU state which we saved in the task struck. Therefore, the kfpu_begin() and kfpu_end() functions have been updated to save and restore the FPU state using our own dedicated per-cpu FPU state variables. This has the additional advantage of allowing us to use the FPU again in user threads. So we remove the code which was added to use task queues to ensure some functions ran in kernel threads. Reviewed-by: Fabian Grünbichler <f.gruenbichler@proxmox.com> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue openzfs#9346 Closes openzfs#9403
* Use .ksh extension for ksh scripts, not .sh * Remove .ksh extension from tests in common.run Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ryan Moeller <ryan@iXsystems.com> Closes openzfs#9502
O_TMPFILE is not available on FreeBSD. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ryan Moeller <ryan@iXsystems.com> Closes openzfs#9503
Currently, incremental recursive encrypted receives fail to work for any snapshot after the first. The reason for this is because the check in zfs_setup_cmdline_props() did not properly realize that when the user attempts to use '-x encryption' in this situation, they are not really overriding the existing encryption property and instead are attempting to prevent it from changing. This resulted in an error message stating: "encryption property 'encryption' cannot be set or excluded for raw or incremental streams". This problem is fixed by updating the logic to expect this use case. Reviewed-by: loli10K <ezomori.nozomu@gmail.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Igor Kozhukhov <igor@dilos.org> Signed-off-by: Tom Caputi <tcaputi@datto.com> Closes openzfs#9494
It's mostly a noop on ZoL and it conflicts with platforms that support dtrace. Remove this header to resolve the conflict. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Jorgen Lundman <lundman@lundman.net> Signed-off-by: Matt Macy <mmacy@FreeBSD.org> Closes openzfs#9497
EBADE, EBADR, and ENOANO do not exist on FreeBSD The libspl errno.h is similarly platform dependent. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Matt Macy <mmacy@FreeBSD.org> Closes openzfs#9498
This logic is not platform dependent and should reside in the common code. Reviewed-by: Igor Kozhukhov <igor@dilos.org> Reviewed-by: Jorgen Lundman <lundman@lundman.net> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Matt Macy <mmacy@FreeBSD.org> Closes openzfs#9505
This assert makes non portable assumptions about the state of memory returned by the memory allocator. Reviewed-by: Igor Kozhukhov <igor@dilos.org> Reviewed-by: Jorgen Lundman <lundman@lundman.net> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Matt Macy <mmacy@FreeBSD.org> Closes openzfs#9506
Consistently use the `zfs_ioctl()` wrapper since `ioctl()` cannot be called directly due to differing semantics between platforms. Follow up PR to openzfs#9492. Reviewed-by: Igor Kozhukhov <igor@dilos.org> Reviewed-by: Jorgen Lundman <lundman@lundman.net> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Matt Macy <mmacy@FreeBSD.org> Closes openzfs#9507
The malloc.h include is gratuitous and runs in to the following error on FreeBSD: Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Matt Macy <mmacy@FreeBSD.org> Closes openzfs#9509
|
@cvoltz thanks! Would you just mind rebasing this PR against the master branch and then force updating the PR. This code appears unchanged between 0.7 and master and it'll resolve the CI failure. This fix itself, looks reasonable to me but I'll take a closer look next week. |
Tests that rely on special filesystems that are specific to Linux should only be run on Linux. Reviewed-by: Igor Kozhukhov <igor@dilos.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Giuseppe Di Natale <guss80@gmail.com> Signed-off-by: Ryan Moeller <ryan@ixsystems.com> Closes openzfs#9512
Fix header guard typo accidentally introduced by openzfs#9497. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Matt Macy <mmacy@FreeBSD.org> Closes openzfs#9514
The sys/signal.h header doesn't exist on FreeBSD, nor is it needed on Linux. Reviewed-by: Igor Kozhukhov <igor@dilos.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Matt Macy <mmacy@FreeBSD.org> Closes openzfs#9510
This change leverage module_param_call() to run arc_tuning_update() immediately after the ARC tunable has been updated as suggested in cffa837 code review. A simple test case is added to the ZFS Test Suite to prevent future regressions in functionality. Reviewed-by: Matt Macy <mmacy@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: loli10K <ezomori.nozomu@gmail.com> Closes openzfs#9487 Closes openzfs#9489
When the "zpool online" command is used bring a faulted or
offline drive back online, a resource.fs.zfs.statechange event
is generated. When the "zpool replace" command is used to bring
a faulted or offline drive back online, a statechange event is
not generated. Add the missing statechange event after
resilvering has finished. The new sequence of events looks like
this:
sysevent.fs.zfs.vdev_attach
sysevent.fs.zfs.resilver_start
sysevent.fs.zfs.history_event (scan setup)
sysevent.fs.zfs.history_event (scan done)
sysevent.fs.zfs.resilver_finish
sysevent.fs.zfs.config_sync
+ resource.fs.zfs.statechange
sysevent.fs.zfs.vdev_remove
sysevent.fs.zfs.history_event (vdev attach)
sysevent.fs.zfs.config_sync
sysevent.fs.zfs.history_event (detach)
Signed-off-by: Christopher Voltz <christopher.voltz@hpe.com>
External-issue: LU-12836
Closes openzfs#9437
|
I'm not sure what went wrong with the forced update but it seems to have included a lot of changes which aren't in my patch set. On my machine, it shows: So it looks like I rebased it properly. I'll just close this pull request and open another with the correct branch. |
Motivation and Context
Fix #9437
Fix LU-12836
Description
When the
zpool onlinecommand is used to bring a faulted or offline drive back online, aresource.fs.zfs.statechangeevent is generated. When thezpool replacecommand is used to bring a faulted or offline drive back online, astatechangeevent is not generated. Add the missingstatechangeevent after resilvering has finished. The new sequence of events looks likethis:
The
statechangeevent looks like this:How Has This Been Tested?
Updated test
events_001_pos.kshto verify thestatechangeevent is generated. Ran the full set of functional tests using the updated ZFS Test suite on CentOS 7.6 using the CentOS 7 v1905.1 Vagrant image. See cvoltz/zfs-test-suite for scripts which use Ansible and Vagrant to provision a Vagrant VM, using libvirt, to build and run the ZFS test suite on a specified branch of a specified repository.Also tested the patch with Lustre 2.12.2-1 and ZFS 0.7.13-1 using the test-degraded-drive test script. See the LU-12836 ticket for details.
Types of changes
Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Performance enhancement (non-breaking change which improves efficiency)
Code cleanup (non-breaking change which makes code smaller or more readable)
Breaking change (fix or feature that would cause existing functionality to change)
Documentation (a change to man pages or other documentation)
My code follows the ZFS on Linux code style requirements.
I have updated the documentation accordingly.
I have read the contributing document.
I have added tests to cover my changes.
All new and existing tests passed.
All commit messages are properly formatted and contain
Signed-off-by.