OpenZFS - 6363 Add UNMAP/TRIM functionality (v2) #7363

tuxoko · 2018-03-28T21:53:52Z

From #5925

Rebased from @dweeezil 's ntrim2-next
Added trim related manpage changes.

Description

Motivation and Context

How Has This Been Tested?

Types of changes

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Performance enhancement (non-breaking change which improves efficiency)
Code cleanup (non-breaking change which makes code smaller or more readable)
Breaking change (fix or feature that would cause existing functionality to change)
Documentation (a change to man pages or other documentation)

Checklist:

My code follows the ZFS on Linux code style requirements.
I have updated the documentation accordingly.
I have read the CONTRIBUTING document.
I have added tests to cover my changes.
All new and existing tests passed.
All commit messages are properly formatted and contain Signed-off-by.
Change has been approved by a ZFS on Linux member.

tuxoko · 2018-03-29T20:43:01Z

This happens a lot in ztest.
WARNING: Pool 'ztest' has encountered an uncorrectable I/O failure and has been suspended.

Also manualtrim_002_pos failed with
TRIM interrupted it was expected to complete.

behlendorf · 2018-04-06T22:54:32Z

@kpande, @tuxoko was nice enough to offer to help address some of the remaining issues with TRIM. Thanks! Step one is getting all the old and new tests passing.

This happens a lot in ztest.
WARNING: Pool 'ztest' has encountered an uncorrectable I/O failure and has been suspended.

This looks like one of those remaining issues which needs to be understood and fixed. In the master branch we aren't seeing anything like this so it's likely TRIM related.

Also manualtrim_002_pos failed with
TRIM interrupted it was expected to complete.

It also looks like several of the ZTS runs resulted in kernel panics. So there's definitely some work to do root causing these issues.

trisk · 2018-04-02T19:55:23Z

module/zfs/vdev_queue.c

+
+	/* trim I/Os have no single meaningful offset */
+	if (zio->io_priority != ZIO_PRIORITY_AUTO_TRIM ||
+	    zio->io_priority != ZIO_PRIORITY_MAN_TRIM)


As per previous review, the comparison should be &&.

Right, || doesn't make sense.

Update dkio.h from Nexenta's version to pick up DKIOCFREE and add their dkioc_free_util.h header for TRIM support.

Ported by: Tim Chase <tim@chase2k.com> Porting notes: The trim kstats are in zfs/<pool> along with the other per-pool stats. The kstats can be cleared by writing to the kstat file. Null format parameters to strftime() were replaced with "%c". Added vdev trace support. New dfl_alloc() function in the SPL is used to allocate arrays of dkioc_free_list_t objects since they may be large enough to require virtual memory. Other changes: Suppressed kstat creation for pools with "$" names. The changes to vdev_raidz_map_alloc() have been minimized in order to allow more conflict-free merging with future changes (ABD). Added the following module parameters: zfs_trim - Enable TRIM zfs_trim_min_ext_sz - Minimum size to trim zfs_txgs_per_trim - Transaction groups over which to batch trims

The extended zpool iostat options -wlqr will display information about automatic and manual TRIMs. This commit also fixes a completely unrelated bug in which the IOS_LATENCY row in the vsx_type_to_nvlist array was missing an entry for the scrub nvlist.

…ents.

The blkdev_issue_discard() function has been available for a long time by the kernel but it only supports synchronous discards. The __blkdev_issue_discard() function provides an asynchronous interface but was added in the 4.6 kernel. Only supporting synchronously discards can potentially limit performance when processing a large number of small extents. To avoid this an asynchronous discard implementation has been added to vdev_disk.c which builds on existing functionality. The kernel provided synchronous version remains the default pending additional functional and performance testing. Due to different mechamism used for submitting TRIM commands there were not being properly accounted for in the extended statistics. Resolve this by allow for aggregated stats to be returned as part of the TRIM zio. This allows for far better visibility in to the discard request sizes. Minor documentation updates. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

Porting Notes: Man page changes dropped for the moment. This can be reconsiled when the final version is merged to OpenZFS. They are accurate now, only worded a little differently.

1) Removed the first-fit allocator. 2) Moved the autotrim metaslab scheduling logic into vdev_auto_trim. 2a) As a consequence of openzfs#2, metaslab_trimset_t was rendered superfluous. New trimsets are simple range_tree_t's. 3) Made ms_trimming_ts remove extents it is working on from ms_tree and then add them back in. 3a) As a consequence of openzfs#3, undone all the direct changes to the allocators and removed metaslab_check_trim_conflict and range_tree_find_gap. Porting Notes: * Removed WITH_*_ALLOCATOR macros and aligned remaining allocations with OpenZFS. Unused wariables warnings resolved with the gcc __attribute__ ((unused__ keyword. * Added missing calls for ms_condensing_cv. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

Porting Notes: * metaslab_sync changes already applied. * resync of test cases needed

1) Simplified the SM_FREE spacemap writing while a trim is active. 2) Simplified the range_tree_verify in metaslab_check_free. 3) Clarified comment above metaslab_trim_all. 4) Substituted 'flust out' with 'drop' in comment in metaslab_trim_all. 5) Moved ms_prev_ts clearing up to ms_cur_ts claring in metaslab_trim_all. 6) Added recomputation of metaslab weight when metaslab is loaded. 7) Moved dmu_tx_commit inside of spa_trim_update_time. 8) Made the smallest allowable manual trim rate 1/1000th of a metaslab size. 9) Switched to using hrtime_t in manual trim timing logic. 10) Changed "limited" to "preserve_spilled" in vdev_auto_trim. 11) Moved vdev_notrim setting into zio_vdev_io_assess.a Porting Notes: * vdev_disk.c and zio.c hunks already applied. * nsec_per_tick -> MSEC2NSEC(1)

… wanting to condense.

Some storage backends such as large thinly-provisioned SANs are very slow for large trims. Manual trim now supports "zpool trim -p" (partial trim) to skip metaslabs for which there is no spacemap.

The existing test cases were split in to multiple test cases and refactored. There are now test cases for the following: zpool_trim_001_pos - Verify manual TRIM zpool_trim_002_pos - Verify manual trim can be interrupted zpool_trim_003_pos - Verify 'zpool trim -s' rate limiting zpool_trim_004_pos - Verify 'zpool trim -p' partial TRIM works zpool_trim_005_neg - Verify bad parameters to 'zpool trim' zpool_trim_006_neg - Verify bad parameters to 'zpool trim -r' autotrim_001_pos - Verify 'autotrim=on' pool data integrity autotrim_002_pos - Verify various pool geometries manualtrim_001_pos - Verify manual trim pool data integrity manualtrim_002_pos - Verify various pool geometries manualtrim_003_pos - Verify 'zpool import|export' manualtrim_004_pos - Verify 'zpool online|offline|replace' manualtrim_005_pos - Verify TRIM and scrub run concurrently Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

* Rename TRIM taskq threads to be more concise for Linux. * Fix divide by zero panic Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

Rather than hacking `vdev_raidz_map_alloc()` to get the child offsets calculate the values directly. Signed-off-by: Isaac Huang <he.huang@intel.com>

* Fixed missing taskq_destroy when exporting a pool which is being actively trimmed. * Add auto/manual TRIM coverage to ztest. * Temporarily disable manualtrim_004_pos. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

Signed-off-by: Chunwei Chen <david.chen@nutanix.com>

KernelKyupCom · 2018-05-09T15:09:04Z

Hi there,
I am also experiencing the "WARNING: Pool 'ztest3' has encountered an uncorrectable I/O failure and has been suspended." and the txg_sync_thread is blocked for more than 120 seconds (a hung task).

After some investigation it turns out there is ENOSPC returned by metaslab_alloc.
The actual call stack is:
#10 [ffffc90003c17bc0] metaslab_alloc at ffffffffa0508fdb [zfs]
#11 [ffffc90003c17cb0] zio_dva_allocate at ffffffffa058f470 [zfs]
#12 [ffffc90003c17da8] zio_execute at ffffffffa058b28a [zfs]
#13 [ffffc90003c17dd8] taskq_thread at ffffffffa04113d9 [spl]
#14 [ffffc90003c17f00] kthread at ffffffff8107ae2d
#15 [ffffc90003c17f50] ret_from_fork at ffffffff818001fa

I am creating zfs pool in a sparse file image.

I can easily reproduce it by copying two sets of small files - the linux kernel tree (including its git) and my ccache directory.
Basically I am importing the pool, copying the first set, creating snapshot1, deleting the files and then trimming. Then the same for the second set of files in another snapshot2. Exporting the pool and reimporting it, destroying the both snapshots, then again trim and export.
I am attaching my test script (just edit the paths): zfs_test.sh.txt
About the hung task it seems we are AB-BAdeadlocking. See the attached log:
lockdep_4.16.7.log

I hope this helps.
Any ideas? Some other useful data to capture to help resolving that?

Best regards,
Angel

aerusso · 2018-06-09T19:58:31Z

module/zfs/metaslab.c

+	VERIFY(!msp->ms_condensing);
+	if (msp->ms_loaded) {
+		range_tree_walk(msp->ms_trimming_ts, range_tree_add,
+		    msp->ms_tree);


Sorry if this has already been addressed elsewhere. During the time the trimming was being done (initiated by metaslab_exec_trim), ms_trimming_ts is removed from msp->ms_tree. If an allocation request occurs on such an mg, isn't it possible that mg->mg_no_free_space could be set (because of the unavailable range---the range being trimmed)?

And if so, is there code that would trigger a re-evaluation of the validity of that mg_no_free_space flag (now that said extent is again available for allocation)? If not, that could explain @KernelKyupCom's ENOSPACE issue.

KernelKyupCom · 2018-07-12T11:18:45Z

Hi guys,
I've been running for a month the code rebased to zfs-0.7.9-11. I haven't experienced the enospc issue anymore, but we hit another problem: in a few days our machines are going out of pids because of.. leaked threads.
I've spent some time to investigate it and found a way to reproduce it relatively easy (not every time, but in several minutes...).
It is just import && trim && export in an endless loop like that:

while [ true ]; do
    zpool import -d /root/ssd/zpool_test2/r ztest2 && zpool trim ztest2 && zpool export ztest2
done

The important part here is to export the pool immediately after the trim is issued. The issue occurs when the trim complete event is rised and SPA_ASYNC_MAN_TRIM_TASKQ_DESTROY is being displatched to have the spa_async_thread already suspended or not yet scheduled when it gets suspended by the export ioctl. This way the export path will not take care of destroying the man_trim_task at all and it remains out there forever.
Currently I am running totally out of tree, but I guess that should apply or at least could be easily reused:

diff --git a/module/zfs/spa.c b/module/zfs/spa.c
index 9702df0c26f3..f609d834acb1 100644
--- a/module/zfs/spa.c
+++ b/module/zfs/spa.c
@@ -4624,7 +4624,34 @@ spa_export_common(char *pool, int new_state, nvlist_t **oldconfig,
         */
        spa_open_ref(spa, FTAG);
        mutex_exit(&spa_namespace_lock);
+
+       /*
+        * Before we suspend the async thread wait if there is ongoing trim.
+        * Otherwise we may not execute the man_trim_tasq_destroy scheduled by
+        * man_trim_done because the async thread is suspended. This results in
+        * tasq thread leak. Drop the mutex after the trim completes and
+        * reacquire it later when checking for still existing taskq with async
+        * thread suspended. This way we give chance to get it regullary cleaned
+        * by the done callback async request if the asyn thread is still
+        * running. Eventually if the request havent been executed before we
+        * suspend the async_thread explixitly clean up the man_trim taskq.
+        */
+       mutex_enter(&spa->spa_man_trim_lock);
+       while (spa->spa_num_man_trimming > 0)
+               cv_wait(&spa->spa_man_trim_done_cv, &spa->spa_man_trim_lock);
+       mutex_exit(&spa->spa_man_trim_lock);
+
        spa_async_suspend(spa);
+       /*
+        * After the suspend, chech if there is still man trim taskq which is
+        * about to leak otherwise adn destroy it. This way its work will be
+        * commited by the txg sync work flushed below.
+        */
+       mutex_enter(&spa->spa_man_trim_lock);
+               if (spa->spa_man_trim_taskq)
+                       spa_man_trim_taskq_destroy(spa);
+       mutex_exit(&spa->spa_man_trim_lock);
+
        if (spa->spa_zvol_taskq) {
                zvol_remove_minors(spa, spa_name(spa), B_TRUE);
                taskq_wait(spa->spa_zvol_taskq);

Also I suppose the same should be done to address the auto trim ?

I have also changed the trim task que thread pool name formatting in order to get closer what is going on. That way it is much easier to identify who is who and what happens, since the task->comm is being truncated to 16 chars by default and the "_man_trim"/"_auto_trim" suffix gets truncated at all (or at least in my case). That's just a suggestion about the naming convention.

diff --git a/module/zfs/spa_misc.c b/module/zfs/spa_misc.c
index 8a91b40510da..ab0443f5537e 100644
--- a/module/zfs/spa_misc.c
+++ b/module/zfs/spa_misc.c
@@ -2257,7 +2257,7 @@ spa_auto_trim_taskq_create(spa_t *spa)
 
        ASSERT(MUTEX_HELD(&spa->spa_auto_trim_lock));
        ASSERT(spa->spa_auto_trim_taskq == NULL);
-       (void) snprintf(name, MAXPATHLEN, "%s_auto_trim", spa->spa_name);
+       (void) snprintf(name, MAXPATHLEN, "atrim_%s", spa->spa_name);
        spa->spa_auto_trim_taskq = taskq_create(name, 1, minclsyspri, 1,
            spa->spa_root_vdev->vdev_children, TASKQ_DYNAMIC);
        VERIFY(spa->spa_auto_trim_taskq != NULL);
@@ -2283,7 +2283,7 @@ spa_man_trim_taskq_create(spa_t *spa)
                 */
                return;
        }
-       (void) snprintf(name, MAXPATHLEN, "%s_man_trim", spa->spa_name);
+       (void) snprintf(name, MAXPATHLEN, "mtrim_%s", spa->spa_name);
        spa->spa_man_trim_taskq = taskq_create(name, 1, minclsyspri, 1,
            spa->spa_root_vdev->vdev_children, TASKQ_DYNAMIC);
        VERIFY(spa->spa_man_trim_taskq != NULL);

I am open for any critics about this fix.

Best regards,
Angel Shtilianov

aerusso · 2018-07-15T12:25:01Z

@tuxoko Would it be possible to update this PR to @dweeezil's post-removal rebase? It might be helpful to see the test suite results.

tuxoko · 2018-07-16T16:45:45Z

@aerusso
I'll look into it. Unfortunately, my priority have changed recently, I'll try to find time to do it.

dweeezil · 2018-07-16T20:20:47Z

@aerusso FYI, my post-removal-rebase branch, which I'd not yet widely publicized, is in fact based on @tuxoko's "v2" work. I've not yet had the time to test it at all yet, short of making sure it compiles. I hope to get it tested out in the next couple of days.

mailinglists35 · 2018-09-25T20:26:34Z

is there any chance this will be ready for 0.8.0 release? or 0.8.x?

RJVB · 2018-10-24T08:56:40Z

Maybe not the best place to ask, but what alternatives are there in the meantime until the change is incorporated? How about not giving the entire SSD to ZFS, but keeping a part of it for some other filesystem, or just swap - would that allow the kernel to TRIM the entire device or only the non-ZFS part?

mailinglists35 · 2018-10-25T11:08:38Z

Maybe not the best place to ask, but what alternatives are there in the meantime until the change is incorporated? How about not giving the entire SSD to ZFS, but keeping a part of it for some other filesystem, or just swap - would that allow the kernel to TRIM the entire device or only the non-ZFS part?

switch to freebsd https://blog.plein.org/2014/04/15/freebsd-9-2-supports-zfs-trimunmap/

RJVB · 2018-10-25T12:38:38Z

Hmm, apparently it’s possible to do this from a VM ... and I have TrueOS on a partition I can either boot from or run in a VM. I’ll just have to remember to export the pool from linux first, and maybe unmount the swap partition too. Sent from a pocket terminal

…

On 25 Oct 2018, at 13:08, mailinglists35 ***@***.***> wrote: Maybe not the best place to ask, but what alternatives are there in the meantime until the change is incorporated? How about not giving the entire SSD to ZFS, but keeping a part of it for some other filesystem, or just swap - would that allow the kernel to TRIM the entire device or only the non-ZFS part? switch to freebsd https://blog.plein.org/2014/04/15/freebsd-9-2-supports-zfs-trimunmap/ — You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.

drescherjm · 2018-10-25T13:15:26Z

How about not giving the entire SSD to ZFS, but keeping a part of it for some other filesystem, or just swap - would that allow the kernel to TRIM the entire device or only the non-ZFS part?

It would be only the non zfs part

RJVB · 2018-10-25T13:42:47Z

It would be only the non zfs part

Well, that's not a big deal if I can do the ZFS part by importing the pool under a FreeBSD clone. Would importing the pool be enough, or would I'd have to do a scrub or something similar?

mailinglists35 · 2018-10-25T14:52:58Z

It would be only the non zfs part
Well, that's not a big deal if I can do the ZFS part by importing the pool under a FreeBSD clone. Would importing the pool be enough, or would I'd have to do a scrub or something similar?

I suggest we stop discussing this here and now before a mod will lock this thread. Use the mailing lists for this.

dweeezil · 2019-01-08T22:53:06Z

Replaced with #8255.

trisk reviewed Apr 10, 2018

View reviewed changes

dweeezil and others added 23 commits April 11, 2018 11:55

Refresh dkio.h and add dkioc_free_util.h

a8b527a

Update dkio.h from Nexenta's version to pick up DKIOCFREE and add their dkioc_free_util.h header for TRIM support.

Trimming an offlined vdev asserts in zio_create.

3cb5fe2

Matt Ahrens' review comments, round 2. Brian Behlendorf's review comm…

d48b844

…ents.

Review feedback

772aad4

Fix abd_alloc_sametype() panic

e1d0f88

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

Matt Ahrens' review comments.

84bb946

Porting Notes: Man page changes dropped for the moment. This can be reconsiled when the final version is merged to OpenZFS. They are accurate now, only worded a little differently.

Tim Chase's review comments, round 2.

245bdfd

Porting Notes: * metaslab_sync changes already applied. * resync of test cases needed

Deadlockiness associated with doing postponing trimming on a metaslab…

7f038a7

… wanting to condense.

Matt Ahrens' review comments, round 5.

b256a3a

Deadlockiness in autotrim due to recent changes.

1fe6922

Want manual trim feature to skip never-allocated space

a660f0f

Some storage backends such as large thinly-provisioned SANs are very slow for large trims. Manual trim now supports "zpool trim -p" (partial trim) to skip metaslabs for which there is no spacemap.

Review feedback

d6ece88

* Rename TRIM taskq threads to be more concise for Linux. * Fix divide by zero panic Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

Remove vdev_raidz_map_alloc()

ce900c3

Rather than hacking `vdev_raidz_map_alloc()` to get the child offsets calculate the values directly. Signed-off-by: Isaac Huang <he.huang@intel.com>

Review feedback 2

a3a5ec5

* Fixed missing taskq_destroy when exporting a pool which is being actively trimmed. * Add auto/manual TRIM coverage to ztest. * Temporarily disable manualtrim_004_pos. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

Add trim manpage

a9a0589

Signed-off-by: Chunwei Chen <david.chen@nutanix.com>

Fix wrong logical operator

e30979e

Signed-off-by: Chunwei Chen <david.chen@nutanix.com>

Wait for 1 sec before check trim status

aee5c52

Signed-off-by: Chunwei Chen <david.chen@nutanix.com>

tuxoko force-pushed the ntrim2 branch from 2968849 to aee5c52 Compare April 11, 2018 18:55

dweeezil mentioned this pull request Apr 27, 2018

OpenZFS - 6363 Add UNMAP/TRIM functionality #5925

Closed

11 tasks

behlendorf added the Status: Work in Progress Not yet ready for general review label May 16, 2018

ahrens added the Type: Feature Feature request or new feature label Jun 7, 2018

aerusso reviewed Jun 9, 2018

View reviewed changes

behlendorf added Status: Inactive Not being actively updated and removed Status: Work in Progress Not yet ready for general review labels Sep 25, 2018

ahrens added the Status: Revision Needed Changes are required for the PR to be accepted label Sep 27, 2018

dweeezil closed this Jan 8, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OpenZFS - 6363 Add UNMAP/TRIM functionality (v2) #7363

OpenZFS - 6363 Add UNMAP/TRIM functionality (v2) #7363

tuxoko commented Mar 28, 2018

tuxoko commented Mar 29, 2018

behlendorf commented Apr 6, 2018

trisk Apr 2, 2018

tuxoko Apr 11, 2018

KernelKyupCom commented May 9, 2018

aerusso Jun 9, 2018

KernelKyupCom commented Jul 12, 2018

aerusso commented Jul 15, 2018

tuxoko commented Jul 16, 2018

dweeezil commented Jul 16, 2018

mailinglists35 commented Sep 25, 2018

RJVB commented Oct 24, 2018

mailinglists35 commented Oct 25, 2018

RJVB commented Oct 25, 2018 via email

drescherjm commented Oct 25, 2018

RJVB commented Oct 25, 2018 via email

mailinglists35 commented Oct 25, 2018

dweeezil commented Jan 8, 2019

OpenZFS - 6363 Add UNMAP/TRIM functionality (v2) #7363

OpenZFS - 6363 Add UNMAP/TRIM functionality (v2) #7363

Conversation

tuxoko commented Mar 28, 2018

Description

Motivation and Context

How Has This Been Tested?

Types of changes

Checklist:

tuxoko commented Mar 29, 2018

behlendorf commented Apr 6, 2018

trisk Apr 2, 2018

Choose a reason for hiding this comment

tuxoko Apr 11, 2018

Choose a reason for hiding this comment

KernelKyupCom commented May 9, 2018

aerusso Jun 9, 2018

Choose a reason for hiding this comment

KernelKyupCom commented Jul 12, 2018

aerusso commented Jul 15, 2018

tuxoko commented Jul 16, 2018

dweeezil commented Jul 16, 2018

mailinglists35 commented Sep 25, 2018

RJVB commented Oct 24, 2018

mailinglists35 commented Oct 25, 2018

RJVB commented Oct 25, 2018 via email

drescherjm commented Oct 25, 2018

RJVB commented Oct 25, 2018 via email

mailinglists35 commented Oct 25, 2018

dweeezil commented Jan 8, 2019