Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OpenZFS - 6363 Add UNMAP/TRIM functionality (v2) #7363

Closed
wants to merge 23 commits into from

Conversation

tuxoko
Copy link
Contributor

@tuxoko tuxoko commented Mar 28, 2018

From #5925

Rebased from @dweeezil 's ntrim2-next
Added trim related manpage changes.

Description

Motivation and Context

How Has This Been Tested?

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Performance enhancement (non-breaking change which improves efficiency)
  • Code cleanup (non-breaking change which makes code smaller or more readable)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Documentation (a change to man pages or other documentation)

Checklist:

  • My code follows the ZFS on Linux code style requirements.
  • I have updated the documentation accordingly.
  • I have read the CONTRIBUTING document.
  • I have added tests to cover my changes.
  • All new and existing tests passed.
  • All commit messages are properly formatted and contain Signed-off-by.
  • Change has been approved by a ZFS on Linux member.

@tuxoko
Copy link
Contributor Author

tuxoko commented Mar 29, 2018

This happens a lot in ztest.
WARNING: Pool 'ztest' has encountered an uncorrectable I/O failure and has been suspended.

Also manualtrim_002_pos failed with
TRIM interrupted it was expected to complete.

@behlendorf
Copy link
Contributor

@kpande, @tuxoko was nice enough to offer to help address some of the remaining issues with TRIM. Thanks! Step one is getting all the old and new tests passing.

This happens a lot in ztest.
WARNING: Pool 'ztest' has encountered an uncorrectable I/O failure and has been suspended.

This looks like one of those remaining issues which needs to be understood and fixed. In the master branch we aren't seeing anything like this so it's likely TRIM related.

Also manualtrim_002_pos failed with
TRIM interrupted it was expected to complete.

It also looks like several of the ZTS runs resulted in kernel panics. So there's definitely some work to do root causing these issues.


/* trim I/Os have no single meaningful offset */
if (zio->io_priority != ZIO_PRIORITY_AUTO_TRIM ||
zio->io_priority != ZIO_PRIORITY_MAN_TRIM)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As per previous review, the comparison should be &&.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, || doesn't make sense.

dweeezil and others added 23 commits April 11, 2018 11:55
Update dkio.h from Nexenta's version to pick up DKIOCFREE and add their
dkioc_free_util.h header for TRIM support.
Ported by: Tim Chase <tim@chase2k.com>

Porting notes:
    The trim kstats are in zfs/<pool> along with the other per-pool stats.
    The kstats can be cleared by writing to the kstat file.

    Null format parameters to strftime() were replaced with "%c".

    Added vdev trace support.

    New dfl_alloc() function in the SPL is used to allocate arrays of
    dkioc_free_list_t objects since they may be large enough to require
    virtual memory.

Other changes:
    Suppressed kstat creation for pools with "$" names.

    The changes to vdev_raidz_map_alloc() have been minimized in order to allow
    more conflict-free merging with future changes (ABD).

Added the following module parameters:
    zfs_trim - Enable TRIM
    zfs_trim_min_ext_sz - Minimum size to trim
    zfs_txgs_per_trim - Transaction groups over which to batch trims
The extended zpool iostat options -wlqr will display information about
automatic and manual TRIMs.

This commit also fixes a completely unrelated bug in which the IOS_LATENCY
row in the vsx_type_to_nvlist array was missing an entry for the scrub
nvlist.
The blkdev_issue_discard() function has been available for
a long time by the kernel but it only supports synchronous
discards.  The __blkdev_issue_discard() function provides
an asynchronous interface but was added in the 4.6 kernel.

Only supporting synchronously discards can potentially limit
performance when processing a large number of small extents.
To avoid this an asynchronous discard implementation has been
added to vdev_disk.c which builds on existing functionality.

The kernel provided synchronous version remains the default
pending additional functional and performance testing.

Due to different mechamism used for submitting TRIM commands
there were not being properly accounted for in the extended
statistics.  Resolve this by allow for aggregated stats to
be returned as part of the TRIM zio.  This allows for far
better visibility in to the discard request sizes.

Minor documentation updates.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Porting Notes:
Man page changes dropped for the moment.  This can be reconsiled
when the final version is merged to OpenZFS.  They are accurate
now, only worded a little differently.
1) Removed the first-fit allocator.
2) Moved the autotrim metaslab scheduling logic into vdev_auto_trim.
2a) As a consequence of openzfs#2, metaslab_trimset_t was rendered superfluous. New
   trimsets are simple range_tree_t's.
3) Made ms_trimming_ts remove extents it is working on from ms_tree and then
   add them back in.
3a) As a consequence of openzfs#3, undone all the direct changes to the allocators and
   removed metaslab_check_trim_conflict and range_tree_find_gap.

Porting Notes:
* Removed WITH_*_ALLOCATOR macros and aligned remaining allocations
  with OpenZFS.  Unused wariables warnings resolved with the gcc
  __attribute__ ((unused__ keyword.
* Added missing calls for ms_condensing_cv.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Porting Notes:
* metaslab_sync changes already applied.
* resync of test cases needed
1) Simplified the SM_FREE spacemap writing while a trim is active.
2) Simplified the range_tree_verify in metaslab_check_free.
3) Clarified comment above metaslab_trim_all.
4) Substituted 'flust out' with 'drop' in comment in metaslab_trim_all.
5) Moved ms_prev_ts clearing up to ms_cur_ts claring in metaslab_trim_all.
6) Added recomputation of metaslab weight when metaslab is loaded.
7) Moved dmu_tx_commit inside of spa_trim_update_time.
8) Made the smallest allowable manual trim rate 1/1000th of a metaslab size.
9) Switched to using hrtime_t in manual trim timing logic.
10) Changed "limited" to "preserve_spilled" in vdev_auto_trim.
11) Moved vdev_notrim setting into zio_vdev_io_assess.a

Porting Notes:
  * vdev_disk.c and zio.c hunks already applied.
  * nsec_per_tick -> MSEC2NSEC(1)
Some storage backends such as large thinly-provisioned SANs are very slow
for large trims.  Manual trim now supports "zpool trim -p" (partial trim)
to skip metaslabs for which there is no spacemap.
The existing test cases were split in to multiple test cases and
refactored.  There are now test cases for the following:

zpool_trim_001_pos - Verify manual TRIM
zpool_trim_002_pos - Verify manual trim can be interrupted
zpool_trim_003_pos - Verify 'zpool trim -s' rate limiting
zpool_trim_004_pos - Verify 'zpool trim -p' partial TRIM works
zpool_trim_005_neg - Verify bad parameters to 'zpool trim'
zpool_trim_006_neg - Verify bad parameters to 'zpool trim -r'
autotrim_001_pos   - Verify 'autotrim=on' pool data integrity
autotrim_002_pos   - Verify various pool geometries
manualtrim_001_pos - Verify manual trim pool data integrity
manualtrim_002_pos - Verify various pool geometries
manualtrim_003_pos - Verify 'zpool import|export'
manualtrim_004_pos - Verify 'zpool online|offline|replace'
manualtrim_005_pos - Verify TRIM and scrub run concurrently

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
* Rename TRIM taskq threads to be more concise for Linux.
* Fix divide by zero panic

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Rather than hacking `vdev_raidz_map_alloc()` to get the child
offsets calculate the values directly.

Signed-off-by: Isaac Huang <he.huang@intel.com>
* Fixed missing taskq_destroy when exporting a pool which is
  being actively trimmed.
* Add auto/manual TRIM coverage to ztest.
* Temporarily disable manualtrim_004_pos.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Chunwei Chen <david.chen@nutanix.com>
Signed-off-by: Chunwei Chen <david.chen@nutanix.com>
Signed-off-by: Chunwei Chen <david.chen@nutanix.com>
@KernelKyupCom
Copy link

Hi there,
I am also experiencing the "WARNING: Pool 'ztest3' has encountered an uncorrectable I/O failure and has been suspended." and the txg_sync_thread is blocked for more than 120 seconds (a hung task).

After some investigation it turns out there is ENOSPC returned by metaslab_alloc.
The actual call stack is:
#10 [ffffc90003c17bc0] metaslab_alloc at ffffffffa0508fdb [zfs]
#11 [ffffc90003c17cb0] zio_dva_allocate at ffffffffa058f470 [zfs]
#12 [ffffc90003c17da8] zio_execute at ffffffffa058b28a [zfs]
#13 [ffffc90003c17dd8] taskq_thread at ffffffffa04113d9 [spl]
#14 [ffffc90003c17f00] kthread at ffffffff8107ae2d
#15 [ffffc90003c17f50] ret_from_fork at ffffffff818001fa

I am creating zfs pool in a sparse file image.

I can easily reproduce it by copying two sets of small files - the linux kernel tree (including its git) and my ccache directory.
Basically I am importing the pool, copying the first set, creating snapshot1, deleting the files and then trimming. Then the same for the second set of files in another snapshot2. Exporting the pool and reimporting it, destroying the both snapshots, then again trim and export.
I am attaching my test script (just edit the paths): zfs_test.sh.txt
About the hung task it seems we are AB-BAdeadlocking. See the attached log:
lockdep_4.16.7.log

I hope this helps.
Any ideas? Some other useful data to capture to help resolving that?

Best regards,
Angel

@behlendorf behlendorf added the Status: Work in Progress Not yet ready for general review label May 16, 2018
@ahrens ahrens added the Type: Feature Feature request or new feature label Jun 7, 2018
VERIFY(!msp->ms_condensing);
if (msp->ms_loaded) {
range_tree_walk(msp->ms_trimming_ts, range_tree_add,
msp->ms_tree);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry if this has already been addressed elsewhere. During the time the trimming was being done (initiated by metaslab_exec_trim), ms_trimming_ts is removed from msp->ms_tree. If an allocation request occurs on such an mg, isn't it possible that mg->mg_no_free_space could be set (because of the unavailable range---the range being trimmed)?

And if so, is there code that would trigger a re-evaluation of the validity of that mg_no_free_space flag (now that said extent is again available for allocation)? If not, that could explain @KernelKyupCom's ENOSPACE issue.

@KernelKyupCom
Copy link

Hi guys,
I've been running for a month the code rebased to zfs-0.7.9-11. I haven't experienced the enospc issue anymore, but we hit another problem: in a few days our machines are going out of pids because of.. leaked threads.
I've spent some time to investigate it and found a way to reproduce it relatively easy (not every time, but in several minutes...).
It is just import && trim && export in an endless loop like that:

while [ true ]; do
    zpool import -d /root/ssd/zpool_test2/r ztest2 && zpool trim ztest2 && zpool export ztest2
done

The important part here is to export the pool immediately after the trim is issued. The issue occurs when the trim complete event is rised and SPA_ASYNC_MAN_TRIM_TASKQ_DESTROY is being displatched to have the spa_async_thread already suspended or not yet scheduled when it gets suspended by the export ioctl. This way the export path will not take care of destroying the man_trim_task at all and it remains out there forever.
Currently I am running totally out of tree, but I guess that should apply or at least could be easily reused:

diff --git a/module/zfs/spa.c b/module/zfs/spa.c
index 9702df0c26f3..f609d834acb1 100644
--- a/module/zfs/spa.c
+++ b/module/zfs/spa.c
@@ -4624,7 +4624,34 @@ spa_export_common(char *pool, int new_state, nvlist_t **oldconfig,
         */
        spa_open_ref(spa, FTAG);
        mutex_exit(&spa_namespace_lock);
+
+       /*
+        * Before we suspend the async thread wait if there is ongoing trim.
+        * Otherwise we may not execute the man_trim_tasq_destroy scheduled by
+        * man_trim_done because the async thread is suspended. This results in
+        * tasq thread leak. Drop the mutex after the trim completes and
+        * reacquire it later when checking for still existing taskq with async
+        * thread suspended. This way we give chance to get it regullary cleaned
+        * by the done callback async request if the asyn thread is still
+        * running. Eventually if the request havent been executed before we
+        * suspend the async_thread explixitly clean up the man_trim taskq.
+        */
+       mutex_enter(&spa->spa_man_trim_lock);
+       while (spa->spa_num_man_trimming > 0)
+               cv_wait(&spa->spa_man_trim_done_cv, &spa->spa_man_trim_lock);
+       mutex_exit(&spa->spa_man_trim_lock);
+
        spa_async_suspend(spa);
+       /*
+        * After the suspend, chech if there is still man trim taskq which is
+        * about to leak otherwise adn destroy it. This way its work will be
+        * commited by the txg sync work flushed below.
+        */
+       mutex_enter(&spa->spa_man_trim_lock);
+               if (spa->spa_man_trim_taskq)
+                       spa_man_trim_taskq_destroy(spa);
+       mutex_exit(&spa->spa_man_trim_lock);
+
        if (spa->spa_zvol_taskq) {
                zvol_remove_minors(spa, spa_name(spa), B_TRUE);
                taskq_wait(spa->spa_zvol_taskq);

Also I suppose the same should be done to address the auto trim ?

I have also changed the trim task que thread pool name formatting in order to get closer what is going on. That way it is much easier to identify who is who and what happens, since the task->comm is being truncated to 16 chars by default and the "_man_trim"/"_auto_trim" suffix gets truncated at all (or at least in my case). That's just a suggestion about the naming convention.

diff --git a/module/zfs/spa_misc.c b/module/zfs/spa_misc.c
index 8a91b40510da..ab0443f5537e 100644
--- a/module/zfs/spa_misc.c
+++ b/module/zfs/spa_misc.c
@@ -2257,7 +2257,7 @@ spa_auto_trim_taskq_create(spa_t *spa)
 
        ASSERT(MUTEX_HELD(&spa->spa_auto_trim_lock));
        ASSERT(spa->spa_auto_trim_taskq == NULL);
-       (void) snprintf(name, MAXPATHLEN, "%s_auto_trim", spa->spa_name);
+       (void) snprintf(name, MAXPATHLEN, "atrim_%s", spa->spa_name);
        spa->spa_auto_trim_taskq = taskq_create(name, 1, minclsyspri, 1,
            spa->spa_root_vdev->vdev_children, TASKQ_DYNAMIC);
        VERIFY(spa->spa_auto_trim_taskq != NULL);
@@ -2283,7 +2283,7 @@ spa_man_trim_taskq_create(spa_t *spa)
                 */
                return;
        }
-       (void) snprintf(name, MAXPATHLEN, "%s_man_trim", spa->spa_name);
+       (void) snprintf(name, MAXPATHLEN, "mtrim_%s", spa->spa_name);
        spa->spa_man_trim_taskq = taskq_create(name, 1, minclsyspri, 1,
            spa->spa_root_vdev->vdev_children, TASKQ_DYNAMIC);
        VERIFY(spa->spa_man_trim_taskq != NULL);

I am open for any critics about this fix.

Best regards,
Angel Shtilianov

@aerusso
Copy link
Contributor

aerusso commented Jul 15, 2018

@tuxoko Would it be possible to update this PR to @dweeezil's post-removal rebase? It might be helpful to see the test suite results.

@tuxoko
Copy link
Contributor Author

tuxoko commented Jul 16, 2018

@aerusso
I'll look into it. Unfortunately, my priority have changed recently, I'll try to find time to do it.

@dweeezil
Copy link
Contributor

@aerusso FYI, my post-removal-rebase branch, which I'd not yet widely publicized, is in fact based on @tuxoko's "v2" work. I've not yet had the time to test it at all yet, short of making sure it compiles. I hope to get it tested out in the next couple of days.

@mailinglists35
Copy link

is there any chance this will be ready for 0.8.0 release? or 0.8.x?

@behlendorf behlendorf added Status: Inactive Not being actively updated and removed Status: Work in Progress Not yet ready for general review labels Sep 25, 2018
@ahrens ahrens added the Status: Revision Needed Changes are required for the PR to be accepted label Sep 27, 2018
@RJVB
Copy link

RJVB commented Oct 24, 2018

Maybe not the best place to ask, but what alternatives are there in the meantime until the change is incorporated? How about not giving the entire SSD to ZFS, but keeping a part of it for some other filesystem, or just swap - would that allow the kernel to TRIM the entire device or only the non-ZFS part?

@mailinglists35
Copy link

Maybe not the best place to ask, but what alternatives are there in the meantime until the change is incorporated? How about not giving the entire SSD to ZFS, but keeping a part of it for some other filesystem, or just swap - would that allow the kernel to TRIM the entire device or only the non-ZFS part?

switch to freebsd https://blog.plein.org/2014/04/15/freebsd-9-2-supports-zfs-trimunmap/

@RJVB
Copy link

RJVB commented Oct 25, 2018 via email

@drescherjm
Copy link

How about not giving the entire SSD to ZFS, but keeping a part of it for some other filesystem, or just swap - would that allow the kernel to TRIM the entire device or only the non-ZFS part?

It would be only the non zfs part

@RJVB
Copy link

RJVB commented Oct 25, 2018 via email

@mailinglists35
Copy link

It would be only the non zfs part
Well, that's not a big deal if I can do the ZFS part by importing the pool under a FreeBSD clone. Would importing the pool be enough, or would I'd have to do a scrub or something similar?

I suggest we stop discussing this here and now before a mod will lock this thread. Use the mailing lists for this.

@dweeezil
Copy link
Contributor

dweeezil commented Jan 8, 2019

Replaced with #8255.

@dweeezil dweeezil closed this Jan 8, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Status: Inactive Not being actively updated Status: Revision Needed Changes are required for the PR to be accepted Type: Feature Feature request or new feature
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet