Openzfs compressedarc abd patchset WIP #5009

dpquigl · 2016-08-22T23:37:42Z

This is a WIP PR for the abd work. This follows the path outlined by Dan Kimmel at delphix and contains multiple features including compressed arc, compressed send/recv, a reworking of arc internals, and the ABD code with some linux specific patches.

ironMann · 2016-08-23T14:04:12Z

@dpquigl I've been rebasing SIMDized raidz patches on couple of other ABD patch sets. Last attempt was branch raidz-abd

However, that requires a slight change in ABD patch, most notably making abd_get_offset() accept size parameter. Commit: 84adc45 Is there a way to push this change upstream?

dpquigl · 2016-08-23T17:28:56Z

@ironMann It looks like your work is against David Chen's version of ABD. This version we have here is against the Delphix-OS version worked on by Dan Kimmel. I'll look to see what its trying to return with that size parameter but its possible it doesn't make sense with the other ABD implementation.

behlendorf · 2016-08-23T17:29:26Z

@ironMann that slight change to abd_get_offset() looks entirely reasonable to me. I can definitely see how it would make adding ABD to the vectorized raidz simpler. Let's get @dankimmel's thoughts on this too make sure it's not going to cause problems elsewhere.

I should add it would be awesome to get an updated version of your raidz-abd branch against this PR. That's still a big chunk we need which is missing. This change still needs more polish to pass all the testing but that's being worked on.

ironMann · 2016-08-23T18:17:25Z

@dpquigl To clarify the size parameter: When a new overlay abd is created with abd_get_offset() it inherits buffer end from the host, meaning its size is (host_size - offset). For raidz, it is beneficial to split up one io abd into several consecutive abds. By providing offset and size, those can behave as regular abds under iterator methods.

dankimmel · 2016-08-23T22:57:51Z

@ironMann: neat, this optimization looks really cool!

@dpquigl is right, the new ABD code is a significant departure from the version you patched. The iterator functions have all been rewritten in the new version to use abd_iterate_func() and abd_iterate_func2() under the hood, but obviously your version is implementing some new primitives in abd.c to handle iterating over groups of >2 ABDs. Is it plausible that your abd_raidz_gen_iterate() and abd_raidz_rec_iterate() functions could be rewritten using those, or maybe that abd_iterate_func2() could be rewritten to take a variable number of ABDs? I don't know enough about the func_raidz_gen/func_raidz_rec implementations you're passing into those to know for sure what the best path might be.

Why do you need the size stored for offset ABDs -- just because it would be easier to track the lengths of the segments to compare? It shouldn't be hard to add that, just trying to understand the use case better, since I think you could achieve the same thing by passing the length to abd_iterate_func* instead.

Authored by: George Wilson <george.wilson@delphix.com> Reviewed by: Prakash Surya <prakash.surya@delphix.com> Reviewed by: Dan Kimmel <dan.kimmel@delphix.com> Reviewed by: Matt Ahrens <mahrens@delphix.com> Reviewed by: Paul Dagnelie <pcd@delphix.com> Ported by: David Quigley <david.quigley@intel.com> This review covers the reading and writing of compressed arc headers, sharing data between the arc_hdr_t and the arc_buf_t, and the implementation of a new dbuf cache to keep frequently access data uncompressed. I've added a new member to l1 arc hdr called b_pdata. The b_pdata always hangs off the arc_buf_hdr_t (if an L1 hdr is in use) and points to the physical block for that DVA. The physical block may or may not be compressed. If compressed arc is enabled and the block on-disk is compressed, then the b_pdata will match the block on-disk and remain compressed in memory. If the block on disk is not compressed, then neither will the b_pdata. Lastly, if compressed arc is disabled, then b_pdata will always be an uncompressed version of the on-disk block. Typically the arc will cache only the arc_buf_hdr_t and will aggressively evict any arc_buf_t's that are no longer referenced. This means that the arc will primarily have compressed blocks as the arc_buf_t's are considered overhead and are always uncompressed. When a consumer reads a block we first look to see if the arc_buf_hdr_t is cached. If the hdr is cached then we allocate a new arc_buf_t and decompress the b_pdata contents into the arc_buf_t's b_data. If the hdr already has a arc_buf_t, then we will allocate an additional arc_buf_t and bcopy the uncompressed contents from the first arc_buf_t to the new one. Writing to the compressed arc requires that we first discard the b_pdata since the physical block is about to be rewritten. The new data contents will be passed in via an arc_buf_t (uncompressed) and during the I/O pipeline stages we will copy the physical block contents to a newly allocated b_pdata. When an l2arc is inuse it will also take advantage of the b_pdata. Now the l2arc will always write the contents of b_pdata to the l2arc. This means that when compressed arc is enabled that the l2arc blocks are identical to those stored in the main data pool. This provides a significant advantage since we can leverage the bp's checksum when reading from the l2arc to determine if the contents are valid. If the compressed arc is disabled, then we must first transform the read block to look like the physical block in the main data pool before comparing the checksum and determining it's valid. OpenZFS Issue: https://www.illumos.org/issues/6950

ironMann · 2016-08-24T15:52:22Z

@dankimmel @behlendorf I've ported raidz onto this abd stack, PR #5020.

2d660f3 is adaptation of ABD needed for the rest of the RAIDZ code. There are few minor changes, and two special iterator methods.

Is it plausible that your abd_raidz_gen_iterate() and abd_raidz_rec_iterate() functions could be rewritten...

I'm already using abd_iterate_func() and abd_iterate_func2(), but without access to more buffers at once performance would suffer. Ideally, I would simultaneously access as many abd buffers as there are devices in raidz map. I know that this breaks layering, but teh methods are somewhat special, and if need be, I can reimplement them elsewhere. Problem is, ABD iterator interface is not exported.

Why do you need the size stored for offset ABDs

For iterators (created from offset ABDs) in abd_raidz_gen_iterate() and abd_raidz_rec_iterate() to work correctly :)
Please also review this change for iter_mapsize calculation of scatter buffers.

behlendorf · 2016-08-25T01:11:10Z

module/zfs/abd.c

+#define	kpm_enable			1
+#define	abd_alloc_chunk() \
+	((struct page *)kmem_alloc(PAGESIZE, KM_SLEEP))
+#define	abd_free_chunk(chunk) 		kmem_free(chunk, PAGESIZE)


Prior to this change PAGESIZE wasn't used extensively in userspace so it was implemented in lib/libspl/include/sys/param.h just as sysconf(_SC_PAGESIZE). That's not going to work well if use PAGESIZE everywhere. We're either going to want to implement PAGESIZE differently or alternately preserve the zfs_abd_chunk_size and initialize it once in abd_init().

Potentially we could also preserve the kmem cache chunk case for use solely by user space. That would help minimize our delta with upstream.

@behlendorf Good point. How reliant is userspace to the actual running system's PAGE_SIZE? Can we get it in configure script, and define PAGE_SIZE then?

~~I'm now more worried about kernel's PAGESIZE value using the build system's kernel value.~~ PAGE_SHIFT is configurable on some arch, e.g. arm64 and ia64.

Getting PAGE_SIZE wrong in user space shouldn't be terrible. I could see it perhaps causing some memory alignment issues but I suspect that would be the worst of it. That said I've never tried.

On the kernel side we should always be using PAGE_SIZE from the kernel headers. I don't recall why we ever added the #ifdef around PAGESIZE, it seems like we'd want it to fail if PAGESIZE was ever defined.

Potential PAGE_SHIFT discrepancies will be resolved by dkms. I was thrown off by a colleague's driver debugging story with two arm64 systems with different page sizes.

dankimmel · 2016-08-25T15:28:55Z

Ideally, I would simultaneously access as many abd buffers as there are devices in raidz map

Yeah, I agree that this is such a special case that adding them to abd.c is fine. If we accrue more callers like this in the future then we can think harder about whether there's a way to abstract the underlying ABD iteration, or whether we should just expose that externally like you suggested.

For iterators (created from offset ABDs) in abd_raidz_gen_iterate() and abd_raidz_rec_iterate() to work correctly :)

Heh. I see what you're saying now.

Please also review this change for iter_mapsize calculation of scatter buffers.

The iter_mapsize calculation looks right to me, and adding the size with the _impl function looks good too.

behlendorf · 2016-08-25T20:43:42Z

@dpquigl the patch in #5024 resolves the Ubuntu and Debian buildbot failures. You'll want to pull it in to this branch at least until it's merged to master.

behlendorf · 2016-08-31T23:14:24Z

include/sys/spa.h

@@ -577,7 +577,6 @@ extern int spa_get_stats(const char *pool, nvlist_t **config, char *altroot,
    size_t buflen);
 extern int spa_create(const char *pool, nvlist_t *config, nvlist_t *props,
    nvlist_t *zplprops);
-extern int spa_import_rootpool(char *devpath, char *devid);


Good find, we accidentally missed this declaration.

…share_buf

In receive_object() a new tx is created in order to create an object. While the tx is assigned and not yet commited direct reclaim occurs via the kmem_cache_alloc() in dnode_create(). This results in an inode being evicted which needs to be removed. The removal process creates another new tx which can't be assigned to the closed txg and the receive_writer thread blocks waiting for the next txg. However, since it's still holding a tx open in the previous txg the entire txg_sync process deadlocks. This is prevented by calling spl_fstrans_mark() which disables direct reclaim from occuring in all memory allocations performed by this process. This covered the offending kmem_cache_alloc() and any other kmem_alloc()'s which occur in the critical region. An analogous situtation can occur in arc_write_ready() and is resolved the same way. Finally, abd_alloc_chunk() is changed to use kmem_flags_convert() which detects if PF_FSTRANS flag set for the process. When set the __GFP_IO and __GFP_IO bits are clearer to prevent reclaim. INFO: task receive_writer:25773 blocked for more than 120 seconds. Call Trace: [<ffffffff8163b809>] schedule+0x29/0x70 [<ffffffffa04ba1ad>] cv_wait_common+0x10d/0x130 [spl] [<ffffffffa04ba1e5>] __cv_wait+0x15/0x20 [spl] [<ffffffffa062f3f3>] txg_wait_open+0xe3/0x1b0 [zfs] [<ffffffffa05d53a5>] dmu_tx_wait+0x465/0x4d0 [zfs] [<ffffffffa05d54d9>] dmu_tx_assign+0xc9/0x700 [zfs] [<ffffffffa0662826>] zfs_rmnode+0x136/0x440 [zfs] [<ffffffffa06892f8>] zfs_zinactive+0xd8/0x140 [zfs] [<ffffffffa067f5ab>] zfs_inactive+0x8b/0x300 [zfs] [<ffffffffa06a3773>] zpl_evict_inode+0x43/0xa0 [zfs] [<ffffffff811fa6e7>] evict+0xa7/0x170 [<ffffffff811fa7ee>] dispose_list+0x3e/0x50 [<ffffffff811fb6c3>] prune_icache_sb+0x163/0x320 [<ffffffff811e12d3>] prune_super+0x143/0x170 [<ffffffff8117c8f5>] shrink_slab+0x165/0x300 [<ffffffff8117fa72>] do_try_to_free_pages+0x3c2/0x4e0 [<ffffffff8117fc8c>] try_to_free_pages+0xfc/0x180 [<ffffffff81173988>] __alloc_pages_nodemask+0x808/0xba0 [<ffffffff811b49ea>] alloc_pages_current+0xaa/0x170 [<ffffffff811bef45>] new_slab+0x275/0x300 [<ffffffff8163306c>] __slab_alloc+0x315/0x48f [<ffffffff811c1583>] kmem_cache_alloc+0x193/0x1d0 [<ffffffffa04b4339>] spl_kmem_cache_alloc+0x99/0x150 [spl] [<ffffffffa05d7864>] dnode_create+0x34/0x280 [zfs] [<ffffffffa05d9ae0>] dnode_hold_impl+0x7e0/0x9d0 [zfs] [<ffffffffa05c3292>] dmu_object_claim_dnsize+0xa2/0x180 [zfs] [<ffffffffa05ce1bd>] receive_object.isra.8+0x3ed/0x400 [zfs] [<ffffffffa05ce578>] receive_writer_thread+0x2a8/0x760 [zfs] [<ffffffffa04b4f01>] thread_generic_wrapper+0x71/0x80 [spl] [<ffffffff810a5aef>] kthread+0xcf/0xe0 Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

The user space implementation of cv_timedwait_hires() was always passing a relative time to pthread_cond_timedwait() when an absolute time is expected. This was accidentally introduced in commit 206971d. Replace two magic values with their corresponding preprocessor macro. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

Signed-off-by: Isaac Huang <he.huang@intel.com>

behlendorf · 2016-09-13T17:56:24Z

Closing. To be replaced by a new rebased PR with this changes now that compressed ARC and compressed send/recv have been merged.

mmatuska · 2020-08-19T08:39:56Z

MRU/MFU pressure fix:
#10548
#10618

This was referenced Aug 22, 2016

OpenZFS 6950 - ARC should cache compressed data #4768

Closed

ZFS Encryption #4329

Closed

ironMann mentioned this pull request Aug 24, 2016

[WIP] SIMD RAIDZ and Fletcher4 on top of openzfs-abd #5020

Closed

7 tasks

behlendorf reviewed Aug 25, 2016
View reviewed changes

behlendorf mentioned this pull request Aug 30, 2016

zfs seems to use more memory than it should #5035

Closed

behlendorf reviewed Aug 31, 2016
View reviewed changes

sempervictus mentioned this pull request Sep 5, 2016

ZVOLs 3X slower than recordsets at same recordsize/volblocksize for sequential writes #4265

Closed

dpquigl and others added 10 commits September 6, 2016 06:30

DLPX-40252 integrate EP-476 compressed zfs send/receive

f7e7999

Remove lint supression from dmu.h and unnecessary dmu.h include in spa.h

c0b2e4a

DLPX-44733 combine arc_buf_alloc_impl() with arc_buf_clone()

6066a66

DLPX-44812 integrate EP-220 large memory scalability

638d8c9

DLPX-45330 dcenter hang due to kmem reclaim

593fd67

DLPX-45422 zfs_abd_scatter_enabled=0 causes assertion failure in arc_…

d84a47b

…share_buf

Convert ABD to use page structures instead of kmem_cache items

a159e61

Added ABD page support to vdev_disk.c

8d1da46

Signed-off-by: Isaac Huang <he.huang@intel.com>

behlendorf closed this Sep 13, 2016

dpquigl deleted the openzfs-compressedarc-abd-patchset branch September 13, 2016 20:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Openzfs compressedarc abd patchset WIP #5009

Openzfs compressedarc abd patchset WIP #5009

dpquigl commented Aug 22, 2016

ironMann commented Aug 23, 2016

dpquigl commented Aug 23, 2016

behlendorf commented Aug 23, 2016 •

edited

Loading

ironMann commented Aug 23, 2016 •

edited

Loading

dankimmel commented Aug 23, 2016 •

edited

Loading

ironMann commented Aug 24, 2016

behlendorf Aug 25, 2016

ironMann Aug 25, 2016 •

edited

Loading

behlendorf Sep 1, 2016

ironMann Sep 1, 2016

dankimmel commented Aug 25, 2016 •

edited

Loading

behlendorf commented Aug 25, 2016

behlendorf Aug 31, 2016

behlendorf commented Sep 13, 2016

mmatuska commented Aug 19, 2020

Openzfs compressedarc abd patchset WIP #5009

Openzfs compressedarc abd patchset WIP #5009

Conversation

dpquigl commented Aug 22, 2016

ironMann commented Aug 23, 2016

dpquigl commented Aug 23, 2016

behlendorf commented Aug 23, 2016 • edited Loading

ironMann commented Aug 23, 2016 • edited Loading

dankimmel commented Aug 23, 2016 • edited Loading

ironMann commented Aug 24, 2016

behlendorf Aug 25, 2016

Choose a reason for hiding this comment

ironMann Aug 25, 2016 • edited Loading

Choose a reason for hiding this comment

behlendorf Sep 1, 2016

Choose a reason for hiding this comment

ironMann Sep 1, 2016

Choose a reason for hiding this comment

dankimmel commented Aug 25, 2016 • edited Loading

behlendorf commented Aug 25, 2016

behlendorf Aug 31, 2016

Choose a reason for hiding this comment

behlendorf commented Sep 13, 2016

mmatuska commented Aug 19, 2020

behlendorf commented Aug 23, 2016 •

edited

Loading

ironMann commented Aug 23, 2016 •

edited

Loading

dankimmel commented Aug 23, 2016 •

edited

Loading

ironMann Aug 25, 2016 •

edited

Loading

dankimmel commented Aug 25, 2016 •

edited

Loading