Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Openzfs compressedarc abd patchset WIP #5009

Closed
wants to merge 11 commits into from
Closed

Openzfs compressedarc abd patchset WIP #5009

wants to merge 11 commits into from

Conversation

dpquigl
Copy link
Contributor

@dpquigl dpquigl commented Aug 22, 2016

This is a WIP PR for the abd work. This follows the path outlined by Dan Kimmel at delphix and contains multiple features including compressed arc, compressed send/recv, a reworking of arc internals, and the ABD code with some linux specific patches.

@ironMann
Copy link
Contributor

@dpquigl I've been rebasing SIMDized raidz patches on couple of other ABD patch sets. Last attempt was branch raidz-abd

However, that requires a slight change in ABD patch, most notably making abd_get_offset() accept size parameter. Commit: 84adc45 Is there a way to push this change upstream?

@dpquigl
Copy link
Contributor Author

dpquigl commented Aug 23, 2016

@ironMann It looks like your work is against David Chen's version of ABD. This version we have here is against the Delphix-OS version worked on by Dan Kimmel. I'll look to see what its trying to return with that size parameter but its possible it doesn't make sense with the other ABD implementation.

@behlendorf
Copy link
Contributor

behlendorf commented Aug 23, 2016

@ironMann that slight change to abd_get_offset() looks entirely reasonable to me. I can definitely see how it would make adding ABD to the vectorized raidz simpler. Let's get @dankimmel's thoughts on this too make sure it's not going to cause problems elsewhere.

I should add it would be awesome to get an updated version of your raidz-abd branch against this PR. That's still a big chunk we need which is missing. This change still needs more polish to pass all the testing but that's being worked on.

@ironMann
Copy link
Contributor

ironMann commented Aug 23, 2016

@dpquigl To clarify the size parameter: When a new overlay abd is created with abd_get_offset() it inherits buffer end from the host, meaning its size is (host_size - offset). For raidz, it is beneficial to split up one io abd into several consecutive abds. By providing offset and size, those can behave as regular abds under iterator methods.

@dankimmel
Copy link
Contributor

dankimmel commented Aug 23, 2016

@ironMann: neat, this optimization looks really cool!

@dpquigl is right, the new ABD code is a significant departure from the version you patched. The iterator functions have all been rewritten in the new version to use abd_iterate_func() and abd_iterate_func2() under the hood, but obviously your version is implementing some new primitives in abd.c to handle iterating over groups of >2 ABDs. Is it plausible that your abd_raidz_gen_iterate() and abd_raidz_rec_iterate() functions could be rewritten using those, or maybe that abd_iterate_func2() could be rewritten to take a variable number of ABDs? I don't know enough about the func_raidz_gen/func_raidz_rec implementations you're passing into those to know for sure what the best path might be.

Why do you need the size stored for offset ABDs -- just because it would be easier to track the lengths of the segments to compare? It shouldn't be hard to add that, just trying to understand the use case better, since I think you could achieve the same thing by passing the length to abd_iterate_func* instead.

Authored by: George Wilson <george.wilson@delphix.com>
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed by: Matt Ahrens <mahrens@delphix.com>
Reviewed by: Paul Dagnelie <pcd@delphix.com>
Ported by: David Quigley <david.quigley@intel.com>

This review covers the reading and writing of compressed arc headers, sharing
data between the arc_hdr_t and the arc_buf_t, and the implementation of a new
dbuf cache to keep frequently access data uncompressed.

I've added a new member to l1 arc hdr called b_pdata. The b_pdata always hangs
off the arc_buf_hdr_t (if an L1 hdr is in use) and points to the physical block
for that DVA. The physical block may or may not be compressed. If compressed
arc is enabled and the block on-disk is compressed, then the b_pdata will match
the block on-disk and remain compressed in memory. If the block on disk is not
compressed, then neither will the b_pdata. Lastly, if compressed arc is
disabled, then b_pdata will always be an uncompressed version of the on-disk
block.

Typically the arc will cache only the arc_buf_hdr_t and will aggressively evict
any arc_buf_t's that are no longer referenced. This means that the arc will
primarily have compressed blocks as the arc_buf_t's are considered overhead and
are always uncompressed. When a consumer reads a block we first look to see if
the arc_buf_hdr_t is cached. If the hdr is cached then we allocate a new
arc_buf_t and decompress the b_pdata contents into the arc_buf_t's b_data. If
the hdr already has a arc_buf_t, then we will allocate an additional arc_buf_t
and bcopy the uncompressed contents from the first arc_buf_t to the new one.

Writing to the compressed arc requires that we first discard the b_pdata since
the physical block is about to be rewritten. The new data contents will be
passed in via an arc_buf_t (uncompressed) and during the I/O pipeline stages we
will copy the physical block contents to a newly allocated b_pdata.

When an l2arc is inuse it will also take advantage of the b_pdata. Now the
l2arc will always write the contents of b_pdata to the l2arc. This means that
when compressed arc is enabled that the l2arc blocks are identical to those
stored in the main data pool. This provides a significant advantage since we
can leverage the bp's checksum when reading from the l2arc to determine if the
contents are valid. If the compressed arc is disabled, then we must first
transform the read block to look like the physical block in the main data pool
before comparing the checksum and determining it's valid.

OpenZFS Issue: https://www.illumos.org/issues/6950
@ironMann
Copy link
Contributor

@dankimmel @behlendorf I've ported raidz onto this abd stack, PR #5020.

2d660f3 is adaptation of ABD needed for the rest of the RAIDZ code. There are few minor changes, and two special iterator methods.

Is it plausible that your abd_raidz_gen_iterate() and abd_raidz_rec_iterate() functions could be rewritten...

I'm already using abd_iterate_func() and abd_iterate_func2(), but without access to more buffers at once performance would suffer. Ideally, I would simultaneously access as many abd buffers as there are devices in raidz map. I know that this breaks layering, but teh methods are somewhat special, and if need be, I can reimplement them elsewhere. Problem is, ABD iterator interface is not exported.

Why do you need the size stored for offset ABDs

For iterators (created from offset ABDs) in abd_raidz_gen_iterate() and abd_raidz_rec_iterate() to work correctly :)
Please also review this change for iter_mapsize calculation of scatter buffers.

#define kpm_enable 1
#define abd_alloc_chunk() \
((struct page *)kmem_alloc(PAGESIZE, KM_SLEEP))
#define abd_free_chunk(chunk) kmem_free(chunk, PAGESIZE)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Prior to this change PAGESIZE wasn't used extensively in userspace so it was implemented in lib/libspl/include/sys/param.h just as sysconf(_SC_PAGESIZE). That's not going to work well if use PAGESIZE everywhere. We're either going to want to implement PAGESIZE differently or alternately preserve the zfs_abd_chunk_size and initialize it once in abd_init().

Potentially we could also preserve the kmem cache chunk case for use solely by user space. That would help minimize our delta with upstream.

Copy link
Contributor

@ironMann ironMann Aug 25, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@behlendorf Good point. How reliant is userspace to the actual running system's PAGE_SIZE? Can we get it in configure script, and define PAGE_SIZE then?

I'm now more worried about kernel's PAGESIZE value using the build system's kernel value. PAGE_SHIFT is configurable on some arch, e.g. arm64 and ia64.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Getting PAGE_SIZE wrong in user space shouldn't be terrible. I could see it perhaps causing some memory alignment issues but I suspect that would be the worst of it. That said I've never tried.

On the kernel side we should always be using PAGE_SIZE from the kernel headers. I don't recall why we ever added the #ifdef around PAGESIZE, it seems like we'd want it to fail if PAGESIZE was ever defined.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Potential PAGE_SHIFT discrepancies will be resolved by dkms. I was thrown off by a colleague's driver debugging story with two arm64 systems with different page sizes.

@dankimmel
Copy link
Contributor

dankimmel commented Aug 25, 2016

Ideally, I would simultaneously access as many abd buffers as there are devices in raidz map

Yeah, I agree that this is such a special case that adding them to abd.c is fine. If we accrue more callers like this in the future then we can think harder about whether there's a way to abstract the underlying ABD iteration, or whether we should just expose that externally like you suggested.

For iterators (created from offset ABDs) in abd_raidz_gen_iterate() and abd_raidz_rec_iterate() to work correctly :)

Heh. I see what you're saying now.

Please also review this change for iter_mapsize calculation of scatter buffers.

The iter_mapsize calculation looks right to me, and adding the size with the _impl function looks good too.

@behlendorf
Copy link
Contributor

@dpquigl the patch in #5024 resolves the Ubuntu and Debian buildbot failures. You'll want to pull it in to this branch at least until it's merged to master.

@@ -577,7 +577,6 @@ extern int spa_get_stats(const char *pool, nvlist_t **config, char *altroot,
size_t buflen);
extern int spa_create(const char *pool, nvlist_t *config, nvlist_t *props,
nvlist_t *zplprops);
extern int spa_import_rootpool(char *devpath, char *devid);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good find, we accidentally missed this declaration.

dpquigl and others added 10 commits September 6, 2016 06:30
In receive_object() a new tx is created in order to create an
object.  While the tx is assigned and not yet commited direct
reclaim occurs via the kmem_cache_alloc() in dnode_create().
This results in an inode being evicted which needs to be removed.

The removal process creates another new tx which can't be assigned
to the closed txg and the receive_writer thread blocks waiting
for the next txg.  However, since it's still holding a tx open
in the previous txg the entire txg_sync process deadlocks.

This is prevented by calling spl_fstrans_mark() which disables
direct reclaim from occuring in all memory allocations performed
by this process.  This covered the offending kmem_cache_alloc()
and any other kmem_alloc()'s which occur in the critical region.

An analogous situtation can occur in arc_write_ready() and is
resolved the same way.

Finally, abd_alloc_chunk() is changed to use kmem_flags_convert()
which detects if PF_FSTRANS flag set for the process.  When set
the __GFP_IO and __GFP_IO bits are clearer to prevent reclaim.

 INFO: task receive_writer:25773 blocked for more than 120 seconds.
 Call Trace:
  [<ffffffff8163b809>] schedule+0x29/0x70
  [<ffffffffa04ba1ad>] cv_wait_common+0x10d/0x130 [spl]
  [<ffffffffa04ba1e5>] __cv_wait+0x15/0x20 [spl]
  [<ffffffffa062f3f3>] txg_wait_open+0xe3/0x1b0 [zfs]
  [<ffffffffa05d53a5>] dmu_tx_wait+0x465/0x4d0 [zfs]
  [<ffffffffa05d54d9>] dmu_tx_assign+0xc9/0x700 [zfs]
  [<ffffffffa0662826>] zfs_rmnode+0x136/0x440 [zfs]
  [<ffffffffa06892f8>] zfs_zinactive+0xd8/0x140 [zfs]
  [<ffffffffa067f5ab>] zfs_inactive+0x8b/0x300 [zfs]
  [<ffffffffa06a3773>] zpl_evict_inode+0x43/0xa0 [zfs]
  [<ffffffff811fa6e7>] evict+0xa7/0x170
  [<ffffffff811fa7ee>] dispose_list+0x3e/0x50
  [<ffffffff811fb6c3>] prune_icache_sb+0x163/0x320
  [<ffffffff811e12d3>] prune_super+0x143/0x170
  [<ffffffff8117c8f5>] shrink_slab+0x165/0x300
  [<ffffffff8117fa72>] do_try_to_free_pages+0x3c2/0x4e0
  [<ffffffff8117fc8c>] try_to_free_pages+0xfc/0x180
  [<ffffffff81173988>] __alloc_pages_nodemask+0x808/0xba0
  [<ffffffff811b49ea>] alloc_pages_current+0xaa/0x170
  [<ffffffff811bef45>] new_slab+0x275/0x300
  [<ffffffff8163306c>] __slab_alloc+0x315/0x48f
  [<ffffffff811c1583>] kmem_cache_alloc+0x193/0x1d0
  [<ffffffffa04b4339>] spl_kmem_cache_alloc+0x99/0x150 [spl]
  [<ffffffffa05d7864>] dnode_create+0x34/0x280 [zfs]
  [<ffffffffa05d9ae0>] dnode_hold_impl+0x7e0/0x9d0 [zfs]
  [<ffffffffa05c3292>] dmu_object_claim_dnsize+0xa2/0x180 [zfs]
  [<ffffffffa05ce1bd>] receive_object.isra.8+0x3ed/0x400 [zfs]
  [<ffffffffa05ce578>] receive_writer_thread+0x2a8/0x760 [zfs]
  [<ffffffffa04b4f01>] thread_generic_wrapper+0x71/0x80 [spl]
  [<ffffffff810a5aef>] kthread+0xcf/0xe0

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
The user space implementation of cv_timedwait_hires() was always passing
a relative time to pthread_cond_timedwait() when an absolute time is
expected.  This was accidentally introduced in commit 206971d.

Replace two magic values with their corresponding preprocessor macro.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Isaac Huang <he.huang@intel.com>
@behlendorf
Copy link
Contributor

Closing. To be replaced by a new rebased PR with this changes now that compressed ARC and compressed send/recv have been merged.

@behlendorf behlendorf closed this Sep 13, 2016
@dpquigl dpquigl deleted the openzfs-compressedarc-abd-patchset branch September 13, 2016 20:34
@mmatuska
Copy link
Contributor

MRU/MFU pressure fix:
#10548
#10618

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants