Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Move ARC data buffers out of vmalloc #2129

Closed
wants to merge 39 commits into from
Closed

Conversation

tuxoko
Copy link
Contributor

@tuxoko tuxoko commented Feb 17, 2014

_Warning_
This patchset requires further testing and review.
DO NOT USE IN PRODUCTION.


Introduce ABD: linear/scatter dual typed buffer for ARC

zfsolinux currently uses vmalloc backed slab for ARC buffers. There are some
major problems with this approach. One is that 32-bit system have only a
handful of vmalloc space. Another is that the fragmentation in slab will easily
trigger OOM in busy system.

With ABD, we use scatterlist to allocate data buffers. In this approach we can
allocate in HIGHMEM, which alleviates vmalloc space pressure on 32-bit. Also,
we don't have to rely on slab, so there's no fragmentation issue.

But for metadata buffers, we still uses linear buffer from slab. The reason for
this is that there are a lot of *_phys pointers directly point to metadata
buffers. So it's kind of impractical to change all those code. And metadata
should only be a small percentage so we probably don't have to worry about
the 32-bit and fragmentation issue, right?


This patchset is divided into the following part:

  1. Add compatibility layer for {kmap,kunmap}_atomic
  2. Introduce ABD, but with scatter type disabled
  3. Modify checksum implementation for ABD
  4. Modify every file which touches abd_t field
  5. Enable scatter type ABD

Edit:
I forgot to mention...
There are some parts that hasn't been able to handle scatter buffer,
compression and raidz gaussian elimination to name a few.
Enable them will cost you extra copy than without this patch set.
Another thing is the xuio thing, which I totally skip.

Todo:

  1. Clean up the implementation and documentation
  2. Compression and raidz gaussian elimination takes scatter buffer
  3. Probably change vdev_cache and vdev_queue to scatter buffer
  4. vdev_file should be able to submit a iovec.

@DeHackEd
Copy link
Contributor

Call me intrigued.

How much testing have you done yourself?

@tuxoko
Copy link
Contributor Author

tuxoko commented Feb 17, 2014

@DeHackEd Not much actually.
I created pool with no raid, mirror, raidz123 and did some file copy.
Haven't done any serious torture testing.
I upload this early just to see some comment.
There some hacky things like the lowest bit in pointer to determine linear/scatter thing. Not sure you'll be comfortable with that. :)

@ryao
Copy link
Contributor

ryao commented Feb 17, 2014

@tuxoko Excellent work. I had started working on something called sgbuf that was intended to replace the zio buffers with entirely, but that stalled as higher priority things took time away from completion. Targeting just the arc buffers like you have done is not only simpler, but strikes at the heart of the problem. I will review what you have done over the next day to provide code comments.

@DeHackEd
Copy link
Contributor

I'll be taking a closer look as well.

Just before going into detail, I want to say that metadata buffers really should be considered for scatter/gathering as well. A lot of people (myself included) have metadata-heavy workloads. I very nearly allocate all of my ARC to metadata on my backup target. I understand the need for contiguous buffers for things like ZAP data structures but it may be possible to find a happy middle ground.

@ryao
Copy link
Contributor

ryao commented Feb 17, 2014

@DeHackEd If @tuxoko's approach stands up to review, then the key metadata structures could be given their own SLAB caches.

@ryao
Copy link
Contributor

ryao commented Feb 17, 2014

@tuxoko I had a brief chat with @ahrens about this in #zfsonlinux on freenode. Since it was a public conversation, I have reproduced it below, with noise from parallel conversations omitted:

12:24 -!- mahrens [~mahrens@openzfs/founder] has joined #zfsonlinux
12:25 <+ryao> mahrens: This could use review: https://github.com/zfsonlinux/zfs/pull/2129
12:26 <+ryao> mahrens: Someone took a low dangling fruit approach to the SLAB fragmentation problem. I have already begun reviewing the code. It looks very promising.
12:26 < mahrens> great. i'll take a look.  may not get to it until Wednesday though.
12:26 <+ryao> mahrens: I imagine behlendorf and I will have beaten it to death by then, but better late than never. :)
12:39 < mahrens> ryao: the general idea of that patch seems good.  There are several issues with the implementation that will make it hard to share that code with other platforms.
12:41 < mahrens> ryao: IMO there are too many tricks with the preprocessor (declaring the functions), pointers (changing the low bit), and the compiler (GCC-isms like ({ bla }) )
12:47 <+ryao> mahrens: I will agree with that. Anyway, the idea of tackling just the ARC buffers and leaving the ZIO buffers alone is interesting. We are basically trading metadata copies for data copies with that, but if it works well in practice, that could be all that we need.
12:49 <+ryao> mahrens: That is after the implementation details are fixed.
12:49 < mahrens> yes, i'd definitely be interested in how it works in practice

@tuxoko
Copy link
Contributor Author

tuxoko commented Feb 18, 2014

Thanks for your time.
I'm prepared to be bashed. :)

P.S. Didn't notice ({ bla }) is GCC only though

@behlendorf
Copy link
Contributor

@tuxoko Nice work! I've only skimmed the patches thus far but this looks like a very workable approach in general. In particular, leaving the meta data linear was a nice way to simplify the problem and minimize the required changes. Since the vast majority of meta data is small (16k or less) we could just shift that to the Linux slab or kmalloc it for now. Further optimization may be needed but that can certainly happen in a second or third step.

One other thing I like about this proposed change is that it lays some of the groundwork which is needed to integrate the ARC with the page cache. That's still going to be a complicated job and tricky to get right but having everything already tracked in the ARC as pages helps considerably.

I'd like to dig in to these further but to be honest it may be a while before I'm able to give these changes the detailed attention they deserve. I need to focus on getting the next tag finalized. However, when that's done memory management was the next area I wanted to focus on so your timing is good. In the meanwhile, if others such as @ryao and @ahrens have the time to weigh in on the changes that would be helpful!

@behlendorf behlendorf added this to the 0.7.0 milestone Feb 19, 2014
@DeHackEd
Copy link
Contributor

So I just installed this onto my workstation. I told it to use 3/4 my RAM (6 GB of 8 total) as ARC. I did:

  • Basic workstation stuff (firefox, IRC, etc)
  • 4 copies of gzip -dc running on a gzip-9 filesystem (yes that's a bit weird, but I did it anyway)
  • Fired up a Windows 7 VM with 1 GB of memory

IO from the gzip operation was pretty bad and this is just a simple mirror of 2 7200RPM 1TB disks so the system had it pretty rough, but it was stable and the ARC shrink cleanly for the Windows VM to launch.

First impressions actually running the patch are good! More to come later.

@DeHackEd
Copy link
Contributor

One big problem I've so far is trying to test it. Since the code is kernel-space only (see include/sys/abd.h) a build of ztest is effectively no different than before the patch.

I wrote some glue to provide the kernel scatterlist functionality to userspace and tried forcing use of abd at all times but now it's faulting because the pointer bit trick appears to be causing scatterlist abd_t to be matched when they are not.

Option no 2 is to just build a kernel driver with all the assertions enabled and play with it. I've tried that as well, and so far so good, but it's not as thorough as ztest.

I still plan to try making this work. Just wanted to provide an update.


Slight edit: experimenting with github issue references

@DeHackEd
Copy link
Contributor

Update: I added my own testing layer which SEEMS to work out. If you're interested my patch is (deleted) -- it comes with no warranty and the duct tape is a design decision.

I've had a stack trace involving abd itself

#4  0x00007f9cd5dc8048 in dbuf_sync_leaf (dr=0x7f9ca4094bf0, tx=0x7f9c9c3ecf30) at ../../module/zfs/dbuf.c:2464
2464            ASSERT(ABD_TO_LINEAR(db->db.db_data) != dr->dt.dl.dr_data);
...
#28 0x00007f9cd5dc7d31 in dbuf_sync_indirect (dr=0x7f9ca4059e90, tx=0x7f9c9c3ecf30) at ../../module/zfs/dbuf.c:2430
#29 0x00007f9cd5dc89d1 in dbuf_sync_list (list=0x7f9ca40a7ed8, tx=0x7f9c9c3ecf30) at ../../module/zfs/dbuf.c:2600
#30 0x00007f9cd5dc7d31 in dbuf_sync_indirect (dr=0x7f9ca40a7e60, tx=0x7f9c9c3ecf30) at ../../module/zfs/dbuf.c:2430
#31 0x00007f9cd5dc89d1 in dbuf_sync_list (list=0x7f9ca40beb10, tx=0x7f9c9c3ecf30) at ../../module/zfs/dbuf.c:2600
#32 0x00007f9cd5df7fb9 in dnode_sync (dn=0x7f9ca40be950, tx=0x7f9c9c3ecf30) at ../../module/zfs/dnode_sync.c:705
#33 0x00007f9cd5dd845e in dmu_objset_sync_dnodes (list=0x1353830, newlist=0x0, tx=0x7f9c9c3ecf30) at ../../module/zfs/dmu_objset.c:947
#34 0x00007f9cd5dd8d6f in dmu_objset_sync (os=0x1353460, pio=0x7f9c9c414800, tx=0x7f9c9c3ecf30) at ../../module/zfs/dmu_objset.c:1067
#35 0x00007f9cd5dfd939 in dsl_dataset_sync (ds=0x1353010, zio=0x7f9c9c414800, tx=0x7f9c9c3ecf30) at ../../module/zfs/dsl_dataset.c:1353
#36 0x00007f9cd5e0f3ea in dsl_pool_sync (dp=0x12463a0, txg=123) at ../../module/zfs/dsl_pool.c:471
#37 0x00007f9cd5e44c2e in spa_sync (spa=0x1243ad0, txg=123) at ../../module/zfs/spa.c:6234
#38 0x00007f9cd5e52a78 in txg_sync_thread (dp=0x12463a0) at ../../module/zfs/txg.c:562
..
(gdb)

... some checksum errors as reported by (unpatched) ZDB and some "buffer modified while frozen" errors.

Anyone else want to try?

@tuxoko
Copy link
Contributor Author

tuxoko commented Feb 26, 2014

@DeHackEd I didn't use ABD in userspace because I was lazy and didn't bother to dig into userspace code. But if you think it would help debugging then it would certainly worth to come up with a userspace ABD.

ASSERT(ABD_TO_LINEAR(db->db.db_data) != dr->dt.dl.dr_data);
This looks a bit dangerous. Should be
ASSERT(!ABD_IS_LINEAR(db->db.db_data) || ABD_TO_LINEAR(db->db.db_data) != dr->dt.dl.dr_data);

And lastly the pointer trick was also a product of my laziness, because I didn't want to touch too much things where only linear buffer exists. I'll clean it up and use a proper structure to hold both linear and scatter buffer when I have time...

Thanks

@DeHackEd
Copy link
Contributor

Thanks for the feedback. I've updated that line in my own code.

One other little thing I should mention. Right about the same time you posted your pull request there was an update pushed to the main repository that updates arc.c fairly extensively to the point that your tree doesn't apply cleanly anymore. Hand merging is straight-forward but you may want to consider rebasing your tree if you have the time to do so.

My ztest runs have been encouraging enough that I'm going to try this patch on one or two more (real/bare metal) machines.

@behlendorf
Copy link
Contributor

There were some ARC changes so it would be helpful if you refreshed these patches. We also absolutely want this code to be stressed in the userspace builds if possible. Being able to generate random workloads with ztest has proven to be a very good way to catch subtle issues which might otherwise be missed.

@tuxoko
Copy link
Contributor Author

tuxoko commented Feb 28, 2014

Rebased and add userspace scatter ABD

@tuxoko
Copy link
Contributor Author

tuxoko commented Feb 28, 2014

Remove container_of kernel code

@DeHackEd
Copy link
Contributor

Sorry about that...

@DeHackEd
Copy link
Contributor

DeHackEd commented Apr 4, 2014

I think this last update (tuxoko/zfs@b003f01) fixed one ztest error I was getting in which zdb checking the pool gives checksum errors on some blocks.

There's a few more suspicious ztest errors though. For exmaple:

lt-ztest: ../../cmd/ztest/ztest.c:2263: Assertion `dmu_read(zd->zd_os, object, offset, blocksize, data, 1) == 0 (0x34 == 0x0)' failed.

(0x34 == 52 == EBADE == checksum error)

It's hard to tell if this patch is responsible or if it's something in ZFS itself. There are a few ztest bugs open.

@tuxoko
Copy link
Contributor Author

tuxoko commented Apr 4, 2014

Hi @DeHackEd. What the difference between lt-ztest and ztest?

I keep getting this then running ztest:
ztest: ../../cmd/ztest/ztest.c:1310: ztest_bt_verify: Assertion `bt->bt_object == object' failed.

@DeHackEd
Copy link
Contributor

DeHackEd commented Apr 4, 2014

That assertion is a known issue ( #1964 )

lt-ztest means that the built binary was re-linked to support running from the build directory instead of being installed into /usr/local/sbin or whatever prefix you used.

@behlendorf
Copy link
Contributor

@tuxoko There are a few known ztest failures which need to be resolved. You hit one of them, it would be great if someone had the time to run these to ground.

@tuxoko
Copy link
Contributor Author

tuxoko commented Apr 11, 2014

Hi @behlendorf
I'm currently working on move linear type ABD into struct abd to eliminate the pointer bit trick.
However, I'm consistently hitting a dangling free pointer bug.

dbuf_evict_user(dmu_buf_impl_t *db)
{
...
        if (db->db_user_data_ptr_ptr)
                *db->db_user_data_ptr_ptr = ABD_BUF(db->db.db_data);
...

Here the db->db.db_data is freed.
I'm not sure if this will be a problem for original code, since it only do pointer assignment, it doesn't say anything about dereferencing it.
However, it will be a headache for me, because if we move the linear buffer into struct abd, then ABD_BUF must go into the structure to retrieve the buffer.

Could you shed some light on this matter?
Thanks

@behlendorf
Copy link
Contributor

@tuxoko To my knowledge there aren't any outstanding issues regarding this. Although, very occasionally I have seen people report non-reproducable crashes in the user evict code so there may be something subtle lurking.

@tuxoko
Copy link
Contributor Author

tuxoko commented Apr 24, 2014

Rebase to master and add two patch for missing things

There're still a lot of checksum error when running ztest, and all of them are data blocks.
I still haven't figure out why.

@DeHackEd
Copy link
Contributor

I've seen that, and have been trying to diagnose them myself. (This is the sort of reason why I wanted ztest to run in the first place).

Thanks for the rebase. I'm going to be giving it a try as well.

@tuxoko
Copy link
Contributor Author

tuxoko commented Apr 25, 2014

So the original sha256 implementation cannot be used as-is for scatter abd.
This should work now.

@DeHackEd
Copy link
Contributor

Nice find. Glad I pressed for ztest support. SHA256 problems would probably bite anyone using dedup.

I've pulled a copy and am running ztest on it extensively right now. Will report back in a few hours if anything turns up.

@DeHackEd
Copy link
Contributor

Though I got side-tracked quite a bit yesterday I want to say that testing results have been largely positive and I intend to continue using this patch on my own systems.

@behlendorf
Copy link
Contributor

@tuxoko the large block patches have been merged to master. If you could refresh this patch stack to apply cleanly, and handle the larger block size, we can focus on getting this in next. I'd really like to push to get this in the next tag.

@tuxoko
Copy link
Contributor Author

tuxoko commented May 12, 2015

@behlendorf
I'm working on it. Also, I'm doing a full review which I haven't done since the first version of the ABD. So it would take some time. But I should be able to finish it in a week or two.

@don-brady
Copy link
Contributor

@tuxoko
I've just started to review this (huge thanks for taking on the work). Hope to test as well soon. Some initial comments:

  1. In _abd_alloc_scatter(), when checking for the minimum threshold, can we check instead for less than PAGE_SIZE*2 to also avoid the 5120, 6144 and 7168 bufsize cases?
  2. the abd_borrow_buf() macro makes it difficult to uniquely trace that function since it turns into a generic zio_buf_free() call. Could this be made into an inline function instead (even inline functions can be traced). Same for abd_return_buf().

@tuxoko
Copy link
Contributor Author

tuxoko commented May 13, 2015

@don-brady

  1. There's nothing preventing us from doing that. I chose PAGE_SIZE be cause it would limit the wasted memory to less than 100%, like kmalloc. But if no one objects, I will change it to 2*PAGE_SIZE.
  2. I was planing to change it anyway, because it uses non-standard statement expression.

@behlendorf
Copy link
Contributor

@tuxoko @don-brady

  1. We need to be careful to avoid making any assumptions about PAGE_SIZE. We can't count on it being 4k since we want this code to work on aarch and ppc architectures where alternate page sizes are entirely valid. I'd suggest that if there's some doubt about what the right cutoff here should be for an optimization we make it module option for now until the best value can be settled on.

@don-brady
Copy link
Contributor

@tuxoko @behlendorf

Overall seems to be working fine alongside the large block changes (local manual merge).

side note: When observing _abd_alloc_scatter() calls while using large record sizes (16M), in addition to the expected large allocations, I see an unexpected number of 12K and 16K scatter allocations coming through. These appear to be mostly coming from vdev io aggregations of 4K metadata (note my vdev ashift is 12). Just curious if we want/need these transient metadata aggregation buffers to be scatter? I don’t think it’s a problem, I just wasn’t expecting to see them.

@behlendorf
Copy link
Contributor

@don-brady I think this is best. If we're going to allow the aggregation then my gut feeling would be that it makes the most sense to use the scatter lists for this. Allocating pages and mapping them atomically to be zeroed/copied (as is done) should be very efficient and avoids any potential lock contention is a shared allocator like the slab. We can always revisit this is that turns out not to be the case!

As another related aside it's never been 100% clear that the existing vdev queue aggregation under Linux is always a win. In many cases this might be wasted work which the block device elevator would have efficiently taken care of. Although there are certainly exceptions which may make the whole thing worth while like the gap spanning. But that's neither here nor there for this issue!

@tuxoko
Copy link
Contributor Author

tuxoko commented May 14, 2015

@don-brady
If you're using the newest version. What you see could be the indirect blocks. In my test, I see a lot of them being 16K. And I added support for using scatter buffer for them.

@behlendorf
Yes, I always wonder why we would need aggregation when we already have linux elevator.

Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
Merge upto 5dc8b73
Illumos 5765 - add support for estimating send stream size with lzc_s...

Signed-off-by: Chunwei Chen <tuxoko@gmail.com>

Conflicts:
	module/zfs/arc.c
	module/zfs/dmu.c
	module/zfs/dmu_send.c
	module/zfs/vdev_disk.c
	module/zfs/zap_micro.c
	module/zfs/zil.c
Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
@sempervictus
Copy link
Contributor

@tuxoko: it appears that this stack is somehow affecting space calculation for L2ARC devices. Could you take a look @ #3400 and weigh in? I'm now seeing the problem off of abd_next without any of the alterations brought by #3115 or additional changes from @kernelOfTruth's work.

Also use the same optimization of abd_buf_segment in abd_get_offset.

Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
In dsl_scan_scrub_cb and spa_load_verify_cb, we originally always allocated
linear ABD. Now we try to allocate scatter ABD according to the BUFC type of
the blkptr to reduce unnecessary spl slab allocation.

Also in zio_ddt_read_start, we match the parent zio->io_data ABD type for the
same reason.

Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
@tuxoko
Copy link
Contributor Author

tuxoko commented May 19, 2015

Hi guys, I've just updated this pull request.
This pull request add the support for large block by using chained scatterlist, and a bunch of other misc stuff. The commit history might look a bit messy, because I want to retain what I added with respect to old version in case @ahrens want to know about them.

I'll submit a new and rebased pull request after I see some result from the buildbot.

@behlendorf
Copy link
Contributor

@tuxoko nice work. Hopefully latter this week I'll have some time to start on a careful review. In the meanwhile I'll instruct the buildbot to go through and test all the patches in this stack.

kernelOfTruth pushed a commit to kernelOfTruth/zfs that referenced this pull request May 21, 2015
adaption to openzfs#2129 in
@ l2arc_compress_buf(l2arc_buf_hdr_t *l2hdr)

 		/*
 		 * Compression succeeded, we'll keep the cdata around for
 		 * writing and release it afterwards.
 		 */
+		if (rounded > csize) {
+			bzero((char *)cdata + csize, rounded - csize);
+			csize = rounded;
+		}

to

		/*
		 * Compression succeeded, we'll keep the cdata around for
		 * writing and release it afterwards.
		 */
		if (rounded > csize) {
			abd_zero_off(cdata, rounded - csize, csize);
			csize = rounded;
		}
kernelOfTruth pushed a commit to kernelOfTruth/zfs that referenced this pull request May 21, 2015
adapted to openzfs#3216,

adaption to openzfs#2129 in
@ l2arc_compress_buf(l2arc_buf_hdr_t *l2hdr)

 		/*
 		 * Compression succeeded, we'll keep the cdata around for
 		 * writing and release it afterwards.
 		 */
+		if (rounded > csize) {
+			bzero((char *)cdata + csize, rounded - csize);
+			csize = rounded;
+		}

to

		/*
		 * Compression succeeded, we'll keep the cdata around for
		 * writing and release it afterwards.
		 */
		if (rounded > csize) {
			abd_zero_off(cdata, rounded - csize, csize);
			csize = rounded;
		}

ZFSonLinux:
openzfs#3114
openzfs#3400
openzfs#3433
@DeHackEd
Copy link
Contributor

From the current head I ended up need this in order to build it properly:
DeHackEd/zfs@092bab5

Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
@behlendorf
Copy link
Contributor

Closing, refreshed as pull request #3441. Let's move addition review and comments in to the updated version.

@behlendorf behlendorf closed this May 28, 2015
kernelOfTruth pushed a commit to kernelOfTruth/zfs that referenced this pull request Jun 8, 2015
…to be written to l2arc device

If we don't account for that, then we might end up overwriting disk
area of buffers that have not been evicted yet, because l2arc_evict
operates in terms of disk addresses.

The discrepancy between the write size calculation and the actual increment
to l2ad_hand was introduced in
commit e14bb3258d05c1b1077e2db7cf77088924e56919

Also, consistently use asize / a_sz for the allocated size, psize / p_sz
for the physical size.  Where the latter accounts for possible size
reduction because of compression, whereas the former accounts for possible
size expansion because of alignment requirements.

The code still assumes that either underlying storage subsystems or
hardware is able to do read-modify-write when an L2ARC buffer size is
not a multiple of a disk's block size.  This is true for 4KB sector disks
that provide 512B sector emulation, but may not be true in general.
In other words, we currently do not have any code to make sure that
an L2ARC buffer, whether compressed or not, which is used for physical I/O
has a suitable size.

modified to account for the changes of

openzfs#2129 (ABD) ,

openzfs#3115 (Illumos - 5408 managing ZFS cache devices requires lots of RAM)

and

Illumos - 5701 zpool list reports incorrect "alloc" value for cache devices
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Type: Feature Feature request or new feature
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet