-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Sequential Scrubs and Resilvers #6256
Conversation
I had done this port and was holding off on making a PR from it because @skiselkov indicated that work was still wanting to make sweeping changes to the rangetree structure. If that's no longer true, then it's probably time to get this merged. :) In my testing (which is what prompted #6209), I found it much better to raise scn_ddt_class_max to DDT_CLASS_UNIQUE and to disable prefetching entirely. You'll note that ddt_class_contains, in module/zfs/ddt.c, is particularly simple when we're scanning the entire DDT, avoiding any need to pull DDT ZAPs into the ARC. |
module/zfs/dsl_scan.c
Outdated
/* | ||
* Given a set of I/O parameters as discovered by the metadata traversal | ||
* process, attempts to place the I/O into the sorted queues (if allowed), | ||
* or immediately executes the I/O. The dummy flag can be set to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Set to... what?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry. Rebase bug...
/*
* Given a set of I/O parameters as discovered by the metadata traversal
* process, attempts to place the I/O into the sorted queues (if allowed),
* or immediately executes the I/O. The dummy flag can be set to indicate
* this IO has already been done and a placeholder should be used instead.
*/
module/zfs/dsl_scan.c
Outdated
@@ -1104,7 +1212,7 @@ dsl_scan_zil_block(zilog_t *zilog, blkptr_t *bp, void *arg, uint64_t claim_txg) | |||
SET_BOOKMARK(&zb, zh->zh_log.blk_cksum.zc_word[ZIL_ZC_OBJSET], | |||
ZB_ZIL_OBJECT, ZB_ZIL_LEVEL, bp->blk_cksum.zc_word[ZIL_ZC_SEQ]); | |||
|
|||
VERIFY(0 == scan_funcs[scn->scn_phys.scn_func](dp, bp, &zb)); | |||
scan_funcs[scn->scn_phys.scn_func](dp, bp, &zb, B_FALSE); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not VERIFY?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there is currently only one scan_func_t
: dsl_scan_scrub_cb()
which has always (as far as I can tell) returned zero. The only info I could find about its purpose is a comment indicating that returning zero means /* do not relocate this block */
which doesn't seem to make sense anymore. Many of the callers didn't check the return code anyway, and the only ones that did simply did a VERIFY(0 ==...)
as you see here.
As part of the cleanup I mentioned, I changed the function to return void.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hah. "did not relocate this block" sounds like some of bprewrite slipping in. OK.
6f49f97
to
59a74df
Compare
Just did an initial test on our 80-drive HDD JBOD. The test pool was setup as a 8x 10 disk raidz2:
The dataset was various copies of /usr, /boot, /bin, /opt, and /lib. Before
After
So looks like a 3.25x speed-up. Very nice! |
@kpande I'll fix the wording soon. Its trying to say too many things using the old text ATM. Right now I just wanted to see how happy buildbot is with it |
Well, I can say that it took the scrub time on one of our systems down from 20 hours 25 minutes to 8 hours 41 minutes. Two other systems had their scrub times remain roughly the same, but they hadn't been scanned in a while and their space usage had significantly increased since the last scrub, so the rates at which they were scrubbed increased. I know this doesn't have anything to do with this patch, but it would be nice if the amount of data that was scrubbed was reported when running zpool status, in addition to just how much was repaired. |
@angstymeat
I'm not 100% sure what you mean by this. During the scrub, |
I thought that was a pretty good speed increase. We handle real-time seismic data from around the world. This particular system receives data from a number of reference stations and devices that are being tested. We can get thousands of files a day between 56k and 8m from each station, and we have data going back 10 years. We don't need speed for the most part, and the average daily data rate isn't high, but we need to hold onto the data for a long time (currently around 10 years), and the data has to be kept safe. The system is built more for redundancy and safety than it is for speed including multiple backups to on-site and off-site each night. The system itself is running under VMWare vSphere 5.0, and its vm has 6 CPUs with 24GB of RAM. The pool is set up as follows, with ashift=12:
and
The drives are some older 2TB Seagate enterprise drives running off a Sun X4500 that is configured with OpenIndiana's comstar software to provide individual disks remotely as block storage over a 4gb fibre channel connection. The connection itself is aggregated over 4 links across two cards for increased performance and redundancy. The vm is mounting the disks as raw devices (no vmware filesystem is involved). We run another 50 virtual machines across 2 servers, each accessing disks from 1 of 3 sets of JBODs. Some are 4Gbps and others are 8Gbps. It's not blazing fast, but they're not terribly slow either. Tests a few years ago showed that I could sustain 6Gb/s read speeds off a ZFS pool if I set it up for it. The disks are under a constant write load. Nothing too high, maybe a few megabytes a second, but it is also used to examine the data files which can cause a lot of them to be loaded at once, so it can have large bursts of data reads. I'm using LZ4 which is give us an overall 1.39 compression ratio. There are 4 file systems:
I'm using zfs-auto-snapshot to make snapshots. I keep 36 hourly snapshots, 14 daily, 10 weekly, and 6 monthlies. There's currently 264 snapshots. We have a couple of systems set up this way. This one is used for reference and testing. The others collect more data from more stations and perform processing before they are sent off to be archived. This system can be taken down for updates, testing, or maintenance more easily than the others. As for this:
It will show you how much is being scrubbed while it is running:
but once it is completed you only see how much was repaired.
It has nothing to do with your patch. I should probably just open a feature request someday. |
I just pushed a commit for buildbot to test. I would not try this commit out, since it is very WIP. I will clen everything up and rebase once it looks like everything is working. |
module/zfs/dsl_scan.c
Outdated
@@ -3449,8 +3446,7 @@ scan_io_queue_vacate_cb(range_tree_t *rt, void *arg) | |||
{ | |||
dsl_scan_io_queue_t *queue = arg; | |||
void *cookie = NULL; | |||
while (avl_destroy_nodes(&queue->q_exts_by_size, &cookie) != NULL) | |||
; | |||
while (avl_destroy_nodes(&queue->q_exts_by_size, &cookie) != NULL); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oof. Is that really the preferred style?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
its my preferred style, but im not looking style right now :). I will rebase everything and clean it up once I'm sure the new range tree is't causing any more problems.
0789f28
to
248573a
Compare
Just to make it clear before I get into the error I just found: I'm using the version of this patch from 3 days ago (before e121c36, the one before you posted that you were making some updates to it and not to use that version). Zpool scrub is reporting errors in files when running with this patch. I've tried it twice, now, and on one of our machines I get errors on both drives in the same mirror. I don't get these errors and the scrub completes without error when I use the current git master.
|
@angstymeat Thanks for bringing this up. The bug I am trying to nail down currently may be contributing to this (although it sounds unlikely from your description). Can you give me any more info about this? What kind of workload is running during the scrub? What kind of data is in the pool? If you could find a consistent way to reproduce it that would be wonderful (although that might be asking for a bit much). |
It's the same kind of data that's in the system I was previously testing on. Seismic waveform data, filesystem using lz4 compression. The data doesn't really compress much, but a lot of the output we generate from it does compress. It's under a constant write load, but not much. iostat is showing around 450k/s being written. I'll say that this pool has a bit of a history; it was one of the first I used ZFS on back under Solaris 10 when it first came out, and was migrated to ZoL once we dropped Sun hardware. It's been running under ZoL since the first month it was able to mount file systems, and has been updated as ZoL has updated. The pool has been recreated several times, but the data was transferred via zfs send/receive since the Solaris days. One thing is that this pool has always been slow. Even now it takes it 31+ hours to scrub, even though it is running the same kinds of disks from the same storage servers the previous system is using. The configuration is the same, other than it uses 5 sets of mirrors instead of 4, and the scrub time is almost 50% longer. It s happening every time I scan with patch 6256, and it is the same set of files each time. I haven't let it scrub through to completion, though. The errors go away when I scrub it with the current master. I'm not sure what else I can provide. I can give you some zdb output, if you can think of anything that would be helpful. I'm currently running master, but I can switch back to the patched version if you need me to. I've been testing this on 5 systems (2 personal), and this is the only one I've seen a scrub issue on. |
include/sys/range_tree.h
Outdated
range_tree_ops_t *rt_ops; | ||
|
||
/* rt_avl_compare should only be set it rt_arg is an AVL tree */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it ->if ?
man/man5/zfs-module-parameters.5
Outdated
.ad | ||
.RS 12n | ||
Minimum time "long-lived" blocks are locked in the ARC, specified in jiffies. | ||
A value of 0 will default to 6 seconds. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we specify this in milliseconds instead, please?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Brian and I talked about this. I think we decided it would be best to make a pass through the code and cleanup all tunables that use jiffies in a later PR. This one simply works the same as zfs_arc_min_prefetch_lifespan
for the moment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tcaputi you've already fixed this in the code, but the zfs-module-parameters.5
wasn't updated. There are two tunings arc_min_prefetch_ms
which was previous called zfs_arc_min_prefetch_lifespan
, and arc_min_prescient_prefetch_ms
which is what long life was renamed.
man/man5/zfs-module-parameters.5
Outdated
.RS 12n | ||
Strategy used to determine how data verification should be done during scrubs. | ||
If set to 1, I/Os will always be issued sequentially. This may help performance | ||
if the pool's data is very disperate. If set to 2, I/Os will be always be issued |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be nice if we never needed to change the default behavior. But assuming that we need to document this, can I suggest an alternate wording:
Determines the order that data will be verified while scrubbing or resilvering. If set to 1, data will be verified as sequentially as possible, given the amount of memory reserved for scrubbing (see zfs_scan_mem_lim_fact). This may improve scrub performance if the pool's data is very fragmented. If set to 2, the largest mostly-contiguous chunk of found data will be verified first. By deferring scrubbing of small segments, we may later find adjacent data to coalesce and increase the segment size.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will take your wording for now. I found this tunable very useful for playing with performance on different pools, so I'd feel bad removing it.
Testing results from a real world worst case pool containing ~97G of small files. TLDR, performance can be improved 3x by changing the default values for The test pool is a copy of a 3 month old filesystem containing a few million small files. The workload for those 3 months involved frequently creating and unlinking small files resulting in a fragmented (37%) but relatively empty (6% full) pool.
The test case was to measure the efficiency of resilvering both mirrors and verify that sequential resilver does behave as designed. This included measuring the total resilver time and checking for the expected IO pattern.
Surprisingly, when comparing this PR, 5d6e310, with master resilver time increased by 14% as shown in the first two columns below.. Not good, so what's going on. Profiling with Additional performance gains are possible by increasing According to my testing the sweet spot is to update the default values for these two tunings as follows.
The above observations also suggest a potential future optimization. Since we know scrub IOs will be largely sequential it would be reasonable to submit them from the start as delegated IOs. This would remove the need for the low level aggregation logic to detect that they're sequential and perform the aggregation. This might further improve performance but obviously we'd want detailed performance data to confirm it actually helps. |
aab6610
to
fe52369
Compare
@behlendorf Haha, it seems that we've made scans so fast with your tunables that we've broken 3 of the tests ( |
Just to throw in some more performance numbers... My home system has a 3-disk raidz1 array that is 3TB in size. It started out at over 11 hours to scrub before this #6256. With the patch applied back in August, it took around 4 hours to scrub with roughly 1.1TB of free space left. After the latest patch I applied yesterday, it took 5 hours & 2 minutes, with 478GB free (it took longer, but it was scrubbing more data that in August). After I adjusted the tunables above, my scrub time on the same amount of data dropped to 3.5 hours. |
man/man5/zfs-module-parameters.5
Outdated
@@ -1780,18 +1798,81 @@ Default value: \fB75\fR. | |||
.sp | |||
.ne 2 | |||
.na | |||
\fBzfs_top_maxinflight\fR (int) | |||
\fBzfs_scan_top_maxinflight\fR (int) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed to zfs_scan_top_bytes
, needs to be updated. Testing shows that we should default to 4M.
man/man5/zfs-module-parameters.5
Outdated
.RE | ||
|
||
.sp | ||
.ne 2 | ||
.na | ||
\fBzfs_scan_idle\fR (int) | ||
\fBzfs_scan_mem_lim_fact\fR (int) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you also make sure all the zfs_scan_*
entries are listed together in alphabetical order.
man/man5/zfs-module-parameters.5
Outdated
To preserve progress across reboots the sequential scan algorithm periodically | ||
needs to stop metadata scanning and issue all the verifications I/Os to disk. | ||
The frequency of this flushing is determined by the | ||
\fBfBzfs_scan_checkpoint_intval\fR tunable. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Extra fB
here needs to be removed.
man/man5/zfs-module-parameters.5
Outdated
a non-scrub or non-resilver I/O operation has occurred within the past | ||
\fBzfs_scan_idle\fR ticks. | ||
To preserve progress across reboots the sequential scan algorithm periodically | ||
needs to stop metadata scanning and issue all the verifications I/Os to disk. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/verifications/verification
d735014
to
756f774
Compare
Currently, scrubs and resilvers can take an extremely long time to complete. This is largely due to the fact that zfs scans process pools in logical order, as determined by each block's bookmark. This makes sense from a simplicity perspective, but blocks in zfs are often scattered randomly across disks, particularly due to zfs's copy-on-write mechanisms. This patch improves performance by splitting scrubs and resilvers into a metadata scanning phase and an IO issuing phase. The metadata scan reads through the structure of the pool and gathers an in-memory queue of I/Os, sorted by size and offset on disk. The issuing phase will then issue the scrub I/Os as sequentially as possible, greatly improving performance. This patch also updates and cleans up some of the scan code which has not been updated in several years. Authored-by: Saso Kiselkov <saso.kiselkov@nexenta.com> Authored-by: Alek Pinchuk <apinchuk@datto.com> Authored-by: Tom Caputi <tcaputi@datto.com> Signed-off-by: Tom Caputi <tcaputi@datto.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
756f774
to
1c6275b
Compare
@behlendorf is this scheduled for 0.8? |
@angstymeat when you say 'tunables above', do you refer to what Brian said earlier? "Testing results from a real world worst case pool containing ~97G of small files. TLDR, performance can be improved 3x by changing the default values for |
@pruiz yes. |
Motivation and Context
Currently, scrubs and resilvers can take an extremely long time to complete. This is largely due to the fact that zfs scans process pools in logical order, as determined by each block's bookmark. This makes sense from a simplicity perspective, but blocks in zfs are often scattered randomly across disks, particularly due to zfs's copy-on-write mechanisms.
At the 2016 ZFS developer summit @skiselkov presented some experimental work he had done to greatly improve performance by breaking scrubs into 2 parts: metadata scanning and block scrubbing. The metadata scanning would sort blocks into sequential chunks which could be issued to disk much more efficiently by the scrubbing code. This patch is mostly his (WIP) work, ported to Linux and with a few small improvements.
This PR is currently being made so that we can use buildbot to detect bugs.
Description
This patch essentially consists of 4 pieces (at the moment).
How Has This Been Tested?
Initial performance tests show scrubs taking 5-6x less time with this patch on a 47TB pool consisting of data mimicking our production environment. On zfs version 0.6.8, the scrub took a little over 125 hours while it only took 22.5 with this patch. @skiselkov 's presentation at the developer summit cited a 16x performance improvement for a worst case scenario.
As this patch solidifies we will add more formalized automated tests.
Types of changes
Checklist:
Signed-off-by
.Special thanks to @alek-p and @skiselkov for doing most of the work.