Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sequential Scrubs and Resilvers #6256

Merged
merged 1 commit into from
Nov 16, 2017
Merged

Conversation

tcaputi
Copy link
Contributor

@tcaputi tcaputi commented Jun 21, 2017

Motivation and Context

Currently, scrubs and resilvers can take an extremely long time to complete. This is largely due to the fact that zfs scans process pools in logical order, as determined by each block's bookmark. This makes sense from a simplicity perspective, but blocks in zfs are often scattered randomly across disks, particularly due to zfs's copy-on-write mechanisms.

At the 2016 ZFS developer summit @skiselkov presented some experimental work he had done to greatly improve performance by breaking scrubs into 2 parts: metadata scanning and block scrubbing. The metadata scanning would sort blocks into sequential chunks which could be issued to disk much more efficiently by the scrubbing code. This patch is mostly his (WIP) work, ported to Linux and with a few small improvements.

This PR is currently being made so that we can use buildbot to detect bugs.

Description

This patch essentially consists of 4 pieces (at the moment).

  1. @skiselkov 's sequential scrubbing algorithm as described here and here
  2. improvements to the scrubbing algorithm so that its effects on txg_sync time are more consistent.
  3. performance improvements to the existing prefetching algorithm
  4. some code reorganization (since much of the existing scrub code is very old at this point)

How Has This Been Tested?

Initial performance tests show scrubs taking 5-6x less time with this patch on a 47TB pool consisting of data mimicking our production environment. On zfs version 0.6.8, the scrub took a little over 125 hours while it only took 22.5 with this patch. @skiselkov 's presentation at the developer summit cited a 16x performance improvement for a worst case scenario.

As this patch solidifies we will add more formalized automated tests.

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Performance enhancement (non-breaking change which improves efficiency)
  • Code cleanup (non-breaking change which makes code smaller or more readable)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Documentation (a change to man pages or other documentation)

Checklist:

  • My code follows the ZFS on Linux code style requirements.
  • I have updated the documentation accordingly.
  • I have read the CONTRIBUTING document.
  • I have added tests to cover my changes.
  • All new and existing tests passed.
  • All commit messages are properly formatted and contain Signed-off-by.
  • Change has been approved by a ZFS on Linux member.

Special thanks to @alek-p and @skiselkov for doing most of the work.

@nwf
Copy link
Contributor

nwf commented Jun 21, 2017

I had done this port and was holding off on making a PR from it because @skiselkov indicated that work was still wanting to make sweeping changes to the rangetree structure. If that's no longer true, then it's probably time to get this merged. :)

In my testing (which is what prompted #6209), I found it much better to raise scn_ddt_class_max to DDT_CLASS_UNIQUE and to disable prefetching entirely. You'll note that ddt_class_contains, in module/zfs/ddt.c, is particularly simple when we're scanning the entire DDT, avoiding any need to pull DDT ZAPs into the ARC.

/*
* Given a set of I/O parameters as discovered by the metadata traversal
* process, attempts to place the I/O into the sorted queues (if allowed),
* or immediately executes the I/O. The dummy flag can be set to
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Set to... what?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry. Rebase bug...

/*
 * Given a set of I/O parameters as discovered by the metadata traversal
 * process, attempts to place the I/O into the sorted queues (if allowed),
 * or immediately executes the I/O. The dummy flag can be set to indicate
 * this IO has already been done and a placeholder should be used instead.
 */

@@ -1104,7 +1212,7 @@ dsl_scan_zil_block(zilog_t *zilog, blkptr_t *bp, void *arg, uint64_t claim_txg)
SET_BOOKMARK(&zb, zh->zh_log.blk_cksum.zc_word[ZIL_ZC_OBJSET],
ZB_ZIL_OBJECT, ZB_ZIL_LEVEL, bp->blk_cksum.zc_word[ZIL_ZC_SEQ]);

VERIFY(0 == scan_funcs[scn->scn_phys.scn_func](dp, bp, &zb));
scan_funcs[scn->scn_phys.scn_func](dp, bp, &zb, B_FALSE);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not VERIFY?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there is currently only one scan_func_t: dsl_scan_scrub_cb() which has always (as far as I can tell) returned zero. The only info I could find about its purpose is a comment indicating that returning zero means /* do not relocate this block */ which doesn't seem to make sense anymore. Many of the callers didn't check the return code anyway, and the only ones that did simply did a VERIFY(0 ==...) as you see here.

As part of the cleanup I mentioned, I changed the function to return void.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hah. "did not relocate this block" sounds like some of bprewrite slipping in. OK.

@tcaputi tcaputi force-pushed the better_scrub branch 3 times, most recently from 6f49f97 to 59a74df Compare June 22, 2017 21:07
@tonyhutter
Copy link
Contributor

Just did an initial test on our 80-drive HDD JBOD. The test pool was setup as a 8x 10 disk raidz2:

	NAME        STATE     READ WRITE CKSUM
	mypool      ONLINE       0     0     0
	  raidz2-0  ONLINE       0     0     0
	    U0      ONLINE       0     0     0
	    U1      ONLINE       0     0     0
	    U14     ONLINE       0     0     0
	    U15     ONLINE       0     0     0
	    U28     ONLINE       0     0     0
	    U29     ONLINE       0     0     0
	    U42     ONLINE       0     0     0
	    U43     ONLINE       0     0     0
	    U56     ONLINE       0     0     0
	    U70     ONLINE       0     0     0
	  raidz2-1  ONLINE       0     0     0
...
	  raidz2-7  ONLINE       0     0     0
...

$ zpool list
NAME     SIZE  ALLOC   FREE  EXPANDSZ   FRAG    CAP  DEDUP  HEALTH  ALTROOT
mypool   580T   439G   580T         -     0%     0%  1.00x  ONLINE  -

The dataset was various copies of /usr, /boot, /bin, /opt, and /lib.

Before

  pool: mypool
 state: ONLINE
  scan: scrub in progress since Thu Jun 22 17:17:26 2017
	437G scanned out of 439G at 559M/s, 0h0m to go
	0B repaired, 99.56% done
config:

...

  scan: scrub repaired 0B in 0h13m with 0 errors on Thu Jun 22 17:30:49 2017

After

  pool: mypool
 state: ONLINE
  scan: scrub in progress since Thu Jun 22 17:33:55 2017
    439G scanned at 1.60G/s, 438G verified out of 439G at 1.60G/s, 99.85% done
    0B repaired, 0h0m to go
config:

...

  scan: scrub repaired 0B in 0h4m with 0 errors on Thu Jun 22 17:38:30 2017

So looks like a 3.25x speed-up. Very nice!

@tcaputi
Copy link
Contributor Author

tcaputi commented Jun 23, 2017

@kpande I'll fix the wording soon. Its trying to say too many things using the old text ATM. Right now I just wanted to see how happy buildbot is with it

@angstymeat
Copy link

Well, I can say that it took the scrub time on one of our systems down from 20 hours 25 minutes to 8 hours 41 minutes.

Two other systems had their scrub times remain roughly the same, but they hadn't been scanned in a while and their space usage had significantly increased since the last scrub, so the rates at which they were scrubbed increased.

I know this doesn't have anything to do with this patch, but it would be nice if the amount of data that was scrubbed was reported when running zpool status, in addition to just how much was repaired.

@tcaputi
Copy link
Contributor Author

tcaputi commented Jun 27, 2017

@angstymeat
Can I ask what kind of data you have in your pool? (number of snapshots / datasets, size of files, recordsize, etc.). I'm kind of surprised that you and @tonyhutter aren't seeing more dramatic speed increases.

I know this doesn't have anything to do with this patch, but it would be nice if the amount of data that was scrubbed was reported when running zpool status, in addition to just how much was repaired.

I'm not 100% sure what you mean by this. During the scrub, zpool status does report how much data has currently been scanned and verified. Afterwards, the amount scanned should just be the amount of data in your pool.

@angstymeat
Copy link

I thought that was a pretty good speed increase.

We handle real-time seismic data from around the world. This particular system receives data from a number of reference stations and devices that are being tested. We can get thousands of files a day between 56k and 8m from each station, and we have data going back 10 years.

We don't need speed for the most part, and the average daily data rate isn't high, but we need to hold onto the data for a long time (currently around 10 years), and the data has to be kept safe. The system is built more for redundancy and safety than it is for speed including multiple backups to on-site and off-site each night.

The system itself is running under VMWare vSphere 5.0, and its vm has 6 CPUs with 24GB of RAM.

The pool is set up as follows, with ashift=12:

  pool: storage
 state: ONLINE
  scan: scrub repaired 0B in 8h41m with 0 errors on Tue Jun 27 00:22:11 2017
config:
	NAME                                        STATE     READ WRITE CKSUM
	storage                                     ONLINE       0     0     0
	  mirror-0                                  ONLINE       0     0     0
	    scsi-3600144f0e632cf000000519eb372002b  ONLINE       0     0     0
	    scsi-3600144f0967d8e00000051a5380b0007  ONLINE       0     0     0
	  mirror-1                                  ONLINE       0     0     0
	    scsi-3600144f0e632cf000000519eb3fc002c  ONLINE       0     0     0
	    scsi-3600144f0967d8e00000051a538100008  ONLINE       0     0     0
	  mirror-2                                  ONLINE       0     0     0
	    scsi-3600144f0e632cf000000519eb400002d  ONLINE       0     0     0
	    scsi-3600144f0967d8e00000051a538140009  ONLINE       0     0     0

errors: No known data errors

and

NAME      SIZE  ALLOC   FREE  EXPANDSZ   FRAG    CAP  DEDUP  HEALTH  ALTROOT
storage  5.44T  4.43T  1.01T         -    36%    81%  1.00x  ONLINE  -

The drives are some older 2TB Seagate enterprise drives running off a Sun X4500 that is configured with OpenIndiana's comstar software to provide individual disks remotely as block storage over a 4gb fibre channel connection. The connection itself is aggregated over 4 links across two cards for increased performance and redundancy.

The vm is mounting the disks as raw devices (no vmware filesystem is involved). We run another 50 virtual machines across 2 servers, each accessing disks from 1 of 3 sets of JBODs. Some are 4Gbps and others are 8Gbps.

It's not blazing fast, but they're not terribly slow either. Tests a few years ago showed that I could sustain 6Gb/s read speeds off a ZFS pool if I set it up for it.

The disks are under a constant write load. Nothing too high, maybe a few megabytes a second, but it is also used to examine the data files which can cause a lot of them to be loaded at once, so it can have large bursts of data reads.

I'm using LZ4 which is give us an overall 1.39 compression ratio.

There are 4 file systems:

NAME              USED  AVAIL  REFER  MOUNTPOINT
storage          4.43T   863G   160K  none
storage/asl      8.96G   863G  8.96G  /export/data/asl
storage/data     6.88G   863G  6.88G  /export/data/data
storage/newpier  19.9M   863G  19.9M  /export/home/newpier
storage/rt       4.41T   863G  4.29T  /export/home/rt

I'm using zfs-auto-snapshot to make snapshots. I keep 36 hourly snapshots, 14 daily, 10 weekly, and 6 monthlies. There's currently 264 snapshots.

We have a couple of systems set up this way. This one is used for reference and testing. The others collect more data from more stations and perform processing before they are sent off to be archived.

This system can be taken down for updates, testing, or maintenance more easily than the others.

As for this:

I know this doesn't have anything to do with this patch, but it would be nice if the amount of data that was scrubbed was reported when running zpool status, in addition to just how much was repaired

It will show you how much is being scrubbed while it is running:

  scan: scrub in progress since Mon Jun 26 15:41:09 2017
    606G scanned at 1.43G/s, 2.57M verified out of 4.42T at 6.18K/s, 0.00% done
    0B repaired, (scan is slow, no estimated time)

but once it is completed you only see how much was repaired.

  scan: scrub repaired 0B in 8h41m with 0 errors on Tue Jun 27 00:22:11 2017

It has nothing to do with your patch. I should probably just open a feature request someday.

@tcaputi
Copy link
Contributor Author

tcaputi commented Jun 28, 2017

I just pushed a commit for buildbot to test. I would not try this commit out, since it is very WIP. I will clen everything up and rebase once it looks like everything is working.

@@ -3449,8 +3446,7 @@ scan_io_queue_vacate_cb(range_tree_t *rt, void *arg)
{
dsl_scan_io_queue_t *queue = arg;
void *cookie = NULL;
while (avl_destroy_nodes(&queue->q_exts_by_size, &cookie) != NULL)
;
while (avl_destroy_nodes(&queue->q_exts_by_size, &cookie) != NULL);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oof. Is that really the preferred style?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

its my preferred style, but im not looking style right now :). I will rebase everything and clean it up once I'm sure the new range tree is't causing any more problems.

@tcaputi tcaputi force-pushed the better_scrub branch 10 times, most recently from 0789f28 to 248573a Compare June 29, 2017 21:14
@angstymeat
Copy link

Just to make it clear before I get into the error I just found: I'm using the version of this patch from 3 days ago (before e121c36, the one before you posted that you were making some updates to it and not to use that version).

Zpool scrub is reporting errors in files when running with this patch. I've tried it twice, now, and on one of our machines I get errors on both drives in the same mirror. I don't get these errors and the scrub completes without error when I use the current git master.

  pool: kea
 state: DEGRADED
status: One or more devices has experienced an error resulting in data
	corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
	entire pool from backup.
   see: http://zfsonlinux.org/msg/ZFS-8000-8A
  scan: scrub in progress since Thu Jun 29 15:13:18 2017
    618G scanned at 182M/s, 527G verified out of 3.91T at 155M/s, 13.18% done
    0B repaired, 6h22m to go
config:

	NAME                                        STATE     READ WRITE CKSUM
	kea                                         DEGRADED     0     0   276
	  mirror-0                                  ONLINE       0     0     4
	    scsi-3600144f0967d8e00000051b65e610001  ONLINE       0     0     4
	    scsi-3600144f0e632cf00000051b65fae0001  ONLINE       0     0     4
	  mirror-1                                  DEGRADED     0     0   270
	    scsi-3600144f0967d8e00000051b65e730002  DEGRADED     0     0   270  too many errors
	    scsi-3600144f0e632cf00000051b65fb30002  DEGRADED     0     0   270  too many errors
	  mirror-2                                  ONLINE       0     0     0
	    scsi-3600144f0967d8e00000051b65e780003  ONLINE       0     0     0
	    scsi-3600144f0e632cf0000005445398e0003  ONLINE       0     0     0
	  mirror-3                                  ONLINE       0     0     2
	    scsi-3600144f0967d8e00000051b65e7c0004  ONLINE       0     0     2
	    scsi-3600144f0e632cf00000051b65fb80004  ONLINE       0     0     2
	logs
	  mirror-4                                  ONLINE       0     0     0
	    scsi-3600c0ff000130d30d6552c5301000000  ONLINE       0     0     0
	    scsi-3600c0ff000130e6fb15e2c5301000000  ONLINE       0     0     0

@tcaputi
Copy link
Contributor Author

tcaputi commented Jun 29, 2017

@angstymeat Thanks for bringing this up. The bug I am trying to nail down currently may be contributing to this (although it sounds unlikely from your description). Can you give me any more info about this? What kind of workload is running during the scrub? What kind of data is in the pool? If you could find a consistent way to reproduce it that would be wonderful (although that might be asking for a bit much).

@angstymeat
Copy link

It's the same kind of data that's in the system I was previously testing on. Seismic waveform data, filesystem using lz4 compression. The data doesn't really compress much, but a lot of the output we generate from it does compress.

It's under a constant write load, but not much. iostat is showing around 450k/s being written.

I'll say that this pool has a bit of a history; it was one of the first I used ZFS on back under Solaris 10 when it first came out, and was migrated to ZoL once we dropped Sun hardware. It's been running under ZoL since the first month it was able to mount file systems, and has been updated as ZoL has updated.

The pool has been recreated several times, but the data was transferred via zfs send/receive since the Solaris days.

One thing is that this pool has always been slow. Even now it takes it 31+ hours to scrub, even though it is running the same kinds of disks from the same storage servers the previous system is using. The configuration is the same, other than it uses 5 sets of mirrors instead of 4, and the scrub time is almost 50% longer.

It s happening every time I scan with patch 6256, and it is the same set of files each time. I haven't let it scrub through to completion, though. The errors go away when I scrub it with the current master.

I'm not sure what else I can provide. I can give you some zdb output, if you can think of anything that would be helpful. I'm currently running master, but I can switch back to the patched version if you need me to.

I've been testing this on 5 systems (2 personal), and this is the only one I've seen a scrub issue on.

range_tree_ops_t *rt_ops;

/* rt_avl_compare should only be set it rt_arg is an AVL tree */
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it ->if ?

.ad
.RS 12n
Minimum time "long-lived" blocks are locked in the ARC, specified in jiffies.
A value of 0 will default to 6 seconds.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we specify this in milliseconds instead, please?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Brian and I talked about this. I think we decided it would be best to make a pass through the code and cleanup all tunables that use jiffies in a later PR. This one simply works the same as zfs_arc_min_prefetch_lifespan for the moment.

Copy link
Contributor

@behlendorf behlendorf Nov 9, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tcaputi you've already fixed this in the code, but the zfs-module-parameters.5 wasn't updated. There are two tunings arc_min_prefetch_ms which was previous called zfs_arc_min_prefetch_lifespan, and arc_min_prescient_prefetch_ms which is what long life was renamed.

.RS 12n
Strategy used to determine how data verification should be done during scrubs.
If set to 1, I/Os will always be issued sequentially. This may help performance
if the pool's data is very disperate. If set to 2, I/Os will be always be issued
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice if we never needed to change the default behavior. But assuming that we need to document this, can I suggest an alternate wording:

Determines the order that data will be verified while scrubbing or resilvering. If set to 1, data will be verified as sequentially as possible, given the amount of memory reserved for scrubbing (see zfs_scan_mem_lim_fact). This may improve scrub performance if the pool's data is very fragmented. If set to 2, the largest mostly-contiguous chunk of found data will be verified first. By deferring scrubbing of small segments, we may later find adjacent data to coalesce and increase the segment size.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will take your wording for now. I found this tunable very useful for playing with performance on different pools, so I'd feel bad removing it.

@behlendorf
Copy link
Contributor

behlendorf commented Nov 9, 2017

Testing results from a real world worst case pool containing ~97G of small files.

TLDR, performance can be improved 3x by changing the default values for zfs_scan_top_maxinflight to 1024 and zfs_vdev_aggregation_limit to 1048576. Without tuning this PR degrades performance compared to master for my test pool.

The test pool is a copy of a 3 month old filesystem containing a few million small files. The workload for those 3 months involved frequently creating and unlinking small files resulting in a fragmented (37%) but relatively empty (6% full) pool.

$ zpool list -v
NAME       SIZE  ALLOC   FREE  EXPANDSZ   FRAG    CAP  DEDUP  HEALTH  ALTROOT
jet1      1.45T  97.4G  1.36T     13.1T    37%     6%  1.00x  ONLINE  -
  mirror   744G  48.7G   695G     6.55T    37%     6%
    L1      -      -      -         -      -      -
    L2      -      -      -         -      -      -
  mirror   744G  48.7G   695G     6.55T    38%     6%
    L3      -      -      -         -      -      -
    L4      -      -      -         -      -      -

The test case was to measure the efficiency of resilvering both mirrors and verify that sequential resilver does behave as designed. This included measuring the total resilver time and checking for the expected IO pattern.

$ zpool status -v
...
	NAME             STATE     READ WRITE CKSUM
	jet1             DEGRADED     0     0     0
	  mirror-0       DEGRADED     0     0     0
	    replacing-0  DEGRADED     0     0     0
	      L1        FAULTED      0     0     0  external device fault
	      L5        ONLINE       0     0     0  (resilvering)
	    L2          ONLINE       0     0     0
	  mirror-1       DEGRADED     0     0     0
	    replacing-0  DEGRADED     0     0     0
	      L3        FAULTED      0     0     0  external device fault
	      L6        ONLINE       0     0     0  (resilvering)
	    L4  

Surprisingly, when comparing this PR, 5d6e310, with master resilver time increased by 14% as shown in the first two columns below.. Not good, so what's going on.

image

Profiling with zpool iostat -r to get a request size histogram showed this was due to zfs_scan_top_maxinflight being set to 32 by default. The IO aggregation code depends on there being pending zios in the queue which can be aggregated. By trickling them out slowly we're effectively defeating the aggregation logic which results in the resilver issuing lots of small, but sequential, IOs. Increasing zfs_scan_top_maxinflight to 256 resolves the aggregation issue, as you can see fewer, larger IOs are now being issued. This change won't result in the devices being overwhelmed as cautioned above in the comment above zfs_scan_top_maxinflight since zfs_vdev_scrub_max_active further limits the number of inflight IOs.

image

Additional performance gains are possible by increasing zfs_vdev_aggregation_limit from 128K to 1M. Since we know scrub IOs are going to be sequential and thus easily aggregated we should allow them to be merged in to larger IOs. As shown in the stacked bar graph below by increasing both the zfs_vdev_aggregation_limit and zfs_scan_top_maxinflight we can rebuild 60% of this fragmented pool using only 128K or larger disk IOs. Compared with <1% with the default tunings, and ~40% with zfs_scan_top_maxinflight=256.

image

According to my testing the sweet spot is to update the default values for these two tunings as follows.

zfs_scan_top_maxinflight=1024
zfs_vdev_aggregation_limit=1048576

The above observations also suggest a potential future optimization. Since we know scrub IOs will be largely sequential it would be reasonable to submit them from the start as delegated IOs. This would remove the need for the low level aggregation logic to detect that they're sequential and perform the aggregation. This might further improve performance but obviously we'd want detailed performance data to confirm it actually helps.

@tcaputi tcaputi force-pushed the better_scrub branch 2 times, most recently from aab6610 to fe52369 Compare November 10, 2017 04:26
@tcaputi
Copy link
Contributor Author

tcaputi commented Nov 10, 2017

@behlendorf Haha, it seems that we've made scans so fast with your tunables that we've broken 3 of the tests (zpool_scrub_002_pos, zpool_reopen_003_pos, zpool_reopen_004_pos) that expect to be able to run a command or 2 before a scrub or resilver completes. I'll try increasing the amount of data in the pool to see if that helps.

@angstymeat
Copy link

angstymeat commented Nov 10, 2017

Just to throw in some more performance numbers...

My home system has a 3-disk raidz1 array that is 3TB in size.

It started out at over 11 hours to scrub before this #6256.

With the patch applied back in August, it took around 4 hours to scrub with roughly 1.1TB of free space left.

After the latest patch I applied yesterday, it took 5 hours & 2 minutes, with 478GB free (it took longer, but it was scrubbing more data that in August).

After I adjusted the tunables above, my scrub time on the same amount of data dropped to 3.5 hours.

@@ -1780,18 +1798,81 @@ Default value: \fB75\fR.
.sp
.ne 2
.na
\fBzfs_top_maxinflight\fR (int)
\fBzfs_scan_top_maxinflight\fR (int)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed to zfs_scan_top_bytes, needs to be updated. Testing shows that we should default to 4M.

.RE

.sp
.ne 2
.na
\fBzfs_scan_idle\fR (int)
\fBzfs_scan_mem_lim_fact\fR (int)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you also make sure all the zfs_scan_* entries are listed together in alphabetical order.

To preserve progress across reboots the sequential scan algorithm periodically
needs to stop metadata scanning and issue all the verifications I/Os to disk.
The frequency of this flushing is determined by the
\fBfBzfs_scan_checkpoint_intval\fR tunable.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Extra fB here needs to be removed.

a non-scrub or non-resilver I/O operation has occurred within the past
\fBzfs_scan_idle\fR ticks.
To preserve progress across reboots the sequential scan algorithm periodically
needs to stop metadata scanning and issue all the verifications I/Os to disk.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/verifications/verification

@tcaputi tcaputi force-pushed the better_scrub branch 13 times, most recently from d735014 to 756f774 Compare November 14, 2017 17:16
Currently, scrubs and resilvers can take an extremely
long time to complete. This is largely due to the fact
that zfs scans process pools in logical order, as
determined by each block's bookmark. This makes sense
from a simplicity perspective, but blocks in zfs are
often scattered randomly across disks, particularly
due to zfs's copy-on-write mechanisms.

This patch improves performance by splitting scrubs
and resilvers into a metadata scanning phase and an IO
issuing phase. The metadata scan reads through the
structure of the pool and gathers an in-memory queue
of I/Os, sorted by size and offset on disk. The issuing
phase will then issue the scrub I/Os as sequentially as
possible, greatly improving performance.

This patch also updates and cleans up some of the scan
code which has not been updated in several years.

Authored-by: Saso Kiselkov <saso.kiselkov@nexenta.com>
Authored-by: Alek Pinchuk <apinchuk@datto.com>
Authored-by: Tom Caputi <tcaputi@datto.com>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
@behlendorf behlendorf merged commit d4a72f2 into openzfs:master Nov 16, 2017
@pruiz
Copy link

pruiz commented Jan 4, 2018

@behlendorf is this scheduled for 0.8?

@mailinglists35
Copy link

@angstymeat when you say 'tunables above', do you refer to what Brian said earlier?

"Testing results from a real world worst case pool containing ~97G of small files. TLDR, performance can be improved 3x by changing the default values for zfs_scan_top_maxinflight to 1024 and zfs_vdev_aggregation_limit to 1048576."

@behlendorf
Copy link
Contributor

@pruiz yes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.