-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Highly inefficient use of space observed when using raidz2 with ashift=12 #548
Comments
rlaager and I were able to narrow things down in IRC. It seems that a single disk pool is fine with ashift=9 and ashift=12. raidz2 is also fine when ashift=9, but when ashift=12, space requirements explode. I did an unpack of the portage tree on a raidz2 ashift=9 pool that I made on my VM host. It used only 436MB: rpool 437M 3.49G 436M /rpool I also tried a 1GB zpool formatted ext4 with 2^20 inodes: /dev/zd0 786112 743000 0 100% /rpool/portage That is consistent with the host usage and it seems that the 5% space reserved for root enabled the extraction to run to completion. The explosion in disk usage seems to be caused by a bad interaction between ashift=12 and raidz. |
Your absolutely right, this is something of a known, although not widely discussed, issue for ZFS and it's one of the reasons why we've left ashift=9 the default. Small files will balloon the storage requirements, for large files things should be much more reasonable. |
behlendorf, would you clarify why this affects not only small files on a ZFS dataset, but small files stored on a zvol formatted with a completely different filesystem? My understanding of a zvol was that it should reserve all of the space that it would ever use and never grow past that unless explicitly resized. |
You see the impact when using a zvol because they default to an 8k block size. If you increase the zvol block size the overhead will decrease. It's analogous to creating files in a zfs filesystem with an 8k block size, which is what happens when you create a small file. As for zvols they don't reserve their space at creation time. While you do set a maximum volume size they should only allocate space as they are written too like any other object. If you want the behavior your describing for your zvol you need to set a reservation, |
That would explain why there appears to be a factor of 2 discrepancy between what the filesystem on the zvol reported and what the zvol actually used. I suspect that when ashift=12, the zvol will allocate two blocks and only use one (i.e. zero pad it), as opposed to the typical unaligned write behavior where a 4KB logical sector would map to either the upper or lower half of a 8KB physical sector. |
behlendorf: When creating a zvol, a refreservation (reservation on zpool <= 8) is created by default. This is covered in the zfs man page and matches my experience with Solaris 11 Express as well as a quick test just now with ZFS on Linux. To get the behavior you describe, you need to add the -s option to: zfs create -V <dataset_name> |
gentoofan: Are you seeing writes to a zvol (with the default refreservation) fail with ENOSPC? If so, that seems like a bug (i.e. the reservation isn't properly accounting for the worst-case overhead). |
I did some more tests. The following is on ashift=12, and I can't see any difference in terms of reported available space when setting a reservation: localhost ~ # zfs create -V 400M -o reservation=400M rpool/test I also tested making a zvol on ashift=9 and I observed the 1.5GB space usage that I had seen on ashift=12. I then repeated with ext4 on top of 'create -V 1G -o volblocksize=4K rpool/ROOT/portage' and the space usage of the zvol correlated to the space usage reported by ext4. Specifically: root@ubuntu: Accounting for FS overhead, 2^20 KB - 229332 KB = 819244 KB or approximately 800M. Permitting for ZFS' internal book-keeping, 17MB overhead seems reasonable. I will retest with ashift=12 soon, although given that I had reproduced the 1.5GB zvol usage with ashift=9, I suspect that toggling this switch will fix things. There is still the issue of why an 1GB zvol with ashift=9 uses 1.5GB to store an ext4 filesystem containing files that a ZFS data with ashift=9 only needs 347MB to store |
Dagger2, rlaager and dajhorn worked this out in IRC. The issue is that ashift=12 is enforcing a 4KB restriction. The two parity blocks required by raidz2 are both 4KB in size because of ashift=12. The zvol has a default block size of 8KB, so 2x4KB are written as data with 2x4KB parity. Since the corresponding parity blocks have been consumed, the other two data blocks are marked as in use, even though they aren't doing anything. The consequence is that the smallest amount of data that can be written to a raidz pool is (disks - raidz level) * 2^ashift, which in my situation, is 16KB. This explains why filesystems on the zvol and on a ZFS dataset would require roughly the same amount of space, despite requiring much less on a a single physical disk. |
rlaager, I have not observed any write failures, although I imagine one would occur if I made a zvol with 3GB in my configuration with the default 8KB volblocksize and then proceeded to fill it. By the way, I had been sitting on the comment I made immediately after yours while I was running tests, so I didn't see your comment until now. |
I justed test this scenario. I created a zpool with ashift=9 on six 300 GiB files. I ran dd completed without error. So, as far as I can tell, there are two bugs here:
|
rlaager, I believe your issue is what I described earlier, which involves a bad configuration. Once a zvol with such a configuration exists, there is not much that the code can do about it. If the actual volblocksize < (disks - raidz level) * 2^ashift, things like this can happen when you try to fill the zvol. It might be worthwhile to make the default volblocksize vary depending on the raidz level and number of disks to prevent users from hitting these configurations by default. It might also be worthwhile to refuse to make zvols that violate this constraint. Of course, that won't help people who have pre-existing zvols that were made by older modules or other implementations. |
After talking about this with rlaager in IRC for a bit, I would like to suggest that the zvol code be patched to accomplish 4 things, where formula = (disks - raidz_level) * 2^ashift:
That would prevent the issues we encountered from occurring. |
Assuming that our idea of "formula" is correct (which probably needs more testing): 3: we should print a kernel message. Also, we should implement the "readonly" part by setting readonly=on on the zvol, which would allow the admin to override it. Imagine, "I upgraded ZFS on Linux and rebooted. All of my virtual machines failed because their zvols went readonly." They can continue at the same risk as before (though now it's known) until they have time to recreate the zvols with a different volblocksize and transfer the data. 4: If you want this to be constant-time (and not have to iterate over the zvols to check), then make the condition "if it would raise the default volblocksize". This is just as safe, but may have false positives (i.e. no zvols exist with a volblocksize less than the new value). We might want to relax the "zvol" conditions to "non-sparse zvols", where "non-sparse" means "with no reservation or refreservation". If I have a 6 disk raidz ashift=12 pool with a volblocksize=4k zvol for a VM's swap or volblocksize=8k for a database, I might want to waste space in exchange for the performance advantage of avoiding read-modify-write at the zvol level. Sparse zvols are already subject to failure if the pool fills up (and thus discouraged by the man page), so while the increase disk consumption might be a surprise, it's not violating any guarantee that ZFS made. |
Is there going to be a bug fix for this issue? I am being affected by the, in such a way that in a pool with compression (1.7x) and dedupe (1.55x) enabled, the storage size is about THE SAME as it was on the old NetApp Filer (which is outrageously big). Six-disk RAIDZ2 pool here. |
Also I see a discrepancy between the total pool size in zpool list and zfs list (used + avail). |
No work is currently planned to address this issue. |
@behlendorf |
I thought that if a process wrote a 16K buffer to a zvol with volblocksize=4K, it would be considered a single block and spread over to 4 drives if available. My testing shows something else: it seems to split the data into 4K blocks and use short stripes even when writing 16K buffers. E.g. on a 5 disk raidz, create 3 test volumes; one with volblocksize=16K and 2 with volblocksize=4K, then use dd to write in 4K or 16K blocks: zfs create -V 40G -o volblocksize=16K ypool/test_vb16_dd16dd if=/dev/urandom of=/dev/zvol/ypool/test_vb16_dd16 bs=16Kzfs create -V 40G -o volblocksize=4K ypool/test_vb4_dd16dd if=/dev/urandom of=/dev/zvol/ypool/test_vb4_dd16 bs=16Kzfs create -V 40G -o volblocksize=4K ypool/test_vb4_dd4dd if=/dev/urandom of=/dev/zvol/ypool/test_vb4_dd4 bs=4Kzfs listypool/test_vb16_dd16 48.4G 2.56T 48.4G - This may be what is expected, but then it seems bad to use the default volblocksize of 8K. I get 30 % better random 4K read IOPS when using volblocksize=4K instead of 16K, so there is an argument for using small volblocksize, but the hybrid approach of combining large write buffers seems better if possible. |
Previous patches have allowed you to set an increased ashift to avoid doing 512b IO with 4k sector devices. However, it was not possible to set the ashift lower than the reported physical sector size even when a smaller logical size was supported. In practice, there are several cases where settong a lower ashift is useful: * Most modern drives now correctly report their physical sector size as 4k. This causes zfs to correctly default to using a 4k sector size (ashift=12). However, for some usage models this new default ashift value causes an unacceptable increase in space usage. Filesystems with many small files may see the total available space reduced to 30-40% which is unacceptable. * When replacing a drive in an existing pool which was created with ashift=9 a modern 4k sector drive cannot be used. The 'zpool replace' command will issue an error that the new drive has an 'incompatible sector alignment'. However, by allowing the ashift to be manual specified as smaller, non-optimal, value the device may still be safely used. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes openzfs#1381 Closes openzfs#1328 Issue openzfs#967 Issue openzfs#548
Previous patches have allowed you to set an increased ashift to avoid doing 512b IO with 4k sector devices. However, it was not possible to set the ashift lower than the reported physical sector size even when a smaller logical size was supported. In practice, there are several cases where settong a lower ashift is useful: * Most modern drives now correctly report their physical sector size as 4k. This causes zfs to correctly default to using a 4k sector size (ashift=12). However, for some usage models this new default ashift value causes an unacceptable increase in space usage. Filesystems with many small files may see the total available space reduced to 30-40% which is unacceptable. * When replacing a drive in an existing pool which was created with ashift=9 a modern 4k sector drive cannot be used. The 'zpool replace' command will issue an error that the new drive has an 'incompatible sector alignment'. However, by allowing the ashift to be manual specified as smaller, non-optimal, value the device may still be safely used. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes openzfs#1381 Closes openzfs#1328 Issue openzfs#967 Issue openzfs#548
Previous patches have allowed you to set an increased ashift to avoid doing 512b IO with 4k sector devices. However, it was not possible to set the ashift lower than the reported physical sector size even when a smaller logical size was supported. In practice, there are several cases where settong a lower ashift is useful: * Most modern drives now correctly report their physical sector size as 4k. This causes zfs to correctly default to using a 4k sector size (ashift=12). However, for some usage models this new default ashift value causes an unacceptable increase in space usage. Filesystems with many small files may see the total available space reduced to 30-40% which is unacceptable. * When replacing a drive in an existing pool which was created with ashift=9 a modern 4k sector drive cannot be used. The 'zpool replace' command will issue an error that the new drive has an 'incompatible sector alignment'. However, by allowing the ashift to be manual specified as smaller, non-optimal, value the device may still be safely used. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes openzfs#1381 Closes openzfs#1328 Issue openzfs#967 Issue openzfs#548
Last weekend I created a 12x4T raidz2 array and streamed across the contents of my old Nexenta fs to test the viability of zfs on Linux. Imagine my surprise when I noticed fs size jumped from 3.27T to 5.73T in the transition! The combo of 128k blocks, ashift=12 and raidz2 meant a 75% space overhead, almost entirely eliminating any space savings from using raidz2 rather than mirroring. This is an issue. |
Previous patches have allowed you to set an increased ashift to avoid doing 512b IO with 4k sector devices. However, it was not possible to set the ashift lower than the reported physical sector size even when a smaller logical size was supported. In practice, there are several cases where settong a lower ashift is useful: * Most modern drives now correctly report their physical sector size as 4k. This causes zfs to correctly default to using a 4k sector size (ashift=12). However, for some usage models this new default ashift value causes an unacceptable increase in space usage. Filesystems with many small files may see the total available space reduced to 30-40% which is unacceptable. * When replacing a drive in an existing pool which was created with ashift=9 a modern 4k sector drive cannot be used. The 'zpool replace' command will issue an error that the new drive has an 'incompatible sector alignment'. However, by allowing the ashift to be manual specified as smaller, non-optimal, value the device may still be safely used. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes openzfs#1381 Closes openzfs#1328 Issue openzfs#967 Issue openzfs#548
Testing today on the latest zfs release on centos 6 shows the same space usage problem on raidz3 with 4k drives. FYI |
This is just something people need to be aware of. ZoL behaves the same in this regard as all the other OpenZFS implementations. The only real difference is that ZoL is much more likely to default to ashift=12. However, the default ashift can always be overridden at pool creation time if this is an issue. Since no work is planned to change this behavior I'm closing the issue. |
Even though this bug has been closed, could someone please share recommendations for creating an ext4 zvol with a pool that has either |
Signed-off-by: Chunwei Chen <david.chen@osnexus.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes openzfs#548
Is it this still an issue for zfs 2.1.1? |
as I understand this have the same problem as with raidz1. I have a copied comments about this from somewhere (specifically about VMs in datasets and ZVOLs): "This is the problem, volblocksize=8K. When using 8K, you will have many padding blocks. At a time you will write a 8K block on your pool.For a 6 disk pool(raidz2), this will be 8K / 4 data disk = 0.5 K. But for each disk, you can write at minimum 4K(ashift 12), so in reality you will write 4 blocks x 4K =16 K(so it is dubble). So from this perspective(space usage), you will need at least volblocksize=16K" "please add -o volblocksize= while creating the volume. If you have 16 disk with RAIDZ3 and ashift=12 => x=(16-3)=13 => "I found this chart that showed that that the default 8k volblocksize was indeed a problem. For Raidz1 with ashift of 12 (4K LBA) you need atleast: "The key insight is that normally, datasets are used for files of varying sizes. As such, when you write a small 16k file, ZFS can use a small 16k record. recordsize is a limit on the max record size. Smaller records are allowed. That said, to correctly compare zvols vs dataset, I suggest you to test the following three configurations: zvol virtual machine: default zvol parameters, disk configured with cache=none (to bypass the pagecache, the hypervisor must issue O_DIRECT writes); dataset virtual machine: set recordsize=8K, atime=off, xattr=off, use raw file disk image with cache=writeback (note: datasets do not engage the linux pagecache, nor they support O_DIRECT - unless you are using zfs 0.8.x, were a "fake" support for direct writes was added)" |
I am using the latest GIT code, e29be02, on a VMWare Player VM. I booted the VM using the Ubuntu Linux 11.10 LiveCD, with Linux 3.0. The VM contains 6x1GB disks in a raidz2 pool with ashift=12. The pool reports 4GB of space formatted. I unpacked a copy of the portage-tree, which requires about 672M on ext4 with a 4K block size, on a ZFS dataset in the pool, which required 1.5GB of space.
I then tried using zvolumes, and I had similar results. The storage requirements on the pool are consistently double that of the storage requirements of the actual hosted filesystems:
(df -h reported size) - (zfs list reported size)/(size provided at creation time) (mkfs. options)
743688 - 1.58G/1G - ext4 (normal, zvol) -E discard -N 1048576
688912 - 1.58G/1G - ext4 (extra options, zvol) -E discard -N 1048576 -I 128 -m 0 -O ^ext_attr,^resize_inode,^has_journal,^large_file,^huge_file,^dir_nlink,^extra_isize
687856 - 1.40G/1G - ext4 (extra options, zvol) -E discard -N 262144 -I 128 -m 0 -O ^ext_attr,^resize_inode,^has_journal,^large_file,^huge_file,^dir_nlink,^extra_isize
301684 - 607M/1G - reiserfs (default options, zvol)
You can obtain a snapshot of the portage tree at the following site to verify my results from the following link:
http://mirrors.rit.edu/gentoo/snapshots/portage-latest.tar.xz
I am linking to the latest tarball rather than the current one, mostly because dated tarballs are not hosted particularly long. I expect that others can reproduce my findings regardless of whether or not the same exact snapshot is used.
I also tried making a 512MB file. I then formatted it reiserfs and mounted it on the loopback device. I proceeded to mount it and extract the portage tree. Afterward, I examined the disk space used, and it used 513MB.
Lastly, I tried this on my physical system with a 2GB parse file formatted ZFS on top of ext4 with ashift=12 and without any raidz, mirroring or striping. The occupied space reported by df was 830208 bytes, which is a dramatic improvement over raidz2.
I thought I paid the price of parity at the beginning when 1/3 of my array's space was missing, but it seems that I am paying for it twice, even when using zvols which I would expect space-wise to be the equivalent of a giant file. I pay once at pool creation and then again when I do many small writes. Does anyone have any idea why?
The text was updated successfully, but these errors were encountered: