New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
disk usage wrong when using larger recordsize, raidz and ashift=12 #4599
Comments
I'd call it an undocumented feature personally. First off, I've confirmed that it affects all OpenZFS platforms including Illumos distros. But think about what's the alternative. I don't see how it can easily be "fixed". It seems you already understand this since you mentioned freespace or "capacity" is calculated assuming 128KiB records. If the disk usage was 8% larger (in your vdev config) like it "correctly" should be, then what happens when we get close to full? It won't be an accurate representation of the capacity since we will be able to fit more data on the zpool than the capacity says we can. So why don't we make the capacity larger? Well if we do that, what happens if you make a 128KiB dataset and write a bunch of data to that. Now we actually can't store as much as the capacity says we can. I don't think it would be a good idea to raise the capacity since if the user uses 128KiB datasets then their pool will full out of space before its "full". So since the pool needs to show a "worst case" smaller capacity (so the user doesn't run out of space before it's full). What other choice is there for showing the USAGE of more efficient 1MiB datasets on the same pool besides showing the USAGE as smaller than the actual filesize so that it actually represents an accurate used % of the pool. |
For raidz, "zfs list" will always show less space, both used and available, than "zpool list"
Unfortunately, this is not a definitive reference and appears to contain some inconsistencies. For your case:
then the model predicts:
|
Pardon me If I just show my ignorance. This is how I see it. It's all about the usage pattern. Tons of small files vs. a few large files vs. a mix. Ideally free space reporting is baset on the actual usage pattern but that may not be feasable. Fixed reporting on 128k is just as unhelpfull for people who store tons of small files (smaller than 128k) and could cause them just as mucht trouble. Would it make sense to always callibrate on the lagest record size? That seems simple and flexible. People with tons of small files could choose to lower the max record size to better reflect their usage pattern but this would hurt mixed users. So an alternative would be to make the df du behavour tunable? Sorry if this doesn't make sense. |
For files, recordsize represents a maximum block size. The actual use will depend on the
In ZFS the free space is an estimate, because until you allocate the data, you don't know how
|
For those that stumble across this in the future:
That URL goes nowhere useful now. Archive.org has a copy and delphix has one with different, slightly broken formatting. |
@richardelling May I ask about your computation of block usage? 5k file = 2 data + 2 parity + 2 padding = 6 blocks total. Where do the other 6 blocks come from? 120k file = 30 data + 6 parity + no padding = 36 blocks total. Again where do the extra 6 blocks come from? |
I’ve not looked at this for a number of years, but I did author a related fix. This comment describes raidz space accounting. If you prefer to spend 30 minutes listening to me talk instead of reading a comment, you can do that too. |
i found a potential bug (or undocumented feature)
Short description:
When a file is created on a dataset with large records enabled located on a raidz pool with ashift=12 the usage column in zfs list shows less then the actual file size on disk.
Long Description:
When creating a raidz pool with ashift=12 a certain amount of disk space is lost due to padding dividing the 128k recordsize by 4k instead of 512.
The raidz capacity loss on 4k disks due to padding is described in detail here:
http://blog.gjpvanwesten.nl/2014/08/part-iv-how-much-space-do-you-lose-with.html
Now that larger record sizes are available i tried using recordsize=1M on my pool.
My raidz2 pool consists of 12 members this means i will loose about 8% space to padding with 128k records.
When i create a file on a dataset with large records enabled the usage column in zfs list shows less space used then the actual filesize.
Now the interesting part: it is off by exactly the 8% that the large recordsize saves me.
For testing purposes i created a 500MiB file on a freshly created file based raidz2 zpool on 12 100MiB files.
The curious thing is, in the context of zfs list this sort of makes sense because the free space is calculated assuming 128k records, but now we use less space due to 1M records and the usage is correct in that context.
So is this a bug or just missing documentation?
(all using ZoL v0.6.5.6, kernel 4.4.0-21)
The text was updated successfully, but these errors were encountered: