Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

disk usage wrong when using larger recordsize, raidz and ashift=12 #4599

Closed
burnin opened this issue May 5, 2016 · 8 comments
Closed

disk usage wrong when using larger recordsize, raidz and ashift=12 #4599

burnin opened this issue May 5, 2016 · 8 comments
Labels
Status: Inactive Not being actively updated

Comments

@burnin
Copy link

burnin commented May 5, 2016

i found a potential bug (or undocumented feature)
Short description:
When a file is created on a dataset with large records enabled located on a raidz pool with ashift=12 the usage column in zfs list shows less then the actual file size on disk.

Long Description:
When creating a raidz pool with ashift=12 a certain amount of disk space is lost due to padding dividing the 128k recordsize by 4k instead of 512.
The raidz capacity loss on 4k disks due to padding is described in detail here:
http://blog.gjpvanwesten.nl/2014/08/part-iv-how-much-space-do-you-lose-with.html
Now that larger record sizes are available i tried using recordsize=1M on my pool.
My raidz2 pool consists of 12 members this means i will loose about 8% space to padding with 128k records.

When i create a file on a dataset with large records enabled the usage column in zfs list shows less space used then the actual filesize.
Now the interesting part: it is off by exactly the 8% that the large recordsize saves me.
For testing purposes i created a 500MiB file on a freshly created file based raidz2 zpool on 12 100MiB files.

dd if=/dev/urandom /tank/test/testfile bs=1M count=500
500+0 records in
500+0 records out
524288000 bytes (524 MB, 500 MiB) copied, 105,889 s, 5,0 MB/s

du  /tank/test
470908  /tank/test

ls -l /tank/test
total 470907
-rw-r--r-- 1 root root 524288000 Mai  5 01:23 testfile.1

zfs list
tank             461M   372M   219K  /tank
tank/test        460M   372M   460M  /tank/test

zpool list
NAME     SIZE  ALLOC   FREE  EXPANDSZ   FRAG    CAP  DEDUP  HEALTH  ALTROOT
tank    1,11G   606M   530M         -    26%    53%  1.00x  ONLINE  -

The curious thing is, in the context of zfs list this sort of makes sense because the free space is calculated assuming 128k records, but now we use less space due to 1M records and the usage is correct in that context.
So is this a bug or just missing documentation?
(all using ZoL v0.6.5.6, kernel 4.4.0-21)

@nicko88
Copy link

nicko88 commented May 5, 2016

I'd call it an undocumented feature personally.

First off, I've confirmed that it affects all OpenZFS platforms including Illumos distros.

But think about what's the alternative. I don't see how it can easily be "fixed".

It seems you already understand this since you mentioned freespace or "capacity" is calculated assuming 128KiB records.

If the disk usage was 8% larger (in your vdev config) like it "correctly" should be, then what happens when we get close to full? It won't be an accurate representation of the capacity since we will be able to fit more data on the zpool than the capacity says we can.

So why don't we make the capacity larger? Well if we do that, what happens if you make a 128KiB dataset and write a bunch of data to that. Now we actually can't store as much as the capacity says we can.

I don't think it would be a good idea to raise the capacity since if the user uses 128KiB datasets then their pool will full out of space before its "full".

So since the pool needs to show a "worst case" smaller capacity (so the user doesn't run out of space before it's full). What other choice is there for showing the USAGE of more efficient 1MiB datasets on the same pool besides showing the USAGE as smaller than the actual filesize so that it actually represents an accurate used % of the pool.

@GeLiXin
Copy link
Contributor

GeLiXin commented May 5, 2016

@burnin @nicko88 this may be the same with #4544

@richardelling
Copy link
Contributor

On May 4, 2016, at 6:16 PM, burnin notifications@github.com wrote:

i found a potential bug (or undocumented feature)
Short description:
When a files is created on a dataset with large records enabled located on a raidz pool with ashift=12 the usage column in zfs list shows less then the actual file size on disk.

For raidz, "zfs list" will always show less space, both used and available, than "zpool list"
This is intentional and a FAQ.

Long Description:
When creating a raidz pool with ashift=12 a certain amount of disk space is lost due to padding dividing the 128k recordsize by 4k instead of 512.
The raidz capacity loss on 4k disks due to padding is described in detail here:
http://blog.gjpvanwesten.nl/2014/08/part-iv-how-much-space-do-you-lose-with.html http://blog.gjpvanwesten.nl/2014/08/part-iv-how-much-space-do-you-lose-with.html

Unfortunately, this is not a definitive reference and appears to contain some inconsistencies.
The authorative reference is Matt's blog and corresponding spreadsheet. This sheet calculates
the cost for parity and padding. It does not calculate metadata overhead, but for larger recordsizes
the metadata overhead becomes neglibible as a percentage of the dataset's use.
http://blog.delphix.com/matt/2014/06/06/zfs-stripe-width/ http://blog.delphix.com/matt/2014/06/06/zfs-stripe-width/

For your case:

  1. raidz2
  2. 12-drives
  3. physical block size = 4k

then the model predicts:
128k recordsize ==> block size in sectors = 32 ==> space used for parity and padding = 24%
1M recordsize ==> block size in sectors = 256 ==> space used for parity and padding = 17%
-- richard

Now that larger record sizes are available i tried using recordsize=1M on my pool.
My raidz2 pool consists of 12 members this means i will loose about 8% space to padding with 128k records.

When i create a file on a dataset with large records enabled the usage column in zfs list shows less space used then the actual filesize.
Now the interesting part: it is off by exactly the 8% that the large recordsize saves me.
For testing purposes i created a 500MiB file on a freshly created file based raidz2 zpool on 12 100MiB files.

dd if=/dev/urandom /tank/test/testfile bs=1M count=500
500+0 records in
500+0 records out
524288000 bytes (524 MB, 500 MiB) copied, 105,889 s, 5,0 MB/s

du /tank/test
470908 /tank/test

ls -l /tank/test
total 470907
-rw-r--r-- 1 root root 524288000 Mai 5 01:23 testfile.1

zfs list
tank 461M 372M 219K /tank
tank/test 460M 372M 460M /tank/test

zpool list
NAME SIZE ALLOC FREE EXPANDSZ FRAG CAP DEDUP HEALTH ALTROOT
tank 1,11G 606M 530M - 26% 53% 1.00x ONLINE -

The curious thing is, in the context of zfs list this sort of makes sense because the free space is calculated assuming 128k records, but now we use less space due to 1M records and the usage is displayed in that context.
So is this a bug or just missing documentation?
(all using ZoL v0.6.5.6, kernel 4.4.0-21)


You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub #4599

@louwrentius
Copy link
Contributor

Pardon me If I just show my ignorance. This is how I see it. It's all about the usage pattern. Tons of small files vs. a few large files vs. a mix.

Ideally free space reporting is baset on the actual usage pattern but that may not be feasable.

Fixed reporting on 128k is just as unhelpfull for people who store tons of small files (smaller than 128k) and could cause them just as mucht trouble.

Would it make sense to always callibrate on the lagest record size? That seems simple and flexible. People with tons of small files could choose to lower the max record size to better reflect their usage pattern but this would hurt mixed users.

So an alternative would be to make the df du behavour tunable? Sorry if this doesn't make sense.

@richardelling
Copy link
Contributor

On May 5, 2016, at 1:06 PM, louwrentius notifications@github.com wrote:

Pardon me If I just show my ignorance. This is how I see it. It's all about the usage pattern. Tons of small files vs. a few large files vs. a mix.

For files, recordsize represents a maximum block size. The actual use will depend on the
size of the file. For example, in your raidz2 case with 4k physical block size and 12-wide set,
a 5k file consumes 12 physical blocks, a 200% hit for parity and padding (6 blocks for data +
parity + padding), plus two copies of the metadata, each consuming 3 physical blocks (2 * 3 = 6
blocks). By comparison, a 120k file consumes 42 blocks, plus two copies of the metadata, each
consuming 3 physical blocks = 48 total blocks.

Ideally free space reporting is baset on the actual usage pattern but that may not be feasable.

In ZFS the free space is an estimate, because until you allocate the data, you don't know how
much space it will consume. This is because of the other features: compression, dedup, copies,
and snapshots. Yes, this does make the case that for lots of tiny files and 4k physical block sizes,
maybe you're better off using mirrors. From a more pragmatic viewpoint, prediction of space
consumption is just a prediction -- view the "zfs free" space with a grain of salt.
-- richard

Fixed reporting on 128k is just as unhelpfull for people who store tons of small files (smaller than 128k) and could cause them just as mucht trouble.

Would it make sense to always callibrate on the lagest record size? That seems simple and flexible. People with tons of small files could choose to lower the max record size to better reflect their usage pattern but this would hurt mixed users.

So an alternative would be to make the df du behavour tunable? Sorry if this doesn't make sense.


You are receiving this because you commented.
Reply to this email directly or view it on GitHub #4599 (comment)

@mgerdts
Copy link
Contributor

mgerdts commented Jun 7, 2019

For those that stumble across this in the future:

Unfortunately, this is not a definitive reference and appears to contain some inconsistencies.
The authorative reference is Matt's blog and corresponding spreadsheet. This sheet calculates
the cost for parity and padding. It does not calculate metadata overhead, but for larger recordsizes
the metadata overhead becomes neglibible as a percentage of the dataset's use.
http://blog.delphix.com/matt/2014/06/06/zfs-stripe-width/

That URL goes nowhere useful now. Archive.org has a copy and delphix has one with different, slightly broken formatting.

https://www.delphix.com/blog/delphix-engineering/zfs-raidz-stripe-width-or-how-i-learned-stop-worrying-and-love-raidz

@iBug
Copy link

iBug commented Oct 24, 2023

@richardelling May I ask about your computation of block usage?

5k file = 2 data + 2 parity + 2 padding = 6 blocks total. Where do the other 6 blocks come from?

120k file = 30 data + 6 parity + no padding = 36 blocks total. Again where do the extra 6 blocks come from?

@mgerdts
Copy link
Contributor

mgerdts commented Oct 25, 2023

I’ve not looked at this for a number of years, but I did author a related fix. This comment describes raidz space accounting.

https://github.com/illumos/illumos-gate/blob/b73ccab03ec36581b1ae5945ef1fee1d06c79ccf/usr/src/lib/libzfs/common/libzfs_dataset.c#L5116

If you prefer to spend 30 minutes listening to me talk instead of reading a comment, you can do that too.

https://www.youtube.com/watch?v=sTvVIF5v2dw

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Status: Inactive Not being actively updated
Projects
None yet
Development

No branches or pull requests

8 participants
@richardelling @louwrentius @nicko88 @burnin @iBug @mgerdts @GeLiXin and others