Join GitHub today
GitHub is home to over 50 million developers working together to host and review code, manage projects, and build software together.Sign up
GitHub is where the world builds software
Millions of developers and companies build, ship, and maintain their software on GitHub — the largest and most advanced development platform in the world.
ZFS Fragmentation: Long-term Solutions #3582
Hey guys.. This is somewhat of a question, issue, and "viability of a feature request" all rolled into one.
First , ZFS is awesome, and I use it to host VMs. I am looking seriously into ZFS as a solution for enterprise OpenStack deployments.
My questions is on fragmentation -- (I think it is a problem , but I'm no expert in this like you guys). .. . Here are a few links that I've been reading on ZFS fragmentation:
Links to data which may indicated problems with ZFS fragmentation
Is correct that ZFS fragmentation appears to be a significant issue under certain workloads? What's the best way to avoid it, using the latest zfsonlinuz codebase?
Here are mentions I found to fixing fragmentation with the supposed BPR code. . .
BPR answer here: https://www.youtube.com/watch?v=G2vIdPmsnTI#t=44m53s
On Sat, Jul 11, 2015 at 6:36 PM, Bronek Kozicki email@example.com
Ok, wow this guy is awesome .. Thanks for the video. I love the video , I've transcribed it here..
TLDR; " It's like changing your pants while you're running... Deleting snapshots, creating snapshots , while you're changing what those snapshots are trying to reference."
Matt Ahrenz on ZFS / "Block-pointer Rewrite project for ZFS Fragmentation"
"So BP Rewrite is a project I was working on at Sun. And the idea was.. uh.. very all encompassing. We would be able to take .. .any... block... on disk and be able to modify anyway we need to. Allocate it somewhere else, or change the compression, change the checksum, de-dup the block, or not de-dup it... And change ..uh... keep track of that change.
It was called BP Rewrite because we need to change the block pointer... to point to some new block. ... So uh... the tricky part... Well.. .The straightforward implementation is to traverse the blocks .. the tricky thing about doing this on ZFS...
(1) One is that you can have... there can be many pointers to a block, because of snapshots and clones. One block on disk can be pointed to by ten different clones. It creates this problem where if you have a bunch of instances of that block pointer, when [we] traverse all the block pointers, i'm going to visit that old block pointer several times. we need to REMEMBER that we changed from this old block pointer to the new block pointer. in other words, we moved this particular block from place A to place B.
So that if I see another pointer to place A, then I know to change it to place B. This creates a performance problem, because you end up having to have a giant hash table that maps from the old location to the new location. If you're familiar with ZFS Dedup, then you're aware that also involves a giant hash table mapping from the blocks' checksum to the location on disk that it's stored, and uh.. ref count. And if you ever used dedup in practice on very large data sets, then you're probably aware that the performance of that is not very great. So there are similar performance problems with BP Rewrite.
And there's also some additional tricky-ness because ZFS is very full-featured. The space used by a given block is accounted for in many different places in many different layers of ZFS. So for example, [space used] is counted in the d_node, so that each file knows how much space it's using . So [that's would be] like ls -al or df . That counts up the amount of space. It's also accounted for in the dsl layer in a bunch of different place... so you know, in zfs list , the space used by each filesystem . Which impacts all the snapshots, and all the parent filesystems, because the space is inherited up the tree , in terms of the space used.
There's a bunch of different places that space accounting needs to happen.. Making sure all those [numbers] get updated accurately when a block changes size is very tricky. So for all those reasons.. as I was working on this... I was very concerned that this would be the last feature ever implemented in ZFS -- because... uhm... As most programmers know, magic does not layer well.
And the BP Rewrite was definitely magic, and it definitely broke alot of the layering in ZFS. It needed to have code in several different layers know -- to have intimate knowledge of how this all worked.
On top of all that, were doing it... We wanted to be able to do this live, on a live file system. This is very important if [running BP rewrite] is going to take weeks, because of the performance issues. It's like changing your pants while you're running... Deleting snapshots, creating snapshots , while you're changing what those snapshots are trying to reference.
So those are some of the issues.. with that. Uhm the uh... because of my concerns about the layering, I'm actually kind of glad that project was not completed. Because I think it would have had some big implications on the difficulty of adding other features after it.
I dont think anyone is attempting a full-on BP Rewrite do anything kind of implementation. Some people have looked at it from a restricted standpoint... It would depend on the type of implementation.
A separate utility based on libzpool without adding code to it would be great. I would very much welcome a separate utility that lets you off-line BP Rewrite your stuff. The issue would be how deeply the utilities fingers would be stuck into that [library] code.
But you still have the issue of the performance of having this giant hash table, and you have to update the accounting at every [ZFS] layer."
Well... what about putting the giant hash table for ZFS Dedup (the block pointer giant hash table) ... What about putting this in a network hash table?
For example. I have a 250 GB disk that needs ZFS BP Rewrite. Why can't a standalone utility make the multi-gigabyte table of block mappings, and then upload it to the network (like Redis or S3 or something).. Upload it in like 1GB chunks or something
Then the standalone utility downloads like 1GB or so of the giant hash table at a time and stores it in memory. Then it does the ZFS dedup on the blocks it has , or can do with the portion of the hash table in memory.
Alternately, it could have nothing in RAM and just fetch everything it needs to off the network. What's the problem with assuming 'unlimited' free space for the BP Rewrite using new cloud-based hash tables ?
No need for the hash table. Another option would be to do this offline (no need for step 3 or 4), that would be better than send/receive because it would not require temporarily putting many terabytes of data somewhere.