Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Metadata Allocation Class #3779

Closed
don-brady opened this issue Sep 15, 2015 · 17 comments
Closed

Metadata Allocation Class #3779

don-brady opened this issue Sep 15, 2015 · 17 comments
Labels
Type: Feature Feature request or new feature
Milestone

Comments

@don-brady
Copy link
Contributor

Intel is working on ways to isolate large block file data from metadata for ZFS on Linux. In addition to the size discrepancy with file data, metadata often has a more transient lifecycle and additional redundancy requirements (ditto blocks). Metadata is often a poor match for a RAIDZ tier since it cannot be dispersed and the relative parity overhead is high. A mirrored redundancy is a better choice for metadata.

A metadata-only allocation tier is being added to the existing storage pool allocation class mechanism and is used as the primary source for metadata allocations. File data remains in the normal class. Each top-level metadata VDEV is tagged as belonging to the metadata allocation class and at runtime becomes associated with a pool’s metadata allocation class. The remaining (i.e. non-designated) top-level VDEVs default to the normal allocation class distinction. In addition to generic metadata, the performance sensitive deduplication table (DDT) data can also benefit from having its own separate allocation class.

Allocation Class Purpose
Normal Default source for all allocations
Log ZIL records only
Metadata (new) All metadata blocks (non level-0 file blocks)
Dedup Table (new) Deduplication Table Data (DDT)

More details to follow.

@DeHackEd
Copy link
Contributor

I swear someone else was working on a metadata-specific vdev for much the same purpose.

@grahamperrin
Copy link
Contributor

@DeHackEd I have seen writings about the concept but FWIW, no recollection of anyone else working on it in an open source area.

Found today: #1071 (comment) mentioning tegile, and Google finds e.g.:

@nwf
Copy link
Contributor

nwf commented Sep 25, 2015

It might be nice to allow (optionally!) one (or two?) of the metadata ditto blocks to reside in the ordinary pool, as well, making the metadata vdevs a kind of dedicated, write-through cache. (Different from L2ARC with secondarycache=metadata because they really would be holding authoritative copies of the metadata, but in dire straits could still be removed from the pool or used at lower fault tolerance -- one SSD instead of two in a mirror, etc.)

@lschweiss-wustl
Copy link

I just watched the Livestream presentation on this. This is definitely a feature needed in ZFS. I've struggled to keep metadata cached. The only way I've been able to get decent amount of metadata cached is to set L2ARC to metadata only, but it takes many passes, which still probably misses a significant amount. I would love to build pools with metadata on SSD tiers. Metadata typically accounts for 1/2 my disk I/O load.

There was a presentation from Nexenta a couple years back about tiered storage pools that had similar goals. http://www.open-zfs.org/w/images/7/71/ZFS_tiering.pdf I haven't heard anything of this effort since.

@tuxoko
Copy link
Contributor

tuxoko commented Jun 2, 2016

Hi @don-brady
May I ask what's the current status of this feature? Is there anything I can help work on with?
Thanks.

@don-brady
Copy link
Contributor Author

@tuxoko I'm hoping to post a public WIP branch soon. The creation/addition of VDEVs dedicated to specific metadata classes is functional. Currently working out accounting issues in metaslab layer. We just started running ztest with metadata only classes to help shake out any edge cases (found a few). Let me come up with a to-do list so others can help.

@tuxoko
Copy link
Contributor

tuxoko commented Jun 7, 2016

Sounds great!!

@inkdot7
Copy link
Contributor

inkdot7 commented Sep 2, 2016

Hi @don-brady
Did some more work on and measurements for the #4365 implementation. Would be interesting to test also this version.

@adilger
Copy link
Contributor

adilger commented Feb 15, 2017

It seems that the WIP pull request #5182 was never referenced here, or vice versa...

@pashford
Copy link
Contributor

@don-brady,

It's been a year since the last update and I was wondering how this is progressing.

Specifically, I'm interested in DDT devices. Moving the DDT onto dedicated high-speed devices should allow dedupe to function nearly as fast as the current memory-only implementation, but require much less memory.

For storage of VMs, dedupe could easily save much more space than compression, but the current memory requirements usually make it too costly.

@DeHackEd
Copy link
Contributor

See #5182 for the WIP. It's gone through a few iterations but I'm running an (old) version here. Very satisfied thus far. (No dedup, just regular metadata)

@nwf
Copy link
Contributor

nwf commented Aug 23, 2017

@pashford So far as I know, DDT metadata can reside on L2ARC devices. The only thing this would change is to permit writebacks to go to faster media, rather than the primary (spinning rust) storage. That seems like it's unlikely to be a huge improvement vs. just having the DDT hang around in L2ARC.

@pashford
Copy link
Contributor

@DeHackEd,

Thanks for the information.

@nwf,

The only thing this would change is to permit writebacks to go to faster media

If the writeback goes to a faster media, then a future DDT miss in the ARC/L2ARC would also go to that faster media, which would generate a performance bump. As an example, if you have a 2PB pool of 7200RPM storage, and have a few fast SSDs (SATA, SAS or NVME) as DDT devices, the DDT performance WILL be better, especially if only a portion of the DDT is kept in memory.

@gf-mse
Copy link

gf-mse commented Sep 24, 2017

Hi @don-brady, may I ask a silly question -- what are the redundancy requirements for DDT storage? I mean -- could it be possible to reconstruct the deduplication table in the case if the present DDT data is lost?

@gf-mse
Copy link

gf-mse commented Sep 24, 2017

( @don-brady ) To add, we are currently looking into enabling deduplication on our 0.5 Pb research storage cluster, and I'd be very much interested to test this feature. We are running zfsonlinux 0.6.5 ( Ubuntu LTS 16.04 ), but if you could point me in the direction of the most recent update ( #5182 (comment) ? ), I can start with build tests etc.

@adilger
Copy link
Contributor

adilger commented Oct 12, 2018

Is there anything left to do in this ticket, or should it be closed now that PR #5182 landed?

@behlendorf
Copy link
Contributor

Yup, we can close this. Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Type: Feature Feature request or new feature
Projects
None yet
Development

No branches or pull requests