Metadata Allocation Class #3779

don-brady · 2015-09-15T00:23:34Z

Intel is working on ways to isolate large block file data from metadata for ZFS on Linux. In addition to the size discrepancy with file data, metadata often has a more transient lifecycle and additional redundancy requirements (ditto blocks). Metadata is often a poor match for a RAIDZ tier since it cannot be dispersed and the relative parity overhead is high. A mirrored redundancy is a better choice for metadata.

A metadata-only allocation tier is being added to the existing storage pool allocation class mechanism and is used as the primary source for metadata allocations. File data remains in the normal class. Each top-level metadata VDEV is tagged as belonging to the metadata allocation class and at runtime becomes associated with a pool’s metadata allocation class. The remaining (i.e. non-designated) top-level VDEVs default to the normal allocation class distinction. In addition to generic metadata, the performance sensitive deduplication table (DDT) data can also benefit from having its own separate allocation class.

Allocation Class	Purpose
Normal	Default source for all allocations
Log	ZIL records only
Metadata (new)	All metadata blocks (non level-0 file blocks)
Dedup Table (new)	Deduplication Table Data (DDT)

More details to follow.

DeHackEd · 2015-09-15T00:34:09Z

I swear someone else was working on a metadata-specific vdev for much the same purpose.

grahamperrin · 2015-09-16T02:21:11Z

@DeHackEd I have seen writings about the concept but FWIW, no recollection of anyone else working on it in an open source area.

Found today: #1071 (comment) mentioning tegile, and Google finds e.g.:

http://www.tegile.com/blog/hybrid-unified-optimized-storage/ "… To the ZFS core they add a sophisticated in-line deduplication and meta-data acceleration code that takes advantage of the variety of storage types …"
http://www.tegile.com/blog/george_tintri/ "… we uniquely leverage flash as a store for all metadata …"
http://www.tegile.com/products/intelliflash

nwf · 2015-09-25T05:49:25Z

It might be nice to allow (optionally!) one (or two?) of the metadata ditto blocks to reside in the ordinary pool, as well, making the metadata vdevs a kind of dedicated, write-through cache. (Different from L2ARC with secondarycache=metadata because they really would be holding authoritative copies of the metadata, but in dire straits could still be removed from the pool or used at lower fault tolerance -- one SSD instead of two in a mirror, etc.)

lschweiss-wustl · 2015-10-20T12:40:32Z

I just watched the Livestream presentation on this. This is definitely a feature needed in ZFS. I've struggled to keep metadata cached. The only way I've been able to get decent amount of metadata cached is to set L2ARC to metadata only, but it takes many passes, which still probably misses a significant amount. I would love to build pools with metadata on SSD tiers. Metadata typically accounts for 1/2 my disk I/O load.

There was a presentation from Nexenta a couple years back about tiered storage pools that had similar goals. http://www.open-zfs.org/w/images/7/71/ZFS_tiering.pdf I haven't heard anything of this effort since.

tuxoko · 2016-06-02T18:15:33Z

Hi @don-brady
May I ask what's the current status of this feature? Is there anything I can help work on with?
Thanks.

don-brady · 2016-06-07T00:03:20Z

@tuxoko I'm hoping to post a public WIP branch soon. The creation/addition of VDEVs dedicated to specific metadata classes is functional. Currently working out accounting issues in metaslab layer. We just started running ztest with metadata only classes to help shake out any edge cases (found a few). Let me come up with a to-do list so others can help.

tuxoko · 2016-06-07T00:27:40Z

Sounds great!!

inkdot7 · 2016-09-02T14:56:54Z

Hi @don-brady
Did some more work on and measurements for the #4365 implementation. Would be interesting to test also this version.

adilger · 2017-02-15T07:35:53Z

It seems that the WIP pull request #5182 was never referenced here, or vice versa...

pashford · 2017-08-23T18:51:02Z

@don-brady,

It's been a year since the last update and I was wondering how this is progressing.

Specifically, I'm interested in DDT devices. Moving the DDT onto dedicated high-speed devices should allow dedupe to function nearly as fast as the current memory-only implementation, but require much less memory.

For storage of VMs, dedupe could easily save much more space than compression, but the current memory requirements usually make it too costly.

DeHackEd · 2017-08-23T18:53:42Z

See #5182 for the WIP. It's gone through a few iterations but I'm running an (old) version here. Very satisfied thus far. (No dedup, just regular metadata)

nwf · 2017-08-23T18:56:16Z

@pashford So far as I know, DDT metadata can reside on L2ARC devices. The only thing this would change is to permit writebacks to go to faster media, rather than the primary (spinning rust) storage. That seems like it's unlikely to be a huge improvement vs. just having the DDT hang around in L2ARC.

pashford · 2017-08-23T19:18:13Z

@DeHackEd,

Thanks for the information.

@nwf,

The only thing this would change is to permit writebacks to go to faster media

If the writeback goes to a faster media, then a future DDT miss in the ARC/L2ARC would also go to that faster media, which would generate a performance bump. As an example, if you have a 2PB pool of 7200RPM storage, and have a few fast SSDs (SATA, SAS or NVME) as DDT devices, the DDT performance WILL be better, especially if only a portion of the DDT is kept in memory.

gf-mse · 2017-09-24T10:22:45Z

Hi @don-brady, may I ask a silly question -- what are the redundancy requirements for DDT storage? I mean -- could it be possible to reconstruct the deduplication table in the case if the present DDT data is lost?

gf-mse · 2017-09-24T10:35:39Z

( @don-brady ) To add, we are currently looking into enabling deduplication on our 0.5 Pb research storage cluster, and I'd be very much interested to test this feature. We are running zfsonlinux 0.6.5 ( Ubuntu LTS 16.04 ), but if you could point me in the direction of the most recent update ( #5182 (comment) ? ), I can start with build tests etc.

adilger · 2018-10-12T18:24:54Z

Is there anything left to do in this ticket, or should it be closed now that PR #5182 landed?

behlendorf · 2018-10-12T18:43:18Z

Yup, we can close this. Thanks.

GregorKopka mentioned this issue Sep 21, 2015

Feature: zpool get/set/list being able to address pool vdevs/vdev members #3810

Open

behlendorf added this to the 0.8.0 milestone Sep 21, 2015

behlendorf added the Type: Feature Feature request or new feature label Sep 21, 2015

adilger mentioned this issue Dec 3, 2015

Implement sequential (two-phase) resilvering #3625

Closed

inkdot7 mentioned this issue Feb 24, 2016

Rotor vector allocation (small records favour SSD) #4365

Closed

tonyhutter mentioned this issue Jan 26, 2017

Zpool iostat show ssd #5117

Closed

pashford mentioned this issue Mar 15, 2018

Metadata Allocation Classes #5182

Merged

yaplej mentioned this issue Mar 20, 2018

OpenZFS - 6363 Add UNMAP/TRIM functionality #5925

Closed

11 tasks

gf-mse mentioned this issue Mar 23, 2018

zdb and zpool disagree on DDT size; how do I estimate additional memory requirements for DDT based on their output? #7323

Closed

behlendorf closed this as completed Oct 12, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Metadata Allocation Class #3779

Metadata Allocation Class #3779

don-brady commented Sep 15, 2015

DeHackEd commented Sep 15, 2015

grahamperrin commented Sep 16, 2015

nwf commented Sep 25, 2015

lschweiss-wustl commented Oct 20, 2015

tuxoko commented Jun 2, 2016

don-brady commented Jun 7, 2016

tuxoko commented Jun 7, 2016

inkdot7 commented Sep 2, 2016

adilger commented Feb 15, 2017

pashford commented Aug 23, 2017

DeHackEd commented Aug 23, 2017

nwf commented Aug 23, 2017

pashford commented Aug 23, 2017

gf-mse commented Sep 24, 2017

gf-mse commented Sep 24, 2017

adilger commented Oct 12, 2018

behlendorf commented Oct 12, 2018

Metadata Allocation Class #3779

Metadata Allocation Class #3779

Comments

don-brady commented Sep 15, 2015

DeHackEd commented Sep 15, 2015

grahamperrin commented Sep 16, 2015

nwf commented Sep 25, 2015

lschweiss-wustl commented Oct 20, 2015

tuxoko commented Jun 2, 2016

don-brady commented Jun 7, 2016

tuxoko commented Jun 7, 2016

inkdot7 commented Sep 2, 2016

adilger commented Feb 15, 2017

pashford commented Aug 23, 2017

DeHackEd commented Aug 23, 2017

nwf commented Aug 23, 2017

pashford commented Aug 23, 2017

gf-mse commented Sep 24, 2017

gf-mse commented Sep 24, 2017

adilger commented Oct 12, 2018

behlendorf commented Oct 12, 2018