Use ZFS metadata to compare checksums #40

one-github · 2019-11-14T20:04:02Z

I use ZFS to store all my files in a big pool. Sometimes I have duplicates I identify with rdfind in a dry-run, check the results.txt manually and (if it's ok with me) re-run rdfind to really delete the duplicate files. So rdfind not only needs to fully read the remaining files in full twice (at least) to compute the checksums, but also it does not leverage the block checksums of every file that are an inherent feature of ZFS (and calculated anyway, but at write-time of the file).

The ZFS command zdb gives an indication on how this could work. To query which files (and their ZFS object ID) are on a given ZFS file system (here minitank/fw/video/gopro):

$ zdb -vvv minitank/fw/video/gopro | grep \(type:\ Regular | head -n 100
(...)
		.VolumeIcon.icns = 7 (type: Regular File)
		com.apple.FinderInfo = 9 (type: Regular File)
		VolumeConfiguration.plist = 16 (type: Regular File)
(...)
		GOPR0410.MP4 = 110 (type: Regular File)
		GOPR0428.LRV = 109 (type: Regular File)
		G0030427.JPG = 108 (type: Regular File)
		GOPR0434.MP4 = 121 (type: Regular File)
		G0010416.JPG = 120 (type: Regular File)
		GOPR0433.MP4 = 116 (type: Regular File)
		G0020421.JPG = 115 (type: Regular File)

This allows to query (for example) what the checksums for each block the ZFS object 115 (file G0020421.JPG) has:

$ zzdb -vvvvv minitank/fw/video/gopro 115 | grep "L0 ZFS plain file" | grep -o cksum=.*
cksum=2b0ad5f93fb4:7298cf4a6946d86:9121b965256654b8:1bc1f6f4a2a099d8
cksum=4097453f3d6c:103cb4291526d015:8afcb0a258e95b77:a9e71a3e8f82f19d
cksum=4022edcd7609:100fea58fa9ac828:2ca5db609ae0a854:faea7f497f7daf2c
cksum=404ea3b38c21:1015c2c361081102:bcc31ed291dca604:4b126a2d4052739b
cksum=3f598a2ab67a:fde544c32e664c2:abd956aeed12b9db:7cf2e4d2a6c9bd4f
cksum=3f675000fe35:fd98072b0095081:5a98cae21efdd757:f7fbf8cd6fa73304
cksum=3ef7cee7b23e:fcc06f4ad0f2b00:9588581ab10b1fe6:c6ee212e369f0f79
cksum=3e4bcddec672:f94b6b8790382e5:628b25de42d08d2a:b563887a5e7995a8
cksum=3d43f9295614:f4fc8a2c9ced025:9bf45e95e78116e6:97f66dbb9bcc3da7
cksum=3db1b8dcb8ea:f6898164ef8961d:717da2e5f2a8ae7c:7d1ff9f6e079a349
cksum=3e8278d5f3eb:f969929d7718668:cc9012a3cdbee3ab:be677c0b5973d2e
cksum=3edd74e3736a:fb28bbd504ea3ed:29d1881b6fdf791f:d88937b1fa135fe6
cksum=3e6d420fa43c:f925adf1e0beeae:e6ec2c707a56118d:2b704c513ab8e7a4
cksum=3df8ee864c70:f7c1a08ad5774eb:9f92915c010af70:37f591cd192339ae
cksum=3d95d67f89ef:f60c81b1766b80c:a5ffc708f633a88f:a4fa96a5806af28d
cksum=3e191cd0dcf0:f7a200c5a8e12ce:dae62952d2d776eb:a932d6341e2bf523
cksum=409332d2588d:ffff751b79ce0c4:13eb43c9f54317cc:352b881aa6025856
cksum=53e67ac1208:283c714f9b3a2a1:78069759bb33e8b1:1e280f42ef6e382a

Now for rdfind using the zdb command is not a very good idea (except for a PoC maybe), as the output format is clearly not meant for automatic processing (it is also said not to be backward compatible through the ZFS releases). But zdb has no real magic, AFAIK it just queries the ZFS API and spits out the resulting information as text.

So when using ZDB, reading the actual files to check for equality would be unnecessary - which would render rdfind even quicker.

In a small experiment I could reproduce the cksum values being stable across different ZFS file systems (bar the checksums of the last block, because of a shorter block length on one of the ZFS file systems) and also intra-file system with a simple copy of the file (but different name and access times etc.).

The text was updated successfully, but these errors were encountered:

pauldreik added the enhancement New feature or request label Nov 15, 2019

matuusu mentioned this issue Nov 24, 2021

Use ZFS checksums for faster comparison pkolaczk/fclones#94

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use ZFS metadata to compare checksums #40

Use ZFS metadata to compare checksums #40

one-github commented Nov 14, 2019

Use ZFS metadata to compare checksums #40

Use ZFS metadata to compare checksums #40

Comments

one-github commented Nov 14, 2019