Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use ZFS metadata to compare checksums #40

Open
one-github opened this issue Nov 14, 2019 · 0 comments
Open

Use ZFS metadata to compare checksums #40

one-github opened this issue Nov 14, 2019 · 0 comments
Labels
enhancement New feature or request

Comments

@one-github
Copy link

I use ZFS to store all my files in a big pool. Sometimes I have duplicates I identify with rdfind in a dry-run, check the results.txt manually and (if it's ok with me) re-run rdfind to really delete the duplicate files. So rdfind not only needs to fully read the remaining files in full twice (at least) to compute the checksums, but also it does not leverage the block checksums of every file that are an inherent feature of ZFS (and calculated anyway, but at write-time of the file).

The ZFS command zdb gives an indication on how this could work. To query which files (and their ZFS object ID) are on a given ZFS file system (here minitank/fw/video/gopro):

$ zdb -vvv minitank/fw/video/gopro | grep \(type:\ Regular | head -n 100
(...)
		.VolumeIcon.icns = 7 (type: Regular File)
		com.apple.FinderInfo = 9 (type: Regular File)
		VolumeConfiguration.plist = 16 (type: Regular File)
(...)
		GOPR0410.MP4 = 110 (type: Regular File)
		GOPR0428.LRV = 109 (type: Regular File)
		G0030427.JPG = 108 (type: Regular File)
		GOPR0434.MP4 = 121 (type: Regular File)
		G0010416.JPG = 120 (type: Regular File)
		GOPR0433.MP4 = 116 (type: Regular File)
		G0020421.JPG = 115 (type: Regular File)

This allows to query (for example) what the checksums for each block the ZFS object 115 (file G0020421.JPG) has:

$ zzdb -vvvvv minitank/fw/video/gopro 115 | grep "L0 ZFS plain file" | grep -o cksum=.*
cksum=2b0ad5f93fb4:7298cf4a6946d86:9121b965256654b8:1bc1f6f4a2a099d8
cksum=4097453f3d6c:103cb4291526d015:8afcb0a258e95b77:a9e71a3e8f82f19d
cksum=4022edcd7609:100fea58fa9ac828:2ca5db609ae0a854:faea7f497f7daf2c
cksum=404ea3b38c21:1015c2c361081102:bcc31ed291dca604:4b126a2d4052739b
cksum=3f598a2ab67a:fde544c32e664c2:abd956aeed12b9db:7cf2e4d2a6c9bd4f
cksum=3f675000fe35:fd98072b0095081:5a98cae21efdd757:f7fbf8cd6fa73304
cksum=3ef7cee7b23e:fcc06f4ad0f2b00:9588581ab10b1fe6:c6ee212e369f0f79
cksum=3e4bcddec672:f94b6b8790382e5:628b25de42d08d2a:b563887a5e7995a8
cksum=3d43f9295614:f4fc8a2c9ced025:9bf45e95e78116e6:97f66dbb9bcc3da7
cksum=3db1b8dcb8ea:f6898164ef8961d:717da2e5f2a8ae7c:7d1ff9f6e079a349
cksum=3e8278d5f3eb:f969929d7718668:cc9012a3cdbee3ab:be677c0b5973d2e
cksum=3edd74e3736a:fb28bbd504ea3ed:29d1881b6fdf791f:d88937b1fa135fe6
cksum=3e6d420fa43c:f925adf1e0beeae:e6ec2c707a56118d:2b704c513ab8e7a4
cksum=3df8ee864c70:f7c1a08ad5774eb:9f92915c010af70:37f591cd192339ae
cksum=3d95d67f89ef:f60c81b1766b80c:a5ffc708f633a88f:a4fa96a5806af28d
cksum=3e191cd0dcf0:f7a200c5a8e12ce:dae62952d2d776eb:a932d6341e2bf523
cksum=409332d2588d:ffff751b79ce0c4:13eb43c9f54317cc:352b881aa6025856
cksum=53e67ac1208:283c714f9b3a2a1:78069759bb33e8b1:1e280f42ef6e382a

Now for rdfind using the zdb command is not a very good idea (except for a PoC maybe), as the output format is clearly not meant for automatic processing (it is also said not to be backward compatible through the ZFS releases). But zdb has no real magic, AFAIK it just queries the ZFS API and spits out the resulting information as text.

So when using ZDB, reading the actual files to check for equality would be unnecessary - which would render rdfind even quicker.

In a small experiment I could reproduce the cksum values being stable across different ZFS file systems (bar the checksums of the last block, because of a shorter block length on one of the ZFS file systems) and also intra-file system with a simple copy of the file (but different name and access times etc.).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants