Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mac ZFS triggers duplicating finding programs to fail randomly #808

Open
mauricev opened this issue Apr 21, 2023 · 9 comments
Open

Mac ZFS triggers duplicating finding programs to fail randomly #808

mauricev opened this issue Apr 21, 2023 · 9 comments

Comments

@mauricev
Copy link

There is a serious bug in Mac ZFS wherein programs that identify files as duplicates often fail randomly to detect duplicates.

Two such programs are Gemini and Duplicate File Finder.

These programs work perfectly on APFS disks, but on Mac ZFS, they randomly generate false negatives. There is no rhyme or reason as to what duplicated file gets falsely detected not to be a duplicate of its corresponding twin. The md5 hash of the files in question is always the same, so Mac ZFS is not altering the contents of any file. It seems more likely a problem with the iteration of the file/folder directory tree where ZFS is somehow not returning every present file. It's not clear whether this is occurring on either the source or destination directories or both.

This bug is present in 2.16 (Monterey Intel) and at least some earlier builds.

@jawbroken
Copy link

Do you actually mean it fails randomly, i.e. that you get different results every time you run a scan? Or do you mean you get the same results every time but you don't understand why you are getting false negatives?

@mauricev
Copy link
Author

The second one. For example, in Gemini, I had it scan duplicate word, powerpoint and excel files, and it consistently would act as if the Excel file was not a duplicate. But the md5 says otherwise. In Duplicate File Finder, it was skipping over two jmp (JMP, a statistics package) files consistently. Once again, md5 is saying they're identical to their originals. So my guess is there is a problem with file iteration, not with the files themselves. But this is hard to explain, since obviously file iteration works in the Finder, and with rsync and FreeFileSync.

@lundman
Copy link
Contributor

lundman commented Apr 22, 2023

I'd curious how it is implemented in these apps.

They might be referring to hard links, which are hard to detect once you make them. macOS does have fcntls for "next" and "prev" links, which is something we had to attempt to implement. There's been no good way to test if they work. So there could be a bug in here.

There is also that macOS has "file_id" (like posix) but also "link_id", and file_id should be the same between all duplicates, but each one should have a unique "link_id". Could be a bug in here.

Then if you query /.vol/123456789/1234 - which is a secret way to lookup files by ID, or, link ID, it should return the correct name (if you use link ID). Could be a bug in here.

Last two can be tested with any getattrlist program, like FSMegaTool.

@lundman
Copy link
Contributor

lundman commented Apr 22, 2023

Last two seem OK

# Start with a source file to link against
$ cp macos_zfs /Volumes/BOOM/source.bin

# Create some links to point to it
$ ln source.bin linkA.bin
$ ln source.bin linkB.bin

$ stat /Volumes/BOOM/source.bin
939524107 692 -rwxr-xr-x 3 root wheel 0 5035736 "Apr 22 15:24:39 2023" "Apr 22 15:24:39 2023" "Apr 22 15:25:02 2023" "Apr 22 15:24:39 2023" 131072 4846 0 source.bin

# So the file-system ID is "939524107" and the files all have fileID "692"

$ ~/FSMegaInfo getattrlist -ATTR_CMN_NAME,ATTR_CMN_FILEID,ATTR_CMN_OBJID source.bin 

getattrlist -ATTR_CMN_NAME,ATTR_CMN_FILEID,ATTR_CMN_OBJID /Volumes/BOOM/source.bin
    ATTR_CMN_NAME   = 'source.bin'
    ATTR_CMN_OBJID  = (objno = 0x80000001, generation = 0x00000000)
    ATTR_CMN_FILEID = 692

# So "source.bin" has file ID 692 and link ID "0x80000001"

$ ~/FSMegaInfo getattrlist -ATTR_CMN_NAME,ATTR_CMN_FILEID,ATTR_CMN_OBJID linkA.bin

getattrlist -ATTR_CMN_NAME,ATTR_CMN_FILEID,ATTR_CMN_OBJID /Volumes/BOOM/linkA.bin
    ATTR_CMN_NAME   = 'linkA.bin'
    ATTR_CMN_OBJID  = (objno = 0x80000002, generation = 0x00000000)
    ATTR_CMN_FILEID = 692

# So "linkA.bin" has file ID 692 and link ID "0x80000002"

$ ~/FSMegaInfo getattrlist -ATTR_CMN_NAME,ATTR_CMN_FILEID,ATTR_CMN_OBJID linkB.bin

getattrlist -ATTR_CMN_NAME,ATTR_CMN_FILEID,ATTR_CMN_OBJID /Volumes/BOOM/linkB.bin
    ATTR_CMN_NAME   = 'linkB.bin'
    ATTR_CMN_OBJID  = (objno = 0x80000003, generation = 0x00000000)
    ATTR_CMN_FILEID = 692

# So "linkB.bin" has file ID 692 and link ID "0x80000003"

# So it seems the unique linkID is working fine, and returns the correct name.

# Now to try looking up by ID:

$ ~/FSMegaInfo getattrlist -ATTR_CMN_NAME,ATTR_CMN_FILEID,ATTR_CMN_OBJID /.vol/939524107/692   

getattrlist -ATTR_CMN_NAME,ATTR_CMN_FILEID,ATTR_CMN_OBJID /.vol/939524107/692
    ATTR_CMN_NAME   = 'linkB.bin' 
    ATTR_CMN_OBJID  = (objno = 0x80000003, generation = 0x00000000)
    ATTR_CMN_FILEID = 692

# So using 692 will return last used name. Is that what apfs does?

# Note that 0x80000003 == 2147483651

$ ~/FSMegaInfo getattrlist -ATTR_CMN_NAME,ATTR_CMN_FILEID,ATTR_CMN_OBJID /.vol/939524107/2147483651

getattrlist -ATTR_CMN_NAME,ATTR_CMN_FILEID,ATTR_CMN_OBJID /.vol/939524107/2147483651
    ATTR_CMN_NAME   = 'linkB.bin'
    ATTR_CMN_OBJID  = (objno = 0x80000003, generation = 0x00000000)
    ATTR_CMN_FILEID = 692


$ ~/FSMegaInfo getattrlist -ATTR_CMN_NAME,ATTR_CMN_FILEID,ATTR_CMN_OBJID /.vol/939524107/2147483650

getattrlist -ATTR_CMN_NAME,ATTR_CMN_FILEID,ATTR_CMN_OBJID /.vol/939524107/2147483650
    ATTR_CMN_NAME   = 'linkA.bin'
    ATTR_CMN_OBJID  = (objno = 0x80000002, generation = 0x00000000)
    ATTR_CMN_FILEID = 692

$ ~/FSMegaInfo getattrlist -ATTR_CMN_NAME,ATTR_CMN_FILEID,ATTR_CMN_OBJID /.vol/939524107/2147483649

getattrlist -ATTR_CMN_NAME,ATTR_CMN_FILEID,ATTR_CMN_OBJID /.vol/939524107/2147483649
    ATTR_CMN_NAME   = 'source.bin'
    ATTR_CMN_OBJID  = (objno = 0x80000001, generation = 0x00000000)
    ATTR_CMN_FILEID = 692

# So reverse names also appear correct, and working. 

Could be worth checking if stating by 692 (fileID) behaves the same as apfs.

Then checking link-next, and link-prev, work the same. Not sure if there is a test tool for that.

@jawbroken
Copy link

Any chance there are extended attributes or similar involved? Just guessing that just because the file contents are the same/have the same hash it doesn't mean the files are considered identical by these tools.

@mauricev
Copy link
Author

These are all regular files. I do believe they're essentially using hashes though Gemini says they have a "proprietary" algorithm. Duplicate File Finder has an option for a "slower" check, and I suspect here it's computing hashes.

It's more likely files are being skipped during iteration. For example, when I tested Duplicate File Finder, I have two parent folders I'm checking. There is a subfolder in each, which contains a file. When I test the subfolder, it declared this file as a duplicate, but when I told it to scan the parent folder, it didn't. So it sounds like something is throwing off the iteration and it just didn't see this file when starting with the parent folder.

@jawbroken
Copy link

jawbroken commented Apr 24, 2023

It just seems weird because there are many other pieces of software, from Finder to command line tools, that have no trouble listing the contents of a directory. Why are these two duplicate-finders the exception?

@mauricev
Copy link
Author

I agree. The finder works. Rsync works, FreeFileSync works. The file navigation dialogs works. But I had a Mac VM with two ZFS disks and Duplicate File Finder was missing a whole slew of file duplicates. When I reformatted as APFS, the program worked great. The way forward is for your guys to test them out yourselves and assure that my machine isn't possessed or that I am crazy.

@mauricev
Copy link
Author

mauricev commented May 5, 2023

I discovered that Gemini fails sporadically on APFS, so right now it's only Duplicate File Finder that fails on Mac zfs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants