-
Notifications
You must be signed in to change notification settings - Fork 119
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Repair to merge tags of identical files #20
Comments
Hi there. Thanks for reporting this issue. You're right: when a repair is made the tags should be sucked from the other file so that the remaining file shares the superset. I'll look at it this as part of the 0.5.0 release. In the mean time, TMSU has a
When you identify a duplicate, the easiest way forward would be choose which of the two files you wish to live and then tag it with all of the other file's tags:
Then you can safely delete the second file:
|
I think this will have to wait until after 0.5.0. Something is making me not like the idea of 'repair' automatically synchronizing tags across duplicates and I can't quite put it into words. I think part of the worry is that the operation could be slow: identifying all sets of duplicates and then comparing tags. Or it could be that although a user has multiple files with the same fingerprint that they consider them separate files, though I can't come up with a convincing case to illustrate what I mean. Let me think about this some more. |
Let me argue to make this issue a stronger case. :-) From the user's perspective, this is a bug. If duplicate files have been merged, then tags should also follow. When I repaired anything, I expect to be slow but it has to be right. I don't mind at all to wait longer for the process to fix everything correctly. Most importantly, information(tags) should not be lost. same fingerprint that they consider them separate files doesn't hold. Either the fingerprint algorithm is not strong enough to identify uniqueness or you are trying to trick the application. If it is the latter, then all bets are off. |
I'll look at this issue soon. It had fallen off my radar. |
@limelime : I want to point out that generic hashing algorithms are only capable of making probabilistic guarantees about their behaviour, not absolute ones. So it might be unlikely to have a set of 2 files with the same fingerprint and different content, but it's perfectly possible no matter the strength of your fingerprinting. No matter how strong the hash, it can never make an absolute guarantee that two files with different content will not have the same hash. I have personally encountered collisions with both MD5 and SHA1. (This is one of the reasons I'm extremely cautious about |
I don't see how you could think this is a bug? TMSU does not combine tagged files during a repair operation: TMSU will only repair a moved files by identifying an untagged file with the same fingerprint. If TMSU cannot find an untagged file with the same fingerprint then it will, instead, report the file as missing. At no point could any tag information be lost as files are simply not merged by 'repair' at this time. You seem to be saying that 'if TMSU implemented my original suggestion there would be a bug in the implementation', which doesn't make sense as you have no idea how it would be implemented. Unless I'm missing something? Are you saying you have identified a bug in the 'repair' subcommand where missing files are repaired by merging them with an already tagged file? I have performed a couple of tests and everything seems to be working as expected:
As you can see, 'repair' only identified the file as moved when there was an untagged candidate. No tagging information was lost at any time. |
If a user has two files on their disk with the same contents why should TMSU be so arrogant to assume that these files hold the same destiny and treat them identically even though they are separate filesystem entities? Just because two files are the same now, does not mean they will be the same in the future. The user might plan on editing one, or both, such that they serve different purposes: the tool should not assume. The better argument may be 'why would a user want these duplicate files on their desk in the first place?' and that's a good question: they likely don't and so they can use the available tooling to remove the duplicates they do not want. TMSU helps here by providing a 'dupes' subcommand. This duplicate functionality is outside of its remit (as a file tagging utility) already but I included it as I figured if TMSU has the fingerprints (for moved files detection) it may aswell leverage this information. If a user wants to remove duplicates from their filesystem they can do this:
Now I agree that's a bit long winded and not entirely intuitive so perhaps there could be a facility to do all of this. However such a facility would alter the filesystem and I've been reluctant to add such functionality as right now TMSU does not alter your files: it's only write access is to the database. This is actually one of the 'features' I put on the website at http://tmsu.org/. I've been resisting adding operations that modifying the files as I believe this would cause a trust issue for new (actually all) users. So, my recommendation is to create a shell function to do this:
In fact in another issue we had a discussion about a tmsu-rm type command. Perhaps the most pragmatic thing to do would be to include such scripts with TMSU. That way you get the functionality you want, TMSU itself is purely read-only and everybody is theoretically happy. Plus this feels a lot more Unixesque than putting this simple stuff into TMSU. |
I've opened issue #35 to include some scripts for performing filesystem operations whilst maintaining the tag information. Merging files I feel should be left to the user with the help of 'dupes' as necessary. |
Can you make TMSU merge tags of identical files when performing the 'repair' operation?
Here is a test case:
The text was updated successfully, but these errors were encountered: