Repair to merge tags of identical files #20

xuanngo2001 · 2015-01-29T20:50:23Z

Can you make TMSU merge tags of identical files when performing the 'repair' operation?
Here is a test case:

### Prepare data: Duplicate file.txt to 2 different directories.
mkdir test/
mkdir another/

echo "Test Repair Merge Tags" > file.txt
cp file.txt test/
mv file.txt another/

### Tag each file with diffferent tag
tmsu tag --tags "repair" test/file.txt
tmsu tag --tags "merge" another/file.txt

tmsu files repair
tmsu files merge

### Show duplicate files are identical(time and hash).
ls -al test/file.txt
ls -al another/file.txt
md5sum test/file.txt
md5sum another/file.txt


### Realize there is duplicate file. Therefore, delete 1 of them.
rm -f another/file.txt

### Repair and expect merge of tags(i.e. test/file.txt should be tagged with 'repair' & 'merge' but it is not the case)
tmsu repair . .
tmsu tags test/file.txt

oniony · 2015-01-29T20:59:31Z

Hi there. Thanks for reporting this issue.

You're right: when a repair is made the tags should be sucked from the other file so that the remaining file shares the superset. I'll look at it this as part of the 0.5.0 release.

In the mean time, TMSU has a dupes command that can be used to identify duplicate files in the database (or for a specific file):

$ tmsu dupes
Set of 2 duplicates:
    ./test/file.txt
    ./another/file.txt

When you identify a duplicate, the easiest way forward would be choose which of the two files you wish to live and then tag it with all of the other file's tags:

$ tmsu tag --from another/file.txt test/file.txt

Then you can safely delete the second file:

$ tmsu untag --all another/file.txt
$ rm another/file.txt

oniony · 2015-01-30T20:39:47Z

I think this will have to wait until after 0.5.0. Something is making me not like the idea of 'repair' automatically synchronizing tags across duplicates and I can't quite put it into words. I think part of the worry is that the operation could be slow: identifying all sets of duplicates and then comparing tags. Or it could be that although a user has multiple files with the same fingerprint that they consider them separate files, though I can't come up with a convincing case to illustrate what I mean. Let me think about this some more.

xuanngo2001 · 2015-04-11T12:59:31Z

Let me argue to make this issue a stronger case. :-)

From the user's perspective, this is a bug. If duplicate files have been merged, then tags should also follow.

When I repaired anything, I expect to be slow but it has to be right. I don't mind at all to wait longer for the process to fix everything correctly. Most importantly, information(tags) should not be lost.

same fingerprint that they consider them separate files doesn't hold. Either the fingerprint algorithm is not strong enough to identify uniqueness or you are trying to trick the application. If it is the latter, then all bets are off.

oniony · 2015-04-11T18:11:06Z

I'll look at this issue soon. It had fallen off my radar.

0ion9 · 2015-04-11T23:37:19Z

@limelime :
"Either the fingerprint algorithm is not strong enough to identify uniqueness or you are trying to trick the application. If it is the latter, then all bets are off."

I want to point out that generic hashing algorithms are only capable of making probabilistic guarantees about their behaviour, not absolute ones. So it might be unlikely to have a set of 2 files with the same fingerprint and different content, but it's perfectly possible no matter the strength of your fingerprinting. No matter how strong the hash, it can never make an absolute guarantee that two files with different content will not have the same hash. I have personally encountered collisions with both MD5 and SHA1.

(This is one of the reasons I'm extremely cautious about tmsu repair, although IIRC tmsu repair also compares other metadata like file size after hash, which further reduces risk. It may be a little off-topic, but I think the approach of generating an editable script like rmlint does is generally more sound, to allow the user to deal correctly with exceptional cases. Undeniably slower, but IMO the extra level of control is necessary when performing potentially destructive actions.)

oniony · 2015-04-14T13:01:22Z

@limelime

From the user's perspective, this is a bug.
When I repaired anything, I expect to be slow but it has to be right. I don't mind at all to wait longer for the process to fix everything correctly. Most importantly, information(tags) should not be lost.

I don't see how you could think this is a bug? TMSU does not combine tagged files during a repair operation: TMSU will only repair a moved files by identifying an untagged file with the same fingerprint. If TMSU cannot find an untagged file with the same fingerprint then it will, instead, report the file as missing. At no point could any tag information be lost as files are simply not merged by 'repair' at this time.

You seem to be saying that 'if TMSU implemented my original suggestion there would be a bug in the implementation', which doesn't make sense as you have no idea how it would be implemented.

Unless I'm missing something? Are you saying you have identified a bug in the 'repair' subcommand where missing files are repaired by merging them with an already tagged file? I have performed a couple of tests and everything seems to be working as expected:

$ echo "hello" >file1
$ cp file1 file2
$ tmsu tag file1 tag1
tmsu: New tag 'tag'.
$ tmsu tag file2 tag2
tmsu: New tag 'tag2'.
$ tmsu dupes
Set of 2 duplicates:
    ./file1
    ./file2
$ rm file1
$ tmsu repair .
/home/paul/test/file1: missing
$ cp file2 file3
$ tmsu repair .
/home/paul/test/file1: updated path to /home/paul/test/file3

As you can see, 'repair' only identified the file as moved when there was an untagged candidate. No tagging information was lost at any time.

oniony · 2015-04-14T13:25:31Z

same fingerprint that they consider them separate files doesn't hold. Either the fingerprint algorithm is not strong enough to identify uniqueness or you are trying to trick the application. If it is the latter, then all bets are off.

If a user has two files on their disk with the same contents why should TMSU be so arrogant to assume that these files hold the same destiny and treat them identically even though they are separate filesystem entities?

Just because two files are the same now, does not mean they will be the same in the future. The user might plan on editing one, or both, such that they serve different purposes: the tool should not assume.

The better argument may be 'why would a user want these duplicate files on their desk in the first place?' and that's a good question: they likely don't and so they can use the available tooling to remove the duplicates they do not want. TMSU helps here by providing a 'dupes' subcommand. This duplicate functionality is outside of its remit (as a file tagging utility) already but I included it as I figured if TMSU has the fingerprints (for moved files detection) it may aswell leverage this information.

If a user wants to remove duplicates from their filesystem they can do this:

$ tmsu dupes
Set of 2 duplicates:
  ./file1
  ./file2
$ tmsu tag --from file2 file1
$ tmsu untag --all file2
$ rm file2

Now I agree that's a bit long winded and not entirely intuitive so perhaps there could be a facility to do all of this. However such a facility would alter the filesystem and I've been reluctant to add such functionality as right now TMSU does not alter your files: it's only write access is to the database. This is actually one of the 'features' I put on the website at http://tmsu.org/. I've been resisting adding operations that modifying the files as I believe this would cause a trust issue for new (actually all) users.

So, my recommendation is to create a shell function to do this:

tmsu-mergefiles() {
    tmsu tag --from "$1" "$2" && tmsu untag --all "$1" && rm "$1"
}

In fact in another issue we had a discussion about a tmsu-rm type command. Perhaps the most pragmatic thing to do would be to include such scripts with TMSU. That way you get the functionality you want, TMSU itself is purely read-only and everybody is theoretically happy. Plus this feels a lot more Unixesque than putting this simple stuff into TMSU.

oniony · 2015-04-14T14:14:48Z

I've opened issue #35 to include some scripts for performing filesystem operations whilst maintaining the tag information.

Merging files I feel should be left to the user with the help of 'dupes' as necessary.

oniony added the bug label Jan 30, 2015

oniony removed the bug label Feb 6, 2015

This was referenced Apr 14, 2015

Include a set of scripts with TMSU for perfoming 'destructive' operations #35

Closed

tmsu mv command for moving files while retaining tags #11

Closed

oniony closed this as completed Apr 14, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repair to merge tags of identical files #20

Repair to merge tags of identical files #20

xuanngo2001 commented Jan 29, 2015

oniony commented Jan 29, 2015

oniony commented Jan 30, 2015

xuanngo2001 commented Apr 11, 2015

oniony commented Apr 11, 2015

0ion9 commented Apr 11, 2015

oniony commented Apr 14, 2015

oniony commented Apr 14, 2015

oniony commented Apr 14, 2015

Repair to merge tags of identical files #20

Repair to merge tags of identical files #20

Comments

xuanngo2001 commented Jan 29, 2015

oniony commented Jan 29, 2015

oniony commented Jan 30, 2015

xuanngo2001 commented Apr 11, 2015

oniony commented Apr 11, 2015

0ion9 commented Apr 11, 2015

oniony commented Apr 14, 2015

oniony commented Apr 14, 2015

oniony commented Apr 14, 2015