Skip to content

Fingerprint Algorithms

Paul Ruane edited this page Jan 9, 2017 · 3 revisions

What's a Fingerprint?

When you tag a file, TMSU will create a fingerprint of the file and store that in the database. Just as a person's fingerprint can uniquely identify them within the global population, a file fingerprint can uniquely identify a file from all the files in the world. The chance of two different files generating the same fingerprint is statistically extremely unlikely.

(You may also see file fingerprints referred to as 'hashes' or 'digests'.)

Use In TMSU

TMSU uses the saved file fingerprints for two purposes:

  • Database repairs
  • Duplicate file identification

TMSU can find the new path of moved and rename files by finding the new path with the same fingerprint. Duplicate files can be identified within the database as having the same fingerprint and therefore contents.

Algorithms

By default, TMSU uses a modified version of the SHA256 algorithm called 'dynamic:SHA256'. This algorithm was selected as it will provide excellent security and adequate performance for a wide range of uses.

Several different file algorithms are supported:

  • none
  • SHA256
  • MD5
  • SHA1
  • dynamic:SHA256
  • dynamic:MD5
  • dynamic:SHA1

SHA1, SHA256 and MD5 are well known cryptographic hash functions, all of which are adequate for uniquely identifying tagged files. MD5 is known to be compromised and SHA1 is theoretically compromised but for tagging purposes this is unlikely to pose a problem.

The dynamic versions of the above algorithms behave differently for files larger than 5MB. Instead of calculating a fingerprint for the whole file's contents, instead they create a fingerprint for three 512kB portions of the file at its beginning, middle and end. This dramatically improves performance on slow filesystems, such as remote filesystems, but at the risk of not detecting all file modifications. This does not normally cause a problem as larger files tend to be music or video files, which are rarely if ever modified. If you do happen to modify larger files and would like TMSU to be able to identify them properly as moved then use one of the non-dynamic algorithms.

Which File Algorithm Should I Use?

Use the following as a guide to which algorithm you should use:

  • If you do not care about being able to repair moved files or detect duplicates then none will give the very best performance.
  • For optimum performance whilst maintaining the ability to repair moved files and detect duplicates, choose dynamic:MD5.
  • For maximum compatibility with other tooling choose SHA256. Although this may perform badly with larger files, especially on a remote filesystem.
  • dynamic:SHA256 is the default because this provides acceptable performance, compatibility with other tooling (except on large files) and avoids performance problems with large files or on remote filesystems.

Changing Algorithm

Before changing the algorithm, ensure you run a 'tmsu repair' to fix any file modifications and moves as otherwise it will not be possible to make these repairs once a new algorithm is in operation.

To change algorithm, use the config subcommand. For example, to use the MD5 algorithm for files:

tmsu config fileFingerprintAlgorithm=MD5

To have TMSU recalculate the fingerprints in the database with the new algorithm issue the following repair command:

tmsu repair --unmodified

Symbolic Link Algorithms

If you are using a version of TMSU before v0.7 then please see the page history for more details.

As of TMSU v0.7.0, it is possible to separately configure the algorithm used for symbolic links. Usually there is no benefit of doing so, but for people using Git Annex or other similar tools there are options available to use the target file's name instead of its digest.

  • none
  • follow
  • targetName
  • targetNameNoExt

The follow option, which is the default, uses the fingerprint of the target file/directory as per the configured file/directory fingerprint algorithm, respectively.

targetName and targetNameNoExt use the symbolic link's target file or directory's name as the fingerprint. This is beneficial if the target file is some kind of identifier, as is the case when integrating with Git Annex. The targetNameNoExt variant will lop off the extension from the target filename.

Directories

I would like to reintroduce directory fingerprints but it is difficult to do this in a way which is both useful and performant.