- data loss :)
how to walk dir tree and find changes?
fname, mtime might not be enough. see git index. dont use tv_nsec! or maybe that issue is fixed in linux? try this reproduction. looks like it's fine for UDF at least
-
store "cost" variable per object that records how much effort it is to restore that object (by recursively resolving deltas)
-
global ref count: if only few files changed increase global ref count instead of per object (combine with commit bloom?) https://blogs.msdn.microsoft.com/devops/2018/07/16/super-charging-the-git-commit-graph-iv-bloom-filters/
http://manpages.ubuntu.com/manpages/precise/man1/simhash.1.html
minhash http://matpalm.com/resemblance/simhash/
http://roussev.net/sdhash/sdhash.html
https://ssdeep-project.github.io/ssdeep/
http://neoscientists.org/~tmueller/binsort/
https://www.dasec.h-da.de/wp-content/uploads/2012/06/adfsl-bbhash.pdf
https://en.wikipedia.org/wiki/Nearest_neighbor_search#Methods
https://www.sciencedirect.com/science/article/pii/S1742287614000097
$ time php -r 'ini_set("memory_limit", "-1");xdiff_file_rabdiff("inp", "oup", "xdiff-patch");'
php -r 31,57s user 3,33s system 63% cpu 55,009 total
$ time rdiff signature inp sig && time rdiff delta sig oup rdiff-patch
rdiff signature inp sig 2,74s user 0,23s system 99% cpu 2,977 total
rdiff delta sig oup rdiff-patch 38,25s user 0,78s system 99% cpu 39,134 total
$ time xdelta3 -e -s inp oup xdelta-patch
xdelta3 -e -s inp oup xdelta-patch 411,30s user 2,02s system 97% cpu 7:05,27 total
1953349632 inp (ubuntu-18.04.1-desktop-amd64.iso)
1502576640 oup (ubuntu-17.10.1-desktop-amd64.iso)
1090069571 xdiff-patch
1102816581 rdiff-patch
1414343627 xdelta-patch
There are some things that sadly make git itself unsuitable for this task. Interestingly, many of these problems can probably be solved within git itself without breaking backwards compatibility.
-
Hardcoded zlib compression
Every object (esp. file) in git is immediately compressed with zlib. This can not be turned off. The package format etc. would need some changes to be able to use other or no compression.
-
Inflexible delta compression
Delta compression can be turned off per file using gitattributes. But the way delta compression works is not flexible: For every pack file, the objects are ordered heuristically (using time, basename, etc) and then each object is delta compressed using a fixed number of surrounding objects.
-
Removing / reducing historical data is not possible
Git it enforces the existence of 100% of all data since that time point. There were proposed patches to allow cloning of the blobs only from a subdirectory, but as is git will immediately throw errors if any blobs are missing.
This program allows removing intermediate versions of files, without having to change history. You can have hourly snapshots for a week and then only daily snapshots for older data. Only metadata changes (esp. file names) have to be retained.
The only option for reducing history size in git is a fixed historical cut off (called shallow repository).
-
Empty trees
Git does not track empty directories. I don't think there's an actual reason for this since the internal format can easily handle an empty tree. In fact, the empty tree object does exist but it's only used in special cases.
-
Missing metadata Git does not store any file permissions, ownership, modified time, extended attributes, etc.
-
git does not store depth to root commit in commit object (easy to calculate, needed for optimizations)