Skip to content

phiresky/gitty-backup-rs

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

11 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Blocking problems for actual use

  1. data loss :)

problems

how to walk dir tree and find changes?

fname, mtime might not be enough. see git index. dont use tv_nsec! or maybe that issue is fixed in linux? try this reproduction. looks like it's fine for UDF at least

ideas

similarity hashing

http://manpages.ubuntu.com/manpages/precise/man1/simhash.1.html

minhash http://matpalm.com/resemblance/simhash/

http://roussev.net/sdhash/sdhash.html

https://ssdeep-project.github.io/ssdeep/

http://neoscientists.org/~tmueller/binsort/

https://www.dasec.h-da.de/wp-content/uploads/2012/06/adfsl-bbhash.pdf

https://en.wikipedia.org/wiki/Nearest_neighbor_search#Methods

https://www.sciencedirect.com/science/article/pii/S1742287614000097

$ time php -r 'ini_set("memory_limit", "-1");xdiff_file_rabdiff("inp", "oup", "xdiff-patch");'
php -r   31,57s user 3,33s system 63% cpu 55,009 total

$ time rdiff signature inp sig && time rdiff delta sig oup rdiff-patch
rdiff signature inp sig  2,74s user 0,23s system 99% cpu 2,977 total
rdiff delta sig oup rdiff-patch  38,25s user 0,78s system 99% cpu 39,134 total

$ time xdelta3 -e -s inp oup xdelta-patch
xdelta3 -e -s inp oup xdelta-patch 411,30s user 2,02s system 97% cpu 7:05,27 total

1953349632      inp (ubuntu-18.04.1-desktop-amd64.iso)
1502576640      oup (ubuntu-17.10.1-desktop-amd64.iso)
1090069571      xdiff-patch
1102816581      rdiff-patch
1414343627      xdelta-patch

Why not just use Git

There are some things that sadly make git itself unsuitable for this task. Interestingly, many of these problems can probably be solved within git itself without breaking backwards compatibility.

  1. Hardcoded zlib compression

    Every object (esp. file) in git is immediately compressed with zlib. This can not be turned off. The package format etc. would need some changes to be able to use other or no compression.

  2. Inflexible delta compression

    Delta compression can be turned off per file using gitattributes. But the way delta compression works is not flexible: For every pack file, the objects are ordered heuristically (using time, basename, etc) and then each object is delta compressed using a fixed number of surrounding objects.

  3. Removing / reducing historical data is not possible

    Git it enforces the existence of 100% of all data since that time point. There were proposed patches to allow cloning of the blobs only from a subdirectory, but as is git will immediately throw errors if any blobs are missing.

    This program allows removing intermediate versions of files, without having to change history. You can have hourly snapshots for a week and then only daily snapshots for older data. Only metadata changes (esp. file names) have to be retained.

    The only option for reducing history size in git is a fixed historical cut off (called shallow repository).

  4. Empty trees

    Git does not track empty directories. I don't think there's an actual reason for this since the internal format can easily handle an empty tree. In fact, the empty tree object does exist but it's only used in special cases.

  5. Missing metadata Git does not store any file permissions, ownership, modified time, extended attributes, etc.

  6. git does not store depth to root commit in commit object (easy to calculate, needed for optimizations)

Releases

No releases published

Packages

No packages published

Languages