Skip to content

Commit

Permalink
Merge pull request #195 from dbaarda/opt/sigargs1
Browse files Browse the repository at this point in the history
Make default block_len a multiple of blake2 blocksize and tidy docs.
  • Loading branch information
dbaarda committed May 16, 2020
2 parents 31a8fdf + 99e015b commit b005057
Show file tree
Hide file tree
Showing 15 changed files with 353 additions and 261 deletions.
19 changes: 18 additions & 1 deletion NEWS.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,23 @@

NOT RELEASED YET

* Change default block_len to always be a multiple of the blake2b 128 byte
blocksize for efficiency. Tidy and update docs to explain using
rs_sig_args() and rs_build_hash_table(), add rs_file_*() utils, and
document new magic types. Remove really obsolete entries in TODO.md. Update
to Doxygen 1.8.16. (dbaarda, https://github.com/librsync/librsync/pull/195)

* Improve hashtable performance by adding a small optional bloom filter,
reducing max loadfactor from 80% to 70%, Fix hashcmp_count stats to include
comparing against empty buckets. This speeds up deltas by 20%~50%.
(dbaarda, https://github.com/librsync/librsync/pull/192,
https://github.com/librsync/librsync/pull/193,
https://github.com/librsync/librsync/pull/196)

* Optimize rabinkarp_update() by correctly using unsigned constants and
manually unrolling the loop for best performance. (dbaarda,
https://github.com/librsync/librsync/pull/191)

## librsync 2.3.0

Released 2020-04-07
Expand Down Expand Up @@ -31,7 +48,7 @@ Released 2020-04-07

* Improved C99 compatibility. Add `-std=c99 -pedantic` to `CMAKE_C_FLAGS` for
gcc and clang. Fix all C99 warnings by making all code C99 compliant. Tidy
all CMake checks, #cmakedefines, and #includes. Fix 64bit support for
all CMake checks, `#cmakedefines`, and `#includes`. Fix 64bit support for
mdfour checksums (texierp, dbaarda,
https://github.com/librsync/librsync/pull/181,
https://github.com/librsync/librsync/pull/182)
Expand Down
12 changes: 6 additions & 6 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -42,10 +42,10 @@ as rsync. You can use librsync in a program you write to do backups,
distribute binary patches to programs, or sync directories to a server
or between peers.

This tree also produces the @ref rdiff command-line tool that exposes the key
operations of librsync: generating file signatures, generating the delta from a
signature to a new file, and applying the delta to regenerate the new file
given the old file.
This tree also produces the \ref page_rdiff that exposes the key operations of
librsync: generating file signatures, generating the delta from a signature to
a new file, and applying the delta to regenerate the new file given the old
file.

librsync was originally written for the rproxy experiment in
delta-compression for HTTP.
Expand Down Expand Up @@ -75,9 +75,9 @@ your own code or make use of some other virtual filesystem layer.
* \ref page_downloads
* \ref versioning
* \ref page_install
* \ref page_rdiff
* \ref page_api
* \ref page_formats
* \ref page_support
* \ref page_contributing
* \ref rdiff command line interface
* \ref NEWS.md
* \ref page_formats
89 changes: 6 additions & 83 deletions TODO.md
Original file line number Diff line number Diff line change
@@ -1,30 +1,22 @@
# librsync TODO

* We have a few functions to do with reading a netint, stashing
it somewhere, then moving into a different state. Is it worth
writing generic functions for that, or would it be too confusing?

* Duplicate block handling. Currently duplicate blocks are included in
the signature, but we only put the first duplicate block in the
hashtable so the delta only includes references to the first block.
This can result in sub-optimal copy commands, breaking single large
copies with duplicate blocks into multiple copies referencing the
earlier copy of the block. However, this could also make patching use
the disk cache more effectively. This solution is probably fine,
particularly given how small copy instructions are, but there might be
solutions for improving copy commands for long runs of duplicate blocks.

* Optimisations and code cleanups;

scoop.c: Scoop needs major refactor. Perhaps the API needs
tweaking?

rsync.h: rs_buffers_s and rs_buffers_t should be one typedef?

* Just how useful is rs_job_drive anyway?

mdfour.c: This code has a different API to the RSA code in libmd
and is coupled with librsync in unhealthy ways (trace?). Recommend
changing to RSA API?

* Just how useful is rs_job_drive anyway?

* Don't use the rs_buffers_t structure.

There's something confusing about the existence of this structure.
Expand Down Expand Up @@ -77,10 +69,6 @@
Some are more likely to change than others. We need a chart
showing which source files depend on which variable.

* Encoding implementation

* Join up signature commands

* Encoding algorithm

* Self-referential copy commands
Expand All @@ -99,48 +87,14 @@
However, I don't see many files which have repeated 1kB chunks,
so I don't know if it would be worthwhile.

* Extended files

Suppose the new file just has data added to the end. At the
moment, we'll match everything but the last block of the old
file. It won't match, because at the moment the search block
size is only reduced at the end of the *new* file. This is a
little inefficient, because ideally we'd know to look for the
last block using the shortened length.

This is a little hard to implement, though perhaps not
impossible. The current rolling search algorithm can only look
for one block size at any time. Can we do better? Can we look
for all block lengths that could match anything?

Remember also that at the moment we don't send the block length
in the signature; it's implied by the length of the new block
that it matches. This is kind of cute, and importantly helps
reduce the length of the signature.

* State-machine searching

Building a state machine from a regular expression is a brilliant
idea. (I think *The Practice of Programming* walks through the
construction of this at a fairly simple level.)

In particular, we can search for any of a large number of
alternatives in a very efficient way, with much less effort than
it would take to search for each the hard way. Remember also the
string-searching algorithms and how much time they can take.

I wonder if we can use similar principles here rather than the
current simple rolling-sum mechanism? Could it let us match
variable-length signatures?

* Support gzip compression of the difference stream. Does this
* Support compression of the difference stream. Does this
belong here, or should it be in the client and librsync just have
an interface that lets it cleanly plug in?

I think if we're going to just do plain gzip, rather than
rsync-gzip, then it might as well be external.

* rsync-gzip: preload with the omitted text so as to get better
rsync-gzip: preload with the omitted text so as to get better
compression. Abo thinks this gets significantly better
compression. On the other hand we have to important and maintain
our own zlib fork, at least until we can persuade the upstream to
Expand All @@ -167,20 +121,6 @@
Will the GNU Lesser GPL work? Specifically, will it be a problem
in distributing this with Mozilla or Apache?

* Checksums

* Do we really need to require that signatures arrive after the
data they describe? Does it make sense in HTTP to resume an
interrupted transfer?

I hope we can do this. If we can't, however, then we should
relax this constraint and allow signatures to arrive before the
data they describe. (Really? Do we care?)

* Allow variable-length checksums in the signature; the signature
will have to describe the length of the sums and we must compare
them taking this into account.

* Testing

* Just more testing in general.
Expand All @@ -197,28 +137,11 @@

* Generate random data; do random mutations.

* Try different block lengths.

* Tests should fail if they can't find their inputs, or have zero
inputs: at present they tend to succeed by default.

* Test varying strong-sum inputs: default, short, long.

* Security audit

* If this code was to read differences or sums from random machines
on the network, then it's a security boundary. Make sure that
corrupt input data can't make the program crash or misbehave.

* Long files

* How do we handle the large signatures required to support large
files? In particular, how do we choose an appropriate block size
when the length is unknown? Perhaps we should allow a way for
the signature to scale up as it grows.

* Perhaps make extracted signatures still be wrapped in commands.
What would this lead to?

* We'd know how much signature data we expect to read, rather than
requiring it to be terminated by the caller.
Loading

0 comments on commit b005057

Please sign in to comment.