Make delta directly process the input stream if it has enough data. #234

dbaarda · 2021-09-14T03:48:46Z

This means deltas only accumulate data into the scoop buffer if the input stream is too small, otherwise it will process directly from the input stream. If the input stream always has enough input data to progress when calling rs_job_iter(), it will leave an unprocessed "tail" fragment in the input buffer that the caller must shuffle and refill. If there is not enough input data, it will start accumulating the data into the internal scoop buffer instead.

This makes delta calculations 5%~15% faster by avoiding all the data copying into the scoop buffer.

Make rs_infilebuf_fill() shuffle and top-up input buffers that are more than half empty. Also tidy and tighten assert() statements in rs_infilebuf_fill() and rs_outfilebuf_drain() about to require the input/output buffers be fully contained in the rs_filebuf_t buffer.

Rename rs_scoop_total_avail() to rs_scoop_avail() and make it a static inline in stream.h Remove rs_job_input_is_ending() from job.[hc] and replace it with a rs_scoop_eof() static inline in stream.h. In scoop.c make rs_scoop_input() only shuffle data to the start of the buffer if necessary to free up space. Also make an assert() check more strict about the data being within the buffer. Slightly tidy up rs_scoop_read_rest(). In delta.c make rs_delta_s_slack() neater by using the new rs_scoop_*() functions.

…oop. Add static inline fuctions to stream.h for getting and iterating through contiguous data buffers from the scoop. In tube.c remove rs_tube_copy_from_scoop() and rs_tube_copy_from_stream() and just make rs_tube_catchup_copy() iterate through contiguous buffers from the scoop. In rs_tube_catchup() use rs_scoop_eof() to check for eof instead of checking the scoop and stream directly. Remove undefined/unused rs_buffers_copy() from steam.h.

This means we only accumulate data into the scoop buffer if the input stream is too small, otherwise we process directly from the input stream. In job.h rename scoop_pos to scan_pos and add scan_buf and scan_len for pointing at the curren scan data, which can be either in the input stream or the scoop buffer. Also give the scoop and scan fields proper doxygen comments. In delta.c use scan_pos, scan_buf, and scan_len instead of scoop_pos, scoop_next, and scoop_avail respectively. Change rs_getinput() to return an rs_result_t and take the block_len as an argument, and have it set scan_buf and scan_len using rs_scoop_readhead() to get at least enough data to scan and emit a full miss literal command. Change rs_delta_s_scan() and rs_delta_s_flush() to do rs_tube_catchup() before rs_getinput() to consume any literal data off the scoop buffer before refilling it. Change the MAX_MISS_LEN to 64K - 3 cmd bytes from 32K. In whole.c change rs_delta_file() to use buffers large enough for 4x 64K literal commands, which is large enough to scan without copying into the scoop buffer.

This gives us a single point for defining the size used for delta commands and streaming buffers. In job.h define MAX_DELTA_CMD to be the maximum size of a single delta command at 64K. In delta.c use MAX_DELTA_CMD to define MAX_MISS_LEN and use it to get the minimum readahead size in rs_getinput(). In whole.c use MAX_DELTA_CMD for defining the buffer sizes used for delta and patch operations.

It turns out ssize_t is a Posix thing that doesn't exist on Windows. I originally started using it in stream.h so that negative values could be used to indicate errors when iterating through buffers. We don't use or need that, so we can just use size_t instead.

In stream.h remove rs_scoop_input() and in scoop.c make it static inline. This function nolonger needs to be called directly anywhere outside scoop.c.

…fers. This points out that large buffers can be processed directly and can leave a tail of data behind in the input buffer. Using large buffers avoids data copies and can be much faster.

This means it matches the name of scoop.c which means tools like iwyu correctly find and check it.

This makes it iwyu clean.

…our. Update the docstring so it correctly describes how data is processed directly from the input stream if there is sufficient data there.

dbaarda · 2021-09-16T01:06:06Z

I think this is good enough to merge... so I'm going to merge it now.

dbaarda added 14 commits September 11, 2021 13:59

Make rs_scoop_input() only visible inside scoop.c.

b7787a2

In stream.h remove rs_scoop_input() and in scoop.c make it static inline. This function nolonger needs to be called directly anywhere outside scoop.c.

Update librsync.h docs about processing directly from large input buf…

b05d78b

…fers. This points out that large buffers can be processed directly and can leave a tail of data behind in the input buffer. Using large buffers avoids data copies and can be much faster.

Add rs_trace call in buf.c to indicate when moving data in buffers.

50eae28

Rename stream.h to scoop.h.

58214f8

This means it matches the name of scoop.c which means tools like iwyu correctly find and check it.

Fix the includes and include guard for scoop.h.

5c2f3a4

This makes it iwyu clean.

Update scoop.c documentation to reflect the new scoop buffering behav…

4929960

…our. Update the docstring so it correctly describes how data is processed directly from the input stream if there is sufficient data there.

Update NEWS.md with changes so far.

29a3b11

Improve documentation in scoop.h.

c50bd63

dbaarda mentioned this pull request Sep 16, 2021

Add support for delta callback API. #209

Open

dbaarda merged commit d202e4e into librsync:master Sep 16, 2021

dbaarda deleted the dev/scoop1 branch September 16, 2021 01:37

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make delta directly process the input stream if it has enough data. #234

Make delta directly process the input stream if it has enough data. #234

dbaarda commented Sep 14, 2021 •

edited

dbaarda commented Sep 16, 2021

Make delta directly process the input stream if it has enough data. #234

Make delta directly process the input stream if it has enough data. #234

Conversation

dbaarda commented Sep 14, 2021 • edited

dbaarda commented Sep 16, 2021

dbaarda commented Sep 14, 2021 •

edited