Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make delta directly process the input stream if it has enough data. #234

Merged
merged 14 commits into from Sep 16, 2021

Conversation

dbaarda
Copy link
Member

@dbaarda dbaarda commented Sep 14, 2021

This means deltas only accumulate data into the scoop buffer if the input stream is too small, otherwise it will process directly from the input stream. If the input stream always has enough input data to progress when calling rs_job_iter(), it will leave an unprocessed "tail" fragment in the input buffer that the caller must shuffle and refill. If there is not enough input data, it will start accumulating the data into the internal scoop buffer instead.

This makes delta calculations 5%~15% faster by avoiding all the data copying into the scoop buffer.

Make rs_infilebuf_fill() shuffle and top-up input buffers that are more than
half empty.

Also tidy and tighten assert() statements in rs_infilebuf_fill() and
rs_outfilebuf_drain() about to require the input/output buffers be fully
contained in the rs_filebuf_t buffer.
Rename rs_scoop_total_avail() to rs_scoop_avail() and make it a static inline
in stream.h

Remove rs_job_input_is_ending() from job.[hc] and replace it with a
rs_scoop_eof() static inline in stream.h.

In scoop.c make rs_scoop_input() only shuffle data to the start of the buffer
if necessary to free up space. Also make an assert() check more strict about
the data being within the buffer. Slightly tidy up rs_scoop_read_rest().

In delta.c make rs_delta_s_slack() neater by using the new rs_scoop_*()
functions.
…oop.

Add static inline fuctions to stream.h for getting and iterating through
contiguous data buffers from the scoop.

In tube.c remove rs_tube_copy_from_scoop() and rs_tube_copy_from_stream() and
just make rs_tube_catchup_copy() iterate through contiguous buffers from the
scoop. In rs_tube_catchup() use rs_scoop_eof() to check for eof instead of
checking the scoop and stream directly.

Remove undefined/unused rs_buffers_copy() from steam.h.
This means we only accumulate data into the scoop buffer if the input stream
is too small, otherwise we process directly from the input stream.

In job.h rename scoop_pos to scan_pos and add scan_buf and scan_len for
pointing at the curren scan data, which can be either in the input stream or
the scoop buffer. Also give the scoop and scan fields proper doxygen comments.

In delta.c use scan_pos, scan_buf, and scan_len instead of scoop_pos,
scoop_next, and scoop_avail respectively. Change rs_getinput() to return an
rs_result_t and take the block_len as an argument, and have it set scan_buf
and scan_len using rs_scoop_readhead() to get at least enough data to scan and
emit a full miss literal command. Change rs_delta_s_scan() and
rs_delta_s_flush() to do rs_tube_catchup() before rs_getinput() to consume any
literal data off the scoop buffer before refilling it. Change the MAX_MISS_LEN
to 64K - 3 cmd bytes from 32K.

In whole.c change rs_delta_file() to use buffers large enough for 4x 64K
literal commands, which is large enough to scan without copying into the scoop
buffer.
This gives us a single point for defining the size used for delta commands and
streaming buffers.

In job.h define MAX_DELTA_CMD to be the maximum size of a single delta command
at 64K.

In delta.c use MAX_DELTA_CMD to define MAX_MISS_LEN and use it to get the
minimum readahead size in rs_getinput().

In whole.c use MAX_DELTA_CMD for defining the buffer sizes used for delta and
patch operations.
It turns out ssize_t is a Posix thing that doesn't exist on Windows. I
originally started using it in stream.h so that negative values could be used
to indicate errors when iterating through buffers. We don't use or need that,
so we can just use size_t instead.
In stream.h remove rs_scoop_input() and in scoop.c make it static inline.

This function nolonger needs to be called directly anywhere outside scoop.c.
…fers.

This points out that large buffers can be processed directly and can leave a
tail of data behind in the input buffer. Using large buffers avoids data
copies and can be much faster.
This means it matches the name of scoop.c which means tools like iwyu
correctly find and check it.
…our.

Update the docstring so it correctly describes how data is processed directly
from the input stream if there is sufficient data there.
@dbaarda
Copy link
Member Author

dbaarda commented Sep 16, 2021

I think this is good enough to merge... so I'm going to merge it now.

@dbaarda dbaarda merged commit d202e4e into librsync:master Sep 16, 2021
@dbaarda dbaarda deleted the dev/scoop1 branch September 16, 2021 01:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

1 participant