New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make delta directly process the input stream if it has enough data. #234
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Make rs_infilebuf_fill() shuffle and top-up input buffers that are more than half empty. Also tidy and tighten assert() statements in rs_infilebuf_fill() and rs_outfilebuf_drain() about to require the input/output buffers be fully contained in the rs_filebuf_t buffer.
Rename rs_scoop_total_avail() to rs_scoop_avail() and make it a static inline in stream.h Remove rs_job_input_is_ending() from job.[hc] and replace it with a rs_scoop_eof() static inline in stream.h. In scoop.c make rs_scoop_input() only shuffle data to the start of the buffer if necessary to free up space. Also make an assert() check more strict about the data being within the buffer. Slightly tidy up rs_scoop_read_rest(). In delta.c make rs_delta_s_slack() neater by using the new rs_scoop_*() functions.
…oop. Add static inline fuctions to stream.h for getting and iterating through contiguous data buffers from the scoop. In tube.c remove rs_tube_copy_from_scoop() and rs_tube_copy_from_stream() and just make rs_tube_catchup_copy() iterate through contiguous buffers from the scoop. In rs_tube_catchup() use rs_scoop_eof() to check for eof instead of checking the scoop and stream directly. Remove undefined/unused rs_buffers_copy() from steam.h.
This means we only accumulate data into the scoop buffer if the input stream is too small, otherwise we process directly from the input stream. In job.h rename scoop_pos to scan_pos and add scan_buf and scan_len for pointing at the curren scan data, which can be either in the input stream or the scoop buffer. Also give the scoop and scan fields proper doxygen comments. In delta.c use scan_pos, scan_buf, and scan_len instead of scoop_pos, scoop_next, and scoop_avail respectively. Change rs_getinput() to return an rs_result_t and take the block_len as an argument, and have it set scan_buf and scan_len using rs_scoop_readhead() to get at least enough data to scan and emit a full miss literal command. Change rs_delta_s_scan() and rs_delta_s_flush() to do rs_tube_catchup() before rs_getinput() to consume any literal data off the scoop buffer before refilling it. Change the MAX_MISS_LEN to 64K - 3 cmd bytes from 32K. In whole.c change rs_delta_file() to use buffers large enough for 4x 64K literal commands, which is large enough to scan without copying into the scoop buffer.
This gives us a single point for defining the size used for delta commands and streaming buffers. In job.h define MAX_DELTA_CMD to be the maximum size of a single delta command at 64K. In delta.c use MAX_DELTA_CMD to define MAX_MISS_LEN and use it to get the minimum readahead size in rs_getinput(). In whole.c use MAX_DELTA_CMD for defining the buffer sizes used for delta and patch operations.
It turns out ssize_t is a Posix thing that doesn't exist on Windows. I originally started using it in stream.h so that negative values could be used to indicate errors when iterating through buffers. We don't use or need that, so we can just use size_t instead.
In stream.h remove rs_scoop_input() and in scoop.c make it static inline. This function nolonger needs to be called directly anywhere outside scoop.c.
…fers. This points out that large buffers can be processed directly and can leave a tail of data behind in the input buffer. Using large buffers avoids data copies and can be much faster.
This means it matches the name of scoop.c which means tools like iwyu correctly find and check it.
This makes it iwyu clean.
…our. Update the docstring so it correctly describes how data is processed directly from the input stream if there is sufficient data there.
I think this is good enough to merge... so I'm going to merge it now. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This means deltas only accumulate data into the scoop buffer if the input stream is too small, otherwise it will process directly from the input stream. If the input stream always has enough input data to progress when calling rs_job_iter(), it will leave an unprocessed "tail" fragment in the input buffer that the caller must shuffle and refill. If there is not enough input data, it will start accumulating the data into the internal scoop buffer instead.
This makes delta calculations 5%~15% faster by avoiding all the data copying into the scoop buffer.