Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Not merging COPY changes for duplicate blocks #4

Closed
therealmik opened this issue Jul 16, 2014 · 4 comments
Closed

Not merging COPY changes for duplicate blocks #4

therealmik opened this issue Jul 16, 2014 · 4 comments

Comments

@therealmik
Copy link
Contributor

If you create a file that's (for example) all-sparse, the signatures will all have the same hash.

The delta file for a (slightly) changed version (eg. you append some data to the end) will contain a COPY of the first block repeated for each block, rather than a merged COPY.

The most efficient representation of the delta would by in a single merged COPY (which would also produce correct results).

@dbaarda dbaarda changed the title Not merging COPY changes for repeated blocks Not merging COPY changes for duplicate blocks Jun 2, 2017
@dbaarda
Copy link
Member

dbaarda commented Jun 2, 2017

I've seen discussion of this somewhere else. The problem is when the src contains duplicate blocks the first found match in the hashtable is not necessarily the best block to use for merging the COPY commands.

The new hashtable we use supports a custom match check function defined in sumset.c as rs_block_match_cmp(). We could extend this function and the corresponding rs_block_match_t to also take into account the block position, returning the match if it's in the ideal position, otherwise accumulate a list of possible matching blocks and continue, if at the end we never found an ideal match, we grab the first match in the list of candidates.

Note this would have a terrible impact on the hashtable performance, since it would result in exhaustive searches of the hashtable nearly every time instead of exiting on the first match. It would be particularly bad with the massive collisions we have when there are a huge number of identical blocks.

To avoid that degenerate case, we probably need to add special handling for identical blocks.

@dbaarda
Copy link
Member

dbaarda commented Jun 3, 2017

Thinking about this, the current Open Addressing hashtable performance will degenerate badly for large numbers of duplicate blocks. Currently we add all these blocks to the hashtable and only return the first match. However, with Open Addressing these duplicates cascade into the adjacent hashtable entries, which adversely affects all the adjacent hash values.

I think this means it will have degenerate behavior for the common case of a large number of duplicate zero blocks.

I think we need to fix this in the short term by not adding duplicate blocks into the hashtable at all, and only using the first one.

Whether we take the next step of keeping all duplicates and picking the best match is debatable. It might be that disk caching wins would mean it's better to re-use the first match anyway.

@dbaarda
Copy link
Member

dbaarda commented Aug 29, 2017

Pull request #112 fixes the degenerate hashtable performance for duplicate blocks, but pretty much removes any chance of merging long sequences of duplicate copies into a single long copy.

However, disk caching might mean re-copying the first block multiple times is cheaper when applying the patch any way. The slightly larger delta file might be worth it.

Another way to reduce the size of the delta file would be to add "run-length-encoding" into the delta file cmnds, so we could say "copy block x N times". However, this is a pretty specialized optimization with a fairly disruptive change so I don't think it would be worth it.

IMHO after #112 is in we can close this bug.

@dbaarda
Copy link
Member

dbaarda commented Oct 11, 2017

#112 is in. I don't think the optimization of extending the delta commands to include run-length-encoding of duplicate blocks is worth it. I'm closing this.

@dbaarda dbaarda closed this as completed Oct 11, 2017
genail pushed a commit to patchkit-net/librsync that referenced this issue Oct 24, 2023
Add static config.h for aarch64-unknown-linux-musl
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants