-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
refdb_fs: enhance performance of globbing #4629
Conversation
d7f4689
to
68be55e
Compare
This is now ready for review. I did some preliminary measurements of cases with few references in the repository. The performance was tested by timing 10 consecutive runs of the test-suite:
Obviously, the methodology is not quite what you'd want for a real measurement. The results with the changes:
The results for the commit this PR is based on (d906a87):
As you can see, the tests took a bit more time with the changes. This was to be somewhat expected, since in the tests, the globs usually select a significant portion of the references. Hence, the additional, albeit superficial, parse of the glob gains more weight. However, I also saw spurious segfaults for both the master and the feature-branch. Those hit once while running the test for the master and, I think, not for any of the runs for the feature-branch. So maybe the difference is just one run ending earlier. Anyways, I will repeat those tests again with a script which gives me more informative statistics. I will also do some tests with one of those repos I have with lots of refs (the target scenario for the optimization). Btw, you could have a look at the durations reported by travis. But I'm too lazy to compare them by hand. |
I did some measurements using a little test program. It queries the number of references matching a glob and prints some counters ("real" time, user and kernel ticks):
I ran the program with three sets of repository and glob:
Sidenote: git-dit is a distributed issue manager of which I am a co-author. We store messages as commits and use references simply for keeping them alive. For each "issue", we have a sub-directory containing those references. The two scenarios involving git-dit fetch those refs for two separate issues. Each scenario was run 20 times with both the program linked against the feature-branch version of libgit2 ("enhanced") and the merge base ("master", d906a87). I found a slight improved performance even for the "libgit2" scenario, although the number of references omitted should be quite low (there are not tags and only a few remote references in addition to the two references I had in the repo). For my target scenarios ("gitdit1" and "gitdit2"), I found an improvement of factors >10. TL;DR: the patch-set makes globbing refs faster. For those interested, here are the numbers I got: "master-libgit2":
"enhanced-libgit2":
"master-gitdit1"
enhanced-gitdit1:
master-gitdit2:
enhanced-gitdit2:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your contribution!
This optimization makes a lot of sense to me, the benefits could be huge when there's a lot of deeply nested references. Code looks good to me, except for two small nits which should be fixed.
src/refdb_fs.c
Outdated
} | ||
|
||
if ((error = git_buf_printf(&path, "%s/", backend->commonpath)) < 0 || | ||
(error = git_buf_put(&path, ref_prefix, ref_prefix_len)) || |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The < 0
comparison is missing
src/refdb_fs.c
Outdated
@@ -505,26 +505,53 @@ static int iter_load_loose_paths(refdb_fs_backend *backend, refdb_fs_iter *iter) | |||
git_iterator *fsit = NULL; | |||
git_iterator_options fsit_opts = GIT_ITERATOR_OPTIONS_INIT; | |||
const git_index_entry *entry = NULL; | |||
const char *ref_prefix = GIT_REFS_DIR; | |||
int ref_prefix_len = strlen(ref_prefix); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
strlen
returns size_t
break; | ||
case '/': | ||
last_sep = pos; | ||
/* FALLTHROUGH */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So we're looking for the first non-literal character here, using the previous literal directories as our base for looking for references. Makes a lot of sense
Those two things should have been obvious to me. Guess I'm too used to I'm ready for squashing as soon as I get an approval. |
Instead of a hardcoded "refs", we may choose a different directory within the git directory as the root from which we look for references.
A glob used for iteration may start with an entire path containing no special characters. If we start scanning for references within that path rather than in `refs/`, we may end up scanning only a small fraction of all references.
6b82bc1
to
20a2b02
Compare
I decided to squash without an additional review, in order to save a round-trip. |
Thanks, looks good to me. @carlosmn: you've got additional comments? |
Thanks again for your contribution! |
This patch-set addresses the performance of reference iteration with a glob specified. The enhancement is specific to FS-based repositories.
libgit2 provides facilities for iterating over references of a repository. The references may be filtered using a glob, e.g. using an iterator created via
git_reference_iterator_glob_new()
. At least for FS-based repositories, all references are visited during the iteration and each of the references encountered is matches against the glob. For the general case, this is the only option.However, a glob may also start with a literal path. In this case, we may scan only the corresponding subdirectory of
refs
rather than the entire reference space. The cost of this optimization is a single scan of the glob.Targets #4619.
TODO: