refdb_fs: enhance performance of globbing #4629

neithernut · 2018-04-18T18:43:24Z

This patch-set addresses the performance of reference iteration with a glob specified. The enhancement is specific to FS-based repositories.

libgit2 provides facilities for iterating over references of a repository. The references may be filtered using a glob, e.g. using an iterator created via git_reference_iterator_glob_new(). At least for FS-based repositories, all references are visited during the iteration and each of the references encountered is matches against the glob. For the general case, this is the only option.

However, a glob may also start with a literal path. In this case, we may scan only the corresponding subdirectory of refs rather than the entire reference space. The cost of this optimization is a single scan of the glob.

Targets #4619.

TODO:

Fix errors shown by the test-suite. I do something wrong, probably during path assembly.
Measure the performance enhancement/impact for both the "usual" case and repositories with lots of refs.

neithernut · 2018-04-21T11:14:47Z

This is now ready for review.

I did some preliminary measurements of cases with few references in the repository. The performance was tested by timing 10 consecutive runs of the test-suite:

time for i in {1..10}; do ./libgit2_clar; done

Obviously, the methodology is not quite what you'd want for a real measurement.

The results with the changes:

real	2m44.829s
user	1m4.892s
sys	1m40.423s

The results for the commit this PR is based on (d906a87):

real	2m41.803s
user	1m3.824s
sys	1m38.538s

As you can see, the tests took a bit more time with the changes. This was to be somewhat expected, since in the tests, the globs usually select a significant portion of the references. Hence, the additional, albeit superficial, parse of the glob gains more weight.

However, I also saw spurious segfaults for both the master and the feature-branch. Those hit once while running the test for the master and, I think, not for any of the runs for the feature-branch. So maybe the difference is just one run ending earlier.

Anyways, I will repeat those tests again with a script which gives me more informative statistics. I will also do some tests with one of those repos I have with lots of refs (the target scenario for the optimization).

Btw, you could have a look at the durations reported by travis. But I'm too lazy to compare them by hand.

neithernut · 2018-04-21T16:30:28Z

I did some measurements using a little test program. It queries the number of references matching a glob and prints some counters ("real" time, user and kernel ticks):

#include <stdio.h>
#include <sys/times.h>

#include <git2.h>

static const unsigned int runs = 1000;

int countref(const char* name, void* dummy) {
    ++*((unsigned int*) dummy);
    return 0;
}

int main(int argc, char* argv[]) {
    if (argc < 3)
        return 1;

    struct tms res;
    clock_t t1 = times(&res);

    // Run test
    int err;
    git_libgit2_init();

    git_repository* repo;
    err = git_repository_open(&repo, argv[1]);
    if (err != 0)
        return err;

    unsigned int refs = 0;
    for (unsigned int i = 0; i < runs; ++i) {
        err = git_reference_foreach_glob(repo, argv[2], countref, &refs);
        if (err != 0)
            return err;
    }

    git_repository_free(repo);
    git_libgit2_shutdown();

    // Timing
    clock_t t2 = times(&res);
    printf("%ld\t%ld\t%ld\t%d\n", t2 - t1, res.tms_utime, res.tms_stime, refs/runs);
    return 0;
}

I ran the program with three sets of repository and glob:

query refs/heads/* for the libgit2 repository ("libgit2")
query refs/remotes/origin/dit/1f280d33e42df23110c74efcf63d2989d653b3fe/* for the git-dit repo ("gitdit1")
query refs/remotes/origin/dit/022fb5e39d1b6a292ee3e7ca375fb6e4f0997382/* for the git-dit repo ("gitdit2")

Sidenote: git-dit is a distributed issue manager of which I am a co-author. We store messages as commits and use references simply for keeping them alive. For each "issue", we have a sub-directory containing those references. The two scenarios involving git-dit fetch those refs for two separate issues.

Each scenario was run 20 times with both the program linked against the feature-branch version of libgit2 ("enhanced") and the merge base ("master", d906a87).

I found a slight improved performance even for the "libgit2" scenario, although the number of references omitted should be quite low (there are not tags and only a few remote references in addition to the two references I had in the repo). For my target scenarios ("gitdit1" and "gitdit2"), I found an improvement of factors >10.

TL;DR: the patch-set makes globbing refs faster.

For those interested, here are the numbers I got:

"master-libgit2":

"enhanced-libgit2":

"master-gitdit1"

110	36	72	6
111	41	69	6
110	39	70	6
110	37	72	6
110	39	69	6
111	41	69	6
111	39	71	6
111	39	72	6
111	41	70	6
115	41	73	6
111	39	71	6
111	38	71	6
110	40	69	6
111	44	66	6
109	36	73	6
110	38	70	6
110	36	73	6
110	37	71	6
109	35	74	6
110	38	71	6

enhanced-gitdit1:

master-gitdit2:

109	44	64	1
107	35	71	1
108	40	67	1
107	35	71	1
106	43	62	1
110	41	68	1
108	37	70	1
107	34	71	1
107	38	67	1
107	37	69	1
108	41	65	1
109	41	67	1
107	37	69	1
107	38	67	1
107	38	68	1
107	36	70	1
106	39	65	1
108	36	71	1
107	38	68	1
108	36	71	1

enhanced-gitdit2:

pks-t

Thanks for your contribution!

This optimization makes a lot of sense to me, the benefits could be huge when there's a lot of deeply nested references. Code looks good to me, except for two small nits which should be fixed.

pks-t · 2018-04-26T11:28:56Z

src/refdb_fs.c

+	}
+
+	if ((error = git_buf_printf(&path, "%s/", backend->commonpath)) < 0 ||
+		(error = git_buf_put(&path, ref_prefix, ref_prefix_len)) ||


The < 0 comparison is missing

pks-t · 2018-04-26T11:29:35Z

src/refdb_fs.c

@@ -505,26 +505,53 @@ static int iter_load_loose_paths(refdb_fs_backend *backend, refdb_fs_iter *iter)
 	git_iterator *fsit = NULL;
 	git_iterator_options fsit_opts = GIT_ITERATOR_OPTIONS_INIT;
 	const git_index_entry *entry = NULL;
+	const char *ref_prefix = GIT_REFS_DIR;
+	int ref_prefix_len = strlen(ref_prefix);


strlen returns size_t

pks-t · 2018-04-26T11:33:07Z

src/refdb_fs.c

+				break;
+			case '/':
+				last_sep = pos;
+				/* FALLTHROUGH */


So we're looking for the first non-literal character here, using the previous literal directories as our base for looking for references. Makes a lot of sense

neithernut · 2018-04-26T12:07:39Z

Those two things should have been obvious to me. Guess I'm too used to auto and paranoid -W-flags by now...

I'm ready for squashing as soon as I get an approval.

Instead of a hardcoded "refs", we may choose a different directory within the git directory as the root from which we look for references.

A glob used for iteration may start with an entire path containing no special characters. If we start scanning for references within that path rather than in `refs/`, we may end up scanning only a small fraction of all references.

neithernut · 2018-04-27T14:33:49Z

I decided to squash without an additional review, in order to save a round-trip.

pks-t · 2018-04-30T10:20:13Z

Thanks, looks good to me.

@carlosmn: you've got additional comments?

pks-t · 2018-05-09T12:14:22Z

Thanks again for your contribution!

neithernut force-pushed the enhance-glob-perf branch from d7f4689 to 68be55e Compare April 21, 2018 10:48

neithernut changed the title ~~[WIP] Enhance performance of globbing through references for FS-based repos~~ Enhance performance of globbing through references for FS-based repos Apr 21, 2018

neithernut changed the title ~~Enhance performance of globbing through references for FS-based repos~~ refdb_fs: enhance performance of globbing through references for FS-based repos Apr 25, 2018

neithernut changed the title ~~refdb_fs: enhance performance of globbing through references for FS-based repos~~ refdb_fs: enhance performance of globbing Apr 25, 2018

pks-t requested changes Apr 26, 2018

View reviewed changes

neithernut added 2 commits April 27, 2018 16:30

refdb_fs: prepare arbitration of the root used for ref iteration

27e98cf

Instead of a hardcoded "refs", we may choose a different directory within the git directory as the root from which we look for references.

neithernut force-pushed the enhance-glob-perf branch from 6b82bc1 to 20a2b02 Compare April 27, 2018 14:31

pks-t approved these changes Apr 30, 2018

View reviewed changes

neithernut mentioned this pull request Apr 30, 2018

git-dit gc is slow neithernut/git-dit#174

Closed

pks-t merged commit 0a19c15 into libgit2:master May 9, 2018

neithernut deleted the enhance-glob-perf branch May 9, 2018 12:56

snyk-bot mentioned this pull request Feb 23, 2020

[Snyk] Upgrade nodegit from 0.4.1 to 0.26.4 saurabharch/Breezeblocks#1

Open

snyk-bot mentioned this pull request Apr 22, 2020

[Snyk] Upgrade nodegit from 0.24.3 to 0.26.5 aminatakonate000/Graviton-App#4

Open

snyk-bot mentioned this pull request May 5, 2020

[Snyk] Upgrade nodegit from 0.24.3 to 0.26.5 Barnstorm-Online/ngp-openapi-generator#1

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refdb_fs: enhance performance of globbing #4629

refdb_fs: enhance performance of globbing #4629

neithernut commented Apr 18, 2018 •

edited

Loading

neithernut commented Apr 21, 2018 •

edited

Loading

neithernut commented Apr 21, 2018

pks-t left a comment

pks-t Apr 26, 2018

pks-t Apr 26, 2018

pks-t Apr 26, 2018

neithernut commented Apr 26, 2018

neithernut commented Apr 27, 2018

pks-t commented Apr 30, 2018

pks-t commented May 9, 2018

refdb_fs: enhance performance of globbing #4629

refdb_fs: enhance performance of globbing #4629

Conversation

neithernut commented Apr 18, 2018 • edited Loading

neithernut commented Apr 21, 2018 • edited Loading

neithernut commented Apr 21, 2018

pks-t left a comment

Choose a reason for hiding this comment

pks-t Apr 26, 2018

Choose a reason for hiding this comment

pks-t Apr 26, 2018

Choose a reason for hiding this comment

pks-t Apr 26, 2018

Choose a reason for hiding this comment

neithernut commented Apr 26, 2018

neithernut commented Apr 27, 2018

pks-t commented Apr 30, 2018

pks-t commented May 9, 2018

neithernut commented Apr 18, 2018 •

edited

Loading

neithernut commented Apr 21, 2018 •

edited

Loading