Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refdb_fs: enhance performance of globbing #4629

Merged
merged 2 commits into from
May 9, 2018

Conversation

neithernut
Copy link
Contributor

@neithernut neithernut commented Apr 18, 2018

This patch-set addresses the performance of reference iteration with a glob specified. The enhancement is specific to FS-based repositories.

libgit2 provides facilities for iterating over references of a repository. The references may be filtered using a glob, e.g. using an iterator created via git_reference_iterator_glob_new(). At least for FS-based repositories, all references are visited during the iteration and each of the references encountered is matches against the glob. For the general case, this is the only option.

However, a glob may also start with a literal path. In this case, we may scan only the corresponding subdirectory of refs rather than the entire reference space. The cost of this optimization is a single scan of the glob.

Targets #4619.

TODO:

  • Fix errors shown by the test-suite. I do something wrong, probably during path assembly.
  • Measure the performance enhancement/impact for both the "usual" case and repositories with lots of refs.

@neithernut
Copy link
Contributor Author

neithernut commented Apr 21, 2018

This is now ready for review.


I did some preliminary measurements of cases with few references in the repository. The performance was tested by timing 10 consecutive runs of the test-suite:

time for i in {1..10}; do ./libgit2_clar; done

Obviously, the methodology is not quite what you'd want for a real measurement.

The results with the changes:

real	2m44.829s
user	1m4.892s
sys	1m40.423s

The results for the commit this PR is based on (d906a87):

real	2m41.803s
user	1m3.824s
sys	1m38.538s

As you can see, the tests took a bit more time with the changes. This was to be somewhat expected, since in the tests, the globs usually select a significant portion of the references. Hence, the additional, albeit superficial, parse of the glob gains more weight.

However, I also saw spurious segfaults for both the master and the feature-branch. Those hit once while running the test for the master and, I think, not for any of the runs for the feature-branch. So maybe the difference is just one run ending earlier.

Anyways, I will repeat those tests again with a script which gives me more informative statistics. I will also do some tests with one of those repos I have with lots of refs (the target scenario for the optimization).

Btw, you could have a look at the durations reported by travis. But I'm too lazy to compare them by hand.

@neithernut neithernut changed the title [WIP] Enhance performance of globbing through references for FS-based repos Enhance performance of globbing through references for FS-based repos Apr 21, 2018
@neithernut
Copy link
Contributor Author

I did some measurements using a little test program. It queries the number of references matching a glob and prints some counters ("real" time, user and kernel ticks):

#include <stdio.h>
#include <sys/times.h>

#include <git2.h>

static const unsigned int runs = 1000;

int countref(const char* name, void* dummy) {
    ++*((unsigned int*) dummy);
    return 0;
}

int main(int argc, char* argv[]) {
    if (argc < 3)
        return 1;

    struct tms res;
    clock_t t1 = times(&res);

    // Run test
    int err;
    git_libgit2_init();

    git_repository* repo;
    err = git_repository_open(&repo, argv[1]);
    if (err != 0)
        return err;

    unsigned int refs = 0;
    for (unsigned int i = 0; i < runs; ++i) {
        err = git_reference_foreach_glob(repo, argv[2], countref, &refs);
        if (err != 0)
            return err;
    }

    git_repository_free(repo);
    git_libgit2_shutdown();

    // Timing
    clock_t t2 = times(&res);
    printf("%ld\t%ld\t%ld\t%d\n", t2 - t1, res.tms_utime, res.tms_stime, refs/runs);
    return 0;
}

I ran the program with three sets of repository and glob:

  • query refs/heads/* for the libgit2 repository ("libgit2")
  • query refs/remotes/origin/dit/1f280d33e42df23110c74efcf63d2989d653b3fe/* for the git-dit repo ("gitdit1")
  • query refs/remotes/origin/dit/022fb5e39d1b6a292ee3e7ca375fb6e4f0997382/* for the git-dit repo ("gitdit2")

Sidenote: git-dit is a distributed issue manager of which I am a co-author. We store messages as commits and use references simply for keeping them alive. For each "issue", we have a sub-directory containing those references. The two scenarios involving git-dit fetch those refs for two separate issues.

Each scenario was run 20 times with both the program linked against the feature-branch version of libgit2 ("enhanced") and the merge base ("master", d906a87).

I found a slight improved performance even for the "libgit2" scenario, although the number of references omitted should be quite low (there are not tags and only a few remote references in addition to the two references I had in the repo). For my target scenarios ("gitdit1" and "gitdit2"), I found an improvement of factors >10.

TL;DR: the patch-set makes globbing refs faster.


For those interested, here are the numbers I got:

"master-libgit2":

30	25	5	2
28	22	4	2
27	22	4	2
28	22	5	2
28	23	4	2
28	23	5	2
28	22	4	2
28	22	5	2
28	22	5	2
28	21	5	2
28	21	5	2
28	22	5	2
28	23	4	2
28	22	5	2
28	21	5	2
27	23	4	2
27	22	5	2
28	22	4	2
28	23	4	2
29	23	5	2

"enhanced-libgit2":

25	23	1	2
22	20	1	2
22	20	1	2
22	20	1	2
23	21	1	2
22	20	1	2
22	20	1	2
22	20	1	2
22	20	1	2
22	20	1	2
22	20	1	2
22	21	0	2
23	21	1	2
22	20	1	2
22	19	3	2
23	21	1	2
22	20	1	2
22	20	1	2
22	20	1	2
23	21	1	2

"master-gitdit1"

110	36	72	6
111	41	69	6
110	39	70	6
110	37	72	6
110	39	69	6
111	41	69	6
111	39	71	6
111	39	72	6
111	41	70	6
115	41	73	6
111	39	71	6
111	38	71	6
110	40	69	6
111	44	66	6
109	36	73	6
110	38	70	6
110	36	73	6
110	37	71	6
109	35	74	6
110	38	71	6

enhanced-gitdit1:

7	4	3	6
8	2	4	6
7	2	4	6
7	2	4	6
7	2	4	6
8	2	5	6
7	1	5	6
7	2	4	6
7	2	4	6
7	2	5	6
7	3	4	6
7	2	4	6
7	2	4	6
7	3	3	6
8	2	4	6
7	2	4	6
7	2	5	6
8	2	4	6
7	2	5	6
7	2	4	6

master-gitdit2:

109	44	64	1
107	35	71	1
108	40	67	1
107	35	71	1
106	43	62	1
110	41	68	1
108	37	70	1
107	34	71	1
107	38	67	1
107	37	69	1
108	41	65	1
109	41	67	1
107	37	69	1
107	38	67	1
107	38	68	1
107	36	70	1
106	39	65	1
108	36	71	1
107	38	68	1
108	36	71	1

enhanced-gitdit2:

2	0	1	1
3	1	1	1
2	1	0	1
2	1	0	1
2	1	1	1
2	1	1	1
2	0	1	1
2	1	1	1
2	1	1	1
2	1	0	1
2	1	0	1
2	1	1	1
3	0	1	1
2	0	1	1
2	0	1	1
2	1	0	1
3	1	0	1
2	0	1	1
2	1	1	1
2	0	1	1

@neithernut neithernut changed the title Enhance performance of globbing through references for FS-based repos refdb_fs: enhance performance of globbing through references for FS-based repos Apr 25, 2018
@neithernut neithernut changed the title refdb_fs: enhance performance of globbing through references for FS-based repos refdb_fs: enhance performance of globbing Apr 25, 2018
Copy link
Member

@pks-t pks-t left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your contribution!

This optimization makes a lot of sense to me, the benefits could be huge when there's a lot of deeply nested references. Code looks good to me, except for two small nits which should be fixed.

src/refdb_fs.c Outdated
}

if ((error = git_buf_printf(&path, "%s/", backend->commonpath)) < 0 ||
(error = git_buf_put(&path, ref_prefix, ref_prefix_len)) ||
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The < 0 comparison is missing

src/refdb_fs.c Outdated
@@ -505,26 +505,53 @@ static int iter_load_loose_paths(refdb_fs_backend *backend, refdb_fs_iter *iter)
git_iterator *fsit = NULL;
git_iterator_options fsit_opts = GIT_ITERATOR_OPTIONS_INIT;
const git_index_entry *entry = NULL;
const char *ref_prefix = GIT_REFS_DIR;
int ref_prefix_len = strlen(ref_prefix);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

strlen returns size_t

break;
case '/':
last_sep = pos;
/* FALLTHROUGH */
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So we're looking for the first non-literal character here, using the previous literal directories as our base for looking for references. Makes a lot of sense

@neithernut
Copy link
Contributor Author

Those two things should have been obvious to me. Guess I'm too used to auto and paranoid -W-flags by now...

I'm ready for squashing as soon as I get an approval.

Instead of a hardcoded "refs", we may choose a different directory
within the git directory as the root from which we look for references.
A glob used for iteration may start with an entire path containing no
special characters. If we start scanning for references within that path
rather than in `refs/`, we may end up scanning only a small fraction of
all references.
@neithernut
Copy link
Contributor Author

I decided to squash without an additional review, in order to save a round-trip.

@pks-t
Copy link
Member

pks-t commented Apr 30, 2018

Thanks, looks good to me.

@carlosmn: you've got additional comments?

@pks-t pks-t merged commit 0a19c15 into libgit2:master May 9, 2018
@pks-t
Copy link
Member

pks-t commented May 9, 2018

Thanks again for your contribution!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants