Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance regression when reading files from the tree in parallel #2076

Open
ggrossetie opened this issue Feb 15, 2024 · 6 comments
Open

Comments

@ggrossetie
Copy link

ggrossetie commented Feb 15, 2024

Reproduction steps

  1. Add the following test cases in CommitFixture.cs:
[Fact]
public void CanReadCommit()
{
    var timer = new Stopwatch();
    var fileContents = new ConcurrentStack<string>();
    var path = SandboxStandardTestRepo();
    using (var repo = new Repository(path))
    {
        var latestCommit = repo.Head.Tip;
        var tree = latestCommit.Tree;
        timer.Start();
        for (var i = 0; i < 100000; i++)
        {
            fileContents.Push(ReadEntry("1.txt", tree));
            fileContents.Push(ReadEntry("README", tree));
            fileContents.Push(ReadEntry("new.txt", tree));
        }
        timer.Stop();
        testOutputHelper.WriteLine($"Took: {timer.ElapsedMilliseconds.ToString()}ms");
        Assert.Equal(300000, fileContents.Count);
    }
}

[Fact]
public void CanReadCommitParallel()
{
    var timer = new Stopwatch();
    var fileContents = new ConcurrentStack<string>();
    var path = SandboxStandardTestRepo();
    using (var repo = new Repository(path))
    {
        var latestCommit = repo.Head.Tip;
        var tree = latestCommit.Tree;
        timer.Start();
        var fileNames = new List<string>() {"1.txt", "README", "new.txt"}.AsEnumerable();
        Parallel.ForEach(fileNames, (fileName) =>
        {
            for (var i = 0; i < 100000; i++)
            {
                fileContents.Push(ReadEntry(fileName, tree));
            }
        });
        timer.Stop();
        testOutputHelper.WriteLine($"Took: {timer.ElapsedMilliseconds.ToString()}ms");
        Assert.Equal(300000, fileContents.Count);
    }
}

private static string ReadEntry(string name, Tree tree)
{
    var treeEntry = tree[name];
    if (treeEntry != null && treeEntry.Target is Blob blob)
    {
        return blob.GetContentText();
    }

    throw new InvalidOperationException($"{name} must be a Blob");
}
  1. Run against version 0.27.0-preview-0119 (commit: 6329bea). On my machine, CanReadCommitParallel takes 1417ms and CanReadCommit takes 2971ms (which is fine)
  2. Checkout the latest version or the latest release
  3. Run the test again. On my machine, CanReadCommitParallel takes 3731ms and CanReadCommit takes 3250ms!

Expected behavior

Reading files in parallel should be faster than reading files sequentially from the git tree.

Actual behavior

It seems that reading files from the tree in parallel (multi-thread) is not faster. I did a git bisect and it seems that this regression was introduced in 21d4f13

Version of LibGit2Sharp (release number or SHA1)

Versions after 21d4f13

Operating system(s) tested; .NET runtime tested

.NET 6 and .NET 7.

@ggrossetie
Copy link
Author

I did another git bisect (while making sure that bin was deleted) and I can reproduce it with 21d4f13 where libgit2 was updated to 1.2.0

@ethomson
Copy link
Member

That was two and a half years ago. A lot has changed since then. It's going to be tricky to isolate the problem.

One question to start - what's the perf if you have one repository per thread instead of sharing the repository across threads?

@ggrossetie
Copy link
Author

Thanks for your reply!

One question to start - what's the perf if you have one repository per thread instead of sharing the repository across threads?

You mean one Repository object (using the same path) per thread? I can try that 👍🏻
Also, I did try to implement a similar benchmark test in libgit2 but my skills in C are insufficient 😅

@ggrossetie
Copy link
Author

ggrossetie commented Feb 19, 2024

One question to start - what's the perf if you have one repository per thread instead of sharing the repository across threads?

Spot on! Using one Repository instance per thread does the trick 🙌🏻
Should we document how to properly use libgit2(sharp) in a multi-thread environment? Or maybe it's possible to (re)enable concurrent reads on a single Repository instance?

@ethomson
Copy link
Member

Right — really good question. I was mostly hoping to stem the bleeding and get you to a performant situation. Digging in to the why it's slow would be really interesting.

https://github.com/libgit2/libgit2/blob/main/docs/threading.md#sharing-objects has a bit of a discussion about libgit2's threading policy (which LibGit2Sharp should also document as it should be identical — I don't think that there's anything in LibGit2Sharp that makes threading any different).

Without actually doing a serious investigation: probably you're hitting a lock on an object cache. Why that got worse, I don't know, and we should 👀 that. But the most performant way to do what you're doing is multiple Repository instances (although there will be a bit of a memory usage increase, since you won't have a cache shared across a single repository).

@ggrossetie
Copy link
Author

Thanks again for your insight.

libgit2/libgit2@main/docs/threading.md#sharing-objects has a bit of a discussion about libgit2's threading policy (which LibGit2Sharp should also document as it should be identical

👍🏻

I don't think that there's anything in LibGit2Sharp that makes threading any different
Without actually doing a serious investigation: probably you're hitting a lock on an object cache

Should we move this issue to https://github.com/libgit2/libgit2?

But the most performant way to do what you're doing is multiple Repository instances (although there will be a bit of a memory usage increase, since you won't have a cache shared across a single repository).

Alright!
I wasn't sure that it was indeed the most performant solution. I will update my code base accordingly. Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants