Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

Awesome work! 馃槏 #1

Closed
jairbubbles opened this issue Apr 22, 2023 · 16 comments 路 Fixed by #2
Closed

Awesome work! 馃槏 #1

jairbubbles opened this issue Apr 22, 2023 · 16 comments 路 Fixed by #2
Labels
question Further information is requested

Comments

@jairbubbles
Copy link
Contributor

Having a managed library to read .git repository is a plus for the .NET ecosystem.

I quickly looked at how libGit2Sharp is used in our application and I'm wondering if you have in mind to support more things in this library or keep it super minimalistic.

Here is the main struct that we fill when we inspect a repo:

internal struct GitInfos
    {
        public string RemoteUrl;
        public IGitCommit HeadCommit;
        public IGitBranch CurrentBranch;
        public IReadOnlyList<GitBranch> Branches;
        public IGitTag[] Tags;
        public IGitRemote[] Remotes;
        public IGitStash[] Stashes;
        public DateTimeOffset LastCommitDate;
        public int BehindDefaultBranch;
        public IGitBranch DefaultRemoteBranch;
        public IGitOperation CurrentOperation;
}

Your lib seems to support already the main things but is lacking things like remotes or stashes.

Let me know, I'd be much interested to contribute. Cheers!

@kekyo kekyo added the question Further information is requested label Apr 22, 2023
@kekyo
Copy link
Owner

kekyo commented Apr 22, 2023

Thanks your comment!

Interesting information, I wrote GitReader to remove libgit2sharp from RelaxVersioner, but of course for general purpose. From your code, I've tried to map the functionality you want GitReader to have:

  • string RemoteUrl : GitReader does not have remote fetch ability, so I do not plan to make it possible to retrieve URLs at this time. However, it is not difficult to make URL acquisition possible, and I will consider it if necessary.
  • IGitCommit HeadCommit : Already can get it.
  • IGitBranch CurrentBranch : Already can get it.
  • IReadOnlyList<GitBranch> Branches : Already can get it.
  • IGitTag[] Tags : Already can get it.
  • IGitRemote[] Remotes : GitReader does not have remote fetch ability, so I do not plan to make it.
  • IGitStash[] Stashes : Can not get it, but I will consider it positively.
  • DateTimeOffset LastCommitDate : Already can get it from head commit.
  • int BehindDefaultBranch : (?) Required manually calculation.
  • IGitBranch DefaultRemoteBranch : Already can get it from remote branch list.
  • IGitOperation CurrentOperation : (?)

@jairbubbles
Copy link
Contributor Author

Happy to know you have great plans for your lib. We use libGit2Sharp for other things and totally replacing it with GitReader is probably a long shot (our application is basically a full featured git GUI). We use git.exe for commands (including fetch and status) and libGit2Sharp to introspect the repository, build the graph, compute diff between commits, edit remotes...

@kekyo
Copy link
Owner

kekyo commented Apr 22, 2023

Well, I have often wanted to analyze Git commit graphs (I don't do it full time now, but in the past I have held both CI and progress analysis maintainer roles). One of my motivations is that it would be useful to have such a library as an infrastructure that can be easily handled for such purposes ;)

@kekyo kekyo linked a pull request May 18, 2023 that will close this issue
@kekyo
Copy link
Owner

kekyo commented May 18, 2023

@jairbubbles 0.10.0 released.

After the merged, I did some tweaking for consistency. If you have any problems, please throw them here or create a separate issue if appropriate.

@jairbubbles
Copy link
Contributor Author

@kekyo Cool! I have started some work to benchmark GitReader vs LibGit2Sharp. In a nusthell, what I see is that it's faster when using the "primitive" open (but we're not getting all info that LibGit2Sharp is providing) but it's a lot slower when using the "structure" open which is too bad as the data structures are a lot user friendly.

@jairbubbles
Copy link
Contributor Author

Method Mean Error StdDev
GitReader 4.604 ms 0.0415 ms 0.0347 ms
GitReaderStructured 187.221 ms 17.3900 ms 51.0018 ms
LibGit2sharp 20.431 ms 0.4045 ms 0.4496 ms

@kekyo
Copy link
Owner

kekyo commented May 18, 2023

Primitive access has been able to reduce latency more than expected 馃槃

The difficulty is that when we open the repository in Structures interface, it is loading packed indexes, branches, tags, stashes, and many other things...
However, since these are asynchronous operations, it is difficult to say that they cannot be done on demand when accessing Repository.Branches.get() its non-awaitable.

At first, I thought about designing a method like Repository.GetBranchesAsync(), but then there is the problem of what to do with Commit.Branches.get(). There is a way to make everything including these methods awaitable, but that would affect convenience, so I left it out of the basic design of Structures interface.

@jairbubbles
Copy link
Contributor Author

jairbubbles commented May 18, 2023

Yest but in my benchmark I'm also getting the references in the "primitive" mode. My guess is that it's the commit resolving which is taking a lot of time. I'm wondering if a lazy evaluation approach like in LibGit2sharp wouldn't be better.

@jairbubbles
Copy link
Contributor Author

jairbubbles commented May 18, 2023

I quickly test to remove Commit / Tag resolving for each reference and as expected the Primitive / Structured have similar performance:

Method Mean Error StdDev
GitReader 4.562 ms 0.0652 ms 0.0544 ms
GitReaderStructured 4.143 ms 0.0820 ms 0.0976 ms
LibGit2sharp 24.052 ms 0.4013 ms 0.3557 ms

@kekyo
Copy link
Owner

kekyo commented May 19, 2023

For example, even in the Structures interface, we may be able to use the idea to stop reading Branches, Tags, etc. in bulk when they are opened, and instead have them call an asynchronous method that explicitly reads them. Suppose we could control what information to read with FillFlags like the following:

[Flags]
enum FillFlags
{
  None = 0x00,
  Branches = 0x01,
  RemoteBranches = 0x02,
  Tags = 0x04,
  Stashes = 0x08,
  All = 0x0f,
}

// (Defaulted: FillFlags.All)
using var repository = await Repository.Factory.OpenStructureAsync(FillFlags.None);

// All refernces are NOT loaded.
Trace.Assert(repository.Branches.Count == 0);
Trace.Assert(repository.RemoteBranches.Count == 0);
Trace.Assert(repository.Tags.Count == 0);
Trace.Assert(repository.Stashes.Count == 0);

// The commit doesn't fixup any additional informations.
var commit = await repository.GetCommitAsync("....");

Trace.Assert(commit.Branches.Count == 0);
Trace.Assert(commit.RemoteBranches.Count == 0);
Trace.Assert(commit.Tags.Count == 0);

// After delayed but explicitly reading:
await repository.FillImmediateAsync(FillFlags.Branches | FillFlags.Tags);

Trace.Assert(repository.Branches.Count >= 1);
Trace.Assert(repository.Tags.Count >= 1);

// (this may require careful implementation of the process in Commit to make this possible)
Trace.Assert(commit.Branches.Count >= 1);
Trace.Assert(commit.Tags.Count >= 1);

By explicitly calling FillImmediateAsync(), users can control the timing of time-consuming tasks themselves. And by default, everything is read automatically, so the convenience of the current Structures interface is not lost.

@jairbubbles
Copy link
Contributor Author

@kekyo I agree that we need control but reading references is not really slow when they are packed. We also need to control commits / tag resolving, it would be some kind of prefetch option. Do you want to pay the price of resolving right away when you open the repository or when you access objects later on?

For instance if you have 490 packed branches, 10 branches in refs/heads/. The cost would be in the commits resolving as you'll have to resolve 500 commits.

Moreover, I feel like it's mostly useless to resolve all branches or tags, it's unlikely that you need that info for all them, at least for most common scenarios.

As for controlling references retrieval why not exposing directly the methods on the repository?

// Method for each types?
public class Respository
{
  IReadonLyDictionary<string, Branch> GetBranchesAsync(ResolvingFlags ...)
  IReadonLyDictionary<string, Branch> GetRemoteBranchesAsync(ResolvingFlags ...)
  IReadonLyCollection<Stash> GetStashesAsync(ResolvingFlags ...)
  ...
}
// Or more generic?
public class Respository
{
  IReadonLyDictionary<string, Branch> GetReferencesAsync(ReferenceTypes...)
  ...
}

public enum ReferenceTypes
{
  Branch,
  RemoteBranch,
  Stash,
  Tag 
}

If we provide enough control through these methods we wouldn't need structures vs primitives anymore which would make the code a lot simpler / easier to consume and it would cover a lot a different use cases.

@jairbubbles
Copy link
Contributor Author

In my use case, we open the repository to get its info as soon as the file watcher detects a change so we want this to be as fast as possible. I was thinking that it would be interesting to be able to keep the object cache between several repository opening.

// Persistent cache that we would be kept in memory
static ObjectsCache  cache = new ObjectsCache();

// When we refresh we would pass the cache
var repository = Factory.OpenRepository(cache);
var branch = await repository.GetHeadBranchAsync();
var commit = await branch.GetCommitAsync(); // If the commit didn't change we didn't price to look for commits in the disk, it's already in the cache

@kekyo
Copy link
Owner

kekyo commented May 19, 2023

I see, so you are saying that you would eliminate property accesses such as Repository.Branches and switch to an awaitable asynchronous method like Repository.GetBranchesAsync(). I thought about that too, but I figured that making it a method would be a bad debugging experience.

An immediate example is the test result of Verify(model) in GitReader.Tests, which is property-accessible, so the test is easy to write. If access to branches and tags were only possible via asynchronous method calls, we need to write code to retrieve this information every time with asynchoronous method calling.

Since this example is test code for GitReader, it is fine to write labor-intensive asynchronous method call code, but it is easy to imagine that this kind of labor would be required in general use. Since the Structures interface is a high-level interface, I thought it would be desirable to make it easier to use, even at the compromise of performance loss.

(I think it would be better if there was something in between the Structures interface and the Primitive interface, but I also think that having too many options is a problem...)

@jairbubbles
Copy link
Contributor Author

but I figured that making it a method would be a bad debugging experience.

How so? I mean you have one method call for one what you need, it's pretty straight forward.

so the test is easy to write

Well for Verify creating a wrapper class is probably the best approach, it gives you control on what you want to test:

internal class RepositoryWrapper
{
  async Task<RepositoryWrapper> InitAsync(string gitPath)
  {
    var repository = Factory.OpenRepository(cache);
    Branches  = await GetBranchesAsync();
    RemoteBranches  = await GetBranchesAsync();
  }
  
   IReadonlyDictionary<string, Branch> Branches { get; }
   IReadonlyDictionary<string, Branch> RemoteBranches { get; }
   ...
}

I feel like the high level interface should:

  • expose user friendly classes like Branch, Tag and so on
  • as performant as possible 馃槏

But having a low level is also super interesting for more advanced scenarios but I would expose things like:

  • ReadPackedRefs
  • ReadReferences
  • ReadGitConfig
  • ....

It wouldn't expose a class Repository for that API it could be only static methods that takes a .git path.

@jairbubbles
Copy link
Contributor Author

In #3 I'm not resolving anymore the commits, it's still slower because of the tags resolving (I have many in the repo I'm benchmarking). I see an optimisation by treating the info about peeled tags in packed-refs, it's currently ignored.

Method Mean Error StdDev
GitReader 3.833 ms 0.0397 ms 0.0310 ms
GitReaderStructured 97.224 ms 6.3351 ms 18.6791 ms
LibGit2sharp 19.009 ms 0.2665 ms 0.2492 ms

@kekyo kekyo mentioned this issue May 19, 2023
@kekyo
Copy link
Owner

kekyo commented Jun 25, 2023

Thank you again your suggestions, GitReader reached 1.0.0!

This issue is closed, please open new issue when you want to.

@kekyo kekyo closed this as completed Jun 25, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants