Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rugged::Walker vs git log performance #343

Closed
LightGuard opened this issue Mar 26, 2014 · 14 comments
Closed

Rugged::Walker vs git log performance #343

LightGuard opened this issue Mar 26, 2014 · 14 comments

Comments

@LightGuard
Copy link

I realize Rugged probably won't be as fast as git log, but currently the performance for the log on a particular file is horrible. Compare on your machine, for me (in a ruby benchmark 1000 times) I'm seeing differences between code to be double the time in rugged as shelling out to git log:

require 'rugged'
require 'yaml'
require 'open3'
require 'benchmark'

class GitCommands
  def commit_info(repo_root, file_path)
    o, _ = Open3.capture2(%Q[git --git-dir=#{repo_root}/.git log --date=short --format='- { !ruby/symbol author: %an, !ruby/symbol date: %ad, !ruby/symbol hash: %h, !ruby/symbol subject: %f }' -- #{file_path.to_s.sub(/#{repo_root}\//, '')}])
    YAML.parse o
  end

  def commit_info_rugged(repo_root, file_path)
    repo = Rugged::Repository.new(repo_root)
    walker = Rugged::Walker.new(repo)
    walker.sorting(Rugged::SORT_DATE)
    walker.push(repo.last_commit)
    walker.inject([]) do |a, c|
      if (c.diff(paths: [file_path.to_s.sub(/#{repo_root}\//, '')]).size > 0)
        a << {author: c.author, date: c.time, hash: c.oid, subject: c.message.split("\n").first}
      end
      a
    end
  end
end

git = GitCommands.new
repo = '';
file = ''
amount = 1000

Benchmark.bmbm(22) do |x|
  x.report('rugged commit_info') { amount.times do ; git.commit_info_rugged(repo, file) end }

  x.report('git commit_info') { amount.times do ; git.commit_info(repo, file) end }
end

obviously replace repo and file :)

Numbers from my machine:

Rehearsal ----------------------------------------------------------
rugged commit_info      23.840000   2.550000  26.390000 ( 26.793017)
git commit_info          1.680000   1.290000  15.260000 ( 15.066757)
------------------------------------------------ total: 41.650000sec

                             user     system      total        real
rugged commit_info      23.890000   2.580000  26.470000 ( 26.858275)
git commit_info          1.700000   1.340000  16.260000 ( 15.967883)
@arthurschreiber
Copy link
Member

The numbers on my machine are even worse...

Rehearsal ----------------------------------------------------------
rugged commit_info      12.340000   1.700000  14.040000 ( 14.384049)
git commit_info          0.080000   0.110000   2.080000 (  2.236715)
------------------------------------------------ total: 16.120000sec

                             user     system      total        real
rugged commit_info      12.580000   1.730000  14.310000 ( 14.950873)
git commit_info          0.060000   0.100000   2.040000 (  2.205928)

This is multiple orders of magnitudes slower.

@cmn @arrbee Any idea what's going on? My first guess was that the #diff call is wasting a lot of time, but even if I remove that, the numbers are not really that much better:

Rehearsal ----------------------------------------------------------
rugged commit_info       3.210000   0.750000   3.960000 (  4.108511)
git commit_info          0.080000   0.110000   2.020000 (  2.135784)
------------------------------------------------- total: 5.980000sec

                             user     system      total        real
rugged commit_info       3.150000   0.730000   3.880000 (  3.966684)
git commit_info          0.050000   0.110000   2.000000 (  2.139848)

EDIT: I was running the tests against the Rugged repo and the Gemfile.

@LightGuard
Copy link
Author

Removing the diff looks like it does make a difference, but it also removes filtering for a specific file, though I'm guessing that's a side symptom of the root cause.

@arthurschreiber
Copy link
Member

Gah, I was looking at the wrong numbers. So, without the diff it's "only" around twice as slow, instead of seven times as slow.

@LightGuard
Copy link
Author

I'd doubt filtering would be that slow.

@vmg
Copy link
Member

vmg commented Mar 26, 2014

Actually, it is. As @carlosmn will probably point out to you in a snarky tone as soon as he wakes up, the two code snippets are not doing the same thing at all. Not even close. The git log call can optimize away the path filtering in a way that Rugged cannot (because you're actually doing the full walk, and then filtering based on path, instead of limiting the amount of walking you have to do beforehand).

Do you have plans to implement path limiting in the libgit2 walker, @carlosmn?

@LightGuard
Copy link
Author

Ah. That does make sense. Path limiting would be a great feature. If you remove the filter from both versions you see a much different result:

Rehearsal ----------------------------------------------------------
rugged commit_info       7.640000   1.340000   8.980000 (  9.114467)
git commit_info         11.370000   1.900000  27.650000 ( 26.525471)
------------------------------------------------ total: 36.630000sec

                             user     system      total        real
rugged commit_info       7.420000   1.370000   8.790000 (  8.930470)
git commit_info         11.250000   1.730000  27.470000 ( 26.313850)

@LightGuard
Copy link
Author

In that case, is there a simple way of filtering w/o doing the tree walk on the commit?

@vmg
Copy link
Member

vmg commented Mar 26, 2014

That's more in line with what we'd expect. phew

Let's see which of our friendly code monkeys can I whip to get path limiting implemented. I'm afraid we don't use that specific feature in GitHub (yet) so there hasn't been a pressing need to implement it.

Thanks for the report, though! I'll update this once we're Fast (TM).

@vmg
Copy link
Member

vmg commented Mar 26, 2014

In that case, is there a simple way of filtering w/o doing the tree walk on the commit?

The simple way is the one you posted in your original snippet, but it's simple and slow. The complex and fast way is limiting during the walk itself, but that needs to be implemented at the libgit2 level.

@LightGuard
Copy link
Author

Okay, thanks.

@carlosmn
Copy link
Member

I'll have you know I've been awake all day, my good sir (but I've spent most of it fighting with the python memory allocator, so I'm not sure it counts).

I'm not too familiar with the optimisations that #diff is able to perform, but you don't need a diff, just a straight comparison of two the hash of two tree entries (in the code you also keep calling #to_s and #sub which don't help performance)

The issue with doing the equivalent of git log -- path is not finding out which commits we should skip (which is fairly easy) but the history simplification itself; that is, what to do with the commits we don't want to skip. When you ask the revwalk/log/rev-list to only consider a path (or a set of paths), we need to change what a commit considers to be its parents. We don't currently have any way pass this information to the caller. I haven't tacked this yet, as I haven't managed to figure out how to do this yet. We would also need to keep a commit in the revwalk until we reach the next one, so that we know which commit should be its rewritten parent, which is something else that we don't have any way of doing either.

Long story short, I do plan to implement history simplification at some point, but I haven't figured out what the function signature would even be, so there hasn't been anything written yet (this is why I did implement git_revwalk_simplify_first_parent(), as that doesn't need any parent replacements).

@carlosmn
Copy link
Member

FWIW, I took a stab at using the tree entries directly instead of relying on diff to know to optimise the actual diff out. I'm testing with grabbing stuff out of libgit2's repo, and limiting the path to src/repository.c. I've reduced the repetitions to 10 because I'm impatient, so there is some variation in the times, but not too much. First, the results with the "vanilla" version of the script:

% be ruby bench_orig.rb 
Rehearsal ----------------------------------------------------------
rugged commit_info       8.440000   0.490000   8.930000 (  9.019762)
git commit_info          0.210000   0.090000   1.320000 (  1.311925)
------------------------------------------------ total: 10.250000sec

                             user     system      total        real
rugged commit_info       8.400000   0.480000   8.880000 (  8.912238)
git commit_info          0.150000   0.060000   1.250000 (  1.226189)

and with my changes

% be ruby bench.rb
Rehearsal ----------------------------------------------------------
rugged commit_info       1.960000   0.330000   2.290000 (  2.291742)
git commit_info          0.160000   0.060000   1.240000 (  1.250153)
------------------------------------------------- total: 3.530000sec

                             user     system      total        real
rugged commit_info       2.070000   0.260000   2.330000 (  2.339170)
git commit_info          0.110000   0.060000   1.180000 (  1.191480)

It's still slower, as expected, but we avoid a lot of the cost that #diff has in exchange for the convenience (and the repeated sub) which makes quite a difference. This is with rugged out of rubygems, which contains a nasty bug in the revwalk that walks over the entire graph once before doing any work, so some of this time is fixed in dev, but I get an odd error when I try to use rugged from dev at the moment.

This bug means that we waste a bunch of time not doing anything before doing any of the work, and the amount is directly proportional to the size of the history. In a small repository (carlosmn/git-httpd, 26 commits), the timings are 1.1s for rugged vs 11.2s for git (this time again with 1000 repetitions).

EDIT: got the rugged version from dev, and it's not any faster; looks like the case of the smaller repository is simply the git setup overhead that's taking us over the hump.

My changes to the script:

diff -u bench_orig.rb bench.rb 
--- bench_orig.rb       2014-03-27 04:50:46.788440621 +0100
+++ bench.rb    2014-03-27 04:30:35.556264199 +0100
@@ -9,13 +9,38 @@
     YAML.parse o
   end

+  def entry_changed?(commit, path)
+    parent = commit.parents[0]
+    entry = commit.tree[path]
+
+    # if at a root commit, consider it changed if we have this file;
+    # i.e. if we added it in the initial commit
+    if not parent
+      return entry != nil
+    end
+
+    parent_entry = parent.tree[path]
+
+    # does exist in either, no change
+    if not entry and not parent_entry
+      false
+    # only in one of them, change
+    elsif not entry or not parent_entry then
+      true
+    # otherwise it's changed if their ids arent' the same
+    else
+      entry[:oid] != parent_entry[:oid]
+    end
+  end
+
   def commit_info_rugged(repo_root, file_path)
     repo = Rugged::Repository.new(repo_root)
+    path = file_path.to_s.sub(/#{repo_root}\//, '')
     walker = Rugged::Walker.new(repo)
     walker.sorting(Rugged::SORT_DATE)
     walker.push(repo.last_commit)
     walker.inject([]) do |a, c|
-      if (c.diff(paths: [file_path.to_s.sub(/#{repo_root}\//, '')]).size > 0)
+      if entry_changed? c, path
         a << {author: c.author, date: c.time, hash: c.oid, subject: c.message.split("\n").first}
       end
       a

@connorshea
Copy link
Contributor

Was this ever fixed?

@carlosmn
Copy link
Member

There isn't really anything to fix. The revwalk and git-log with path simplification are different beasts. When you simplify by path, you use the diff information to tell you down which sub-histories you don't need to go down.

Implementing git log -- some/path via the revwalk is going to be slow and can give you different results, since git-log avoids going down merges when a particular file is the same in the merge commit, regardless of whether it was changed in the merged history, which the revwalk will not do.

There's libgit2/libgit2#3041tracking the features a git_log structure would have, but so far nobody has cared enough to build it. I'm going to close this since it' snot actually a bug that these two different things behave differently.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants