Rugged::Walker vs git log performance #343

LightGuard · 2014-03-26T17:28:36Z

I realize Rugged probably won't be as fast as git log, but currently the performance for the log on a particular file is horrible. Compare on your machine, for me (in a ruby benchmark 1000 times) I'm seeing differences between code to be double the time in rugged as shelling out to git log:

require 'rugged'
require 'yaml'
require 'open3'
require 'benchmark'

class GitCommands
  def commit_info(repo_root, file_path)
    o, _ = Open3.capture2(%Q[git --git-dir=#{repo_root}/.git log --date=short --format='- { !ruby/symbol author: %an, !ruby/symbol date: %ad, !ruby/symbol hash: %h, !ruby/symbol subject: %f }' -- #{file_path.to_s.sub(/#{repo_root}\//, '')}])
    YAML.parse o
  end

  def commit_info_rugged(repo_root, file_path)
    repo = Rugged::Repository.new(repo_root)
    walker = Rugged::Walker.new(repo)
    walker.sorting(Rugged::SORT_DATE)
    walker.push(repo.last_commit)
    walker.inject([]) do |a, c|
      if (c.diff(paths: [file_path.to_s.sub(/#{repo_root}\//, '')]).size > 0)
        a << {author: c.author, date: c.time, hash: c.oid, subject: c.message.split("\n").first}
      end
      a
    end
  end
end

git = GitCommands.new
repo = '';
file = ''
amount = 1000

Benchmark.bmbm(22) do |x|
  x.report('rugged commit_info') { amount.times do ; git.commit_info_rugged(repo, file) end }

  x.report('git commit_info') { amount.times do ; git.commit_info(repo, file) end }
end

obviously replace repo and file :)

Numbers from my machine:

Rehearsal ----------------------------------------------------------
rugged commit_info      23.840000   2.550000  26.390000 ( 26.793017)
git commit_info          1.680000   1.290000  15.260000 ( 15.066757)
------------------------------------------------ total: 41.650000sec

                             user     system      total        real
rugged commit_info      23.890000   2.580000  26.470000 ( 26.858275)
git commit_info          1.700000   1.340000  16.260000 ( 15.967883)

The text was updated successfully, but these errors were encountered:

arthurschreiber · 2014-03-26T17:44:12Z

The numbers on my machine are even worse...

Rehearsal ----------------------------------------------------------
rugged commit_info      12.340000   1.700000  14.040000 ( 14.384049)
git commit_info          0.080000   0.110000   2.080000 (  2.236715)
------------------------------------------------ total: 16.120000sec

                             user     system      total        real
rugged commit_info      12.580000   1.730000  14.310000 ( 14.950873)
git commit_info          0.060000   0.100000   2.040000 (  2.205928)

This is multiple orders of magnitudes slower.

@cmn @arrbee Any idea what's going on? My first guess was that the #diff call is wasting a lot of time, but even if I remove that, the numbers are not really that much better:

Rehearsal ----------------------------------------------------------
rugged commit_info       3.210000   0.750000   3.960000 (  4.108511)
git commit_info          0.080000   0.110000   2.020000 (  2.135784)
------------------------------------------------- total: 5.980000sec

                             user     system      total        real
rugged commit_info       3.150000   0.730000   3.880000 (  3.966684)
git commit_info          0.050000   0.110000   2.000000 (  2.139848)

EDIT: I was running the tests against the Rugged repo and the Gemfile.

LightGuard · 2014-03-26T17:50:40Z

Removing the diff looks like it does make a difference, but it also removes filtering for a specific file, though I'm guessing that's a side symptom of the root cause.

arthurschreiber · 2014-03-26T17:59:18Z

Gah, I was looking at the wrong numbers. So, without the diff it's "only" around twice as slow, instead of seven times as slow.

LightGuard · 2014-03-26T18:04:55Z

I'd doubt filtering would be that slow.

vmg · 2014-03-26T18:09:47Z

Actually, it is. As @carlosmn will probably point out to you in a snarky tone as soon as he wakes up, the two code snippets are not doing the same thing at all. Not even close. The git log call can optimize away the path filtering in a way that Rugged cannot (because you're actually doing the full walk, and then filtering based on path, instead of limiting the amount of walking you have to do beforehand).

Do you have plans to implement path limiting in the libgit2 walker, @carlosmn?

LightGuard · 2014-03-26T18:15:27Z

Ah. That does make sense. Path limiting would be a great feature. If you remove the filter from both versions you see a much different result:

Rehearsal ----------------------------------------------------------
rugged commit_info       7.640000   1.340000   8.980000 (  9.114467)
git commit_info         11.370000   1.900000  27.650000 ( 26.525471)
------------------------------------------------ total: 36.630000sec

                             user     system      total        real
rugged commit_info       7.420000   1.370000   8.790000 (  8.930470)
git commit_info         11.250000   1.730000  27.470000 ( 26.313850)

LightGuard · 2014-03-26T18:17:59Z

In that case, is there a simple way of filtering w/o doing the tree walk on the commit?

vmg · 2014-03-26T18:18:10Z

That's more in line with what we'd expect. phew

Let's see which of our friendly code monkeys can I whip to get path limiting implemented. I'm afraid we don't use that specific feature in GitHub (yet) so there hasn't been a pressing need to implement it.

Thanks for the report, though! I'll update this once we're Fast (TM).

vmg · 2014-03-26T18:19:03Z

In that case, is there a simple way of filtering w/o doing the tree walk on the commit?

The simple way is the one you posted in your original snippet, but it's simple and slow. The complex and fast way is limiting during the walk itself, but that needs to be implemented at the libgit2 level.

LightGuard · 2014-03-26T18:21:00Z

Okay, thanks.

carlosmn · 2014-03-26T18:37:14Z

I'll have you know I've been awake all day, my good sir (but I've spent most of it fighting with the python memory allocator, so I'm not sure it counts).

I'm not too familiar with the optimisations that #diff is able to perform, but you don't need a diff, just a straight comparison of two the hash of two tree entries (in the code you also keep calling #to_s and #sub which don't help performance)

The issue with doing the equivalent of git log -- path is not finding out which commits we should skip (which is fairly easy) but the history simplification itself; that is, what to do with the commits we don't want to skip. When you ask the revwalk/log/rev-list to only consider a path (or a set of paths), we need to change what a commit considers to be its parents. We don't currently have any way pass this information to the caller. I haven't tacked this yet, as I haven't managed to figure out how to do this yet. We would also need to keep a commit in the revwalk until we reach the next one, so that we know which commit should be its rewritten parent, which is something else that we don't have any way of doing either.

Long story short, I do plan to implement history simplification at some point, but I haven't figured out what the function signature would even be, so there hasn't been anything written yet (this is why I did implement git_revwalk_simplify_first_parent(), as that doesn't need any parent replacements).

carlosmn · 2014-03-27T04:04:57Z

FWIW, I took a stab at using the tree entries directly instead of relying on diff to know to optimise the actual diff out. I'm testing with grabbing stuff out of libgit2's repo, and limiting the path to src/repository.c. I've reduced the repetitions to 10 because I'm impatient, so there is some variation in the times, but not too much. First, the results with the "vanilla" version of the script:

% be ruby bench_orig.rb 
Rehearsal ----------------------------------------------------------
rugged commit_info       8.440000   0.490000   8.930000 (  9.019762)
git commit_info          0.210000   0.090000   1.320000 (  1.311925)
------------------------------------------------ total: 10.250000sec

                             user     system      total        real
rugged commit_info       8.400000   0.480000   8.880000 (  8.912238)
git commit_info          0.150000   0.060000   1.250000 (  1.226189)

and with my changes

% be ruby bench.rb
Rehearsal ----------------------------------------------------------
rugged commit_info       1.960000   0.330000   2.290000 (  2.291742)
git commit_info          0.160000   0.060000   1.240000 (  1.250153)
------------------------------------------------- total: 3.530000sec

                             user     system      total        real
rugged commit_info       2.070000   0.260000   2.330000 (  2.339170)
git commit_info          0.110000   0.060000   1.180000 (  1.191480)

It's still slower, as expected, but we avoid a lot of the cost that #diff has in exchange for the convenience (and the repeated sub) which makes quite a difference. This is with rugged out of rubygems, which contains a nasty bug in the revwalk that walks over the entire graph once before doing any work, so some of this time is fixed in dev, but I get an odd error when I try to use rugged from dev at the moment.

This bug means that we waste a bunch of time not doing anything before doing any of the work, and the amount is directly proportional to the size of the history. In a small repository (carlosmn/git-httpd, 26 commits), the timings are 1.1s for rugged vs 11.2s for git (this time again with 1000 repetitions).

EDIT: got the rugged version from dev, and it's not any faster; looks like the case of the smaller repository is simply the git setup overhead that's taking us over the hump.

My changes to the script:

diff -u bench_orig.rb bench.rb 
--- bench_orig.rb       2014-03-27 04:50:46.788440621 +0100
+++ bench.rb    2014-03-27 04:30:35.556264199 +0100
@@ -9,13 +9,38 @@
     YAML.parse o
   end

+  def entry_changed?(commit, path)
+    parent = commit.parents[0]
+    entry = commit.tree[path]
+
+    # if at a root commit, consider it changed if we have this file;
+    # i.e. if we added it in the initial commit
+    if not parent
+      return entry != nil
+    end
+
+    parent_entry = parent.tree[path]
+
+    # does exist in either, no change
+    if not entry and not parent_entry
+      false
+    # only in one of them, change
+    elsif not entry or not parent_entry then
+      true
+    # otherwise it's changed if their ids arent' the same
+    else
+      entry[:oid] != parent_entry[:oid]
+    end
+  end
+
   def commit_info_rugged(repo_root, file_path)
     repo = Rugged::Repository.new(repo_root)
+    path = file_path.to_s.sub(/#{repo_root}\//, '')
     walker = Rugged::Walker.new(repo)
     walker.sorting(Rugged::SORT_DATE)
     walker.push(repo.last_commit)
     walker.inject([]) do |a, c|
-      if (c.diff(paths: [file_path.to_s.sub(/#{repo_root}\//, '')]).size > 0)
+      if entry_changed? c, path
         a << {author: c.author, date: c.time, hash: c.oid, subject: c.message.split("\n").first}
       end
       a

connorshea · 2016-04-25T01:55:31Z

Was this ever fixed?

carlosmn · 2016-04-25T08:17:19Z

There isn't really anything to fix. The revwalk and git-log with path simplification are different beasts. When you simplify by path, you use the diff information to tell you down which sub-histories you don't need to go down.

Implementing git log -- some/path via the revwalk is going to be slow and can give you different results, since git-log avoids going down merges when a particular file is the same in the merge commit, regardless of whether it was changed in the merged history, which the revwalk will not do.

There's libgit2/libgit2#3041tracking the features a git_log structure would have, but so far nobody has cared enough to build it. I'm going to close this since it' snot actually a bug that these two different things behave differently.

arthurschreiber added performance labels Apr 23, 2014

jiripospisil mentioned this issue Apr 26, 2014

Build libgit2 with optimizations #364

Merged

carlosmn mentioned this issue Jun 19, 2014

Add Rugged::Walker#each_oid with :limit and :offset options #386

Merged

mr-vinn mentioned this issue Oct 3, 2014

Improve Repository#branches_contains performance gitlabhq/gitlab_git#48

Merged

SoftwareDev mentioned this issue Nov 6, 2014

Potential Complexity Issue with finding last modified time of Git Files MarkUsProject/Markus#1866

Closed

arthurschreiber mentioned this issue Oct 5, 2015

Q: How to walk the history of a single file with rugged? #529

Open

matthiasbeyer mentioned this issue Oct 5, 2015

[WIP] Add method to get history of a specific file #531

Closed

carlosmn closed this as completed Apr 25, 2016

mwobst mentioned this issue Mar 3, 2023

Use the git class of the ruby git project to improve performance ytti/oxidized#2724

Closed

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rugged::Walker vs git log performance #343

Rugged::Walker vs git log performance #343

LightGuard commented Mar 26, 2014

arthurschreiber commented Mar 26, 2014

LightGuard commented Mar 26, 2014

arthurschreiber commented Mar 26, 2014

LightGuard commented Mar 26, 2014

vmg commented Mar 26, 2014

LightGuard commented Mar 26, 2014

LightGuard commented Mar 26, 2014

vmg commented Mar 26, 2014

vmg commented Mar 26, 2014

LightGuard commented Mar 26, 2014

carlosmn commented Mar 26, 2014

carlosmn commented Mar 27, 2014

connorshea commented Apr 25, 2016

carlosmn commented Apr 25, 2016

Rugged::Walker vs git log performance #343

Rugged::Walker vs git log performance #343

Comments

LightGuard commented Mar 26, 2014

arthurschreiber commented Mar 26, 2014

LightGuard commented Mar 26, 2014

arthurschreiber commented Mar 26, 2014

LightGuard commented Mar 26, 2014

vmg commented Mar 26, 2014

LightGuard commented Mar 26, 2014

LightGuard commented Mar 26, 2014

vmg commented Mar 26, 2014

vmg commented Mar 26, 2014

LightGuard commented Mar 26, 2014

carlosmn commented Mar 26, 2014

carlosmn commented Mar 27, 2014

connorshea commented Apr 25, 2016

carlosmn commented Apr 25, 2016