-
Notifications
You must be signed in to change notification settings - Fork 277
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Rugged::Walker vs git log performance #343
Comments
The numbers on my machine are even worse...
This is multiple orders of magnitudes slower. @cmn @arrbee Any idea what's going on? My first guess was that the
EDIT: I was running the tests against the Rugged repo and the |
Removing the diff looks like it does make a difference, but it also removes filtering for a specific file, though I'm guessing that's a side symptom of the root cause. |
Gah, I was looking at the wrong numbers. So, without the diff it's "only" around twice as slow, instead of seven times as slow. |
I'd doubt filtering would be that slow. |
Actually, it is. As @carlosmn will probably point out to you in a snarky tone as soon as he wakes up, the two code snippets are not doing the same thing at all. Not even close. The Do you have plans to implement path limiting in the libgit2 walker, @carlosmn? |
Ah. That does make sense. Path limiting would be a great feature. If you remove the filter from both versions you see a much different result:
|
In that case, is there a simple way of filtering w/o doing the tree walk on the commit? |
That's more in line with what we'd expect. phew Let's see which of our friendly code monkeys can I whip to get path limiting implemented. I'm afraid we don't use that specific feature in GitHub (yet) so there hasn't been a pressing need to implement it. Thanks for the report, though! I'll update this once we're Fast (TM). |
The simple way is the one you posted in your original snippet, but it's simple and slow. The complex and fast way is limiting during the walk itself, but that needs to be implemented at the libgit2 level. |
Okay, thanks. |
I'll have you know I've been awake all day, my good sir (but I've spent most of it fighting with the python memory allocator, so I'm not sure it counts). I'm not too familiar with the optimisations that The issue with doing the equivalent of Long story short, I do plan to implement history simplification at some point, but I haven't figured out what the function signature would even be, so there hasn't been anything written yet (this is why I did implement |
FWIW, I took a stab at using the tree entries directly instead of relying on diff to know to optimise the actual diff out. I'm testing with grabbing stuff out of libgit2's repo, and limiting the path to
and with my changes
It's still slower, as expected, but we avoid a lot of the cost that This bug means that we waste a bunch of time not doing anything before doing any of the work, and the amount is directly proportional to the size of the history. In a small repository (carlosmn/git-httpd, 26 commits), the timings are 1.1s for rugged vs 11.2s for git (this time again with 1000 repetitions). EDIT: got the rugged version from dev, and it's not any faster; looks like the case of the smaller repository is simply the git setup overhead that's taking us over the hump. My changes to the script: diff -u bench_orig.rb bench.rb
--- bench_orig.rb 2014-03-27 04:50:46.788440621 +0100
+++ bench.rb 2014-03-27 04:30:35.556264199 +0100
@@ -9,13 +9,38 @@
YAML.parse o
end
+ def entry_changed?(commit, path)
+ parent = commit.parents[0]
+ entry = commit.tree[path]
+
+ # if at a root commit, consider it changed if we have this file;
+ # i.e. if we added it in the initial commit
+ if not parent
+ return entry != nil
+ end
+
+ parent_entry = parent.tree[path]
+
+ # does exist in either, no change
+ if not entry and not parent_entry
+ false
+ # only in one of them, change
+ elsif not entry or not parent_entry then
+ true
+ # otherwise it's changed if their ids arent' the same
+ else
+ entry[:oid] != parent_entry[:oid]
+ end
+ end
+
def commit_info_rugged(repo_root, file_path)
repo = Rugged::Repository.new(repo_root)
+ path = file_path.to_s.sub(/#{repo_root}\//, '')
walker = Rugged::Walker.new(repo)
walker.sorting(Rugged::SORT_DATE)
walker.push(repo.last_commit)
walker.inject([]) do |a, c|
- if (c.diff(paths: [file_path.to_s.sub(/#{repo_root}\//, '')]).size > 0)
+ if entry_changed? c, path
a << {author: c.author, date: c.time, hash: c.oid, subject: c.message.split("\n").first}
end
a |
Was this ever fixed? |
There isn't really anything to fix. The revwalk and git-log with path simplification are different beasts. When you simplify by path, you use the diff information to tell you down which sub-histories you don't need to go down. Implementing There's libgit2/libgit2#3041tracking the features a |
I realize Rugged probably won't be as fast as git log, but currently the performance for the log on a particular file is horrible. Compare on your machine, for me (in a ruby benchmark 1000 times) I'm seeing differences between code to be double the time in rugged as shelling out to git log:
obviously replace repo and file :)
Numbers from my machine:
The text was updated successfully, but these errors were encountered: