Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to work out the progress of filter-branch in large repos #144

Closed
robe070 opened this issue Feb 11, 2016 · 18 comments
Closed

How to work out the progress of filter-branch in large repos #144

robe070 opened this issue Feb 11, 2016 · 18 comments

Comments

@robe070
Copy link

robe070 commented Feb 11, 2016

We have a 30,000 commit master branch. We are creating subrepos from it. The first subdir has few commits (110) related to it and took about 50 minutes to init and push. The second one is the main subdir for which most commits are related to it. The subrepo init has been running for 14 hours so far. I have no idea how far it has progressed.

I presume there is a working directory somewhere. How may I work out the progress?

@grimmySwe
Copy link
Collaborator

I think there should be a .git-rewrite catalog with information.

I am looking at a solution for the long init process, unfortunately I think you can expect long time for all git-subrepo operations on that repo as the filter-branch is used extensively. But I have recently suggested some changes, look in #142.

@grimmySwe
Copy link
Collaborator

Notes:
in subrepo:init we use

  # filter-branch out the subdir
  o "Get commits specific to the subdir."
  FAIL=false RUN git filter-branch -f \
    --subdirectory-filter "$subdir" \
    HEAD

This will take the entire branch and convert it into a subdir. For init purpose we are only interested in the last commit, so we could use something better. First hunch was to simply add -- HEAD^..HEAD to get the last commit, but in this case it doesn't work as the last change in this case might not change the subdir.

So first we need to find the latest subrepo commit with something like `git log --follow $subdir'. From that commit we can create a branch, and then perform

$ git checkout -b newbranch <latest_subdir_commit>
$ git filter-branch --subdirectory-filter $subdir -- HEAD^..HEAD

That would give us a commit that is the current state of the subdir. So we can use that, create a new parentless commit form that with commit-tree store a reference to it.

Note 2: Consider copying the project commit first with commit_tree to a node with no parent, then the filter-branch should always show the tree difference between current state and the current.

@robe070
Copy link
Author

robe070 commented Feb 12, 2016

Thats a help, thanks.
I found .git-rewrite and the map directory is having SHAs added to it. The commit, index and message directories are being frequently updated. But how far through is the process? What can I measure against what? I see that commit contains this:

tree 4aed92f47723b268d1fedfb40a05a5f50f395586
parent 02f2f1fec9d36a0b5f9c80d456ae14e3194f2285
author stewart <stewart@lansa.com.au> 1450046897 +1100
committer Rob Goodridge <rob.goodridge@lansa.com.au> 1450124295 +1100
[git-vault-id] lansa$/vl/Trunk@32069/112090

And that is what I need as this is a repo mirrored from Vault and so contains a sequential number in the commit message, @32069.
And in the general case, the parent is in the source branch and may be located in the log.

@jrosdahl
Copy link
Contributor

For init purpose we are only interested in the last commit, so we could use something better.

But that would change init's behavior into squashing all commits in the subdirectory instead of reconstructing the subdirectory's history, right?

Given that we would like to change init's behavior, or make it optional, here's an easier way of constructing a commit with the subdirectory's current state:

tree=$(git ls-tree HEAD^{tree} $subdir | awk '{print $3}')
commit=$(git commit-tree -m "..." $tree)

@grimmySwe
Copy link
Collaborator

Good catch. I do think we at least should consider what the implications are and what the is the best default.
If you recreate the entire history, it might take a while do to the filter-branch on a unknown number of commits. Looking at the regular git init doesn't help so much as it will only initialize a dir into a repo.

I think that my opinion is that the default should be a fixed time operation so that you don't get surprised. If you know what you are doing you could add --with-history option and then you will recreate history.

@grimmySwe
Copy link
Collaborator

Clone actually squashes in content, so it might be good if init has the
same default behavior.

lördag 13 februari 2016 skrev Joel Rosdahl notifications@github.com:

For init purpose we are only interested in the last commit, so we could
use something better.

But that would change init's behavior into squashing all commits in the
subdirectory instead of reconstructing the subdirectory's history, right?

Given that we would like to change init's behavior, or make it optional,
here's an easier way of constructing a commit with the subdirectory's
current state:

tree=$(git ls-tree HEAD^{tree} $subdir | awk '{print $3}')
commit=$(git commit-tree -m "..." $tree)


Reply to this email directly or view it on GitHub
#144 (comment)
.

@ingydotnet
Copy link
Owner

Just to weigh in a bit...

All subrepo clones and pulls squash down to a single commit in the parent repo.
But that commit points to a remote repo and a commit from which it was taken.
So therefore you always have access to the full history for as long as you can
fetch that remote content.

This is actually one of my favorite parts of subrepo. It doesn't clutter up
your history (possibly with the subrepo maintainer's terrible history
housekeeping) but doesn't lose any info either (unless the remote goes away
from access).

Now init should play along. After a commit, you are gonna want to push the
entire history tree off to some remote (as if it existed before and you subrepo
cloned it).

I guess the other side of the coin is that, it terms of info preservation, all
the commits are already in the repo it was inited from. So as long as the
commit msg from the init shows where it came from, somebody with a enough time
or a fast enough computer could always retrieve it.

I would support a progress counter and also a way to limit the amount of time
spent on it. I'd probably do the limit with and env var, since it is a bit of
an edge case to warrant a CLI option.

@robe070, I don't suppose this repo is publicly available, is it? It would make
a good test case.

@robe070
Copy link
Author

robe070 commented Feb 15, 2016

No its not publicly available.

A comment on the progress messages - the original reason for the post - I've been performing filter-branch directly and it provides a very good progress message. Why does subrepo hide it?

@robe070
Copy link
Author

robe070 commented Feb 15, 2016

My solution to #145 is much faster than git subrepo init, git subrepo push, though it does not care whether there is a common commit or not. It also copies tags to the new subrepo. It also presumes the subrepo is local - it 'cd's to it. I think that could be an OK restriction. To me this task is one to perform locally, check its all OK and then push the final result to a central repo.

{ bigrepo } master » git filter-branch --tag-name-filter cat --prune-empty --subdirectory-filter subdir -- --all
{ bigrepo } master » cd ../mysubrepo
{ mysubrepo} master » git pull ../bigrepo master
{ mysubrepo} master » git fetch ../bigrepo 'refs/tags/*:refs/tags/*' 

I estimate its between 5 and 10 times faster. For the largest subdir in my bigrepo that took 'just' 36 hours, instead of maybe 2 weeks!

It provides progress messages from which you can estimate the completion time.

For my purposes, the subrepo created must retain the history directly. It must be visible in GUI tools. I actually think that conceptually this subrepo is to be THE repo for this part of the project. Its not good enough that it relies on another repo to retain its history. The history and the commits need to be together. When you pull from the subrepo its a different story. You can do what ever fits with your need because the central subrepo will always exist and retain the history as long as you need it for.

@grimmySwe
Copy link
Collaborator

@robe070

For my purposes, the subrepo created must retain the history directly. It must be visible in GUI tools. I actually think that conceptually this subrepo is to be THE repo for this part of the project. Its not good enough that it relies on another repo to retain its history. The history and the commits need to be together. When you pull from the subrepo its a different story. You can do what ever fits with your need because the central subrepo will always exist and retain the history as long as you need it for.

I am not sure that I follow you here. In my world there is one and only one subrepo, in that subrepo the subrepo history is recorded. Any number of parent repos can use the subrepo, inside the parent repos you only see squashed commits when you perform git subrepo pull. Doing git subrepo init will not create a repo, it will only prepare to push your subdir changes into a remote repo.

@ingydotnet

I would support a progress counter and also a way to limit the amount of time
spent on it. I'd probably do the limit with and env var, since it is a bit of
an edge case to warrant a CLI option.

By limiting the amount of time, do you mean to abort the operation or somehow "take what you got"?

@robe070
Regarding your suggestion on algorithm for init I can only say that if you look at the code the current subrepo init mainly does the filter-branch operation. So I wonder where the overhead time comes from.

I personally think that git subrepo init by default should be a constant time operation. So when you run it doesn't matter where you are. If you add the --no-squash flag you will perform the filter-branch operation and get the complete history.

And as @robe070 says it might be good to make the filter-branch progress visible in someway.

@robe070
Copy link
Author

robe070 commented Feb 15, 2016

When I have referred to 'the subrepo' I am referring to the same term as @grimmySwe - the repo which is used in any number of parent repos.

@grimmySwe

Clone actually squashes in content, so it might be good if init has the same default behavior.

If the init/push squashes commits, then history would only be in the originating repo. Thats not good in my mind. The subrepo is too important. It should contain all the commit history.

@grimmySwe

I personally think that git subrepo initby default should be a constant time operation. So when you run it doesn't matter where you are. If you add the --no-squash flag you will perform the filter-branch operation and get the complete history.

I have not seen git subrepo init initialise the subrepo. I thought you needed to git subrepo push too. So the time taken is for both operations. My filter-branch acts on the current branch and removes all other commits. subrepo creates a separate branch. I actually don't understand how that can be done as there is no option that I can see on the filter-branch command to create a branch and not be destructive. So I presume its a two step process. And I'm pretty sure I've seen at least 2 filter-branch operations in git subrepo push, so thats at least 3 of them. So the time is adding up. Probably not hard to get to 5 times. Anyway, I've based it on actual timings of a subdir with a low percentage of the total commits - less than 1%. It may be very different when 97% of the commits are being rewritten.

And as I believe no history is not appropriate for the subrepo, therefore squashing is not appropriate and achieving a constant time not possible.

How prevalent are large repositories? Is this just an edge case for which a workaround is documented like my steps? And a progress message provided.

@grimmySwe
Copy link
Collaborator

Ok, now I understand. As you say, to initialize a real subrepo you would need to both init and push.

And this workflow should support arbitrary size of repo, if something works for a huge repo it should work for a small one as well. Current behavior seems to perform unnecessary steps in the push step.

@ingydotnet
Copy link
Owner

I think we are misunderstanding and agreeing with each other (for the most
part) at the same time.

The intent of the init command is to create a repository from a subdir, with
its complete history; get it out to a remote location so it lives on its own;
then make it look like it was cloned/squashed back into the parent repo.

Basically its admitting that a subdir should have been a separate project in
the first place; and then making it look like it was.


I like the idea of exposing the filter-branch output. Maybe with the --verbose
option.

I think this is the fast way to do an init+push:

git subrepo init --remote=$url foobar/
git subrepo branch foobar/
git push $url subrepo/foobar:master

THe init command should really leave behind a subrepo/foobar branch, thus
making step 2 go away. I'll make that change on a branch. The subrepo push
command does extra work that is not necessary here, so a straight git push
is in order.

@ingydotnet
Copy link
Owner

Pushed branch issue/144 and wrote a usage script:
https://gist.github.com/1b48de5ef8c3fd1fdfbe

The --debug option for init will now show progress directly. This was done
in a way that is easy to apply to other commands as needed. It also respects
piping output to other utils (it won't show progress in that case).

After you init foo you are left with a remote and branch, both named
subrepo/foo. You can use these to review and push directly.

Try it out in various situations and let me know what you think.

@ingydotnet
Copy link
Owner

FWIW, Here is the output of init on the builtin subdir of git.git:

https://gist.github.com/anonymous/67f5a814577495519bd3

4576 subdir commits only took 3 mins on MacBook pro.

A direct push of the subrepo/builtin branch only took a few seconds.

@robe070, the filter-branch output looks really nice. Thanks for the idea.

@robe070
Copy link
Author

robe070 commented Feb 16, 2016

I've tested branch issue/144.
The timing has reduced from about 1 hour to 11 minutes and it tells you how far the filter-branch has progressed. Great that works for me!

git reset --hard also takes some time with a large repo. It would be good to ​allow git to provide its progress on that too - percentage complete.

My git init timings are here:https://gist.github.com/robe070/36da4b9ba2de74acc94d

In summary, 32,000 commit repo with almost 400,000 files.
The Demo subdir is only referenced in 112 of those commits.
The filter-branch step took the longest, but the git reset --hard was not far behind - maybe 3-4 minutes of the total 10.5 minutes. The filter-branch step takes proportionally longer for a greater number of commits. The reset would not get longer for more commits.

@robe070
Copy link
Author

robe070 commented Feb 29, 2016

I'm happy with these changes. Progress displayed for git reset --hardcould be added to a new issue and this issue closed.

Maybe this could be merged into the master branch?

@ingydotnet
Copy link
Owner

Done. Closing issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants