New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How to work out the progress of filter-branch in large repos #144
Comments
I think there should be a I am looking at a solution for the long init process, unfortunately I think you can expect long time for all git-subrepo operations on that repo as the filter-branch is used extensively. But I have recently suggested some changes, look in #142. |
Notes:
This will take the entire branch and convert it into a subdir. For init purpose we are only interested in the last commit, so we could use something better. First hunch was to simply add -- HEAD^..HEAD to get the last commit, but in this case it doesn't work as the last change in this case might not change the subdir. So first we need to find the latest subrepo commit with something like `git log --follow $subdir'. From that commit we can create a branch, and then perform
That would give us a commit that is the current state of the subdir. So we can use that, create a new parentless commit form that with commit-tree store a reference to it. Note 2: Consider copying the project commit first with commit_tree to a node with no parent, then the filter-branch should always show the tree difference between current state and the current. |
Thats a help, thanks.
And that is what I need as this is a repo mirrored from Vault and so contains a sequential number in the commit message, @32069. |
But that would change init's behavior into squashing all commits in the subdirectory instead of reconstructing the subdirectory's history, right? Given that we would like to change init's behavior, or make it optional, here's an easier way of constructing a commit with the subdirectory's current state:
|
Good catch. I do think we at least should consider what the implications are and what the is the best default. I think that my opinion is that the default should be a fixed time operation so that you don't get surprised. If you know what you are doing you could add --with-history option and then you will recreate history. |
Clone actually squashes in content, so it might be good if init has the lördag 13 februari 2016 skrev Joel Rosdahl notifications@github.com:
|
Just to weigh in a bit... All subrepo clones and pulls squash down to a single commit in the parent repo. This is actually one of my favorite parts of subrepo. It doesn't clutter up Now I guess the other side of the coin is that, it terms of info preservation, all I would support a progress counter and also a way to limit the amount of time @robe070, I don't suppose this repo is publicly available, is it? It would make |
No its not publicly available. A comment on the progress messages - the original reason for the post - I've been performing filter-branch directly and it provides a very good progress message. Why does subrepo hide it? |
My solution to #145 is much faster than git subrepo init, git subrepo push, though it does not care whether there is a common commit or not. It also copies tags to the new subrepo. It also presumes the subrepo is local - it
I estimate its between 5 and 10 times faster. For the largest subdir in my bigrepo that took 'just' 36 hours, instead of maybe 2 weeks! It provides progress messages from which you can estimate the completion time. For my purposes, the subrepo created must retain the history directly. It must be visible in GUI tools. I actually think that conceptually this subrepo is to be THE repo for this part of the project. Its not good enough that it relies on another repo to retain its history. The history and the commits need to be together. When you pull from the subrepo its a different story. You can do what ever fits with your need because the central subrepo will always exist and retain the history as long as you need it for. |
I am not sure that I follow you here. In my world there is one and only one subrepo, in that subrepo the subrepo history is recorded. Any number of parent repos can use the subrepo, inside the parent repos you only see squashed commits when you perform
By limiting the amount of time, do you mean to abort the operation or somehow "take what you got"? @robe070 I personally think that And as @robe070 says it might be good to make the filter-branch progress visible in someway. |
When I have referred to 'the subrepo' I am referring to the same term as @grimmySwe - the repo which is used in any number of parent repos.
If the init/push squashes commits, then history would only be in the originating repo. Thats not good in my mind. The subrepo is too important. It should contain all the commit history.
I have not seen And as I believe no history is not appropriate for the subrepo, therefore squashing is not appropriate and achieving a constant time not possible. How prevalent are large repositories? Is this just an edge case for which a workaround is documented like my steps? And a progress message provided. |
Ok, now I understand. As you say, to initialize a real subrepo you would need to both init and push. And this workflow should support arbitrary size of repo, if something works for a huge repo it should work for a small one as well. Current behavior seems to perform unnecessary steps in the push step. |
I think we are misunderstanding and agreeing with each other (for the most The intent of the init command is to create a repository from a subdir, with Basically its admitting that a subdir should have been a separate project in I like the idea of exposing the filter-branch output. Maybe with the --verbose I think this is the fast way to do an init+push:
THe init command should really leave behind a subrepo/foobar branch, thus |
Pushed branch issue/144 and wrote a usage script: The --debug option for init will now show progress directly. This was done After you init foo you are left with a remote and branch, both named Try it out in various situations and let me know what you think. |
FWIW, Here is the output of init on the https://gist.github.com/anonymous/67f5a814577495519bd3 4576 subdir commits only took 3 mins on MacBook pro. A direct push of the subrepo/builtin branch only took a few seconds. @robe070, the filter-branch output looks really nice. Thanks for the idea. |
I've tested branch issue/144.
My git init timings are here:https://gist.github.com/robe070/36da4b9ba2de74acc94d In summary, 32,000 commit repo with almost 400,000 files. |
I'm happy with these changes. Progress displayed for Maybe this could be merged into the master branch? |
Done. Closing issue. |
We have a 30,000 commit master branch. We are creating subrepos from it. The first subdir has few commits (110) related to it and took about 50 minutes to init and push. The second one is the main subdir for which most commits are related to it. The subrepo init has been running for 14 hours so far. I have no idea how far it has progressed.
I presume there is a working directory somewhere. How may I work out the progress?
The text was updated successfully, but these errors were encountered: