Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

filter-repo --path doesn't preserve history of content in path that earlier had been out of path (multi-path rename or copy detection) #25

Closed
nathanielmanistaatgoogle opened this issue Nov 29, 2019 · 1 comment

Comments

@nathanielmanistaatgoogle

STEPS TO REPRODUCE:

$ git clone https://fuchsia.googlesource.com/fuchsia
$ cd fuchsia
$ git --no-pager log --pretty=oneline --follow ee838c60b2ff5b957898323cf27f0bd92fa7c93d -- src/ledger/bin/app/app.cc|wc
$ git filter-repo --path=src/ledger/
$ git --no-pager log --pretty=oneline --follow 93946749a1 -- src/ledger/bin/app/app.cc|wc

(Here 93946749a1 comes from HEAD is now at 93946749a1 [ledger] Migrate ptr.h to ledger/lib in the output of git filter-repo.)

EXPECTED RESULTS:
The first number in the output of the two git log commands will be the same, indicating that all of the history of src/ledger/bin/app/app.cc that was available before filtering the repo remains available.

OBSERVED RESULTS:
The numbers are 144 and 25, indicating that a great deal of the history of src/ledger/bin/app/app.cc was not preserved. Additionally, browsing the new log for the file indicates that history for the file only goes back as far as when the file was moved into src/ledger/.

Am I holding it wrong? I would have thought from "extracting wanted paths and their history (stripping everything else)" that the full history of the content at the given path at HEAD would be retained, even if the content at the given path at HEAD had not always been located at the given path.

@newren
Copy link
Owner

newren commented Nov 30, 2019

Am I holding it wrong? I would have thought from "extracting wanted paths and their history (stripping everything else)" that the full history of the content at the given path at HEAD would be retained, even if the content at the given path at HEAD had not always been located at the given path.

Yeah, I can see how it'd be nice to have this different functionality, but it's not what was intended by the wording or what I even tried to implement. Note that the wording you quote was not for any specific command line flag, it was just generic what-can-the-tool-do wording. If you look at the actual documentation from the manual for the --path option, it states that it is for "Exact paths (files or directories) to include in filtered history.". The builtin documentation uses the same wording. I would have at a minimum left off the word "Exact" in both places if rename or copy detection was implied; the "Exact" is there just to reinforce that.

Also, this is consistent with how the rev-list, log, and fast-export git subcommands work. E.g. git log -- src/ledger/bin/app/app.cc won't show any history for other paths that this file was renamed or copied from (or for which parts of it came from). You used the --follow flag specifically, which is a big hack as even noted in the git log documentation (it mentions that it only works when a single file is specified). If rev-list/log/fast-export, etc. had a --follow option that followed renames, I could simply expose it from filter-repo, but despite the desire for such an option no one has implemented it in many years. There's some good challenges there too, e.g. we'd probably want to traverse in topological order and we may need two passes -- one to create the topological ordering, and the second to build up additional paths from renames. (A case where this might be necessary: some branch builds on top of 'master' and has some paths within the specified pathspec that came from a rename of something outside the pathspec at the time 'master' existed. If 'master' was traversed before the other branch, then we'd have already picked the more limited pathspec and miss the extra needed paths.)

But even if --follow implemented following of renames for multiple files or a directory or more, that still wouldn't necessarily be sufficient because perhaps the user needs copy detection (i.e. it wasn't a file renamed from somewhere else, rather it was copied). But with copy detection it's not as clear if you want the full history of the original; I can imagine that in some cases you would but not others.

And if we start doing either rename or copy detection, then we're moving from well-defined correct behavior to heuristics. For diffs or logs or even merges that's fine, because the results will be interpreted by a user (even in merge, if the detection is wrong, the user can fixup conflicts and make other edits). Here, we'd record the results of the heuristics in stone. That's a bit worrying to me...and it also means we'd have to open up a pile of knobs (at the very least a similarity percentage, and whether copies are wanted in addition to renames) for configuration.

All that said, I wanted something like that when I was using it too. The best compromise I came up with was to have people run 'git filter-repo --analyze' beforehand, look at the renames sub-report, and pick out additional paths by hand based on that to feed to their filter-repo run. The --analyze option still had a few caveats with the rename detection, but that was mostly fundamental to the problem. Providing it and letting the user decide what to include (though I didn't even bother with copy detection), seemed like the best option I had available.

I know that's probably not what you wanted to hear, but I hope the --analyze hint and the renames information it generates is at least helpful.

@newren newren changed the title filter-repo --path doesn't preserve history of content in path that earlier had been out of path filter-repo --path doesn't preserve history of content in path that earlier had been out of path (multi-path rename or copy detection) Nov 30, 2019
@newren newren closed this as completed Nov 30, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants