Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Filter branch #429

Merged
merged 1 commit into from Jun 15, 2013
Merged

Filter branch #429

merged 1 commit into from Jun 15, 2013

Conversation

ben
Copy link
Member

@ben ben commented May 11, 2013

This adds an idiomatic c-sharpy api for doing filter-branch-like operations. This works without ever touching the index or the working directory, so it should be wicked fast.

  • Header rewriting (author, message, etc)
  • Tree rewriting using TreeDefinition
  • Tag rewriting (like --tag-filter)
  • Parent-commit munging (like --parent-filter)
  • Back up original refs to refs/original
  • Name review
  • XML documenation
  • Squash

The API looks like this:

repo.Refs.RewriteHistory(
    // A collection of commits that should be rewritten
    repo.Head.Commits,

    // A function that returns a `TreeDefinition` for the new commit
    commitTreeRewriter: c =>
        {
            var td = TreeDefinition.From(c);
            td.Remove("README");
            return td;
        },

    // A function that returns header information to be used for the new commit
    commitHeaderRewriter: c =>
        {
            var ch = CommitRewriteInfo.From(c);
            ch.Message += "\n\nCleaned-by: The Cleaner";
            return ch;
        });

"refs/tags/lw",
"refs/tags/test",
};
Assert.Equal(expected, result.Select(x => x.CanonicalName).OrderBy(x => x));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally helpful to tack on a ToArray() after OrderBy() to get a better message on failure.

@ben
Copy link
Member Author

ben commented May 13, 2013

I just pushed up what I did on my flight. Let me try to synthesize the feedback here and match it up with where things stand now:

  1. CommitHeader.From with optional parameters turns out to have some difficulties, since a conversion to a Func<> doesn't work any more. I ended up doing From with optional parameters (which is great), and the simple duplication call is called SameAs.
  2. The reachability call is kind of awkward no matter where it lives. I moved it over to IEnumerable<Reference> and called it ReachableFrom, but it still seems clunky. The problem is that it needs a set of refs and two sets of commits. If it belonged to Repository, it would get the full commit graph for free, but I kind of don't like polluting that class with tons of calls.
  3. RewriteHistory moved to Repository. It affects commits, tags, and refs, so I can't imagine a better place for it than the class that ties all those together.
  4. The DTO class for rewriting metadata ended up being called CommitRewriteInfo. Does that work for everyone?
  5. I removed the CommitRewriteInfo.Encoding field, but now I'm wondering if that was a mistake. Do you think it's useful during a rewrite to know what the encoding was originally? We're not allowing writing of custom encodings.

Also, I checked off a few of the boxes, so this is a bit more powerful now. Let me know what you think.

repo.RewriteHistory(commits, commitHeaderRewriter: c => CommitRewriteInfo.From(c, message: "abc"));
Assert.Empty(repo.Refs.Where(x => x.CanonicalName.StartsWith("refs/original/original/")));
}
}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The travis build is going to fail here. The question is, what should happen if we try to back up the refs to refs/original (or wherever), and it already exists? We could:

  • Pretend we're doing filter-branch -f, and overwrite them.
  • Throw an exception of some sort, though I'd want to do a checking pass first so we don't actually do a rewrite if it's going to fail.
  • All of the above, with some way of turning it on.
  • A callback for collisions? I only mention it because I've seen it done inside libgit2.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Throw an exception of some sort, though I'd want to do a checking pass first so we don't actually do a rewrite if it's going to fail.

👍

@dahlbyk
Copy link
Member

dahlbyk commented May 13, 2013

CommitHeader.From with optional parameters turns out to have some difficulties, since a conversion to a Func<> doesn't work any more. I ended up doing From with optional parameters (which is great), and the simple duplication call is called SameAs.

R# will complain, but it's perfectly legal to have overlapping methods with and without optional parameters so you can still use CommitHeader.From as a Func<Commit, CommitHeader>.

The reachability call is kind of awkward no matter where it lives. I moved it over to IEnumerable<Reference> and called it ReachableFrom, but it still seems clunky. The problem is that it needs a set of refs and two sets of commits. If it belonged to Repository, it would get the full commit graph for free, but I kind of don't like polluting that class with tons of calls.

I had overlooked the use of repo.Commits - you're right, it does feel clunky. It's cheating a bit, but exposing GitObject.repo internally would allow you to grab repo.Commits from one of the targets instead of requiring allCommits as an argument.

Unrelated microoptimization: you could capture targetsList in a HashSet<> for more efficient lookup, since we're checking for each commit. At least I expect it would perform better...

I removed the CommitRewriteInfo.Encoding field, but now I'm wondering if that was a mistake. Do you think it's useful during a rewrite to know what the encoding was originally? We're not allowing writing of custom encodings.

Encoding is available through the Commit passed to the callback - properties on CommitRewriteInfo are only useful as input back into the tree.



// This test should rewrite br2, but not packed-test:
// * a4a7dce (br2) Merge branch 'master' into br2
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this comment rather decorate the test above?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh yeah. It should. 😊

@ben
Copy link
Member Author

ben commented May 13, 2013

R# will complain, but it's perfectly legal to have overlapping methods [...]

Hey, you're right! I was trusting the squigglies too much.

It's cheating a bit, but exposing GitObject.repo internally [...]

Yup. I like cheating. That makes it much nicer to use, too. Now I'm wondering if it makes sense to make GitObject.repo public, it could simplify some client code too.

/// <returns>A new <see cref="CommitRewriteInfo"/> object that matches the info for the <paramref name="commit"/>.</returns>
public static CommitRewriteInfo From(Commit commit)
{
return new CommitRewriteInfo
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just personal preference, but I'm inclined to capture the values from commit in the other overload, and just defer to that one here.

@nulltoken
Copy link
Member

RewriteHistory moved to Repository. It affects commits, tags, and refs, so I can't imagine a better place for it than the class that ties all those together.

@ben Could you please rather move it into ReferenceCollection?

var newName = referenceNameRewriter(reference.CanonicalName.Substring(5));
if (newName != null)
{
Refs.Add(newName, reference.TargetIdentifier, true, "rewrite history");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm. Maybe would the reflog also deserve some love 😜

More insight here

@ben
Copy link
Member Author

ben commented May 14, 2013

@nulltoken mentioned the reflog. Looking at git's code, it seems like it only generates a few distinct reflog messages.

  1. filter-branch: rewrite on a successful rewrite.
  2. filter-branch: backup when backing up.
  3. filter-branch: delete if the ref was deleted, maybe because of a --prune-empty or somesuch.
  4. filter-branch: rewrite to first if a ref was rewritten into multiple commits.

I've corrected 1 and 2; RewriteHistory now emits the same messages as git would have.

Cases 3 and 4 aren't directly supported, but you could simulate them with the parent-rewrite hook (by either omitting or inserting parent links). The problem is that, since RewriteHistory isn't aware that this is happening, it can't detect what happened and emit the right message. Any opinions on the right way to support this?

/// If this returns null, the tag will be deleted.
/// If it returns the empty string, the tag will not be changed.
/// If it returns the input, the tag will be moved.
/// Any other value results in a new tag.</param>
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an emulation of what git provides, but now that I've written it I'm not totally convinced it's the right thing. There's a lot of behavior here that's keying off the value of a returned string. Would it make more sense to have this be an Action<Tag, GitObject> that receives the old tag and the new target? We could even call it in the right order for chained tags.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this returns null, the tag will be deleted.

Can this handle the following use case?

initial:

refs/tags/my-tag -> TagA -> TagB -> TagC -> GitObject (either a Blob, Tree or a Commit)

expected output:

refs/original/tags/my-tag -> TagC -> GitObject (either a Blob, Tree or a Commit)
or
refs/original/tags/my-tag -> TagA -> GitObject (either a Blob, Tree or a Commit)
or
refs/original/tags/my-tag -> GitObject (either a Blob, Tree or a Commit)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any other value results in a new tag

Can this handle the following use case?

rewriter = x => { return x == "v0.0.1" ? "v0.0.1rc" : x }

initial:

refs/tags/my-tag -> TagA (name: test) -> TagB (name: v0.0.1) -> TagC (name: another) -> GitObject (either a Blob, Tree or a Commit)

expected output:

refs/original/tags/my-tag -> TagA (name: test) -> TagB (name: v0.0.1rc) -> TagC (name: another) -> GitObject (either a Blob, Tree or a Commit)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

initial:

refs/tags/my-tag -> TagA -> TagB -> TagC -> GitObject (either a Blob, Tree or a Commit)

expected output:

refs/original/tags/my-tag -> TagC -> GitObject (either a Blob, Tree or a Commit)
or
refs/original/tags/my-tag -> TagA -> GitObject (either a Blob, Tree or a Commit)
or
refs/original/tags/my-tag -> GitObject (either a Blob, Tree or a Commit)

My gut feeling is that refs/original/tags/my-tag should now point to refs/original/tags/TagA – the tag-chain rewriting should act a lot more like the commit rewriting than it currently does. I'll add this case to the tests, but it may already work.

initial:

refs/tags/my-tag -> TagA (name: test) -> TagB (name: v0.0.1) -> TagC (name: another) -> GitObject (either a Blob, Tree or a Commit)

expected output:

refs/original/tags/my-tag -> TagA (name: test) -> TagB (name: v0.0.1rc) -> TagC (name: another) -> GitObject (either a Blob, Tree or a Commit)

This one probably already works. Whatever the result of the name callback, the code records the before and after states of every rewritten tag in tagMap. So when rewriting TagA, the code notices that it's pointing to a TagAnnotation. It tries to rewrite the target tag TagB first (using a cached result if that's already done), and creates a new TagA that points to TagB's annotation.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm coming back to this, and more and more I'm thinking that tag handling should be left to the caller. The default behavior is to do nothing to tags (to preserve signing, etc) – not even backing them up. If the caller wants to rewrite them, they're completely able to.

@nulltoken
Copy link
Member

Cases 3 and 4 aren't directly supported, but you could simulate them with the parent-rewrite hook (by either omitting or inserting parent links). The problem is that, since RewriteHistory isn't aware that this is happening, it can't detect what happened and emit the right message. Any opinions on the right way to support this?

I think it'd be ok to keep those outside the scope of this PR. Could you please open an issue to keep track on those?

Regarding this, does git filter-branch is able to handle such rewrite?

initial:

A -> B -> C

expected output:

A -> B1 -> B2 -> B3 -> C
or
A -> B1 -> B2a -> B3 -> C
         \ B2b /

@ben
Copy link
Member Author

ben commented May 17, 2013

What do you think is the correct thing to do in this situation?

refA -> commit -> tree -> blob <- tag
(rewrite)
refA' -> commit' -> tree' -> blob'    |    blob <- tag (?)

The commit's tree has been rewritten to include a new blob in the place the old blob was. Should the tag now point to the new blob in the new tree? My gut feel is no; if you went out of your way to tag a blob, you probably want that to always point to that blob.

@ben
Copy link
Member Author

ben commented May 17, 2013

expected output:

A -> B1 -> B2 -> B3 -> C
or
A -> B1 -> B2a -> B3 -> C
         \ B2b /

It kind of looks that way. There's a reflog message for "rewritten to multiple", meaning the commit that refA pointed to before has been transformed into multiple commits. It seems to pick the first one, but it notes that this happened.

I'm comfortable passing this off to the caller. If she wants to split a commit into an entire topology (using some combination of the treeRewriter and parentRewriter), she can decide where the transformed ref points to also. By default, the call should choose the commit that is constructed from commitHeaderRewriter, commitTreeRewriter, and parentRewriter.

directReferenceRewriter:
(r, t) => repo.Refs.DefaultDirectReferenceRewriter(r, t, "refs/rewritten"));
Assert.NotEmpty(repo.Refs.Where(x => x.CanonicalName.StartsWith("refs/rewritten")));
}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't done this for tags yet, but this is what it looks like if you don't restrict the rewriting hooks to just naming the backups. I provide a default implementation that does the right things, but it's easy to override for just naming, or you can do all your own logic in there.

@nulltoken
Copy link
Member

@ben repo.ObjectDatabase.CreateTag() has been merged. This may be helpful to rewrite annotated tags:wink:

@ben
Copy link
Member Author

ben commented Jun 11, 2013

So now this properly rewrites chains of tag annotations without creating spurious tags in the process. Is there anything left, or should we ship it?

@nulltoken
Copy link
Member

@ben I found a couple of issues and eventually ended on a coding spree. Sorry 😁

I've tried to simplify the code a bit (mainly to help me get a better understanding of it).

Main changes are:

  • Fix a minor issue where the rewriters were applied on commits that weren't part of the initial user's list
  • Tags are now also backed up when rewritten
  • TagAnnotation can be renamed

this.backupRefsNamespace = backupRefsNamespace;
}

public void AmazeMe()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. I'm really that bad at naming...

Ideas?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rewrite? Execute?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Execute!

@dahlbyk Thanks. Fixed.

@nulltoken
Copy link
Member

I'm not sure I can go much farther for now.

@ben @dahlbyk @yorah Would you please sprinkle a bit of your review magic on those? I'm especially not in love the latest commit.

@ben
Copy link
Member Author

ben commented Jun 14, 2013

🔥 @yorah 🔥 ❗ Making a pull request to a pull request? 🙀

return td;
});

Assert.Empty(repo.Head.Commits.Where(c => c["README"] != null));
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Super nitpicky: would this read better as Assert.True(repo.Head.Commits.All(c => c["README"] == null));?

@ben
Copy link
Member Author

ben commented Jun 14, 2013

I think it's ready! 💯 💴

@nulltoken nulltoken merged commit 39716ec into vNext Jun 15, 2013
@nulltoken
Copy link
Member

Awesome job @ben! I tweaked the skipped test a bit before merging it.

@yorah I squashed in your commit as well. The rollbackActions Queue is an amazing idea! Thanks a lot for your help ❤️

@ben
Copy link
Member Author

ben commented Jun 15, 2013

✨ ❗ ✨

@ben ben deleted the filter-branch branch June 15, 2013 17:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants