Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update checkout with new strategies & behavior #1016

Merged
merged 11 commits into from
Nov 13, 2012

Conversation

arrbee
Copy link
Member

@arrbee arrbee commented Oct 25, 2012

So, @nulltoken created a failing test case for checkout that proved to be particularly daunting. If checkout is given only a very limited strategy mask (e.g. just GIT_CHECKOUT_CREATE_MISSING) then it is possible for typechange/rename modifications to leave it unable to complete the request. That's okay, but the existing code did not have enough information not to generate an error (at least for tree/blob conflicts).

This led me down a long reworking of existing code. Initially I just approached this with a substantial reorganization of the existing code, and you can still see that in the first couple of commits, but at some point I realized that there were two issues:

  1. The checkout strategies allowed you to specify combinations that really didn't line up with any real world usage (e.g. only delete old files and create new ones, but don't update existing ones)
  2. The checkout code wasn't considering the contents of the current HEAD when deciding what it could do safely and what it couldn't. As such, it could not distinguish between files that had been removed from the index (but were in the old HEAD) and files that were truly untracked. Frequently when you check out a new commit, you want to remove files that were in the old commit and not in the new one, but you don't want to remove files that were totally untracked.

After much rewriting, this PR nows tries to deal with those cases correctly.

In addition to those core issues being fixed, there are a number of other things going on in this PR:

  1. A number of the file utilities are improved (particularly rmdir stuff)
  2. Moved the pathspec stuff out of diff.c and into a standalone file
  3. Greatly increased the accuracy of checkout progress values since checkout now builds a precise list of actions that it plans to take before executing any one them
  4. Moved ignore checks in workdir iterators to be handled lazily which should speed up diffs

I think the main facet of this PR that should be examined closely is the direction that I went with the checkout options structure and strategy flags. I'm worried that I went overboard and it is now going to be more complicated to use, but I suspect that most users will just want either the GIT_CHECKOUT_SAFE or GIT_CHECKOUT_FORCE flags (which are actually combinations of more granular flags).

BTW, I've editing this description of the PR to be more up to date with the current actual content of the branch...

@vmg
Copy link
Member

vmg commented Oct 25, 2012

I don't think that changes that improve performance without adding clutter can be "questionable". I don't care if accessing diff internals for checkout is not "elegant" -- it does seem to be faster and just as clean.

The fileops changes look great to me. Haven't looked at the actual checkout code.

@arrbee
Copy link
Member Author

arrbee commented Oct 25, 2012

Conflictio progress callbackio

@arrbee
Copy link
Member Author

arrbee commented Oct 29, 2012

Rebased

@arrbee
Copy link
Member Author

arrbee commented Oct 31, 2012

@nulltoken @ben Okay, this is just an idea about direction for checkout. I'm just looking for feedback - not that this is actually ready to move forward with...

Instead of just doing a diff between the index and the working directory, we would do something more like a merged status diff, where the HEAD, index, and working directory were all factored in. The code would be trying to make the working directory look like the index, but the differences with the HEAD would accounted for.

What do you think if the checkout flags became something like:

    /** A dry run with no changes applied */
    GIT_CHECKOUT_NONE = 0,

    /** Update files in working dir that match HEAD to match index;
     *
     * Possible errors: untracked file exists that is in new HEAD, modified
     * file exists that has to be created/deleted for new HEAD, unmerged
     * entries exist in index.
     */
    GIT_CHECKOUT_SAFE = (1u << 0),

    /** Update files to match index even if they don't match HEAD; ignores
     * unmerged entries in index; errors: untracked file exists that is in
     * new HEAD.
     */
    GIT_CHECKOUT_HARD = (1u << 1),

    /** Allow overwrite of existing working dir files even if not in HEAD */
    GIT_CHECKOUT_OVERWRITE_CONFLICTS = (1u << 2),

    /** Remove working dir files not in index (and not ignored) */
    GIT_CHECKOUT_REMOVE_UNTRACKED = (1u << 3),

    /** For unmerged files, checkout stage 2 from index */
    GIT_CHECKOUT_USE_OURS = (1u << 4),
    /** For unmerged files, checkout stage 3 from index */
    GIT_CHECKOUT_USE_THEIRS = (1u << 5),

    /** Checkout submodules if submodule HEAD moved in parent tree */
    GIT_CHECKOUT_UPDATE_SUBMODULES_IF_CHANGED = (1u << 6),

    /** Checkout submodule with same options; i.e. HARD checkout will do
     * HARD update of submodule, too
     */
    GIT_CHECKOUT_UPDATE_SUBMODULES = (1u << 7),

    /** Normal checkout will preflight the entire operation to make sure
     * there will are no conflicts before making any actual changes. This
     * flag skips the preflight and just starts doing the work
     */
    GIT_CHECKOUT_NON_ATOMIC = (1u << 10),

If the working directory had been modified from the HEAD, then with SAFE mode, the index version would not be allowed to overwrite the working directory. In SAFE mode, it would still be okay to remove a file that was in the HEAD if no longer in the index and to create a file in the index that is not in the HEAD (nor working dir) without any special extra flags.

In HARD mode, the index could overwrite the working directory changes to the HEAD. I'm not sure if the separate OVERWRITE_CONFLICTS flag is needed, but the idea there is to prevent a completely new file in the index (that isn't present in the HEAD) from overwriting a file in the working directory. REMOVE_UNTRACKED is a further escalation to delete files that are not in the index (and not ignored).

SAFE and HARD are pretty much mutually exclusive; we would probably want an error if those two values were both specified.

The USE_OURS and USE_THEIRS are just anticipated extensions, along with the two options for recursively updating submodules, but I might not implement them initially.

Lastly, the NON_ATOMIC option stems more from the observation that the current implementation is not atomic (i.e. it will incrementally make updates and may abort with a partial implementation), where it would be nice to prevent any updates if there are going to be conflicts discovered mid-update.

Again, I only post this speculatively based on some thinking about what I'd like to see in the API. Please feel free to tear it apart. I really want to get this right.

@ben
Copy link
Member

ben commented Nov 1, 2012

Let me see if I'm synthesizing this right:

git libgit2
checkout GIT_CHECKOUT_SAFE
checkout -f GIT_CHECKOUT_HARD
checkout --ours <path> GIT_CHECKOUT_USE_OURS
checkout --theirs <path> GIT_CHECKOUT_USE_THEIRS
checkout -m ???
checkout -p ???

I like this mapping a lot better than what we have now. We were tending to think of it as a copy (which made sense for cloning), but I guess when you start looking at it like a merge, the approach becomes more clear.

I don't think there are any other flags to core git's checkout that make sense here (--conflict? -b|-B?), and -p|-m will probably be best provided by the merge API.

tl;dr: 👍 This is way better.

@nulltoken
Copy link
Member

I agree with @ben. Those flags make much more sense. Thanks for having
worked this out!

Two questions:

  • When operating in GIT_CHECKOUT_NONE mode, how would the function
    report the conflictual paths (and their explanation/reason) to the user?
    something à la git_checkout_opts.skipped_notify_cb?
  • How would the checkout behave in GIT_CHECKOUT_NON_ATOMIC mode? Fail at
    the first conflict? Checkout as many files as possible?

@arrbee
Copy link
Member Author

arrbee commented Nov 5, 2012

@nulltoken Thanks for your thoughts!

  1. That's a good idea, To be useful, NONE would need some sort of way to meaningfully report what would have happened, I guess. Maybe another callback function or just use the skipped_notify_cb... I'll look at it and propose something specific.
  2. That makes me think that DRYRUN should be explicit instead of just the zero value, because the actual actions could depend on other flags that get passed in. What do you folks think of making SOFT the zero value?
  3. As I have sat down to start writing this, I'm feeling like NON-ATOMIC doesn't make sense. The old code was largely non-atomic, so I added this flag in case there was a significant cost associated with making things atomic, but now I feel like (a) I'm going to keep the structure that builds a list of actions before applying them anyhow, so there we can cheaply always detect potential conflicts first, and (b) it is deceptive to call this "atomic" because if there is a second process actively modifying the underlying filesystem, then this is not truly going to be an atomic modification.

@nulltoken
Copy link
Member

That makes me think that DRYRUN should be explicit instead of just the zero value, because the actual actions could depend on other flags that get passed in. What do you folks think of making SOFT the zero value?

If by SOFT you mean SAFE, yes 😉

As I have sat down to start writing this, I'm feeling like NON-ATOMIC doesn't make sense.

I agree. I'd even say that provided the process can notify the users of conflicts and checks out as much as it's allowed to (taking into account the flags), it could even be seen as a safe multi-pass/incremental operation.

@arrbee
Copy link
Member Author

arrbee commented Nov 9, 2012

Okay, so that was more complicated that I thought it would be.

I just force pushed a big update here with new checkout strategy flags that resemble the proposal. When I actually sat down to implement it I made a few changes, but I think it is similar to what we discussed.

There are still two test failures in the code. One related to submodules I need to investigate.

The other is related to reset behavior. The test test_reset_hard__resetting_culls_empty_directories is written with the assumption that reset will remove untracked directories, but that is not the behavior of git reset --hard and it is also no longer the behavior of the checkout code with options that emulate git reset. There is a flag to the new checkout code that would remove the untracked directories, but I hesitate to actually use it inside the git_reset API because then our HARD version will be much more dangers than actual git reset. I think that the test should probably just be reworked to not expect untracked directories to be removed, but I wanted to discuss it first.

There are a number of other refactorings embedded in this PR now that laid the groundwork for the rewrite of checkout. I tried to break them out into separate commits. We might want to cherry pick some of them into a separate PR and get them merged (e.g. lazy eval of ignores by iterators, moving pathspec code into a separate file) just to keep this PR focused, or we can just leave them in here and focus on getting this resolved.

At this point, I'd love it if folks could start by reviewing the updated include/git2/checkout.h and giving feedback on the reworked API. I'd love to hear opinions on that while I fix the remaining broken test and do some valgrinding.

@arrbee
Copy link
Member Author

arrbee commented Nov 9, 2012

Okay, so I fixed the one regression that I hadn't had time to analyze.

The second issue is with the reset test. I don't think the old behavior is correct. Unfortunately, I rewrote the reset hard test to check the behavior that I believe it should have, and we still don't match the results of git reset --hard on the same directory. Reading through the reset code, I believe that the reset implementation has the same problem as the checkout implementation (which I have remedied in this PR) - namely, the reset code overwrites the index with the content of the commit to which we are reseting before deciding what to do with the content of the working directory. As such, it cannot properly discern between a file that can safely be removed (e.g. a file that is present in the working directory and the index, but is not present in the commit being reset to) vs a file that should not be removed (e.g. an actual untracked file in the working directory).

With the rewritten checkout, it is probably not that much work to rearrange the reset code to do things in the right order, but this PR has already turned into a lot of code. I'll wait for some discussion before making any further changes (although I may push my extended reset test, even though it is still failing).

@nulltoken
Copy link
Member

@arrbee You're right, this test is incorrect. It lacks a commit and wrongly asserts that subdir should be removed. It should rather reflect the following

$ git init cull && cd cull
Initialized empty Git repository in d:/Temp/cull/.git/

$ echo "content" > test.txt && git add test.txt

$ git commit -m "Initial commit"
[master (root-commit) 7f2be72] Initial commit
 1 file changed, 1 insertion(+)
 create mode 100644 test.txt

$ mkdir -p newdir/with/nested

$ mkdir subdir

$ echo "all anew..." > newdir/with/nested/file.txt

$ ls
newdir  subdir  test.txt

$ git add newdir/with/nested/file.txt

$ git commit -m "Add subtree"
[master 1af4912] Add subtree
 1 file changed, 1 insertion(+)
 create mode 100644 newdir/with/nested/file.txt

$ git reset HEAD^ --hard
HEAD is now at 7f2be72 Initial commit

$ ls
subdir  test.txt

@arrbee
Copy link
Member Author

arrbee commented Nov 9, 2012

Okay, so I updated the reset test in a slightly different manner, but my plan is to open a new PR shortly to fix the reset behavior. In that PR I will implement some new reset tests including the one you propose above @nulltoken

I think there are potentially two things that might be considered missing in the PR.

  1. We could use some extra tests for the checkout code, since I mostly just tweaked the old tests and now there are a lot of new corner cases to exercise with the new flags.
  2. Several of the new checkout flags related to unmerged entries in the index and recursively checkout out submodules are not actually implemented. I don't plan to implement them as part of this PR, but I should probably either comment them out in the header or at least note explicitly that they are unimplemented.

Does that sounds reasonable?

@arrbee
Copy link
Member Author

arrbee commented Nov 9, 2012

Travis fixes on the way...

@arrbee
Copy link
Member Author

arrbee commented Nov 9, 2012

There is still a memory leak in here. I'm working on it...

* Rework GIT_DIRREMOVAL values to GIT_RMDIR flags, allowing
  combinations of flags
* Add GIT_RMDIR_EMPTY_PARENTS flag to remove parent dirs that
  are left empty after removal
* Add GIT_MKDIR_VERIFY_DIR to give an error if item is a file,
  not a dir (previously an EEXISTS error was ignored, even for
  files) and enable this flag for git_futils_mkpath2file call
* Improve accuracy of error messages from git_futils_mkdir
So, @nulltoken created a failing test case for checkout that
proved to be particularly daunting.  If checkout is given only
a very limited strategy mask (e.g. just GIT_CHECKOUT_CREATE_MISSING)
then it is possible for typechange/rename modifications to leave it
unable to complete the request.  That's okay, but the existing code
did not have enough information not to generate an error (at least
for tree/blob conflicts).

This led me to a significant reorganization of the code to handle
the failing case, but it has three benefits:

1. The test case is handled correctly (I think)
2. The new code should actually be much faster than the old code
   since I decided to make checkout aware of diff list internals.
3. The progress value accuracy is hugely increased since I added
   a fourth pass which calculates exactly what work needs to be
   done before doing anything.
This makes it so that the check if a file is ignored will be
deferred until requested on the workdir iterator, instead of
aggressively evaluating the ignore rules for each entry.  This
should improve performance because there will be no need to
check ignore rules for files that are already in the index.
Diff uses a `git_strarray` of path specs to represent a subset
of all files to be processed.  It is useful to be able to reuse
this filtering in other places outside diff, so I've moved it
into a standalone set of utilities.
There are some diff functions that are useful in a rewritten
checkout and this lays some groundwork for that.  This contains
three main things:

1. Share the function diff uses to calculate the OID for a file
   in the working directory (now named `git_diff__oid_for_file`
2. Add a `git_diff__paired_foreach` function to iterator over
   two diff lists concurrently.  Convert status to use it.
3. Move all the string/prefix/index entry comparisons into
   function pointers inside the `git_diff_list` object so they
   can be switched between case sensitive and insensitive
   versions.  This makes them easier to reuse in various
   functions without replicating logic.  As part of this, move
   a couple of index functions out of diff.c and into index.c.
This is a major reworking of checkout strategy options.  The
checkout code is now sensitive to the contents of the HEAD tree
and the new options allow you to update the working tree so that
it will match the index content only when it previously matched
the contents of the HEAD.  This allows you to, for example, to
distinguish between removing files that are in the HEAD but not
in the index, vs just removing all untracked files.

Because of various corner cases that arise, etc., this required
some additional capabilities in rmdir and other utility functions.

This includes the beginnings of an implementation of code to read
a partial tree into the index based on a pathspec, but that is
not enabled because of the possibility of creating conflicting
index entries.
The `git_reset` API with the HARD option is still slightly
broken, but this test now does exercise the ability of the
command to revert modified files.
This fixes a number of warnings and problems with cross-platform
builds.  Among other things, it's not safe to name a member of a
structure "strcmp" because that may be #defined.
@arrbee
Copy link
Member Author

arrbee commented Nov 9, 2012

Okay, I think I've fixed everything. Since I was valgrinding, I rebased onto the latest development (even though there were no merge conflicts in this PR) so I would be testing against the latest... Let's see what Travis has to say.

This fixes some various warnings that showed up in Travis and
a couple uses of uninitialized memory and one memory leak.
@arrbee
Copy link
Member Author

arrbee commented Nov 9, 2012

By the way, this does not currently fix #1047. I started to lay the groundwork for doing so in git_index_read_tree_match() which allows you to load just the parts of a tree that match a given pathspec into the index, but there is a critical bug in the implementation (namely, if you load a blob "a/b" where the used to be an "a/b" implied tree (i.e. an "a/b/c" blob), then the code wasn't removing the conflicting old entry (as opposed to git_index_read_tree which has the luxury of clearing the index before loading the tree). I'd like to move ahead with this PR and fix #1047 separately, I think.

vmg pushed a commit that referenced this pull request Nov 13, 2012
Update checkout with new strategies & behavior
@vmg vmg merged commit aa1c3b5 into libgit2:development Nov 13, 2012
@ben
Copy link
Member

ben commented Nov 13, 2012

✨ 🏇 ✨

phatblat pushed a commit to phatblat/libgit2 that referenced this pull request Sep 13, 2014
Update checkout with new strategies & behavior
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants