Add a "dirty" state to the index when it has unsaved changes #4536

ethomson · 2018-02-19T00:34:44Z

Add a new "dirty" state to the index when it has unsaved changes.

At present, you can load the index and make changes to it without saving those changes. However various functions in the library behave different when the repository's index has unsaved changes. For example, checkout will reload the index (by default), but merge will not.

This asymmetry is obviously a problem.

It was noted during an attempt to standardize this behavior that some people may be relying on the prior behavior that merge does not reload the index.

The discussion from this was that we should add a warning in the unlikely event that someone does rely on this behavior. This PR adds a "dirty" state to the PR - now if a consumer tries to act on the index in a way that changes would be discarded, they are warned. As a result, consumers must explicitly either commit their changes (using git_index_write) or discard their changes (through any of a variety of mechanisms, including git_index_read) before executing a function that will otherwise mutate the index (eg git_checkout, git_merge, etc).

The first few commits in this PR update tests that are incorrect regardless of the changes to add these restrictions and include a "dirty" state to the index. The remainder add this change and update several tests to explicitly commit the index before further mutating it.

This should (finally) pave the way for #4407.

ethomson · 2018-04-18T11:13:26Z

Rebased.

ethomson · 2018-05-09T14:08:19Z

Any thoughts on this before I 🚢 or is it self-evident?

tiennou

Just a few high-level comments, as I don't feel competent enough with The Index to really know what is going on (apart from the must-not-lose-staged-changes part of it 😉).

tiennou · 2018-05-09T16:51:02Z

src/index.c

@@ -668,6 +674,9 @@ int git_index_read(git_index *index, int force)
 	if (!error)
 		git_futils_filestamp_set(&index->stamp, &stamp);

+	if (!error)


Maybe merge this with the if (!error) above ?

tiennou · 2018-05-09T16:53:12Z

src/index.c

@@ -527,6 +529,7 @@ int git_index_clear(git_index *index)

 	assert(index);

+	index->dirty = 1;


Unsure, but this seems the inverse of the commit's message rationale ?

Well. Clearing the index here without saving naturally has to leave the index dirty. I agree though that the commit message is confusing

Agreed; I think that originally I thought that clearing the index would make it not dirty and updated that later, but neglected the commit message. I've fixed it.

pks-t

Sorry for my late feedback 🙄

pks-t · 2018-05-18T10:09:12Z

tests/index/names.c

@@ -25,6 +25,42 @@ void test_index_names__cleanup(void)
 	cl_git_sandbox_cleanup();
 }

+static void index_add_main(void)


In the commit message you talk of "This test", but this commit doesn't do anything but add a new function which is never being called. The actual caller is only being added in the last commit of this series, which is a confusing when reading commits one by one

By the way, one more thing: I don't really get what "main" is referring to in this context.

Thanks; this got jumbled while rebasing. I fixed it, and I also changed the name to add_conflicts. Regarding "main", I was referring to the "main index" (as opposed to the other sections like the tree cache) but I updated it to avoid confusion.

pks-t · 2018-05-18T10:09:45Z

tests/index/names.c

+	const char **conflict;
+	size_t i;
+
+	for (i = 0; i < 3; i++) {


i < ARRAY_SIZE(paths)?

pks-t · 2018-05-18T10:19:01Z

src/checkout.c

@@ -2397,20 +2397,20 @@ static int checkout_data_init(
 				GIT_DIR_MODE, GIT_MKDIR_VERIFY_DIR)) < 0)
 		goto cleanup;

+	if ((error = git_repository_index(&data->index, data->repo)) < 0)
+		goto cleanup;


So, previously we were not loading the index into data->index when GIT_CHECKOUT_NO_REFRESH was set. This leaves me wondering about line 2441, where we unconditionally dereference data->index. Was this code broken previously and would segfault in case GIT_CHECKOUT_NO_REFRESH was set or is the index already loaded at this point in time, resulting in a memory leak due to an additional reference to the index?

Yes, it would have dereferenced that NULL index. the NO_REFRESH bit is really not a great pattern to use, so I'm not surprised that nobody's ever noticed this. However, I added a test to make sure that we don't regress again.

pks-t · 2018-05-18T10:21:03Z

src/stash.c

@@ -159,7 +164,37 @@ struct stash_update_rules {
 	bool include_ignored;
 };

+/* similar to git_index_add_bypath but able to operate on any


Multi-line comments should start with an empty /* line and upper-case

pks-t · 2018-05-18T10:23:59Z

src/stash.c

+	int error;
+
+	if (!git_repository_is_bare(repo) &&
+		(error = git_repository_index__weakptr(&repo_index, repo)) < 0)


I'd like to finally settle on a common style when it comes to multi-line conditionals. We currently have two different styles in our code base. The first one aligned with a tab

if (!git_repository_is_bare(repo) && (error = git_repository_index__weakptr(&repo_index, repo)) < 0) return error;

and the same thing aligned with four spaces instead

if (!git_repository_is_bare(repo) && (error = git_repository_index__weakptr(&repo_index, repo)) < 0) return error;

I'd say the second one is superior, as you can clearly distinguish conditionals from the body. Especially when there are multiple more than two lines in a conditional, this becomes rather valuable and makes code a lot easier to read, in my opinion.

I'd like to finally settle on a common style when it comes to multi-line conditionals. We currently have two different styles in our code base

I... guess? I agree that we shouldn't have two different styles, and yes, some things have slipped in here because it really doesn't matter that much (more on that below). But it feels like this was a solved decision. We took git's style, which, from a cursory grep, is typically indenting these with tabs. And we typically indent with tabs.

It seems like anything that's slipped in has done just that - slipped in.

I'd say the second one is superior, as you can clearly distinguish conditionals from the body. Especially when there are multiple more than two lines in a conditional, this becomes rather valuable and makes code a lot easier to read, in my opinion.

I agree, but that's why I have my tabs set to four spaces.

But for the broader point: it's incredibly frustrating for contributors when they have to wait several weeks for a code review and then have that code review consist primarily of complaints about whitespace. (This one was open three months ago! Three months is a long time, and yes, I realize that the complaints are not only about whitespace.)

Does style matter? Sure. But they don't matter nearly enough to stop the velocity of the project, or to turn people off from contributing.

Can you imagine waiting weeks to get a code review then it's picking nits at whitespace changes? This is not the sign of a healthy community, and that person will probably never contribute again.

We need to fix this when it comes to reviewing pull requests. This is not a complaint at you, individually; I do mean we, the project.

As for the whitespace issues: if you really care about this, then add an automated step to the CI builds and give people something easy that they can run locally. Otherwise, we need to relax, IMHO.

I'd like to finally settle on a common style when it comes to multi-line conditionals. We currently have two different styles in our code base
I... guess? I agree that we shouldn't have two different styles, and yes, some things have slipped in here because it really doesn't matter that much (more on that below). But it feels like this was a solved decision. We took git's style, which, from a cursory grep, is typically indenting these with tabs. And we typically indent with tabs.

It seems like anything that's slipped in has done just that - slipped in.

I might not have brought my intention along crlearly enough, but this was not a request for change but instead a clarification for what our agreed-upon style is. I had asked that specific question about how to indent multiline conditionals in the past and never got an answer (or at least I couldn't recall). So while I feel like the latter option is superior, I don't care which of both styles we use as long as it is consistent.

I agree, but that's why I have my tabs set to four spaces.

Weird. I always thought the projects recommendation was eight spaces, and that's why I always had a specific setting for libgit2 to display as eight spaces. But in fact, we do recommend to set it to four spaces.

But for the broader point: it's incredibly frustrating for contributors when they have to wait several weeks for a code review and then have that code review consist primarily of complaints about whitespace. (This one was open three months ago! Three months is a long time, and yes, I realize that the complaints are not only about whitespace.)

I know. And that's why I said in the past that I only add style-related critic in case there are other complaints about a PR.

Does style matter? Sure. But they don't matter nearly enough to stop the velocity of the project, or to turn people off from contributing.

Can you imagine waiting weeks to get a code review then it's picking nits at whitespace changes? This is not the sign of a healthy community, and that person will probably never contribute again.
We need to fix this when it comes to reviewing pull requests. This is not a complaint at you, individually; I do mean we, the project.

As for the whitespace issues: if you really care about this, then add an automated step to the CI builds and give people something easy that they can run locally. Otherwise, we need to relax, IMHO.

I'm definitely in favor of having an automated job to do this. While I agree that style shouldn't be the main concern, I think it in fact is important for a project. I've seen enough projects where I wanted to contribute to, but after taking a quick peek into the implementation immediately bailed out as the code was so inconsistently formatted that I had problems to see what was going on. So it in fact is a double-edged sword: you need to care enough about style to not let the code base deteriorate and get hard to read, but you shouldn't be as strict to drive away contributors.

As you mentioned, I think the best way forward would be to have an automated job that does this immediately after a PR was submitted. If properly implemented, this would help both roles: the contributor can quickly fix up stuff while he is still familiar with the topic, and the maintainer can focus on the important parts.

pks-t · 2018-05-18T10:44:13Z

src/index.c

@@ -3185,6 +3212,7 @@ static int git_index_read_iterator(

 	clear_uptodate(index);

+	index->dirty = 0;


Is this correct? I cannot see us saving the index to disk here, so I'd consider it dirty

Yes, I agree. I've changed it.

pks-t · 2018-05-18T10:53:54Z

src/index.c

@@ -3605,6 +3616,7 @@ int git_indexwriter_commit(git_indexwriter *writer)
 		return -1;
 	}

+	writer->index->dirty = 0;


This looks like it belongs to the previous commit, where you introduce the dirty bit?

pks-t · 2018-05-18T10:57:04Z

tests/checkout/tree.c

 	assert_status_entrycount(g_repo, 1);
-	cl_git_pass(git_checkout_tree(g_repo, obj, &g_opts));
+
+	cl_git_pass(git_reset(g_repo, obj, GIT_RESET_HARD, NULL));


Doesn't this change what we're actually testing? Previously, we've checked that git_checkout_tree resets the mode. Now we check whether git_reset resets the mode.

pks-t · 2018-05-18T11:00:42Z

tests/checkout/typechange.c

@@ -289,6 +290,10 @@ void test_checkout_typechange__checkout_with_conflicts(void)
 		cl_assert_equal_i(cts.updates, 0);
 		cl_assert_equal_i(cts.ignored, 0);

+		cl_git_pass(git_repository_index(&index, g_repo));
+		cl_git_pass(git_index_read(index, 1));


These interspersed changes of git_index_read leave me a bit uneasy. Does it mean that users of libgit2 may now face places where they have to add git_index_read or face errors otherwise? If so, we at least have to provide an update guide detailing in which exact situations this is now necessary, and even then people will complain.

This is a good question. This affects what happens when you are replacing the contents of the index (eg, checkout, merge) and you have unsaved changes, then you will need to save them (git_index_write) or revert them (git_index_read). If you do not, this new code path will raise an error.

I don't think that this is a common scenario (except in our tests, since they play fast and loose with things). But there is a very valid case that we would be breaking: checking out with the FORCE flag or a reset with the HARD flag when you have unsaved changes in the index. At the moment, you can do:

git_index_add(...); /* no save */ git_reset(..., GIT_RESET_HARD);

And your unsaved changes are ignored - the index is updated to match the checkout / reset target. With this patch, this will be an error.

Increasingly, I think this is incorrect and that we should allow git_checkout / git_reset to operate when forcing. I think that this would fix most of the intentional use cases of this sort of behavior for people who were in a bad place but reset their way out of it, knowingly.

I'd also be happy to add a configurable option here for people who need an escape hatch to opt-in to the prior behavior. It's also possible that this is too heavy handed and we should keep the default as-is.

I'll follow up a commit that updates the reset / checkout behavior to allow forcing in this instance.

pks-t · 2018-05-18T11:02:17Z

tests/cherrypick/workdir.c

@@ -276,6 +276,7 @@ void test_cherrypick_workdir__conflict_use_ours(void)
 	opts.merge_opts.file_favor = GIT_MERGE_FILE_FAVOR_OURS;

 	cl_git_pass(git_reset(repo, (git_object *)head, GIT_RESET_HARD, NULL));
+


Unrelated whitespace change

ethomson · 2018-06-18T17:31:18Z

I added a change so that when checking out with FORCE, we do not halt on a dirty index. This seems like a very correct change that is very much in keeping with the way arrbee had originally designed and specified GIT_CHECKOUT_FORCE, and how I would have expected things to work. That also simplifies the tests so that they now don't need to do the git_index_read dance to discard their changes.

ethomson · 2018-06-18T17:34:56Z

So this leaves one realistic use case that we're blocking here now, which is GIT_CHECKOUT_SAFE with a dirty index. Previously we would have silently allowed this (obliterating the unsaved changes in the index), now we error on this. I doubt anybody was doing this before, but I appreciate your concern.

Proposal: add a git_libgit2_opt that controls this behavior, default to the current behavior (silently discarding dirty indexes before safe checkout and friends), document that we'll upgrade this to a warning or error in the future?

pks-t · 2018-06-22T07:29:47Z

Proposal: add a git_libgit2_opt that controls this behavior, default to the current behavior (silently discarding dirty indexes before safe checkout and friends), document that we'll upgrade this to a warning or error in the future?

That sounds like a perfectly reasonable plan. I'm sold

pks-t

With the new option I'm feeling a lot safer. I like how we start to be a lot more reasonable when it comes to interface and behaviour changes. Thanks!

pks-t · 2018-06-22T10:29:42Z

include/git2/errors.h

@@ -55,6 +55,7 @@ typedef enum {
 	GIT_ITEROVER        = -31,	/**< Signals end of iteration with iterator */
 	GIT_RETRY           = -32,	/**< Internal only */
 	GIT_EMISMATCH       = -33,	/**< Hashsum mismatch in object */
+	GIT_EINDEXDIRTY     = -34,  /**< Unsaved changes in the index would be overwritten */


This time I'm fast to review, so I'm allowed to nag at the whitespace error here :P

pks-t · 2018-06-22T10:31:32Z

include/git2/common.h

+ *		> Ensure that there are no unsaved changes in the index before
+ *		> beginning any operation that reloads the index from disk (eg,
+ *		> checkout).  If there are unsaved changes, the instruction will
+ *		> fail.


Should we maybe mention that this does not apply when forcing operations?

pks-t · 2018-06-22T10:35:47Z

docs/changelog.md

+  index but you have not written the index to disk.
+
+  At present, this defaults to off, but we intend to enable this more
+  broadly in the future, as a warning or error.


Should we urge consumers to review how they handle dirty indices and then enable the option?

We add entries into the main index to correspond with the NAME entries that we're going to test. NAME entries store the results of conflicts occuring with rename detection during merge, and they must correspond to conflicts in the index. This test was mistakenly adding regular entries. The checkout validation failed, since it requires NAME entries to correspond to high-stage (conflict) entries. Correct the test to actually create conflicts.

The index::reuc tests must test that the checkout itself succeeds, otherwise subsequent tests are not valid. In fact, the checkouts were failing because when checking out `SAFE`, they cannot update the files that are in conflict. Change the checkout level to `FORCE` to ensure that they get updated correctly.

When running `git_index_add_all`, we should write the index to disk so that we can re-read it safely during status.

Always set the `index` in the `checkout_data`, even in the case that we are not reloading the index. Other functionality in checkout examines the index (for example: determining whether the workdir is modified) and we need it even in the (uncommon) case that we are not reloading.

ethomson · 2018-06-26T13:02:07Z

Cool. I've updated this with your suggestions, @pks-t. Thanks!

pks-t

I guess I don't have much to add to this PR. I didn't see anything that catched my eye in particular. I've been briefly asking myself whether git_index_read_safely should be exposed to users or whether it should be the default of git_index_read with force set to false. But to be honest I'm not particularly well versed in this area to have a real grasp of that.

pks-t · 2018-06-29T12:04:21Z

tests/checkout/tree.c

+	cl_git_pass(git_repository_open(&other_repo, git_repository_workdir(g_repo)));
+	cl_git_pass(git_repository_index(&other_index, other_repo));
+
+    cl_git_pass(git_oid_fromstr(&entry.id, "1385f264afb75a56a5bec74243be9b367ba4ca08"));


Just in case you're re-rolling: whitespace issue

Comments were adressed

Add tests that ensure that we re-read the on-disk image by default during checkout, but when the `GIT_CHECKOUT_NO_REFRESH` option is specified, we do _not_ re-read the index.

Don't manipulate the repository's index during stash; instead, manipulate a temporary index and check it out. This allows us to use the checkout mechanism to update the workdir and the repository's index, and allows checkout to use its common mechanisms to write data and handle errors.

Teach the index when it is "dirty", and has unsaved changes. Consider the index dirty whenever a caller has added or removed an entry from the main index, REUC or NAME section, including when the index is completely cleared. Similarly, consider the index _not_ dirty immediately after it is written, or when it is read from the on-disk index. This allows us to ensure that unsaved changes are not lost when we automatically refresh the index.

Test that any changes to the index will mark the index as dirty. Also ensure that when we initialize a new index, read the index contents from disk, or write the index contents to disk that we reset the dirty flag to zero. Further ensure that an unforced read with dirty contents (when the on-disk index has not changed) does _not_ reset the dirty flag as we have not updated the contents of our index and our unsaved contents remain intact.

Now that the index has a "dirty" state, where it has changes that have not yet been committed or rolled back, our tests need to be adapted to actually commit or rollback the changes instead of assuming that the index can be operated on in its indeterminate state.

If the index is dirty, allow `GIT_CHECKOUT_FORCE` to obliterate unsaved changes. This is in keeping with its name and description.

When the index is dirty, return GIT_EINDEXDIRTY so that consumers can identify the exact problem programatically.

Add the `GIT_OPT_ENABLE_UNSAVED_INDEX_SAFETY` option, which will cause commands that reload the on-disk index to fail if the current `git_index` has changed that have not been saved. This will prevent users from - for example - adding a file to the index then calling a function like `git_checkout` and having that file be silently removed from the index since it was re-read from disk. Now calls that would re-read the index will fail if the index is "dirty", meaning changes have been made to it but have not been written. Users can either `git_index_read` to discard those changes explicitly, or `git_index_write` to write them.

ethomson · 2018-06-29T13:56:14Z

I've been briefly asking myself whether git_index_read_safely should be exposed to users or whether it should be the default of git_index_read with force set to false.

I don't think so. But I'll continue to give this some thought and do some testing with this functionality enabled before we make change changes.

ethomson mentioned this pull request Feb 19, 2018

checkout: when examining index (instead of workdir), also examine mode #4535

Merged

ethomson force-pushed the ethomson/index_dirty branch from a07ddf9 to 1767023 Compare February 19, 2018 22:24

ethomson added the bloomberg label Feb 20, 2018

ethomson force-pushed the ethomson/index_dirty branch from 1767023 to 934486c Compare April 18, 2018 11:11

tiennou reviewed May 9, 2018

View reviewed changes

pks-t previously requested changes May 18, 2018

View reviewed changes

ethomson force-pushed the ethomson/index_dirty branch from 934486c to 5c8a730 Compare June 18, 2018 15:24

ethomson force-pushed the ethomson/index_dirty branch from 750c659 to 9ef37a4 Compare June 22, 2018 10:23

pks-t reviewed Jun 22, 2018

View reviewed changes

ethomson force-pushed the ethomson/index_dirty branch from 9ef37a4 to d619cce Compare June 22, 2018 11:00

ethomson added 4 commits June 26, 2018 09:52

index::addall tests: write the index

7330ae6

When running `git_index_add_all`, we should write the index to disk so that we can re-read it safely during status.

ethomson force-pushed the ethomson/index_dirty branch from d619cce to a670454 Compare June 26, 2018 11:43

pks-t reviewed Jun 29, 2018

View reviewed changes

ethomson added 7 commits June 29, 2018 14:54

checkout tests: validate GIT_CHECKOUT_NO_REFRESH

55a9660

Add tests that ensure that we re-read the on-disk image by default during checkout, but when the `GIT_CHECKOUT_NO_REFRESH` option is specified, we do _not_ re-read the index.

checkout: FORCE doesn't halt on dirty index

5e26391

If the index is dirty, allow `GIT_CHECKOUT_FORCE` to obliterate unsaved changes. This is in keeping with its name and description.

index: return a unique error code on dirty index

787768c

When the index is dirty, return GIT_EINDEXDIRTY so that consumers can identify the exact problem programatically.

ethomson added 2 commits June 29, 2018 14:54

docs: add unsaved index safety to changelog

243d40d

ethomson force-pushed the ethomson/index_dirty branch from a670454 to 243d40d Compare June 29, 2018 13:54

ethomson merged commit c43658f into master Jun 30, 2018

ethomson mentioned this pull request Jun 30, 2018

merge: reload index before git_merge #4407

Merged

ethomson deleted the ethomson/index_dirty branch October 26, 2018 13:37

snyk-bot mentioned this pull request Feb 23, 2020

[Snyk] Upgrade nodegit from 0.4.1 to 0.26.4 saurabharch/Breezeblocks#1

Open

snyk-bot mentioned this pull request Apr 22, 2020

[Snyk] Upgrade nodegit from 0.24.3 to 0.26.5 aminatakonate000/Graviton-App#4

Open

snyk-bot mentioned this pull request May 5, 2020

[Snyk] Upgrade nodegit from 0.24.3 to 0.26.5 Barnstorm-Online/ngp-openapi-generator#1

Open

		@@ -527,6 +529,7 @@ int git_index_clear(git_index *index)

		assert(index);

		index->dirty = 1;

		@@ -3185,6 +3212,7 @@ static int git_index_read_iterator(

		clear_uptodate(index);

		index->dirty = 0;

		@@ -276,6 +276,7 @@ void test_cherrypick_workdir__conflict_use_ours(void)
		opts.merge_opts.file_favor = GIT_MERGE_FILE_FAVOR_OURS;

		cl_git_pass(git_reset(repo, (git_object *)head, GIT_RESET_HARD, NULL));

Add a "dirty" state to the index when it has unsaved changes #4536

Add a "dirty" state to the index when it has unsaved changes #4536

Conversation

ethomson commented Feb 19, 2018

ethomson commented Apr 18, 2018

ethomson commented May 9, 2018

tiennou left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pks-t left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pks-t May 18, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ethomson commented Jun 18, 2018

ethomson commented Jun 18, 2018

pks-t commented Jun 22, 2018

pks-t left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ethomson commented Jun 26, 2018

pks-t left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ethomson commented Jun 29, 2018

pks-t May 18, 2018 •

edited

Loading