Configuration snapshotting #2188

carlosmn · 2014-03-15T03:04:47Z

Thinking about immutable data structures is a lot easier than dealing with locks that only affect certain parts of the system, so let's start tackling the config autorefresh situation this way. Introduce git_repository_snapshot() to create a read-only version of the configuration for things like remote and submodule loading which don't want to read half of a version of the file.

If we like this direction, I'll see what other things need converting. This version of the code creates a copy of everything, which is not terribly efficient. A later version will do refcounting or whatever.

jspahrsummers · 2014-03-17T16:55:54Z

💯

vmg · 2014-03-18T13:45:25Z

cc @mhagger who will probably try to do something similar in core Git someday

mhagger · 2014-03-18T17:08:45Z

Thanks for the ping.

See "git configuration API improvements" in the GSoC ideas page and the mailing list threads referenced therein for our ideas about changing config handling in core Git. At the moment we're really only talking about a caching layer, mostly to simplify access to the config and make it possible for later settings to unset earlier settings (see below). Presumably every time the config cache is read, it would check the timestamps on the files from which it was read and invalidate itself if they have been changed.

I didn't read this PR's patch in detail, but I don't understand what problem is solved by making a disk copy of the config. Isn't the copy process vulnerable to exactly the same race/inconsistency issues when it reads the "official" version of the config as the individual config readers were vulnerable to before? Plus you would have the new problem that the on-disk copy gets out of date with the original. And if you want a snapshot, how is a disk snapshot better than an in-memory cache? (Is it because the on-disk snapshot is meant to be usable by subprocesses? If so, how will subprocesses, who no longer have any connection to the original config, going to deal with simultaneous changes?) I'm not saying it's a bad idea, I'm just saying that I don't yet understand why it is a good idea.

If the GSoC idea gets implemented, then there might soon be a config file syntax for unsetting values. Please keep this in mind so that you can support it too (or if you don't like the idea, register your objections now, before it is too late!) In core Git, this seemingly little change requires the method of reading config files to change significantly. Currently, the config files are (re-)read whenever a value is needed, and while they are read all of the config settings are streamed past a callback that picks out the value(s) it wants. Since we don't want to have to change all of our callbacks immediately, we will have to read the whole config into memory (taking into account "unset" directives) and then push the surviving settings through the callback. I don't know whether libgit2 will require a similar architectural change.

carlosmn · 2014-03-18T18:03:15Z

I think we're both trying to move into roughly the same spot, but we're starting from opposite ends.

As you said, when something inside a git commands asks for the configuration, the config subsystem will parse whatever is on the filesystem at the time and pass every config option to the caller. This seemed like a rather silly way of doing things, so in libgit2 we went with keeping the configuration in memory and letting the user ask for specific keys.

git's version reloads on every call, we only reload when asked explicitly, because a configuration file might change between two calls, and if we were to reload every single time, I might get e.g. two partial views into a remote's configuration. git's pushing version does not have this issue, as it will send values for only a specific version of the config, which is what it read.

As git runs for a bit and then dies, it's clear where the a change in the configuration would or wouldn't affect what it does, so even if you only ever read one version of the file, the process is going to be done pretty quickly, and the next invocation will take it into account. But when should libgit2 (or a program using it) consider this boundary to be? In the middle of reading a remote's or branch's configuration is obviously a bad thing, but how do we know?

IIRC GHfW will issue a config refresh whenever they get the window focus, which seems like a nice way of solving it, but it still requires the application to choose these points even when these races don't affect it because it only ever cares about single config values. If an application itself only ever cares about log.date, it should be possible for it to read the most up to date version simply by asking for it. And libgit2 can't refresh willy-nilly because a different thread might be in the middle of loading some complex configuration.

So we want to behave more like git: give an atomic view into the state of the config files at a point in time, but still have a low-friction way of getting the newest version of the config. The callback-based push version is out of the question for us, and not just for stylistic reasons but because for integration with the different languages, the caller needs to be the one in charge of when things happen. Locking is its own can of worms and can make us do the opposite, where one thread stops everyone else from getting an update, and that's without bugs.

So I figure immutability is pretty neat (we've gone this direction with references as well). If someone needs to take a look at a particular version of the config because they want to read multiple keys (remotes, I think submodules, branch config...), they ask for an immutable version and read off of that (which is a lot like a pull version of what git does), while people who don't care can simply ask for a config variable from the real config and have it refresh if necessary (also much like what you get with git if you're only interested in the one config value). This is what this PR tries to do.

As we already keep config in an intermediary position, adding support for unsetting values would not be an issue for libgit2; and if git gets this intermediary layer and changes config to be pull-based, we're going to look very much alike, with the exception that git lives fast and dies young, while libgit2 needs to think about retirement.

carlosmn · 2014-03-18T18:14:47Z

So it looks like I didn't explicitly talk about why we don't see races with this.

The config files are written via the config.lock -> config mechanism, which means once we have our file descriptor, the data in the file is essentially immutable (barring misbehaved users, but there's not much we can do there). For snapshots, we care about how the config existed at some point (preferably just before we started asking for the configuration).

Concurrent changes will be picked up by concurrent users, and that's ok (much like a concurrent git would do too) because they're doing something else.

If you're running git remote add origin ... in one terminal and git remote -v in another at the same time, git remote -v will either see the remote or it won't, but it won't only see remote.origin.fetch and miss out on remote.origin.url. This atomicity is what we care about. Running git remote -v again will show the remote, and we want that behaviour as well.

Creating explicit snapshots for subsystems which care is how we simulate git's "read and die quickly".

mhagger · 2014-03-18T20:25:45Z

@carlosmn, thanks for the great explanation.

Yogu · 2014-03-19T15:40:47Z

Sorry for being a little off-topic, but

The config files are written via the config.lock -> config mechanism, which means once we have our file descriptor, the data in the file is essentially immutable

does this also work on Windows? I'd think that you either have to open config allowing deletion (shared-delete) and thus risking it being deleted in the middle of reading it, or not allowing deletion and thus preventing the writing operation from renaming config.lock to config.

carlosmn · 2014-03-19T21:01:06Z

IIRC Windows' VFS layer blocks writing to it, so we shouldn't get any modifications whilst we're writing either.

ethomson · 2014-03-20T16:09:05Z

It's true, @Yogu, a reader could have the file locked on win32 and the config update could fail.

carlosmn · 2014-03-31T08:22:38Z

This should be just about ready. We can create snapshots and getting or setting a value refreshes. Iterators make use of a snapshot. This version still creates a copy, but we might want to tackle that in a different PR, as this one is already fairly ambitious.

While we support multi-threading, this still only supports serialised access, as the values could vanish in the middle of parsing if we refresh while e.g. git_config_get_bool() is parsing.

carlosmn · 2014-04-09T11:34:21Z

Now with reference counting so we don't copy data, yay!

We're still restricted to serialised use, thought that's always really been the case, so we might want to see about merging this and handling that in a different PR, as it would involve a change in API (which this one, oddly enough, doesn't)?

On set, we set/add the value written to the config's internal values, but we do not refresh old values. Document this in a test in preparation for the refresh changes.

In order to have consistent views of the config files for remotes, submodules et al. and a configuration that represents what is currently stored on-disk, we need a way to provide a view of the configuration that does not change. The goal here is to provide the snapshotting part by creating a read-only copy of the state of the configuration at a particular point in time, which does not change when a repository's main config changes.

This way we can assume we have a consistent view of the config situation when we're looking up remote, branch, pack-objects, etc.

Current code sets the active map to a new one and builds it whilst it's active. This is a race condition with someone else trying to access the same config. Instead, let's build up our new map and swap the active and new one.

This will be used by the writing commands in a later step.

When writing out, parse the resulting file instead of adding or replacing the value locally. This has the effect of reading external changes as well.

With the isolation of complex reads, we can now try to refresh the on-disk file before reading a value from it. This changes the semantics a bit, as before we could be sure that a string we got from the configuration was valid until we wrote or refreshed. This is no longer the case, as a read can also invalidate the pointer.

When we delete an entry, we also want to refresh the configuration to catch any changes that happened externally. This allows us to simplify the logic, as we no longer need to delete these variables internally. The whole state will be refreshed and the deleted entries won't be there.

This is mostly groundwork to let us re-use the map in the snapshots.

Now that our strmap is no longer modified but replaced, we can use the same strmap for the snapshot's values and it will be freed when we don't need it anymore.

carlosmn · 2014-04-18T14:15:00Z

Rebased on top of current dev to deal with the stramp alloc change.

arrbee · 2014-04-21T23:50:56Z

Having read through much of this, I think it's ready to merge. I'd like to go through a couple parts one more time before I push the merge button myself, but if other folks have read it too, then I'll definitely add my 👍 for merging.

I do wonder if providing a config API that was the equivalent of git_attr_get_many (i.e. take an array of keys and return an array of consistently read values) would eliminate many of the cases where snapshots are needed. Obviously that has its own set of problems with object lifetimes and memory allocation, but it might be a nice addition...

carlosmn · 2014-04-22T00:13:52Z

I did think about a function for batch-lookup, but string arrays are really annoying to deal with and this approach is somewhat more generic and provides the atomicity guarantees without changing the code where we use it.

Returning a string array of values feels like it would be more awkward for the user, as they have to remember the order in which they asked. And then there's keeping those strings alive, though I guess git_strarrary would take care of that at the cost of copying.

Maybe at some point in the future we can offer this. We also need to figure out what to do for keeping strings alive in a concurrent environment, but that's for a different PR.

arrbee · 2014-05-07T04:36:33Z

It seems like the git_repository_config__weakptr + git_config_snapshot pair is such a common sequence now, maybe we should just a git_repository_config__snapshot helper that does them both. What do you think?

carlosmn · 2014-05-07T08:01:09Z

I did notice that, but never got around to implementing it. I was thinking more in the direction of git_repository_config_snapshot() as a public method, as this pattern isn't necessarily limited to the library internals.

Accessing the repository's config and immediately taking a snapshot of it is a common operation, so let's provide a convenience function for it.

Configuration snapshotting

Yogu mentioned this pull request Mar 26, 2014

Retry renaming config.lock to config when access denied #2207

Closed

carlosmn changed the title ~~[RFC] Configuration snapshotting~~ Configuration snapshotting Mar 31, 2014

carlosmn added 12 commits April 18, 2014 15:55

config: document current write behaviour in a test

36913b8

On set, we set/add the value written to the config's internal values, but we do not refresh old values. Document this in a test in preparation for the refresh changes.

Use config snapshotting

29c4cb0

This way we can assume we have a consistent view of the config situation when we're looking up remote, branch, pack-objects, etc.

config: make refresh atomic

c047317

Current code sets the active map to a new one and builds it whilst it's active. This is a race condition with someone else trying to access the same config. Instead, let's build up our new map and swap the active and new one.

config: split out the refresh step

bd95f83

This will be used by the writing commands in a later step.

config: use a snapshot for the iterator

0500a1e

config: refresh the values on write

eaf3703

When writing out, parse the resulting file instead of adding or replacing the value locally. This has the effect of reading external changes as well.

config: document the how long the pointers are valid for

c20d71e

config: refcount the values map

4b99b8f

This is mostly groundwork to let us re-use the map in the snapshots.

config: share the strmap on snapshot

2280b38

Now that our strmap is no longer modified but replaced, we can use the same strmap for the snapshot's values and it will be freed when we don't need it anymore.

repository: introduce a convenience config snapshot method

ac99d86

Accessing the repository's config and immediately taking a snapshot of it is a common operation, so let's provide a convenience function for it.

arrbee added a commit that referenced this pull request May 12, 2014

Merge pull request #2188 from libgit2/cmn/config-snapshot

d2c4d1c

Configuration snapshotting

arrbee merged commit d2c4d1c into development May 12, 2014

This was referenced May 12, 2014

git_repository_config does not invalidate a cached git_config based on last date modified #1022

Closed

On automagic config reloading #2152

Closed

carlosmn deleted the cmn/config-snapshot branch May 12, 2014 20:23

phatblat pushed a commit to phatblat/libgit2 that referenced this pull request Sep 13, 2014

Merge pull request libgit2#2188 from libgit2/cmn/config-snapshot

8b2c865

Configuration snapshotting

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Configuration snapshotting #2188

Configuration snapshotting #2188

carlosmn commented Mar 15, 2014

jspahrsummers commented Mar 17, 2014

vmg commented Mar 18, 2014

mhagger commented Mar 18, 2014

carlosmn commented Mar 18, 2014

carlosmn commented Mar 18, 2014

mhagger commented Mar 18, 2014

Yogu commented Mar 19, 2014

carlosmn commented Mar 19, 2014

ethomson commented Mar 20, 2014

carlosmn commented Mar 31, 2014

carlosmn commented Apr 9, 2014

carlosmn commented Apr 18, 2014

arrbee commented Apr 21, 2014

carlosmn commented Apr 22, 2014

arrbee commented May 7, 2014

carlosmn commented May 7, 2014

Configuration snapshotting #2188

Configuration snapshotting #2188

Conversation

carlosmn commented Mar 15, 2014

jspahrsummers commented Mar 17, 2014

vmg commented Mar 18, 2014

mhagger commented Mar 18, 2014

carlosmn commented Mar 18, 2014

carlosmn commented Mar 18, 2014

mhagger commented Mar 18, 2014

Yogu commented Mar 19, 2014

carlosmn commented Mar 19, 2014

ethomson commented Mar 20, 2014

carlosmn commented Mar 31, 2014

carlosmn commented Apr 9, 2014

carlosmn commented Apr 18, 2014

arrbee commented Apr 21, 2014

carlosmn commented Apr 22, 2014

arrbee commented May 7, 2014

carlosmn commented May 7, 2014