Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Configuration snapshotting #2188

Merged
merged 13 commits into from
May 12, 2014
Merged

Configuration snapshotting #2188

merged 13 commits into from
May 12, 2014

Conversation

carlosmn
Copy link
Member

Thinking about immutable data structures is a lot easier than dealing with locks that only affect certain parts of the system, so let's start tackling the config autorefresh situation this way. Introduce git_repository_snapshot() to create a read-only version of the configuration for things like remote and submodule loading which don't want to read half of a version of the file.

If we like this direction, I'll see what other things need converting. This version of the code creates a copy of everything, which is not terribly efficient. A later version will do refcounting or whatever.

@jspahrsummers
Copy link
Contributor

💯

@vmg
Copy link
Member

vmg commented Mar 18, 2014

cc @mhagger who will probably try to do something similar in core Git someday

@mhagger
Copy link
Contributor

mhagger commented Mar 18, 2014

Thanks for the ping.

See "git configuration API improvements" in the GSoC ideas page and the mailing list threads referenced therein for our ideas about changing config handling in core Git. At the moment we're really only talking about a caching layer, mostly to simplify access to the config and make it possible for later settings to unset earlier settings (see below). Presumably every time the config cache is read, it would check the timestamps on the files from which it was read and invalidate itself if they have been changed.

I didn't read this PR's patch in detail, but I don't understand what problem is solved by making a disk copy of the config. Isn't the copy process vulnerable to exactly the same race/inconsistency issues when it reads the "official" version of the config as the individual config readers were vulnerable to before? Plus you would have the new problem that the on-disk copy gets out of date with the original. And if you want a snapshot, how is a disk snapshot better than an in-memory cache? (Is it because the on-disk snapshot is meant to be usable by subprocesses? If so, how will subprocesses, who no longer have any connection to the original config, going to deal with simultaneous changes?) I'm not saying it's a bad idea, I'm just saying that I don't yet understand why it is a good idea.

If the GSoC idea gets implemented, then there might soon be a config file syntax for unsetting values. Please keep this in mind so that you can support it too (or if you don't like the idea, register your objections now, before it is too late!) In core Git, this seemingly little change requires the method of reading config files to change significantly. Currently, the config files are (re-)read whenever a value is needed, and while they are read all of the config settings are streamed past a callback that picks out the value(s) it wants. Since we don't want to have to change all of our callbacks immediately, we will have to read the whole config into memory (taking into account "unset" directives) and then push the surviving settings through the callback. I don't know whether libgit2 will require a similar architectural change.

@carlosmn
Copy link
Member Author

I think we're both trying to move into roughly the same spot, but we're starting from opposite ends.

As you said, when something inside a git commands asks for the configuration, the config subsystem will parse whatever is on the filesystem at the time and pass every config option to the caller. This seemed like a rather silly way of doing things, so in libgit2 we went with keeping the configuration in memory and letting the user ask for specific keys.

git's version reloads on every call, we only reload when asked explicitly, because a configuration file might change between two calls, and if we were to reload every single time, I might get e.g. two partial views into a remote's configuration. git's pushing version does not have this issue, as it will send values for only a specific version of the config, which is what it read.

As git runs for a bit and then dies, it's clear where the a change in the configuration would or wouldn't affect what it does, so even if you only ever read one version of the file, the process is going to be done pretty quickly, and the next invocation will take it into account. But when should libgit2 (or a program using it) consider this boundary to be? In the middle of reading a remote's or branch's configuration is obviously a bad thing, but how do we know?

IIRC GHfW will issue a config refresh whenever they get the window focus, which seems like a nice way of solving it, but it still requires the application to choose these points even when these races don't affect it because it only ever cares about single config values. If an application itself only ever cares about log.date, it should be possible for it to read the most up to date version simply by asking for it. And libgit2 can't refresh willy-nilly because a different thread might be in the middle of loading some complex configuration.

So we want to behave more like git: give an atomic view into the state of the config files at a point in time, but still have a low-friction way of getting the newest version of the config. The callback-based push version is out of the question for us, and not just for stylistic reasons but because for integration with the different languages, the caller needs to be the one in charge of when things happen. Locking is its own can of worms and can make us do the opposite, where one thread stops everyone else from getting an update, and that's without bugs.

So I figure immutability is pretty neat (we've gone this direction with references as well). If someone needs to take a look at a particular version of the config because they want to read multiple keys (remotes, I think submodules, branch config...), they ask for an immutable version and read off of that (which is a lot like a pull version of what git does), while people who don't care can simply ask for a config variable from the real config and have it refresh if necessary (also much like what you get with git if you're only interested in the one config value). This is what this PR tries to do.

As we already keep config in an intermediary position, adding support for unsetting values would not be an issue for libgit2; and if git gets this intermediary layer and changes config to be pull-based, we're going to look very much alike, with the exception that git lives fast and dies young, while libgit2 needs to think about retirement.

@carlosmn
Copy link
Member Author

So it looks like I didn't explicitly talk about why we don't see races with this.

The config files are written via the config.lock -> config mechanism, which means once we have our file descriptor, the data in the file is essentially immutable (barring misbehaved users, but there's not much we can do there). For snapshots, we care about how the config existed at some point (preferably just before we started asking for the configuration).

Concurrent changes will be picked up by concurrent users, and that's ok (much like a concurrent git would do too) because they're doing something else.

If you're running git remote add origin ... in one terminal and git remote -v in another at the same time, git remote -v will either see the remote or it won't, but it won't only see remote.origin.fetch and miss out on remote.origin.url. This atomicity is what we care about. Running git remote -v again will show the remote, and we want that behaviour as well.

Creating explicit snapshots for subsystems which care is how we simulate git's "read and die quickly".

@mhagger
Copy link
Contributor

mhagger commented Mar 18, 2014

@carlosmn, thanks for the great explanation.

@Yogu
Copy link
Contributor

Yogu commented Mar 19, 2014

Sorry for being a little off-topic, but

The config files are written via the config.lock -> config mechanism, which means once we have our file descriptor, the data in the file is essentially immutable

does this also work on Windows? I'd think that you either have to open config allowing deletion (shared-delete) and thus risking it being deleted in the middle of reading it, or not allowing deletion and thus preventing the writing operation from renaming config.lock to config.

@carlosmn
Copy link
Member Author

IIRC Windows' VFS layer blocks writing to it, so we shouldn't get any modifications whilst we're writing either.

@ethomson
Copy link
Member

It's true, @Yogu, a reader could have the file locked on win32 and the config update could fail.

@carlosmn
Copy link
Member Author

This should be just about ready. We can create snapshots and getting or setting a value refreshes. Iterators make use of a snapshot. This version still creates a copy, but we might want to tackle that in a different PR, as this one is already fairly ambitious.

While we support multi-threading, this still only supports serialised access, as the values could vanish in the middle of parsing if we refresh while e.g. git_config_get_bool() is parsing.

@carlosmn carlosmn changed the title [RFC] Configuration snapshotting Configuration snapshotting Mar 31, 2014
@carlosmn
Copy link
Member Author

carlosmn commented Apr 9, 2014

Now with reference counting so we don't copy data, yay!

We're still restricted to serialised use, thought that's always really been the case, so we might want to see about merging this and handling that in a different PR, as it would involve a change in API (which this one, oddly enough, doesn't)?

On set, we set/add the value written to the config's internal values,
but we do not refresh old values.

Document this in a test in preparation for the refresh changes.
In order to have consistent views of the config files for remotes,
submodules et al. and a configuration that represents what is currently
stored on-disk, we need a way to provide a view of the configuration
that does not change.

The goal here is to provide the snapshotting part by creating a
read-only copy of the state of the configuration at a particular point
in time, which does not change when a repository's main config changes.
This way we can assume we have a consistent view of the config situation
when we're looking up remote, branch, pack-objects, etc.
Current code sets the active map to a new one and builds it whilst it's
active. This is a race condition with someone else trying to access the
same config.

Instead, let's build up our new map and swap the active and new one.
This will be used by the writing commands in a later step.
When writing out, parse the resulting file instead of adding or
replacing the value locally. This has the effect of reading external
changes as well.
With the isolation of complex reads, we can now try to refresh the
on-disk file before reading a value from it.

This changes the semantics a bit, as before we could be sure that a
string we got from the configuration was valid until we wrote or
refreshed. This is no longer the case, as a read can also invalidate the
pointer.
When we delete an entry, we also want to refresh the configuration to
catch any changes that happened externally.

This allows us to simplify the logic, as we no longer need to delete
these variables internally. The whole state will be refreshed and the
deleted entries won't be there.
This is mostly groundwork to let us re-use the map in the snapshots.
Now that our strmap is no longer modified but replaced, we can use the
same strmap for the snapshot's values and it will be freed when we don't
need it anymore.
@carlosmn
Copy link
Member Author

Rebased on top of current dev to deal with the stramp alloc change.

@arrbee
Copy link
Member

arrbee commented Apr 21, 2014

Having read through much of this, I think it's ready to merge. I'd like to go through a couple parts one more time before I push the merge button myself, but if other folks have read it too, then I'll definitely add my 👍 for merging.

I do wonder if providing a config API that was the equivalent of git_attr_get_many (i.e. take an array of keys and return an array of consistently read values) would eliminate many of the cases where snapshots are needed. Obviously that has its own set of problems with object lifetimes and memory allocation, but it might be a nice addition...

@carlosmn
Copy link
Member Author

I did think about a function for batch-lookup, but string arrays are really annoying to deal with and this approach is somewhat more generic and provides the atomicity guarantees without changing the code where we use it.

Returning a string array of values feels like it would be more awkward for the user, as they have to remember the order in which they asked. And then there's keeping those strings alive, though I guess git_strarrary would take care of that at the cost of copying.

Maybe at some point in the future we can offer this. We also need to figure out what to do for keeping strings alive in a concurrent environment, but that's for a different PR.

@arrbee
Copy link
Member

arrbee commented May 7, 2014

It seems like the git_repository_config__weakptr + git_config_snapshot pair is such a common sequence now, maybe we should just a git_repository_config__snapshot helper that does them both. What do you think?

@carlosmn
Copy link
Member Author

carlosmn commented May 7, 2014

I did notice that, but never got around to implementing it. I was thinking more in the direction of git_repository_config_snapshot() as a public method, as this pattern isn't necessarily limited to the library internals.

Accessing the repository's config and immediately taking a snapshot of
it is a common operation, so let's provide a convenience function for
it.
arrbee added a commit that referenced this pull request May 12, 2014
@arrbee arrbee merged commit d2c4d1c into development May 12, 2014
@carlosmn carlosmn deleted the cmn/config-snapshot branch May 12, 2014 20:23
phatblat pushed a commit to phatblat/libgit2 that referenced this pull request Sep 13, 2014
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants