gps: source cache: add persistent BoltDB cache #1098

jmank88 · 2017-08-30T21:24:10Z

What does this do / why do we need it?

This PR introduces a persistent singleSourceCache backed by a BoltDB file as a step towards #431. It is currently only tested and not used by commands. To be integrated into the rest of gps and cmd/dep in future PRs.

What should your reviewer look out for in this PR?

More edge cases to test. Try to break the encoding.

Which issue(s) does this PR fix?

Towards #431

sdboyer

whew, ok! so awesome to see this becoming a real thing 😄

the lower-level mechanics of what's being executed seem in line with what i was imagining - that is, i did not see any obvious ways in which the public-facing interface wasn't being satisfied.

i have two main concerns, though:

the hand-rolled encoding scheme. (this ends up being a pretty far-reaching thing)
treating the meat of the data - that which is keyed by revision - as essentially permanent.

lots more detail in the itemized comments, let's discuss in there!

sdboyer · 2017-08-30T23:39:31Z

internal/gps/source_cache_bolt_encode.go

+
+// cacheEncodeProjectProperties returns a key/value pair containing the encoded
+// ProjectRoot and ProjectProperties.
+func cacheEncodeProjectProperties(ip ProjectRoot, pp ProjectProperties) ([]byte, []byte, error) {


ahh, crap. i wished we'd talked about this a bit more in advance, but...

what's the reason to hand-roll an encoding scheme, vs. using an existing one - flatbuffers, protobuf, capn proto, msgpack, etc.?

really doing this right may entail pulling up the semver lib directly into dep and modifying it so that we can access its internal data. if it comes to that, we can do it...though we can probably also do these things in iterative stages.

to clarify: i'm imagining an encoding scheme where we are operating on concrete types (simpleManifest, simpleRootManifest, PackageTree, etc.), and using the encoding frameworks to generate an efficient byte-representation of their data they contain. ideally, the decoding side would then be as close to zero-copy as possible, producing the concrete types we need to return with high efficiency.

Encoding scheme was my first question: #431 (comment) ;-)

There is no reason to hand-roll this way other than that it was a simple first step, and that dep's manifest/lock serialization could be used as a reference, so I could be confident about correctness.

really doing this right may entail pulling up the semver lib directly into dep and modifying it so that we can access its internal data. if it comes to that, we can do it...

We certainly need to access internal data, although we could get away with some other means of exporting the fields - we can always convert into an equivalent type with struct tags or custom serialization methods.

Encoding scheme was my first question: #431 (comment) ;-)

crap, it totally was. i'm sorry, i missed that. but, as you say:

There is no reason to hand-roll this way other than that it was a simple first step, and that dep's manifest/lock serialization could be used as a reference, so I could be confident about correctness.

i entirely agree. using primitives that we understand and can easily reason about as a first step to getting it working gives us a strong foundation to work from. 👍

i was just trying approach cautiously in the event that you were quite attached to the hand-rolled scheme, but it doesn't seem you are 🦄 💯 👍 😄

We certainly need to access internal data, although we could get away with some other means of exporting the fields - we can always convert into an equivalent type with struct tags or custom serialization methods.

we'd still have to convert back to the original types, though. and there's no mechanism that the semver lib provides for generating objects, apart from its string parser.

we'd still have to convert back to the original types, though. and there's no mechanism that the semver lib provides for generating objects, apart from its string parser.

Yeah the public lib would still have to be augmented, but then we could avoid an internal fork. Any chance that some sort of strongly typed constructor would be generally useful as an addition to the lib?

Maybe? At the point where we have a constructor that's exposing all the fields, though, it starts seeming to me like it might just be better to export the fields. i feel like we've started exploring that path in the past, but i can't entirely recall.

Just poking around the code I see a few consequences of exporting the fields:

Making them mutable means the internal semver cache has to hand out copies, not immutable originals

Some 'getter' methods become obsolete and also clash with their names (e.g. Version.Major()), so this is probably API breaking

confirmed, i'm expecting both of those to be the case :)

though on the first one, i converted the semver lib a while ago to use values instead of pointers (with the goal of avoiding heap allocating them). possibly an unnecessary microoptimization, but they're generally of sufficiently small size that it may make sense. anyway, if we continue with that, we can just return the dereferenced pointer (assuming the serialization lib operates on a pointer)

sdboyer · 2017-08-30T23:40:35Z

internal/gps/source_cache_bolt_encode.go

+		return []byte(k), []byte{}, nil
+	}
+
+	if v, ok := pp.Constraint.(Version); ok {


when doing gps-internal type switches on Version like this, it's more typical to use a type switch that relies on the unexported types.

Yeah, I modeled it after dep's serialization methods: https://github.com/golang/dep/blob/master/manifest.go#L300

I'd like to be able to just call a method on Constraint here. typedString() string is functionally close to what we need, but maybe we want a typedEncoding() []byte or typedEncoding() someProtobufOneofType

Yeah, I modeled it after dep's serialization methods:

oh for sure, i get why - i'm just personally loathe to do that in gps because i'm accustomed to switching on the underlying types, and the .Type() system i wrote for it has always felt less expressive. i don't think it actually is, but it's a gut feel.

but maybe we want a typedEncoding() []byte or typedEncoding() someProtobufOneofType

i've not actually used protobufs before (though i have used others); my guess, though, is that we'd end up relying on generated code to create whatever methods we need, rather than writing them ourselves.

sdboyer · 2017-08-30T23:44:38Z

internal/gps/source_cache_bolt_encode.go

+
+// cacheGetManifest returns a new RootManifest with the data retrieved from the bolt.Bucket.
+func cacheGetManifest(b *bolt.Bucket) (RootManifest, error) {
+	m := &cachedManifest{


i'm looking over the git history and only just now seeing that we added this type back in early June. maybe i asked then and forgot, but what's the value of having cachedManifest in addition to simpleRootManifest?

simpleRootManifest and safeLock seem adequate. I may have just overlooked them.

sdboyer · 2017-08-31T00:09:03Z

internal/gps/source_cache_bolt_encode.go

+
+		case strings.HasPrefix(vs, ver):
+			vs = strings.TrimPrefix(vs, ver)
+			if c, err := NewSemverConstraint(vs); err != nil {


this, in particular, is one of the reasons why we may benefit from a highly efficient decoding scheme: each call to this (putting aside the crude literal string matching cache the semver lib uses) will trigger a regex match on the incoming string. for larger manifests, this cost could end up being non-negligible. semver lib benchmarks put it on the order of 5 microseconds per op. if there are 20 such constraints, that's 100 microseconds per manifest.

i know that sounds quite low, but that final cost ends up being paid twice per step (at every version we try, of every dependency) in the algorithm, and once more if we have to backtrack through a project. that means that any extra work we do here is going to be heavily amplified by the solver.

sdboyer · 2017-08-31T00:52:50Z

internal/gps/source_cache_bolt.go

+
+// newBoltCache returns a new singleSourceCacheBolt backed by a project's BoltDB file under the cache directory.
+func newBoltCache(cd string, pi ProjectIdentifier, epoch int64, logger *log.Logger) (*singleSourceCacheBolt, error) {
+	path := sourceCachePath(cd, pi.normalizedSource()) + ".db"


huh. interesting. i'd been imagining that we'd just have a single, consolidated boltdb file, and that we share prefixed handles into it (or something) on a per-singleSourceCacheBolt basis. but here there's one file persingleSourceCacheBolt instance.

i don't have an immediate, clear reason why this wouldn't be fine, though it does raise some vague yellow flags in my mind. i'll ruminate on it.

i mean, certainly this has the benefit that simultaneous actions on different sources will not result in a bunch of competing write transactions 🤔

I went into it assuming a single file, but given the current architecture it was cleaner/simpler to start this way, and there are definitely trade-offs to consider, so I didn't want to run ahead. My intuition is that bolt can probably juggle transactions to isolated top level buckets just fine, but I have no idea what the coordination costs are. Reducing the number of files could be beneficial to avoid limits, or just to speed up opening/closing.

It shouldn't be too hard to convert to a single file. Most of the work/mess would be in the integration. We'll just have to store a bolt reference (or a fancier cacheFactory sort of type) in the sourceCoordinator to pass to newSourceGateway calls or something like that.

Reducing the number of files could be beneficial to avoid limits,

ah, yes, this is a good point, particularly given the issues we've been having with it later.

i'm OK with sticking with multiple files for now, though. we can revisit before we make the code live. all we need to do now is keep an eye out for decisions that might make it harder to switch to single file.

Well I can either work on this, or a different encoding next. Unless you want to merge something intermediate first.

let's do this first, and we can leave the encoding for this PR. that can be a follow-up, primarily once we're happy with tests. i haven't actually dug into those, yet.

sdboyer · 2017-08-31T00:57:17Z

internal/gps/source_cache_bolt.go

+// updateBucket executes update with the named bucket, creating it first if necessary.
+func (s *singleSourceCacheBolt) updateBucket(name string, update func(b *bolt.Bucket) error) error {
+	return s.db.Update(func(tx *bolt.Tx) error {
+		b, err := tx.CreateBucketIfNotExists([]byte(name))


this is kind of a nit here, but i'm making it with the intent of it being a more general thing - we should probably mostly be working with []byte at this level, rather than string, and do the necessary conversions at an outermost point of ingress/egress so that we don't incur unnecessary conversion costs throughout.

(i'm not clear on whether this is actually a real problem, but it jumped out at me as a thought here, so i'm making the note)

They are all currently derived from dynamic Revision strings:

s.updateBucket("rev:"+string(rev)... s.viewBucket("rev:"+string(rev)...

There is still room for improvement though, I'll make the methods specific to revisions and build the keys from a single []byte("rev:") var.

sdboyer · 2017-08-31T01:02:40Z

internal/gps/source_cache_bolt.go

+		}
+
+		// Manifest
+		mb := info.Bucket([]byte("manifest"))


i'm not sure if the go compiler optimizes out the string literal conversion here, but if we can't be sure, let's use package-level vars to avoid realloc, etc.

I sure hope it does, although now I wonder if it would have to prove it's not mutated by the method in order to share a single reference. Regardless, this will likely go away anyways, between dropping the time-stamp and using an encoding like protobuf.

sdboyer · 2017-08-31T01:11:43Z

internal/gps/source_cache_bolt.go

+
+func (s *singleSourceCacheBolt) getManifestAndLock(rev Revision, ai ProjectAnalyzerInfo) (m Manifest, l Lock, ok bool) {
+	err := s.viewBucket("rev:"+string(rev), func(b *bolt.Bucket) error {
+		info := cacheFindLatestValid(b, "info:"+ai.String()+":", s.epoch)


i haven't fully grokked how we're using these epoch markers, but given that revisions - and therefore the data we derive from them - are immutable, what's the reason to use an epoch marker here at all? it seems like this should basically be a permanent cache.

maybe there's a spot somewhere else where, for the M&L and ptrees, we use a value for the epoch that has the effect of making this basically a permanent cache, unless some magic value is passed in to clear the bucket, basically for maintenance purposes? or something?

There was a point that I was not time-stamping manifests/locks or package trees, but then I convinced myself it was necessary again. Now I can't think of any reason these wouldn't be immutable that wouldn't also warrant wiping the cache anyways (e.g. a new dep version produces a different package tree; or a ProjectAnalyzerInfo changes the way it parses the files, without changing its Version).

Happy to pull it back out. Simplifies a lot of code.

e.g. a new dep version

yep yep yep - i've been contemplating that gps needs its own internal versioning number that we include in these indices, and bump whenever we make a change that could have any effect on how the gps interprets data on disk, eg in ListPackages().

i believe what i settled on is what we see as the solver-version in Gopkg.lock, but that's a property of the Solver, not the SourceManager. maybe i made a mistake in not separating them out, but i think i probably did it that way because we effectively can't separate the two :/. still, not great. i'm going to have to think about this.

in any case, let's go with pulling the timestamp back out - i see this is changed, so perhaps you've already done as much :)

…ifest and safeLock

…ourceCaches

jmank88 · 2017-09-02T14:13:42Z

@sdboyer I have prototyped protobuf integration, but with semver still being stringified/parsed.

Is this intermediate state worth commiting/reviewing (assuming we're heading in that direction)?

How would you like to proceed with semver? Is there something I can prototype in a fork, or another way to go about this?

sdboyer

let's fix the one little nit, but then merge this. then we can iterate on serialization, etc. in subsequent PRs.

sdboyer · 2017-09-05T01:51:33Z

Gopkg.toml

+
+[[constraint]]
+  name = "github.com/boltdb/bolt"
+  version = "^1.0.0"


nit: don't use the caret. this is a confusing area already, we should be consistent in how we do.

jmank88 · 2017-09-05T02:16:25Z

let's fix the one little nit, but then merge this. then we can iterate on serialization, etc. in subsequent PRs.

I was hoping to save you from reviewing the whole thing since so much is being replaced, but if it's done then separate PRs is fine with me.

sdboyer · 2017-09-05T02:48:29Z

@jmank88 it's OK - i think i'll benefit from having the diff against this approach as well, for the same reasons of familiarity that you initially outlined.

jmank88 · 2017-09-05T13:19:50Z

Close/Open CI trigger due to unrelated fluke (hopefully?):

Signal received: waiting for 1 ops to complete...
--- FAIL: TestSignalHandling (0.04s)
manager_test.go:730: Releasing flag did not get set

Edit: I'm bad at this: 1) I don't need to close/open anymore, I have access to restart. 2) The first job still linked as a failure on the PR. 3) So I restarted the whole thing, instead of just the failed job. Hopefully this mess sorts itself out.

jmank88 requested a review from sdboyer as a code owner August 30, 2017 21:24

googlebot added the cla: yes label Aug 30, 2017

gps: source cache: add persistent BoltDB cache

df06cb7

sdboyer reviewed Aug 31, 2017

View reviewed changes

jmank88 added 4 commits August 31, 2017 08:21

gps: source cache: drop cachedLock/Manifest in favor of simpleRootMan…

1d22fcc

…ifest and safeLock

gps: source cache: clean up bolt cache

235cdec

gps: source cache: drop timestamps from immutable cache values

1d632a4

gps: source cache: share single bolt db file between multiple singleS…

580ccfd

…ourceCaches

sdboyer approved these changes Sep 5, 2017

View reviewed changes

prefer implicit carets in own Gopkg.toml

6a529db

jmank88 closed this Sep 5, 2017

jmank88 reopened this Sep 5, 2017

jmank88 merged commit d3e738d into golang:master Sep 5, 2017

jmank88 deleted the bolt_cache branch September 5, 2017 14:45

jmank88 mentioned this pull request Sep 5, 2017

gps: source cache: protobuf integration #1127

Merged

ibrasho added the area: gps label Sep 9, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

gps: source cache: add persistent BoltDB cache #1098

gps: source cache: add persistent BoltDB cache #1098

jmank88 commented Aug 30, 2017 •

edited

Loading

sdboyer left a comment

sdboyer Aug 30, 2017

sdboyer Aug 30, 2017

jmank88 Aug 31, 2017

sdboyer Aug 31, 2017

jmank88 Aug 31, 2017

sdboyer Aug 31, 2017

jmank88 Aug 31, 2017 •

edited

Loading

sdboyer Aug 31, 2017

sdboyer Aug 30, 2017

jmank88 Aug 31, 2017

sdboyer Aug 31, 2017

sdboyer Aug 30, 2017

jmank88 Aug 31, 2017

sdboyer Aug 31, 2017

sdboyer Aug 31, 2017

sdboyer Aug 31, 2017

jmank88 Aug 31, 2017 •

edited

Loading

sdboyer Aug 31, 2017

jmank88 Aug 31, 2017

sdboyer Aug 31, 2017

sdboyer Aug 31, 2017

jmank88 Aug 31, 2017

sdboyer Aug 31, 2017

jmank88 Aug 31, 2017

sdboyer Aug 31, 2017

jmank88 Aug 31, 2017

sdboyer Aug 31, 2017

jmank88 commented Sep 2, 2017

sdboyer left a comment

sdboyer Sep 5, 2017

jmank88 commented Sep 5, 2017

sdboyer commented Sep 5, 2017

jmank88 commented Sep 5, 2017 •

edited

Loading

gps: source cache: add persistent BoltDB cache #1098

gps: source cache: add persistent BoltDB cache #1098

Conversation

jmank88 commented Aug 30, 2017 • edited Loading

What does this do / why do we need it?

What should your reviewer look out for in this PR?

Which issue(s) does this PR fix?

sdboyer left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jmank88 Aug 31, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jmank88 Aug 31, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jmank88 commented Sep 2, 2017

sdboyer left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jmank88 commented Sep 5, 2017

sdboyer commented Sep 5, 2017

jmank88 commented Sep 5, 2017 • edited Loading

jmank88 commented Aug 30, 2017 •

edited

Loading

jmank88 Aug 31, 2017 •

edited

Loading

jmank88 Aug 31, 2017 •

edited

Loading

jmank88 commented Sep 5, 2017 •

edited

Loading