Skip to content
This repository has been archived by the owner on Sep 9, 2020. It is now read-only.

gps: source cache: add persistent BoltDB cache #1098

Merged
merged 6 commits into from
Sep 5, 2017
Merged

gps: source cache: add persistent BoltDB cache #1098

merged 6 commits into from
Sep 5, 2017

Conversation

jmank88
Copy link
Collaborator

@jmank88 jmank88 commented Aug 30, 2017

What does this do / why do we need it?

This PR introduces a persistent singleSourceCache backed by a BoltDB file as a step towards #431. It is currently only tested and not used by commands. To be integrated into the rest of gps and cmd/dep in future PRs.

What should your reviewer look out for in this PR?

More edge cases to test. Try to break the encoding.

Which issue(s) does this PR fix?

Towards #431

Copy link
Member

@sdboyer sdboyer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

whew, ok! so awesome to see this becoming a real thing 😄

the lower-level mechanics of what's being executed seem in line with what i was imagining - that is, i did not see any obvious ways in which the public-facing interface wasn't being satisfied.

i have two main concerns, though:

  1. the hand-rolled encoding scheme. (this ends up being a pretty far-reaching thing)
  2. treating the meat of the data - that which is keyed by revision - as essentially permanent.

lots more detail in the itemized comments, let's discuss in there!


// cacheEncodeProjectProperties returns a key/value pair containing the encoded
// ProjectRoot and ProjectProperties.
func cacheEncodeProjectProperties(ip ProjectRoot, pp ProjectProperties) ([]byte, []byte, error) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ahh, crap. i wished we'd talked about this a bit more in advance, but...

what's the reason to hand-roll an encoding scheme, vs. using an existing one - flatbuffers, protobuf, capn proto, msgpack, etc.?

really doing this right may entail pulling up the semver lib directly into dep and modifying it so that we can access its internal data. if it comes to that, we can do it...though we can probably also do these things in iterative stages.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to clarify: i'm imagining an encoding scheme where we are operating on concrete types (simpleManifest, simpleRootManifest, PackageTree, etc.), and using the encoding frameworks to generate an efficient byte-representation of their data they contain. ideally, the decoding side would then be as close to zero-copy as possible, producing the concrete types we need to return with high efficiency.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Encoding scheme was my first question: #431 (comment) ;-)

There is no reason to hand-roll this way other than that it was a simple first step, and that dep's manifest/lock serialization could be used as a reference, so I could be confident about correctness.

really doing this right may entail pulling up the semver lib directly into dep and modifying it so that we can access its internal data. if it comes to that, we can do it...

We certainly need to access internal data, although we could get away with some other means of exporting the fields - we can always convert into an equivalent type with struct tags or custom serialization methods.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Encoding scheme was my first question: #431 (comment) ;-)

crap, it totally was. i'm sorry, i missed that. but, as you say:

There is no reason to hand-roll this way other than that it was a simple first step, and that dep's manifest/lock serialization could be used as a reference, so I could be confident about correctness.

i entirely agree. using primitives that we understand and can easily reason about as a first step to getting it working gives us a strong foundation to work from. 👍

i was just trying approach cautiously in the event that you were quite attached to the hand-rolled scheme, but it doesn't seem you are 🦄 💯 👍 😄

We certainly need to access internal data, although we could get away with some other means of exporting the fields - we can always convert into an equivalent type with struct tags or custom serialization methods.

we'd still have to convert back to the original types, though. and there's no mechanism that the semver lib provides for generating objects, apart from its string parser.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we'd still have to convert back to the original types, though. and there's no mechanism that the semver lib provides for generating objects, apart from its string parser.

Yeah the public lib would still have to be augmented, but then we could avoid an internal fork. Any chance that some sort of strongly typed constructor would be generally useful as an addition to the lib?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe? At the point where we have a constructor that's exposing all the fields, though, it starts seeming to me like it might just be better to export the fields. i feel like we've started exploring that path in the past, but i can't entirely recall.

Copy link
Collaborator Author

@jmank88 jmank88 Aug 31, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just poking around the code I see a few consequences of exporting the fields:

  • Making them mutable means the internal semver cache has to hand out copies, not immutable originals
  • Some 'getter' methods become obsolete and also clash with their names (e.g. Version.Major()), so this is probably API breaking

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

confirmed, i'm expecting both of those to be the case :)

though on the first one, i converted the semver lib a while ago to use values instead of pointers (with the goal of avoiding heap allocating them). possibly an unnecessary microoptimization, but they're generally of sufficiently small size that it may make sense. anyway, if we continue with that, we can just return the dereferenced pointer (assuming the serialization lib operates on a pointer)

return []byte(k), []byte{}, nil
}

if v, ok := pp.Constraint.(Version); ok {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when doing gps-internal type switches on Version like this, it's more typical to use a type switch that relies on the unexported types.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I modeled it after dep's serialization methods: https://github.com/golang/dep/blob/master/manifest.go#L300

I'd like to be able to just call a method on Constraint here. typedString() string is functionally close to what we need, but maybe we want a typedEncoding() []byte or typedEncoding() someProtobufOneofType

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I modeled it after dep's serialization methods:

oh for sure, i get why - i'm just personally loathe to do that in gps because i'm accustomed to switching on the underlying types, and the .Type() system i wrote for it has always felt less expressive. i don't think it actually is, but it's a gut feel.

but maybe we want a typedEncoding() []byte or typedEncoding() someProtobufOneofType

i've not actually used protobufs before (though i have used others); my guess, though, is that we'd end up relying on generated code to create whatever methods we need, rather than writing them ourselves.


// cacheGetManifest returns a new RootManifest with the data retrieved from the bolt.Bucket.
func cacheGetManifest(b *bolt.Bucket) (RootManifest, error) {
m := &cachedManifest{
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i'm looking over the git history and only just now seeing that we added this type back in early June. maybe i asked then and forgot, but what's the value of having cachedManifest in addition to simpleRootManifest?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

simpleRootManifest and safeLock seem adequate. I may have just overlooked them.


case strings.HasPrefix(vs, ver):
vs = strings.TrimPrefix(vs, ver)
if c, err := NewSemverConstraint(vs); err != nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this, in particular, is one of the reasons why we may benefit from a highly efficient decoding scheme: each call to this (putting aside the crude literal string matching cache the semver lib uses) will trigger a regex match on the incoming string. for larger manifests, this cost could end up being non-negligible. semver lib benchmarks put it on the order of 5 microseconds per op. if there are 20 such constraints, that's 100 microseconds per manifest.

i know that sounds quite low, but that final cost ends up being paid twice per step (at every version we try, of every dependency) in the algorithm, and once more if we have to backtrack through a project. that means that any extra work we do here is going to be heavily amplified by the solver.


// newBoltCache returns a new singleSourceCacheBolt backed by a project's BoltDB file under the cache directory.
func newBoltCache(cd string, pi ProjectIdentifier, epoch int64, logger *log.Logger) (*singleSourceCacheBolt, error) {
path := sourceCachePath(cd, pi.normalizedSource()) + ".db"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

huh. interesting. i'd been imagining that we'd just have a single, consolidated boltdb file, and that we share prefixed handles into it (or something) on a per-singleSourceCacheBolt basis. but here there's one file persingleSourceCacheBolt instance.

i don't have an immediate, clear reason why this wouldn't be fine, though it does raise some vague yellow flags in my mind. i'll ruminate on it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i mean, certainly this has the benefit that simultaneous actions on different sources will not result in a bunch of competing write transactions 🤔

Copy link
Collaborator Author

@jmank88 jmank88 Aug 31, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I went into it assuming a single file, but given the current architecture it was cleaner/simpler to start this way, and there are definitely trade-offs to consider, so I didn't want to run ahead. My intuition is that bolt can probably juggle transactions to isolated top level buckets just fine, but I have no idea what the coordination costs are. Reducing the number of files could be beneficial to avoid limits, or just to speed up opening/closing.

It shouldn't be too hard to convert to a single file. Most of the work/mess would be in the integration. We'll just have to store a bolt reference (or a fancier cacheFactory sort of type) in the sourceCoordinator to pass to newSourceGateway calls or something like that.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reducing the number of files could be beneficial to avoid limits,

ah, yes, this is a good point, particularly given the issues we've been having with it later.

i'm OK with sticking with multiple files for now, though. we can revisit before we make the code live. all we need to do now is keep an eye out for decisions that might make it harder to switch to single file.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well I can either work on this, or a different encoding next. Unless you want to merge something intermediate first.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's do this first, and we can leave the encoding for this PR. that can be a follow-up, primarily once we're happy with tests. i haven't actually dug into those, yet.

// updateBucket executes update with the named bucket, creating it first if necessary.
func (s *singleSourceCacheBolt) updateBucket(name string, update func(b *bolt.Bucket) error) error {
return s.db.Update(func(tx *bolt.Tx) error {
b, err := tx.CreateBucketIfNotExists([]byte(name))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is kind of a nit here, but i'm making it with the intent of it being a more general thing - we should probably mostly be working with []byte at this level, rather than string, and do the necessary conversions at an outermost point of ingress/egress so that we don't incur unnecessary conversion costs throughout.

(i'm not clear on whether this is actually a real problem, but it jumped out at me as a thought here, so i'm making the note)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They are all currently derived from dynamic Revision strings:

s.updateBucket("rev:"+string(rev)...
s.viewBucket("rev:"+string(rev)...

There is still room for improvement though, I'll make the methods specific to revisions and build the keys from a single []byte("rev:") var.

}

// Manifest
mb := info.Bucket([]byte("manifest"))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i'm not sure if the go compiler optimizes out the string literal conversion here, but if we can't be sure, let's use package-level vars to avoid realloc, etc.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I sure hope it does, although now I wonder if it would have to prove it's not mutated by the method in order to share a single reference. Regardless, this will likely go away anyways, between dropping the time-stamp and using an encoding like protobuf.


func (s *singleSourceCacheBolt) getManifestAndLock(rev Revision, ai ProjectAnalyzerInfo) (m Manifest, l Lock, ok bool) {
err := s.viewBucket("rev:"+string(rev), func(b *bolt.Bucket) error {
info := cacheFindLatestValid(b, "info:"+ai.String()+":", s.epoch)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i haven't fully grokked how we're using these epoch markers, but given that revisions - and therefore the data we derive from them - are immutable, what's the reason to use an epoch marker here at all? it seems like this should basically be a permanent cache.

maybe there's a spot somewhere else where, for the M&L and ptrees, we use a value for the epoch that has the effect of making this basically a permanent cache, unless some magic value is passed in to clear the bucket, basically for maintenance purposes? or something?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There was a point that I was not time-stamping manifests/locks or package trees, but then I convinced myself it was necessary again. Now I can't think of any reason these wouldn't be immutable that wouldn't also warrant wiping the cache anyways (e.g. a new dep version produces a different package tree; or a ProjectAnalyzerInfo changes the way it parses the files, without changing its Version).

Happy to pull it back out. Simplifies a lot of code.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

e.g. a new dep version

yep yep yep - i've been contemplating that gps needs its own internal versioning number that we include in these indices, and bump whenever we make a change that could have any effect on how the gps interprets data on disk, eg in ListPackages().

i believe what i settled on is what we see as the solver-version in Gopkg.lock, but that's a property of the Solver, not the SourceManager. maybe i made a mistake in not separating them out, but i think i probably did it that way because we effectively can't separate the two :/. still, not great. i'm going to have to think about this.

in any case, let's go with pulling the timestamp back out - i see this is changed, so perhaps you've already done as much :)

@jmank88
Copy link
Collaborator Author

jmank88 commented Sep 2, 2017

@sdboyer I have prototyped protobuf integration, but with semver still being stringified/parsed.

Is this intermediate state worth commiting/reviewing (assuming we're heading in that direction)?

How would you like to proceed with semver? Is there something I can prototype in a fork, or another way to go about this?

Copy link
Member

@sdboyer sdboyer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's fix the one little nit, but then merge this. then we can iterate on serialization, etc. in subsequent PRs.

Gopkg.toml Outdated

[[constraint]]
name = "github.com/boltdb/bolt"
version = "^1.0.0"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: don't use the caret. this is a confusing area already, we should be consistent in how we do.

@jmank88
Copy link
Collaborator Author

jmank88 commented Sep 5, 2017

let's fix the one little nit, but then merge this. then we can iterate on serialization, etc. in subsequent PRs.

I was hoping to save you from reviewing the whole thing since so much is being replaced, but if it's done then separate PRs is fine with me.

@sdboyer
Copy link
Member

sdboyer commented Sep 5, 2017

@jmank88 it's OK - i think i'll benefit from having the diff against this approach as well, for the same reasons of familiarity that you initially outlined.

@jmank88
Copy link
Collaborator Author

jmank88 commented Sep 5, 2017

Close/Open CI trigger due to unrelated fluke (hopefully?):

Signal received: waiting for 1 ops to complete...
--- FAIL: TestSignalHandling (0.04s)
manager_test.go:730: Releasing flag did not get set

Edit: I'm bad at this: 1) I don't need to close/open anymore, I have access to restart. 2) The first job still linked as a failure on the PR. 3) So I restarted the whole thing, instead of just the failed job. Hopefully this mess sorts itself out.

@jmank88 jmank88 closed this Sep 5, 2017
@jmank88 jmank88 reopened this Sep 5, 2017
@jmank88 jmank88 merged commit d3e738d into golang:master Sep 5, 2017
@jmank88 jmank88 deleted the bolt_cache branch September 5, 2017 14:45
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants