Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: Generic Record Sync #658

Open
wants to merge 3 commits into
base: master
from

Conversation

@thomcc
Copy link
Contributor

commented Feb 12, 2019

Rendered

For reasons described in slack, I've moved this to be a PR.

I don't necessarially think we should actually land this as a doc in docs (although I'm open to it, but the amount of 'code inside markdown' is probably a bit much).

@thomcc

This comment has been minimized.

Copy link
Contributor Author

commented Feb 12, 2019

I forgot to finish a couple points (use of local id for example) but I'll get to them tomorrow Got to them.

@thomcc thomcc requested review from mhammond, linacambridge and ncalexan Feb 13, 2019

@thomcc

This comment has been minimized.

Copy link
Contributor Author

commented Feb 13, 2019

I've flagged people on review, but there's no urgency to review this, of course.

@rfk
Copy link
Member

left a comment

Thanks for sharing out out, @thomcc! I probably need to read over it a few times to really understand, but left a couple of preliminary comments based on my first read.

In general, I think it's fine to ask the server to be smarter here rather than trying to fit all the changes into the existing server API, we need to be in a position where we're OK evolving our server infrastructure just as much as we do for our client infrastructure.

docs/rfcs/generic-record-sync.md Outdated Show resolved Hide resolved
current client generates)

**Important**: This requires a change to current clients so that they know to ignore all items
in a collection whose ID starts with `__metadata__:`!

This comment has been minimized.

Copy link
@rfk

rfk Feb 19, 2019

Member

We should consider what it would take to move some of these smarts to the server. For example, maybe we could just introduce a new "v1.6" sync API that has extra endpoints/fields/whatever for the new bits of state here, and have new clients use that instead of the existing "v1.5" API. We could keep the existing "v1.5" API around as a kind of backwards-compatible view onto the same data, minus the extra bits that legacy clients don't understand.

This comment has been minimized.

Copy link
@thomcc

thomcc Jun 27, 2019

Author Contributor

I'd love to do this, but the timing seems tricky. It feels like the first version of this is likely be ready before the rust server is widely usable, so new server features are tricky.

Also, we're no longer a unified sync server / client team :(

This part of the design has gotten a little worse, if I'm being honest, since I've moved it outside of this collection, which means transactional updates of the schema and records will not be possible, unfortunately.

docs/rfcs/generic-record-sync.md Outdated Show resolved Hide resolved

- `"last_sync"`: The most recent X-Weave-Timestamp (as returned by e.g. the fetch to `info/collections` we do before the sync or something). This is for expiring records from this list.

### `$collection/__metadata__:schema`

This comment has been minimized.

Copy link
@rfk

rfk Feb 19, 2019

Member

Would we expect clients to push new versions of this schema as we deploy new versions of clients? If so, how do we protect the user's data against corruption by a buggy/malicious client? I wonder if there's an alternate approach here where we manually publish known-good schema records via some out-of-band mechanism, and just have the clients refer to them by number. For example we might publish "logins schema v1.2.0" to some well-known location, and then each client can just say "I'm using logins schema X.Y.Z" and count of other clients being able to fetch it, rather than having to push the schema itself up into the remote database.

This comment has been minimized.

Copy link
@thomcc

thomcc Feb 19, 2019

Author Contributor

That makes sense and sounds reasonable. I didn't really consider very many server changes, and in general all the ones I considered were non-critical to the general functionality.

The main downside of this is that it's a bit of a hassle to make schema changes, in that it requires contacting us to update the server somehow, but maybe that's fine.

It might eliminate the need for __metadata__:clients as well, or at least simplify it, I have to think more about it.

This comment has been minimized.

Copy link
@thomcc

thomcc Jun 27, 2019

Author Contributor

If so, how do we protect the user's data against corruption by a buggy/malicious client

I don't think we can protect for corruption against malicious clients. Fundamentally, they'll always be able to corrupt any record on the server in any way they want.

For buggy ones, at the moment there's no fix beyond a release that updates the schema. Thankfully, that will work retroactively. I'll think more on this about how we could better support use cases like 'backing out' a schema.

I'm hesitant about an out of band mechanism for storing the schemas unless it could be made self-serve: If getting started with using this requires cross-team communication I think that makes this a lot less attractive [0]. Also, putting them out of band for schemas means

  1. Schemas are immediately live on all clients or a system for them "riding the trains" (or whatever) has to be made.
  2. Testing new schemas locally is more difficult, possibly much more.

OTOH it would allow for more control, validation, easier hotfixing, and makes other things easier. Also, certain parts of the design could probably be simplified if it worked like this.

[0]: part of the motivation for remerge is that AFAICT other teams take the path of least resistance here, which tends not to be 'talk to the sync people'. (In the case of containers, they are planning on adding sync, but still never bothered talking to us, even though we explicitly asked them to reach out when they were ready to do this)

@mhammond mhammond added the backlog label Feb 19, 2019

@rfk rfk added the enhancement label Feb 19, 2019

@mhammond
Copy link
Member

left a comment

I think this proposal is very solid and has obviously taken much thought and effort. It's also a very important discussion for our team - thanks! And sorry for the delay - I needed time to make my thoughts coherent - hopefully I haven't failed in that :)

However, I'm a little skeptical that the entirety of this will offer net value. I'm concerned it will implement something that has more complexity than we need, and even then, may fail to handle edge-cases we find outselves facing. I suspect Mentat would have put us in the same boat too, but that's a different conversation. IOW, I fear more pain than gain for our foreseeable future.

Don't get me wrong - I do think that some of the ideas expressed here are valuable - particularly the local schema ideas. A simple way of "versioning" a schema and avoiding custom (but boilerplate) schema upgrades has value, as does a library which makes the implementation of new simple stores almost trivial.

My skepticism is mainly around server-based-data-driven merging and duping, particularly in a world where we end up with a single Rust implementation everywhere. In particular:

  • I believe we can come up with something that "roundtrips" most unknown data. If there's "important" new data, such as a new field used for de-duping, then I think this scheme still doesn't save us - we will know from the server data that the field exists and it is used for deduping, but we'll not have enough context to know what that value should be in our local store for our existing items.

  • With some care, and simple round tripping of "unknown" stuff (and possibly even including the version of the component which most recently wrote the server record) I believe we could evolve the data stored on the server for a collection in a way that slightly old versions of the component still work correctly. I think it's OK to assume that all "active" apps which embed our components are on a reasonable update schedule, and that it's OK to lock out older apps which use very old versions of our components. So while we'd want to have a policy of something like "support 2 ESR releases ago", we'd explicitly avoid a goal like "support a 2 year old product which we've stopped updating"

  • I fear the deduping and merging may end up with fancy "generic" definitions which are, in practice, used exactly one. For example, let's say some engine wanted to store an entire URL, but de-dupe based on (say) the URL minus the scheme, or minus the query string. Or let's say that payments wanted to take the expiry date of a card into account and prefer the latest date. While we could extend the definition to handle that, I'm not sure that we should.

With the correct library to leverage, I think it would be fine to require collection-specific code in some cases - most notably deduping and merging. Concepts like deprecated and "merge": "take_min" are simple to express as "plain old code" and not particularly onerous - but critically, they also support edge-cases in a much simpler way - the merge code parsing a URL and stripping query strings is much simpler to write in a Rust function than to express in this data-definition system.

I guess a reasonable tl;dr of my position is:

  • We should definitely work out a way to support new collections in a way that doesn't require a hand-written SQL schema and hand-written schema upgrades.

  • But these new collections do not need to be implemented in a "zero-code" way - it's fine for new collections to have a small bit of Rust code which contains a generic way to define a flexible and extensible schema, and a way to implement a handful of functions for doing collection-specific stuff, such as merging and deduping.

  • It's great that this has been brought up now. At some point in the future we will be asked to do (say) credit-cards and addresses, and we will be asked to do that ASAP. That will be too late to try something like this. I wonder if it is worth experimenting with moving the logins store to this model? That will give us a real use-case and it will not matter if the idea fails.

  • That said though, I don't think we've got space to do this in this half - but as soon as Fenix is released with everything we need to do for that, I think space could be made.

Again, thanks for starting this valuable discussion and for such a well thought out RFC.

@jdragojevic jdragojevic added the ↕ Sync label Mar 4, 2019

thomcc added some commits Feb 12, 2019

Numerous updates
- Change name to `remerge`.
- Get rid of foreign_guid
- Explicitly write out a reference guide for the schema, and move most
  of what was spread out in the inlined rust code / comments to it.
    - Note: The schema format is going to be YAML not JSON essentially
      so that comments are possible. This is also in-line with what e.g.
      glean uses.
    - This includes adding support for many richer types and such.
- Replace the whole metaschema_version stuff with explicit feature flag
  lists.
- Redesign new metadata locations as to not require changes to existing
  clients.
- Punt on the idea of prerelease versions since that was a lot of
  complexity and wonkyness for little gain.
- List out explicit design non-goals.
- List out things that this design does not yet address but
  could/should, but that for the most part don't need to be addressed
  soon.

@thomcc thomcc force-pushed the generic-sync-rfc branch from e3f3b0e to 1e2ba10 Jun 27, 2019

@thomcc

This comment has been minimized.

Copy link
Contributor Author

commented Jun 27, 2019

I've just pushed up a major change to this based on a bunch of thoughts I've had since the last time I wrote it. The expanded commit message at 1e2ba10 says all of them, but essentially:

  • I've given this system a name: "Remerge"

    • If anybody feels like this name is bad I'm open to it, but I like the name.
  • I've cut a bunch of things that are problematic from the design.

    • I've removed ForeignGuid: While this was part of the motivation behind this proposal, for most real cases there's a better option, and this would never be as robust as using e.g. URL or some more stable identifier.
      • This means that the required server changes vanish, as do several special cases.
    • The special cases for prerelease versions are also gone, but I'd like some way of allowing completely breaking scorched-earth schema changes eventually.
  • The notion of metaschema_version has been replaced by feature flags. This is cleaner, easier to work with, and makes it more obvious when a change is going to lock out old clients.

  • The schema is now going to be in YAML and not JSON.

    • Why not JSON: These will be written by hand, JSON has no comment support (serde_json doesn't), and IMO is annoying to write by hand. I'm open to alternatives but I don't think it matters much.
    • Why YAML: It's is pretty common, has a first-class serde impl, and is what e.g. glean schemas use. It's widely used at Mozilla, and most languages can parse YAML easily.
  • I've split the document into two files, one for the schema format and one for more or less everything else (I was going to go further, but it would be easy to make this).

    • Note: most of the schema format document was written shortly after writing the initial PR.
      • I also have some old code from the same time to parse and validate a schema (as JSON, not YAML yet), which I'll update and get up when I have time, since it might make some of this a little more concrete.
    • The schema is much more fleshed out, possibly too fleshed out, some more of the cases could probably be eliminated.
      • Note: When I wrote most of it, I hadn't fully come up with the plan for how supporting new data types would work, and so it felt more important to support anything we could add without too much effort. Now that I have a better plan there (see the new sectioin about extending remerge), I can probably remove several things.
      • That said, for the most part, field types / options present in the schema format are things we perform in one way or another in some record.
        • Big exceptions: untyped_map (which would be desirable for a kinto-like use case) and record_set (which would be desirable for something like syncing containers origin info or anything that wants an inner array a la history visits).
  • I've explicitly listed out some important things that were previously implicit:

    • Many of the design tradeoffs made.
    • The 'big ideas' behind remerge, sorta.
    • List of design issues/questions that are deferred as future work because I don't expect them to matter in the short term and think they won't change things meaningfully.
@thomcc

This comment has been minimized.

Copy link
Contributor Author

commented Jun 27, 2019

@mhammond Responded inline:

However, I'm a little skeptical that the entirety of this will offer net value. I'm concerned it will implement something that has more complexity than we need, and even then, may fail to handle edge-cases we find outselves facing. I suspect Mentat would have put us in the same boat too, but that's a different conversation. IOW, I fear more pain than gain for our foreseeable future.

For something like this, there's a balance to be struck between ease of implementation and ease of use. I agree that mentat is probably too far on the side of implementation complexity, but I feel that this design is much simpler than Mentat. I expect it to be around the size of pre-bookmarks places, maybe on the order of 10kloc or so. (Famous last words, but I'd be very surprised if it were larger than places-with-bookmarks)

I also think that this design document makes it come off as a bit more complex than it would end up being. That's mostly because I've made an effort to spell everything out explicitly and handwave minimally. I also wrote some of the design concurrently with a rough WIP prototype, which I'll try to push up eventually.

In my updated version of this, I've been more clear about things we're not doing, so maybe it's easier to see why it's much simpler than mentat.

I believe we can come up with something that "roundtrips" most unknown data. If there's "important" new data, such as a new field used for de-duping, then I think this scheme still doesn't save us - we will know from the server data that the field exists and it is used for deduping, but we'll not have enough context to know what that value should be in our local store for our existing items.

I've tried to come up with something that is flexible enough to support the use cases we've seen or wanted in practice. Most of these are too complex to do with naive roundtripping.

For example:

  • dateAdded would be frequently wrong if handled this way.
  • Information about the device that created an item: we can't do this now but have always wanted it.
    • In remerge this would be any field which has a created_at timestamp (or even 'prefer_min' number) as it's composite root, and it would essentially just work.
  • Syncing the logins meta-information requires 3WM, and so any naive solution is going to be broken.
    • More broadly, doing naive round trip in 3WM seems very tricky to me, if it's possible.

With some care, and simple round tripping of "unknown" stuff (and possibly even including the version of the component which most recently wrote the server record) I believe we could evolve the data stored on the server for a collection in a way that slightly old versions of the component still work correctly.

I broadly agree that we could do it, but I don't think having the sync team implement these has been successful. As far as I can tell, teams are very reluctant to talk to us about this, and tend to just use easy off the shelf solutions instead (which also means they don't have to wait on us to get started).

I think it's OK to assume that all "active" apps which embed our components are on a reasonable update schedule, and that it's OK to lock out older apps which use very old versions of our components. So while we'd want to have a policy of something like "support 2 ESR releases ago", we'd explicitly avoid a goal like "support a 2 year old product which we've stopped updating"

I agree. That's more or less explicitly one of the main ideas in remerge. We want to be able to break old clients, without breaking new ones. This is why there are two version numbers (the current schema version and the required schema version)

I fear the deduping and merging may end up with fancy "generic" definitions which are, in practice, used exactly one. For example, let's say some engine wanted to store an entire URL, but de-dupe based on (say) the URL minus the scheme, or minus the query string. Or let's say that payments wanted to take the expiry date of a card into account and prefer the latest date. While we could extend the definition to handle that, I'm not sure that we should.

This is probably a reasonable complaint, and does seem likely that new features we add for teams would be , but given that in a custom-code world all the code is used exactly once, I don't think this is that much of a problem so long as the feature is vaguely sensible.

The alternative is either have us write it (which I know you're arguing, but see below on why I don't think that's good), or to allow some sort of programability to this, which would make this substantially more complex, to the point where I'd probably flip back to your side on it being not worth the effort, (not to mention the security concerns).

Additionally, given how reluctant teams are to talk to us now, it's possible that the only thing we'd be missing are stuff like blatantly missing features.

With the correct library to leverage, I think it would be fine to require collection-specific code in some cases - most notably deduping and merging. Concepts like deprecated and "merge": "take_min" are simple to express as "plain old code" and not particularly onerous - but critically, they also support edge-cases in a much simpler way - the merge code parsing a URL and stripping query strings is much simpler to write in a Rust function than to express in this data-definition system.

I think there still relatively easy to support in a generic way like this. The main cost relative to the specific way is performance, not the amount of effort it takes IMO.

But these new collections do not need to be implemented in a "zero-code" way - it's fine for new collections to have a small bit of Rust code which contains a generic way to define a flexible and extensible schema, and a way to implement a handful of functions for doing collection-specific stuff, such as merging and deduping.

  1. This keeps the dependency on us, which IMO think is a non-starter. AFAICT, teams don't want to have a dependency on another team for this. "Just talk to the sync team" clearly failed. An easy to use 'off the shelf'-esque solution that you can include without needing to talk to us would hopefully make using sync for new synced data easy.

  2. In practice I don't think new collections would be "a small bit of Rust code". Logins is simple and is still not very small. Additionally, an issue like #650 would be mostly trivial for us to fix in a schema update.

    We could probably factor logins into something more reusable instead, but I don't think enabling our team to more easily implement engines is as valuable as enabling other teams to do the same without needing to involve us.

It's great that this has been brought up now. At some point in the future we will be asked to do (say) credit-cards and addresses, and we will be asked to do that ASAP. That will be too late to try something like this. I wonder if it is worth experimenting with moving the logins store to this model? That will give us a real use-case and it will not matter if the idea fails.

Yeah, logins is in a lot of ways a better candidate for it than forms, which is unlikely to do much evolution. That said, both would work.

@eoger eoger removed request for ncalexan and linacambridge Aug 22, 2019

@eoger

This comment has been minimized.

Copy link
Contributor

commented Aug 22, 2019

Please re-tag for review once this is ready!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
5 participants
You can’t perform that action at this time.