Skip to content
This repository has been archived by the owner. It is now read-only.

Improving Sync latency with FxA push notifications #1316

Closed
eoger opened this issue Jun 29, 2016 · 18 comments

Comments

Projects
None yet
5 participants
@eoger
Copy link
Contributor

commented Jun 29, 2016

In order to improve the latency of Sync, we proposed to send push notifications to the clients in the same account.
Let's start simple and do client-driven push notifications after each Sync.
We could probably do better in the future but this is a fast and easy way to validate if the sync clients behavior is still consistent after this change.
For starters, let's send this notification only for the tabs and the clients collections, which are less prone to conflicts.

sync_push_architecture

We already have a push payload definition for collectionChanged push message.

The endpoint path is pretty verbose, but I'd like to avoid using a generic /devices/notify endpoint and ending up with the same body as the push payloads. I'm open to suggestions.

cc @MarkH @rnewman @rfk @ckarlof @kitcambridge @vladikoff

@rnewman

This comment has been minimized.

Copy link

commented Jun 29, 2016

Let me repeat back, make sure I'm understanding you.

The client that just synced is the originating client. It makes sense for this client to trigger the notification, because only that client knows when it's done syncing and whether it has uploaded anything that's worth notifying about.

FxA's device manager keeps track of the set of clients. The originating client can ask an identity-attached service to notify other clients (which implies that it knows/sends along its own identifier).

(That service presumably has abuse-handling rate/payload/etc. limits.)

The notification service follows some notification scheme to decide how to dispatch the notifications to the other clients. The most trivial scheme is to immediately send the notification to all clients.

Does that make sense?

My concerns/observations/suggestions:

  • It seems possible that this will result in either (a) inconsistent behavior or (b) sync loops. Client A syncs tabs. Client B gets notified. Client B syncs tabs and notifies. Etc. — or B syncs and doesn't notify.
  • This could dramatically increase the rate at which devices sync. For example, a user bookmarking a bunch of tabs might sync up 10 times in 5 minutes, and her other devices sync down once. With notifications her other devices will sync down 10 times, or have to decide whether to ignore some notifications. Rate limiting — on sender, receiver, and perhaps the notification server itself — must be an integral part of the design and implementation.
  • Furthermore, this could dramatically increase redundant data usage, particularly on mobile platforms. The tabs record format is not suited to incremental changes. If triggered syncs are bidirectional, then devices will exchange showers of redundant all-tab bundles as fast as the user can browse (and rate limiting allows). I would be very, very leery of deploying this into production without replacing the tabs object format.
  • Syncing all other clients at the same time will increase the likelihood that those clients will race. Syncing only tabs and clients will help, but even clients is a little uncertain — clients write each other's records in order to send tabs, and so a write should ideally result in a targeted notification to the receiving client, not a broadcast, and a race is still possible when multiple clients send commands to each other.
  • Client code must be prepared to handle the usual terrifying range of potential server states even if they come to sync only tabs. For example, they might find a changed syncID or a server reassignment, which would ordinarily result in a complete re-sync. It's important to think carefully about what exactly a client should do in this case, particularly given that the odds of arriving at that blank server at the same time as another device will be much higher with naïve notification schemes.
  • Consider special-case behavior for one- and two-device constellations — these are the most common, the safest, and efficiency wins here will be worthwhile. To do this either puts logic in the notification system, or it requires clients to have the same device knowledge as the notification system.
@rfk

This comment has been minimized.

Copy link
Member

commented Jul 5, 2016

The endpoint path is pretty verbose, but I'd like to avoid using a
generic /devices/notify endpoint and ending up with the same body as the push payloads.

@eoger out of curiosity, why do you want to avoid that?

Modulo @rnewman's concerns, the proposal here seems reasonable to me, and we could even go one step further and only notify of changes to the clients collection, while we get a feel for how the whole system will operate.

Rate limiting — on sender, receiver, and perhaps the notification server itself
— must be an integral part of the design and implementation.

Server-side, I think it's reasonable to limit each account to X notifications in Y time period, and return our existing "attempt limit exceeded" error code if this rate exceeded.

I'm not generally in favour of building in lots of special-case smarts to the server, e.g. trying to debounce certain types of messages or to cause devices to sync at slightly different times. We may find that it's most convenient place to put them, but I'd like to avoid it if possible.

requires clients to have the same device knowledge as the notification system.

This seems a reasonable requirement to me, and they can trivially get this information from the FxA API.

@eoger

This comment has been minimized.

Copy link
Contributor Author

commented Jul 6, 2016

OK for the clients collections only.

Some ideas I threw around this morning.

  • Rate limiting/data usage concerns
    • My idea is to do all of this on the receiving side depending on context
    • On server: Just enough to stop malicious usage
    • On desktop clients: No limitations
    • On mobile clients: Set up a flag and sync only when the app is brought in the foreground? Sync depending on Wifi and/or battery level/charging?
      We'll have to sync up with the mobile folks to learn about best practices.
  • Node reassignment problem
    • Sync 0<=X<=2 seconds after receiving a notification in order to avoid collisions (but it will still happen). Will be mitigated in some cases by the next item.
  • Targeted notifications
    • Make the FxA endpoint a dumb push relay only? (almost no logic, just a generic /devices/notify and in the body a payload and a list of clients to send the message to)
@rnewman

This comment has been minimized.

Copy link

commented Jul 6, 2016

IIRC APNS will start to silently drop your pushes on the floor if you send too many. And besides, push is a battery and bandwidth drain vector. I don't think it's sufficient to rate limit on the receiver: a bug in one client should not subject every other client to a firehose.

Neither is it really fair to assume that desktop clients have no limitations. My cellphone gets better data throughput, and has a higher data cap, than desktop users in many parts of the US, let alone the developing world.

(For the record, I am a mobile folk 😄)

For rate limiting I'd suggest:

  • Senders should try to not send a ridiculous number of pushes. If the user is sending tab after tab, after one or two syncs we should back off on our notifications to try to effectively coalesce.
  • The server should have limits for malicious senders and also a sanity per-account or per-destination rate limit.
  • Clients should limit how often they take expensive actions like syncing in response to push notifications, just in case.

A delay isn't really a solution to the node reassignment (blank server) problem. Indeed, I'd hesitate to say there's a bullet-point solution to this kind of thing. Sync is a complex system, and it requires systems thinking at each level.

The safest thing to do, perhaps, is to have a policy that ordinary push-triggered syncs aggressively abort if the server isn't in the same known-good state, and don't even start if we're not in a good state. For example, if we find meta/global to have changed when we are asked to sync the clients collection, we probably (a) ought to wait until the server stops churning, (b) ought to re-sync tabs, too, (c) likely need to take some kind of local reset action. It seems safe to just 'nope' right out of there and come back five minutes later.

Alternatively, it might be time to really consider heuristics for "should I sync now?" — this problem can and does arise during timed syncs, too.

@rfk

This comment has been minimized.

Copy link
Member

commented Jul 7, 2016

Make the FxA endpoint a dumb push relay only? (almost no logic,
just a generic /devices/notify and in the body a payload and a list of clients to send the message to)

FWIW, this feels right to me - the server will check for well-formedness of the message and apply some coarse rate-limiting to prevent accidentaly or malicious abuse, but otherwise not impose any semantics on the message.

The list of clients can be optional, defaulting to all connected devices.

The server should have limits for malicious senders and also
a sanity per-account or per-destination rate limit.

Server-side rules of the form "no more than X messages of type Y in any Z-second time period" seem useful and relatively clean for us to enforce, does that seem sufficient. We can tweak the values of X, Y and Z in config as needed.

this problem can and does arise during timed syncs, too.

Right, we're not creating a new problem here, we're just increasing the likeihood that it'll trigger, and trigger in some user-observable way.

@mhammond

This comment has been minimized.

Copy link
Member

commented Jul 7, 2016

Sorry I'm late to the party here - I'm not getting emails for this issue - hopefully this comment will fix that :)

ISTM that many of these problems re sync-loops etc could be mitigated by having clients only process incoming items when they receive the notification - I can't see a reason it is unsafe to skip the upload step (and if there is, I'm sure that we'd already hit this fairly regularly when clients shutdown at inopportune times.) Clients would continue to use their existing schedule for the next "full" sync. Thus, receipt of a push message should never cause a write, and the amount of data transferred shouldn't increase dramatically (as the incoming-only sync advanced the last-modified time - it's not as though that incoming data is going to be processed twice)

Is there any reason that wouldn't work and avoid many of the problems (apart from rate limiting and mobile device battery usage) here?

@rnewman

This comment has been minimized.

Copy link

commented Jul 7, 2016

In the abstract, incoming-only syncs would be a solution.

There are a few reasons why they're less suitable in the concrete:

  1. For clients, as currently specified, there's no such thing as an incoming-only clients sync: when a client processes its own record due to a command, it needs to re-upload its record with any unprocessed commands. If it doesn't, it'll subsequently see duplicate commands, and do things like wipe twice, or open a sent tab twice. (I think we saw this recently after a patch broke reupload.)
  2. For other collections, behavior in the presence of multiple inbound syncs depends on the implementation, mostly on desktop and Android. iOS keeps more state around, which helps.

(a) There are likely to be some rough edges around advancing timestamps that we'd hit if changes weren't uploaded, particularly on Android. That's solvable, but might make this a more costly change.

(b) Any engine that keeps state in memory between download and upload, of course, is buggy in this case. (And buggy in the general case.)

(c) This would turn an edge-case merge flow into a common flow. Consider two clients, A and B, both working on a single record, starting in state T.

B: T -> U. Uploads U.
A: T -> V.
A: Downloads U. Merges U and V to produce W. Does not upload.
B: U -> X. Uploads.
A: Downloads X. Merges W (a local-only record) and X to produce Y.

You can imagine a worse variant, too, where another change occurs subsequently:

A: Y -> Z.

That is: A has produced a new state by mutating one that no other client even got to see, which itself was the result of a merge with a state that no other client got to see. That's unusual.

The common flow, which the engines are written to assume, is something like:

B: V -> W. Up.
A: V -> X.
A: Downloads W. Merges to Y. Up.
B: Downloads Y. Merges X (former shared head) with Y.  Up.

We assume that if an upload fails once, it's likely to succeed before another client makes a change to the same record. (This is the tower of "it'll be alright in the end" that Sync is built on!)

So usually a shared head is only one sync step away, a resolved conflict is always shared and applied on the other client, and a delta doesn't really stick around to accrue more local changes.

The more hops in this DAG, the more likely we are to get divergence and data loss: a subsequent merge step no longer has the true shared parent, and instead we're building on assumptions. It's like one of those draw-a-monster games where you fold the paper down as you draw each section of the body.

This is not to say that repeated incremental unidirectional syncs for most collections are impossible. I think they're relatively safe on iOS, and probably completely safe for history (for the usual values of 'safe' for that collection) elsewhere. But I wouldn't expect behavior to be the same in all cases, and I don't think we'd achieve quite the same level of robustness without further analysis and perhaps more changes.

@rfk

This comment has been minimized.

Copy link
Member

commented Jul 8, 2016

The immediate driver for this work is send-tab-to-device, which means the minimal thing we can ship is "a push message causes a device to download and correctly process its sync client commands".

IIUC that means we have to resolve the thorny issue of re-uploading your clients collection record, but we can continue kicking the can of worms down the road for other datatypes at this stage. (Which I don't enjoy doing, but...)

@mhammond

This comment has been minimized.

Copy link
Member

commented Jul 8, 2016

(I think we saw this recently after a patch broke reupload.)

FTR, the problem there was that we re-uploaded the commands we just processed without clearing them.

(b) Any engine that keeps state in memory between download and upload, of course, is
buggy in this case. (And buggy in the general case.)

Yeah - and actually that "clients" code is buggy in exactly that way :( Tangentially, in some cases that could be considered a feature for the clients collection (eg, incoming tab received as we shutdown; re-processing that command for users without session restore makes sense) and arguably OK for others (resetting client state immediately before shutdown and immediately after next startup probably does no harm) - but it's poorly defined and thus quite fragile.

(c) This would turn an edge-case merge flow into a common flow.

Sync has enough scale that IMO edge-cases should still be considered common. If we do bad things when a Sync is interrupted between download and upload, those bad things probably happen many times per day.

B: T -> U. Uploads U.
A: T -> V.
A: Downloads U. Merges U and V to produce W. Does not upload.
B: U -> X. Uploads.
A: Downloads X. Merges W (a local-only record) and X to produce Y.

That sounds like what could be a common scenario now:
B: T -> U. Doesn't sync yet (small score increment, then laptop put to sleep)
A: T -> V. Syncs.
B: Downloads V. Merges U (a local-only record) and V to produce Y.

The common flow, which the engines are written to assume, is something like:

B: V -> W. Up.

Not really - the entire "score" mechanism means it's more like:

B: V -> W. Maybe upload, maybe not.

While clients and bookmarks try and mitigate this by having every change bump the score such that a sync should start immediately, there are plenty of scenarios where that's not going to happen in practice, and as above, this must therefore happen many times per day.

Anyway, this is getting a little abstract given:

IIUC that means we have to resolve the thorny issue of re-uploading your clients collection record, but we can continue kicking the can of worms down the road for other datatypes at this stage. (Which I don't enjoy doing, but...)

Agreed. Let's just focus on the client collection record here (while still taking the edge-cases into account)

Off the top of my head, that doesn't sound too bad: we only send the push notification when uploading a client record other than our own. Thus, the receiver of the push notification will not again re-send a push notification as it will only update its own record while processing the push. This in-turn implies to me that (a) the message isn't a generic "I synced something" but a more specific "I wrote a new client record" and (b) the API doesn't offer "send this to every device" but instead insists on the specific device IDs. (Well, maybe (b) isn't strictly necessary, but the client would never make use of it)

@rnewman

This comment has been minimized.

Copy link

commented Jul 8, 2016

… arguably OK for others (resetting client state immediately before shutdown and immediately after next startup probably does no harm) - but it's poorly defined and thus quite fragile.

Yeah. This is kind of inevitable when syncing and remote control get bolted on to a product very late in its life.

If we do bad things when a Sync is interrupted between download and upload, those bad things probably happen many times per day.

I quite agree… but I'm in favor of not upping the frequency of tickling that tiger without careful thought 😄

Thus, the receiver of the push notification will not again re-send a push notification as it will only update its own record while processing the push.

In most cases, yes. However, care must be taken in two respects:

Firstly, causing the clients engine to sync can also upload queued commands to other clients. That's probably okay, but a bug might cause this to not converge.

(Future repro: send a bunch of tabs from Android or iOS while offline. Reconnect to the internet. Send a push after sending a tab to that device.)

You could consider a download-only clients sync as the narrowest option.

Secondly, races in clients are likely uncommon right now: not only are there not many Send Tab users, but the windows of operation are typically non-overlapping. client A sends three tabs to B, one at a time; client B syncs minutes later and reuploads its blank record without drama. Push-triggered syncs bring those windows into overlap: A's second tab send will occur while B is syncing. Clients should be safe here — I think they all use X-I-U-S for reuploading their record — but it would be worth checking what the behavior is. I imagine that desktop might drop outbound commands on the floor, and Android will open tabs multiple times. I don't know if failure will abort a sync.

@mhammond

This comment has been minimized.

Copy link
Member

commented Jul 11, 2016

but I'm in favor of not upping the frequency of tickling that tiger without careful thought 😄

I'm a little more reckless than that - we know the tiger is being tickled but just never see it happen. I've no problem with changes we want to make having a side-effect of tickling him a little more - that's probably the only way we will actually see what he does :) I'll be even more gung-ho once we have telemetry reporting validation results ;)

(Future repro: send a bunch of tabs from Android or iOS while offline. Reconnect to the internet. Send a push after sending a tab to that device.)

Can you elaborate on what badness would happen there? Isn't that exactly what we want to happen?

Secondly, races in clients are likely uncommon right now:

Yes, I think that's probably correct. Not only will desktop fail to "track" command changes, it will indeed throw already-processed commands away if the upload fails (ie, I think Desktop will open tabs multiple times in this scenario.)

Sadly I think bad things still play out in a "download-only client sync" - client B downloads its record but doesn't update it on the server and waits for the next "scheduled" sync. If client B terminates or otherwise fails to complete that future Sync, bad things still happen (I think desktop will re-open those tabs next Sync.) So ISTM that either way, we need to make Desktop and Android more robust in this case (and further, doing so should mean we are robust in a download-only client sync, or a client sync that does re-write its record.)

Specifically for Desktop, I think it means we need to persist the local state of the local record, ensure that state is only cleared after a successful upload, and do something sensible WRT merging a client record changed remotely since we last read our version (eg, that "local state" should probably contain a full copy of the remote record we read before we processed and locally removed incoming commands.)

@vladikoff

This comment has been minimized.

Copy link
Member

commented Jul 11, 2016

cc @rfk to triage this bug...

@rfk

This comment has been minimized.

Copy link
Member

commented Jul 13, 2016

This is assigned to me to turn into a concrete feature proposal, and probably migrate to the new https://github.com/mozilla/fxa-features repo.

@rfk

This comment has been minimized.

Copy link
Member

commented Jul 18, 2016

This is assigned to me to turn into a concrete feature proposal,

On reflection, I'm not sure this is big enough to warrant a full-blown feature card etc on the FxA side. I suggest that we scope this issue down to just "implement the device -> device notification pipe" which represents the FxA server's responsibilities here, and move the rest of the discussion above into e.g. the client-side metabug or similar.

@eoger are you planning on and/or interested in doing the server-side work in fxa-auth-server to implement the /devices/notify endpoint?

@rfk

This comment has been minimized.

Copy link
Member

commented Jul 18, 2016

Also:

I'd like to avoid using a generic /devices/notify endpoint and ending up with
the same body as the push payloads.

It sounds like we actually came around to implementing something fairly close to this, @eoger I'd love to hear your thoughts on potential downsides of doing it this way, before we get too far down that road.

@rfk rfk assigned eoger and unassigned rfk Jul 18, 2016

@eoger

This comment has been minimized.

Copy link
Contributor Author

commented Jul 19, 2016

We discussed this with Chris today, let's ship this push-proxy version of /devices/notify. We can do payload schema validation so the server doesn't have to do anything really smart.
I will do the server-side work.

@rfk

This comment has been minimized.

Copy link
Member

commented Jul 20, 2016

@eoger given #1357, what would you like to do with the rest of the discussion and context in this issue? I don't think it makes sense to keep it alive as an engineering issue in this repo, we could move it to a wiki or feature doc somewhere, or we could just close the issue and link back to it for reference. Thoughts?

@vladikoff

This comment has been minimized.

Copy link
Member

commented Jul 25, 2016

from mtg: closing in favour of #1357

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
You can’t perform that action at this time.