Join GitHub today
GitHub is home to over 36 million developers working together to host and review code, manage projects, and build software together.Sign up
Improving Sync latency with FxA push notifications #1316
In order to improve the latency of Sync, we proposed to send push notifications to the clients in the same account.
We already have a push payload definition for collectionChanged push message.
The endpoint path is pretty verbose, but I'd like to avoid using a generic /devices/notify endpoint and ending up with the same body as the push payloads. I'm open to suggestions.
Let me repeat back, make sure I'm understanding you.
The client that just synced is the originating client. It makes sense for this client to trigger the notification, because only that client knows when it's done syncing and whether it has uploaded anything that's worth notifying about.
FxA's device manager keeps track of the set of clients. The originating client can ask an identity-attached service to notify other clients (which implies that it knows/sends along its own identifier).
(That service presumably has abuse-handling rate/payload/etc. limits.)
The notification service follows some notification scheme to decide how to dispatch the notifications to the other clients. The most trivial scheme is to immediately send the notification to all clients.
Does that make sense?
@eoger out of curiosity, why do you want to avoid that?
Modulo @rnewman's concerns, the proposal here seems reasonable to me, and we could even go one step further and only notify of changes to the clients collection, while we get a feel for how the whole system will operate.
Server-side, I think it's reasonable to limit each account to X notifications in Y time period, and return our existing "attempt limit exceeded" error code if this rate exceeded.
I'm not generally in favour of building in lots of special-case smarts to the server, e.g. trying to debounce certain types of messages or to cause devices to sync at slightly different times. We may find that it's most convenient place to put them, but I'd like to avoid it if possible.
This seems a reasonable requirement to me, and they can trivially get this information from the FxA API.
OK for the clients collections only.
Some ideas I threw around this morning.
IIRC APNS will start to silently drop your pushes on the floor if you send too many. And besides, push is a battery and bandwidth drain vector. I don't think it's sufficient to rate limit on the receiver: a bug in one client should not subject every other client to a firehose.
Neither is it really fair to assume that desktop clients have no limitations. My cellphone gets better data throughput, and has a higher data cap, than desktop users in many parts of the US, let alone the developing world.
(For the record, I am a mobile folk
For rate limiting I'd suggest:
A delay isn't really a solution to the node reassignment (blank server) problem. Indeed, I'd hesitate to say there's a bullet-point solution to this kind of thing. Sync is a complex system, and it requires systems thinking at each level.
The safest thing to do, perhaps, is to have a policy that ordinary push-triggered syncs aggressively abort if the server isn't in the same known-good state, and don't even start if we're not in a good state. For example, if we find
Alternatively, it might be time to really consider heuristics for "should I sync now?" — this problem can and does arise during timed syncs, too.
FWIW, this feels right to me - the server will check for well-formedness of the message and apply some coarse rate-limiting to prevent accidentaly or malicious abuse, but otherwise not impose any semantics on the message.
The list of clients can be optional, defaulting to all connected devices.
Server-side rules of the form "no more than X messages of type Y in any Z-second time period" seem useful and relatively clean for us to enforce, does that seem sufficient. We can tweak the values of X, Y and Z in config as needed.
Right, we're not creating a new problem here, we're just increasing the likeihood that it'll trigger, and trigger in some user-observable way.
Sorry I'm late to the party here - I'm not getting emails for this issue - hopefully this comment will fix that :)
ISTM that many of these problems re sync-loops etc could be mitigated by having clients only process incoming items when they receive the notification - I can't see a reason it is unsafe to skip the upload step (and if there is, I'm sure that we'd already hit this fairly regularly when clients shutdown at inopportune times.) Clients would continue to use their existing schedule for the next "full" sync. Thus, receipt of a push message should never cause a write, and the amount of data transferred shouldn't increase dramatically (as the incoming-only sync advanced the last-modified time - it's not as though that incoming data is going to be processed twice)
Is there any reason that wouldn't work and avoid many of the problems (apart from rate limiting and mobile device battery usage) here?
In the abstract, incoming-only syncs would be a solution.
There are a few reasons why they're less suitable in the concrete:
(a) There are likely to be some rough edges around advancing timestamps that we'd hit if changes weren't uploaded, particularly on Android. That's solvable, but might make this a more costly change.
(b) Any engine that keeps state in memory between download and upload, of course, is buggy in this case. (And buggy in the general case.)
(c) This would turn an edge-case merge flow into a common flow. Consider two clients,
You can imagine a worse variant, too, where another change occurs subsequently:
The common flow, which the engines are written to assume, is something like:
We assume that if an upload fails once, it's likely to succeed before another client makes a change to the same record. (This is the tower of "it'll be alright in the end" that Sync is built on!)
So usually a shared head is only one sync step away, a resolved conflict is always shared and applied on the other client, and a delta doesn't really stick around to accrue more local changes.
The more hops in this DAG, the more likely we are to get divergence and data loss: a subsequent merge step no longer has the true shared parent, and instead we're building on assumptions. It's like one of those draw-a-monster games where you fold the paper down as you draw each section of the body.
This is not to say that repeated incremental unidirectional syncs for most collections are impossible. I think they're relatively safe on iOS, and probably completely safe for history (for the usual values of 'safe' for that collection) elsewhere. But I wouldn't expect behavior to be the same in all cases, and I don't think we'd achieve quite the same level of robustness without further analysis and perhaps more changes.
The immediate driver for this work is send-tab-to-device, which means the minimal thing we can ship is "a push message causes a device to download and correctly process its sync client commands".
IIUC that means we have to resolve the thorny issue of re-uploading your clients collection record, but we can continue kicking the can of worms down the road for other datatypes at this stage. (Which I don't enjoy doing, but...)
FTR, the problem there was that we re-uploaded the commands we just processed without clearing them.
Yeah - and actually that "clients" code is buggy in exactly that way :( Tangentially, in some cases that could be considered a feature for the clients collection (eg, incoming tab received as we shutdown; re-processing that command for users without session restore makes sense) and arguably OK for others (resetting client state immediately before shutdown and immediately after next startup probably does no harm) - but it's poorly defined and thus quite fragile.
Sync has enough scale that IMO edge-cases should still be considered common. If we do bad things when a Sync is interrupted between download and upload, those bad things probably happen many times per day.
That sounds like what could be a common scenario now:
Not really - the entire "score" mechanism means it's more like:
While clients and bookmarks try and mitigate this by having every change bump the score such that a sync should start immediately, there are plenty of scenarios where that's not going to happen in practice, and as above, this must therefore happen many times per day.
Anyway, this is getting a little abstract given:
Agreed. Let's just focus on the client collection record here (while still taking the edge-cases into account)
Off the top of my head, that doesn't sound too bad: we only send the push notification when uploading a client record other than our own. Thus, the receiver of the push notification will not again re-send a push notification as it will only update its own record while processing the push. This in-turn implies to me that (a) the message isn't a generic "I synced something" but a more specific "I wrote a new client record" and (b) the API doesn't offer "send this to every device" but instead insists on the specific device IDs. (Well, maybe (b) isn't strictly necessary, but the client would never make use of it)
Yeah. This is kind of inevitable when syncing and remote control get bolted on to a product very late in its life.
I quite agree… but I'm in favor of not upping the frequency of tickling that tiger without careful thought
In most cases, yes. However, care must be taken in two respects:
Firstly, causing the clients engine to sync can also upload queued commands to other clients. That's probably okay, but a bug might cause this to not converge.
(Future repro: send a bunch of tabs from Android or iOS while offline. Reconnect to the internet. Send a push after sending a tab to that device.)
You could consider a download-only clients sync as the narrowest option.
Secondly, races in clients are likely uncommon right now: not only are there not many Send Tab users, but the windows of operation are typically non-overlapping. client A sends three tabs to B, one at a time; client B syncs minutes later and reuploads its blank record without drama. Push-triggered syncs bring those windows into overlap: A's second tab send will occur while B is syncing. Clients should be safe here — I think they all use X-I-U-S for reuploading their record — but it would be worth checking what the behavior is. I imagine that desktop might drop outbound commands on the floor, and Android will open tabs multiple times. I don't know if failure will abort a sync.
I'm a little more reckless than that - we know the tiger is being tickled but just never see it happen. I've no problem with changes we want to make having a side-effect of tickling him a little more - that's probably the only way we will actually see what he does :) I'll be even more gung-ho once we have telemetry reporting validation results ;)
Can you elaborate on what badness would happen there? Isn't that exactly what we want to happen?
Yes, I think that's probably correct. Not only will desktop fail to "track" command changes, it will indeed throw already-processed commands away if the upload fails (ie, I think Desktop will open tabs multiple times in this scenario.)
Sadly I think bad things still play out in a "download-only client sync" - client B downloads its record but doesn't update it on the server and waits for the next "scheduled" sync. If client B terminates or otherwise fails to complete that future Sync, bad things still happen (I think desktop will re-open those tabs next Sync.) So ISTM that either way, we need to make Desktop and Android more robust in this case (and further, doing so should mean we are robust in a download-only client sync, or a client sync that does re-write its record.)
Specifically for Desktop, I think it means we need to persist the local state of the local record, ensure that state is only cleared after a successful upload, and do something sensible WRT merging a client record changed remotely since we last read our version (eg, that "local state" should probably contain a full copy of the remote record we read before we processed and locally removed incoming commands.)
On reflection, I'm not sure this is big enough to warrant a full-blown feature card etc on the FxA side. I suggest that we scope this issue down to just "implement the device -> device notification pipe" which represents the FxA server's responsibilities here, and move the rest of the discussion above into e.g. the client-side metabug or similar.
@eoger are you planning on and/or interested in doing the server-side work in fxa-auth-server to implement the
It sounds like we actually came around to implementing something fairly close to this, @eoger I'd love to hear your thoughts on potential downsides of doing it this way, before we get too far down that road.
@eoger given #1357, what would you like to do with the rest of the discussion and context in this issue? I don't think it makes sense to keep it alive as an engineering issue in this repo, we could move it to a wiki or feature doc somewhere, or we could just close the issue and link back to it for reference. Thoughts?