Skip to content
This repository has been archived by the owner on Jul 13, 2023. It is now read-only.

Confirm whether uaids are being dropped on 404/410 from bridged servers. #1444

Open
rfk opened this issue Nov 5, 2020 · 3 comments
Open
Assignees
Labels
3 Estimate - m - This is a small change, but there's some uncertainty.

Comments

@rfk
Copy link

rfk commented Nov 5, 2020

Based on this comment, it's my understanding that when the FCM server responds with a 404 or 410 status code, the intended behavior of the autopush server is to drop the corresponding uaid record and all its subscriptions. The logic for doing so lives in _router_fail_err here:

if exc.status_code in [404, 410] and router_type in [
'apns', 'fcm', 'adm']:
self._base_tags.update({
"platform": router_type,
"reason": "unregistered",
})
self.metrics.increment(
"notification.bridge.error",
tags=self._base_tags)
self.db.router.drop_user(uaid)

It's not clear whether this logic is not triggering correctly.

Based on FxA server logs, we're definitely seeing 404 and/or 410 responses when trying to send push messages to mobile clients, since FxA logs a specific "subscription expired" event in this case.

I also took a look in grafana for events of type autopush.notification.bridge.error[reason:recipient_gone], which would correspond to the FCMNotFoundError error type:

if isinstance(err, FCMNotFoundError):
self.log.debug("FCM Recipient not found: %s" % err)
self.metrics.increment("notification.bridge.error",
tags=make_tags(
self._base_tags,
reason="recipient_gone"
))

I am able to see a small but steady rate of such errors. So I think it's clear that such errors are in fact happening.

However...

If I look in grafana for events of type autopush.notification.bridge.error[reason:unregistered] as would be emitted alongside the drop_user call above, I do not see any events at all for platform:fcm. In fact the only instances of such an event are for platform:gcm, which may be coming from this different codepath that emits a similarly-named event.

I also believe that the current appservices push component would fail if its uaid record were to be discarded by the server, since I can't find any codepaths that would recover from such a state. But we haven't observed any devices that seem to be in such a state in the wild.

So I'm wondering if the drop_user logic linked above is working correctly, or whether it might be failing to trigger in practice. The observed behaviour of mobile push clients in the wild suggests some instances where the autopush server believes a subscription is valid but the FxA server does not, and a failure to drop subscriptions on 404/410 could explain that.

@tublitzed tublitzed added this to Backlog: Misc in Services Engineering via automation Nov 5, 2020
@tublitzed tublitzed moved this from Backlog: Misc to Prioritized in Services Engineering Nov 5, 2020
@jrconlin jrconlin self-assigned this Nov 5, 2020
@jrconlin jrconlin added the 3 Estimate - m - This is a small change, but there's some uncertainty. label Dec 2, 2020
@rfk
Copy link
Author

rfk commented Dec 2, 2020

I also believe that the current appservices push component would fail if its uaid record were to be discarded by the server,
since I can't find any codepaths that would recover from such a state. But we haven't observed any devices that seem
to be in such a state in the wild.

Update: #1445 seems to show evidence of what might be devices in such a state in the wild.

@jrconlin
Copy link
Member

Looking at the autopush python code, it appears that we do not drop them.

@jrconlin
Copy link
Member

It's worth noting that the newer rust version does drop these records.

@tublitzed tublitzed moved this from Prioritized to Backlog: Push in Services Engineering Feb 22, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
3 Estimate - m - This is a small change, but there's some uncertainty.
Projects
Services Engineering
  
Backlog: Push
Development

No branches or pull requests

2 participants