-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[meta] sync quota tracking bug #41
Comments
@jchaupitre I can help with with the first item on behalf of @jvehent |
Thanks, @ameihm0912, looking forward to working with you on this. As a first ask, would you be able to share with me an existing policy that we have and is similar (somehow) with what we are trying to do here? Otherwise, would you know who we could contact? |
@jchaupitre if you don't mind I'll grab a quick time slot next week to make sure I understand what we need; then I can start gathering resources for you. I suspect what we may want here is a mix of some high level policy statements and some of our lower level control definitions and behavioral patterns. |
@davismtl, here are the summary statistics for the different collections on the decommissioned sync node 725 -- Getting statistics across all nodes will be impossible, so I'm hoping we can use this one as a sample, hopefully representative, data to drive our discussions: Forms Forms.Size History History.Size Bookmarks Bookmarks.Size Prefs Prefs.Size Passwords Passwords.Size Addons Addons.Size |
@davismtl, here's the quantile stats for the above data set (i.e. decommissioned sync node 725): % Forms Forms.Size % History History.Size % Bookmarks Bookmarks.Size % Prefs Prefs.Size % Passwords Passwords.Size % Addons Addons.Size |
We met with @ameihm0912 and @davismtl, and the following questions were raised:
Also discussed, that if we are to land such policy, there will be work to prevent these abusive accounts from being migrated, but ALSO, work on the client / service side to prevent users from breaching the policy after the migration and on a go forward basis. Pinging @habib, @erkolson, and @Micheletto for visibility. |
Statistics for newly decommissioned sync-node-560: Forms Forms.Size History History.Size Bookmarks Bookmarks.Size Prefs Prefs.Size Passwords Passwords.Size Addons Addons.Size Forms Forms.Size History History.Size Bookmarks Bookmarks.Size Prefs Prefs.Size Passwords Passwords.Size Addons Addons.Size |
Action items from meeting on 3/23 1.a - Submit a new telemetry request to measure the typical Sync usage as to determine what our sync user outliers are, and to define what is the "normal" use of sync. We want to measure data usage, sync occurrences, time to sync, number of devices, are they non-firefox clients, and any other relevant insights (Leif is our POC) - Julien |
Action items update: Next update: We will follow up on the telemetry request with Data Org (not expected to be worked on immediately due to other priorities), and will assist them with setting it up once they can engage. |
While testing whole node migrations of user data from sync-py to sync-rs, I've discovered some users take an exceptionally long time to migrate (2+ hours). I ran some queries to create a distribution of rows and users. I found that though the 99th percentile user on this particular node had ~110,000 rows, the 3 biggest users had more than 9 million rows. Migration time:
There are many processes working in parallel to transfer the data but these extreme outlier users make it very difficult to create optimal workload allocations for parallel processes. i.e. If we have enough processes to migrate the entire node in 2.5 hours based on total rows, average throughput, and an even allotment of users across all processes -- the presence of one of these users in a process's allocation will cause it to take closer to 5 hours. |
Some food for thought as we work on a policy. We should align with client-side integration since it doesn't sync all the data on a first sync. So if we decide to keep 1 year of data or 100k records, it doesn't mean that the client will ever use it. |
I'm going to pick this one back up in light of the recently discovered spanner limitation. Update to come after reviewing options with product. |
Adjusting the scope of this ticket to specifically track handling the spanner limitation. |
from rbillings load test:
This was with 0.6.0 with quota disabled. |
|
Ops was able to enable quota on storage. Load testing results:
Which is inline with the prior results. She noticed no timing differences with the quota enabled. |
I'm seeing a probably unexpected failure case of the quota check from Vasilica's latest log: https://bugzilla.mozilla.org/show_bug.cgi?id=1665917#c4 After a "successful" batch commit the sync client reports:
Looking at the Sync client, "Records that will be uploaded again because the server couldn't store them" seem to indicate the write was successful but included "failed" bsos. I believe the "failed" bucket was populated here due to the quota check failing in batch append (or post_bsos). When this happens there's no indication to the client that the quota check failed (the "failed" bucket does include an error message, so the quota error might be included there but it's likely ignored), which I'd consider a bug. We've already planned to deprecate populating of this "failed" bucket feature entirely -- I think we likely need to remove it now to fix this issue (I've added it to the checklist) |
While this isn't ideal, it's probably not too much of a problem TBH - the next sync will explicitly retry those failed items once. However, if it's failing due to the quota, then the clients are forever going to retry anyway. The only "bad" thing happening here that wouldn't if this was fixed is that Firefox will write a json file with the failed guids. |
Spanner has a limitation on how much data we can store per user collection. We need to find a way to move forward with the migration in a way that doesn't negatively impact spanner.
Affected users tracked in 1653022
This will track:
[ ] client side changes - 1655531The text was updated successfully, but these errors were encountered: