Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[meta] sync quota tracking bug #41

Open
7 of 9 tasks
tublitzed opened this issue Mar 10, 2020 · 19 comments
Open
7 of 9 tasks

[meta] sync quota tracking bug #41

tublitzed opened this issue Mar 10, 2020 · 19 comments
Assignees
Labels
3 Estimate - m - This is a small change, but there's some uncertainty.

Comments

@tublitzed
Copy link
Contributor

tublitzed commented Mar 10, 2020

Spanner has a limitation on how much data we can store per user collection. We need to find a way to move forward with the migration in a way that doesn't negatively impact spanner.

Affected users tracked in 1653022

This will track:

@tublitzed tublitzed added this to Backlog: Misc in Services Engineering via automation Mar 10, 2020
@tublitzed tublitzed moved this from Backlog: Misc to Prioritized in Services Engineering Mar 10, 2020
@jchaupitre jchaupitre self-assigned this Mar 10, 2020
@jchaupitre
Copy link

  • Consult @jvehent, specifically what makes a usage abusive
  • consult @davismtl, specifically on whether we are authorized to study user behavior here, how much data is too much data, and do users enter into an agreement with us when they consume the sync service

@ameihm0912
Copy link

@jchaupitre I can help with with the first item on behalf of @jvehent

@tublitzed tublitzed moved this from Prioritized to Scheduled in Services Engineering Mar 11, 2020
@jchaupitre
Copy link

Thanks, @ameihm0912, looking forward to working with you on this.

As a first ask, would you be able to share with me an existing policy that we have and is similar (somehow) with what we are trying to do here? Otherwise, would you know who we could contact?

@ameihm0912
Copy link

@jchaupitre if you don't mind I'll grab a quick time slot next week to make sure I understand what we need; then I can start gathering resources for you. I suspect what we may want here is a mix of some high level policy statements and some of our lower level control definitions and behavioral patterns.

@jchaupitre
Copy link

@davismtl, here are the summary statistics for the different collections on the decommissioned sync node 725 -- Getting statistics across all nodes will be impossible, so I'm hoping we can use this one as a sample, hopefully representative, data to drive our discussions:

Forms Forms.Size
Min. : 1.0 Min. : 187
1st Qu.: 9.0 1st Qu.: 2060
Median : 115.0 Median : 25338
Mean : 1327.8 Mean : 279384
3rd Qu.: 854.8 3rd Qu.: 184872
Max. :1482688.0 Max. :350418052
NA's :103478 NA's :103478

History History.Size
Min. : 1 Min. : 187
1st Qu.: 71 1st Qu.: 38832
Median : 606 Median : 321589
Mean : 2127 Mean : 1093897
3rd Qu.: 2534 3rd Qu.: 1326376
Max. :267939 Max. :145262334
NA's :191629 NA's :191629

Bookmarks Bookmarks.Size
Min. : 1.0 Min. : 405
1st Qu.: 10.0 1st Qu.: 3870
Median : 16.0 Median : 6485
Mean : 246.4 Mean : 111028
3rd Qu.: 51.0 3rd Qu.: 22453
Max. :1573713.0 Max. :380801236
NA's :86994 NA's :86994

Prefs Prefs.Size
Min. :1 Min. : 8123
1st Qu.:1 1st Qu.: 12711
Median :1 Median : 13499
Mean :1 Mean : 13066
3rd Qu.:1 3rd Qu.: 13607
Max. :1 Max. :370899
NA's :109765 NA's :109765

Passwords Passwords.Size
Min. : 1.00 Min. : 187
1st Qu.: 2.00 1st Qu.: 1122
Median : 7.00 Median : 3764
Mean : 31.99 Mean : 17631
3rd Qu.: 27.00 3rd Qu.: 14750
Max. :6126.00 Max. :3231694
NA's :110760 NA's :110760

Addons Addons.Size
Min. : 1.00 Min. : 251
1st Qu.: 1.00 1st Qu.: 359
Median : 2.00 Median : 674
Mean : 3.83 Mean : 1318
3rd Qu.: 4.00 3rd Qu.: 1456
Max. :401.00 Max. :121667
NA's :180637 NA's :180637

@jchaupitre
Copy link

jchaupitre commented Mar 16, 2020

@davismtl, here's the quantile stats for the above data set (i.e. decommissioned sync node 725):

% Forms Forms.Size
25% 9.00 2060.0
50% 115.00 25338.5
75% 854.75 184872.5
90% 3357.00 709535.0
99% 17973.00 3744529.4

% History History.Size
25% 71.00 38832
50% 606.00 321589
75% 2534.00 1326376
90% 5889.00 3038440
99% 18050.46 9090346

% Bookmarks Bookmarks.Size
25% 10.00 3870
50% 16.00 6485
75% 51.00 22453
90% 219.00 100038
99% 2510.11 1171953

% Prefs Prefs.Size
25% 1 12711
50% 1 13499
75% 1 13607
90% 1 14395
99% 1 15291

% Passwords Passwords.Size
25% 2 1122.00
50% 7 3764.00
75% 27 14750.25
90% 81 44575.20
99% 357 198256.64

% Addons Addons.Size
25% 1 359
50% 2 674
75% 4 1456
90% 9 2952
99% 27 9190

@jchaupitre
Copy link

jchaupitre commented Mar 18, 2020

We met with @ameihm0912 and @davismtl, and the following questions were raised:

  • can we determine how many devices the account has been syncing with?
  • can we also determine when was the last sync?
  • can we determine what is the cost per MB of storage? Are we charged per amount of storage and/or amount of read / write? In other words, what is the cost per sync? and/or per device used to sync?

Also discussed, that if we are to land such policy, there will be work to prevent these abusive accounts from being migrated, but ALSO, work on the client / service side to prevent users from breaching the policy after the migration and on a go forward basis.

Pinging @habib, @erkolson, and @Micheletto for visibility.

@jchaupitre
Copy link

jchaupitre commented Mar 20, 2020

Statistics for newly decommissioned sync-node-560:

Forms Forms.Size
Min. : 1 Min. : 187
1st Qu.: 15 1st Qu.: 3405
Median : 158 Median : 35076
Mean : 2082 Mean : 421249
3rd Qu.: 1304 3rd Qu.: 276591
Max. :365491 Max. :81099789
NA's :94018 NA's :94018

History History.Size
Min. : 1 Min. : 187
1st Qu.: 84 1st Qu.: 45116
Median : 938 Median : 491200
Mean : 3070 Mean : 1559658
3rd Qu.: 3929 3rd Qu.: 2039947
Max. :823982 Max. :385440602
NA's :204294 NA's :204294

Bookmarks Bookmarks.Size
Min. : 1 Min. :2.790e+02
1st Qu.: 14 1st Qu.:5.523e+03
Median : 35 Median :1.496e+04
Mean : 771 Mean :2.616e+05
3rd Qu.: 156 3rd Qu.:6.866e+04
Max. :9856717 Max. :2.472e+09
NA's :73434 NA's :73434

Prefs Prefs.Size
Min. :1 Min. : 8083
1st Qu.:1 1st Qu.: 9895
Median :1 Median : 10963
Mean :1 Mean : 11626
3rd Qu.:1 3rd Qu.: 12839
Max. :1 Max. :412371
NA's :102756 NA's :102756

Passwords Passwords.Size
Min. : 1.00 Min. : 187
1st Qu.: 3.00 1st Qu.: 1673
Median : 12.00 Median : 6716
Mean : 54.35 Mean : 29587
3rd Qu.: 57.00 3rd Qu.: 30752
Max. :4399.00 Max. :2388249
NA's :99509 NA's :99509

Addons Addons.Size
Min. : 1.0 Min. : 251
1st Qu.: 2.0 1st Qu.: 546
Median : 3.0 Median : 1117
Mean : 6.5 Mean : 2210
3rd Qu.: 8.0 3rd Qu.: 2612
Max. :666.0 Max. :203930
NA's :164683 NA's :164683


Forms Forms.Size
25% 15.00 3405
50% 158.00 35076
75% 1304.00 276591
90% 5263.00 1072015
99% 29184.56 5786539

History History.Size
25% 84.00 45116
50% 938.00 491200
75% 3929.00 2039947
90% 8689.40 4423990
99% 23592.28 11688947

Bookmarks Bookmarks.Size
25% 14.0 5523.0
50% 35.0 14959.0
75% 156.0 68655.0
90% 563.2 253166.2
99% 5543.0 2421994.6

Prefs Prefs.Size
25% 1 9895.0
50% 1 10963.0
75% 1 12839.0
90% 1 13627.0
99% 1 15833.8

Passwords Passwords.Size
25% 3 1673.00
50% 12 6716.00
75% 57 30752.25
90% 154 84007.40
99% 508 277247.78

Addons Addons.Size
25% 2 546.00
50% 3 1117.00
75% 8 2612.00
90% 15 5120.00
99% 44 14693.22

@jchaupitre
Copy link

Action items from meeting on 3/23

1.a - Submit a new telemetry request to measure the typical Sync usage as to determine what our sync user outliers are, and to define what is the "normal" use of sync. We want to measure data usage, sync occurrences, time to sync, number of devices, are they non-firefox clients, and any other relevant insights (Leif is our POC) - Julien
1.b - Determine if we capture the same usage between client side telemetry and service side, as to determine the amount of non Firefox client usage of sync.
2- Cost is definitely a reason to investigate abuse of sync, but also we want to determine how sync is used and abused (i.e. syncing between other browsers). What usage of sync is determined acceptable? - To be determined at later time
3. Firefox client/service has a built-in quota (e.g. users get 1 MB of sync data) that was never used - Is that something we could be using here? We could start with a very large number like 100MB, and measure how often users reach that limit. - Need to confirm with Sync-client (Janet) - Julien
4. Should we consider white listing clients, i.e. 3rd party browser, firefox forks, etc? this would be based on user agent which could be spoofed. Dependent on 1.a and 1.b - need to confirm with Services-engineering and/or ops (Rachel) how to implement this - Julien

@jchaupitre
Copy link

Action items update:
1.a submitted a new request to Data Org: https://jira.mozilla.com/browse/DO-177
3. Sync Quota implementation - Client and Service engineering team confirmed this was NOT implemented on either side. Specifically Client team removed that UI a long time ago on Firefox Desktop because it wasn't used (https://bugzilla.mozilla.org/show_bug.cgi?id=1180587). Client team also assumes neither android or iOS version supports this. Service team indicated they have an issue open to impl. it mozilla-services/syncstorage-rs#120.
4. Confirmed with Rachel that Services-engineering team can be engaged to investigate how we could build capability to block non-firefox clients to use sync services.

Next update: We will follow up on the telemetry request with Data Org (not expected to be worked on immediately due to other priorities), and will assist them with setting it up once they can engage.

@tublitzed tublitzed moved this from Scheduled to Backlog: Sync in Services Engineering Apr 1, 2020
@erkolson
Copy link

While testing whole node migrations of user data from sync-py to sync-rs, I've discovered some users take an exceptionally long time to migrate (2+ hours). I ran some queries to create a distribution of rows and users. I found that though the 99th percentile user on this particular node had ~110,000 rows, the 3 biggest users had more than 9 million rows.

Migration time:

  1. mean user - ~4s
  2. 99th percentile user - 0:01:40 (100s)
  3. 9,000,000 row user - 2:06:21 (8181s)

There are many processes working in parallel to transfer the data but these extreme outlier users make it very difficult to create optimal workload allocations for parallel processes. i.e. If we have enough processes to migrate the entire node in 2.5 hours based on total rows, average throughput, and an even allotment of users across all processes -- the presence of one of these users in a process's allocation will cause it to take closer to 5 hours.

@davismtl
Copy link

Some food for thought as we work on a policy. We should align with client-side integration since it doesn't sync all the data on a first sync. So if we decide to keep 1 year of data or 100k records, it doesn't mean that the client will ever use it.

@tublitzed tublitzed moved this from Backlog: Sync to Prioritized in Services Engineering Jul 21, 2020
@tublitzed tublitzed assigned tublitzed and unassigned jchaupitre Jul 21, 2020
@tublitzed
Copy link
Contributor Author

I'm going to pick this one back up in light of the recently discovered spanner limitation. Update to come after reviewing options with product.

@tublitzed tublitzed moved this from Prioritized to In Progress in Services Engineering Jul 23, 2020
@tublitzed tublitzed added the 3 Estimate - m - This is a small change, but there's some uncertainty. label Jul 23, 2020
@tublitzed
Copy link
Contributor Author

Adjusting the scope of this ticket to specifically track handling the spanner limitation.

@tublitzed tublitzed changed the title Sync usage policy Adjust sync to work within spanner quotas Jul 24, 2020
@tublitzed tublitzed changed the title Adjust sync to work within spanner quotas [meta] sync quota tracking bug Aug 17, 2020
@jrconlin jrconlin moved this from In Progress to Done in Services Engineering Sep 24, 2020
@jrconlin
Copy link
Member

jrconlin commented Sep 24, 2020

from rbillings load test:

11:51 AM
rbillings @jrconlin results: SUCCESSES: 1027241 | FAILURES: 252 | WORKERS: 800

This was with 0.6.0 with quota disabled.

@jrconlin
Copy link
Member

jrconlin commented Sep 24, 2020

Unfortunately, rbillings will be out tomorrow and ops is currently dealing with a higher priority issue, so load testing 0.6.0 with quota enabled will have to wait for Monday, Sept 28.

@jrconlin
Copy link
Member

Ops was able to enable quota on storage. Load testing results:

3:35 PM rbillings SUCCESSES: 1027815 | FAILURES: 132 | WORKERS: 800 looks good!

Which is inline with the prior results.

She noticed no timing differences with the quota enabled.

@tublitzed tublitzed moved this from Done to Archived in Services Engineering Sep 28, 2020
@pjenvey
Copy link
Member

pjenvey commented Oct 3, 2020

I'm seeing a probably unexpected failure case of the quota check from Vasilica's latest log:

https://bugzilla.mozilla.org/show_bug.cgi?id=1665917#c4

After a "successful" batch commit the sync client reports:

601543067675	Sync.Collection	DEBUG	POST success 200 https://stage.sync.nonprod.cloudops.mozgcp.net/1.5/<id>/storage/bookmarks?batch=<batch_id>commit=true
1601543067676	Sync.Engine.Bookmarks	DEBUG	Records that will be uploaded again because the server couldn't store them: <bso_id>, <etc>

Looking at the Sync client, "Records that will be uploaded again because the server couldn't store them" seem to indicate the write was successful but included "failed" bsos.

I believe the "failed" bucket was populated here due to the quota check failing in batch append (or post_bsos). When this happens there's no indication to the client that the quota check failed (the "failed" bucket does include an error message, so the quota error might be included there but it's likely ignored), which I'd consider a bug.

We've already planned to deprecate populating of this "failed" bucket feature entirely -- I think we likely need to remove it now to fix this issue (I've added it to the checklist)

@mhammond
Copy link

mhammond commented Oct 4, 2020

While this isn't ideal, it's probably not too much of a problem TBH - the next sync will explicitly retry those failed items once. However, if it's failing due to the quota, then the clients are forever going to retry anyway. The only "bad" thing happening here that wouldn't if this was fixed is that Firefox will write a json file with the failed guids.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3 Estimate - m - This is a small change, but there's some uncertainty.
Projects
Development

No branches or pull requests

8 participants