Verify that user data is not completely re-synced post migration. #45

tublitzed · 2020-04-02T15:00:12Z

Plan here is as follows:

1. Find a single test user account still pointed to legacy sync nodes. (Sync id 128881199 - tublitzed@yahoo.com). Current pointed to this legacy sync node.
2. Enable local trace logging using about:sync. Enable success logging by setting services.sync.log.appender.file.logOnSuccess to true in about:config.~
3. Replace bookmarks collection with this collection via the bookmarks UI. This will force an entirely fresh upload of the collection.
4. Collect the logs from this completely fresh upload, and save them. Logs live here
5. On April 6th, this sync user will be migrated to spanner.
6. Once step 5 is complete, test sync service by adding a new bookmark. Collect and analyze logs to confirm that a) user is pointed to durable sync and b) user did NOT re-upload entire sync collection after being migrated.

The text was updated successfully, but these errors were encountered:

tublitzed · 2020-04-02T19:57:05Z

Just a note here for completeness - Erik moved my original sync user over to this test node here. I now have a new sync id for this test: 141727562.

Just verified that all content has been synced up to new node, logs live here.

tublitzed · 2020-04-06T18:27:50Z

New sync id is 142084482. Unfortunately, I did not find good news. We are in fact re-syncing everything. Logs live here

rfk · 2020-04-06T23:09:04Z

1586197410063 Sync.Resource DEBUG GET fail 404 https://sync-1-us-west1-g.sync.services.mozilla.com/1.5/142084482/storage/meta/global
1586197410063 Sync.Resource WARN GET request to https://sync-1-us-west1-g.sync.services.mozilla.com/1.5/142084482/storage/meta/global failed with status 404

Your client fails to find the meta/global record on the server, which would definitely be enough to trigger a full sync. If meta/global is missing then no other data on the server can be trusted to be in a good or consistent state.

rfk · 2020-04-06T23:10:14Z

Can we confirm from server-side migration logs whether the record 142084482/storage/meta/global was successfully copied over from the previous node?

rfk · 2020-04-06T23:17:22Z

My first theory here is that perhaps the migration script is calculating the (fxa_uid, fxa_kid) pair in a different way than tokenserver will after the migration. So e.g. we might copy all their data over to spanner keyed by (fxa_uid, fxa_kid1) but then when they client sees the node migration event and hits tokenserver to get a new token, they end up looking for data under (fxa_uid, fxa_kid2), and it appears the node is empty.

erkolson · 2020-04-07T16:47:09Z

The client got a 404 when requesting the storage/meta/global. Here are the first 5 requests to the server:

timestamp,request,request_length,bytes_sent,status
2020-04-06 18:23:30.806275 UTC,GET /1.5/142084482/info/collections HTTP/1.1,1134.0,2.0,200
2020-04-06 18:23:30.895814 UTC,GET /1.5/142084482/info/configuration HTTP/1.1,1138.0,165.0,200
2020-04-06 18:23:30.985694 UTC,GET /1.5/142084482/storage/meta/global HTTP/1.1,1139.0,1.0,404
2020-04-06 18:23:31.077635 UTC,GET /1.5/142084482/storage/meta/global HTTP/1.1,1138.0,1.0,404
2020-04-06 18:23:31.195519 UTC,PUT /1.5/142084482/storage/meta/global HTTP/1.1,1333.0,13.0,200

pjenvey · 2020-04-07T19:58:46Z

thanks @rfk. We've spotted some issues w/ migrate_node's individual user migration argument. We'll be testing a new user against the fixed version.

Erik's moved this new user (142188710) to this test node.

Restored the med.json & Sync'd, logs live here

pjenvey · 2020-04-07T21:49:44Z

Erik migrated the user to 142198989/durable sync and made the Python node 401.

After the 401 error log, the client appears to have sync'd to the new durable sync node without a full reupload! Log

EDIT: It did reupload its Tabs (it originally had 1 tab, and reuploaded).

pjenvey · 2020-04-07T22:26:47Z

Though I notice one oddity, the Tabs collection in about:sync fails to load, reporting an Error:

Error: HMACMismatch: Record SHA256 HMAC mismatch: should be 9946063866ac59e0006d0abccb0d9b37ec3216fe57eab97fc5cc0a1d0480bbfd, is ed307f5234e611eb05431ff6a979e2948f25b541595c46fb3f41182257140d36

Restarting the browser lead to a sync error w/ the same HMACmismatch: Log

pjenvey · 2020-04-07T22:32:53Z

Tabs are currently set to be solely stored in memcached on the server cache_only_collections = tabs, so it makes sense that they would have been reuploaded. But this doesn't explain the HMACmismatch.

rfk · 2020-04-08T00:01:50Z

Hrm, the hmac mismatch is very strange indeed. If you use the about:sync addon, can you inspect the state of the "tabs" collection to make sure it only has a single record with id "CB3DaSRC1v6F", which is the one uploaded by your client?

rfk · 2020-04-08T02:06:34Z

I also wonder if we're seeing the HMAC error on the tabs collection because that's the only one for which we're trying to download any records - all other collections are (correctly) treated as not having any remote changes. Trying to add a second client to this cluster might result in in additional HMAC errors being surfaced.

rfk · 2020-04-08T02:39:13Z

can you inspect the state of the "tabs" collection to make sure it only has a single record with id
"CB3DaSRC1v6F", which is the one uploaded by your client?

Actually, from the logs, I wonder if there is somehow a second item in the "tabs" collection. The logs around the HMAC error say this:

1586297619302	Sync.Engine.Tabs	DEBUG	First sync, uploading all items
1586297619302	Sync.Engine.Tabs	INFO	1 outgoing items pre-reconciliation
1586297619302	Sync.Engine.Tabs	TRACE	Downloading & applying server changes
1586297619302	FirefoxAccounts	TRACE	not checking freshness of profile as it remains recent
1586297619302	Sync.BrowserIDManager	TRACE	_ensureValidToken already has one
1586297619368	Sync.Collection	DEBUG	GET success 200 https://sync-1-us-west1-g.sync.services.mozilla.com/1.5/142198989/storage/tabs?full=1&limit=1000
1586297619369	Sync.Engine.Tabs	TRACE	Ignoring incoming tab item because of its id: CB3DaSRC1v6F
1586297619369	Sync.Engine.Tabs	TRACE	Skipping reconciled incoming item CB3DaSRC1v6F

So it skips the incoming tab item for itself, which is good.

1586297619369	Sync.Service	INFO	Bad HMAC event detected. Attempting recovery or signaling to other clients.
[... crypto/keys refetching blah-blah goes here ...]
1586297619511	Sync.Engine.Tabs	INFO	Records: 0 applied, 0 successfully, 1 failed to apply, 1 newly failed to apply, 1 reconciled.

But here we see that it tried to apply a record, and failed to do so due to the HMAC error. Does that mean that there's second BSO in the tabs collection, of unknown origin?

To check my understanding: we expect that the migration script will not populate any items into the "tabs" collection, because that collection does not exist in MySQL on the storage nodes. Is that accurate?

pjenvey · 2020-04-08T19:33:34Z

If you use the about:sync addon, can you inspect the state of the "tabs" collection to make sure it only has a single record with id "CB3DaSRC1v6F", which is the one uploaded by your client?

about:sync isn't helpful here, it reports the HMACMismatch error and is stuck in the "Fetching records..." state for tabs, even after dismissing the error message.

rfk · 2020-04-08T19:59:44Z

Ugh, OK. Any chance you'd be willing to hit the server API directly and pull down the record(s) for inspection? :-)

pjenvey · 2020-04-08T21:28:38Z

Ugh, OK. Any chance you'd be willing to hit the server API directly and pull down the record(s) for inspection? :-)

Someone would need to help me figure out how to get the appropriate Authorization header needed to query the backend. However I was luckily able to grab this count from about:sync's logging of the /info/collection_counts response. It reported tabs: 4 surprisingly.

Is it possible we have stale tabs data from mysql? I'm not familiar w/ the cache_only_collections implementation, does it ever backup any of its data to mysql?

I'm also wondering the ramifications of migrated user data including entries for tabs in the metaglobal and I think /info/collections/ as well.

pjenvey · 2020-04-08T23:27:47Z

Someone would need to help me figure out how to get the appropriate Authorization header needed to query the backend.

Nevermind, think I've got the hang of this..

rfk · 2020-04-08T23:56:28Z

Is it possible we have stale tabs data from mysql? I'm not familiar w/ the cache_only_collections
implementation, does it ever backup any of its data to mysql?

I don't think it will ever be written back to mysql. Can we look in the source tables on the old node to see if there were any there?

pjenvey · 2020-04-17T05:46:03Z

@rfk thanks for the hints

Need to confirm some of these details w/ ops but it seems this was due to stale tabs data in spanner.

For this migration test, the client was a new account:

New accounts default to a Spanner node
Then it was manually reasssigned to a legacy Python node
Test client would sync data there, then we'd manually migrate it from legacy back to Spanner. No tabs data would migrate since they're only in memcached
Then it was manually reassigned back to Spanner

If a client "Signs Out" of FxA/Sync while on the Python node, clients and forms collections perform a kind of "reset" on the client side and the accompanying records for the device are DELETEd on the server. In some cases we were doing a "Sign Out" to force the client to immediately catch the reassignment (restarting the browser seems to be another way to force it).

We had a couple problems here:

During 1) the new account made an initial sync to Spanner. Looks like we forgot to cleanup this initial sync data. The account's user info/client state does not change during this test: so we will eventually migrate the client back to the same location (primary keys) in Spanner
The migration script should have skipped migrating this user due to this pre-existing data in Spanner but a bug prevented that check from happening, instead we were overwriting everything there..
Except tabs since there were none to migrate

So stale tabs data was left around + I think combined w/ the "reset" -- this "reset" must change some state of the tabs collection used for HMAC validation. Does that make sense?

Another note: though we won't migrate tabs data, we'll migrate an entry for it in meta/global. Do you foresee any issues w/ that (we'll double check the behavior around this w/ more test test runs)?

rfk · 2020-04-17T05:54:42Z

Aha, yes I think this would explain it!

At step (1) the client would have created new bulk storage keys in /crypto/keys on the spanner node, and uploaded data (including tabs) encrypted under those keys k1.

At step (2) the client would have seen that it had been re-assigned to a new, empty storage node, and would have re-initialized the sync data including /crypto/keys, then re-uploaded its sync data encrypted under those keys k2.

At step (3) you copy the new /crypto/keys over to spanner, and the client happily continues reading and writing data using the same bulk storage keys k2.

Except for the "tabs" collection, where it finds the old data that we encrypted using the old storage keys k1, which it's not able to decrypt and which produces HMAC verification error.

I'm not sure what effect the explicit signing out would have had - perhaps if you get a node re-assignment while you're signed in, Firefox is able to re-use the previous keys from /crypto/keys, but if you're not signed in then it has to create some news ones.

Regardless, I think this is a plausible explanation, Thanks for taking the time to dig in to the details here!

tublitzed · 2020-04-29T16:59:27Z

Re-opening this one as it looks like we may have hit the issue again from yesterdays client testing

pjenvey · 2020-05-04T18:10:08Z

Logs for my last round of successful attempts (on stage):

pjenvey-stage-migrate1: Did a forceful "Sign out" out of Sync to force lookup back to the Spanner node after its migration
pjenvey-stage-migrate2
pjenvey-stage-migrate3

All 3 of these test accounts migrated successfully.

tublitzed added this to Backlog: Misc in Services Engineering via automation Apr 2, 2020

tublitzed added the 2 Estimate - s - This is a small change with clearly defined parameters. label Apr 2, 2020

tublitzed self-assigned this Apr 2, 2020

tublitzed moved this from Backlog: Misc to In Progress in Services Engineering Apr 2, 2020

tublitzed mentioned this issue Apr 6, 2020

Prevent complete client re-upload during migration. mozilla-services/syncstorage-rs#568

Closed

tublitzed moved this from In Progress to Done in Services Engineering Apr 6, 2020

tublitzed moved this from Done to Archived in Services Engineering Apr 6, 2020

tublitzed mentioned this issue Apr 6, 2020

Test migration behaviour against client with fresh FF install #46

Closed

pjenvey mentioned this issue Apr 17, 2020

Have migrate_node optionally delete individual user data before migrating mozilla-services/syncstorage-rs#596

Closed

tublitzed closed this as completed Apr 21, 2020

Services Engineering automation moved this from Archived to Done Apr 21, 2020

tublitzed moved this from Done to Archived in Services Engineering Apr 27, 2020

tublitzed moved this from Archived to In Progress in Services Engineering Apr 29, 2020

pjenvey reopened this Apr 30, 2020

Services Engineering automation moved this from In Progress to Prioritized Apr 30, 2020

pjenvey assigned pjenvey and unassigned tublitzed Apr 30, 2020

tublitzed moved this from Prioritized to Archived in Services Engineering May 26, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Verify that user data is not completely re-synced post migration. #45

Verify that user data is not completely re-synced post migration. #45

tublitzed commented Apr 2, 2020 •

edited

Loading

tublitzed commented Apr 2, 2020 •

edited

Loading

tublitzed commented Apr 6, 2020

rfk commented Apr 6, 2020

rfk commented Apr 6, 2020

rfk commented Apr 6, 2020

erkolson commented Apr 7, 2020

pjenvey commented Apr 7, 2020

pjenvey commented Apr 7, 2020 •

edited

Loading

pjenvey commented Apr 7, 2020

pjenvey commented Apr 7, 2020

rfk commented Apr 8, 2020

rfk commented Apr 8, 2020

rfk commented Apr 8, 2020

pjenvey commented Apr 8, 2020

rfk commented Apr 8, 2020

pjenvey commented Apr 8, 2020

pjenvey commented Apr 8, 2020

rfk commented Apr 8, 2020

pjenvey commented Apr 17, 2020

rfk commented Apr 17, 2020

tublitzed commented Apr 29, 2020

pjenvey commented May 4, 2020

Verify that user data is not completely re-synced post migration. #45

Verify that user data is not completely re-synced post migration. #45

Comments

tublitzed commented Apr 2, 2020 • edited Loading

tublitzed commented Apr 2, 2020 • edited Loading

tublitzed commented Apr 6, 2020

rfk commented Apr 6, 2020

rfk commented Apr 6, 2020

rfk commented Apr 6, 2020

erkolson commented Apr 7, 2020

pjenvey commented Apr 7, 2020

pjenvey commented Apr 7, 2020 • edited Loading

pjenvey commented Apr 7, 2020

pjenvey commented Apr 7, 2020

rfk commented Apr 8, 2020

rfk commented Apr 8, 2020

rfk commented Apr 8, 2020

pjenvey commented Apr 8, 2020

rfk commented Apr 8, 2020

pjenvey commented Apr 8, 2020

pjenvey commented Apr 8, 2020

rfk commented Apr 8, 2020

pjenvey commented Apr 17, 2020

rfk commented Apr 17, 2020

tublitzed commented Apr 29, 2020

pjenvey commented May 4, 2020

tublitzed commented Apr 2, 2020 •

edited

Loading

tublitzed commented Apr 2, 2020 •

edited

Loading

pjenvey commented Apr 7, 2020 •

edited

Loading