Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Verify that user data is not completely re-synced post migration. #45

Open
5 of 6 tasks
tublitzed opened this issue Apr 2, 2020 · 22 comments
Open
5 of 6 tasks
Assignees
Labels
2 Estimate - s - This is a small change with clearly defined parameters.

Comments

@tublitzed
Copy link
Contributor

tublitzed commented Apr 2, 2020

Plan here is as follows:

  • 1. Find a single test user account still pointed to legacy sync nodes. (Sync id 128881199 - tublitzed@yahoo.com). Current pointed to this legacy sync node.
  • 2. Enable local trace logging using about:sync. Enable success logging by setting services.sync.log.appender.file.logOnSuccess to true in about:config.~
  • 3. Replace bookmarks collection with this collection via the bookmarks UI. This will force an entirely fresh upload of the collection.
  • 4. Collect the logs from this completely fresh upload, and save them. Logs live here
  • 5. On April 6th, this sync user will be migrated to spanner.
  • 6. Once step 5 is complete, test sync service by adding a new bookmark. Collect and analyze logs to confirm that a) user is pointed to durable sync and b) user did NOT re-upload entire sync collection after being migrated.
@tublitzed tublitzed added this to Backlog: Misc in Services Engineering via automation Apr 2, 2020
@tublitzed tublitzed added the 2 Estimate - s - This is a small change with clearly defined parameters. label Apr 2, 2020
@tublitzed tublitzed self-assigned this Apr 2, 2020
@tublitzed tublitzed moved this from Backlog: Misc to In Progress in Services Engineering Apr 2, 2020
@tublitzed
Copy link
Contributor Author

tublitzed commented Apr 2, 2020

Just a note here for completeness - Erik moved my original sync user over to this test node here. I now have a new sync id for this test: 141727562.

Just verified that all content has been synced up to new node, logs live here.

@tublitzed
Copy link
Contributor Author

New sync id is 142084482. Unfortunately, I did not find good news. We are in fact re-syncing everything. Logs live here

@rfk
Copy link

rfk commented Apr 6, 2020

1586197410063 Sync.Resource DEBUG GET fail 404 https://sync-1-us-west1-g.sync.services.mozilla.com/1.5/142084482/storage/meta/global
1586197410063 Sync.Resource WARN GET request to https://sync-1-us-west1-g.sync.services.mozilla.com/1.5/142084482/storage/meta/global failed with status 404

Your client fails to find the meta/global record on the server, which would definitely be enough to trigger a full sync. If meta/global is missing then no other data on the server can be trusted to be in a good or consistent state.

@rfk
Copy link

rfk commented Apr 6, 2020

Can we confirm from server-side migration logs whether the record 142084482/storage/meta/global was successfully copied over from the previous node?

@rfk
Copy link

rfk commented Apr 6, 2020

My first theory here is that perhaps the migration script is calculating the (fxa_uid, fxa_kid) pair in a different way than tokenserver will after the migration. So e.g. we might copy all their data over to spanner keyed by (fxa_uid, fxa_kid1) but then when they client sees the node migration event and hits tokenserver to get a new token, they end up looking for data under (fxa_uid, fxa_kid2), and it appears the node is empty.

@erkolson
Copy link

erkolson commented Apr 7, 2020

The client got a 404 when requesting the storage/meta/global. Here are the first 5 requests to the server:

timestamp,request,request_length,bytes_sent,status
2020-04-06 18:23:30.806275 UTC,GET /1.5/142084482/info/collections HTTP/1.1,1134.0,2.0,200
2020-04-06 18:23:30.895814 UTC,GET /1.5/142084482/info/configuration HTTP/1.1,1138.0,165.0,200
2020-04-06 18:23:30.985694 UTC,GET /1.5/142084482/storage/meta/global HTTP/1.1,1139.0,1.0,404
2020-04-06 18:23:31.077635 UTC,GET /1.5/142084482/storage/meta/global HTTP/1.1,1138.0,1.0,404
2020-04-06 18:23:31.195519 UTC,PUT /1.5/142084482/storage/meta/global HTTP/1.1,1333.0,13.0,200

@pjenvey
Copy link
Member

pjenvey commented Apr 7, 2020

thanks @rfk. We've spotted some issues w/ migrate_node's individual user migration argument. We'll be testing a new user against the fixed version.

Erik's moved this new user (142188710) to this test node.

Restored the med.json & Sync'd, logs live here

@pjenvey
Copy link
Member

pjenvey commented Apr 7, 2020

Erik migrated the user to 142198989/durable sync and made the Python node 401.

After the 401 error log, the client appears to have sync'd to the new durable sync node without a full reupload! Log

EDIT: It did reupload its Tabs (it originally had 1 tab, and reuploaded).

@pjenvey
Copy link
Member

pjenvey commented Apr 7, 2020

Though I notice one oddity, the Tabs collection in about:sync fails to load, reporting an Error:

Error: HMACMismatch: Record SHA256 HMAC mismatch: should be 9946063866ac59e0006d0abccb0d9b37ec3216fe57eab97fc5cc0a1d0480bbfd, is ed307f5234e611eb05431ff6a979e2948f25b541595c46fb3f41182257140d36

Restarting the browser lead to a sync error w/ the same HMACmismatch: Log

@pjenvey
Copy link
Member

pjenvey commented Apr 7, 2020

Tabs are currently set to be solely stored in memcached on the server cache_only_collections = tabs, so it makes sense that they would have been reuploaded. But this doesn't explain the HMACmismatch.

@rfk
Copy link

rfk commented Apr 8, 2020

Hrm, the hmac mismatch is very strange indeed. If you use the about:sync addon, can you inspect the state of the "tabs" collection to make sure it only has a single record with id "CB3DaSRC1v6F", which is the one uploaded by your client?

@rfk
Copy link

rfk commented Apr 8, 2020

I also wonder if we're seeing the HMAC error on the tabs collection because that's the only one for which we're trying to download any records - all other collections are (correctly) treated as not having any remote changes. Trying to add a second client to this cluster might result in in additional HMAC errors being surfaced.

@rfk
Copy link

rfk commented Apr 8, 2020

can you inspect the state of the "tabs" collection to make sure it only has a single record with id
"CB3DaSRC1v6F", which is the one uploaded by your client?

Actually, from the logs, I wonder if there is somehow a second item in the "tabs" collection. The logs around the HMAC error say this:

1586297619302	Sync.Engine.Tabs	DEBUG	First sync, uploading all items
1586297619302	Sync.Engine.Tabs	INFO	1 outgoing items pre-reconciliation
1586297619302	Sync.Engine.Tabs	TRACE	Downloading & applying server changes
1586297619302	FirefoxAccounts	TRACE	not checking freshness of profile as it remains recent
1586297619302	Sync.BrowserIDManager	TRACE	_ensureValidToken already has one
1586297619368	Sync.Collection	DEBUG	GET success 200 https://sync-1-us-west1-g.sync.services.mozilla.com/1.5/142198989/storage/tabs?full=1&limit=1000
1586297619369	Sync.Engine.Tabs	TRACE	Ignoring incoming tab item because of its id: CB3DaSRC1v6F
1586297619369	Sync.Engine.Tabs	TRACE	Skipping reconciled incoming item CB3DaSRC1v6F

So it skips the incoming tab item for itself, which is good.

1586297619369	Sync.Service	INFO	Bad HMAC event detected. Attempting recovery or signaling to other clients.
[... crypto/keys refetching blah-blah goes here ...]
1586297619511	Sync.Engine.Tabs	INFO	Records: 0 applied, 0 successfully, 1 failed to apply, 1 newly failed to apply, 1 reconciled.

But here we see that it tried to apply a record, and failed to do so due to the HMAC error. Does that mean that there's second BSO in the tabs collection, of unknown origin?

To check my understanding: we expect that the migration script will not populate any items into the "tabs" collection, because that collection does not exist in MySQL on the storage nodes. Is that accurate?

@pjenvey
Copy link
Member

pjenvey commented Apr 8, 2020

If you use the about:sync addon, can you inspect the state of the "tabs" collection to make sure it only has a single record with id "CB3DaSRC1v6F", which is the one uploaded by your client?

about:sync isn't helpful here, it reports the HMACMismatch error and is stuck in the "Fetching records..." state for tabs, even after dismissing the error message.

@rfk
Copy link

rfk commented Apr 8, 2020

Ugh, OK. Any chance you'd be willing to hit the server API directly and pull down the record(s) for inspection? :-)

@pjenvey
Copy link
Member

pjenvey commented Apr 8, 2020

Ugh, OK. Any chance you'd be willing to hit the server API directly and pull down the record(s) for inspection? :-)

Someone would need to help me figure out how to get the appropriate Authorization header needed to query the backend. However I was luckily able to grab this count from about:sync's logging of the /info/collection_counts response. It reported tabs: 4 surprisingly.

Is it possible we have stale tabs data from mysql? I'm not familiar w/ the cache_only_collections implementation, does it ever backup any of its data to mysql?

I'm also wondering the ramifications of migrated user data including entries for tabs in the metaglobal and I think /info/collections/ as well.

@pjenvey
Copy link
Member

pjenvey commented Apr 8, 2020

Someone would need to help me figure out how to get the appropriate Authorization header needed to query the backend.

Nevermind, think I've got the hang of this..

@rfk
Copy link

rfk commented Apr 8, 2020

Is it possible we have stale tabs data from mysql? I'm not familiar w/ the cache_only_collections
implementation, does it ever backup any of its data to mysql?

I don't think it will ever be written back to mysql. Can we look in the source tables on the old node to see if there were any there?

@pjenvey
Copy link
Member

pjenvey commented Apr 17, 2020

@rfk thanks for the hints

Need to confirm some of these details w/ ops but it seems this was due to stale tabs data in spanner.

For this migration test, the client was a new account:

  1. New accounts default to a Spanner node
  2. Then it was manually reasssigned to a legacy Python node
  3. Test client would sync data there, then we'd manually migrate it from legacy back to Spanner. No tabs data would migrate since they're only in memcached
  4. Then it was manually reassigned back to Spanner

If a client "Signs Out" of FxA/Sync while on the Python node, clients and forms collections perform a kind of "reset" on the client side and the accompanying records for the device are DELETEd on the server. In some cases we were doing a "Sign Out" to force the client to immediately catch the reassignment (restarting the browser seems to be another way to force it).

We had a couple problems here:

  • During 1) the new account made an initial sync to Spanner. Looks like we forgot to cleanup this initial sync data. The account's user info/client state does not change during this test: so we will eventually migrate the client back to the same location (primary keys) in Spanner
  • The migration script should have skipped migrating this user due to this pre-existing data in Spanner but a bug prevented that check from happening, instead we were overwriting everything there..
  • Except tabs since there were none to migrate

So stale tabs data was left around + I think combined w/ the "reset" -- this "reset" must change some state of the tabs collection used for HMAC validation. Does that make sense?

Another note: though we won't migrate tabs data, we'll migrate an entry for it in meta/global. Do you foresee any issues w/ that (we'll double check the behavior around this w/ more test test runs)?

@rfk
Copy link

rfk commented Apr 17, 2020

Aha, yes I think this would explain it!

At step (1) the client would have created new bulk storage keys in /crypto/keys on the spanner node, and uploaded data (including tabs) encrypted under those keys k1.

At step (2) the client would have seen that it had been re-assigned to a new, empty storage node, and would have re-initialized the sync data including /crypto/keys, then re-uploaded its sync data encrypted under those keys k2.

At step (3) you copy the new /crypto/keys over to spanner, and the client happily continues reading and writing data using the same bulk storage keys k2.

Except for the "tabs" collection, where it finds the old data that we encrypted using the old storage keys k1, which it's not able to decrypt and which produces HMAC verification error.

I'm not sure what effect the explicit signing out would have had - perhaps if you get a node re-assignment while you're signed in, Firefox is able to re-use the previous keys from /crypto/keys, but if you're not signed in then it has to create some news ones.

Regardless, I think this is a plausible explanation, Thanks for taking the time to dig in to the details here!

Services Engineering automation moved this from Archived to Done Apr 21, 2020
@tublitzed tublitzed moved this from Done to Archived in Services Engineering Apr 27, 2020
@tublitzed tublitzed moved this from Archived to In Progress in Services Engineering Apr 29, 2020
@tublitzed
Copy link
Contributor Author

Re-opening this one as it looks like we may have hit the issue again from yesterdays client testing

@pjenvey pjenvey reopened this Apr 30, 2020
Services Engineering automation moved this from In Progress to Prioritized Apr 30, 2020
@pjenvey pjenvey assigned pjenvey and unassigned tublitzed Apr 30, 2020
@pjenvey
Copy link
Member

pjenvey commented May 4, 2020

Logs for my last round of successful attempts (on stage):

pjenvey-stage-migrate1: Did a forceful "Sign out" out of Sync to force lookup back to the Spanner node after its migration
pjenvey-stage-migrate2
pjenvey-stage-migrate3

All 3 of these test accounts migrated successfully.

@tublitzed tublitzed moved this from Prioritized to Archived in Services Engineering May 26, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
2 Estimate - s - This is a small change with clearly defined parameters.
Projects
Development

No branches or pull requests

4 participants