Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle Spanner Migration replaced_at records in the purge_old_records script #1538

Closed
data-sync-user opened this issue Apr 16, 2024 · 1 comment
Assignees

Comments

@data-sync-user
Copy link
Collaborator

data-sync-user commented Apr 16, 2024

The purge_old_records.py script was disabled around the same time (early 2020) as the syncstorage migration from AWS (MySQL based syncstorage nodes) to GCP (Google Cloud Spanner based syncstorage node) began. Users were migrated to Spanner by basically:

  1. disabling their MySQL syncstorage node and copying their data to Spanner
  2. marking their tokenserver record as "replaced" (setting a replaced_at value in their user record in the database), which signals:
  3. upon the user's subsequent request to tokenserver, tokenserver reassigns the user to the syncstorage node by creating a new active user record pointing to the syncstorage node with auth metadata (used to index their data on the syncstorage node) copied from the replaced record.

The problem we've encountered with the purge_old_records script is due to:
A)) All syncstorage data was copied to the Spanner database, including invalidated data pending deletion by the purge script which was no longer running on either the MySQL nor Spanner nodes. The script was modified in 2022 to detect invalidated data copied to Spanner and issue deletes for them (the "force" option)
B) Step #2 of the migration process did not take the purge script actions into account:

  • Generally users have their tokenserver record "replaced" during syncstorage node reassignments, typically due to password changes which alter their sync encryption key and invalidate their syncstorage data
  • The purge script deletes that invalid syncstorage data: such "replaced" records indicate to the script that it should delete its associated syncstorage data
  • Combined with the modification to the purge script in A), the marking of those migrated user records as "replaced" with no associated password change/syncstorage data invalidation (the same auth metadata) poses a problem: those records signal deletion of potentially active user data. Some of these "replaced" records point to data that was never invalidated, only copied.

The work here is modifying the script to further analyze "replaced" records: the scripts current logic assumes these records point to invalidated syncstorage data when that is not always the case.

┆Issue is synchronized with this Jira Task

@data-sync-user
Copy link
Collaborator Author

➤ JR Conlin commented:

Adding a bit of additional context to this bug.

We believe that not all records were properly migrated from AWS to GCP. An unknown number of user records were migrated by setting the user’s current node to Spanner, but also with the replaced_at field set.

The purge_old_records.py script looks to delete old user records from the database. The script determines a user record is “old” by querying (

_GET_OLD_USER_RECORDS_FOR_SERVICE = sqltext("""\
) the tokenserver database looking for any record that has a replaced_at field set to a timestamp value less than the current time, less a grace_period of between 24 hours (default in purge_old_records.py) and a week ([default|] in the database.py get_old_user_records()) . We will use the force option to ignore node state, which then calls delete_service_data() to delete the service data associated with the old user record, and then deletes the token service record associated with the old user record.

There may exist an edge case where a user’s assigned node was set to Spanner, and the replaced_at datestamp was also set. This would cause all records for a given user to be candidates for deletion.

In token server (

fn get_or_create_user_sync(
) get_or_create_user_sync(), we fetch user records in descending order by creation time. For any user record other than the first result (
for old_user in &raw_users[1..] {
), we force the replaced_at time to be the latest users creation time. We then capture the first time we’ve “seen” this user (the earliest records created_at timestamp. We then check the most recent user record and if there’s a value in the replaced_at field, or if the node is unassigned, then we return a result (
replaced_at: None,
first_seen_at,
) that forces the replaced_at field to be unset. If the replaced_at is not set, but the assigned node is set, we also ignore (
replaced_at: None,
first_seen_at,
) the set replaced_at value.

The product of this is that IF the most recent user record has a replaced_at value set, then we may accidentally delete that user’s record, even if they’re already assigned to the Spanner node. Our purge script should try to avoid that situation. Ideally, it should also remove the false replaced_at for the most recent user’s record.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants