Handle Spanner Migration replaced_at records in the purge_old_records script #1538

data-sync-user · 2024-04-16T19:07:09Z

The purge_old_records.py script was disabled around the same time (early 2020) as the syncstorage migration from AWS (MySQL based syncstorage nodes) to GCP (Google Cloud Spanner based syncstorage node) began. Users were migrated to Spanner by basically:

disabling their MySQL syncstorage node and copying their data to Spanner
marking their tokenserver record as "replaced" (setting a replaced_at value in their user record in the database), which signals:
upon the user's subsequent request to tokenserver, tokenserver reassigns the user to the syncstorage node by creating a new active user record pointing to the syncstorage node with auth metadata (used to index their data on the syncstorage node) copied from the replaced record.

The problem we've encountered with the purge_old_records script is due to:
A)) All syncstorage data was copied to the Spanner database, including invalidated data pending deletion by the purge script which was no longer running on either the MySQL nor Spanner nodes. The script was modified in 2022 to detect invalidated data copied to Spanner and issue deletes for them (the "force" option)
B) Step #2 of the migration process did not take the purge script actions into account:

Generally users have their tokenserver record "replaced" during syncstorage node reassignments, typically due to password changes which alter their sync encryption key and invalidate their syncstorage data
The purge script deletes that invalid syncstorage data: such "replaced" records indicate to the script that it should delete its associated syncstorage data
Combined with the modification to the purge script in A), the marking of those migrated user records as "replaced" with no associated password change/syncstorage data invalidation (the same auth metadata) poses a problem: those records signal deletion of potentially active user data. Some of these "replaced" records point to data that was never invalidated, only copied.

The work here is modifying the script to further analyze "replaced" records: the scripts current logic assumes these records point to invalidated syncstorage data when that is not always the case.

┆Issue is synchronized with this Jira Task

The text was updated successfully, but these errors were encountered:

data-sync-user · 2024-04-23T18:33:31Z

➤ JR Conlin commented:

Adding a bit of additional context to this bug.

We believe that not all records were properly migrated from AWS to GCP. An unknown number of user records were migrated by setting the user’s current node to Spanner, but also with the replaced_at field set.

The purge_old_records.py script looks to delete old user records from the database. The script determines a user record is “old” by querying (

syncstorage-rs/tools/tokenserver/database.py

Line 86 in b777fa0

_GET_OLD_USER_RECORDS_FOR_SERVICE = sqltext("""\

) the tokenserver database looking for any record that has a replaced_at field set to a timestamp value less than the current time, less a grace_period of between 24 hours (default in purge_old_records.py) and a week ([default|] in the database.py get_old_user_records()) . We will use the force option to ignore node state, which then calls delete_service_data() to delete the service data associated with the old user record, and then deletes the token service record associated with the old user record.

There may exist an edge case where a user’s assigned node was set to Spanner, and the replaced_at datestamp was also set. This would cause all records for a given user to be candidates for deletion.

In token server (

syncstorage-rs/tokenserver-db/src/models.rs

Line 357 in b777fa0

fn get_or_create_user_sync(

) get_or_create_user_sync(), we fetch user records in descending order by creation time. For any user record other than the first result (

syncstorage-rs/tokenserver-db/src/models.rs

Line 402 in b777fa0

for old_user in &raw_users[1..] {

), we force the replaced_at time to be the latest users creation time. We then capture the first time we’ve “seen” this user (the earliest records created_at timestamp. We then check the most recent user record and if there’s a value in the replaced_at field, or if the node is unassigned, then we return a result (

syncstorage-rs/tokenserver-db/src/models.rs

Lines 444 to 445 in b777fa0

    
           replaced_at: None, 
        
           first_seen_at,

) that forces the replaced_at field to be unset. If the replaced_at is not set, but the assigned node is set, we also ignore (

syncstorage-rs/tokenserver-db/src/models.rs

Lines 459 to 460 in b777fa0

    
           replaced_at: None, 
        
           first_seen_at,

) the set replaced_at value.

The product of this is that IF the most recent user record has a replaced_at value set, then we may accidentally delete that user’s record, even if they’re already assigned to the Spanner node. Our purge script should try to avoid that situation. Ideally, it should also remove the false replaced_at for the most recent user’s record.

data-sync-user assigned pjenvey Apr 16, 2024

data-sync-user closed this as completed Apr 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Handle Spanner Migration replaced_at records in the purge_old_records script #1538

Handle Spanner Migration replaced_at records in the purge_old_records script #1538

data-sync-user commented Apr 16, 2024 •

edited

data-sync-user commented Apr 23, 2024

Handle Spanner Migration replaced_at records in the purge_old_records script #1538

Handle Spanner Migration replaced_at records in the purge_old_records script #1538

Comments

data-sync-user commented Apr 16, 2024 • edited

data-sync-user commented Apr 23, 2024

data-sync-user commented Apr 16, 2024 •

edited