-
Notifications
You must be signed in to change notification settings - Fork 27
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add migration state flag to data store #125
Comments
Doing a db migration on these nodes scares me. Possible approaches mentioned in the meeting:
Memcached is probably going to be the simplest option and gives the finest-grained control. |
Agreed. Why I left it open and nebulous as "data store". Memcache works just fine. |
@jrconlin just to ensure we're all on the same page re: expectations, is this something your team is happy to take on, or would you like my input on implementation details? |
FWIW, I think MIGRATING is what would return a 5XX and MIGRATED would return a 401. In not sure how you see ERROR being used, but from the client's POV, that should probably also return a 5XX, so maybe ERROR and MIGRATING are actually the same if you squint? |
our team can do it, but we definitely need to make sure we have all the requirements defined. |
@rfk we'll take it on. It'd be great to work with your team though just to ensure we're all on the same page for requirements/approach. |
@jrconlin - should this still be open since it looks like you're tackling it as part of this PR? |
Closing this. Ops identified that Node 800 is considered the "spanner" dedicated node and would suffice as an indicator of customer state for tokenserver. |
@jrconlin should we also update the related migration plan to reflect this? |
Yes. I'm updating the document now. Thanks!
…On Wed, Jan 8, 2020 at 9:44 AM Rachel Tublitz ***@***.***> wrote:
@jrconlin <https://github.com/jrconlin> should we also update the related migration
plan
<https://docs.google.com/document/d/1nXkq5KTWuV86pYMwaZ_zZxDZ4JNyIMdy-pYzvxIsPO0/edit#>
to reflect this?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#125?email_source=notifications&email_token=AAAIXK7I5SPLCZIBHJFTF7TQ4YGIFA5CNFSM4JGPXOP2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEINMIOI#issuecomment-572179513>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAAIXK56O65WX2T5JPO5T2DQ4YGIFANCNFSM4JGPXOPQ>
.
|
Looping back around after a cross team meeting today. The client cannot tolerate the idea of writing data, getting a success state back and then having the server potentially lose that information. During migration, there is a chance (albeit very small) that a migration may start, the client may attempt writing to syncstorage, then the migration completes and the user is "moved" to the new data store. This would result in the server "losing" the previous write after telling the client that it was successfully stored. The client would be MUCH HAPPIER if the server were to indicate that the write failed for any reason, and the server tries again. Database modification was ruled as being undesirable, so creating a distinct migration state flag to indicate the state of transition for the user is a bit harder. Alternate options:
|
Just fleshing out some of these options:
This implies that the legacy storage servers would need to check this node value before every write.
Similar to the above, but at least the read would be local. In practice, this sounds alot like adding a "migrated" flag to the storage server rather than the tokenserver.
In both of these scenarios, we would probably just need to detect the situation and drop the spanner data entirely - ie, these users would, in effect, not be "migrated" but would end up treating this as an old-school node migration. While this doesn't sound great, it might be OK because I'd expect that in-practice very few users would be impacted. However, I think we should treat this as the last-resort option.
Another option we briefly discussed was that assuming there's a brief (ie, ~1s) period where these a possibility of things not being in sync, we could (say) return a 500 for all users on that node. I don't know enough about the server architecture to know if that makes sense or not. It's also worth reiterating that clients cache the tokenserver token, so any solution will need to handle the fact that until that expiry or a 401, the clients aren't going to ask what the storage server is. IOW, from the clients POV, the perfect scenario is that:
|
Right, likewise the other option would require a cross check. They both feel messy compared with creating a new column, but would work. Setting a flag globally on the server is a bit tricky, because it means communicating with the various threads handling the links. There is also a user meta record on the storage server that might be extended to indicate the data is being migrated. |
Sorry I missed the synchronous discussion, but why is it important that the client not see a I think it's important that we don't accept client writes while a migration in progress in order to let the migration work on a consistent snapshot of data, so the |
Ok, I get the feeling that we're exposing an issue much, much larger here. If I understand your point clearly, when a client discovers it's been assigned to a new node, it does a full reconcile sync write. This repopulates the new destination node with data. In that case it doesn't matter what was previously stored in the node, since the client will simply write new data to it. Correct? (We had discussion about this in several internal documents. If that's the case, then migrating user data from existing MySQL nodes to the spanner node is effectively busy-work. Now, if that's not quite correct, and that there is some value to coping existing data from an old node to the new node, then we need to ensure that the snapshot is "clean". To that end both nodes (old and new) should prevent accidental writing of sync data to prevent potential corruption or loss of data state. That could happen if a client gets a 2xx response message and updates what it thinks the server has. The current plan is not to update the token server to indicate the new node is "available" until the data has been migrated over. Since that may take time a client may try to update data, thus a potential for lost data. Mind you, if the larger "migrate doesn't matter" issue is in play, then you're right. |
I don't believe it will clobber existing data in the node, but will instead try to fetch any existing data from the new node and locally reconcile that data with its own view of the world, then upload any changes. But I'm not the most appropriate authority on client behavior here; @mhammond or @linacambridge may want to weigh in. |
Correct. The way Desktop decides to do this is via the (global and per-collection) If the IDs match, Desktop syncs as before. If they don't, it'll throw away all local Sync metadata—change counters, sync IDs, last sync times, and so on—pull down everything from the server, do a full reconcile, and upload any new changes. In most cases, that delta will be nothing, so there shouldn't be an increased write load on the server. But it is super read-heavy. However, Desktop will wipe all collections for a user if it's missing a So I think there are a couple of approaches we can take:
In both cases, it's possible we'll do the migration just as a client is syncing. Since there's no way for the client to tell the server "hey, I'm syncing right now", and batch uploads aren't atomic across collections (and might even be split up for large collections), a client could successfully commit some writes, then get a 5xx. In this case, the client won't fast-forward its last sync time or update any Sync metadata. So we'll leave a partial write on the old server, which the client can fix up after the migration. But that gets tricky—if the user has multiple devices, and another one syncs with Spanner first, it'll see the result of the partial write, and try to fix it up before the interrupted device syncs again. Option (2) is better for working around this because it will make every client download and re-apply the partial state. It might be wrong, but we're guaranteed not to lose those partial writes, and it'll be consistently wrong everywhere (and the user can fix it up). |
This was not in my mental model of the system, thanks for clarifying! |
What's the end result of this situation? |
Closing as wontfix based on recent change of plans to migrate per node vs per user. |
For each user, we should add a state flag column in the data store to provide fine control over migration.
e.g.
"LOCAL" - User should continue using this node.
"MIGRATING" - return error putting client in pending migration state
"MIGRATED" - return 503 to force client to fetch new token and migrate
"ERROR" - An error occurred, user requires special attention.
The text was updated successfully, but these errors were encountered: