New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix eventual consistency issues in SharedDirectory #11629
Conversation
⯅ @fluid-example/bundle-size-tests: +2.51 KB
Baseline commit: 44bb24d |
Co-authored-by: Matt Rakow <ChumpChief@users.noreply.github.com>
Co-authored-by: Matt Rakow <ChumpChief@users.noreply.github.com>
Co-authored-by: Matt Rakow <ChumpChief@users.noreply.github.com>
|
@anthony-murphy Take a look when you get time. |
|
@vladsud Take a look at this when you get time. |
packages/dds/map/src/directory.ts
Outdated
| @@ -693,17 +698,34 @@ export class SharedDirectory extends SharedObject<ISharedDirectoryEvents> implem | |||
| return this.localValueMaker.fromSerializable(serializable); | |||
| } | |||
|
|
|||
| private getParentDirectory(relativePath: string): IDirectory | undefined { | |||
| const absolutePath = this.makeAbsolute(relativePath); | |||
| const subdirs = absolutePath.substr(1).split(posix.sep); | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do directory paths always start with /? Is this enforced somewhere?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that string you get at the end does not have starting "/". So feels a bit dangerous to assume so, as other code may pass string without leading "/" (similar to how this code does), and I assume getWorkingDirectory() and other APIs are not enforcing that paths should start with slash
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW, your code could be much more efficient if you use lastIndexOf() and substr(), without ever splitting into chunks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
makeAbsolute handles that and make sure we start with "/". This is same logic as used in getWorkingDirectory. So when we call getWorkingDirectory, we don't need to make sure that it starts with "/".
Changed the code to use lastIndexOf.
| const parentSubDir = this.getParentDirectory(relativePath) as SubDirectory | undefined; | ||
| const index = relativePath.lastIndexOf(posix.sep); | ||
| const dirName = relativePath.substring(index + 1); | ||
| return !!(parentSubDir?.isSubDirectoryDeletePending(dirName)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this recursive? I.e. if any parent is pending delete, will it return true? If not, should it be? As any delete in the chain should be handled the same, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not recursive. This just checks in the immediate parent directory as we don't need or reach here if any parent has been deleted because getWorkingDirectory before where it is called would return empty result. However, in cases where this directory has again been created after deletion, then we will be able to reach here. So, I would say,
a.) It is maybe unlikely that if any ancestor is deleted and then entire structure will be created again from deleted parent to this child.
b.) I think if we need to check entire path from root -> this child, whether node is deleted or not then at that point in time the cons would outweigh the benefit of not applying the operation. So, I think just checking this for immediate parent should be ok.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, if I got you right, you are saying that pending delete for any parent in the chain would result in this function returning false, right? Can you please add a comment on that in the header of the function?
I do not think #a is unlikely - we should handle it properly.
I'm not sure what you mean by cons would outweigh benefits. It all starts with clearly describing behaviors for users. If I delete /A, I expect all objects under it to be gone and any concurrent operations from other machines underneath /A to be dropped on the floor.
| const context = local ? undefined : this.makeLocal(op.key, op.path, op.value); | ||
| subdir.processSetMessage(op, context, local, localOpMetadata); | ||
| subdir.processSetMessage(msg, op, context, local, localOpMetadata); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It feels to me (on the first look) that all operations should be handled exactly the same in when pending delete is present for any parent directory (recursively). Not sure why some operations (above) have that special handling, and some (from here down) do not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is just the type of op. The above 3 ops are set/delete/clear on keys while these 2 are create/delete subdirectory and we do check these things for these 2 ops in this function: needProcessSubDirectoryOperation() in detail and take action appropriately.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd think that we should handle all these operations uniformly. I.e. if any parent above has pending delete, then all the ops (on keys and sub-directories) should be ignored
| * Subdirectories that have been deleted locally but not yet ack'd from the server. This maintains the count | ||
| * of delete op that are pending or yet to be acked from server. | ||
| */ | ||
| private readonly pendingDeleteSubDirectoriesCount: Map<string, number> = new Map(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it should be fine, but can you please ensure all of this logic and new state tracking works Ok when we serialize container with pending deletes and later on rehydrate it? I think we need to have UTs that validate things work. I guess there is no special handling required, as we replay changes (i.e. essentially generate appropriate changes again), but better be tested :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In detached container or when we serialize it then there are no ops in that state. So, there will be no pending deletes, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Serialized attached container (Container.closeAndGetPendingLocalState). There will be local non-acked ops serialized and rehydrated.
|
@vladsud Take another look when you get time. |
| } | ||
| } else if (op.type === "deleteSubDirectory") { | ||
| // If this is remote delete op and we have keys in this subDirectory, then we need to delete these | ||
| // keys except the pending ones as they will be sequenced after this delete. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comment does not match my mental model. I'd think that any operations for old instance of this sub-directory should be dropped on the floor, even if it gets sequences later
|
I'm not comfortable approving this review without the addition of non-deterministic stress tests. These have proven to be essential at finding complex interactions between clients where eventual consistency doesn't hold. we should check in test like this independently. ideally, they would be failing so skipped, as we know there are problems, and a then be enabled in this review showing they now succeed. This is probably not a small work item, and i'm happy to help write them. @Abe27342 can you point to any documentation that exist for building new tests based on your framework? |
Sure, the package readme here is a pretty good starting point. There are some existing examples in |
Thanks, I will take a look into these and try to build these first and then merge/fix this PR. |
|
This PR has been automatically marked as stale because it has had no activity for 60 days. It will be closed if no further activity occurs within 8 days of this comment. Thank you for your contributions to Fluid Framework! |
Description
Thanks @anthony-murphy for the draft PR you raised earlier.
1.) Add a list of clientids which have created this directory. Add a directory seq number which is set to sequence number of
create op of that directory or 0 in case it is loaded frpm snapshot. Add a shouldProcessMessage which only process ops if the remote message is either from the creator of directory or this directory was created when container was detached or in case this directory is already live(known to other clients) and the op was created after the directory was created.
2.) If a delete sub directory op comes and there is already local pending create/delete subdirectory, then instead of ignoring the
delete op, clear the non-pending keys. Also reset the creator clientids and seq number to 0 as the new create op will reset them and we will also not process any set/clear/delete ops before the new create op.
3.) If there is pending delete op for this subDirectory, then don't apply the storage ops as we are going to delete this subDirectory. The problem it solves is that if client again created the sub directory after deleting, then we will end up applying that op without this change.