Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix eventual consistency issues in SharedDirectory #11629

Closed
wants to merge 14 commits into from

Conversation

jatgarg
Copy link
Contributor

@jatgarg jatgarg commented Aug 22, 2022

Description

Thanks @anthony-murphy for the draft PR you raised earlier.

1.) Add a list of clientids which have created this directory. Add a directory seq number which is set to sequence number of
create op of that directory or 0 in case it is loaded frpm snapshot. Add a shouldProcessMessage which only process ops if the remote message is either from the creator of directory or this directory was created when container was detached or in case this directory is already live(known to other clients) and the op was created after the directory was created.

2.) If a delete sub directory op comes and there is already local pending create/delete subdirectory, then instead of ignoring the
delete op, clear the non-pending keys. Also reset the creator clientids and seq number to 0 as the new create op will reset them and we will also not process any set/clear/delete ops before the new create op.

3.) If there is pending delete op for this subDirectory, then don't apply the storage ops as we are going to delete this subDirectory. The problem it solves is that if client again created the sub directory after deleting, then we will end up applying that op without this change.

@jatgarg jatgarg self-assigned this Aug 22, 2022
@jatgarg jatgarg requested a review from a team as a code owner August 22, 2022 23:38
@github-actions github-actions bot added area: dds Issues related to distributed data structures base: main PRs targeted against main branch labels Aug 22, 2022
@msfluid-bot
Copy link
Collaborator

msfluid-bot commented Aug 23, 2022

@fluid-example/bundle-size-tests: +2.51 KB
Metric NameBaseline SizeCompare SizeSize Diff
aqueduct.js 397.3 KB 399.81 KB +2.51 KB
connectionState.js 680 Bytes 680 Bytes No change
containerRuntime.js 196.25 KB 196.25 KB No change
loader.js 153.76 KB 153.76 KB No change
map.js 42.84 KB 42.84 KB No change
matrix.js 131.78 KB 131.78 KB No change
odspDriver.js 151.87 KB 151.87 KB No change
odspPrefetchSnapshot.js 39.77 KB 39.77 KB No change
sharedString.js 152.99 KB 152.99 KB No change
Total Size 1.27 MB 1.27 MB +2.51 KB

Baseline commit: 44bb24d

Generated by 🚫 dangerJS against 9ecd0f0

packages/dds/map/src/directory.ts Outdated Show resolved Hide resolved
packages/dds/map/src/directory.ts Outdated Show resolved Hide resolved
packages/dds/map/src/directory.ts Outdated Show resolved Hide resolved
packages/dds/map/src/directory.ts Outdated Show resolved Hide resolved
packages/dds/map/src/directory.ts Show resolved Hide resolved
packages/dds/map/src/directory.ts Outdated Show resolved Hide resolved
packages/dds/map/src/directory.ts Outdated Show resolved Hide resolved
packages/dds/map/src/test/directory.spec.ts Outdated Show resolved Hide resolved
packages/dds/map/src/directory.ts Show resolved Hide resolved
jatgarg and others added 9 commits August 24, 2022 12:41
Co-authored-by: Matt Rakow <ChumpChief@users.noreply.github.com>
Co-authored-by: Matt Rakow <ChumpChief@users.noreply.github.com>
Co-authored-by: Matt Rakow <ChumpChief@users.noreply.github.com>
@jatgarg
Copy link
Contributor Author

jatgarg commented Aug 26, 2022

@anthony-murphy Take a look when you get time.

@jatgarg
Copy link
Contributor Author

jatgarg commented Sep 12, 2022

@vladsud Take a look at this when you get time.

@@ -693,17 +698,34 @@ export class SharedDirectory extends SharedObject<ISharedDirectoryEvents> implem
return this.localValueMaker.fromSerializable(serializable);
}

private getParentDirectory(relativePath: string): IDirectory | undefined {
const absolutePath = this.makeAbsolute(relativePath);
const subdirs = absolutePath.substr(1).split(posix.sep);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do directory paths always start with /? Is this enforced somewhere?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that string you get at the end does not have starting "/". So feels a bit dangerous to assume so, as other code may pass string without leading "/" (similar to how this code does), and I assume getWorkingDirectory() and other APIs are not enforcing that paths should start with slash

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW, your code could be much more efficient if you use lastIndexOf() and substr(), without ever splitting into chunks

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

makeAbsolute handles that and make sure we start with "/". This is same logic as used in getWorkingDirectory. So when we call getWorkingDirectory, we don't need to make sure that it starts with "/".
Changed the code to use lastIndexOf.

const parentSubDir = this.getParentDirectory(relativePath) as SubDirectory | undefined;
const index = relativePath.lastIndexOf(posix.sep);
const dirName = relativePath.substring(index + 1);
return !!(parentSubDir?.isSubDirectoryDeletePending(dirName));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this recursive? I.e. if any parent is pending delete, will it return true? If not, should it be? As any delete in the chain should be handled the same, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not recursive. This just checks in the immediate parent directory as we don't need or reach here if any parent has been deleted because getWorkingDirectory before where it is called would return empty result. However, in cases where this directory has again been created after deletion, then we will be able to reach here. So, I would say,
a.) It is maybe unlikely that if any ancestor is deleted and then entire structure will be created again from deleted parent to this child.
b.) I think if we need to check entire path from root -> this child, whether node is deleted or not then at that point in time the cons would outweigh the benefit of not applying the operation. So, I think just checking this for immediate parent should be ok.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, if I got you right, you are saying that pending delete for any parent in the chain would result in this function returning false, right? Can you please add a comment on that in the header of the function?
I do not think #a is unlikely - we should handle it properly.
I'm not sure what you mean by cons would outweigh benefits. It all starts with clearly describing behaviors for users. If I delete /A, I expect all objects under it to be gone and any concurrent operations from other machines underneath /A to be dropped on the floor.

const context = local ? undefined : this.makeLocal(op.key, op.path, op.value);
subdir.processSetMessage(op, context, local, localOpMetadata);
subdir.processSetMessage(msg, op, context, local, localOpMetadata);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It feels to me (on the first look) that all operations should be handled exactly the same in when pending delete is present for any parent directory (recursively). Not sure why some operations (above) have that special handling, and some (from here down) do not.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is just the type of op. The above 3 ops are set/delete/clear on keys while these 2 are create/delete subdirectory and we do check these things for these 2 ops in this function: needProcessSubDirectoryOperation() in detail and take action appropriately.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd think that we should handle all these operations uniformly. I.e. if any parent above has pending delete, then all the ops (on keys and sub-directories) should be ignored

* Subdirectories that have been deleted locally but not yet ack'd from the server. This maintains the count
* of delete op that are pending or yet to be acked from server.
*/
private readonly pendingDeleteSubDirectoriesCount: Map<string, number> = new Map();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it should be fine, but can you please ensure all of this logic and new state tracking works Ok when we serialize container with pending deletes and later on rehydrate it? I think we need to have UTs that validate things work. I guess there is no special handling required, as we replay changes (i.e. essentially generate appropriate changes again), but better be tested :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In detached container or when we serialize it then there are no ops in that state. So, there will be no pending deletes, right?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Serialized attached container (Container.closeAndGetPendingLocalState). There will be local non-acked ops serialized and rehydrated.

@jatgarg jatgarg requested a review from vladsud October 3, 2022 07:45
@jatgarg
Copy link
Contributor Author

jatgarg commented Oct 6, 2022

@vladsud Take another look when you get time.

}
} else if (op.type === "deleteSubDirectory") {
// If this is remote delete op and we have keys in this subDirectory, then we need to delete these
// keys except the pending ones as they will be sequenced after this delete.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment does not match my mental model. I'd think that any operations for old instance of this sub-directory should be dropped on the floor, even if it gets sequences later

@anthony-murphy
Copy link
Contributor

anthony-murphy commented Oct 14, 2022

I'm not comfortable approving this review without the addition of non-deterministic stress tests. These have proven to be essential at finding complex interactions between clients where eventual consistency doesn't hold. we should check in test like this independently. ideally, they would be failing so skipped, as we know there are problems, and a then be enabled in this review showing they now succeed.

This is probably not a small work item, and i'm happy to help write them.

@Abe27342 can you point to any documentation that exist for building new tests based on your framework?

@Abe27342
Copy link
Contributor

I'm not comfortable approving this review without the addition of non-deterministic stress tests. These have proven to be essential at finding complex interactions between clients where eventual consistency doesn't hold. we should check in test like this independently. ideally, they would be failing so skipped, as we know there are problems, and a then be enabled in this review showing they now succeed.

This is probably not a small work item, and i'm happy to help write them.

@Abe27342 can you point to any documentation that exist for building new tests based on your framework?

Sure, the package readme here is a pretty good starting point. There are some existing examples in @fluid-experimental/tree and @fluidframework/sequence (SharedTreeFuzz.ts and intervalCollection.fuzz.spec.ts are the file names IIRC). Both of those use stochastic-test-utils with a full set of fluid mocks, but if SharedDirectory supports reasonable collaborative scenarios without such mocks (e.g. like merge-tree) you could get away with less. Happy to answer any other questions you have.

@jatgarg
Copy link
Contributor Author

jatgarg commented Oct 14, 2022

I'm not comfortable approving this review without the addition of non-deterministic stress tests. These have proven to be essential at finding complex interactions between clients where eventual consistency doesn't hold. we should check in test like this independently. ideally, they would be failing so skipped, as we know there are problems, and a then be enabled in this review showing they now succeed.
This is probably not a small work item, and i'm happy to help write them.
@Abe27342

        Abram Sanderson (HE/HIM)
        FTE can you point to any documentation that exist for building new tests based on your framework?

Sure, the package readme here is a pretty good starting point. There are some existing examples in @fluid-experimental/tree and @fluidframework/sequence (SharedTreeFuzz.ts and intervalCollection.fuzz.spec.ts are the file names IIRC). Both of those use stochastic-test-utils with a full set of fluid mocks, but if SharedDirectory supports reasonable collaborative scenarios without such mocks (e.g. like merge-tree) you could get away with less. Happy to answer any other questions you have.

Thanks, I will take a look into these and try to build these first and then merge/fix this PR.

@msftbot
Copy link
Contributor

msftbot bot commented Dec 14, 2022

This PR has been automatically marked as stale because it has had no activity for 60 days. It will be closed if no further activity occurs within 8 days of this comment. Thank you for your contributions to Fluid Framework!

@msftbot msftbot bot closed this Dec 22, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area: dds Issues related to distributed data structures base: main PRs targeted against main branch status: stale
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants