[WIP] server-side file metadata operations with rpc (mdhim alternative) by sandrain · Pull Request #485 · llnl/UnifyFS

sandrain · 2020-04-20T16:58:50Z

This is a PR draft for removing the mdhim dependency for our metadata management. The initial PR was #427 and #466. This is particularly based on #427, which has been diverged quite a lot from the dev branch. Some notable changes are:

Group RPC operations for synchronizing file metadata across servers

#427 elaborates this. The current group operations include

filesize: collect file size from all servers (reduce)
metaset (file create): create a file in all servers
unlink: unlink a file from all servers
truncate: apply the new file size in all servers
broadcasting the extent tree: broadcast the local extent tree to all other servers

Command line option to specify the operation mode: `mdhim` or `rpc`

The current mdhim-based operations are all preserved for now. The runtime argument -z will specify whether UnifyFS server uses mdhim (-z mdhim) or the rpc alternative (-z rpc).

Inode abstraction in the server

When the server runs with the rpc mode (-z rpc), the new data structures (inode and inode tree) are initialized for managing file attributes (including the extent trees) for each file.

TODO

file read operations (read/mread) do not work for the rpc mode.

More TODO

fsync: individual file sync (fsync(2))
laminate: possibly make all inode structure (including extent trees) immutable
directory operations

Types of changes

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Performance enhancement (non-breaking change which improves efficiency)
Code cleanup (non-breaking change which makes code smaller or more readable)
Breaking change (fix or feature that would cause existing functionality to change)
Documentation (a change to man pages or other documentation)

tonyhutter · 2020-04-20T18:51:41Z

I've commented on this up before in a separate PR here and here and here, but all these data structures...

extent_tree
int2void
unifyfs_inode_tree

..are basically 90% copy-n-pasted seg_tree.c code. For example:

localhost ~/server_collective $ fgrep -RIn 'This is meant to be called in a loop' 
server/src/unifyfs_inode_tree.c:149: * This is meant to be called in a loop, like:
server/src/unifyfs_inode_tree.h:85: * This is meant to be called in a loop, like:
common/src/int2void.c:194: * This is meant to be called in a loop, like:
common/src/seg_tree.h:64: * This is meant to be called in a loop, like:
common/src/extent_tree.c:431: * This is meant to be called in a loop, like:
common/src/extent_tree.h:72: * This is meant to be called in a loop, like:
common/src/seg_tree.c:399: * This is meant to be called in a loop, like:

It sounds like we need a generic tree library that abstracts away the uglyness of the low-level BSD tree.h that we have. Using seg_tree.c as a template, you can write a generic tree library that takes in a void *data, with a custom comparator function. You'd then cast the void *data to your custom data structure (like extent_tree_node, unifyfs_inode, int, etc) so it works for whatever you want to put in the tree. We should even re-write seg_tree to use the generic tree implementation. The generic tree would look something like this:

int unifyfs_tree_init(struct unifyfs_tree* tree, int (compare_func)(void *data1, void *data2));
int unifyfs_tree_add(struct unifyfs_tree* tree, void *data);
struct unifyfs_tree_node* unifyfs_tree_find(struct unifyfs_tree* tree, void *data);
void unifyfs_tree_remove(struct unifyfs_tree* tree, void *data);
...

This will reduce the amount of code substantially. We would also be able to verify the generic tree library works simply by running the existing seg_tree tests (assuming we adapted seg_tree to use the generic tree underneath). That should exercise all the tree's codepaths.

sandrain · 2020-04-21T12:29:27Z

@tonyhutter Thanks for the feedback, and I agree that we need some refactoring there. I will address this after some other parts are done.

adammoody · 2020-04-23T02:31:37Z

server/src/unifyfs_collectives.c

+
+            /* set return value
+             * TODO: check if we have an error and handle it */
+            ret = out.ret;


On these return values from the children, I think we need to compute an OR operation or something. This code will use the return value from the last child, but I think we want to report an error if any child reports an error.

Thanks @adammoody . Somehow I haven't able to check your reviews last week. I will fix this.

adammoody · 2020-04-23T03:09:23Z

We could drop the int2void if that's no longer needed. I think it was just used to lookup a collective state structure given an integer tag, but that's not necessary with the improved collectives using non-blocking margo calls.

adammoody · 2020-04-23T03:11:17Z

Do we still have the server-side local extent optimization? That was on the old branch in this commit:

99b0c04

adammoody · 2020-04-23T03:16:41Z

We could drop the int2void if that's no longer needed. I think it was just used to lookup a collective state structure given an integer tag, but that's not necessary with the improved collectives using non-blocking margo calls.

Oh, now I see this was dropped in a later commit. Nevermind.

adammoody · 2020-04-23T04:23:42Z

@tonyhutter , refactoring is a good idea. It will be more involved than just those four functions, though.

In the segment/extent trees, adding a node is more like a merge operation where the node being inserted can merge with the node just before and just after if they line up just right. Whereas in the inode tree, adding a new node is just a simple insert operation. So we likely need to define an "add" function pointer in addition to the "compare" function.

Another wrinkle is that the client and server structures may require different locking. The seg_tree for the client already includes pthread locks. The server potentially needs to deal with both pthreads and margo threads because it uses both pthreads and margo threads internally. It's not yet clear if we need both pthread locks and margo locks or if just one of those will suffice and if so which one. Anyway, I think that means we'll need some sort of "lock" and "unlock" function pointers.

Then there are a few big pieces missing on the server side that this PR is trying to address as it replaces MDHIM. We still need to add support to look up extent info in the read path, we need to add support for distributed extent data and queries in addition to the broadcasted extents that we have now. Once all of that is done, we'll have a big mess of potential deadlocks and race conditions to think about that will drive us to figure out the pthread/margo locking requirements.

It's not clear how much the server-side data structures might diverge by the time we're done or whether we'll even need them at all. In fact, you can see that int2void was just tossed out completely since it's no longer used anywhere after the collective rewrite, so one of those three copy-and-paste structures is already gone. I think we'll have a better picture of how different/similar these structures are after the main work in the PR is done, then it should be obvious how best to refactor and clean up.

TEST_CHECKPATCH_SKIP_FILES=common/src/unifyfs_configurator.h

- merging the collective status into unifyfs_coll_state_t

- Reviving some necessary metadata logics from old code. - Error propagating in the collective fops.

- applying Swen's fix - fixed warnings, segmentation faults, now deadlock

-- we exchange buffers successfully, but need to merge them.. (causing a deadlock now)

- still needs some cleaning in unifyfs_inode

- filesize, truncate, unlink, metaset - remove the previous syncronous implementations

adammoody · 2020-04-28T07:10:11Z

@sandrain , how hard would it be to reapply the sequence of commits from the original PR that you used to get to this PR? I started another temporary branch which is the original PR that is rebased on the current dev branch. I need to test that to check that it still works. I've only checked that it compiles so far. Anyway, I was hoping we could layer the commits you and @boehms made on the original PR if possible.

sandrain · 2020-04-28T13:24:34Z

@adammoody I see the margotree branch is synced with dev. I will apply updates here into the original margotree branch.

completely. Instead of using a distributed kv store (mdhim), each server daemon maintains information of all files in memory. The file information includes file attributes and extent information (extent tree). When the file information should be shared, we use collective operations (broadcast/reduce). - new data structures in server: unifyfs_inode_tree, unifyfs_inode - new collective operations: unifyfs_collectives.h unifyfs_collectives.c

- connecting the previous operations with mdhim - re-create fsync rpc - connecting the collective operations (except read/mread)

- also, discarding int2void tree that is not used anymore

sandrain · 2020-05-07T17:31:14Z

I am closing this PR for this changes are now in margotreenew.

sandrain added the WIP label Apr 20, 2020

sandrain requested review from MichaelBrim and adammoody April 20, 2020 17:19

sandrain force-pushed the server-collective branch from af75dbe to 78e550e Compare April 21, 2020 12:06

adammoody reviewed Apr 23, 2020

View reviewed changes

sandrain force-pushed the server-collective branch from 2acfb37 to bb69277 Compare April 23, 2020 19:08

adammoody and others added 18 commits April 27, 2020 22:06

tree and filesize reduce

86f4dad

track file extents in server with segment tree

0064fc5

use extent map to service local reads

d3baf2c

TEST_CHECKPATCH_SKIP_FILES=common/src/unifyfs_configurator.h

move stack implementation from client to common

4805f40

server: add integer to void* map

268b26a

fix memory leaks in extent_tree

fe112d7

fix: int2void delete bug

5bdaddb

stat accepting a rank number

f6f743e

fix: bug on root when rotating parent rank in tree

9e60302

truncate implementation using collective

5bd5f8d

- unlink with collective

ea89882

- merging the collective status into unifyfs_coll_state_t

metaget/metaset collective implementation

4a456b4

Breaking grid2ext_tree into unifyfs_inode and unifyfs_inode_tree. Also:

5be2dff

- Reviving some necessary metadata logics from old code. - Error propagating in the collective fops.

fix the size problem in the extent tree.

7ababee

edits to extent_tree

595c377

style edits to unifyfs_tree

439f0cb

coalesce adjacent entries in extent tree

430d71e

typos in comment

1f3b5a8

adammoody and others added 10 commits April 27, 2020 22:13

coalesce with trailing extent

8eee254

Manually mergin Swen's extent tree code.

84cbba9

- applying Swen's fix - fixed warnings, segmentation faults, now deadlock

fix the indexing in sending extent tree nodes

badbe18

[wip] remote extent tree buffer

24c155a

-- we exchange buffers successfully, but need to merge them.. (causing a deadlock now)

extent tree merge: using a shadow extent tree for merging.

41eeb45

wip: cleaning up inode/extent_tree related codes

5a84d53

- still needs some cleaning in unifyfs_inode

some more cleanup in the unifyfs_inode

899d12a

separate local/remote extent trees

5fc8362

Converting server collectives to use async margo rpcs.

de2647c

- filesize, truncate, unlink, metaset - remove the previous syncronous implementations

cleanup after rebase

54b81c9

sandrain and others added 6 commits May 4, 2020 14:22

[wip] fops abstraction

af54326

- connecting the previous operations with mdhim - re-create fsync rpc - connecting the collective operations (except read/mread)

renaming collective to group rpc

16833a7

- bug fix in unifyfs_inode_get_local_extents

debc64a

- also, discarding int2void tree that is not used anymore

Resolving the merge conflicts

549045e

Fixing a rebase error: calling metaset twice from cmd_metaset

095b94b

sandrain force-pushed the server-collective branch from bb69277 to 095b94b Compare May 5, 2020 16:37

sandrain closed this May 7, 2020

sandrain deleted the server-collective branch June 10, 2020 20:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] server-side file metadata operations with rpc (mdhim alternative)#485

[WIP] server-side file metadata operations with rpc (mdhim alternative)#485
sandrain wants to merge 34 commits intollnl:devfrom
sandrain:server-collective

sandrain commented Apr 20, 2020

Uh oh!

tonyhutter commented Apr 20, 2020

Uh oh!

sandrain commented Apr 21, 2020

Uh oh!

adammoody Apr 23, 2020

Uh oh!

sandrain Apr 27, 2020

Uh oh!

adammoody commented Apr 23, 2020

Uh oh!

adammoody commented Apr 23, 2020

Uh oh!

adammoody commented Apr 23, 2020

Uh oh!

adammoody commented Apr 23, 2020

Uh oh!

adammoody commented Apr 28, 2020

Uh oh!

sandrain commented Apr 28, 2020

Uh oh!

sandrain commented May 7, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

sandrain commented Apr 20, 2020

Group RPC operations for synchronizing file metadata across servers

Command line option to specify the operation mode: mdhim or rpc

Inode abstraction in the server

TODO

More TODO

Types of changes

Uh oh!

tonyhutter commented Apr 20, 2020

Uh oh!

sandrain commented Apr 21, 2020

Uh oh!

adammoody Apr 23, 2020

Choose a reason for hiding this comment

Uh oh!

sandrain Apr 27, 2020

Choose a reason for hiding this comment

Uh oh!

adammoody commented Apr 23, 2020

Uh oh!

adammoody commented Apr 23, 2020

Uh oh!

adammoody commented Apr 23, 2020

Uh oh!

adammoody commented Apr 23, 2020

Uh oh!

adammoody commented Apr 28, 2020

Uh oh!

sandrain commented Apr 28, 2020

Uh oh!

sandrain commented May 7, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Command line option to specify the operation mode: `mdhim` or `rpc`