Skip to content

[WIP] server-side file metadata operations with rpc (mdhim alternative)#485

Closed
sandrain wants to merge 34 commits intollnl:devfrom
sandrain:server-collective
Closed

[WIP] server-side file metadata operations with rpc (mdhim alternative)#485
sandrain wants to merge 34 commits intollnl:devfrom
sandrain:server-collective

Conversation

@sandrain
Copy link
Copy Markdown
Contributor

This is a PR draft for removing the mdhim dependency for our metadata management. The initial PR was #427 and #466. This is particularly based on #427, which has been diverged quite a lot from the dev branch. Some notable changes are:

Group RPC operations for synchronizing file metadata across servers

#427 elaborates this. The current group operations include

  • filesize: collect file size from all servers (reduce)
  • metaset (file create): create a file in all servers
  • unlink: unlink a file from all servers
  • truncate: apply the new file size in all servers
  • broadcasting the extent tree: broadcast the local extent tree to all other servers

Command line option to specify the operation mode: mdhim or rpc

The current mdhim-based operations are all preserved for now. The runtime argument -z will specify whether UnifyFS server uses mdhim (-z mdhim) or the rpc alternative (-z rpc).

Inode abstraction in the server

When the server runs with the rpc mode (-z rpc), the new data structures (inode and inode tree) are initialized for managing file attributes (including the extent trees) for each file.

TODO

  • file read operations (read/mread) do not work for the rpc mode.

More TODO

  • fsync: individual file sync (fsync(2))
  • laminate: possibly make all inode structure (including extent trees) immutable
  • directory operations

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Performance enhancement (non-breaking change which improves efficiency)
  • Code cleanup (non-breaking change which makes code smaller or more readable)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Documentation (a change to man pages or other documentation)

@tonyhutter
Copy link
Copy Markdown
Collaborator

I've commented on this up before in a separate PR here and here and here, but all these data structures...

extent_tree
int2void
unifyfs_inode_tree

..are basically 90% copy-n-pasted seg_tree.c code. For example:

localhost ~/server_collective $ fgrep -RIn 'This is meant to be called in a loop' 
server/src/unifyfs_inode_tree.c:149: * This is meant to be called in a loop, like:
server/src/unifyfs_inode_tree.h:85: * This is meant to be called in a loop, like:
common/src/int2void.c:194: * This is meant to be called in a loop, like:
common/src/seg_tree.h:64: * This is meant to be called in a loop, like:
common/src/extent_tree.c:431: * This is meant to be called in a loop, like:
common/src/extent_tree.h:72: * This is meant to be called in a loop, like:
common/src/seg_tree.c:399: * This is meant to be called in a loop, like:

It sounds like we need a generic tree library that abstracts away the uglyness of the low-level BSD tree.h that we have. Using seg_tree.c as a template, you can write a generic tree library that takes in a void *data, with a custom comparator function. You'd then cast the void *data to your custom data structure (like extent_tree_node, unifyfs_inode, int, etc) so it works for whatever you want to put in the tree. We should even re-write seg_tree to use the generic tree implementation. The generic tree would look something like this:

int unifyfs_tree_init(struct unifyfs_tree* tree, int (compare_func)(void *data1, void *data2));
int unifyfs_tree_add(struct unifyfs_tree* tree, void *data);
struct unifyfs_tree_node* unifyfs_tree_find(struct unifyfs_tree* tree, void *data);
void unifyfs_tree_remove(struct unifyfs_tree* tree, void *data);
...

This will reduce the amount of code substantially. We would also be able to verify the generic tree library works simply by running the existing seg_tree tests (assuming we adapted seg_tree to use the generic tree underneath). That should exercise all the tree's codepaths.

@sandrain sandrain force-pushed the server-collective branch from af75dbe to 78e550e Compare April 21, 2020 12:06
@sandrain
Copy link
Copy Markdown
Contributor Author

@tonyhutter Thanks for the feedback, and I agree that we need some refactoring there. I will address this after some other parts are done.


/* set return value
* TODO: check if we have an error and handle it */
ret = out.ret;
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On these return values from the children, I think we need to compute an OR operation or something. This code will use the return value from the last child, but I think we want to report an error if any child reports an error.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @adammoody . Somehow I haven't able to check your reviews last week. I will fix this.

@adammoody
Copy link
Copy Markdown
Collaborator

We could drop the int2void if that's no longer needed. I think it was just used to lookup a collective state structure given an integer tag, but that's not necessary with the improved collectives using non-blocking margo calls.

@adammoody
Copy link
Copy Markdown
Collaborator

Do we still have the server-side local extent optimization? That was on the old branch in this commit:

99b0c04

@adammoody
Copy link
Copy Markdown
Collaborator

We could drop the int2void if that's no longer needed. I think it was just used to lookup a collective state structure given an integer tag, but that's not necessary with the improved collectives using non-blocking margo calls.

Oh, now I see this was dropped in a later commit. Nevermind.

@adammoody
Copy link
Copy Markdown
Collaborator

@tonyhutter , refactoring is a good idea. It will be more involved than just those four functions, though.

In the segment/extent trees, adding a node is more like a merge operation where the node being inserted can merge with the node just before and just after if they line up just right. Whereas in the inode tree, adding a new node is just a simple insert operation. So we likely need to define an "add" function pointer in addition to the "compare" function.

Another wrinkle is that the client and server structures may require different locking. The seg_tree for the client already includes pthread locks. The server potentially needs to deal with both pthreads and margo threads because it uses both pthreads and margo threads internally. It's not yet clear if we need both pthread locks and margo locks or if just one of those will suffice and if so which one. Anyway, I think that means we'll need some sort of "lock" and "unlock" function pointers.

Then there are a few big pieces missing on the server side that this PR is trying to address as it replaces MDHIM. We still need to add support to look up extent info in the read path, we need to add support for distributed extent data and queries in addition to the broadcasted extents that we have now. Once all of that is done, we'll have a big mess of potential deadlocks and race conditions to think about that will drive us to figure out the pthread/margo locking requirements.

It's not clear how much the server-side data structures might diverge by the time we're done or whether we'll even need them at all. In fact, you can see that int2void was just tossed out completely since it's no longer used anywhere after the collective rewrite, so one of those three copy-and-paste structures is already gone. I think we'll have a better picture of how different/similar these structures are after the main work in the PR is done, then it should be obvious how best to refactor and clean up.

@sandrain sandrain force-pushed the server-collective branch from 2acfb37 to bb69277 Compare April 23, 2020 19:08
@adammoody
Copy link
Copy Markdown
Collaborator

@sandrain , how hard would it be to reapply the sequence of commits from the original PR that you used to get to this PR? I started another temporary branch which is the original PR that is rebased on the current dev branch. I need to test that to check that it still works. I've only checked that it compiles so far. Anyway, I was hoping we could layer the commits you and @boehms made on the original PR if possible.

@sandrain
Copy link
Copy Markdown
Contributor Author

@adammoody I see the margotree branch is synced with dev. I will apply updates here into the original margotree branch.

sandrain and others added 6 commits May 4, 2020 14:22
completely. Instead of using a distributed kv store (mdhim), each server daemon
maintains information of all files in memory. The file information includes
file attributes and extent information (extent tree). When the file information
should be shared, we use collective operations (broadcast/reduce).

- new data structures in server: unifyfs_inode_tree, unifyfs_inode
- new collective operations: unifyfs_collectives.h unifyfs_collectives.c
- connecting the previous operations with mdhim
- re-create fsync rpc
- connecting the collective operations (except read/mread)
- also, discarding int2void tree that is not used anymore
@sandrain sandrain force-pushed the server-collective branch from bb69277 to 095b94b Compare May 5, 2020 16:37
@sandrain
Copy link
Copy Markdown
Contributor Author

sandrain commented May 7, 2020

I am closing this PR for this changes are now in margotreenew.

@sandrain sandrain closed this May 7, 2020
@sandrain sandrain deleted the server-collective branch June 10, 2020 20:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants