Skip to content

server tree and collective filesize reduce#427

Closed
adammoody wants to merge 27 commits intodevfrom
margotree
Closed

server tree and collective filesize reduce#427
adammoody wants to merge 27 commits intodevfrom
margotree

Conversation

@adammoody
Copy link
Copy Markdown
Collaborator

@adammoody adammoody commented Jan 8, 2020

Using margo rpcs between servers, this implements a tree-based reduction to compute the file size across servers.

One challenge here is that these operations are on-demand, non-blocking collectives. When a file size request comes into to a server, it allocates a temporary structure to track the state of the collective. It assigns a tag, unique to that local server, that the server will use to look up its state structure on later rpcs. It forwards the tag to any children, and the children include that tag in their reply.

To implement a file size lookup, the server calls a file size request rpc on each of its children in the tree. Each child in turn forwards that request down the tree. At the leaves, the servers lookup their local max write offset for the given file and respond to their parent. On each file size response, a server takes the max of the incoming message with its current max and saves that on the state structure. Once it receives replies from all children, it forwards a reply to its own parent and releases the state structure.

This type of tree-based bcast/reduce can be used for other operations like truncate and unlink. If we at some point opt to replicate the metadata info for each file on every server, this could be used for operations that act on the stat info too like chmod, chgrp. Large read operations could be done with tree-based scatter/bcast of extents, followed by tree-based gathers to collect the data replies.

The second commit in this PR adds server-based segment trees to track file extents on each server. These segment trees are updated when new file extents are fsync'd from the client. The file size reduction is modified to use these segment trees to get the max extent offset on each node as input to the reduction. Additionally, the segment trees define operations for file truncate and unlink to be used in a future commit.

The last commit uses the new server-based segment tree to service client read requests. For each read request from the client, the server processes as much as it can from its local segment tree. If any request has left over segments after this local processing, the server falls back to query the global key value store and requests data from other servers. This eliminates remote key/value lookups in the case a client is reading data local to the node, which should speed up those reads. Also, it will flatten writes across fsync operations from a single process as well as processes from the same node. This optimization is not valid in the case that a process on another node writes to the same byte, since the local server would then retrieve its stale local value. It can be used with truncate, though, so long as the extents are truncated appropriately.

Description

Motivation and Context

How Has This Been Tested?

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Performance enhancement (non-breaking change which improves efficiency)
  • Code cleanup (non-breaking change which makes code smaller or more readable)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Documentation (a change to man pages or other documentation)

Checklist:

  • My code follows the UnifyFS code style requirements.
  • I have updated the documentation accordingly.
  • I have read the CONTRIBUTING document.
  • I have added tests to cover my changes.
  • All new and existing tests passed.
  • All commit messages are properly formatted.

@adammoody adammoody added the WIP label Jan 8, 2020
@adammoody adammoody changed the title tree and filesize reduce server tree and collective filesize reduce Jan 8, 2020

static void filesize_response_forward(unifyfs_state_filesize_t* st)
{
printf("%d: BUCKEYES response_forward\n", glb_pmi_rank); fflush(stdout);
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to find a way to get a "BUCKEYES" printf in the final product ;)

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, keen eye :-). Needed someway to find my new debug messages, and for some reason that name stands out to me. O-H!

@adammoody adammoody force-pushed the margotree branch 6 times, most recently from 9755176 to 5ed78f0 Compare January 13, 2020 19:08
@adammoody
Copy link
Copy Markdown
Collaborator Author

@CamStan , I'll check this again. I see my other PRs are running the tests just fine, so must be something I messed up.

@CamStan
Copy link
Copy Markdown
Member

CamStan commented Jan 13, 2020

I cleared Travis' cache and ran this again. In watching it, it appears to hang after it gets to that last test that prints successfully, then eventually fails out. I believe the ERROR: 0100-sysio-gotcha.t - missing test plan line is getting printed because it's failing before reaching the done_testing(); step at the end of t/sys/sysio_suite.c, which is where the number of tests in the test plan is determined in our case.

I'm not seeing anything obvious, but I keep running into these situations where the unit tests hang in various places. Still haven't been able to figure out why, but there could be a bigger underlying issue that keeps getting bumped.

@adammoody
Copy link
Copy Markdown
Collaborator Author

@CamStan , I did find a bug that I think was leading to a segfault. When I cleaned that up, the tests started passing. So it was my fault after all. Sorry for the false alarm, but thanks for helping me figure it out.

@tonyhutter
Copy link
Copy Markdown
Collaborator

Is there any way you can extend seg_tree to get what you want, rather than copy-n-pasting that code into separate extent_tree/gfid2ext_tree/int2void implementations? Maybe add a void * node.data field to hold custom data, and allow the user to specify a custom comparitor function?

- merging the collective status into unifyfs_coll_state_t
}

/* Remove and free all nodes in the unifyfs_inode_tree. */
void unifyfs_inode_tree_destroy(
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we also freeing the tree structure itself here, and if so, should we destroy the rwlock?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, you are right that we need to destroy the lock. But the structure is allocated by the caller for now.

@tonyhutter
Copy link
Copy Markdown
Collaborator

Reiterating my earlier comment, are there any plans to extend seg_tree to get rid of the separate, but very similar tree implementations of extent_tree, int2void, unifyfs_inode_tree?

@adammoody
Copy link
Copy Markdown
Collaborator Author

Reiterating my earlier comment, are there any plans to extend seg_tree to get rid of the separate, but very similar tree implementations of extent_tree, int2void, unifyfs_inode_tree?

That makes sense, but we're saving that step for later. It may be that the requirements of these different data structures diverge as more functionality is added. We've kept them separate to allow for that. We can revisit this idea down the road.

@adammoody
Copy link
Copy Markdown
Collaborator Author

adammoody commented Feb 27, 2020

@sandrain and @boehms , here are the last bits we need to add so that we can replace our use of MDHIM for storing file attributes. Would you mind adding these in?

Considering how to code up a replacement for the extra logic we have in unifyfs_set_file_attribute(), it basically comes down to selectively updating the size and is_laminated fields in the metadata. I think we have the following cases:

client-to-server metaset_rpc with create==1:

  • entry must not already exist in the tree (error if it exists)
  • set all metadata fields with values from client, including size and is_laminated

client-to-server metaset_rpc with create==0:

  • entry must exist in the tree (error if it does not exist)
  • update all metadata fields with values from client, except that ensure that existing size and is_laminated values are not changed (mask out settings from client)

truncate:

  • entry must exist in the tree (error if it does not exist)
  • update the size field in metadata, leave all other fields untouched

laminate:

  • entry must exist in the tree (error if it does not exist)
  • update the size and is_laminated fields in metadata, leave all other fields untouched

Once we have that, we should be able to replace all calls to unifyfs_set_file_attribute() and unifyfs_get_file_attribute().

@tonyhutter
Copy link
Copy Markdown
Collaborator

That makes sense, but we're saving that step for later. It may be that the requirements of these different data structures diverge as more functionality is added. We've kept them separate to allow for that. We can revisit this idea down the road.

Please include this step in this PR though; it's not worth racking up the Technical Debt, especially since this PR isn't time sensitive. The seg_tree code already has test cases, so we know it works. It should be extended in this PR to be a more general tree library that would accommodate these new use cases. It should be doable if you add a void * seg_tree.data pointer for the custom data, and allow the user to specify a custom comparator function.

@MichaelBrim
Copy link
Copy Markdown
Collaborator

Closing as margotree work has been merged.

@MichaelBrim MichaelBrim deleted the margotree branch July 12, 2022 13:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants