server tree and collective filesize reduce by adammoody · Pull Request #427 · llnl/UnifyFS

adammoody · 2020-01-08T21:25:40Z

Using margo rpcs between servers, this implements a tree-based reduction to compute the file size across servers.

One challenge here is that these operations are on-demand, non-blocking collectives. When a file size request comes into to a server, it allocates a temporary structure to track the state of the collective. It assigns a tag, unique to that local server, that the server will use to look up its state structure on later rpcs. It forwards the tag to any children, and the children include that tag in their reply.

To implement a file size lookup, the server calls a file size request rpc on each of its children in the tree. Each child in turn forwards that request down the tree. At the leaves, the servers lookup their local max write offset for the given file and respond to their parent. On each file size response, a server takes the max of the incoming message with its current max and saves that on the state structure. Once it receives replies from all children, it forwards a reply to its own parent and releases the state structure.

This type of tree-based bcast/reduce can be used for other operations like truncate and unlink. If we at some point opt to replicate the metadata info for each file on every server, this could be used for operations that act on the stat info too like chmod, chgrp. Large read operations could be done with tree-based scatter/bcast of extents, followed by tree-based gathers to collect the data replies.

The second commit in this PR adds server-based segment trees to track file extents on each server. These segment trees are updated when new file extents are fsync'd from the client. The file size reduction is modified to use these segment trees to get the max extent offset on each node as input to the reduction. Additionally, the segment trees define operations for file truncate and unlink to be used in a future commit.

The last commit uses the new server-based segment tree to service client read requests. For each read request from the client, the server processes as much as it can from its local segment tree. If any request has left over segments after this local processing, the server falls back to query the global key value store and requests data from other servers. This eliminates remote key/value lookups in the case a client is reading data local to the node, which should speed up those reads. Also, it will flatten writes across fsync operations from a single process as well as processes from the same node. This optimization is not valid in the case that a process on another node writes to the same byte, since the local server would then retrieve its stale local value. It can be used with truncate, though, so long as the extents are truncated appropriately.

Description

Motivation and Context

How Has This Been Tested?

Types of changes

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Performance enhancement (non-breaking change which improves efficiency)
Code cleanup (non-breaking change which makes code smaller or more readable)
Breaking change (fix or feature that would cause existing functionality to change)
Documentation (a change to man pages or other documentation)

Checklist:

My code follows the UnifyFS code style requirements.
I have updated the documentation accordingly.
I have read the CONTRIBUTING document.
I have added tests to cover my changes.
All new and existing tests passed.
All commit messages are properly formatted.

MichaelBrim · 2020-01-08T21:49:14Z

server/src/unifyfs_tree.c

+
+static void filesize_response_forward(unifyfs_state_filesize_t* st)
+{
+    printf("%d: BUCKEYES response_forward\n", glb_pmi_rank);  fflush(stdout);


I think we need to find a way to get a "BUCKEYES" printf in the final product ;)

Yeah, keen eye :-). Needed someway to find my new debug messages, and for some reason that name stands out to me. O-H!

adammoody · 2020-01-13T22:00:44Z

@CamStan , I'll check this again. I see my other PRs are running the tests just fine, so must be something I messed up.

CamStan · 2020-01-13T23:06:23Z

I cleared Travis' cache and ran this again. In watching it, it appears to hang after it gets to that last test that prints successfully, then eventually fails out. I believe the ERROR: 0100-sysio-gotcha.t - missing test plan line is getting printed because it's failing before reaching the done_testing(); step at the end of t/sys/sysio_suite.c, which is where the number of tests in the test plan is determined in our case.

I'm not seeing anything obvious, but I keep running into these situations where the unit tests hang in various places. Still haven't been able to figure out why, but there could be a bigger underlying issue that keeps getting bumped.

adammoody · 2020-01-17T19:24:37Z

@CamStan , I did find a bug that I think was leading to a segfault. When I cleaned that up, the tests started passing. So it was my fault after all. Sorry for the false alarm, but thanks for helping me figure it out.

TEST_CHECKPATCH_SKIP_FILES=common/src/unifyfs_configurator.h

tonyhutter · 2020-02-05T16:57:30Z

Is there any way you can extend seg_tree to get what you want, rather than copy-n-pasting that code into separate extent_tree/gfid2ext_tree/int2void implementations? Maybe add a void * node.data field to hold custom data, and allow the user to specify a custom comparitor function?

server/src/unifyfs_tree.c

- merging the collective status into unifyfs_coll_state_t

server/src/unifyfs_request_manager.c

server/src/extent_tree.c

adammoody · 2020-02-26T21:41:01Z

server/src/unifyfs_inode_tree.c

+}
+
+/* Remove and free all nodes in the unifyfs_inode_tree. */
+void unifyfs_inode_tree_destroy(


Are we also freeing the tree structure itself here, and if so, should we destroy the rwlock?

Yes, you are right that we need to destroy the lock. But the structure is allocated by the caller for now.

server/src/unifyfs_tree.c

tonyhutter · 2020-02-27T02:21:32Z

Reiterating my earlier comment, are there any plans to extend seg_tree to get rid of the separate, but very similar tree implementations of extent_tree, int2void, unifyfs_inode_tree?

adammoody · 2020-02-27T18:57:36Z

Reiterating my earlier comment, are there any plans to extend seg_tree to get rid of the separate, but very similar tree implementations of extent_tree, int2void, unifyfs_inode_tree?

That makes sense, but we're saving that step for later. It may be that the requirements of these different data structures diverge as more functionality is added. We've kept them separate to allow for that. We can revisit this idea down the road.

adammoody · 2020-02-27T20:14:42Z

@sandrain and @boehms , here are the last bits we need to add so that we can replace our use of MDHIM for storing file attributes. Would you mind adding these in?

Considering how to code up a replacement for the extra logic we have in unifyfs_set_file_attribute(), it basically comes down to selectively updating the size and is_laminated fields in the metadata. I think we have the following cases:

client-to-server metaset_rpc with create==1:

entry must not already exist in the tree (error if it exists)
set all metadata fields with values from client, including size and is_laminated

client-to-server metaset_rpc with create==0:

entry must exist in the tree (error if it does not exist)
update all metadata fields with values from client, except that ensure that existing size and is_laminated values are not changed (mask out settings from client)

truncate:

entry must exist in the tree (error if it does not exist)
update the size field in metadata, leave all other fields untouched

laminate:

entry must exist in the tree (error if it does not exist)
update the size and is_laminated fields in metadata, leave all other fields untouched

Once we have that, we should be able to replace all calls to unifyfs_set_file_attribute() and unifyfs_get_file_attribute().

tonyhutter · 2020-02-28T00:20:32Z

That makes sense, but we're saving that step for later. It may be that the requirements of these different data structures diverge as more functionality is added. We've kept them separate to allow for that. We can revisit this idea down the road.

Please include this step in this PR though; it's not worth racking up the Technical Debt, especially since this PR isn't time sensitive. The seg_tree code already has test cases, so we know it works. It should be extended in this PR to be a more general tree library that would accommodate these new use cases. It should be doable if you add a void * seg_tree.data pointer for the custom data, and allow the user to specify a custom comparator function.

- applying Swen's fix - fixed warnings, segmentation faults, now deadlock

-- we exchange buffers successfully, but need to merge them.. (causing a deadlock now)

- still needs some cleaning in unifyfs_inode

- filesize, truncate, unlink, metaset - remove the previous syncronous implementations

MichaelBrim · 2020-11-19T15:23:00Z

Closing as margotree work has been merged.

adammoody added the WIP label Jan 8, 2020

adammoody changed the title ~~tree and filesize reduce~~ server tree and collective filesize reduce Jan 8, 2020

MichaelBrim reviewed Jan 8, 2020

View reviewed changes

adammoody force-pushed the margotree branch 6 times, most recently from 9755176 to 5ed78f0 Compare January 13, 2020 19:08

adammoody force-pushed the margotree branch from 5ed78f0 to 80c96df Compare January 16, 2020 00:07

adammoody force-pushed the margotree branch 2 times, most recently from fb30c21 to bdb5d39 Compare January 22, 2020 20:21

adammoody mentioned this pull request Jan 23, 2020

Add local key/value store to service reads of local data #252

Closed

adammoody added 6 commits January 31, 2020 14:56

tree and filesize reduce

7037cf6

track file extents in server with segment tree

97857a8

use extent map to service local reads

99b0c04

TEST_CHECKPATCH_SKIP_FILES=common/src/unifyfs_configurator.h

move stack implementation from client to common

fd2fbde

server: add integer to void* map

53ac177

fix memory leaks in extent_tree

a1a8e0d

adammoody force-pushed the margotree branch from bdb5d39 to a1a8e0d Compare February 5, 2020 03:30

adammoody and others added 4 commits February 5, 2020 10:19

fix: int2void delete bug

d714b56

stat accepting a rank number

4b6e967

fix: bug on root when rotating parent rank in tree

854ebee

truncate implementation using collective

93592ae

adammoody commented Feb 13, 2020

View reviewed changes

server/src/unifyfs_tree.c Outdated Show resolved Hide resolved

- unlink with collective

1f3cf15

- merging the collective status into unifyfs_coll_state_t