-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GlusterFS marker translator Enhancement #182
Comments
Could you please elaborate on the part of getting away with Contri? Most of our recursive crawling code depends on contri. I am sorry, I really do not know, I have one question here: Are we going for DHT2, then definitely the whole thing needs to be altered(since a lot of updates need to happen between DS and MDS). If we are going for DHT2, why we need to alter so much in the current code. |
@Manikandan-Selvaganesh Regarding doing away with contriMy understanding of current design is that we are using a combination of dirty xattr and contri in order to achieve crash consistency.
So in case of crash (during a recusive update ) an inode along the ancestry could be in one of the below states
Alternative method to achieve crash consistency.
Note that this was not possible earlier since we were marking dirty xattr in callback path.
The design choice we are proposing is to let heal efficiency post a crash take a dip and gain write performance. |
In the new design, you don't have contri xattrs. |
@VMallika, I agree that with few limits across the filesystem this could degenerate to a full filesystem crawl. In the event, that the filesystem is flattish say 2 levels with millions of files under each directory, Another point to note is that current implementation achieves eventual accuracy. With marking dirty flag in forward call path, we would expedite such heal as we realize the directory is dirty much quicker. I guess its prudent to have a rough program that simulates the crawl (especially the benefit of tracking directories randomly) before stepping further. Please let me know, if this addresses your question or adds even more :) |
Thank you for your contributions. |
Closing this issue as there was no update since my last update on issue. If this is an issue which is still valid, feel free to open it. |
Marker translator is responsible for accounting of size in GlusterFS.
Current accounting design.
Accounting happens using a set of contri and size xattrs on each inode.
Contri gives the contribution towards its parent while size give the size of the direcotry.
In a quiesced state Contri and size should be the same value.
When an operation happens that changes size of file/directory the update is first made to size xattr and subsequently to contri xattr.
Note: For a file the size is obtained from iatt, hence it is atomic with actual write syscall.
During an update we mark the inode dirty then update the size of parent inode, contri on current inode towards parent and clear the dirty flag.
So, In case of a crash (before updates are propogated), we may end in 2 situation
a ) inode is dirty or
b ) contri and size mismatch.
In both these cases, The staleness can be identified on subsequent lookup and healed.
However until such a lookup happens these updates are unaccounted at the top in the hierarchy.
(This is because we don't have any dirty flag set in the forward call path. We do all accounting work in the callback path.)
Heals: Directories are healed by summing the contri of all its dirent. Files are healed by matching the contri to the size.
Below is a high level workflow of write for reference (some corner cases has been exlcuded).
a) set the dirty xattr and incore dirty flag.
b) update conri of the inode (on disk, in core)
c) update size of the parent (on disk, in core)
c) clear dirty flag.
Proposed Design
We could make following changes to current design.
Work flow of write call.
In the forward call path, do below steps for all the ancestors that have limit set
lock inode context
if the dirty_count is 0, do an xattr operation to mark the dirty xattr (flag) on disk
increment dirty_count in the inode ctx
unlock inode context
do the write call.
In the callback path (after unwinding write call), do below steps for all the ancestors that have limit set
take lock over inode context.
update size in context of inode
decrement dirty_count in the inode context.
If it falls to zero, update on disk size xattr with incore value and remove the dirty xattr on disk
unlock inode context.
Correctness.
In case of parallel fops, the on disk xattr update is delayed until dirty count reaches zero
In core sizes are always kept updated and we return these values for any lookup calls.
In essence we are absorbing parallel fops and the xattr updates happens only when the directory is in a quiesced state.
Since we do not keep an updated size for all directories in FS tree, we will need a crawl each time a new limit is set.
We will also need a crawl over the directories marked dirty following a brick crash. (as opposed to current single level lookup based heal)
Advantages
So while in current quota implementation the xattr update goes all the way along the ancestry path, in this design will have fewer xattr operations.
Disadvantage
Work around.
Note that how many directories or which directories along the ancestry have xattr tracked has no implication on correctness of the tracking.
If a directory is tracked it only helps us during crawl by avoiding the crawl of its entire subtree.
So if instead of tracking only the directories with limit set, we also track a random subset of additional directories, we could leverage this in speeding crawl.
Suppose every directory is tracked with 1/4 probability (except if it has a limit set in which case it has to be tracked p =1) .
the chances of having to go k levels deep during crawling = chance of none of directories along the path is tracked = (1-1/4)^k = (3/4)^k.
eg: chances of having to crawl a directory that is 5 levels deep = (3/4)^5 = 0.23
chances of having to crawl 10 levels is 0.05
So with increase in depth the crawling becomes efficient.
To crawl in this manner, we will have to crawl based on whether an inode has marker xattrs.
So we cannot use the
find | xargs stat
on aux mount to crawl. We will have to write some code for that too.Even this can be improved, instead of using static 1/4 probablity, we can choose the probability such that it improves logarthimically with depth.
(Akin to most randomized algorithms eg: skip list)
The text was updated successfully, but these errors were encountered: