GlusterFS marker translator Enhancement #182

sanoj-unnikrishnan · 2017-04-24T10:04:15Z

Marker translator is responsible for accounting of size in GlusterFS.

Current accounting design.

Accounting happens using a set of contri and size xattrs on each inode.
Contri gives the contribution towards its parent while size give the size of the direcotry.
In a quiesced state Contri and size should be the same value.
When an operation happens that changes size of file/directory the update is first made to size xattr and subsequently to contri xattr.
Note: For a file the size is obtained from iatt, hence it is atomic with actual write syscall.

During an update we mark the inode dirty then update the size of parent inode, contri on current inode towards parent and clear the dirty flag.

So, In case of a crash (before updates are propogated), we may end in 2 situation
a ) inode is dirty or
b ) contri and size mismatch.

In both these cases, The staleness can be identified on subsequent lookup and healed.
However until such a lookup happens these updates are unaccounted at the top in the hierarchy.
(This is because we don't have any dirty flag set in the forward call path. We do all accounting work in the callback path.)

Heals: Directories are healed by summing the contri of all its dirent. Files are healed by matching the contri to the size.

Below is a high level workflow of write for reference (some corner cases has been exlcuded).

set context updation status in core on inode
Take inode lock on parent
clear contex updation status
get size, contri towards parent
compute delta
if delta != 0 then
a) set the dirty xattr and incore dirty flag.
b) update conri of the inode (on disk, in core)
c) update size of the parent (on disk, in core)
c) clear dirty flag.
release inode lock on parent.
repeat steps 1..7 for parent until root

Proposed Design

We could make following changes to current design.

Maintain only size xattr for accounting. (do away with contri)
Set size xattr only on the directories that have limits configured on them.
Use a flag/xattr in core and on-disk to identify if a limit is set on a directory.
set a dirty flag in the forward call path, to know if the size value can be trusted (post a crash).

Work flow of write call.

In the forward call path, do below steps for all the ancestors that have limit set
lock inode context
if the dirty_count is 0, do an xattr operation to mark the dirty xattr (flag) on disk
increment dirty_count in the inode ctx
unlock inode context
do the write call.
In the callback path (after unwinding write call), do below steps for all the ancestors that have limit set
take lock over inode context.
update size in context of inode
decrement dirty_count in the inode context.
If it falls to zero, update on disk size xattr with incore value and remove the dirty xattr on disk
unlock inode context.

Correctness.
In case of parallel fops, the on disk xattr update is delayed until dirty count reaches zero
In core sizes are always kept updated and we return these values for any lookup calls.
In essence we are absorbing parallel fops and the xattr updates happens only when the directory is in a quiesced state.

Since we do not keep an updated size for all directories in FS tree, we will need a crawl each time a new limit is set.
We will also need a crawl over the directories marked dirty following a brick crash. (as opposed to current single level lookup based heal)

Advantages

Note that for most use cases we do not expect most more than 2 directories along the ancestry in path to have quota limits set on them.
So while in current quota implementation the xattr update goes all the way along the ancestry path, in this design will have fewer xattr operations.
Since we have marked dirty xattr flag in forward call path, we can be sure that we don't have any staleness in accounting.

Disadvantage

latencies in heal after crash.
additional crawl while adding limits.

Work around.

Note that how many directories or which directories along the ancestry have xattr tracked has no implication on correctness of the tracking.
If a directory is tracked it only helps us during crawl by avoiding the crawl of its entire subtree.
So if instead of tracking only the directories with limit set, we also track a random subset of additional directories, we could leverage this in speeding crawl.

Suppose every directory is tracked with 1/4 probability (except if it has a limit set in which case it has to be tracked p =1) .
the chances of having to go k levels deep during crawling = chance of none of directories along the path is tracked = (1-1/4)^k = (3/4)^k.
eg: chances of having to crawl a directory that is 5 levels deep = (3/4)^5 = 0.23
chances of having to crawl 10 levels is 0.05
So with increase in depth the crawling becomes efficient.

To crawl in this manner, we will have to crawl based on whether an inode has marker xattrs.
So we cannot use the find | xargs stat on aux mount to crawl. We will have to write some code for that too.

Even this can be improved, instead of using static 1/4 probablity, we can choose the probability such that it improves logarthimically with depth.
(Akin to most randomized algorithms eg: skip list)

The text was updated successfully, but these errors were encountered:

Manikandan-Selvaganesh · 2017-04-27T02:39:05Z

Could you please elaborate on the part of getting away with Contri?
With the current design, we get the file size from stat(iatt) and then calculate the contri every time an update is made and we recursively add this contri value uptill the ancestor where limit is set.

Most of our recursive crawling code depends on contri.

I am sorry, I really do not know, I have one question here: Are we going for DHT2, then definitely the whole thing needs to be altered(since a lot of updates need to happen between DS and MDS). If we are going for DHT2, why we need to alter so much in the current code.

sanoj-unnikrishnan · 2017-04-27T16:12:01Z

@Manikandan-Selvaganesh
Regarding DHT2, I checked with Shyam, it will start as experimental and take some time (2+ years) for it to become native (depending on acceptance).

Regarding doing away with contri

My understanding of current design is that we are using a combination of dirty xattr and contri in order to achieve crash consistency.
In callback of current writev,

we set dirty flag on directory
update contri of directory, size of parent
clear dirty on the directory
recurse same steps upward till root

So in case of crash (during a recusive update ) an inode along the ancestry could be in one of the below states

directory inode marked dirty (we will heal them)
Directories / files (not marked dirty) but whose contri mismatch with size (they get healed as and when they are detected on lookup).
Directories at the top in hierarchy could have no dirty and no change to contri/size.
So we have no way of identifying the unaccounted io until a lookup happens at the bottom of hierarchy (where update was happening at the time of crash).
Once such a lookup happens, the ancestry will get healed (till then, for such directories accounting is stale).

Alternative method to achieve crash consistency.

we set the dirty xattr in forward call path (before the write). So we no longer have to rely on a difference between contri and size xattr values to identify directories with incorrect accounting.
we simply consider all the ones with dirty xattr as dirty , the remaining are clean.

Note that this was not possible earlier since we were marking dirty xattr in callback path.

Also, note that update to sizes could have been made independent of contri. In every fop, we know the change in size that it does from the iatt values in prebuf/postbuf. So it is only a matter of propogating this change upward.

The design choice we are proposing is to let heal efficiency post a crash take a dip and gain write performance.

vmallikarjuna · 2017-05-03T04:59:05Z

In the new design, you don't have contri xattrs.
If the limit is set on root directory and suppose if updating of size xattr fails on this because of brick crash.
As there are no contri xattrs set on the children of root directory, how do you recover the quota size? You may have to crawl the entire filesystem.

sanoj-unnikrishnan · 2017-05-04T06:08:23Z

@VMallika,
After a crash,directory heal would aggregate the size xattr on all its dirents and record the same as directory size. We do have size xattr on directories with limit set. So , we don't need to crawl subtree of the directories that have limit set (unless it is marked dirty).

I agree that with few limits across the filesystem this could degenerate to a full filesystem crawl.
Hence, the work around of tracking additional directories (In other words, tracking each directory with probability 1/4 say). This scheme improves with depth as pointed out before.

In the event, that the filesystem is flattish say 2 levels with millions of files under each directory,
Then even this would not fare well. But it is still similar to current scheme (as in current scheme we have to get contris of all dirent to do a directory heal.)

Another point to note is that current implementation achieves eventual accuracy.
What i mean is, if a file 15 level deep is modified and crash happens before contri update is percolated upward. We may have staleness in size reported by marker (until a heal happens at lower layers and it gets percolated upward through dirty_transaction).

With marking dirty flag in forward call path, we would expedite such heal as we realize the directory is dirty much quicker.

I guess its prudent to have a rough program that simulates the crawl (especially the benefit of tracking directories randomly) before stepping further.

Please let me know, if this addresses your question or adds even more :)

stale · 2020-08-13T07:55:56Z

Thank you for your contributions.
Noticed that this issue is not having any activity in last ~6 months! We are marking this issue as stale because it has not had recent activity.
It will be closed in 2 weeks if no one responds with a comment here.

stale · 2020-08-28T09:17:05Z

Closing this issue as there was no update since my last update on issue. If this is an issue which is still valid, feel free to open it.

sanoj-unnikrishnan mentioned this issue Apr 24, 2017

Batched xattr updates in marker translator. #183

Closed

amarts added CB: quota Prio: Medium labels Apr 24, 2017

amarts changed the title ~~GlusterFS maker translator Enhancement~~ GlusterFS marker translator Enhancement May 7, 2017

amarts added the Type:Enhancement label Jan 16, 2020

stale bot added the wontfix Managed by stale[bot] label Aug 13, 2020

stale bot closed this as completed Aug 28, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GlusterFS marker translator Enhancement #182

GlusterFS marker translator Enhancement #182

sanoj-unnikrishnan commented Apr 24, 2017 •

edited

Loading

Manikandan-Selvaganesh commented Apr 27, 2017

sanoj-unnikrishnan commented Apr 27, 2017

vmallikarjuna commented May 3, 2017

sanoj-unnikrishnan commented May 4, 2017

stale bot commented Aug 13, 2020

stale bot commented Aug 28, 2020

GlusterFS marker translator Enhancement #182

GlusterFS marker translator Enhancement #182

Comments

sanoj-unnikrishnan commented Apr 24, 2017 • edited Loading

Current accounting design.

Proposed Design

Work around.

Manikandan-Selvaganesh commented Apr 27, 2017

sanoj-unnikrishnan commented Apr 27, 2017

Regarding doing away with contri

Alternative method to achieve crash consistency.

vmallikarjuna commented May 3, 2017

sanoj-unnikrishnan commented May 4, 2017

stale bot commented Aug 13, 2020

stale bot commented Aug 28, 2020

sanoj-unnikrishnan commented Apr 24, 2017 •

edited

Loading