Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GlusterFS marker translator Enhancement #182

Closed
sanoj-unnikrishnan opened this issue Apr 24, 2017 · 6 comments
Closed

GlusterFS marker translator Enhancement #182

sanoj-unnikrishnan opened this issue Apr 24, 2017 · 6 comments

Comments

@sanoj-unnikrishnan
Copy link
Contributor

sanoj-unnikrishnan commented Apr 24, 2017

Marker translator is responsible for accounting of size in GlusterFS.

Current accounting design.

Accounting happens using a set of contri and size xattrs on each inode.
Contri gives the contribution towards its parent while size give the size of the direcotry.
In a quiesced state Contri and size should be the same value.
When an operation happens that changes size of file/directory the update is first made to size xattr and subsequently to contri xattr.
Note: For a file the size is obtained from iatt, hence it is atomic with actual write syscall.

During an update we mark the inode dirty then update the size of parent inode, contri on current inode towards parent and clear the dirty flag.

So, In case of a crash (before updates are propogated), we may end in 2 situation
a ) inode is dirty or
b ) contri and size mismatch.

In both these cases, The staleness can be identified on subsequent lookup and healed.
However until such a lookup happens these updates are unaccounted at the top in the hierarchy.
(This is because we don't have any dirty flag set in the forward call path. We do all accounting work in the callback path.)

Heals: Directories are healed by summing the contri of all its dirent. Files are healed by matching the contri to the size.

Below is a high level workflow of write for reference (some corner cases has been exlcuded).

  1. set context updation status in core on inode
  2. Take inode lock on parent
  3. clear contex updation status
  4. get size, contri towards parent
  5. compute delta
  6. if delta != 0 then
    a) set the dirty xattr and incore dirty flag.
    b) update conri of the inode (on disk, in core)
    c) update size of the parent (on disk, in core)
    c) clear dirty flag.
  7. release inode lock on parent.
  8. repeat steps 1..7 for parent until root

Proposed Design

We could make following changes to current design.

  1. Maintain only size xattr for accounting. (do away with contri)
  2. Set size xattr only on the directories that have limits configured on them.
  3. Use a flag/xattr in core and on-disk to identify if a limit is set on a directory.
  4. set a dirty flag in the forward call path, to know if the size value can be trusted (post a crash).

Work flow of write call.

  1. In the forward call path, do below steps for all the ancestors that have limit set
    lock inode context
    if the dirty_count is 0, do an xattr operation to mark the dirty xattr (flag) on disk
    increment dirty_count in the inode ctx
    unlock inode context

  2. do the write call.

  3. In the callback path (after unwinding write call), do below steps for all the ancestors that have limit set
    take lock over inode context.
    update size in context of inode
    decrement dirty_count in the inode context.
    If it falls to zero, update on disk size xattr with incore value and remove the dirty xattr on disk
    unlock inode context.

Correctness.
In case of parallel fops, the on disk xattr update is delayed until dirty count reaches zero
In core sizes are always kept updated and we return these values for any lookup calls.
In essence we are absorbing parallel fops and the xattr updates happens only when the directory is in a quiesced state.

Since we do not keep an updated size for all directories in FS tree, we will need a crawl each time a new limit is set.
We will also need a crawl over the directories marked dirty following a brick crash. (as opposed to current single level lookup based heal)

Advantages

  1. Note that for most use cases we do not expect most more than 2 directories along the ancestry in path to have quota limits set on them.
    So while in current quota implementation the xattr update goes all the way along the ancestry path, in this design will have fewer xattr operations.
  2. Since we have marked dirty xattr flag in forward call path, we can be sure that we don't have any staleness in accounting.

Disadvantage

  1. latencies in heal after crash.
  2. additional crawl while adding limits.

Work around.

Note that how many directories or which directories along the ancestry have xattr tracked has no implication on correctness of the tracking.
If a directory is tracked it only helps us during crawl by avoiding the crawl of its entire subtree.
So if instead of tracking only the directories with limit set, we also track a random subset of additional directories, we could leverage this in speeding crawl.

Suppose every directory is tracked with 1/4 probability (except if it has a limit set in which case it has to be tracked p =1) .
the chances of having to go k levels deep during crawling = chance of none of directories along the path is tracked = (1-1/4)^k = (3/4)^k.
eg: chances of having to crawl a directory that is 5 levels deep = (3/4)^5 = 0.23
chances of having to crawl 10 levels is 0.05
So with increase in depth the crawling becomes efficient.

To crawl in this manner, we will have to crawl based on whether an inode has marker xattrs.
So we cannot use the find | xargs stat on aux mount to crawl. We will have to write some code for that too.

Even this can be improved, instead of using static 1/4 probablity, we can choose the probability such that it improves logarthimically with depth.
(Akin to most randomized algorithms eg: skip list)

@Manikandan-Selvaganesh
Copy link
Contributor

Could you please elaborate on the part of getting away with Contri?
With the current design, we get the file size from stat(iatt) and then calculate the contri every time an update is made and we recursively add this contri value uptill the ancestor where limit is set.

Most of our recursive crawling code depends on contri.

I am sorry, I really do not know, I have one question here: Are we going for DHT2, then definitely the whole thing needs to be altered(since a lot of updates need to happen between DS and MDS). If we are going for DHT2, why we need to alter so much in the current code.

@sanoj-unnikrishnan
Copy link
Contributor Author

@Manikandan-Selvaganesh
Regarding DHT2, I checked with Shyam, it will start as experimental and take some time (2+ years) for it to become native (depending on acceptance).

Regarding doing away with contri

My understanding of current design is that we are using a combination of dirty xattr and contri in order to achieve crash consistency.
In callback of current writev,

  1. we set dirty flag on directory
  2. update contri of directory, size of parent
  3. clear dirty on the directory
  4. recurse same steps upward till root

So in case of crash (during a recusive update ) an inode along the ancestry could be in one of the below states

  1. directory inode marked dirty (we will heal them)
  2. Directories / files (not marked dirty) but whose contri mismatch with size (they get healed as and when they are detected on lookup).
  3. Directories at the top in hierarchy could have no dirty and no change to contri/size.
    So we have no way of identifying the unaccounted io until a lookup happens at the bottom of hierarchy (where update was happening at the time of crash).
    Once such a lookup happens, the ancestry will get healed (till then, for such directories accounting is stale).

Alternative method to achieve crash consistency.

  1. we set the dirty xattr in forward call path (before the write). So we no longer have to rely on a difference between contri and size xattr values to identify directories with incorrect accounting.
    we simply consider all the ones with dirty xattr as dirty , the remaining are clean.

Note that this was not possible earlier since we were marking dirty xattr in callback path.

  1. Also, note that update to sizes could have been made independent of contri. In every fop, we know the change in size that it does from the iatt values in prebuf/postbuf. So it is only a matter of propogating this change upward.

The design choice we are proposing is to let heal efficiency post a crash take a dip and gain write performance.

@vmallikarjuna
Copy link

In the new design, you don't have contri xattrs.
If the limit is set on root directory and suppose if updating of size xattr fails on this because of brick crash.
As there are no contri xattrs set on the children of root directory, how do you recover the quota size? You may have to crawl the entire filesystem.

@sanoj-unnikrishnan
Copy link
Contributor Author

@VMallika,
After a crash,directory heal would aggregate the size xattr on all its dirents and record the same as directory size. We do have size xattr on directories with limit set. So , we don't need to crawl subtree of the directories that have limit set (unless it is marked dirty).

I agree that with few limits across the filesystem this could degenerate to a full filesystem crawl.
Hence, the work around of tracking additional directories (In other words, tracking each directory with probability 1/4 say). This scheme improves with depth as pointed out before.

In the event, that the filesystem is flattish say 2 levels with millions of files under each directory,
Then even this would not fare well. But it is still similar to current scheme (as in current scheme we have to get contris of all dirent to do a directory heal.)

Another point to note is that current implementation achieves eventual accuracy.
What i mean is, if a file 15 level deep is modified and crash happens before contri update is percolated upward. We may have staleness in size reported by marker (until a heal happens at lower layers and it gets percolated upward through dirty_transaction).

With marking dirty flag in forward call path, we would expedite such heal as we realize the directory is dirty much quicker.

I guess its prudent to have a rough program that simulates the crawl (especially the benefit of tracking directories randomly) before stepping further.

Please let me know, if this addresses your question or adds even more :)

@amarts amarts changed the title GlusterFS maker translator Enhancement GlusterFS marker translator Enhancement May 7, 2017
@stale
Copy link

stale bot commented Aug 13, 2020

Thank you for your contributions.
Noticed that this issue is not having any activity in last ~6 months! We are marking this issue as stale because it has not had recent activity.
It will be closed in 2 weeks if no one responds with a comment here.

@stale stale bot added the wontfix Managed by stale[bot] label Aug 13, 2020
@stale
Copy link

stale bot commented Aug 28, 2020

Closing this issue as there was no update since my last update on issue. If this is an issue which is still valid, feel free to open it.

@stale stale bot closed this as completed Aug 28, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants