-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
inode: implement namespace at inode level #1763
Conversation
/run regression |
/run full regression |
/run regression |
1 similar comment
/run regression |
27a7e0d
to
5f3bba1
Compare
/run regression |
Some doubts: Who sets namespace xattr on the directory? |
Ideally, the management layer. Depending on the option (if it needs namespace or not). An example here: https://github.com/amarts/glusterfs_fork/blob/simple-quota_v2/extras/hook-scripts/set/post/S61simple-quota.sh As a check so just any random person doesn't set the flag, planning to check for
For now, the implementation I was thinking says it can be nested. Only thing is, performing a rename / hardlink across namespaces would be problematic, and not ideal. I guess if everyone agrees, we can just add that check in inode_link(), but I didn't want to bring that check now. |
I personally understood namespace + quota to be a private share from a big volume. Just like we carve out logical-disks using LVM from a pool of physical disks. Is that not the case?
|
Yes, it is the case. But there is one difference. GlusterFS never had concept of namespace, so everything below '/' was treated as single tree. With this introduction, because we keep namespace reference in every inode, it becomes like more of xfs project quota like feature, where within a single project any operations are permitted, but across project's no hardlink/renames are allowed. Even the accounting is separate, ie, assume below:
so technically, each of those shares work as a separate filesystem inside, instead of one. |
4226c40
to
fb5f1b2
Compare
/run regression |
@pranithk now, namespace gets set in client side also! |
fb5f1b2
to
4b95be0
Compare
/run regression |
1 similar comment
/run regression |
5d7af58
to
c1e5d88
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The management in the client side seems a bit weak to me. We may have multiple clients working at the same time, and the namespace can be changed by anyone of them at any time. This change is not consistently identified by the other clients, probably causing inconsistencies in the data related to namespaces.
Another issue that I see is that on graph switch we send lookups in the new graph in a random order (using gfid-based lookups). It seems quite hard to correctly set the namespaces for each inode in this scenario. It seems to me there could be a lot of races with the running requests (even assuming that the namespace is updated recursively, which is not right now).
/* pick the old dentry */ | ||
dentry = __inode_unlink(inode, srcdir, srcname); | ||
if (linked_inode) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this is NULL, shouldn't we return an error ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Been sometime I tested this. But without this fix, there was a crash when a cross namespace rename was attempted. (this may be due to the review comment you gave on line 212. Will retest). I also added this, because there are many cases of return NULL;
in __inode_link(), in which case, I felt we shouldn't proceed with __inode_unlink() of the source. (Can this be a different bug)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My comment was not meant to remove this check, but if this check fails, can we continue without returning an error ?
However, after second thought I think it's fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will keep it as is, but add a reference about this comment. as a comment in the code.
"dict set (namespace) failed (path: %s), continuing", | ||
state->loc.path); | ||
} | ||
if (state->loc.path && (state->loc.path[0] == '<')) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we shouldn't allow lookups based only on gfid. Kernel doesn't do that and it's a basic thing to keep any kind of hierarchical structure consistent. Last time I checked, I think the only place where we really need gfid-based lookups is on a graph switch, but even in this case we could reimplement it in a way that gfid lookups are not really needed.
Anyway this is another discussion and something for another patch.
/* This is a lookup on gfid : get full-path */ | ||
ret = dict_set_int32(xdata, "get-full-path", 1); | ||
if (ret) { | ||
gf_msg_debug( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we continue in case of failure ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I mentioned, its for information right now, and made the log as INFO for now. Working on a recursive lookup for handling gfid based lookups. Should be finished before next major release, so I believe this is good to get in.
leaving it as TODO. Good to have logic of resolving GFID only access | ||
to a path for many other features too. But initial version can just | ||
be knowning that we are hitting the scenario in certain usecases */ | ||
if ((op_ret == 0) && (dict_get_sizen(xdata, "get-full-path"))) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Couldn't we reuse one of the already existing implementations ? I've found many ways to get something similar, like:
- trusted.glusterfs.pathinfo
- trusted.pathinfo
- glusterfs.gfid2path
- glusterfs.gfidtopath
- glusterfs.ancestry.path
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. That was the plan. I have implemented this to check how many this requests I receive in posix. In last ~8 months in pushing it out with kadalu releases, I have seen 2 instances of this log, meaning, we need to fix it. But the frequency is not high. We need to take a call about should we allow gfid based lookup at all and then proceed IMO. For this PR, shall I keep it as is, as this is not an implementation which is complete. Or will keep just a TODO for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's fine for now.
I agree with client side issues. I did it as a part of 'completion'. But this feature and need is very much on server side. If you agree, I can get the feature only on the server side, thus avoiding issues on client side. Shall I break this PR into 2 - one which handles it on server side and inode changes, another which takes care of client side initializations (which can be merged later depending on need). I will address other comments separately. About the gfid based lookup, yes, there is an issue so far. If we actually decide to keep the hierarchy even if a gfid based lookup is done (can achieve it by recursive lookups on the path (get it from gfid2path xattr). Can do that as a separate PR. |
As long as we don't create any new feature that depends on client side namespace, it's fine to keep it as it's now. I only wanted to note that we may have several issues with it in the client side. If we are aware of them and agree to fix them before using namespaces in the client side, that's perfectly fine with me.
I think it would be great to remove them because they cause a lot of troubles in thing like this. The gfid-lookups in fuse can be easily replaced by regular lookups. However I'm not 100% sure if we need them for anything else. @pranithk are you aware of other uses of gfid-lookups that cannot be replaced by regular lookups ? |
Change-Id: I453e15ec6e04fc88386ec6a479f0a1e12ea48d12 Signed-off-by: Amar Tumballi <amar@kadalu.io>
/run regression |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The patch looks good. However I have two points that I would like to consider:
-
In the brick side the namespace is only set for lookups and setxattrs. Shouldn't we do the same for create, mkdir, mknod and symlink ?
-
When sharding is used, all shards may belong to a different namespace than the base shard. Shouldn't we handle this case ?
The |
You are right for the create/mkdir/mknod/symlink part. I missed it... :-/ However that's not valid for shard files because the parent directory of a shard file is different than the parent directory of the base shard, so these files may belong to different namespaces (basically .shard and all its contents will belong to the root namespace, while the base shard can belong to any other namespace). I think all shards should belong to the same namespace, but I don't see how we can do that without adding "shard intelligence" into the brick side, which I don't like very much either... Do you have any solution in mind or shard won't be supported with namespaces for the time being ? |
Here are my thoughts on this:
Have some suggestions from @xhernandez on how we can go about it, and I also will think more on it and propose the design for next step in an issue and take it forward. This PR can go in so that many features which needs in-tree namespace features can utilize them immediately. |
@pranithk are you also ok with this PR ? if so, feel free to merge it. |
@@ -4079,6 +4126,21 @@ fuse_setxattr_resume(fuse_state_t *state) | |||
state->fd = fd_lookup(state->loc.inode, state->finh->pid); | |||
#endif /* GF_TEST_FFOP */ | |||
|
|||
if (dict_get_sizen(state->xattr, GF_NAMESPACE_KEY)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this check be in removexattr as well? Shouldn't this check be present in gfapi part too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Haven't handled 'removexattr()' as removing namespace in a running process is tricky... ie, we have to recursively reset namespace pointer in whole inode tree under that namespace.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
my idea was, even if there is gfapi client, the namespace setting always happens through a fuse client (that too a special one, ie, pid < 0). Hence haven't handled setxattr in gfapi.
state->loc.path); | ||
state->resolve.op_ret = -1; | ||
state->resolve.op_errno = ENOMEM; | ||
goto err; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
xdata will leak here I think, add if (xdata) dict_unref() in err
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
dict_get_sizen(state->dict, GF_NAMESPACE_KEY)) { | ||
gf_smsg("server", GF_LOG_ERROR, 0, PS_MSG_SETXATTR_INFO, "path=%s", | ||
state->loc.path, "key=%s", GF_NAMESPACE_KEY, NULL); | ||
ret = -1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove 'ret = -1` line, it is confusing, I had to check what SERVER_REQ_SET_ERROR() is doing to ret again.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
ret = -1; | ||
SERVER_REQ_SET_ERROR(req, ret); | ||
goto out; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What should happen in removexattr?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as explained in another comment, haven't done removexattr() as it involves doing things beyond just one inode scope.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In that case, until we handle it, do you want to give some errno like ENOTSUP or something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. Please check if thats fine. added a test case for that. Will run the tests anyways.
@@ -362,6 +363,7 @@ __inode_ctx_free(inode_t *inode) | |||
static void | |||
__inode_destroy(inode_t *inode) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should s/__inode_destroy/inode_destroy/g I thought there was a dead lock.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there is no inode_destroy()
. Hence will keep this part to later.
Change-Id: Iba16bf356909b6275af0956949c3d3d302d45081
So that we treat it as an error in removexattr Change-Id: I6579e268124bec01e89b2fe7cb31134c786bf88f Signed-off-by: Amar Tumballi <amar@kadalu.io>
/run regression |
@@ -2740,6 +2769,15 @@ server4_removexattr_resume(call_frame_t *frame, xlator_t *bound_xl) | |||
if (state->resolve.op_ret != 0) | |||
goto err; | |||
|
|||
if (dict_get_sizen(state->xdata, GF_NAMESPACE_KEY) || | |||
!strncmp(GF_NAMESPACE_KEY, state->name, sizeof(GF_NAMESPACE_KEY))) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be strcmp. strncmp will only compare the prefix.
@@ -2760,6 +2798,15 @@ server4_fremovexattr_resume(call_frame_t *frame, xlator_t *bound_xl) | |||
if (state->resolve.op_ret != 0) | |||
goto err; | |||
|
|||
if (dict_get_sizen(state->xdata, GF_NAMESPACE_KEY) || | |||
!strncmp(GF_NAMESPACE_KEY, state->name, sizeof(GF_NAMESPACE_KEY))) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be strcmp. strncmp will only compare the prefix.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh sorry you used sizeof, this should be fine.
/recheck smoke |
The brick process is getting crashed due to stack overflow while unref namespace inode, the ns inode was introduced by the patch ((gluster#1763) Solution: __inode_destroy is calling inode_unref that is again calling inode_unref become a recursive call and eventually a brick process is getting crashed. To avoid a crash for namespace inode call only __inode_ref. Fixes: gluster#4295 Change-Id: If5deb06b726a5e7dfedd2784bddcef81e6e5d7d9 Signed-off-by: Mohit Agrawal <moagrawa@redhat.com>
The brick process is getting crashed due to stack overflow while unref namespace inode, the ns inode was introduced by the patch ((#1763) Solution: __inode_destroy is calling inode_unref that is again calling inode_unref become a recursive call and eventually a brick process is getting crashed. To avoid a crash for namespace inode call only __inode_ref. Fixes: #4295 Change-Id: If5deb06b726a5e7dfedd2784bddcef81e6e5d7d9 Signed-off-by: Mohit Agrawal <moagrawa@redhat.com>
The brick process is getting crashed due to stack overflow while unref namespace inode, the ns inode was introduced by the patch ((gluster#1763) Solution: __inode_destroy is calling inode_unref that is again calling inode_unref become a recursive call and eventually a brick process is getting crashed. To avoid a crash for namespace inode call only __inode_ref. > Fixes: gluster#4295 > Change-Id: If5deb06b726a5e7dfedd2784bddcef81e6e5d7d9 > Signed-off-by: Mohit Agrawal <moagrawa@redhat.com> > (Cherry picked from commit 80ecbba) > (Reviewed on upstream link gluster#4302) Fixes: gluster#4295 Change-Id: If5deb06b726a5e7dfedd2784bddcef81e6e5d7d9 Signed-off-by: Mohit Agrawal <moagrawa@redhat.com>
With this PR, specially on the brick side graph, inode table would be
properly set with 'namespace' inode reference. It is not guaranteed
with fuse/client side graph due to subdir mount.
To get 'namespace' for a corresponding inode, all one needs to do is,
check
ns_inode
pointer in inode structure.Currently only special mounts with
PID < 0
can set the namespaceattribute, and in lookup, if namespace attribute is present, we set
the variable in inode.
By default, the ns_inode is set to 'root' inode when inode gets created.
Fixes: #1757
Change-Id: I69157e388538ea5d4b4e45d543575a04ee9ef221
Signed-off-by: Amar Tumballi amar@kadalu.io