Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

inode: implement namespace at inode level #1763

Merged
merged 11 commits into from
Dec 10, 2021

Conversation

amarts
Copy link
Member

@amarts amarts commented Nov 6, 2020

With this PR, specially on the brick side graph, inode table would be
properly set with 'namespace' inode reference. It is not guaranteed
with fuse/client side graph due to subdir mount.

To get 'namespace' for a corresponding inode, all one needs to do is,
check ns_inode pointer in inode structure.

Currently only special mounts with PID < 0 can set the namespace
attribute, and in lookup, if namespace attribute is present, we set
the variable in inode.

By default, the ns_inode is set to 'root' inode when inode gets created.

Fixes: #1757
Change-Id: I69157e388538ea5d4b4e45d543575a04ee9ef221
Signed-off-by: Amar Tumballi amar@kadalu.io

@amarts
Copy link
Member Author

amarts commented Nov 8, 2020

/run regression

@amarts
Copy link
Member Author

amarts commented Nov 9, 2020

/run full regression

@amarts
Copy link
Member Author

amarts commented Nov 9, 2020

/run regression

1 similar comment
@amarts
Copy link
Member Author

amarts commented Nov 9, 2020

/run regression

@amarts
Copy link
Member Author

amarts commented Nov 10, 2020

/run regression

@amarts amarts added this to the Release 9 milestone Nov 14, 2020
@pranithk
Copy link
Member

Some doubts:

Who sets namespace xattr on the directory?
Can namespace be nested?

@amarts
Copy link
Member Author

amarts commented Nov 16, 2020

Who sets namespace xattr on the directory?

Ideally, the management layer. Depending on the option (if it needs namespace or not). An example here: https://github.com/amarts/glusterfs_fork/blob/simple-quota_v2/extras/hook-scripts/set/post/S61simple-quota.sh

As a check so just any random person doesn't set the flag, planning to check for pid < 0 as a limit.

Can namespace be nested?

For now, the implementation I was thinking says it can be nested. Only thing is, performing a rename / hardlink across namespaces would be problematic, and not ideal.

I guess if everyone agrees, we can just add that check in inode_link(), but I didn't want to bring that check now.

@pranithk
Copy link
Member

Who sets namespace xattr on the directory?

Ideally, the management layer. Depending on the option (if it needs namespace or not). An example here: https://github.com/amarts/glusterfs_fork/blob/simple-quota_v2/extras/hook-scripts/set/post/S61simple-quota.sh

As a check so just any random person doesn't set the flag, planning to check for pid < 0 as a limit.

Can namespace be nested?

For now, the implementation I was thinking says it can be nested. Only thing is, performing a rename / hardlink across namespaces would be problematic, and not ideal.

I personally understood namespace + quota to be a private share from a big volume. Just like we carve out logical-disks using LVM from a pool of physical disks. Is that not the case?

I guess if everyone agrees, we can just add that check in inode_link(), but I didn't want to bring that check now.

@amarts
Copy link
Member Author

amarts commented Nov 16, 2020

I personally understood namespace + quota to be a private share from a big volume. Just like we carve out logical-disks using LVM from a pool of physical disks. Is that not the case?

Yes, it is the case. But there is one difference. GlusterFS never had concept of namespace, so everything below '/' was treated as single tree. With this introduction, because we keep namespace reference in every inode, it becomes like more of xfs project quota like feature, where within a single project any operations are permitted, but across project's no hardlink/renames are allowed. Even the accounting is separate, ie, assume below:

/ and /abcd and /abcd/xyz (treat each of these as projects in xfs). If there is a 100MB file inside of /abcd/xyz it is not counted inside / and /abcd's accounting.

so technically, each of those shares work as a separate filesystem inside, instead of one.

@amarts amarts force-pushed the i1757_namespace branch 3 times, most recently from 4226c40 to fb5f1b2 Compare November 19, 2020 09:16
@amarts
Copy link
Member Author

amarts commented Nov 19, 2020

/run regression

@amarts
Copy link
Member Author

amarts commented Nov 20, 2020

@pranithk now, namespace gets set in client side also!

@amarts
Copy link
Member Author

amarts commented Feb 23, 2021

/run regression

1 similar comment
@amarts
Copy link
Member Author

amarts commented Feb 23, 2021

/run regression

@amarts amarts force-pushed the i1757_namespace branch 2 times, most recently from 5d7af58 to c1e5d88 Compare August 16, 2021 10:54
@msaju msaju modified the milestones: Release 9, Gluster 11 Nov 9, 2021
Copy link
Contributor

@xhernandez xhernandez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The management in the client side seems a bit weak to me. We may have multiple clients working at the same time, and the namespace can be changed by anyone of them at any time. This change is not consistently identified by the other clients, probably causing inconsistencies in the data related to namespaces.

Another issue that I see is that on graph switch we send lookups in the new graph in a random order (using gfid-based lookups). It seems quite hard to correctly set the namespaces for each inode in this scenario. It seems to me there could be a lot of races with the running requests (even assuming that the namespace is updated recursively, which is not right now).

libglusterfs/src/inode.c Show resolved Hide resolved
/* pick the old dentry */
dentry = __inode_unlink(inode, srcdir, srcname);
if (linked_inode) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is NULL, shouldn't we return an error ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Been sometime I tested this. But without this fix, there was a crash when a cross namespace rename was attempted. (this may be due to the review comment you gave on line 212. Will retest). I also added this, because there are many cases of return NULL; in __inode_link(), in which case, I felt we shouldn't proceed with __inode_unlink() of the source. (Can this be a different bug)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My comment was not meant to remove this check, but if this check fails, can we continue without returning an error ?

However, after second thought I think it's fine.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will keep it as is, but add a reference about this comment. as a comment in the code.

libglusterfs/src/inode.c Show resolved Hide resolved
xlators/mount/fuse/src/fuse-bridge.c Outdated Show resolved Hide resolved
xlators/mount/fuse/src/fuse-bridge.c Outdated Show resolved Hide resolved
xlators/protocol/server/src/server-rpc-fops_v2.c Outdated Show resolved Hide resolved
"dict set (namespace) failed (path: %s), continuing",
state->loc.path);
}
if (state->loc.path && (state->loc.path[0] == '<')) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we shouldn't allow lookups based only on gfid. Kernel doesn't do that and it's a basic thing to keep any kind of hierarchical structure consistent. Last time I checked, I think the only place where we really need gfid-based lookups is on a graph switch, but even in this case we could reimplement it in a way that gfid lookups are not really needed.

Anyway this is another discussion and something for another patch.

/* This is a lookup on gfid : get full-path */
ret = dict_set_int32(xdata, "get-full-path", 1);
if (ret) {
gf_msg_debug(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we continue in case of failure ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I mentioned, its for information right now, and made the log as INFO for now. Working on a recursive lookup for handling gfid based lookups. Should be finished before next major release, so I believe this is good to get in.

leaving it as TODO. Good to have logic of resolving GFID only access
to a path for many other features too. But initial version can just
be knowning that we are hitting the scenario in certain usecases */
if ((op_ret == 0) && (dict_get_sizen(xdata, "get-full-path"))) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couldn't we reuse one of the already existing implementations ? I've found many ways to get something similar, like:

  • trusted.glusterfs.pathinfo
  • trusted.pathinfo
  • glusterfs.gfid2path
  • glusterfs.gfidtopath
  • glusterfs.ancestry.path

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. That was the plan. I have implemented this to check how many this requests I receive in posix. In last ~8 months in pushing it out with kadalu releases, I have seen 2 instances of this log, meaning, we need to fix it. But the frequency is not high. We need to take a call about should we allow gfid based lookup at all and then proceed IMO. For this PR, shall I keep it as is, as this is not an implementation which is complete. Or will keep just a TODO for now.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's fine for now.

@amarts
Copy link
Member Author

amarts commented Nov 13, 2021

The management in the client side seems a bit weak to me. We may have multiple clients working at the same time, and the namespace can be changed by anyone of them at any time. This change is not consistently identified by the other clients, probably causing inconsistencies in the data related to namespaces.

I agree with client side issues. I did it as a part of 'completion'. But this feature and need is very much on server side. If you agree, I can get the feature only on the server side, thus avoiding issues on client side. Shall I break this PR into 2 - one which handles it on server side and inode changes, another which takes care of client side initializations (which can be merged later depending on need).


I will address other comments separately. About the gfid based lookup, yes, there is an issue so far. If we actually decide to keep the hierarchy even if a gfid based lookup is done (can achieve it by recursive lookups on the path (get it from gfid2path xattr). Can do that as a separate PR.

@xhernandez
Copy link
Contributor

The management in the client side seems a bit weak to me. We may have multiple clients working at the same time, and the namespace can be changed by anyone of them at any time. This change is not consistently identified by the other clients, probably causing inconsistencies in the data related to namespaces.

I agree with client side issues. I did it as a part of 'completion'. But this feature and need is very much on server side. If you agree, I can get the feature only on the server side, thus avoiding issues on client side. Shall I break this PR into 2 - one which handles it on server side and inode changes, another which takes care of client side initializations (which can be merged later depending on need).

As long as we don't create any new feature that depends on client side namespace, it's fine to keep it as it's now. I only wanted to note that we may have several issues with it in the client side. If we are aware of them and agree to fix them before using namespaces in the client side, that's perfectly fine with me.

I will address other comments separately. About the gfid based lookup, yes, there is an issue so far. If we actually decide to keep the hierarchy even if a gfid based lookup is done (can achieve it by recursive lookups on the path (get it from gfid2path xattr). Can do that as a separate PR.

I think it would be great to remove them because they cause a lot of troubles in thing like this. The gfid-lookups in fuse can be easily replaced by regular lookups. However I'm not 100% sure if we need them for anything else. @pranithk are you aware of other uses of gfid-lookups that cannot be replaced by regular lookups ?

Change-Id: I453e15ec6e04fc88386ec6a479f0a1e12ea48d12
Signed-off-by: Amar Tumballi <amar@kadalu.io>
@amarts
Copy link
Member Author

amarts commented Nov 26, 2021

/run regression

Copy link
Contributor

@xhernandez xhernandez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The patch looks good. However I have two points that I would like to consider:

  1. In the brick side the namespace is only set for lookups and setxattrs. Shouldn't we do the same for create, mkdir, mknod and symlink ?

  2. When sharding is used, all shards may belong to a different namespace than the base shard. Shouldn't we handle this case ?

@amarts
Copy link
Member Author

amarts commented Nov 26, 2021

The patch looks good. However I have two points that I would like to consider:

  1. In the brick side the namespace is only set for lookups and setxattrs. Shouldn't we do the same for create, mkdir, mknod and symlink ?
  2. When sharding is used, all shards may belong to a different namespace than the base shard. Shouldn't we handle this case ?

The inode_link() is called during that time, so namespace for those inodes are set based on parent itself (would apply for all shard files too).

@xhernandez
Copy link
Contributor

The patch looks good. However I have two points that I would like to consider:

  1. In the brick side the namespace is only set for lookups and setxattrs. Shouldn't we do the same for create, mkdir, mknod and symlink ?
  2. When sharding is used, all shards may belong to a different namespace than the base shard. Shouldn't we handle this case ?

The inode_link() is called during that time, so namespace for those inodes are set based on parent itself (would apply for all shard files too).

You are right for the create/mkdir/mknod/symlink part. I missed it... :-/

However that's not valid for shard files because the parent directory of a shard file is different than the parent directory of the base shard, so these files may belong to different namespaces (basically .shard and all its contents will belong to the root namespace, while the base shard can belong to any other namespace). I think all shards should belong to the same namespace, but I don't see how we can do that without adding "shard intelligence" into the brick side, which I don't like very much either...

Do you have any solution in mind or shard won't be supported with namespaces for the time being ?

@amarts
Copy link
Member Author

amarts commented Nov 29, 2021

Do you have any solution in mind or shard won't be supported with namespaces for the time being ?

Here are my thoughts on this:

  1. We will not support 'shard' with namespace (for now, ie, in devel branch)
  2. Work on pending enhancements, namely:
  • support for complete inode tree (dentry) resolution of gfid on brick process. (server-resolve.c)
  • support for out of the tree namespaces (ie, any file can have namespace depending on what the 'create' fop requestor asks). - this should resolve shard issue.

Have some suggestions from @xhernandez on how we can go about it, and I also will think more on it and propose the design for next step in an issue and take it forward.

This PR can go in so that many features which needs in-tree namespace features can utilize them immediately.

xhernandez
xhernandez previously approved these changes Nov 29, 2021
@xhernandez
Copy link
Contributor

@pranithk are you also ok with this PR ? if so, feel free to merge it.

@@ -4079,6 +4126,21 @@ fuse_setxattr_resume(fuse_state_t *state)
state->fd = fd_lookup(state->loc.inode, state->finh->pid);
#endif /* GF_TEST_FFOP */

if (dict_get_sizen(state->xattr, GF_NAMESPACE_KEY)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this check be in removexattr as well? Shouldn't this check be present in gfapi part too?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Haven't handled 'removexattr()' as removing namespace in a running process is tricky... ie, we have to recursively reset namespace pointer in whole inode tree under that namespace.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

my idea was, even if there is gfapi client, the namespace setting always happens through a fuse client (that too a special one, ie, pid < 0). Hence haven't handled setxattr in gfapi.

state->loc.path);
state->resolve.op_ret = -1;
state->resolve.op_errno = ENOMEM;
goto err;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

xdata will leak here I think, add if (xdata) dict_unref() in err

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

dict_get_sizen(state->dict, GF_NAMESPACE_KEY)) {
gf_smsg("server", GF_LOG_ERROR, 0, PS_MSG_SETXATTR_INFO, "path=%s",
state->loc.path, "key=%s", GF_NAMESPACE_KEY, NULL);
ret = -1;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove 'ret = -1` line, it is confusing, I had to check what SERVER_REQ_SET_ERROR() is doing to ret again.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

ret = -1;
SERVER_REQ_SET_ERROR(req, ret);
goto out;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What should happen in removexattr?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as explained in another comment, haven't done removexattr() as it involves doing things beyond just one inode scope.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In that case, until we handle it, do you want to give some errno like ENOTSUP or something?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Please check if thats fine. added a test case for that. Will run the tests anyways.

@@ -362,6 +363,7 @@ __inode_ctx_free(inode_t *inode)
static void
__inode_destroy(inode_t *inode)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should s/__inode_destroy/inode_destroy/g I thought there was a dead lock.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there is no inode_destroy(). Hence will keep this part to later.

Change-Id: Iba16bf356909b6275af0956949c3d3d302d45081
So that we treat it as an error in removexattr

Change-Id: I6579e268124bec01e89b2fe7cb31134c786bf88f
Signed-off-by: Amar Tumballi <amar@kadalu.io>
@amarts
Copy link
Member Author

amarts commented Dec 6, 2021

/run regression

@@ -2740,6 +2769,15 @@ server4_removexattr_resume(call_frame_t *frame, xlator_t *bound_xl)
if (state->resolve.op_ret != 0)
goto err;

if (dict_get_sizen(state->xdata, GF_NAMESPACE_KEY) ||
!strncmp(GF_NAMESPACE_KEY, state->name, sizeof(GF_NAMESPACE_KEY))) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be strcmp. strncmp will only compare the prefix.

@@ -2760,6 +2798,15 @@ server4_fremovexattr_resume(call_frame_t *frame, xlator_t *bound_xl)
if (state->resolve.op_ret != 0)
goto err;

if (dict_get_sizen(state->xdata, GF_NAMESPACE_KEY) ||
!strncmp(GF_NAMESPACE_KEY, state->name, sizeof(GF_NAMESPACE_KEY))) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be strcmp. strncmp will only compare the prefix.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh sorry you used sizeof, this should be fine.

@xhernandez
Copy link
Contributor

/recheck smoke

@xhernandez xhernandez merged commit 063720d into gluster:devel Dec 10, 2021
mohit84 added a commit to mohit84/glusterfs that referenced this pull request Jan 30, 2024
The brick process is getting crashed due to stack overflow
while unref namespace inode, the ns inode was introduced by
the patch ((gluster#1763)

Solution: __inode_destroy is calling inode_unref that is again
calling inode_unref become a recursive call and eventually a
brick process is getting crashed. To avoid a crash for namespace
inode call only __inode_ref.

Fixes: gluster#4295
Change-Id: If5deb06b726a5e7dfedd2784bddcef81e6e5d7d9
Signed-off-by: Mohit Agrawal <moagrawa@redhat.com>
mohit84 added a commit that referenced this pull request Feb 1, 2024
The brick process is getting crashed due to stack overflow
while unref namespace inode, the ns inode was introduced by
the patch ((#1763)

Solution: __inode_destroy is calling inode_unref that is again
calling inode_unref become a recursive call and eventually a
brick process is getting crashed. To avoid a crash for namespace
inode call only __inode_ref.

Fixes: #4295
Change-Id: If5deb06b726a5e7dfedd2784bddcef81e6e5d7d9

Signed-off-by: Mohit Agrawal <moagrawa@redhat.com>
mohit84 added a commit to mohit84/glusterfs that referenced this pull request Feb 1, 2024
The brick process is getting crashed due to stack overflow
while unref namespace inode, the ns inode was introduced by
the patch ((gluster#1763)

Solution: __inode_destroy is calling inode_unref that is again
calling inode_unref become a recursive call and eventually a
brick process is getting crashed. To avoid a crash for namespace
inode call only __inode_ref.

> Fixes: gluster#4295
> Change-Id: If5deb06b726a5e7dfedd2784bddcef81e6e5d7d9
> Signed-off-by: Mohit Agrawal <moagrawa@redhat.com>
> (Cherry picked from commit 80ecbba)
> (Reviewed on upstream link gluster#4302)

Fixes: gluster#4295
Change-Id: If5deb06b726a5e7dfedd2784bddcef81e6e5d7d9
Signed-off-by: Mohit Agrawal <moagrawa@redhat.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

RFE: improve namespace support in glusterfs
7 participants