New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Filter tasks by pid namespace #329
Comments
@brendangregg WDYT? |
I don't understand the pid() call. I haven't thought about it enough, but I would have guessed we could do something with positional parameters. Eg:
Where
So it becomes $1. Then the problem of parsing /proc/PID/ns/pid can be done at the shell. Adding nspid, and nscgroup, etc, as builtins seems a small change. I'd also check how bcc does this... |
That makes sense @brendangregg , matching a parameter is really what we want. Are you referring to the mount ns guard in bcc? https://github.com/iovisor/bcc/blob/c2e2a26b8624492018a14d5eebd4a50b869c911f/src/cc/ns_guard.cc |
I've since remembered that we do have this interface:
So my nspid builtin suggestion I think is like that cgroup builtin, and so far I haven't suggested an equivalent of that cgroupid() function (I'm not sure we need one if we pass it in via positional parameters). I haven't touched ns_guard.cc. It looks like it could fetch namespaces via some user-level library calls, which would be useful if we're doing a cgroupid()-equivalent function, but instead we're doing the cgroup builtin -- which must be implemented in kernel BPF code. It can't call out and do fstat() etc. Adding a builtin should be mostly easy work, although it touches a lot of files for tests and documentation. Recursive grep some existing ones, and also see the "2. Codegen: curtask" tutorial in the docs/internals_development.md. The hard part will be what goes in ast/codegen_llvm.cc and ast/semantic_analyser.cc. I don't see a BPF function for returning namespace information in include/uapi/linux/bpf.h, and that maybe the way we want to do this in the future (adding kernel tag). If you want to do that work, great, but it's the kind of work where if no one has put their hand up for it, in six months from now it'll still not be done. The other approach would be digging it out of the task struct, and we could write bpftrace code to do that, but I don't believe there's a way to have a builtin expand to some bpftrace code. Perhaps we need to add such a capability: bpftrace "aliases" or "macros", which expand before all the preprocessors. Will involve adding them to lexer.l, and inserting that extra step in the build path. Or you might find another, better, way... |
I'm exploring those possibilities, the road that seems reasonable to me is to add a Not yet sure how to get the pid namespace id from the BPF_CALL_0(bpf_get_current_pid_ns)
{
struct task_struct *task = current;
if (unlikely(!task))
return -EINVAL;
return (u64) task->nsproxy->pid_ns_for_children
}
const struct bpf_func_proto bpf_get_current_pid_ns = {
.func = bpf_get_current_pid_ns,
.gpl_only = false,
.ret_type = RET_INTEGER,
}; Then use the value I get from that in bpftrace to compare with the extracted tasks pid namespace and filter. |
There is code for this in bcc, it can probably be reused. I ran into a similar problem when I was trying to adapt
@fntlnz if you can figure out how to actually get the process namespace / implement a By the way, I also think you're on the right track by trying to fetch this out of the curtask struct and filtering, though I agree it would be very convenient (and is probably best solved) in the kernel as Brendan indicates above, as this seems generally useful for bpf tools to debug containers. A crazy idea that I had while thinking about this was to do some special handling in the codegen. If you look at https://github.com/iovisor/bpftrace/blob/master/src/ast/codegen_llvm.cpp#L99-L102, Through a few probe reads and additions of field offsets (ample examples of this in that file), you should be able to manually navigate structs inside this codegen context... I think. Then the read value is just bubbled up and implements a To prove if this is possible, you could just add includes for files that have both of these structs to a test script, or we could hack it into the standard definitions.h to ensure that these headers are always loaded. Then the codegen work to implement the manual dereferencing, and check to see if you ultimately read the pidns value you are expecting. I'm not sure how I feel about that from a design perspective (it's probably bad), but that's not a problem unless it works. If you get stuck, I might invest some time in proving/disproving this theory. |
I struggled with this for a bit, then found someone had already done this in bcc: This makes sense to me, I recall reading somewhere that the numbers associated with /proc/PID/ns/* were inode numbers. So this bpftrace should allow for the necessary struct navigation:
But I am consistently getting 0 for the pid ns, so something is up here. I'll try to find a bcc example that works for reading this and see if it's a bug in bpftrace or i'm just doing something wrong. hope this helps @fntlnz |
So looking at the headers for the task_struct, I think that it's quite possible that bpftrace doesn't understand the In the case of @brendangregg 's bcc doesn't do the same sort of struct/field parsing that bpftrace does, so I suppose it is able to work around this problem in some other way. I think that would probably make parsing this out from the struct quite impractical in bpftrace (unless it can somehow leverage what bcc does here), as I don't think you can reliably get the offset of
Though likely with more null checks for safer struct navigation, not sure on the accepted way to do this in kernel-land. |
Yes @dalehamel that's very similar to what I have in mind, however I also wanted to explore a way to implement this completely in bpftrace since that is also how a number of bcc programs do that like: |
In the meanwhile, those that can enable cgroupv2 in their kubernetes cluster can do the filtering using the APP=front-end
POD=$(kubectl get pod -n sock-shop -l name=$APP -o jsonpath='{.items[0].metadata.name}')
NODE=$(kubectl get pod -n sock-shop -l name=$APP -o jsonpath='{.items[0].spec.nodeName}')
CGR=/sys/fs/cgroup/unified$(kubectl exec -ti -n sock-shop $POD \
cat /proc/1/cgroup|grep ^0:|cut -d: -f3|tr -d '\r\n')
kubectl trace run --attach --serviceaccount=kubectltrace \
$NODE
-e 'kprobe:do_sys_open* /cgroup == cgroupid("'$CGR'")/ \
{ printf("%s: %s\n", comm, str(arg1)) }'
kudos to @alban - he wrote that example down for his talk at fosdem, he also has a guide on how to enable cgroupv2 in a kubernetes cluster https://gist.github.com/alban/6b6eee36e042d947c0c550b0dacced52 |
Thanks for the tip regarding the cgroupid, but I was under the impression that cgroups mapped more or less 1:1 to container, and could (and often do) differ within a pod. I'll give this a shot though and check my assumption. I did a quick search of the docs:
But the wording there is "set of namespaces". From my poking around /proc I seem to recall that the pid namespace (and network, unless using host networking) was the most reliable shared namespace to map 1:1 to a pod. In your example, the inline call to I'm eager to try this out though, I recall from a bcc issue that Brendan Gregg suggested something similar (though he was talking about containers, and not pods), and I think it (at the minimum) can be used to build a reasonable workaround until something filtering directly on the pid namespace can be implemented. |
FWIW I did a spot check, two containers in the same pod, and they did use the same cgroup namespace. This could mean that as long as you default to the first container in the pod that you are ok to use cgroup as the identifier for the pod. I'm not sure if this is enforced, or is a coincidence / luck for the pod workload I examined. Either way, this seems more promising than I initially thought. And in fact, the pid namespace was different between these two pods : / perhaps cgroup is better for this after all |
@fntlnz can work on cgroups v1? |
Any progress here? I'm trying to use bpftrace to monitor activity inside containers but not on the host system. I've been able to achieve this with bcc by creating a bpfmap. Is it possible to do the same here? |
Following the idea of filtering tasks by cgroupv2 I'm writing this to explore the idea to add to bpftrace the ability to filter tasks based on their belonging to a specific namespace.
The easiest use case that comes to mind is the ability to use a pid namespace to determine wether a task should or should not be considered, let's see how it looks like when used in
bashreadline.t
What I expect as output from this is the actual list of executed bash commands in that container.
This should be doable on kernels > 4.8 by leveraging
bpf_get_current_task
to obtain the children and then traverse that until we find the matching internal processesstruct task_struct
to get the pids of the processes contained in our pid namespace so that we can filter them and use the internal pid for thepid
variable.Adding this will also require to have a mechanism to figure out the binary to attach the uprobe from a namespaced point of view.
The text was updated successfully, but these errors were encountered: