Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Filter tasks by pid namespace #329

Open
fntlnz opened this issue Jan 10, 2019 · 14 comments
Open

Filter tasks by pid namespace #329

fntlnz opened this issue Jan 10, 2019 · 14 comments
Labels
enhancement New feature or request, changes on existing features kernel Issue may require kernel work priority: low

Comments

@fntlnz
Copy link

fntlnz commented Jan 10, 2019

Following the idea of filtering tasks by cgroupv2 I'm writing this to explore the idea to add to bpftrace the ability to filter tasks based on their belonging to a specific namespace.

The easiest use case that comes to mind is the ability to use a pid namespace to determine wether a task should or should not be considered, let's see how it looks like when used in bashreadline.t

BEGIN
{
	printf("Tracing bash commands... Hit Ctrl-C to end.\n");
	printf("%-9s %-6s %s\n", "TIME", "PID", "COMMAND");
}

uretprobe:/bin/bash:readline /nspid == nspid("/proc/1234/ns/pid")
{
	time("%H:%M:%S  ");
	printf("%-6d %s\n", pid, str(retval));
}

What I expect as output from this is the actual list of executed bash commands in that container.

This should be doable on kernels > 4.8 by leveraging bpf_get_current_task to obtain the children and then traverse that until we find the matching internal processes struct task_struct to get the pids of the processes contained in our pid namespace so that we can filter them and use the internal pid for the pid variable.

Adding this will also require to have a mechanism to figure out the binary to attach the uprobe from a namespaced point of view.

@fntlnz
Copy link
Author

fntlnz commented Jan 11, 2019

@brendangregg WDYT?

@brendangregg
Copy link
Contributor

I don't understand the pid() call.

I haven't thought about it enough, but I would have guessed we could do something with positional parameters. Eg:

uretprobe:/bin/bash:readline /nspid == $1/
{

Where nspid was a builtin that returned the ID, eg 4026531836, and then we can pass that in on the command line:

./bashreadline.py 4026531836

So it becomes $1. Then the problem of parsing /proc/PID/ns/pid can be done at the shell.

Adding nspid, and nscgroup, etc, as builtins seems a small change.

I'd also check how bcc does this...

@fntlnz
Copy link
Author

fntlnz commented Jan 11, 2019

That makes sense @brendangregg , matching a parameter is really what we want.
I'm upgrading the issue to reflect this and will write some code.

Are you referring to the mount ns guard in bcc? https://github.com/iovisor/bcc/blob/c2e2a26b8624492018a14d5eebd4a50b869c911f/src/cc/ns_guard.cc

@brendangregg brendangregg added the kernel Issue may require kernel work label Jan 11, 2019
@brendangregg
Copy link
Contributor

I've since remembered that we do have this interface:

# bpftrace -e 'tracepoint:syscalls:sys_enter_openat /cgroup == cgroupid("/sys/fs/cgroup/unified/mycg")/ { printf("%s\n", str(args->filename)); }':
Attaching 1 probe...

So my nspid builtin suggestion I think is like that cgroup builtin, and so far I haven't suggested an equivalent of that cgroupid() function (I'm not sure we need one if we pass it in via positional parameters).

I haven't touched ns_guard.cc. It looks like it could fetch namespaces via some user-level library calls, which would be useful if we're doing a cgroupid()-equivalent function, but instead we're doing the cgroup builtin -- which must be implemented in kernel BPF code. It can't call out and do fstat() etc.

Adding a builtin should be mostly easy work, although it touches a lot of files for tests and documentation. Recursive grep some existing ones, and also see the "2. Codegen: curtask" tutorial in the docs/internals_development.md. The hard part will be what goes in ast/codegen_llvm.cc and ast/semantic_analyser.cc. I don't see a BPF function for returning namespace information in include/uapi/linux/bpf.h, and that maybe the way we want to do this in the future (adding kernel tag). If you want to do that work, great, but it's the kind of work where if no one has put their hand up for it, in six months from now it'll still not be done.

The other approach would be digging it out of the task struct, and we could write bpftrace code to do that, but I don't believe there's a way to have a builtin expand to some bpftrace code. Perhaps we need to add such a capability: bpftrace "aliases" or "macros", which expand before all the preprocessors. Will involve adding them to lexer.l, and inserting that extra step in the build path.

Or you might find another, better, way...

@fntlnz
Copy link
Author

fntlnz commented Jan 13, 2019

I'm exploring those possibilities, the road that seems reasonable to me is to add a CallInst *IRBuilderBPF::CreateGetCurrentPidNamespace() in ast/irbuilderbpf.cpp and call in it a new exported bpf helper that should look like this.

Not yet sure how to get the pid namespace id from the pid_namespace struct in pid_namespace.h so here I'm using it directly for reference.

BPF_CALL_0(bpf_get_current_pid_ns)
{
	struct task_struct *task = current;

	if (unlikely(!task))
		return -EINVAL;

  return (u64) task->nsproxy->pid_ns_for_children

}

const struct bpf_func_proto bpf_get_current_pid_ns = {
	.func		= bpf_get_current_pid_ns,
	.gpl_only	= false,
	.ret_type	= RET_INTEGER,
};

Then use the value I get from that in bpftrace to compare with the extracted tasks pid namespace and filter.

@dalehamel
Copy link
Contributor

Adding this will also require to have a mechanism to figure out the binary to attach the uprobe from a namespaced point of view.

There is code for this in bcc, it can probably be reused. I ran into a similar problem when I was trying to adapt gethostlatency.py to work for probing only a specific container, only to realize it already handled this if you specified a pid to it - it figures out that the path is relative to the mountns for the container pid you provided, because it enters that namespace before resolving the symbols in bcc_syms. This is what allows it to resolve a containerized libc correctly, in order to attach a uprobe to gethostbyname. Not totally sure how to handle this in bpftrace though, that could probably be it's own issue.

Adding a builtin should be mostly easy work, although it touches a lot of files for tests and documentation. Recursive grep some existing ones, and also see the "2. Codegen: curtask" tutorial in the docs/internals_development.md. The hard part will be what goes in ast/codegen_llvm.cc and ast/semantic_analyser.cc.

@fntlnz if you can figure out how to actually get the process namespace / implement a CreateGetCurrentPidNamespace() call, I'm familiar with the other bits to add a builtin (did this for #30). I really want to be able to filter by process namespace (hugely useful for debugging issues for a specific pod in kubernetes), and would be happy to assist in implementing the builtin that wraps this call if it would be helpful.

By the way, I also think you're on the right track by trying to fetch this out of the curtask struct and filtering, though I agree it would be very convenient (and is probably best solved) in the kernel as Brendan indicates above, as this seems generally useful for bpf tools to debug containers.

A crazy idea that I had while thinking about this was to do some special handling in the codegen. If you look at https://github.com/iovisor/bpftrace/blob/master/src/ast/codegen_llvm.cpp#L99-L102, _expr is what would hold the pointer to the struct object. You should have access to the sruct task_struct struct, via bpftrace_.structs_[type.cast_type], also available within that file. You should be able to (ab?)use this to get the offset of the nsproxy member, and then pass that in to another codegen call get a pointer to that, then use the same trick on the struct nsproxy struct to get the offset of pid_ns_for_children or any other member from here.

Through a few probe reads and additions of field offsets (ample examples of this in that file), you should be able to manually navigate structs inside this codegen context... I think. Then the read value is just bubbled up and implements a nspid call or something similar that gives you the integer result of the last proberead. Manually navigating a specific struct though is unprecedented and has a bit of a smell to it... but if it works it might be the best of a number of bad options?

To prove if this is possible, you could just add includes for files that have both of these structs to a test script, or we could hack it into the standard definitions.h to ensure that these headers are always loaded. Then the codegen work to implement the manual dereferencing, and check to see if you ultimately read the pidns value you are expecting. I'm not sure how I feel about that from a design perspective (it's probably bad), but that's not a problem unless it works. If you get stuck, I might invest some time in proving/disproving this theory.

@dalehamel
Copy link
Contributor

Not yet sure how to get the pid namespace id from the pid_namespace struct in pid_namespace.h so here I'm using it directly for reference.

I struggled with this for a bit, then found someone had already done this in bcc:
https://github.com/iovisor/bcc/blob/b0bf04ac4042a6c004b15dcc5e40e12ae78020f9/tools/runqlat.py#L238

This makes sense to me, I recall reading somewhere that the numbers associated with /proc/PID/ns/* were inode numbers.

So this bpftrace should allow for the necessary struct navigation:

#include <linux/nsproxy.h>
#include <linux/pid_namespace.h>
#include <linux/ns_common.h>

BEGIN
{
  printf("%-8s %-8s %-8s %-16s ", "TIME", "PID", "PIDNS", "COMM");
}

kprobe:wake_up_new_task
{
  $task = (task_struct *)curtask;
  $nsproxy = (nsproxy*) $task->nsproxy;
  $pidns = (pid_namespace*) $nsproxy->pid_ns_for_children;
  $ns = (ns_common*) $pidns->ns;

  time("%H:%M:%S ");
  printf("%-8d %-8d %-16s \n", pid, $ns->inum, comm);
}

But I am consistently getting 0 for the pid ns, so something is up here. I'll try to find a bcc example that works for reading this and see if it's a bug in bpftrace or i'm just doing something wrong.

hope this helps @fntlnz

@dalehamel
Copy link
Contributor

So looking at the headers for the task_struct, I think that it's quite possible that bpftrace doesn't understand the #ifdef macros (I don't think it parses these at all, and even if it did I'm not sure how it would know what kernel configs are applied). That would basically break how it calculates field offsets, which would make sense for why I am getting such strange values when I try to read even $task->pid to compare to the bpftrace builtin pid.

In the case of @brendangregg 's runqlen.bt, I believe that the use of $task->se is valid only because it happens to be defined near the top of the struct, where it's less likely that you'll run into a config option that isn't enabled, it depends only on CONFIG_THREAD_INFO_IN_TASK and CONFIG_SMP. For anything further down in the struct, it's more likely to have a kernel config option missing and thus have incorrect offsets.

bcc doesn't do the same sort of struct/field parsing that bpftrace does, so I suppose it is able to work around this problem in some other way.

I think that would probably make parsing this out from the struct quite impractical in bpftrace (unless it can somehow leverage what bcc does here), as I don't think you can reliably get the offset of nsproxy. For this reason, a bpftrace helper right in the kernel is probably best. Going off of @fntlnz's work, I suspect it'd be something like this:

BPF_CALL_0(bpf_get_current_pid_ns)
{
	struct task_struct *task = current;

	if (unlikely(!task))
		return -EINVAL;

  return (u64) task->nsproxy->pid_ns_for_children->ns.inum

}

const struct bpf_func_proto bpf_get_current_pid_ns = {
	.func		= bpf_get_current_pid_ns,
	.gpl_only	= false,
	.ret_type	= RET_INTEGER,
};

Though likely with more null checks for safer struct navigation, not sure on the accepted way to do this in kernel-land.

@fntlnz
Copy link
Author

fntlnz commented Feb 5, 2019

Yes @dalehamel that's very similar to what I have in mind, however I also wanted to explore a way to implement this completely in bpftrace since that is also how a number of bcc programs do that like:

@fntlnz
Copy link
Author

fntlnz commented Feb 5, 2019

In the meanwhile, those that can enable cgroupv2 in their kubernetes cluster can do the filtering using the cgroupid function:

APP=front-end
POD=$(kubectl get pod -n sock-shop -l name=$APP -o jsonpath='{.items[0].metadata.name}')
NODE=$(kubectl get pod -n sock-shop -l name=$APP -o jsonpath='{.items[0].spec.nodeName}')
CGR=/sys/fs/cgroup/unified$(kubectl exec -ti -n sock-shop $POD \
                              cat /proc/1/cgroup|grep ^0:|cut -d: -f3|tr -d '\r\n')

kubectl trace run --attach --serviceaccount=kubectltrace \
          $NODE 
          -e 'kprobe:do_sys_open* /cgroup == cgroupid("'$CGR'")/ \
                { printf("%s: %s\n", comm, str(arg1)) }'

kudos to @alban - he wrote that example down for his talk at fosdem, he also has a guide on how to enable cgroupv2 in a kubernetes cluster https://gist.github.com/alban/6b6eee36e042d947c0c550b0dacced52

@dalehamel
Copy link
Contributor

Thanks for the tip regarding the cgroupid, but I was under the impression that cgroups mapped more or less 1:1 to container, and could (and often do) differ within a pod. I'll give this a shot though and check my assumption. I did a quick search of the docs:

The shared context of a pod is a set of Linux namespaces, cgroups, and potentially other facets of isolation - the same things that isolate a Docker container.

But the wording there is "set of namespaces". From my poking around /proc I seem to recall that the pid namespace (and network, unless using host networking) was the most reliable shared namespace to map 1:1 to a pod.

In your example, the inline call to kubectl exec there would not work if there are multiple containers in the pod, you would need to specify which to probe. This is how other kubernetes things work, and that is totally reasonable (you would just need to specify which container to probe), but the subtlety is that I think there is a lot of applications where you would want to probe all containers in a pod, and only narrow down to a particular container later in debugging.

I'm eager to try this out though, I recall from a bcc issue that Brendan Gregg suggested something similar (though he was talking about containers, and not pods), and I think it (at the minimum) can be used to build a reasonable workaround until something filtering directly on the pid namespace can be implemented.

@dalehamel
Copy link
Contributor

under the impression that cgroups mapped more or less 1:1 to container, and could (and often do) differ within a pod

FWIW I did a spot check, two containers in the same pod, and they did use the same cgroup namespace. This could mean that as long as you default to the first container in the pod that you are ok to use cgroup as the identifier for the pod.

I'm not sure if this is enforced, or is a coincidence / luck for the pod workload I examined. Either way, this seems more promising than I initially thought.

And in fact, the pid namespace was different between these two pods : / perhaps cgroup is better for this after all

@tluo-github
Copy link

@fntlnz can work on cgroups v1?

@jordalgo jordalgo added enhancement New feature or request, changes on existing features priority: low labels Dec 13, 2023
@jeason81
Copy link

Any progress here? I'm trying to use bpftrace to monitor activity inside containers but not on the host system. I've been able to achieve this with bcc by creating a bpfmap. Is it possible to do the same here?

@bpftrace bpftrace deleted a comment from tluo-github Jan 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request, changes on existing features kernel Issue may require kernel work priority: low
Projects
None yet
Development

No branches or pull requests

6 participants