Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

usdt: Make namespace aware #1231

Merged
merged 2 commits into from
Jun 30, 2017
Merged

usdt: Make namespace aware #1231

merged 2 commits into from
Jun 30, 2017

Conversation

vmg
Copy link
Contributor

@vmg vmg commented Jun 26, 2017

When trying to attach probes to a namespaced process (i.e. one inside a container), we access directly the /proc/$pid/maps file in our local procfs. The paths to the mapped files inside the procfs, however, are relative to the chroot of the process instead of the global FS root.

To work around this, this PR makes the USDT code in BCC aware of namespaces, by using the ProcMountNS helpers that were previously introduced to the library.

We should now be able to attach and trace probes to processes namespaced inside a container.

Note that tracing using uprobes is only supported in Docker deployments when using the devicemapper (and most likely brtfs) storage drivers. The overlay FSs (aufs, overlayfs, overlayfs2) do not seem to be properly writing the userspace breakpoints on the binaries, so the uprobes cannot trigger.

@brendangregg
Copy link
Member

Nice and simple.

[buildbot, ok to test]

@vmg
Copy link
Contributor Author

vmg commented Jun 27, 2017

Ooops! This fix is not quite correct. It fixes the immediate issue, but accessing the probes through /proc/pid/root does not play nicely with the ProcMountNS code that the rest of BCC uses.

The proper fix implies making the USDT code aware of ProcMountNS. I'll be working on that today. 👌

Cleanup the `strncmp` code and add a few more ignored map names
@cfcs
Copy link

cfcs commented Jun 27, 2017

Wouldn't you want to access the associated FD rather than the path?

@vmg
Copy link
Contributor Author

vmg commented Jun 27, 2017

So, I've spent the afternoon messing around with this... I've added the proper ProcMountNS guards to the USDT code and now all the examples and tools are working.

And by working I mean that they execute without problems, but no events are being reported.

I've verified this manually with the simplest examples (i.e. by printing to trace_pipe from the probe) and the USDT probes are just not being triggered. I'll have to investigate further tomorrow, here's a list of possible leads:

  • Are we properly enabling the semaphores for the given probes? I'm using Ruby (mri) as my example USDT app, and all the probes here need manual activation. I gotta try an app with static probes that don't need activation and see if that triggers.

  • Is bpf_attach_uprobe working properly? @drzaeus77 implemented enter_mount_ns to handle namespaced binaries, but could it have a bug? I've tried a manual fix: attaching the probe to /proc/$pid/root/$path_to_binary, because I was concerned that the kernel wouldn't really play well when writing a namespaced path to /sys/kernel/debug/tracing/uprobe_events (even though the writing process was in the right namespace). The /root path should always be valid to the kernel -- and yet the USDT probe still doesn't trigger.

That's all I have for today. @drzaeus77 @brendangregg if you have any further insights or leads I could look into, all help is appreciated. Thank you!

@vmg
Copy link
Contributor Author

vmg commented Jun 27, 2017

Wouldn't you want to access the associated FD rather than the path?

Well, the fix wasn't quite right to begin with, but I don't understand what you mean by associated FD? Most of the code internally works by reading paths (particularly the Elf code), so either we have a /proc/$pid/root anchored paths, or we have a normal path + the proper ProcMountNS, which is what I've implemented in 9a24f15

Does that make sense?

@vmg
Copy link
Contributor Author

vmg commented Jun 28, 2017

Alright, so I've managed to manually verify that processes inside Docker containers cannot be traced using the uprobe kernel tooling.

Reproduction steps:

A binary usdt is compiled from the following source code:

#include <unistd.h>
#include <stdio.h>
#include <folly/tracing/StaticTracepoint.h>

int main() {
  char s[100];
  int i, a = 20, b = 40;
  for (i = 0; i < 100; i++) s[i] = (i & 7) + (i & 6);

  fprintf(stderr, "Running: %d\n", (int)getpid());

  while (1) {
    FOLLY_SDT(test, probe_point_1, s[7], b);
    FOLLY_SDT(test, probe_point_3, a, b);
    sleep(3);
    a++; b++;
    FOLLY_SDT(test, probe_point_1, s[4], a);
    FOLLY_SDT(test, probe_point_2, 5, s[10]);
    FOLLY_SDT(test, probe_point_3, s[4], s[7]);
  }
  return 1;
}

It has several tracepoints, all statically defined, no semaphores:

$ readelf -n ./usdt

Displaying notes found in: .note.ABI-tag
  Owner                 Data size       Description
  GNU                  0x00000010       NT_GNU_ABI_TAG (ABI version tag)
    OS: Linux, ABI: 2.6.32

Displaying notes found in: .note.gnu.build-id
  Owner                 Data size       Description
  GNU                  0x00000014       NT_GNU_BUILD_ID (unique build ID bitstring)
    Build ID: e2daa96a33480f2ee31a4ede48186aa82d861621

Displaying notes found in: .note.stapsdt
  Owner                 Data size       Description
  stapsdt              0x00000040       NT_STAPSDT (SystemTap probe descriptors)
    Provider: test
    Name: probe_point_1
    Location: 0x00000000000007d9, Base: 0x0000000000000000, Semaphore: 0x0000000000000000
    Arguments: -1@%al -4@-116(%rbp)
  stapsdt              0x00000047       NT_STAPSDT (SystemTap probe descriptors)
    Provider: test
    Name: probe_point_3
    Location: 0x00000000000007da, Base: 0x0000000000000000, Semaphore: 0x0000000000000000
    Arguments: -4@-120(%rbp) -4@-116(%rbp)
  stapsdt              0x00000040       NT_STAPSDT (SystemTap probe descriptors)
    Provider: test
    Name: probe_point_1
    Location: 0x00000000000007f1, Base: 0x0000000000000000, Semaphore: 0x0000000000000000
    Arguments: -1@%al -4@-120(%rbp)
  stapsdt              0x00000038       NT_STAPSDT (SystemTap probe descriptors)
    Provider: test
    Name: probe_point_2
    Location: 0x00000000000007f6, Base: 0x0000000000000000, Semaphore: 0x0000000000000000
    Arguments: -4@$5 -1@%al
  stapsdt              0x00000039       NT_STAPSDT (SystemTap probe descriptors)
    Provider: test
    Name: probe_point_3
    Location: 0x00000000000007ff, Base: 0x0000000000000000, Semaphore: 0x0000000000000000
    Arguments: -1@%al -1@%dl

We now run this binary:

$ ./usdt
Running: 99539

And we attempt to manually trace it using the interfaces at /sys/kernel/debug/tracing from a root shell:

root@ubuntu:/sys/kernel/debug/tracing# echo 'p /proc/99539/exe:0x00000000000007da' > uprobe_events
root@ubuntu:/sys/kernel/debug/tracing# echo 1 > events/uprobes/enable
root@ubuntu:/sys/kernel/debug/tracing# cat trace
# tracer: nop
#
# entries-in-buffer/entries-written: 1/1   #P:4
#
#                              _-----=> irqs-off
#                             / _----=> need-resched
#                            | / _---=> hardirq/softirq
#                            || / _--=> preempt-depth
#                            ||| /     delay
#           TASK-PID   CPU#  ||||    TIMESTAMP  FUNCTION
#              | |       |   ||||       |         |
           <...>-99539 [003] d... 518377.843996: p_exe_0x7da: (0x55ac586167da)

Note we're using /proc/$pid/exe to ensure we're tracing the right binary. The tracing works, we can see the traced PID in tracing/trace.

We now attempt to run the ./usdt binary from inside a Docker image. The Dockerfile is really basic:

FROM ubuntu:17.04
ADD usdt /usr/bin/

We run the image as so:

$ docker run -it --rm usdt-test-image /usr/bin/usdt
Running: 1

Note that the process spawns as PID 1, but we can find it running on the host OS:

$ ps aux | grep usdt
vmg       99600  0.3  0.2 287980 18496 pts/1    Sl+  12:35   0:00 docker run -it --rm usdt-test-image /usr/bin/usdt
root      99655  0.6  0.0   4216   740 pts/7    Ss+  12:35   0:00 /usr/bin/usdt
vmg       99699  0.0  0.0  14248   984 pts/4    S+   12:35   0:00 grep usdt

We now attempt to perform the same tracing steps as earlier:

root@ubuntu:/sys/kernel/debug/tracing# echo 'p /proc/99655/exe:0x00000000000007da' > uprobe_events
root@ubuntu:/sys/kernel/debug/tracing# echo 1 > events/uprobes/enable
root@ubuntu:/sys/kernel/debug/tracing# cat trace
# tracer: nop
#
# entries-in-buffer/entries-written: 0/0   #P:4
#
#                              _-----=> irqs-off
#                             / _----=> need-resched
#                            | / _---=> hardirq/softirq
#                            || / _--=> preempt-depth
#                            ||| /     delay
#           TASK-PID   CPU#  ||||    TIMESTAMP  FUNCTION
#              | |       |   ||||       |         |

Note again we're using /proc/$pid/exe to ensure we're tracing the right binary. This time, however, there is no tracing output.

Am I going fucking bananas here? This is very puzzling to me. Has anybody been able to reproduce the issue locally? Is there an issue in my local kernel or Docker version?

@vmg
Copy link
Contributor Author

vmg commented Jun 28, 2017

Looking at the kernel sources for the trace procfs, I see nothing funky:

http://elixir.free-electrons.com/linux/v4.10/source/kernel/trace/trace_uprobe.c#L442

	filename = argv[1];
	ret = kern_path(filename, LOOKUP_FOLLOW, &path);
	if (ret)
		goto fail_address_parse;

	inode = igrab(d_inode(path.dentry));
	path_put(&path);

	if (!inode || !S_ISREG(inode->i_mode)) {
		ret = -EINVAL;
		goto fail_address_parse;
	}

If the path to the binary was successfully resolved (which it should have been, as the uprobes are showing up in the procfs and can be activated), it keeps a pointer to the inode of the executable, not to the path itself, so that should persist just fine. even if the tracer changes namespaces again. 😡

@drzaeus77
Copy link
Collaborator

I'm playing around with your code, and seeing a different set of problems. In particular, I get an error out of kernel/events/uprobes.c:

int uprobe_register():
...
        if (!inode->i_mapping->a_ops->readpage && !shmem_mapping(inode->i_mapping))
                return -EIO;

What file system are you using? My default docker install chose overlayfs which doesn't have the readpage() handler, hence the error when trying to do perf_event_open.

@drzaeus77
Copy link
Collaborator

I was able to get a working example where I for instance did the following:

docker run -it --rm -v pwd/usdt:/usr/bin/usdt ubuntu:17.04 /usr/bin/usdt

Attaching a bcc USDT object to the resulting pid gave coherent results. This showed to me that the trace attach is able to work across mount namespaces, since there is no /usr/bin/usdt in the root mount ns. However, using the ADD usdt /usr/bin method gives a EIO return from perf_event_open, which is due to missing readpage a_ops as mentioned in my previous comment. So, my theory is that uprobes won't work with overlayfs. Can you try with btrfs or some other fs?

@vmg
Copy link
Contributor Author

vmg commented Jun 30, 2017

Ooh, interesting. Thanks for looking into this, @drzaeus77.

Yes, I can also verify that mounting with the -v flag lets tracing work. That's a good first step, and it means this PRs functionality works and should be ready to review/merge. So that's good news right there.

There's still a pretty big issue with files that are added using ADD or simply part of the original image. This is the case for all our container deployments. From what I can test locally:

  • aufs: this is the filesystem driver that Docker uses by default in Ubuntu. It lets us attach probes successfully, but the events do not trigger. We really ought to figure exactly why, and see if it could potentially be patched in the kernel.

  • overlayfs, overlayfs2: I've tried switching to this driver manually and I can reproduce your crash. This driver doesn't have a readpage callback so we cannot attach uprobes. :/

  • brtfs: I'm going to test this today. Looking at the kernel sources, the readpage callback is set, so hopefully this one could actually work in practice.

@vmg
Copy link
Contributor Author

vmg commented Jun 30, 2017

Yey, good news! It seems like the tracing issues are specific to the aufs and overlayfs Docker drivers. I've managed to successfully trace processes using devicemapper, which is the driver we run in production.

I think this is good to review and merge: the PR fixes the issues the USDT code had when loading probes in mounted namespaces. I'm going to update the title and body of the PR to accurately describe the changes.

@vmg vmg changed the title proc: Access mapped files inside of namespaces usdt: Make namespace aware Jun 30, 2017
@drzaeus77 drzaeus77 merged commit 96c1b8e into iovisor:master Jun 30, 2017
@ggaurav10 ggaurav10 mentioned this pull request Jan 8, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants