Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bcc container improvements #1051

Merged
merged 4 commits into from Mar 30, 2017

Conversation

@kmjohansen
Copy link
Contributor

commented Mar 17, 2017

The following commits add functionality to make bcc tracing of containers work a bit better.

  • Resolve symbols in a container where the target process resides in a different mount namespace than the process running bcc.
  • Resolve perf-map files either in the host container using the host's view of the target pid, or in the target container using its pid namespace's identifier
  • Enable uprobes and USDT probes where the target process' files reside in a different mount namespace from the host.

The uprobe and USDT probe functionality only currently knows how to determine the target's namespcae if a pid is given to bpf_attach_uprobe. A future improvement would be to enhance this code path to let the tracing process specify a fd or pid from which namespace information should be gathered in situations where we want to trace all processes triggering a uprobe in a different namespace. If the maintainers have some suggestions about the best way to accomplish this, I'm happy to take a stab at that too.

@4ast

This comment has been minimized.

Copy link
Member

commented Mar 18, 2017

[buildbot, ok to test]
@markdrayton @palmtenor pls take a look

@markdrayton

This comment has been minimized.

Copy link
Contributor

commented Mar 18, 2017

Thanks for the PR! Making BCC work in container environments is great.

I'm a little confused about the second commit:

Allow perf-map files to exist either in the container under the pid
that's specific to the container's pid namespace, or in the host
container using the pid that's specific to the initial pid namespace.

I don't quite understand this — how do these changes allow us to find the map under either PID? It's very possible I'm just missing something about the mount namespace handling here.

Does (or can) any of this supersede the existing code to find maps inside different PID namespaces/chroots? (e.g. bcc_perf_map_nstgid and bcc_perf_map_path)

More generally the code looks fine bar a few nits but it would really benefit from some unit tests. I patched this locally and did some very basic testing but would feel more comfortable with unit tests.

@kmjohansen

This comment has been minimized.

Copy link
Contributor Author

commented Mar 22, 2017

Thanks for taking a look @markdrayton . I appreciate the feedback.

Regarding the question about the 2nd commit, I may have done a poor job of explaining what's going on there. That commit attempts to address the case where you have a process running in a different pid and mount namespace from the process where you're running the trace. In that situation, a container may have /tmp mounted on a tmpfs that's separate from the /tmp where you're running bcc. Without that 2nd commit, the code will always look in the target's /tmp using the target's pid. So if the target is pid 2 in its namespace, but 4037 in bcc's namespace, it will enter the target's mount namespace and look for /tmp/perf-2.map. If you managed to put the perf map in the /tmp that's accessible to bcc without changing mount namespaces, it won't look there by default. The 2nd patch checks /tmp/perf-2.map in the target's mountns, and then also checks for /tmp/perf-4037.map in the mountns that bcc is using. Does that explanation make more sense?

To the question about bcc_perf_map_nstgid and bcc_perf_map_path: I think these functions are complimentary, but it's also possible that I don't completely grasp their intent. The code I added still relies upon bcc_perf_map_nstgid in order to determine the pid of the perf map in the target's pid namespace. I'm less clear about bcc_perf_map_path. I don't think it's harmful, since it would address the case where somebody put a process in a container and then chroot'd it, which might be overkill but it still permitted.

Finally, about unit tests: I'll take a stab at putting something together and will update the pull request. Most namespace stuff requires at least CAP_SYS_ADMIN in order to make modifications. Do the tests expect to be running with elevated privileges, or is there something additional I need to do to make that happen?

@goldshtn

This comment has been minimized.

Copy link
Collaborator

commented Mar 22, 2017

The tests should already run with elevated privileges, most of them anyway.

@markdrayton

This comment has been minimized.

Copy link
Contributor

commented Mar 22, 2017

Thanks for the explanation. It makes more sense now. I need to review the code a bit more but I'm pushed for time at the moment — will try to do so after the tests are ready.

Also, I just realised I forgot to send the inline comments I made. I didn't realise they are submitted independently of regular comments. I'll hit go on that after this.

is_so_ = bcc_elf_is_shared_obj(name) == 1;
if (ns_switch)

This comment has been minimized.

Copy link
@markdrayton

markdrayton Mar 22, 2017

Contributor

nit: errant space here :-)

if (bcc_elf_loadaddr(sym->module, &load_addr) < 0) {
sym->module = NULL;
if (ns_switch)
bcc_procutils_exit_mountns(&nsc);

This comment has been minimized.

Copy link
@markdrayton

markdrayton Mar 22, 2017

Contributor

It'd be nice to reduce the number of bcc_procutils_exit_mountns call sites. How about:

  ...
  bool success = true;
  ns_switch = bcc_procutils_enter_mountns(pid, &nsc);

  if (bcc_elf_loadaddr(sym->module, &load_addr) < 0) {
    sym->module = NULL;
    success = false;
    goto exitns;
  }

  sym->name = symname;
  sym->offset = addr;

  if (sym->name && sym->offset == 0x0)
    if (bcc_find_symbol_addr(sym) < 0)
      success = false;

exitns:
  if (ns_switch)
    bcc_procutils_exit_mountns(&nsc);

  if (!success || sym->offset == 0x0)
    return -1;

  sym->offset = (sym->offset - load_addr);
  return 0;
  ...
it = ps->modules_.insert(ps->modules_.end(), Module(new_modname,
ps->pid_, false));
it->ranges_.push_back(ProcSyms::Module::Range(start, end));
return 0;

This comment has been minimized.

Copy link
@markdrayton

markdrayton Mar 22, 2017

Contributor

Missing a space.

This comment has been minimized.

Copy link
@kmjohansen

kmjohansen Mar 29, 2017

Author Contributor

Sorry, struggling to see this one. Which line is problematic? When I pull this up in my editor, it looks aligned to me.


if (arc != 0) {
snprintf(new_modname, sizeof (new_modname), "/tmp/perf-%d.map",
ps->pid_);

This comment has been minimized.

Copy link
@markdrayton

markdrayton Mar 22, 2017

Contributor

3 spaces :-)

if (it == ps->modules_.end()) {
// If modname references a perf-map, determine if we need to enter a mount
// namespace in order to read symbols from it later.
if (strstr(modname, ".map") != nullptr) {

This comment has been minimized.

Copy link
@markdrayton

markdrayton Mar 22, 2017

Contributor

Can this check be made tighter? /\.map/ could give false positives fairly easily.

if (write(kfd, buf, strlen(buf)) < 0) {
if (errno == EINVAL)
fprintf(stderr, "check dmesg output for possible cause\n");
if (ns_switch)
bcc_procutils_exit_mountns(&nsc);

This comment has been minimized.

Copy link
@markdrayton

markdrayton Mar 22, 2017

Contributor

Possible to deduplicate some of this code?

@kmjohansen kmjohansen force-pushed the kmjohansen:container branch from c962f6a to 105aa65 Mar 26, 2017

@kmjohansen

This comment has been minimized.

Copy link
Contributor Author

commented Mar 26, 2017

Sorry this took longer to get to than I had hoped. I've updated the original three patches with the cleanups that @markdrayton requested. I tried to tidy things up a bit more while I was there too. This latest update also includes a patch with unit tests. I'm not enamored with them, but they work and they fail without the leading three patches present too.

@markdrayton

This comment has been minimized.

Copy link
Contributor

commented Mar 27, 2017

Thanks @kmjohansen! I should be able to look at this tomorrow during a flight.

@goldshtn

This comment has been minimized.

Copy link
Collaborator

commented Mar 27, 2017

@kmjohansen There are two test failures on the Ubuntu buildbot. The first one is actually quite interesting because it looks like a USDT parsing error in libz, I'll try to take a look. The other one seems to be related to your new tests. Can you please review?

@goldshtn

This comment has been minimized.

Copy link
Collaborator

commented Mar 27, 2017

Sorry, scratch that -- it actually looks like /lib64/libz.so.1 doesn't exist. Indeed, looking at a clean Xenial install, I have libz.so.1 in /lib/x86_64-linux-gnu.

@markdrayton
Copy link
Contributor

left a comment

Thanks for the tests and improvements!

Is this change now intending to cope with any modules (not just perf maps) inside a mount namespace? I see there are bcc_procutils_enter/exit_mountns calls surrounding bcc_elf_is_shared_obj in the Module constructor now.

Overall the change seems to do the right thing but I wonder if we can consolidate the perf map logic a bit more. Specifically, bcc_perf_map.c deals with exposing the (possibly chrooted and possibly in a different PID namespace) target's perf map to a BCC process that is outside of that environment. Your change adds some more perf map code into ProcSyms::_add_module, and also adds another mapping for a perf map that will be used if the first one (which is in the BCC mount namespace) fails. I think this could be cleaner than it is now. I will spend a little time seeing if we can make this neater today.

}

// Copy a shared library from shared mntns to private /tmp
in_fd = open("/lib64/libz.so.1", O_RDONLY);

This comment has been minimized.

Copy link
@markdrayton

markdrayton Mar 28, 2017

Contributor

Can we figure out where libz is from reading /proc/self/maps?

if (it == ps->modules_.end()) {
// If modname references a perf-map, determine if we need to enter a mount
// namespace in order to read symbols from it later.
if (strstr(modname, ".map") != nullptr) {

This comment has been minimized.

Copy link
@markdrayton

markdrayton Mar 28, 2017

Contributor

is_perf_map()?

This comment has been minimized.

Copy link
@kmjohansen

kmjohansen Mar 29, 2017

Author Contributor

I would have, except that is_perf_map() is specific to the Module class. Since this is constructing the module, it's not yet available. I can add a separate helper function, if you think that would be useful.

bcc_elf_foreach_sym(name_.c_str(), _add_symbol, this);
}

bcc_procutils_exit_mountns(&nsc);

This comment has been minimized.

Copy link
@markdrayton

markdrayton Mar 28, 2017

Contributor

Can this just go in the if (in_ns_) block as we only have one case in which we need to enter and exit the namespace?

This comment has been minimized.

Copy link
@kmjohansen

kmjohansen Mar 29, 2017

Author Contributor

Ah, sorry. Two different things going on here. In the case of a perf-map, we want to try to look in the mount namespace if we've already determined that a perf.map is there. If not, we'll double-check on the host container. For standard ELF symbols, this code always attempts to look in the target namespace.

kmjohansen added some commits Mar 26, 2017

Bcc should look at mountns during symbol resolution.
Allow bcc to resolve symbols in processes that have mappings in a
different mount namespace.  This allows us to obtain stack traces from
the host when our target resides in a container.  With this change it's
possible to get stacks from targets that used to show up as unknown.
When searching for perf-map files look in container, and then host.
Allow perf-map files to exist either in the container under the pid
that's specific to the container's pid namespace, or in the host
container using the pid that's specific to the initial pid namespace.
This lets us store the perf-map either in the continer or on the host,
depending upon which is easier for the person performing the debugging.
Allow bcc to place uprobes and USDT probes in containers.
The uprobe/usdt mechanism uses the target's inode in order to determine
where to place the probe.  The inode lookup occurs at the time the file
path is written to uprobe_events.  If bpf_attach_uprobe() has been
passed a pid, and that pid is in a different mount namespace from the
caller, attempt to switch to the victim's mount namespace so that we can
select the correct inode for the probe.

@kmjohansen kmjohansen force-pushed the kmjohansen:container branch from 105aa65 to ed223be Mar 29, 2017

@kmjohansen

This comment has been minimized.

Copy link
Contributor Author

commented Mar 29, 2017

Thanks @markdrayton and @goldshtn for the feedback thus far.

I've uploaded a new version of the unit tests that should do a better job of attempting to locate libz.so.1. It asks ld.so to do the dirty work instead of trying to guess about the location.

To the question about non perf-map modules in a namespace: yes. One of the primary motivations was to be able to get stack traces of a nginx process that was running in a container. This change should work for both java and standard ELF binaries that have symbols (or debug information) available in a container.

As far as consolidating the code that looks for perf maps in different places, I'm happy to make further changes to tidy this up. I agree that it's not as sleek as perhaps it could be, but I was at a bit of a loss as to the best way to improve it. I'll respond to the specific outstanding code review comments above.

@goldshtn

This comment has been minimized.

Copy link
Collaborator

commented Mar 30, 2017

@markdrayton One more look before we merge this?

@markdrayton

This comment has been minimized.

Copy link
Contributor

commented Mar 30, 2017

Yep, was just looking at this now :-) I think it's more or less good to go at this point though — like @kmjohansen says we can probably consolidate things a bit more but I don't have specific recommendations now. We have good test coverage so refactoring down the road should be straightforward. I'm going to give it a last test in our environment just to make sure the perf map stuff we use works and, assuming that's fine, merging sounds good. Will report back soon!

@markdrayton

This comment has been minimized.

Copy link
Contributor

commented Mar 30, 2017

Yep, seems to work fine. I tested on processes in the same PID/mount namespace, different namespaces, and those in chroots. Thanks for the work @kmjohansen!

@goldshtn goldshtn merged commit 4b87af0 into iovisor:master Mar 30, 2017

1 check passed

default
Details
@kmjohansen

This comment has been minimized.

Copy link
Contributor Author

commented Apr 1, 2017

Totally welcome. Thanks @markdrayton and @goldshtn for the review and for being willing to pull the change. I'm a long time DTrace user and I'm really glad to have this project available for Linux. Thanks for all the work you're doing to move this forward.

@kmjohansen kmjohansen deleted the kmjohansen:container branch Apr 1, 2017

bcc_procutils_exit_mountns(&nsc);

if (arc != 0) {
snprintf(new_modname, sizeof (new_modname), "/tmp/perf-%d.map",

This comment has been minimized.

Copy link
@palmtenor

palmtenor Apr 27, 2017

Member

@kmjohansen: Quick question here. Is it possible that even in the new mount namespace, we should also use NS-PID instead of just ps->pid_?

Also, I don't get when would this part of logic actually be hit. i.e., access to modname fails while access /tmp/perf-PID.map works. In fact, to my understanding chroot can also happen in combine with mount namespace, so modname should actually be better since it is the absolute path taking the Process's root into consideration.

This comment has been minimized.

Copy link
@kmjohansen

kmjohansen May 12, 2017

Author Contributor

Sorry for the slow reply. Not 100% sure I understand the question. I believe this code was attempting to catch the case where the perf map was in the host system instead of a container, but the process that it referred to was inside a different PID namespace. In that case, we were attempting to resolve the perf map relative to the primordial mount namespace.

To make things more confusing, there's no canonical definition of what has to be in a linux container, so it's possible to get a process that's in a different mount namespace but shares a pid namespace with the host. Similarly, it's also possible that a process could be in a different pid namespace but share a mount namespace. Or they both could be disjoint. You're correct that chroot isn't out of scope after entering a mountns, but it is awfully redundant for a process to create a full chroot inside a container, which if properly scoped should be its own chroot on steroids. If you really wanted to be thorough, you could attempt to look up the file relative to the mount namespace, and then resolved against any chroot applied by the process after it entered the mount namespace. But again, I'm not quite sure I understand what's being asked.

The code above is just trying to locate a perf-map using a PID that would resolve on the host system.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
5 participants
You can’t perform that action at this time.