Skip to content

Commit

Permalink
tools: add filtering by mount namespace
Browse files Browse the repository at this point in the history
In previous patches, I added the option --cgroupmap to filter events
belonging to a set of cgroup-v2. Although this approach works fine with
systemd services and containers when cgroup-v2 is enabled, it does not
work with containers when only cgroup-v1 is enabled because
bpf_get_current_cgroup_id() only works with cgroup-v2. It also requires
Linux 4.18 to get this bpf helper function.

This patch adds an additional way to filter by containers, using mount
namespaces.

Note that this does not help with systemd services since they normally
don't create a new mount namespace (unless you set some options like
'ReadOnlyPaths=', see "man 5 systemd.exec").

My goal with this patch is to filter Kubernetes pods, even on
distributions with an older kernel (<4.18) or without cgroup-v2 enabled.

- This is only implemented for tools that already support filtering by
  cgroup id (bindsnoop, capable, execsnoop, profile, tcpaccept, tcpconnect,
  tcptop and tcptracer).

- I picked the mount namespace because the other namespaces could be
  disabled in Kubernetes (e.g. HostNetwork, HostPID, HostIPC).

It can be tested by following the example in docs/special_filtering added
in this commit, to avoid compiling locally the following command can be used

```
sudo bpftool map create /sys/fs/bpf/mnt_ns_set type hash key 8 value 4 \
  entries 128 name mnt_ns_set flags 0
docker run -ti --rm --privileged \
  -v /usr/src:/usr/src -v /lib/modules:/lib/modules \
  -v /sys/fs/bpf:/sys/fs/bpf --pid=host kinvolk/bcc:alban-containers-filters \
  /usr/share/bcc/tools/execsnoop --mntnsmap /sys/fs/bpf/mnt_ns_set

```

Co-authored-by: Alban Crequy <alban@kinvolk.io>
Co-authored-by: Mauricio Vásquez <mauricio@kinvolk.io>
  • Loading branch information
2 people authored and yonghong-song committed May 22, 2020
1 parent 104a5b8 commit 32ab858
Show file tree
Hide file tree
Showing 29 changed files with 322 additions and 219 deletions.
59 changes: 58 additions & 1 deletion docs/filtering_by_cgroups.md → docs/special_filtering.md
@@ -1,4 +1,10 @@
# Demonstrations of filtering by cgroups
# Special Filtering

Some tools have special filtering capabitilies, the main use case is to trace
processes running in containers, but those mechanisms are generic and could
be used in other cases as well.

## Filtering by cgroups

Some tools have an option to filter by cgroup by referencing a pinned BPF hash
map managed externally.
Expand Down Expand Up @@ -66,3 +72,54 @@ map, bcc tools will display results from this shell. Cgroups can be added and
removed from the BPF hash map without restarting the bcc tool.

This feature is useful for integrating bcc tools in external projects.

## Filtering by mount by namespace

The BPF hash map can be created by:

```
# bpftool map create /sys/fs/bpf/mnt_ns_set type hash key 8 value 4 entries 128 \
name mnt_ns_set flags 0
```

Execute the `execsnoop` tool filtering only the mount namespaces
in `/sys/fs/bpf/mnt_ns_set`:

```
# tools/execsnoop.py --mntnsmap /sys/fs/bpf/mnt_ns_set
```

Start a terminal in a new mount namespace:

```
# unshare -m bash
```

Update the hash map with the mount namespace ID of the terminal above:

```
FILE=/sys/fs/bpf/mnt_ns_set
if [ $(printf '\1' | od -dAn) -eq 1 ]; then
HOST_ENDIAN_CMD=tac
else
HOST_ENDIAN_CMD=cat
fi
NS_ID_HEX="$(printf '%016x' $(stat -Lc '%i' /proc/self/ns/mnt) | sed 's/.\{2\}/&\n/g' | $HOST_ENDIAN_CMD)"
bpftool map update pinned $FILE key hex $NS_ID_HEX value hex 00 00 00 00 any
```

Execute a command in this terminal:

```
# ping kinvolk.io
```

You'll see how on the `execsnoop` terminal you started above the call is logged:

```
# tools/execsnoop.py --mntnsmap /sys/fs/bpf/mnt_ns_set
[sudo] password for mvb:
PCOMM PID PPID RET ARGS
ping 8096 7970 0 /bin/ping kinvolk.io
```
7 changes: 6 additions & 1 deletion man/man8/bindsnoop.8
Expand Up @@ -2,7 +2,7 @@
.SH NAME
bindsnoop \- Trace bind() system calls.
.SH SYNOPSIS
.B bindsnoop.py [\fB-h\fP] [\fB-w\fP] [\fB-t\fP] [\fB-p\fP PID] [\fB-P\fP PORT] [\fB-E\fP] [\fB-U\fP] [\fB-u\fP UID] [\fB--count\fP] [\fB--cgroupmap MAP\fP]
.B bindsnoop.py [\fB-h\fP] [\fB-w\fP] [\fB-t\fP] [\fB-p\fP PID] [\fB-P\fP PORT] [\fB-E\fP] [\fB-U\fP] [\fB-u\fP UID] [\fB--count\fP] [\fB--cgroupmap MAP\fP] [\fB--mntnsmap MNTNSMAP\fP]
.SH DESCRIPTION
bindsnoop reports socket options set before the bind call that would impact this system call behavior.
.PP
Expand Down Expand Up @@ -42,6 +42,11 @@ Trace cgroups in this BPF map:
.B
\fB--cgroupmap\fP MAP
.TP
Trace mount namespaces in this BPF map:
.TP
.B
\fB--mntnsmap\fP MNTNSMAP
.TP
Include errors in the output:
.TP
.B
Expand Down
7 changes: 5 additions & 2 deletions man/man8/capable.8
Expand Up @@ -3,7 +3,7 @@
capable \- Trace security capability checks (cap_capable()).
.SH SYNOPSIS
.B capable [\-h] [\-v] [\-p PID] [\-K] [\-U] [\-x] [\-\-cgroupmap MAPPATH]
[--unique]
[\-\-mntnsmap MAPPATH] [--unique]
.SH DESCRIPTION
This traces security capability checks in the kernel, and prints details for
each call. This can be useful for general debugging, and also security
Expand Down Expand Up @@ -33,6 +33,9 @@ Show extra fields in TID and INSETID columns.
\-\-cgroupmap MAPPATH
Trace cgroups in this BPF map only (filtered in-kernel).
.TP
\-\-mntnsmap MAPPATH
Trace mount namespaces in this BPF map only (filtered in-kernel).
.TP
\-\-unique
Don't repeat stacks for the same PID or cgroup.
.SH EXAMPLES
Expand All @@ -45,7 +48,7 @@ Trace capability checks for PID 181:
#
.B capable \-p 181
.TP
Trace capability checks in a set of cgroups only (see filtering_by_cgroups.md
Trace capability checks in a set of cgroups only (see special_filtering.md
from bcc sources for more details):
#
.B capable \-\-cgroupmap /sys/fs/bpf/test01
Expand Down
14 changes: 9 additions & 5 deletions man/man8/execsnoop.8
Expand Up @@ -2,8 +2,8 @@
.SH NAME
execsnoop \- Trace new processes via exec() syscalls. Uses Linux eBPF/bcc.
.SH SYNOPSIS
.B execsnoop [\-h] [\-T] [\-t] [\-x] [\-\-cgroupmap CGROUPMAP] [\-u USER]
.B [\-q] [\-n NAME] [\-l LINE] [\-U] [\-\-max-args MAX_ARGS]
.B execsnoop [\-h] [\-T] [\-t] [\-x] [\-\-cgroupmap CGROUPMAP] [\-\-mntnsmap MAPPATH]
.B [\-u USER] [\-q] [\-n NAME] [\-l LINE] [\-U] [\-\-max-args MAX_ARGS]
.SH DESCRIPTION
execsnoop traces new processes, showing the filename executed and argument
list.
Expand Down Expand Up @@ -42,7 +42,7 @@ Include failed exec()s
.TP
\-q
Add "quotemarks" around arguments. Escape quotemarks in arguments with a
backslash. For tracing empty arguments or arguments that contain whitespace.
backslash. For tracing empty arguments or arguments that contain whitespace.
.TP
\-n NAME
Only print command lines matching this name (regex)
Expand All @@ -55,6 +55,10 @@ Maximum number of arguments parsed and displayed, defaults to 20
.TP
\-\-cgroupmap MAPPATH
Trace cgroups in this BPF map only (filtered in-kernel).
.TP
\-\-mntnsmap MAPPATH
Trace mount namespaces in this BPF map only (filtered in-kernel).
.TP
.SH EXAMPLES
.TP
Trace all exec() syscalls:
Expand All @@ -81,7 +85,7 @@ Include failed exec()s:
#
.B execsnoop \-x
.TP
Put quotemarks around arguments.
Put quotemarks around arguments.
#
.B execsnoop \-q
.TP
Expand All @@ -93,7 +97,7 @@ Only trace exec()s where argument's line contains "testpkg":
#
.B execsnoop \-l testpkg
.TP
Trace a set of cgroups only (see filtering_by_cgroups.md from bcc sources for more details):
Trace a set of cgroups only (see special_filtering.md from bcc sources for more details):
#
.B execsnoop \-\-cgroupmap /sys/fs/bpf/test01
.SH FIELDS
Expand Down
7 changes: 5 additions & 2 deletions man/man8/opensnoop.8
Expand Up @@ -4,7 +4,7 @@ opensnoop \- Trace open() syscalls. Uses Linux eBPF/bcc.
.SH SYNOPSIS
.B opensnoop.py [\-h] [\-T] [\-U] [\-x] [\-p PID] [\-t TID] [\-u UID]
[\-d DURATION] [\-n NAME] [\-e] [\-f FLAG_FILTER]
[--cgroupmap MAPPATH]
[--cgroupmap MAPPATH] [--mntnsmap MAPPATH]
.SH DESCRIPTION
opensnoop traces the open() syscall, showing which processes are attempting
to open which files. This can be useful for determining the location of config
Expand Down Expand Up @@ -58,6 +58,9 @@ Filter on open() flags, e.g., O_WRONLY.
.TP
\--cgroupmap MAPPATH
Trace cgroups in this BPF map only (filtered in-kernel).
.TP
\--mntnsmap MAPPATH
Trace mount namespaces in this BPF map only (filtered in-kernel).
.SH EXAMPLES
.TP
Trace all open() syscalls:
Expand Down Expand Up @@ -100,7 +103,7 @@ Only print calls for writing:
#
.B opensnoop \-f O_WRONLY \-f O_RDWR
.TP
Trace a set of cgroups only (see filtering_by_cgroups.md from bcc sources for more details):
Trace a set of cgroups only (see special_filtering.md from bcc sources for more details):
#
.B opensnoop \-\-cgroupmap /sys/fs/bpf/test01
.SH FIELDS
Expand Down
4 changes: 2 additions & 2 deletions man/man8/profile.8
Expand Up @@ -3,7 +3,7 @@
profile \- Profile CPU usage by sampling stack traces. Uses Linux eBPF/bcc.
.SH SYNOPSIS
.B profile [\-adfh] [\-p PID | \-L TID] [\-U | \-K] [\-F FREQUENCY | \-c COUNT]
.B [\-\-stack\-storage\-size COUNT] [\-\-cgroupmap CGROUPMAP] [duration]
.B [\-\-stack\-storage\-size COUNT] [\-\-cgroupmap CGROUPMAP] [\-\-mntnsmap MAPPATH] [duration]
.SH DESCRIPTION
This is a CPU profiler. It works by taking samples of stack traces at timed
intervals. It will help you understand and quantify CPU usage: which code is
Expand Down Expand Up @@ -101,7 +101,7 @@ Profile kernel stacks only:
#
.B profile -K
.TP
Profile a set of cgroups only (see filtering_by_cgroups.md from bcc sources for more details):
Profile a set of cgroups only (see special_filtering.md from bcc sources for more details):
#
.B profile \-\-cgroupmap /sys/fs/bpf/test01
.SH DEBUGGING
Expand Down
7 changes: 5 additions & 2 deletions man/man8/tcpaccept.8
Expand Up @@ -2,7 +2,7 @@
.SH NAME
tcpaccept \- Trace TCP passive connections (accept()). Uses Linux eBPF/bcc.
.SH SYNOPSIS
.B tcpaccept [\-h] [\-T] [\-t] [\-p PID] [\-P PORTS] [\-\-cgroupmap MAPPATH]
.B tcpaccept [\-h] [\-T] [\-t] [\-p PID] [\-P PORTS] [\-\-cgroupmap MAPPATH] [\-\-mntnsmap MAPPATH]
.SH DESCRIPTION
This tool traces passive TCP connections (eg, via an accept() syscall;
connect() are active connections). This can be useful for general
Expand Down Expand Up @@ -36,6 +36,9 @@ Comma-separated list of local ports to trace (filtered in-kernel).
.TP
\-\-cgroupmap MAPPATH
Trace cgroups in this BPF map only (filtered in-kernel).
.TP
\-\-mntnsmap MAPPATH
Trace mount namespaces in this BPF map only (filtered in-kernel).
.SH EXAMPLES
.TP
Trace all passive TCP connections (accept()s):
Expand All @@ -54,7 +57,7 @@ Trace PID 181 only:
#
.B tcpaccept \-p 181
.TP
Trace a set of cgroups only (see filtering_by_cgroups.md from bcc sources for more details):
Trace a set of cgroups only (see special_filtering.md from bcc sources for more details):
#
.B tcpaccept \-\-cgroupmap /sys/fs/bpf/test01
.SH FIELDS
Expand Down
8 changes: 6 additions & 2 deletions man/man8/tcpconnect.8
Expand Up @@ -2,7 +2,7 @@
.SH NAME
tcpconnect \- Trace TCP active connections (connect()). Uses Linux eBPF/bcc.
.SH SYNOPSIS
.B tcpconnect [\-h] [\-c] [\-t] [\-x] [\-p PID] [-P PORT] [\-\-cgroupmap MAPPATH]
.B tcpconnect [\-h] [\-c] [\-t] [\-x] [\-p PID] [-P PORT] [\-\-cgroupmap MAPPATH] [\-\-mntnsmap MAPPATH]
.SH DESCRIPTION
This tool traces active TCP connections (eg, via a connect() syscall;
accept() are passive connections). This can be useful for general
Expand Down Expand Up @@ -72,9 +72,13 @@ Count connects per src ip and dest ip/port:
#
.B tcpconnect \-c
.TP
Trace a set of cgroups only (see filtering_by_cgroups.md from bcc sources for more details):
Trace a set of cgroups only (see special_filtering.md from bcc sources for more details):
#
.B tcpconnect \-\-cgroupmap /sys/fs/bpf/test01
.TP
Trace a set of mount namespaces only (see special_filtering.md from bcc sources for more details):
#
.B tcpconnect \-\-mntnsmap /sys/fs/bpf/mnt_ns_set
.SH FIELDS
.TP
TIME(s)
Expand Down
7 changes: 5 additions & 2 deletions man/man8/tcptop.8
Expand Up @@ -3,7 +3,7 @@
tcptop \- Summarize TCP send/recv throughput by host. Top for TCP.
.SH SYNOPSIS
.B tcptop [\-h] [\-C] [\-S] [\-p PID] [\-\-cgroupmap MAPPATH]
[interval] [count]
[--mntnsmap MAPPATH] [interval] [count]
.SH DESCRIPTION
This is top for TCP sessions.

Expand Down Expand Up @@ -39,6 +39,9 @@ Trace this PID only.
\-\-cgroupmap MAPPATH
Trace cgroups in this BPF map only (filtered in-kernel).
.TP
\--mntnsmap MAPPATH
Trace mount namespaces in this BPF map only (filtered in-kernel).
.TP
interval
Interval between updates, seconds (default 1).
.TP
Expand All @@ -58,7 +61,7 @@ Trace PID 181 only, and don't clear the screen:
#
.B tcptop \-Cp 181
.TP
Trace a set of cgroups only (see filtering_by_cgroups.md from bcc sources for more details):
Trace a set of cgroups only (see special_filtering.md from bcc sources for more details):
#
.B tcptop \-\-cgroupmap /sys/fs/bpf/test01
.SH FIELDS
Expand Down
7 changes: 5 additions & 2 deletions man/man8/tcptracer.8
Expand Up @@ -2,7 +2,7 @@
.SH NAME
tcptracer \- Trace TCP established connections. Uses Linux eBPF/bcc.
.SH SYNOPSIS
.B tcptracer [\-h] [\-v] [\-p PID] [\-N NETNS] [\-\-cgroupmap MAPPATH]
.B tcptracer [\-h] [\-v] [\-p PID] [\-N NETNS] [\-\-cgroupmap MAPPATH] [--mntnsmap MAPPATH]
.SH DESCRIPTION
This tool traces established TCP connections that open and close while tracing,
and prints a line of output per connect, accept and close events. This includes
Expand Down Expand Up @@ -31,6 +31,9 @@ Trace this network namespace only (filtered in-kernel).
.TP
\-\-cgroupmap MAPPATH
Trace cgroups in this BPF map only (filtered in-kernel).
.TP
\-\-mntnsmap MAPPATH
Trace mount namespaces in the map (filtered in-kernel).
.SH EXAMPLES
.TP
Trace all TCP established connections:
Expand All @@ -49,7 +52,7 @@ Trace connections in network namespace 4026531969 only:
#
.B tcptracer \-N 4026531969
.TP
Trace a set of cgroups only (see filtering_by_cgroups.md from bcc sources for more details):
Trace a set of cgroups only (see special_filtering.md from bcc sources for more details):
#
.B tcptracer \-\-cgroupmap /sys/fs/bpf/test01
.SH FIELDS
Expand Down
80 changes: 80 additions & 0 deletions src/python/bcc/containers.py
@@ -0,0 +1,80 @@
# Copyright 2020 Kinvolk GmbH
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

def _cgroup_filter_func_writer(cgroupmap):
if not cgroupmap:
return """
static inline int _cgroup_filter() {
return 0;
}
"""

text = """
BPF_TABLE_PINNED("hash", u64, u64, cgroupset, 1024, "CGROUP_PATH");
static inline int _cgroup_filter() {
u64 cgroupid = bpf_get_current_cgroup_id();
return cgroupset.lookup(&cgroupid) == NULL;
}
"""

return text.replace('CGROUP_PATH', cgroupmap)

def _mntns_filter_func_writer(mntnsmap):
if not mntnsmap:
return """
static inline int _mntns_filter() {
return 0;
}
"""
text = """
#include <linux/nsproxy.h>
#include <linux/mount.h>
#include <linux/ns_common.h>
/* see mountsnoop.py:
* XXX: struct mnt_namespace is defined in fs/mount.h, which is private
* to the VFS and not installed in any kernel-devel packages. So, let's
* duplicate the important part of the definition. There are actually
* more members in the real struct, but we don't need them, and they're
* more likely to change.
*/
struct mnt_namespace {
atomic_t count;
struct ns_common ns;
};
BPF_TABLE_PINNED("hash", u64, u32, mount_ns_set, 1024, "MOUNT_NS_PATH");
static inline int _mntns_filter() {
struct task_struct *current_task;
current_task = (struct task_struct *)bpf_get_current_task();
u64 ns_id = current_task->nsproxy->mnt_ns->ns.inum;
return mount_ns_set.lookup(&ns_id) == NULL;
}
"""

return text.replace('MOUNT_NS_PATH', mntnsmap)

def filter_by_containers(args):
filter_by_containers_text = """
static inline int container_should_be_filtered() {
return _cgroup_filter() || _mntns_filter();
}
"""

cgroupmap_text = _cgroup_filter_func_writer(args.cgroupmap)
mntnsmap_text = _mntns_filter_func_writer(args.mntnsmap)

return cgroupmap_text + mntnsmap_text + filter_by_containers_text

0 comments on commit 32ab858

Please sign in to comment.