Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

privilege escalation via ptrace (CVE-2016-8659) #107

Closed
smcv opened this issue Oct 13, 2016 · 19 comments
Closed

privilege escalation via ptrace (CVE-2016-8659) #107

smcv opened this issue Oct 13, 2016 · 19 comments

Comments

@smcv
Copy link
Collaborator

smcv commented Oct 13, 2016

Sebastian Krahmer reported this to the oss-security mailing list.

There is a beautiful, easy to exploit, logical bug within the
bubblewrap program, thats part of the flatpak container-app framework,
but also used with other container solutions.

/usr/bin/bwrap may be installed mode 04755 or with cap_sys_admin and other
file caps. I dont know if there are any dists already shipping it that way,
but the Makefile and some RedHat spec files contain file caps for it.

bubblewrap's aim is to setup a container and seccomp sandbox for programs to be run
as user.

For some reason it sets the PR_SET_DUMPABLE flag, as seen below. The comment about
it looks strange to me. If thats really true, suid programs shouldn't
be forced to play with the dumpable flag to achieve their goal.

Once the dumpable flag is set, there is a chance we could attach to the process,
once the remaining caps are dropped and the whole process runs as user.

Luckily, that happens at line 1707, right after a PrivSep socket has been opened!

Once attached to the (now running as unprived user) process, we can inject
commands into that socket. We could do arbitrary mounts, but won't achieve much, since
the bwrap process is running in its own mount namespace. However, there is
a sethostname() OP, that we can use to affect the hostname of the entire system
(not restricted to UTS namespace). Now, just wait for root or other users to
login and execute bash to use one of the PS expansion bugs to execute code.

(I wonder that has been re-discovered recently, it was already part of the
CVE-2011-0966 attack vector.)


 383 acquire_caps (void)
 384 {

[...]

 422   /* We need the process to be dumpable, or we can't access /proc/self/uid_map */
 423   if (prctl (PR_SET_DUMPABLE, 1, 0, 0, 0) < 0)
 424     die_with_error ("prctl(PR_SET_DUMPABLE) failed");
 425 }


[...]


1422 int
1423 main (int    argc,
1424       char **argv)
1425 {

[...]

1440   /* Get the (optional) capabilities we need, drop root */
1441   acquire_caps ();

[...]

1692   if (is_privileged)
1693     {
1694       pid_t child;
1695       int privsep_sockets[2];
1696
1697       if (socketpair (AF_UNIX, SOCK_SEQPACKET | SOCK_CLOEXEC, 0, privsep_sockets) != 0)
1698         die_with_error ("Can't create privsep socket");
1699
1700       child = fork ();
1701       if (child == -1)
1702         die_with_error ("Can't fork unprivileged helper");
1703
1704       if (child == 0)
1705         {
1706           /* Unprivileged setup process */
1707           drop_caps ();                                                            // BOOM
1708           close (privsep_sockets[0]);
1709           setup_newroot (opt_unshare_pid, privsep_sockets[1]);
1710           exit (0);
1711         }
1712       else
1713         {
1714           uint32_t buffer[2048];  /* 8k, but is int32 to guarantee nice alignment */
1715           uint32_t op, flags;
1716           const char *arg1, *arg2;
1717           cleanup_fd int unpriv_socket = -1;
1718
1719           unpriv_socket = privsep_sockets[0];
1720           close (privsep_sockets[1]);
1721
1722           do
1723             {
1724               op = read_priv_sec_op (unpriv_socket, buffer, sizeof (buffer),
1725                                      &flags, &arg1, &arg2);
1726               privileged_op (-1, op, flags, arg1, arg2);
1727               if (write (unpriv_socket, buffer, 1) != 1)
1728                 die ("Can't write to op_socket");
1729             }
1730           while (op != PRIV_SEP_OP_DONE);
1731
1732           /* Continue post setup */
1733         }
1734     }
@smcv
Copy link
Collaborator Author

smcv commented Oct 13, 2016

As a short-term fix for Debian, I've reverted #94 so there is no way to induce the privileged process to issue the sethostname() syscall, but I suspect a better long-term answer is to isolate the use of PR_SET_DUMPABLE to precisely where it is needed.

See also http://www.openwall.com/lists/oss-security/2016/10/13/4

For some reason it sets the PR_SET_DUMPABLE flag, as seen below. The comment about
it looks strange to me. If thats really true, suid programs shouldn't
be forced to play with the dumpable flag to achieve their goal.

I assume the developers of Bubblewrap wouldn't have done this if the
kernel (or at least a kernel they care about) didn't require it.
But hopefully becoming dumpable can be restricted to smaller sections
of the code.

If it's only for write_uid_gid_map(), then one way would be for that
function to fork(), with the child making itself dumpable and writing
the map files, and the parent just waiting for the child?
...
I believe the intention is that none of the operations that can pass over
the privilege separation socket are problematic, because they only affect
the sandboxed processes, and if you can ptrace bwrap then you can certainly
ptrace and subvert the sandboxed processes too. SETUP_SET_HOSTNAME clearly
doesn't match that intention, if Bubblewrap didn't unshare the UTS namespace
beforehand; Bubblewrap does insist that --hostname can't be used without
--unshare-uts, but you're right that a user outside the sandbox can ptrace
it and bypass that check by making it behave as though --hostname had been
used.

As a quick temporary fix for Debian, I'm reverting the addition of
SETUP_SET_HOSTNAME.

Am I right in thinking that this pseudocode would fix it while reinstating
the feature?

case PRIV_SEP_OP_SET_HOSTNAME:
  if (!opt_unshare_uts)
    die_with_error ("Refusing to set hostname in original namespace");
  else if (!sethostname (... as before))
    die_with_error (... as before);

@alexlarsson
Copy link
Collaborator

The check described above should be fine for that exact exploit, but we should also try to figure out the minimal part that needs DUMPABLE.

alexlarsson added a commit to alexlarsson/bubblewrap that referenced this issue Oct 13, 2016
This is normally verified on argument validation, but it may happen
if someone managed to send custom priv-sep operations via e.g. ptrace.

See containers#107
rh-atomic-bot pushed a commit that referenced this issue Oct 13, 2016
This is normally verified on argument validation, but it may happen
if someone managed to send custom priv-sep operations via e.g. ptrace.

See #107

Closes: #108
Approved by: alexlarsson
@alexlarsson
Copy link
Collaborator

So, for the dumpable, if you remove the code that sets this, then you get:

 bwrap --unshare-user  --bind / / sh
setting up uid map: Permission denied

Because its trying to write to /proc/self/uid_map, which is root owned for a non-dumpable app.

cgwalters added a commit to cgwalters/bubblewrap that referenced this issue Oct 13, 2016
The root cause of
containers#107
aka CVE-2016-8659 is that we were explictly turning on the dumpable
flag, which allows the caller to `ptrace()` us.

In fact, Linux already introduced `setfsuid()` for the NFS server for
a very similar reason; see `man setfsuid`:

```
       At the time when this system call was introduced, one process
       could send a signal to another process with the same effective
       user ID.  This meant that if a privileged process changed its
       effective user ID for the purpose of file permission checking,
       then it could become vulnerable to receiving signals sent by
       another (unprivileged) process with the same user ID.
```

Let's make use of this, which makes us the same as other setuid
binaries, without introducing any additional risk from being
potentially `ptrace()able`.
@cgwalters
Copy link
Collaborator

See #109

@alexlarsson
Copy link
Collaborator

So, here is the issue. When we're using user namespaces we need to have the privileged parent set the uid mapping in the child user namespace, because there is not necessarily a 1-1 mapping. However, the child, as well as the parent are non-dumpable due to the initial setuid bit.

Non-dumpable processes have uid 0 owning /proc/pid/' files though, so the parent (running as the user uid) isn't allowed to set the uid map unless the child is made dumpable.

However, this is only needed in the case when we need user namespaces. If we're not, then being non-dumpable is fine. So, my solution is to only make the child non-dumpable in the case of using user namespaces.

But, i hear you, then we'll be able to ptrace the child in the case of using user namespaces, causing this bug to re-appear! That is not true though, because in a user namespace we lose all capabilities in the parent namespace, so we will not be able to do something affecting the host namespace anyway (i.e. sethostname gets EPERM unless you unshared UTS ns). In fact, there is not much we can do here, because any process in the parent user namespace (i.e. the regular host namespace) have all caps in the user namespace, including CAP_SYS_PTRACE, so we override the value of dumpable anyway.

@alexlarsson
Copy link
Collaborator

Here is my approach instead:
#110

@cgwalters
Copy link
Collaborator

Let's call #109 the "fsuid" patch and #110 the "move-dumpable" patch.

I'd argue that the fsuid patch is a lot shorter, and more obviously correct than move-dumpable. Why is the latter better?

@alexlarsson
Copy link
Collaborator

Because the fsuid patch gives somewhat increases privileges (fsuid 0) which are not needed.

@alexlarsson
Copy link
Collaborator

In particular it is doing file writes as uid 0.

@cgwalters
Copy link
Collaborator

So the risk with fsuid is things like - what if somehow the pid got reused and we happened to write to the uid map for another pid? Another aspect is that suddenly we're trusting /proc to be the real /proc. But the latter is true of all setuid binaries really.

The risk with dumpable is - now all of a sudden we have to be concerned about the fact that someone could ptrace us. And yes, we are in a user namespace...but in fact, your argument:

In fact any process from the parent user
namespace have ptrace access anyway, due to parent ns having
CAP_SYS_PTRACE (and all other caps) in the child ns.

appears to me to be wrong, because the dumpable flag isn't set, and we need that and the capability to ptrace. If I add in to your patch:

diff --git a/bubblewrap.c b/bubblewrap.c
index 5c7999b..06bc029 100644
--- a/bubblewrap.c
+++ b/bubblewrap.c
@@ -1652,11 +1652,14 @@ main (int    argc,

   if (is_privileged && opt_unshare_user)
     {
+      fprintf (stderr, "** BEFORE SETTING DUMPABLE; pid=%u **\n", getpid());
+      sleep (15);
       /* We have to be dumpable for the parent to be able to set the
          uid map for us. This enables ptracing for the child, but that
          is not really a major*/
       if (prctl (PR_SET_DUMPABLE, 1, 0, 0, 0) < 0)
         die_with_error ("prctl(PR_SET_DUMPABLE) failed");
+      fprintf (stderr, "** AFTER SETTING DUMPABLE **\n");

       /* Let parent continue to set the uid map */
       val = 1;

And try to strace before setting dumpable, I get EPERM as expected.

So the aspect of being in a user namespace obviously mitigates things a lot...but we need to consider risks such as having an open file descriptor outside of the namespace. I doubt we have anything open that's writable, but we'd have to be careful in the future.

If you're arguing that we can enable dumpable, it follows that everything else after that could be arbitrary user code, and we could do:

--- a/bubblewrap.c
+++ b/bubblewrap.c
@@ -1728,7 +1728,7 @@ main (int    argc,
   if (chdir ("/") != 0)
     die_with_error ("chdir / (base path)");

-  if (is_privileged)
+  if (is_privileged && !opt_unshare_user)
     {
       pid_t child;
       int privsep_sockets[2];

Or actually, hypothetically:

@@ -1807,7 +1807,8 @@ main (int    argc,
     die_with_error ("chdir /");

   /* Now we have everything we need CAP_SYS_ADMIN for, so drop it */
-  drop_caps ();
+  if (opt_unshare_user && !opt_retain_caps)
+    drop_caps ();

   if (opt_block_fd != -1)
     {

Which I think is where some of the patches in #101 are going. Basically, in the userns path (whether privileged or not) we rely solely on the kernel for security post-CLONE_NEWUSER.

@cgwalters
Copy link
Collaborator

cgwalters commented Oct 13, 2016

Here's a good way to compare the patches - on kernels vulnerable to https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2016-3135 (the one we cite in README.md as an example userns vulnerability) - I'm pretty sure your patch exposes the vulnerability, since a caller could inject code to do the iptables syscalls, even though we don't expose that in the binary. But mine does not, because we drop CAP_NET_ADMIN (really all caps) in the namespace before executing the user code.

Basically, your patch makes bubblewrap's security equivalent to that of the full kernel user namespace feature set, whereas mine keeps things restricted to what we expose on the bwrap command line.

(But on the other hand, exposing the full userns feature set is where #101 is going)

@smcv
Copy link
Collaborator Author

smcv commented Oct 13, 2016

Basically, your patch makes bubblewrap's security equivalent to that of the full kernel user namespace feature set, whereas mine keeps things restricted to what we expose on the bwrap command line.
...
(But on the other hand, exposing the full userns feature set is where #101 is going)

This sounds like it's heading towards a sysadmin configuration mechanism for "how much exposure to kernel userns bugs am I willing to tolerate?"...

on kernels vulnerable to https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2016-3135 (the one we cite in README.md as an example userns vulnerability) - I'm pretty sure your patch exposes the vulnerability, since a caller could inject code to do the iptables syscalls, even though we don't expose that in the binary

This sounds to me like a good argument for taking the fsuid approach, at least short-term.

@alexlarsson
Copy link
Collaborator

@cgwalters Your test case is wrong, because getpid() is cached so returns the old value
Try it again with this instead:
fprintf (stderr, "** BEFORE SETTING DUMPABLE; pid=%u real pid=%ld **\n", getpid(), syscall(SYS_getpid));
sleep (15);

@alexlarsson
Copy link
Collaborator

alexlarsson commented Oct 13, 2016

So, yeah, i get your point, but even with the fsuid approach we are ptraceable.

However, i mentioned this to @eparis and eric biederman, and he said:

ebiederm: alex, eparis That corner case where we let user namespace caps override dumpable looks buggy. I am not certain how that got in there.

So, maybe this will change over time.

@cgwalters
Copy link
Collaborator

@alexlarsson
Copy link
Collaborator

Maybe the end result is that we just can't expose user namespaces securely (above full kernel access to user namespaces) at all?

@cgwalters
Copy link
Collaborator

OK, let's just paper over this rabbit hole for now and try to replace it after the release with something stronger. I'm fine going with #110 - I'll close #109 .

@cgwalters
Copy link
Collaborator

Closed via #110

@cgwalters
Copy link
Collaborator

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants