Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
72 changes: 72 additions & 0 deletions pocs/linux/kernelctf/CVE-2025-38618_lts_cos/docs/exploit.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,72 @@
# CVE-2025-38618

## Overview
The vulnerability allows us to bind a vsock to the illegal port `VMADDR_PORT_ANY`. Opening a connection to this socket can create a peer vsock which is vulnerable to a refcount underflow, but succesfully creating the connection involves a race condition. Most of the exploit consists of winning this race. Once the vulnerable vsock is created, it is freed and a cross-cache spray replaces it with a `msgmsg_seg` containing a ROP chain which we execute by closing the vsock.

## Binding to VMADDR_PORT_ANY

We want to increment the static variable `port` in `__vsock_bind_connectible()` until it reaches `VMADDR_PORT_ANY`. We first obtain the current value of `port` by using `getsockname()` on a newly autobound vsock. If this the first time a vsock has been autobound, `port` will have been randomly initialized.

There is a time limit on how long the exploit can run, and chances are we cannot set `port` to `VMADDR_PORT_ANY` within it. Therefore we only proceed if the number of increments needed is less than `MAX_PORTS` and retry on a freshly booted instance otherwise. Eventually `port` will be initialized to a high enough value for the vulnerability to be triggered.

We set `port` to `VMADDR_PORT_ANY` using the `inc_port()` helper function. The process is sped up by calling `inc_port()` from multiple threads. We then create a new vsock and bind it to `VMADDR_PORT_ANY`.

## Creating vulnerable socket

As explained in [vulnerability.md](./vulnerability.md), we need to call `bind()` and `listen()` on the buggy socket in the window between when it is retrieved from the bound list and when its state is checked in the `virtio_transport_recv_pkt()` call scheduled by `connect()`. Binding the socket undoes the vulnerability, so we only have one try at winning the race.

The socket is locked before its state is checked, giving us an opportunity to lengthen the race window by holding its lock from another thread. Since `bind()` and `listen()` also take this lock, we will have to hope they run first once it is released. To increase the chances of this happening, we perform these syscalls from multiple threads.

We first take the lock in `setsockopt()`, which will attempt to read from file-mapped memory that is in the process of being deallocated by `fallocate()` while holding the lock. This read faults and is only allowed to proceed once the entire file has been deallocated, giving us plenty of time to call `connect()` and `bind()`.

The race involves four types of thread:

- `falloc_pthread` calls `fallocate()` on a temporary file to deallocate it.
- `setsockopt_pthread` calls `setsockopt()` to take the buggy socket's lock.
- `connect_pthread` calls `connect()` to schedule `virtio_transport_recv_pkt()` with `VMADDR_PORT_ANY` as the target.
- The `bind_pthreads` call `bind()` and `listen()` on the buggy socket.

Here is the order in which events should happen:

1. `falloc_pthread` begins deallocating the temporary file on CPU 1. All other threads run on CPU 0 and immediately `usleep()` to allow the deallocation to start.
2. `setsockopt_pthread` wakes up first and calls `setsockopt()` with `optval` pointing to file-mapped memory that is currently being deallocated. When `setsockopt()` tries to read `optval` it will fault and go to sleep while holding the buggy socket's lock.
3. `connect_pthread` and the `bind_pthreads` wake up next. The `bind_pthreads` will wait on the buggy socket's lock in the `bind()` call. Meanwhile `connect_pthread` calls `connect()` and exits having scheduled `virtio_transport_rcv_pkt()`. This function will run on the vsock loopback workqueue and find the buggy socket on the list of bound sockets then wait on its lock.
4. `falloc_pthread` finishes deallocating the file.
5. `setsockopt_thread` resumes execution and exits, releasing the lock and waking up the waiting threads.
6. Hopefully at least one of the `bind_pthreads` runs, binding the socket and setting its state to `TCP_LISTEN`. If the work queue thread runs before all of the `bind_pthreads`, the exploit fails.
7. `virtio_transport_rcv_pkt()` resumes execution on the work queue and sees the buggy socket in state `TCP_LISTEN`. It calls `virtio_transport_recv_listen()` and creates the vulnerable peer vsock.

Once all threads have finished executing, we get the vulnerable vsock's fd by calling `accept()` on the buggy socket. If `accept()` times out, the race was lost and we cannot proceed.

## ROP

After obtaining the vulnerable vsock we close the connection by calling `shutdown()` on its peer, bringing the vulnerable vsock's refcount down to 1. Now the refcount decrement in `bind()` will free the vulnerable `vsock_sock`, leaving a dangling pointer from the corresponding `socket`.

We will replace the `vsock_sock` with a `msgmsg_seg` containing a ROP chain which we can execute by closing the vulnerable vsock. This will cause `vsock_release()` to call `sk->sk_prot->close(sk, 0)` on an attacker-controlled `sk`. We make `sk_prot` overlap `drr_qdisc_ops` in the kernel image so that `sk_prot->close` is set to `qdisc_peek_dequeued()`:

```
static inline struct sk_buff *qdisc_peek_dequeued(struct Qdisc *sch)
{
struct sk_buff *skb = skb_peek(&sch->gso_skb);

if (!skb) {
skb = sch->dequeue(sch);
/* ... */
```

We only need to ensure that `sch->gso_skb` is `NULL` and `sch->dequeue` will be called on `sch`. Since `&sch->gso_skb` happens to be stored in `rbp`, we can set `sch->dequeue` to the stack pivot

```
mov rsp, rbp ; pop rbp ; ret
```
and then execute a ROP chain at `&sch->gso_skb + 8` which overwrites `core_pattern` and teleforks.

Thus we need to prepare the `msgmsg_seg` with:

- 8 bytes for `sk_prot` at `offsetof(struct sock, sk_prot)`.
- 8 bytes for `dequeue` at `offsetof(struct Qdisc, dequeue)`.
- 72 bytes for the ROP chain at `offsetof(struct Qdisc, gso_skb)`

A cross-cache spray is necessary to replace the vulnerable `vsock_sock` in cache `AF_VSOCK` with a `msgmsg_seg` in `kmalloc-cg-1k`. Since the size of each slot in `AF_VSOCK` is 1280 bytes, there are four possible locations for the dangling pointer inside of the 1024-byte `msgmsg_seg`. Fortunately there is enough space to place a copy of the necessary pointers and ROP chain at all four offsets simultaneously.

Only a kernel base leak is required, which we can request during reproduction on github by forgoing the reliability bonus (the steps required to trigger the bug make that unobtainable in any case). Against the live instance a prefetch side-channel was used.
59 changes: 59 additions & 0 deletions pocs/linux/kernelctf/CVE-2025-38618_lts_cos/docs/vulnerability.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,59 @@
It is possible for `vsock_bind()` to bind a stream vsock socket to the port `VMADDR_PORT_ANY`,  which is otherwise used to indicate that the socket is currently unbound. We can then successfully `connect()` to `VMADDR_PORT_ANY` from another vsock and cause `accept()` on the buggy vsock to return a peer vsock which also has port `VMADDR_PORT_ANY`. Calling `bind()` on this peer vsock results in an unnecessary refcount decrement leading to a use-after-free on the underlying `vsock_sock`. This bug was present since the introduction of VSOCK in version `3.9` and was fixed with commit `aba0c94f61ec ("vsock: Do not allow binding to VMADDR_PORT_ANY")` in version `6.16`.

The vulnerability occurs in the autobinding section of `__vsock_bind_connectible()`, which can be reached by setting `svm_port = VMADDR_PORT_ANY` in `bind()`:

```
static int __vsock_bind_connectible(struct vsock_sock *vsk,
                    struct sockaddr_vm *addr)
{
    static u32 port;
    struct sockaddr_vm new_addr;

    if (!port)
        port = get_random_u32_above(LAST_RESERVED_PORT);

    vsock_addr_init(&new_addr, addr->svm_cid, addr->svm_port);

    if (addr->svm_port == VMADDR_PORT_ANY) {
        bool found = false;
        unsigned int i;

        for (i = 0; i < MAX_PORT_RETRIES; i++) {
            if (port <= LAST_RESERVED_PORT)
                port = LAST_RESERVED_PORT + 1;

            new_addr.svm_port = port++;

            if (!__vsock_find_bound_socket(&new_addr)) {
                found = true;
                break;
            }
        }

        if (!found)
            return -EADDRNOTAVAIL;
    } else {
        /* ... */
    }

    vsock_addr_init(&vsk->local_addr, new_addr.svm_cid, new_addr.svm_port);

    /* Remove connection oriented sockets from the unbound list and add them
     * to the hash table for easy lookup by its address.  The unbound list
     * is simply an extra entry at the end of the hash table, a trick used
     * by AF_UNIX.
     */
    __vsock_remove_bound(vsk);
    __vsock_insert_bound(vsock_bound_sockets(&vsk->local_addr), vsk);

    return 0;
}
```
The static variable `port` will be assigned (if available) to the socket  and incremented. It is possible to increment `port` up to `VMADDR_PORT_ANY` by calling `bind()` enough times. The next socket we autobind will then be put on the bound list with port `VMADDR_PORT_ANY`.

To exploit the vulnerability, we call `connect()` on another vsock with target port `VMADDR_PORT_ANY` and target cid `VM_CID_LOCAL`. This will result in `virtio_transport_recv_pkt()` being scheduled on the `vsock_loopback` work queue. As long as the buggy socket was bound with cid `VM_CID_LOCAL`, `virtio_transport_recv_pkt()` will find it on the list of bound sockets. If the buggy socket is in state `TCP_LISTEN`, a peer vsock with port `VMADDR_PORT_ANY` will be created in `virtio_transport_recv_listen()` to accept the connection. This socket will be assigned to the file descriptor returned by calling `accept()` on the buggy socket.

Unlike vsocks created through `socket()`, the vsock returned by `accept()` is not put on the list of unbound sockets when created. Placing a socket on this list takes a reference count which is released when the socket is bound in `__vsock_bind_connectible()`. If we can successfully bind the peer vsock, there will be an unnecessary refcount decrement when `vsock_bind_connectible()` attempts to remove it from the unbound list. Normally this would be impossible since it was already bound in `virtio_transport_recv_listen()`, but our vsock bypasses the `vsock_addr_bound()` check by having port `VMADDR_PORT_ANY`.

The requirement that the buggy socket be in state `TCP_LISTEN` complicates exploitation as `listen()` will return an error when called on a vsock with port `VMADDR_PORT_ANY`. The vulnerability can still be exploited by rebinding the buggy socket to a legitimate port and calling `listen()` on it in the window between the socket being retrieved in `virtio_transport_recv_pkt()` and its state being checked.

Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
CFLAGS = -Wno-incompatible-pointer-types -Wno-format -Wno-int-conversion -Wno-pointer-to-int-cast -D COS -static

exploit: exploit.c
gcc $(CFLAGS) -o $@ $<

Binary file not shown.
Loading
Loading