Skip to content

Q/BUG: process stuck in audit_backlog_wait for a longtime #144

@tuxoko

Description

@tuxoko

System: centos 7
Kernel: Upstream 5.10.168 with some custom change unrelated to audit
auditd: 2.8.5-4.el7.x86_64

Hi, we had an issue where we were trying to kill a process to be able to umount some filesystem.
However the process seemed to stuck in audit_backlog_wait for more than 20 seconds.
We have some code triggered a panic because it couldn't umount the filesystem.

Here's the stack trace

PID: 188152  TASK: ffff8f6e30ecdac0  CPU: 8   COMMAND: "rsync"
 #0 [ffffa6218ab6bc68] __schedule at ffffffff95a00af6
 #1 [ffffa6218ab6bcf8] schedule at ffffffff95a00f3f
 #2 [ffffa6218ab6bd18] schedule_timeout at ffffffff95a049bf
 #3 [ffffa6218ab6bd98] audit_log_start at ffffffff9516e338
 #4 [ffffa6218ab6be20] audit_log_start at ffffffff9516e5af
 #5 [ffffa6218ab6be48] audit_log_name.constprop.0 at ffffffff95174b00
 #6 [ffffa6218ab6be98] audit_log_exit at ffffffff951751db
 #7 [ffffa6218ab6bee0] __audit_syscall_exit at ffffffff95176ba0
 #8 [ffffa6218ab6bf18] syscall_exit_to_user_mode at ffffffff959f7c62
 #9 [ffffa6218ab6bf38] do_syscall_64 at ffffffff959f3ae5
#10 [ffffa6218ab6bf50] entry_SYSCALL_64_after_hwframe at ffffffff95c000a9
crash> ps -m | grep rsync
[0 00:00:21.270] [UN]  PID: 188152  TASK: ffff8f6e30ecdac0  CPU: 8   COMMAND: "rsync"

The audit_queue seems quite empty at the time of panic

crash> p audit_queue.qlen
$5 = 4
crash> p audit_backlog_limit
audit_backlog_limit = $6 = 640

I think there were some issue with fairness at play here.
When processes enter audit_log_start and audit_queue.qlen is large, it will then decides to wait.
Then while kauditd is consuming the audit_queue, other threads entering audit_log_start might see audit_queue.qlen small and bypass the wait. So there's no guarantee when the process in the wait will be able to queue.

Another part of this issue is that kauditd will only wake up one process in each iteration when it process the whole queue. The comment says wake everyone but it uses wake_up not wake_up_all even though waiter uses add_wait_queue_exclusive. If the intention is wake everyone then should we change it to wake_up_all? I think if it is wake_up_all then the chances of our process stuck for 20 seconds would probably be lower.

static int kauditd_thread(void *dummy)
{
...
		/* we have processed all the queues so wake everyone */
		wake_up(&audit_backlog_wait);

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions