Skip to content

Conversation

@kernel-patches-daemon-bpf
Copy link

Pull request for series with
subject: bpf: Fix FIONREAD and copied_seq issues
version: 4
url: https://patchwork.kernel.org/project/netdevbpf/list/?series=1027335

@kernel-patches-daemon-bpf
Copy link
Author

Upstream branch: 590699d
series: https://patchwork.kernel.org/project/netdevbpf/list/?series=1027335
version: 4

@kernel-patches-review-bot
Copy link

commit 64968b4d8e0328739af5c0f7099d93efb5c018e5
Author: Jiayuan Chen <jiayuan.chen@linux.dev>

bpf, sockmap: Fix FIONREAD for sockmap

This patch adds ingress_size field to track the total bytes in the
ingress_msg queue, since sockmap has its own independent receive queue
separate from the native TCP/UDP queues. The patch implements new ioctl
handlers for TCP and UDP to return this tracked size for FIONREAD.

> diff --git a/include/linux/skmsg.h b/include/linux/skmsg.h
> index 0323a2b6c..2fccae0df 100644
> --- a/include/linux/skmsg.h
> +++ b/include/linux/skmsg.h
> @@ -97,6 +97,7 @@ struct sk_psock {
>  	struct sk_buff_head		ingress_skb;
>  	struct list_head		ingress_msg;
>  	spinlock_t			ingress_lock;
> +	ssize_t				ingress_size;
>  	unsigned long			state;
>  	struct list_head		link;
>  	spinlock_t			link_lock;

[ ... ]

> @@ -321,6 +322,16 @@ static inline void sock_drop(struct sock *sk, struct sk_buff *skb)
>  	kfree_skb(skb);
>  }
>
> +static inline ssize_t sk_psock_get_msg_size(struct sk_psock *psock)
> +{
> +	return READ_ONCE(psock->ingress_size);
> +}
> +
> +static inline void sk_psock_inc_msg_size(struct sk_psock *psock, ssize_t diff)
> +{
> +	WRITE_ONCE(psock->ingress_size, READ_ONCE(psock->ingress_size) + diff);
> +}
> +

Can sk_psock_inc_msg_size race with itself?  The function implements a
read-modify-write operation that isn't atomic.  Looking at the call sites:

- sk_psock_queue_msg() holds ingress_lock when calling this
- sk_psock_dequeue_msg() holds ingress_lock when calling this
- __sk_psock_purge_ingress_msg() holds ingress_lock when calling this
- __sk_msg_recvmsg() does NOT hold ingress_lock when calling this

The __sk_msg_recvmsg() path runs under the socket lock via
tcp_bpf_recvmsg()->lock_sock(), but sk_psock_queue_msg() can be called
from softirq context (network RX -> BPF verdict ->
sk_psock_verdict_apply() -> sk_psock_skb_ingress_self() ->
sk_psock_queue_msg()) without the socket lock.

This means concurrent updates from different lock contexts can race:

Thread A (softirq, ingress_lock): reads size=1000
Thread B (process, socket lock): reads size=1000, subtracts 50, writes 950
Thread A: adds 100 to stale 1000, writes 1100

Result: size=1100 but should be 1050 (lost the -50 update)

The ingress_size counter would drift from the actual queue size, breaking
the FIONREAD accuracy that this patch aims to provide.

>  static inline bool sk_psock_queue_msg(struct sk_psock *psock,
>  				      struct sk_msg *msg)
>  {
> @@ -329,6 +340,7 @@ static inline bool sk_psock_queue_msg(struct sk_psock *psock,
>  	spin_lock_bh(&psock->ingress_lock);
>  	if (sk_psock_test_state(psock, SK_PSOCK_TX_ENABLED)) {
>  		list_add_tail(&msg->list, &psock->ingress_msg);
> +		sk_psock_inc_msg_size(psock, msg->sg.size);
>  		ret = true;
>  	} else {
>  		sk_msg_free(psock->sk, msg);

[ ... ]

> @@ -345,18 +357,25 @@ static inline struct sk_msg *sk_psock_dequeue_msg(struct sk_psock *psock)
>
>  	spin_lock_bh(&psock->ingress_lock);
>  	msg = list_first_entry_or_null(&psock->ingress_msg, struct sk_msg, list);
> -	if (msg)
> +	if (msg) {
>  		list_del(&msg->list);
> +		sk_psock_inc_msg_size(psock, -msg->sg.size);
> +	}
>  	spin_unlock_bh(&psock->ingress_lock);
>  	return msg;
>  }

[ ... ]

> diff --git a/net/core/skmsg.c b/net/core/skmsg.c
> index d73e03f77..c959d52a6 100644
> --- a/net/core/skmsg.c
> +++ b/net/core/skmsg.c
> @@ -455,6 +455,7 @@ int __sk_msg_recvmsg(struct sock *sk, struct sk_psock *psock, struct msghdr *msg
>  					atomic_sub(copy, &sk->sk_rmem_alloc);
>  				}
>  				msg_rx->sg.size -= copy;
> +				sk_psock_inc_msg_size(psock, -copy);

This call to sk_psock_inc_msg_size() happens without holding ingress_lock,
just the socket lock.  Can this race with sk_psock_queue_msg() which holds
ingress_lock but not the socket lock?

>
>  				if (!sge->length) {
>  					sk_msg_iter_var_next(i);
> @@ -819,9 +820,11 @@ static void __sk_psock_purge_ingress_msg(struct sk_psock *psock)
>  		list_del(&msg->list);
>  		if (!msg->skb)
>  			atomic_sub(msg->sg.size, &psock->sk->sk_rmem_alloc);
> +		sk_psock_inc_msg_size(psock, -((ssize_t)msg->sg.size));
>  		sk_msg_free(psock->sk, msg);
>  		kfree(msg);
>  	}
> +	WARN_ON_ONCE(psock->ingress_size);
>  }

[ ... ]



AI reviewed your patch. Please fix the bug or email reply why it's not a bug.
See: https://github.com/kernel-patches/vmtest/blob/master/ci/claude/README.md

In-Reply-To-Subject: bpf, sockmap: Fix FIONREAD for sockmap
CI run summary: https://github.com/kernel-patches/bpf/actions/runs/19669112811

@kernel-patches-daemon-bpf
Copy link
Author

@kernel-patches-daemon-bpf
Copy link
Author

Upstream branch: f2cb066
series: https://patchwork.kernel.org/project/netdevbpf/list/?series=1027335
version: 4

@kernel-patches-daemon-bpf
Copy link
Author

Upstream branch: 8c868a3
series: https://patchwork.kernel.org/project/netdevbpf/list/?series=1027335
version: 4

@kernel-patches-daemon-bpf
Copy link
Author

Upstream branch: 5262cb2
series: https://patchwork.kernel.org/project/netdevbpf/list/?series=1027335
version: 4

@kernel-patches-daemon-bpf
Copy link
Author

Upstream branch: 688b745
series: https://patchwork.kernel.org/project/netdevbpf/list/?series=1027335
version: 4

@kernel-patches-daemon-bpf
Copy link
Author

Upstream branch: bd5bdd2
series: https://patchwork.kernel.org/project/netdevbpf/list/?series=1027335
version: 4

@kernel-patches-daemon-bpf
Copy link
Author

Upstream branch: 34235a3
series: https://patchwork.kernel.org/project/netdevbpf/list/?series=1027335
version: 4

@kernel-patches-daemon-bpf
Copy link
Author

Upstream branch: 85bdeeb
series: https://patchwork.kernel.org/project/netdevbpf/list/?series=1027335
version: 4

A socket using sockmap has its own independent receive queue: ingress_msg.
This queue may contain data from its own protocol stack or from other
sockets.

The issue is that when reading from ingress_msg, we update tp->copied_seq
by default. However, if the data is not from its own protocol stack,
tcp->rcv_nxt is not increased. Later, if we convert this socket to a
native socket, reading from this socket may fail because copied_seq might
be significantly larger than rcv_nxt.

This fix also addresses the syzkaller-reported bug referenced in the
Closes tag.

This patch marks the skmsg objects in ingress_msg. When reading, we update
copied_seq only if the data is from its own protocol stack.

                                                     FD1:read()
                                                     --  FD1->copied_seq++
                                                         |  [read data]
                                                         |
                                [enqueue data]           v
                  [sockmap]     -> ingress to self ->  ingress_msg queue
FD1 native stack  ------>                                 ^
-- FD1->rcv_nxt++               -> redirect to other      | [enqueue data]
                                       |                  |
                                       |             ingress to FD1
                                       v                  ^
                                      ...                 |  [sockmap]
                                                     FD2 native stack

Closes: https://syzkaller.appspot.com/bug?extid=06dbd397158ec0ea4983
Fixes: 04919be ("tcp: Introduce tcp_read_skb()")
Signed-off-by: Jiayuan Chen <jiayuan.chen@linux.dev>
A socket using sockmap has its own independent receive queue: ingress_msg.
This queue may contain data from its own protocol stack or from other
sockets.

Therefore, for sockmap, relying solely on copied_seq and rcv_nxt to
calculate FIONREAD is not enough.

This patch adds a new ingress_size field in the psock structure to record
the data length in ingress_msg. Additionally, we implement new ioctl
interfaces for TCP and UDP to intercept FIONREAD operations. While Unix
and VSOCK also support sockmap and have similar FIONREAD calculation
issues, fixing them would require more extensive changes
(please let me know if modifications are needed). I believe it's not
appropriate to include those changes under this fix patch.

Previous work by John Fastabend made some efforts towards FIONREAD support:
commit e5c6de5 ("bpf, sockmap: Incorrectly handling copied_seq")
Although the current patch is based on the previous work by John Fastabend,
it is acceptable for our Fixes tag to point to the same commit.

                                                     FD1:read()
                                                     --  FD1->copied_seq++
                                                         |  [read data]
                                                         |
                                [enqueue data]           v
                  [sockmap]     -> ingress to self ->  ingress_msg queue
FD1 native stack  ------>                                 ^
-- FD1->rcv_nxt++               -> redirect to other      | [enqueue data]
                                       |                  |
                                       |             ingress to FD1
                                       v                  ^
                                      ...                 |  [sockmap]
                                                     FD2 native stack

Fixes: 04919be ("tcp: Introduce tcp_read_skb()")
Signed-off-by: Jiayuan Chen <jiayuan.chen@linux.dev>
@kernel-patches-daemon-bpf
Copy link
Author

Upstream branch: ff34657
series: https://patchwork.kernel.org/project/netdevbpf/list/?series=1027335
version: 4

This commit adds two new test functions: one to reproduce the bug reported
by syzkaller [1], and another to cover the calculation of copied_seq.

The tests primarily involve installing  and uninstalling sockmap on
sockets, then reading data to verify proper functionality.

Additionally, extend the do_test_sockmap_skb_verdict_fionread() function
to support UDP FIONREAD testing.

[1] https://syzkaller.appspot.com/bug?extid=06dbd397158ec0ea4983

Signed-off-by: Jiayuan Chen <jiayuan.chen@linux.dev>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants