Skip to content
/ linux Public

Commit 4cb163e

Browse files
kuba-mooSasha Levin
authored andcommitted
net: consume xmit errors of GSO frames
[ Upstream commit 7aa767d ] udpgro_frglist.sh and udpgro_bench.sh are the flakiest tests currently in NIPA. They fail in the same exact way, TCP GRO test stalls occasionally and the test gets killed after 10min. These tests use veth to simulate GRO. They attach a trivial ("return XDP_PASS;") XDP program to the veth to force TSO off and NAPI on. Digging into the failure mode we can see that the connection is completely stuck after a burst of drops. The sender's snd_nxt is at sequence number N [1], but the receiver claims to have received (rcv_nxt) up to N + 3 * MSS [2]. Last piece of the puzzle is that senders rtx queue is not empty (let's say the block in the rtx queue is at sequence number N - 4 * MSS [3]). In this state, sender sends a retransmission from the rtx queue with a single segment, and sequence numbers N-4*MSS:N-3*MSS [3]. Receiver sees it and responds with an ACK all the way up to N + 3 * MSS [2]. But sender will reject this ack as TCP_ACK_UNSENT_DATA because it has no recollection of ever sending data that far out [1]. And we are stuck. The root cause is the mess of the xmit return codes. veth returns an error when it can't xmit a frame. We end up with a loss event like this: ------------------------------------------------- | GSO super frame 1 | GSO super frame 2 | |-----------------------------------------------| | seg | seg | seg | seg | seg | seg | seg | seg | | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | ------------------------------------------------- x ok ok <ok>| ok ok ok <x> \\ snd_nxt "x" means packet lost by veth, and "ok" means it went thru. Since veth has TSO disabled in this test it sees individual segments. Segment 1 is on the retransmit queue and will be resent. So why did the sender not advance snd_nxt even tho it clearly did send up to seg 8? tcp_write_xmit() interprets the return code from the core to mean that data has not been sent at all. Since TCP deals with GSO super frames, not individual segment the crux of the problem is that loss of a single segment can be interpreted as loss of all. TCP only sees the last return code for the last segment of the GSO frame (in <> brackets in the diagram above). Of course for the problem to occur we need a setup or a device without a Qdisc. Otherwise Qdisc layer disconnects the protocol layer from the device errors completely. We have multiple ways to fix this. 1) make veth not return an error when it lost a packet. While this is what I think we did in the past, the issue keeps reappearing and it's annoying to debug. The game of whack a mole is not great. 2) fix the damn return codes We only talk about NETDEV_TX_OK and NETDEV_TX_BUSY in the documentation, so maybe we should make the return code from ndo_start_xmit() a boolean. I like that the most, but perhaps some ancient, not-really-networking protocol would suffer. 3) make TCP ignore the errors It is not entirely clear to me what benefit TCP gets from interpreting the result of ip_queue_xmit()? Specifically once the connection is established and we're pushing data - packet loss is just packet loss? 4) this fix Ignore the rc in the Qdisc-less+GSO case, since it's unreliable. We already always return OK in the TCQ_F_CAN_BYPASS case. In the Qdisc-less case let's be a bit more conservative and only mask the GSO errors. This path is taken by non-IP-"networks" like CAN, MCTP etc, so we could regress some ancient thing. This is the simplest, but also maybe the hackiest fix? Similar fix has been proposed by Eric in the past but never committed because original reporter was working with an OOT driver and wasn't providing feedback (see Link). Link: https://lore.kernel.org/CANn89iJcLepEin7EtBETrZ36bjoD9LrR=k4cfwWh046GB+4f9A@mail.gmail.com Fixes: 1f59533 ("qdisc: validate frames going through the direct_xmit path") Signed-off-by: Jakub Kicinski <kuba@kernel.org> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20260223235100.108939-1-kuba@kernel.org Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
1 parent 57957bc commit 4cb163e

File tree

1 file changed

+18
-5
lines changed

1 file changed

+18
-5
lines changed

net/core/dev.c

Lines changed: 18 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -4762,6 +4762,8 @@ int __dev_queue_xmit(struct sk_buff *skb, struct net_device *sb_dev)
47624762
* to -1 or to their cpu id, but not to our id.
47634763
*/
47644764
if (READ_ONCE(txq->xmit_lock_owner) != cpu) {
4765+
bool is_list = false;
4766+
47654767
if (dev_xmit_recursion())
47664768
goto recursion_alert;
47674769

@@ -4772,28 +4774,39 @@ int __dev_queue_xmit(struct sk_buff *skb, struct net_device *sb_dev)
47724774
HARD_TX_LOCK(dev, txq, cpu);
47734775

47744776
if (!netif_xmit_stopped(txq)) {
4777+
is_list = !!skb->next;
4778+
47754779
dev_xmit_recursion_inc();
47764780
skb = dev_hard_start_xmit(skb, dev, txq, &rc);
47774781
dev_xmit_recursion_dec();
4778-
if (dev_xmit_complete(rc)) {
4779-
HARD_TX_UNLOCK(dev, txq);
4780-
goto out;
4781-
}
4782+
4783+
/* GSO segments a single SKB into
4784+
* a list of frames. TCP expects error
4785+
* to mean none of the data was sent.
4786+
*/
4787+
if (is_list)
4788+
rc = NETDEV_TX_OK;
47824789
}
47834790
HARD_TX_UNLOCK(dev, txq);
4791+
if (!skb) /* xmit completed */
4792+
goto out;
4793+
47844794
net_crit_ratelimited("Virtual device %s asks to queue packet!\n",
47854795
dev->name);
4796+
/* NETDEV_TX_BUSY or queue was stopped */
4797+
if (!is_list)
4798+
rc = -ENETDOWN;
47864799
} else {
47874800
/* Recursion is detected! It is possible,
47884801
* unfortunately
47894802
*/
47904803
recursion_alert:
47914804
net_crit_ratelimited("Dead loop on virtual device %s, fix it urgently!\n",
47924805
dev->name);
4806+
rc = -ENETDOWN;
47934807
}
47944808
}
47954809

4796-
rc = -ENETDOWN;
47974810
rcu_read_unlock_bh();
47984811

47994812
dev_core_stats_tx_dropped_inc(dev);

0 commit comments

Comments
 (0)