Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

net can become wedged and stop receiving packets #750

Closed
jgallagher opened this issue Sep 2, 2022 · 15 comments
Closed

net can become wedged and stop receiving packets #750

jgallagher opened this issue Sep 2, 2022 · 15 comments
Assignees
Milestone

Comments

@jgallagher
Copy link
Contributor

jgallagher commented Sep 2, 2022

Yesterday while testing host boot flash operations triggered by mgmt-gateway, @mkeeter and I ran into two bugs:

  1. HostFlash.bulk_erase calls poll_for_write_complete after issuing the command to the qspi driver, which busy waits. I've seen bulk erase take > 1 minute on a gimlet, and the comments say it can take as long as 8 minutes; during that time, the hf task starves all lower priority tasks (including net).
  2. After triggering one of these long busy waits, when it returned, net was in a strangely wedged state. Client tasks could still send packets (we were seeing udpbroadcast data), but as far as we could tell no packets were being received. udpecho was not receiving notifications of incoming packets, and net was not responding to NDP messages.

Fixing 1 should be straightforward, but I haven't done so yet because I wanted to address 2 first. I took a humility dump of the SP while net was in this state (attached to the first comment on this issue). After restarting net via jefe, it came back healthy.

I've pushed a debug-net-wedge that can relatively reliably wedge the net task in the same way on my gimletlet. It adds two tasks:

  1. busywait runs at a relatively high priority. When it receives a message, it busy waits for 5,000 ticks and then responds.
  2. udpecho-slowly acts like udpecho, unless the packet it receives begins with an ASCII 1. For such packets, it first messages busywait (producing the same situation we saw with hostflash starving net).

The branch also has a debug-net-client crate that spins messaging udpecho-slowly forever. To run it, cd into its directory and then

RUST_LOG=info cargo run -- '[fe80::c1d:7dff:feef:9f1d%2]:7777'

but with the address of your gimletlet. When everything is working, the output looks similar to this:

[2022-09-02T20:16:00Z INFO  debug_net_client] sending packet to trigger high priority busy loop
[2022-09-02T20:16:00Z INFO  debug_net_client] sending followup packet 1
[2022-09-02T20:16:00Z INFO  debug_net_client] sending followup packet 2
[2022-09-02T20:16:00Z INFO  debug_net_client] sending followup packet 3
[2022-09-02T20:16:01Z INFO  debug_net_client] sending followup packet 4
[2022-09-02T20:16:01Z INFO  debug_net_client] sending followup packet 5
[2022-09-02T20:16:01Z INFO  debug_net_client] sending followup packet 6
[2022-09-02T20:16:02Z INFO  debug_net_client] sending followup packet 7
[2022-09-02T20:16:02Z INFO  debug_net_client] sending followup packet 8
[2022-09-02T20:16:02Z INFO  debug_net_client] sending followup packet 9
[2022-09-02T20:16:03Z INFO  debug_net_client] no response after 1.001421583s
[2022-09-02T20:16:04Z INFO  debug_net_client] no response after 2.002848957s
[2022-09-02T20:16:05Z INFO  debug_net_client] received response 1 from [fe80::c1d:7dff:feef:9f1d%2]:7777
[2022-09-02T20:16:05Z INFO  debug_net_client] received response 2 from [fe80::c1d:7dff:feef:9f1d%2]:7777
[2022-09-02T20:16:05Z INFO  debug_net_client] received response 3 from [fe80::c1d:7dff:feef:9f1d%2]:7777
[2022-09-02T20:16:05Z INFO  debug_net_client] received response 4 from [fe80::c1d:7dff:feef:9f1d%2]:7777
[2022-09-02T20:16:06Z INFO  debug_net_client] no response after 3.289892797s
[2022-09-02T20:16:06Z INFO  debug_net_client] sending packet to trigger high priority busy loop
[2022-09-02T20:16:06Z INFO  debug_net_client] sending followup packet 1
[2022-09-02T20:16:06Z INFO  debug_net_client] sending followup packet 2
[2022-09-02T20:16:06Z INFO  debug_net_client] sending followup packet 3
[2022-09-02T20:16:07Z INFO  debug_net_client] sending followup packet 4
[2022-09-02T20:16:07Z INFO  debug_net_client] sending followup packet 5
[2022-09-02T20:16:07Z INFO  debug_net_client] sending followup packet 6
[2022-09-02T20:16:08Z INFO  debug_net_client] sending followup packet 7
[2022-09-02T20:16:08Z INFO  debug_net_client] sending followup packet 8
[2022-09-02T20:16:08Z INFO  debug_net_client] sending followup packet 9
[2022-09-02T20:16:09Z INFO  debug_net_client] no response after 1.001229534s
[2022-09-02T20:16:10Z INFO  debug_net_client] no response after 2.003001465s
[2022-09-02T20:16:11Z INFO  debug_net_client] received response 1 from [fe80::c1d:7dff:feef:9f1d%2]:7777
[2022-09-02T20:16:11Z INFO  debug_net_client] received response 2 from [fe80::c1d:7dff:feef:9f1d%2]:7777
[2022-09-02T20:16:11Z INFO  debug_net_client] received response 3 from [fe80::c1d:7dff:feef:9f1d%2]:7777
[2022-09-02T20:16:11Z INFO  debug_net_client] received response 4 from [fe80::c1d:7dff:feef:9f1d%2]:7777
[2022-09-02T20:16:12Z INFO  debug_net_client] no response after 3.294712543s
[2022-09-02T20:16:12Z INFO  debug_net_client] sending packet to trigger high priority busy loop
... repeat ...

After somewhere between 1 and 15 minutes (usually around 5 for me), you should see fewer than 4 responses and an undending sequence of "no response" logs; e.g.,

[2022-09-02T19:56:55Z INFO  debug_net_client] sending packet to trigger high priority busy loop
[2022-09-02T19:56:55Z INFO  debug_net_client] sending followup packet 1
[2022-09-02T19:56:55Z INFO  debug_net_client] sending followup packet 2
[2022-09-02T19:56:56Z INFO  debug_net_client] sending followup packet 3
[2022-09-02T19:56:56Z INFO  debug_net_client] sending followup packet 4
[2022-09-02T19:56:56Z INFO  debug_net_client] sending followup packet 5
[2022-09-02T19:56:57Z INFO  debug_net_client] sending followup packet 6
[2022-09-02T19:56:57Z INFO  debug_net_client] sending followup packet 7
[2022-09-02T19:56:57Z INFO  debug_net_client] sending followup packet 8
[2022-09-02T19:56:57Z INFO  debug_net_client] sending followup packet 9
[2022-09-02T19:56:59Z INFO  debug_net_client] no response after 1.005248163s
[2022-09-02T19:57:00Z INFO  debug_net_client] no response after 2.006617007s
[2022-09-02T19:57:01Z INFO  debug_net_client] no response after 3.00801117s
[2022-09-02T19:57:02Z INFO  debug_net_client] no response after 4.009221014s
[2022-09-02T19:57:03Z INFO  debug_net_client] no response after 5.010172729s
[2022-09-02T19:57:04Z INFO  debug_net_client] no response after 6.011480753s
[2022-09-02T19:57:05Z INFO  debug_net_client] no response after 7.011830731s
[2022-09-02T19:57:06Z INFO  debug_net_client] no response after 8.013229815s
[2022-09-02T19:57:07Z INFO  debug_net_client] no response after 9.014601338s
[2022-09-02T19:57:08Z INFO  debug_net_client] no response after 10.015946541s
[2022-09-02T19:57:09Z INFO  debug_net_client] no response after 11.017274414s
[2022-09-02T19:57:10Z INFO  debug_net_client] no response after 12.01828268s

At this point, net is almost certainly wedged; you can confirm by trying to talk to the normal udpecho task on port 7.

Even after poking at this for the better part of a day and a half, I know frustratingly little. In both the gimlet dump and on my gimletlet, the four RDES3 values of RX_DESC are all 0xc1000000, which is the value we set it to to give ownership back to the hardware. This leads me to believe we think we've told the hardware to give us packets, but it doesn't realize it.

Sometimes when I add ringbuf traces (depending on where they are) I'm no longer able to reproduce the wedge. One set of ringbuf traces I collected was aimed at figuring out if we were somehow failing to call rx_notify() after clearing space in the RX buffer, but based on the traces it looks like we did indeed call rx_notify(), but still stopped receiving packets:

humility: ring buffer drv_stm32h7_eth::__RINGBUF in net:
 NDX LINE      GEN    COUNT PAYLOAD
  13  616       20        1 VlanIsNextFree { now: 0x2a12b, vid: 0x301, storage_slot: 0x2, rdes3: 0xc1000000 }
  14  470       20        1 VlanCanRecv { now: 0x2a12b, vid: 0x301, rx_notify: false }
  15  616       20        1 VlanIsNextFree { now: 0x2a514, vid: 0x302, storage_slot: 0x2, rdes3: 0xc1000000 }
  16  470       20        1 VlanCanRecv { now: 0x2a514, vid: 0x302, rx_notify: false }
  17  271       20        1 EthInterrupt { now: 0x2b1b7, dmacsr: 0x80c4 }
  18  616       20        1 VlanIsNextFree { now: 0x2b1b7, vid: 0x301, storage_slot: 0x2, rdes3: 0x3e040086 }
  19  470       20        1 VlanCanRecv { now: 0x2b1b7, vid: 0x301, rx_notify: false }
  20  449       20        1 VlanRecvRxNotify { now: 0x2b1b7, vid: 0x301 }
  21  616       20        1 VlanIsNextFree { now: 0x2b1b7, vid: 0x301, storage_slot: 0x3, rdes3: 0xc1000000 }
  22  470       20        1 VlanCanRecv { now: 0x2b1b7, vid: 0x301, rx_notify: false }
  23  616       20        1 VlanIsNextFree { now: 0x2b5a0, vid: 0x302, storage_slot: 0x3, rdes3: 0xc1000000 }
  24  470       20        1 VlanCanRecv { now: 0x2b5a0, vid: 0x302, rx_notify: false }
  25  271       20        1 EthInterrupt { now: 0x2b989, dmacsr: 0x80c4 }
  26  616       20        1 VlanIsNextFree { now: 0x2b989, vid: 0x301, storage_slot: 0x3, rdes3: 0x3e040086 }
  27  470       20        1 VlanCanRecv { now: 0x2b989, vid: 0x301, rx_notify: false }
  28  449       20        1 VlanRecvRxNotify { now: 0x2b989, vid: 0x301 }                                     // <--- rx_notify call after last packet
  29  616       20        1 VlanIsNextFree { now: 0x2b989, vid: 0x301, storage_slot: 0x0, rdes3: 0xc1000000 } // <--- never received another packet after this point
  30  470       20        1 VlanCanRecv { now: 0x2b989, vid: 0x301, rx_notify: false }
  31  616       20        1 VlanIsNextFree { now: 0x2bd72, vid: 0x302, storage_slot: 0x0, rdes3: 0xc1000000 }
   0  470       21        1 VlanCanRecv { now: 0x2bd72, vid: 0x302, rx_notify: false }
   1  616       21        1 VlanIsNextFree { now: 0x2c15b, vid: 0x301, storage_slot: 0x0, rdes3: 0xc1000000 }
   2  470       21        1 VlanCanRecv { now: 0x2c15b, vid: 0x301, rx_notify: false }
   3  616       21        1 VlanIsNextFree { now: 0x2c544, vid: 0x302, storage_slot: 0x0, rdes3: 0xc1000000 }
   4  470       21        1 VlanCanRecv { now: 0x2c544, vid: 0x302, rx_notify: false }
   5  616       21        1 VlanIsNextFree { now: 0x2d4e3, vid: 0x301, storage_slot: 0x0, rdes3: 0xc1000000 }
   6  470       21        1 VlanCanRecv { now: 0x2d4e3, vid: 0x301, rx_notify: false }

(I can send a full dump of these traces, but am not sure how useful the earlier portions are, so I've omitted them for now.) I have not been able to get crystal clear traces on our ethernet interrupts, but from what I have gathered, nothing seems amiss there. I don't want to discourage any particular line of thought since I still have no idea what the root cause is, but if there's something wrong with the interrupt handling it's extremely subtle.

At this point I suspect some kind of memory synchronization issue between hubris and the ethernet device, possibly related to #513? But I tried a few different things aimed at narrowing that down, and was still able to reliably wedge net. The things I've tried that still leave net reliably wedgeable are:

  1. Changing ServerImpl::poll() to repeat if an interface encounters a packet intended for another valid vid
  2. Changing all the memory orderings in the ethernet driver to Ordering::SeqCst
  3. Replacing the atomic fences in {rx,tx}_notify() with calls to cortex_m::asm::dsb() (a la https://github.com/stm32-rs/stm32h7xx-hal/blob/master/src/ethernet/eth.rs#L315-L317)
  4. Putting the ethernet descriptors into a region of sram marked as device memory (a la https://github.com/STMicroelectronics/STM32CubeH7/blob/c828320460548114b485ec8a6affaa950f51d12b/Projects/STM32H743I-EVAL/Applications/LwIP/LwIP_HTTP_Server_Netconn_RTOS/Src/main.c#L298-L326)

The things I've tried that appear to make the bug go away, or at least make it hard enough to reproduce that my current technique isn't good enough:

  1. Disabling the vlan feature Update: I left this running for much longer, and I eventually reproduced the wedge on my gimletlet with vlan disabled. debug-net-client ran for 1 hr 8 minutes before triggering the wedge.
  2. Adding ringbuf traces in rx_notify() that record the four RDES3 words before the fence (NOTE: I have not let this run for an extended period of time; until seeing the vlan wedge after an hour, the longest I had had to wait was 15 minutes, and this ran for about 20 without wedging.)
@jgallagher
Copy link
Contributor Author

humility dump from the gimlet: net-task-stuck-is-mgmt-gateway-the-culprit.zip

If it would be helpful I can get a dump from my gimletlet when it's similarly wedged.

@cbiffle
Copy link
Collaborator

cbiffle commented Sep 12, 2022

The device memory mapping on ARMv7-M is generally more for memory mapped registers; the DMA attribute we set translates to inner/outer non-cacheable + shared which is likely what we want here.

I'm pleased to hear that changing the orderings to SeqCst didn't change the behavior; we worked pretty hard on getting the existing orderings right and there's a comment in there somewhere discussing the rationale. A bug there would signal some intrinsic problems with our approach to DMA here.

DSB should be overkill; the atomic fences generate DMB which is lighter weight (only synchronizes explicit memory accesses rather than also synchronizing the execution pipeline). DSB is mostly necessary in cases where you've changed aspects of, like, memory protection in ways that don't require a full-on prefetch-flushing ISB.

I suspect that one of two things is happening:

  1. We've had a race condition between descriptor updates and poking the Ethernet MAC's "hey here is your tail pointer" register, which is what causes it to rescan memory. This shouldn't be happening but that doesn't mean it isn't. :-) The poke in question is at drv/stm32h7-eth/src/lib.rs:288.
  2. Running out of RX descriptors has caused the hardware to get into a sticky error state that we don't handle, because we didn't aggressively test this case; the hardware is insisting that the error be cleared before it will resume normal receive operation. I think this is more likely.

Unfortunately the dump doesn't include the MAC register state (and rightfully so, as Humility has no idea which registers are read-sensitive!), which is what we'd need to diagnose case 2. So I'll have to try and repro this locally once I find my NIC.

@jgallagher
Copy link
Contributor Author

Humility memory dumps from a gimlet in this stuck state:

john@igor ~ $ pfexec humility -a ./build-gimlet-b.zip readmem -w 0x40029160 4
humility: WARNING: archive on command-line overriding archive in environment file
humility: attached to 0483:374e:003700313156501320323443 via ST-Link V3
                   \/        4        8        c
0x40029160 | 00201584

john@igor ~ $ pfexec humility -a ./build-gimlet-b.zip readmem -w 0x40029128 4
humility: WARNING: archive on command-line overriding archive in environment file
humility: attached to 0483:374e:003700313156501320323443 via ST-Link V3
                    0        4       \/        c
0x40029120 |                   300018c0          |         ...0


john@igor ~ $ pfexec humility -a ./build-gimlet-b.zip readmem -w 0x4002914c 4
humility: WARNING: archive on command-line overriding archive in environment file
humility: attached to 0483:374e:003700313156501320323443 via ST-Link V3
                    0        4        8       \/
0x40029140 |                            30001880 |             ...0


john@igor ~ $ pfexec humility -a ./build-gimlet-b.zip readmem -w 0x4002900c 4
humility: WARNING: archive on command-line overriding archive in environment file
humility: attached to 0483:374e:003700313156501320323443 via ST-Link V3
                    0        4        8       \/
0x40029000 |                            00006000 |             .`..

john@igor ~ $ pfexec humility -a ./build-gimlet-b.zip readmem -w 0x4002916c 4
humility: WARNING: archive on command-line overriding archive in environment file
humility: attached to 0483:374e:003700313156501320323443 via ST-Link V3
                    0        4        8       \/
0x40029160 |                            00000000 |             ....

Larger dumps attached:
eth-dump.txt
eth-dma-dump.txt

@mkeeter
Copy link
Collaborator

mkeeter commented Oct 24, 2022

#867 is a partial fix for this, but panics the net task. An intermediate fix would be to cleanly restart the net task (which would keep its callers happier), until we can track down the true root cause.

@cbiffle
Copy link
Collaborator

cbiffle commented Apr 27, 2023

Update from the future: the STM32H7 Ethernet MAC has a difficult-to-nail-down issue where incoming packets stop transferring from the MTL to MAC, with no apparent status bits to indicate this case and distinguish it from "the NIC is very busy." So the watchdog in #867 is a workaround.

It's currently not clear that we can do anything more without fabbing a chip, so I'm going to close this for now as worked-around.

@cbiffle cbiffle closed this as completed Apr 27, 2023
@mkeeter
Copy link
Collaborator

mkeeter commented Jul 14, 2023

The workaround is causing issues on the dogfood rack (https://github.com/oxidecomputer/meta/issues/225), which is understandable: the net watchdog timer is set at 60 seconds, which is too long for MGS, so MGS times out before net restarts and recovers.

Although we could work around this by tweaking watchdogs, I'm reopening this issue in hopes of making some progress towards a true root cause.

First, an observation: the previous implementation of debug_net_client has the possibility of false positives in run_one! If > 7 packets are dropped – but the net task isn't actually stuck – debug_net_client will still report "no responses" forever. In practice, I've seen this happen repeatedly on my bench.

I suspect that we were mislead by this in the past, and spent time poking at systems that weren't actually stuck.

I've updated this check to be more conservative (require 10x consecutive failures) in the debug-net-wedge-redux branch, so it should only report true net-is-stuck events. This typically takes longer to reproduce (1-2 hours), running on a Gimletlet on my desk.

There's a bright side: in a test with the conservative metric, the system – once stuck – did show errors!

  • ETH_MTLRXQDR is 00150010, indicating 21 stuck packets in the MTL queue.
  • ETH_DMACSR is 0x00201584
    • TBU: Transmit buffer unavailable (fine)
    • RBU: Receive buffer unavailable (?)
    • RPS: Receive process stopped (!)
    • ETI: Early Transmit interrupt (fine)
    • FBE: Fatal bus error (!)
    • REB: Error during data transfer by Rx DMA (!)
    • On a random (presumably-working Gimlet), I see 0x00000484
      • TBU: Transmit buffer unavailable
      • RBU: Receive buffer unavailable (?)
      • ETI: Early transmit interrupt
  • ETH_DMADSR is 00006000
    • Rx is stopped (!)
    • Tx is running
  • ETH_DMACRXCR is 00010c00
    • Receive programmable burst length = 16
    • receive buffer size is 1536
    • Note that SR is not set, because we've stopped!
  • ETH_DMACRXDLAR is 30001880
    • Same as on a working Gimlet
    • Seems legit: task_net::buf::claim_rx_statics::RX_DESC 0x30001880 64
  • ETH_DMACRXDTPR is 300018c0
    • Sure, I believe that's the end of the descriptor ring
  • ETH_DMACRXRLR is 3
    • 4 Rx descriptors in the ring (N - 1)
  • ETH_DMACCARXDR is 300018b0
    • Seems like a perfectly fine receive descriptor
  • ETH_DMACCARXBR is 0x00000301
    • This is the pointer to which the Rx DMA is trying to write
    • oh no
    • To quote James Mickens, "Despair is when you’re debugging a kernel driver and you look at a memory dump and you see that a pointer has a value of [0x301]"
    • On a working Gimlet, this is 0x30024c0

Despite my deep existential dread, this is a great result: if we can actually notice that the system is stuck, we can fix it.

Sure enough, writing 0x00010c01 to ETH_DMACRXCR (which sets the SR bit, enabling DMA Rx again) unsticks the system.

Next steps are running a full dogfood rack update with the net-watchdog-debug branch, which has a few changes to net/src/main.rs:

  • Logs all of the registers above when a watchdog timeout occurs
  • Panics instead of calling restart_me, so that a dump is captured

If we get lucky, we'll see the usual 1-2 failures, and will be able to read those dumps and see if they show the same characteristic errors.

@mkeeter mkeeter reopened this Jul 14, 2023
@mkeeter
Copy link
Collaborator

mkeeter commented Jul 14, 2023

I have realized – to my great dismay – that 0x301 is our VLAN VID.

Changing the VLAN VID to a different value changes the value in ETH_DMACCARXBR (!!).

So, why the heck are we treating our VID as a DMA buffer address?

When we receive a packet, it's written to the DMA Rx descriptor in "write-back format" (see Fig 803). In this format, the first word (RDES[0]) is the VID; in read format, this word is the memory address.

This behavior is compatible with the DMA peripheral deciding to read the VID as its memory buffer. It's not clear why it would do such a thing.

    /// Programs the words in `d` to prepare to receive into `buffer` and sets
    /// `d` accessible to hardware. The final write to make it accessible is
    /// performed with Release ordering to get a barrier.
    fn set_descriptor(d: &RxDesc, buffer: *mut [u8; BUFSZ]) {
        d.rdes[0].store(buffer as u32, Ordering::Relaxed);
        d.rdes[1].store(0, Ordering::Relaxed);
        d.rdes[2].store(0, Ordering::Relaxed);
        let rdes3 =
            1 << RDES3_OWN_BIT | 1 << RDES3_IOC_BIT | 1 << RDES3_BUF1_VALID_BIT;
        d.rdes[3].store(rdes3, Ordering::Release); // <-- release
    }

This code assigns the buffer before setting RDES[3], as you'd expect.

Looking at the assembly, it also looks reasonable:

 80222c6:	9033      	str	r0, [sp, #204]	; 0xcc
 80222c8:	e002      	b.n	80222d0 <main+0x1700>
 80222ca:	2500      	movs	r5, #0
 80222cc:	f50d 6aff 	add.w	sl, sp, #2040	; 0x7f8
 80222d0:	f20d 794c 	addw	r9, sp, #1868	; 0x74c
 80222d4:	e006      	b.n	80222e4 <main+0x1714>
 80222d6:	f50d 60bf 	add.w	r0, sp, #1528	; 0x5f8
 80222da:	2194      	movs	r1, #148	; 0x94
 80222dc:	f00a fc62 	bl	802cba4 <__aeabi_memclr4>
 80222e0:	2001      	movs	r0, #1
 80222e2:	9033      	str	r0, [sp, #204]	; 0xcc
 80222e4:	f108 000c 	add.w	r0, r8, #12
 80222e8:	f8c8 b000 	str.w	fp, [r8]
 80222ec:	f8c8 5004 	str.w	r5, [r8, #4]
 80222f0:	f04f 4b41 	mov.w	fp, #3238002688	; 0xc1000000
 80222f4:	f8c8 5008 	str.w	r5, [r8, #8]
 80222f8:	f3bf 8f5f 	dmb	sy
 80222fc:	f8c0 b000 	str.w	fp, [r0]
 8022300:	6b30      	ldr	r0, [r6, #48]	; 0x30
 8022302:	3001      	adds	r0, #1
 8022304:	f082 82dc 	bcs.w	80248c0 <main+0x3cf0>
 8022308:	6a71      	ldr	r1, [r6, #36]	; 0x24
 802230a:	1a42      	subs	r2, r0, r1

The 0xc1000000 value is 1 << RDES3_OWN_BIT | 1 << RDES3_IOC_BIT | 1 << RDES3_BUF1_VALID_BIT; picking out just the important bits:

 80222ca:	2500      	movs	r5, #0
...
 80222e4:	f108 000c 	add.w	r0, r8, #12
 80222e8:	f8c8 b000 	str.w	fp, [r8]
 80222ec:	f8c8 5004 	str.w	r5, [r8, #4]
 80222f0:	f04f 4b41 	mov.w	fp, #3238002688	; 0xc1000000
 80222f4:	f8c8 5008 	str.w	r5, [r8, #8]
 80222f8:	f3bf 8f5f 	dmb	sy
 80222fc:	f8c0 b000 	str.w	fp, [r0]

It looks like r8 is a pointer to RDES , and there's a dmb instruction before writing to RDES[3]. This is all well and good, and doesn't explain the behavior.

However, if the DMA peripheral reads the Rx descriptor naively (in the order 0-3), then there's a sequence of operations that could lead to what we're seeing:

User code                    |  DMA
-----------------------------|--------------------------------------
                             |  read RDES[0] => 0x301
                             |  read RDES[1] => ...
                             |  read RDES[2] => ...
RDES[0] <= buffer            |
RDES[1] <= 0                 |
RDES[2] <= 0                 |
RDES[3] <= OWN | IOC | BUF1V |
                             |  read RDES[3] => OWN | IOC | BUF1V

The store to RDES[3] being Ordering::Release doesn't protect us here!

@jgallagher
Copy link
Contributor Author

jgallagher commented Jul 15, 2023

We hit this on sled 8 while mupdating dogfood this evening with your debug branch (#1459); I put the net dump at /staff/dock/rack2/mupdate-20230714-2/net-task-dump/hubris.core.net.0. The ringbuf appears to match your description almost exactly:

humility: ring buffer task_net::__RINGBUF in net:
 NDX LINE      GEN    COUNT PAYLOAD
   0  290        1        1 MtlRxQdr(0xd0010)
   1  292        1        1 DmaCsr(0x201584)
   2  294        1        1 DmaDsr(0x6000)
   3  296        1        1 DmaCrxCr(0x10c00)
   4  298        1        1 DmaCrxDlar(0x30001880)
   5  300        1        1 DmaCrxDtpr(0x300018c0)
   6  302        1        1 DmaCcarxDr(0x300018b0)
   7  304        1        1 DmaCcarxBr(0x301)

@mkeeter
Copy link
Collaborator

mkeeter commented Jul 15, 2023

Two more failures overnight, each of which has the same pattern in the logs.

I also ran a test where I changed the implementation of set_descriptor:

    fn set_descriptor(d: &RxDesc, buffer: *mut [u8; BUFSZ]) {
        d.rdes[0].store(0x123, Ordering::SeqCst); // <-- this is new!

        d.rdes[0].store(buffer as u32, Ordering::Relaxed);
        d.rdes[1].store(0, Ordering::Relaxed);
        d.rdes[2].store(0, Ordering::Relaxed);
        let rdes3 =
            1 << RDES3_OWN_BIT | 1 << RDES3_IOC_BIT | 1 << RDES3_BUF1_VALID_BIT;
        d.rdes[3].store(rdes3, Ordering::Release); // <-- release
    }

After running for a long time (5ish hours), this fails with ETH_DMACCARXBR = 0x123.

This failure only seems possible if (as we suspect) the DMA peripheral reads RDES[0] before RDES[3]. Here's a modified sequence diagram:

User code                    |  DMA
-----------------------------|--------------------------------------
RDES[0] <= 0x123             |
dmb                          |
                             |  read RDES[0] => 0x123
                             |  read RDES[1] => ...
                             |  read RDES[2] => ...
RDES[0] <= buffer            |
RDES[1] <= 0                 |
RDES[2] <= 0                 |
dmb                          |
RDES[3] <= OWN | IOC | BUF1V |
                             |  read RDES[3] => OWN | IOC | BUF1V

@lzrd suggests that the proper fix here is to use two DMA lists, i.e. manipulating the end pointer to serve as a stronger barrier.

In the meantime, I'm going to try patching net to detect and clear this error (and reset the SR flag to restart DMA).

@mkeeter
Copy link
Collaborator

mkeeter commented Jul 15, 2023

One more test: running with DCache entirely disabled

diff --git a/drv/stm32h7-startup/src/lib.rs b/drv/stm32h7-startup/src/lib.rs
index 29572cad..7adc0893 100644
--- a/drv/stm32h7-startup/src/lib.rs
+++ b/drv/stm32h7-startup/src/lib.rs
@@ -143,7 +143,7 @@ pub fn system_init_custom(
     // Turn on CPU I/D caches to improve performance at the higher clock speeds
     // we're about to enable.
     cp.SCB.enable_icache();
-    cp.SCB.enable_dcache(&mut cp.CPUID);
+    //cp.SCB.enable_dcache(&mut cp.CPUID);

     // The Flash controller comes out of reset configured for 3 wait states.
     // That's approximately correct for 64MHz at VOS3, which is fortunate, since

This still reproduces the issue, which continues to line up with my theory.

@mkeeter
Copy link
Collaborator

mkeeter commented Jul 16, 2023

Running with the patch from #1460 succeeded in updating every Gimlet on the dogfood rack; previously, we saw 1-2 failures per round of updates.

Here's a modified sequence diagram showing why this helps:

User code                    |  DMA
-----------------------------|--------------------------------------
                             |  read RDES[0] => 0x301
                             |  read RDES[1] => ...
                             |  read RDES[2] => ...
RDES[0] <= buffer            |
RDES[1] <= 0                 |
RDES[2] <= 0                 |
            ^                |
            |                |
            |                |
            delay            |
            |                | read RDES[3] => 0
            |                |
            V                |
RDES[3] <= OWN | IOC | BUF1V |
                             |
                             |  read RDES[0] => buffer
                             |  read RDES[1] => ...
                             |  read RDES[2] => ...
                             |  read RDES[3] => OWN | IOC | BUF1V

@willglynn
Copy link

willglynn commented Jul 17, 2023

@mkeeter I think the driver is supposed to manipulate the tail pointer as descriptor ownership changes.

The stm32h7-eth receive tail pointer is presently constant:

    /// Returns a pointer to the byte just past the end of the `RxDesc` ring.
    /// This too gets loaded into the DMA controller, so that it knows what
    /// section of the ring is initialized and can be read. (The answer is "all
    /// of it.")
    pub fn tail_ptr(&self) -> *const RxDesc {
        self.storage.as_ptr_range().end
    }

However, the receive tail pointer register ETH_DMACRXDTPR description links the tail pointer to ownership:

Bits 31:0 RDT[31:0]: Receive Descriptor Tail Pointer
This field contains the tail pointer for the Rx descriptor ring. The software writes the tail pointer to add more descriptors to the Rx channel. The hardware tries to write all received packets to the descriptors referenced between the head and the tail pointer registers.

The DMA status register ETH_DMACSR has a note too:

Bit7 RBU: Receive Buffer Unavailable
This bit indicates that the application owns the next descriptor in the Receive list, and the DMA cannot acquire it. … In ring mode, the application should advance the Receive Descriptor Tail Pointer register of a channel. This bit is set only when the DMA owns the previous Rx descriptor.

RM0433 figure 809 illustrates the tail pointer pointing past the last OWNed descriptor, rather than the end of the ring:

@mkeeter
Copy link
Collaborator

mkeeter commented Jul 17, 2023

Thanks for taking a look! I agree that it's maybe possible to avoid this situation by carefully manipulating the end pointer.

(I also object that in principle, it shouldn't be necessary; in a correct hardware implementation, the DMA peripheral would read RDES3 first with Acquire memory ordering, which would make an incorrect read of RDES0 impossible)

The official drivers don't clarify the situation at all: take a look at ETH_UpdateDescriptor. It writes the Rx descriptor as you'd expect, with a memory barrier before writing RDES3 (which they call DESC3).

However, it then writes the tail pointer to 0 (!) if the available Rx descriptor count has changed 🤯

I suspect this is a dummy write – simply to trigger an Receive Poll demand – and they're not using the address-matching aspect of the DMA peripheral at all.

(As an aside, they do enough work between writing the buffer address and RDES3 that they're probably dodging the race condition window by accident)

@willglynn
Copy link

willglynn commented Jul 17, 2023

I agree! The official HAL was useless in trying to understand the reference.

They make an off-handed comment in § 58.9.6, explaining how to switch descriptor lists, which makes me think the descriptors are loaded in a not-too-careful manner while RxDMA is running:

Switching to a new descriptor list is different in the Rx DMA compared to the Tx DMA. Switching to a new descriptor list is permitted when the RxDMA is in Suspend state, as explained below:

  • Generally, RxDMA prepares the descriptors in advance.

Presumably the RxDMA won't read descriptors past the tail pointer at all, allowing them to be sloppy with memory ordering as long as it's set.

I can understand why they'd do this at the hardware level: reading RDES[0…3] in that order is a single bus transaction, while reading RDES[3] first isn't. They'd rather suspend the RxDMA – reducing contention on the memory which contains those descriptors – than leaving it free running when it can't make progress. An extra fire-and-forget write to a peripheral register is cheap too, given the Cortex-M7's memory interface.

@mkeeter
Copy link
Collaborator

mkeeter commented Jul 17, 2023

I opened case #00183800 with ST support, so we'll see what they say!

dancrossnyc added a commit that referenced this issue Aug 14, 2023
At its root, #750 is about a race condition between the DMA
engine and the STM32 Ethernet, and the host CPU.  We can
prevent this race by properly maintaining the Rx tail pointer.

Some background information to contextual this change:

The device and host participate in a simple protocol to drive
a state machine: the host gives descriptors to the device
that specify buffers to DMA incoming Ethernet frames into.
The host gives a descriptor to the device by setting the
ownership bit and setting the tail pointer: this driver
handles the former properly, but does not handle the latter
correctly.

The device consumes these descriptors as frames arrive and
it DMAs them host memory.  In turn, the device lets the host
know that it is done with a descriptor by doing two things:

1. Resetting the ownership bit in the descriptor, and
2. Sending an interrupt to the host.

Internally, the device _also_ increments its "head" pointer in
the Rx ring (modulo the size of the ring).  If the device hits
the end of the ring (as specified by the value in the Rx tail
pointer register), it goes into a "Suspended" state and stops
DMAing packets to the host; it nominally resumes when the Rx
tail pointer register is written again (if the newly written
tail register value is the same as the old value, it doesn't
appear to continue: note this implies it is not possible to
transition from a _full_ ring directly to an _empty_ ring).

Both device and host maintain (or, perhaps more precisely,
should maintain) two pointers into the ring: a producer pointer,
and a consumer pointer.

One may think of this as the host producing available
descriptors that are consumed by the device, and the device
producing received frames that are consumed by the host.
The tail pointer sets a demarkation point between the host
and device: descriptors before the tail pointer are owned by
the device, those after are owned by the host.  The ownership
bit is used to communicate to the host that the device is
done with a given descriptor.

Put another way, the tail pointer represents the host's producer
pointer; the host's consumer pointer is simply the next
descriptor in the ring after the last one that it processed.
The device's consumer pointer is whatever the host set the tail
pointer to, and it's producer pointer is its head pointer.  If
the device encounters a situation where _its_ head and tail
pointer are equal, the ring is full, and the DMA engine enters
the suspended state until the tail pointer is rewritten.

An invariant that both sides of the protocol must maintain is
that host software ensures that it's producer pointer (that
is, what it sets the tail pointer to) is always equal to or
(ie, the ring is full), or greater than (ie, there is space
available in the ring) to the devices's producer pointer
(that is, the devices's "head" pointer).

Of course, we talk about an ordering relationship here,
but all of this is happening modulo the size of the ring,
but that's a detail.

Note that there are two cases here:

1. The first time you write to the tail pointer to set
   things in motion at the beginning of time, and
2. Every other time.

For the first time, this change writes the address of the last
valid descriptor in the ring into the tail pointer register (not
one beyond the last!).  Then after consuming and resetting a
descriptor on the host, we write the address of the next valid
descriptor in the ring to the tail pointer, indicating that the
just-processed descriptor is available to the device.  As we do
this for every received frame, we can never starve the ring (at
least, not due to this).

In this scenario, the device will follows the host around
the ring; the problem with the first write is solved since,
after the first frame reception, the device's head pointer will
be somewhere beyond the first descriptor in the ring.  Thus,
writing the tail pointer to the address of (say) the 0'th
element is fine, since that won't be equal to the head pointer
on the peripheral.  Similarly, all available descriptors can
be consumed by the device.

Finally, this allows for some simplification with respect to
barriers: since the tail pointer indicates which descriptors
are owned by the device and which by the host, the host can
batch up updates to the descriptors and do a single flush
at the end of its processing, write before writing to the
tail pointer.  This was already done in `rx_notify`.

Note that something similar should probably be done for the
transmission ring, which I suspect is vulnerable to a similar
race between the DMA engine on the device and writes on the
host.

I have validated that, whereas I could reproduce this issue in
a demo branch prior to this change, I cannot reproduce it with
this change.
@morlandi7 morlandi7 added this to the 1.0.2 milestone Aug 15, 2023
dancrossnyc added a commit that referenced this issue Aug 15, 2023
At its root, #750 is about a race condition between the DMA
engine and the STM32 Ethernet, and the host CPU.  We can
prevent this race by properly maintaining the Rx tail pointer.

Some background information to contextual this change:

The device and host participate in a simple protocol to drive
a state machine: the host gives descriptors to the device
that specify buffers to DMA incoming Ethernet frames into.
The host gives a descriptor to the device by setting the
ownership bit and setting the tail pointer: this driver
handles the former properly, but does not handle the latter
correctly.

The device consumes these descriptors as frames arrive and
it DMAs them host memory.  In turn, the device lets the host
know that it is done with a descriptor by doing two things:

1. Resetting the ownership bit in the descriptor, and
2. Sending an interrupt to the host.

Internally, the device _also_ increments its "head" pointer in
the Rx ring (modulo the size of the ring).  If the device hits
the end of the ring (as specified by the value in the Rx tail
pointer register), it goes into a "Suspended" state and stops
DMAing packets to the host; it nominally resumes when the Rx
tail pointer register is written again (if the newly written
tail register value is the same as the old value, it doesn't
appear to continue: note this implies it is not possible to
transition from a _full_ ring directly to an _empty_ ring).

Both device and host maintain (or, perhaps more precisely,
should maintain) two pointers into the ring: a producer pointer,
and a consumer pointer.

One may think of this as the host producing available
descriptors that are consumed by the device, and the device
producing received frames that are consumed by the host.
The tail pointer sets a demarkation point between the host
and device: descriptors before the tail pointer are owned by
the device, those after are owned by the host.  The ownership
bit is used to communicate to the host that the device is
done with a given descriptor.

Put another way, the tail pointer represents the host's producer
pointer; the host's consumer pointer is simply the next
descriptor in the ring after the last one that it processed.
The device's consumer pointer is whatever the host set the tail
pointer to, and it's producer pointer is its head pointer.  If
the device encounters a situation where _its_ head and tail
pointer are equal, the ring is full, and the DMA engine enters
the suspended state until the tail pointer is rewritten.

An invariant that both sides of the protocol must maintain is
that host software ensures that it's producer pointer (that
is, what it sets the tail pointer to) is always equal to or
(ie, the ring is full), or greater than (ie, there is space
available in the ring) to the devices's producer pointer
(that is, the devices's "head" pointer).

Of course, we talk about an ordering relationship here,
but all of this is happening modulo the size of the ring,
but that's a detail.

Note that there are two cases here:

1. The first time you write to the tail pointer to set
   things in motion at the beginning of time, and
2. Every other time.

For the first time, this change writes the address of the last
valid descriptor in the ring into the tail pointer register (not
one beyond the last!).  Then after consuming and resetting a
descriptor on the host, we write the address of the next valid
descriptor in the ring to the tail pointer, indicating that the
just-processed descriptor is available to the device.  As we do
this for every received frame, we can never starve the ring (at
least, not due to this).

In this scenario, the device will follows the host around
the ring; the problem with the first write is solved since,
after the first frame reception, the device's head pointer will
be somewhere beyond the first descriptor in the ring.  Thus,
writing the tail pointer to the address of (say) the 0'th
element is fine, since that won't be equal to the head pointer
on the peripheral.  Similarly, all available descriptors can
be consumed by the device.

Finally, this allows for some simplification with respect to
barriers: since the tail pointer indicates which descriptors
are owned by the device and which by the host, the host can
batch up updates to the descriptors and do a single flush
at the end of its processing, write before writing to the
tail pointer.  This was already done in `rx_notify`.

Note that something similar should probably be done for the
transmission ring, which I suspect is vulnerable to a similar
race between the DMA engine on the device and writes on the
host.

I have validated that, whereas I could reproduce this issue in
a demo branch prior to this change, I cannot reproduce it with
this change.
dancrossnyc added a commit that referenced this issue Aug 15, 2023
At its root, #750 is about a race condition between the DMA
engine and the STM32 Ethernet, and the host CPU.  We can
prevent this race by properly maintaining the Rx tail pointer.

Some background information to contextual this change:

The device and host participate in a simple protocol to drive
a state machine: the host gives descriptors to the device
that specify buffers to DMA incoming Ethernet frames into.
The host gives a descriptor to the device by setting the
ownership bit and setting the tail pointer: this driver
handles the former properly, but does not handle the latter
correctly.

The device consumes these descriptors as frames arrive and
it DMAs them host memory.  In turn, the device lets the host
know that it is done with a descriptor by doing two things:

1. Resetting the ownership bit in the descriptor, and
2. Sending an interrupt to the host.

Internally, the device _also_ increments its "head" pointer in
the Rx ring (modulo the size of the ring).  If the device hits
the end of the ring (as specified by the value in the Rx tail
pointer register), it goes into a "Suspended" state and stops
DMAing packets to the host; it nominally resumes when the Rx
tail pointer register is written again (if the newly written
tail register value is the same as the old value, it doesn't
appear to continue: note this implies it is not possible to
transition from a _full_ ring directly to an _empty_ ring).

Both device and host maintain (or, perhaps more precisely,
should maintain) two pointers into the ring: a producer pointer,
and a consumer pointer.

One may think of this as the host producing available
descriptors that are consumed by the device, and the device
producing received frames that are consumed by the host.
The tail pointer sets a demarkation point between the host
and device: descriptors before the tail pointer are owned by
the device, those after are owned by the host.  The ownership
bit is used to communicate to the host that the device is
done with a given descriptor.

Put another way, the tail pointer represents the host's producer
pointer; the host's consumer pointer is simply the next
descriptor in the ring after the last one that it processed.
The device's consumer pointer is whatever the host set the tail
pointer to, and it's producer pointer is its head pointer.  If
the device encounters a situation where _its_ head and tail
pointer are equal, the ring is full, and the DMA engine enters
the suspended state until the tail pointer is rewritten.

An invariant that both sides of the protocol must maintain is
that host software ensures that it's producer pointer (that
is, what it sets the tail pointer to) is always equal to or
(ie, the ring is full), or greater than (ie, there is space
available in the ring) to the devices's producer pointer
(that is, the devices's "head" pointer).

Of course, we talk about an ordering relationship here,
but all of this is happening modulo the size of the ring,
but that's a detail.

Note that there are two cases here:

1. The first time you write to the tail pointer to set
   things in motion at the beginning of time, and
2. Every other time.

For the first time, this change writes the address of the last
valid descriptor in the ring into the tail pointer register (not
one beyond the last!).  Then after consuming and resetting a
descriptor on the host, we write the address of the next valid
descriptor in the ring to the tail pointer, indicating that the
just-processed descriptor is available to the device.  As we do
this for every received frame, we can never starve the ring (at
least, not due to this).

In this scenario, the device will follows the host around
the ring; the problem with the first write is solved since,
after the first frame reception, the device's head pointer will
be somewhere beyond the first descriptor in the ring.  Thus,
writing the tail pointer to the address of (say) the 0'th
element is fine, since that won't be equal to the head pointer
on the peripheral.  Similarly, all available descriptors can
be consumed by the device.

Finally, this allows for some simplification with respect to
barriers: since the tail pointer indicates which descriptors
are owned by the device and which by the host, the host can
batch up updates to the descriptors and do a single flush
at the end of its processing, write before writing to the
tail pointer.  This was already done in `rx_notify`.

Note that something similar should probably be done for the
transmission ring, which I suspect is vulnerable to a similar
race between the DMA engine on the device and writes on the
host.

I have validated that, whereas I could reproduce this issue in
a demo branch prior to this change, I cannot reproduce it with
this change.
dancrossnyc added a commit that referenced this issue Aug 17, 2023
At its root, #750 is about a race condition between the DMA
engine and the STM32 Ethernet, and the host CPU.  We can
prevent this race by properly maintaining the Rx tail pointer.

Some background information to contextual this change:

The device and host participate in a simple protocol to drive
a state machine: the host gives descriptors to the device
that specify buffers to DMA incoming Ethernet frames into.
The host gives a descriptor to the device by setting the
ownership bit and setting the tail pointer: this driver
handles the former properly, but does not handle the latter
correctly.

The device consumes these descriptors as frames arrive and
it DMAs them host memory.  In turn, the device lets the host
know that it is done with a descriptor by doing two things:

1. Resetting the ownership bit in the descriptor, and
2. Sending an interrupt to the host.

Internally, the device _also_ increments its "head" pointer in
the Rx ring (modulo the size of the ring).  If the device hits
the end of the ring (as specified by the value in the Rx tail
pointer register), it goes into a "Suspended" state and stops
DMAing packets to the host; it nominally resumes when the Rx
tail pointer register is written again (if the newly written
tail register value is the same as the old value, it doesn't
appear to continue: note this implies it is not possible to
transition from a _full_ ring directly to an _empty_ ring).

Both device and host maintain (or, perhaps more precisely,
should maintain) two pointers into the ring: a producer pointer,
and a consumer pointer.

One may think of this as the host producing available
descriptors that are consumed by the device, and the device
producing received frames that are consumed by the host.
The tail pointer sets a demarkation point between the host
and device: descriptors before the tail pointer are owned by
the device, those after are owned by the host.  The ownership
bit is used to communicate to the host that the device is
done with a given descriptor.

Put another way, the tail pointer represents the host's producer
pointer; the host's consumer pointer is simply the next
descriptor in the ring after the last one that it processed.
The device's consumer pointer is whatever the host set the tail
pointer to, and it's producer pointer is its head pointer.  If
the device encounters a situation where _its_ head and tail
pointer are equal, the ring is full, and the DMA engine enters
the suspended state until the tail pointer is rewritten.

An invariant that both sides of the protocol must maintain is
that host software ensures that it's producer pointer (that
is, what it sets the tail pointer to) is always equal to or
(ie, the ring is full), or greater than (ie, there is space
available in the ring) to the devices's producer pointer
(that is, the devices's "head" pointer).

Of course, we talk about an ordering relationship here,
but all of this is happening modulo the size of the ring,
but that's a detail.

Note that there are two cases here:

1. The first time you write to the tail pointer to set
   things in motion at the beginning of time, and
2. Every other time.

For the first time, this change writes the address of the last
valid descriptor in the ring into the tail pointer register (not
one beyond the last!).  Then after consuming and resetting a
descriptor on the host, we write the address of the next valid
descriptor in the ring to the tail pointer, indicating that the
just-processed descriptor is available to the device.  As we do
this for every received frame, we can never starve the ring (at
least, not due to this).

In this scenario, the device will follows the host around
the ring; the problem with the first write is solved since,
after the first frame reception, the device's head pointer will
be somewhere beyond the first descriptor in the ring.  Thus,
writing the tail pointer to the address of (say) the 0'th
element is fine, since that won't be equal to the head pointer
on the peripheral.  Similarly, all available descriptors can
be consumed by the device.

Finally, this allows for some simplification with respect to
barriers: since the tail pointer indicates which descriptors
are owned by the device and which by the host, the host can
batch up updates to the descriptors and do a single flush
at the end of its processing, write before writing to the
tail pointer.  This was already done in `rx_notify`.

Note that something similar should probably be done for the
transmission ring, which I suspect is vulnerable to a similar
race between the DMA engine on the device and writes on the
host.

I have validated that, whereas I could reproduce this issue in
a demo branch prior to this change, I cannot reproduce it with
this change.
dancrossnyc added a commit that referenced this issue Aug 17, 2023
At its root, #750 is about a race condition between the DMA
engine and the STM32 Ethernet, and the host CPU.  We can
prevent this race by properly maintaining the Rx tail pointer.

Some background information to contextual this change:

The device and host participate in a simple protocol to drive
a state machine: the host gives descriptors to the device
that specify buffers to DMA incoming Ethernet frames into.
The host gives a descriptor to the device by setting the
ownership bit and setting the tail pointer: this driver
handles the former properly, but does not handle the latter
correctly.

The device consumes these descriptors as frames arrive and
it DMAs them host memory.  In turn, the device lets the host
know that it is done with a descriptor by doing two things:

1. Resetting the ownership bit in the descriptor, and
2. Sending an interrupt to the host.

Internally, the device _also_ increments its "head" pointer in
the Rx ring (modulo the size of the ring).  If the device hits
the end of the ring (as specified by the value in the Rx tail
pointer register), it goes into a "Suspended" state and stops
DMAing packets to the host; it resumes when the Rx tail pointer
register is written again (note: writing a value that is the
same as the head value on the _first_ write doesn't seem to
start the device).

Both device and host maintain (or, perhaps more precisely,
should maintain) two pointers into the ring: a producer pointer,
and a consumer pointer.

One may think of this as the host producing available
descriptors that are consumed by the device, and the device
producing received frames that are consumed by the host.
The tail pointer sets a demarkation point between the host
and device: descriptors before the tail pointer are owned by
the device, those after are owned by the host.  The ownership
bit is used to communicate to the host that the device is
done with a given descriptor.

Put another way, the tail pointer represents the host's producer
pointer; the host's consumer pointer is simply the next
descriptor in the ring after the last one that it processed.
The device's consumer pointer is whatever the host set the tail
pointer to, and it's producer pointer is its head pointer.  If
the device encounters a situation where _its_ head and tail
pointer are equal, the ring is full, and the DMA engine enters
the suspended state until the tail pointer is rewritten.

An invariant that both sides of the protocol must maintain is
that host software ensures that it's producer pointer (that
is, what it sets the tail pointer to) is always equal to or
(ie, the ring is full), or greater than (ie, there is space
available in the ring) to the devices's producer pointer
(that is, the devices's "head" pointer).

Of course, we talk about an ordering relationship here,
but all of this is happening modulo the size of the ring,
but that's a detail.

Note that there are two cases here:

1. The first time you write to the tail pointer to set
   things in motion at the beginning of time, and
2. Every other time.

For the first time, this change writes the address of the last
valid descriptor in the ring into the tail pointer register (not
one beyond the last!), as we have observed there is no non-racy
way to provide the full ring to the device immediately after
reset: writing the beginning of the ring to the tail pointer
does not start the peripheral, and writing just beyond the end
of it open us up to a race against updating the first descriptor
(until the first time the tail pointer is reset).

After consuming and resetting a descriptor on the host, we write
the address of the next descriptor in the ring to the tail
pointer, indicating that the just-processed descriptor is
available to the device.  As we do this for every received frame
we do not starve the ring.

In this scenario, the device follows the host around the ring;
the problem with the first write is solved since, after the
first frame reception, the device's head pointer will be
somewhere beyond the first descriptor in the ring.  Thus,
writing the tail pointer to the address of (say) the 0'th
element is fine, since that won't be equal to the head pointer
on the peripheral.  Similarly, all available descriptors can
be consumed by the device.

Finally, this allows for some simplification with respect to
barriers: since the tail pointer indicates which descriptors
are owned by the device and which by the host, the host can
batch updates to the descriptors and do a single flush at the
end of its processing, write before writing to the tail pointer.
This was already done in `rx_notify`.

A similar change for the Tx ring has also been sent.

I have validated that, whereas I could reproduce this issue in
a demo branch prior to this change, I cannot reproduce it with
this change.
dancrossnyc added a commit that referenced this issue Aug 17, 2023
At its root, #750 is about a race condition between the DMA
engine and the STM32 Ethernet, and the host CPU.  We can
prevent this race by properly maintaining the Rx tail pointer.

Some background information to contextual this change:

The device and host participate in a simple protocol to drive
a state machine: the host gives descriptors to the device
that specify buffers to DMA incoming Ethernet frames into.
The host gives a descriptor to the device by setting the
ownership bit and setting the tail pointer: this driver
handles the former properly, but does not handle the latter
correctly.

The device consumes these descriptors as frames arrive and
it DMAs them host memory.  In turn, the device lets the host
know that it is done with a descriptor by doing two things:

1. Resetting the ownership bit in the descriptor, and
2. Sending an interrupt to the host.

Internally, the device _also_ increments its "head" pointer in
the Rx ring (modulo the size of the ring).  If the device hits
the end of the ring (as specified by the value in the Rx tail
pointer register), it goes into a "Suspended" state and stops
DMAing packets to the host; it resumes when the Rx tail pointer
register is written again (note: writing a value that is the
same as the head value on the _first_ write doesn't seem to
start the device).

Both device and host maintain (or, perhaps more precisely,
should maintain) two pointers into the ring: a producer pointer,
and a consumer pointer.

One may think of this as the host producing available
descriptors that are consumed by the device, and the device
producing received frames that are consumed by the host.
The tail pointer sets a demarkation point between the host
and device: descriptors before the tail pointer are owned by
the device, those after are owned by the host.  The ownership
bit is used to communicate to the host that the device is
done with a given descriptor.

Put another way, the tail pointer represents the host's producer
pointer; the host's consumer pointer is simply the next
descriptor in the ring after the last one that it processed.
The device's consumer pointer is whatever the host set the tail
pointer to, and it's producer pointer is its head pointer.  If
the device encounters a situation where _its_ head and tail
pointer are equal, the ring is full, and the DMA engine enters
the suspended state until the tail pointer is rewritten.

An invariant that both sides of the protocol must maintain is
that host software ensures that it's producer pointer (that
is, what it sets the tail pointer to) is always equal to or
(ie, the ring is full), or greater than (ie, there is space
available in the ring) to the devices's producer pointer
(that is, the devices's "head" pointer).

Of course, we talk about an ordering relationship here,
but all of this is happening modulo the size of the ring,
but that's a detail.

Note that there are two cases here:

1. The first time you write to the tail pointer to set
   things in motion at the beginning of time, and
2. Every other time.

For the first time, this change writes the address of the last
valid descriptor in the ring into the tail pointer register (not
one beyond the last!), as we have observed there is no non-racy
way to provide the full ring to the device immediately after
reset: writing the beginning of the ring to the tail pointer
does not start the peripheral, and writing just beyond the end
of it open us up to a race against updating the first descriptor
(until the first time the tail pointer is reset).

After consuming and resetting a descriptor on the host, we write
the address of the next descriptor in the ring to the tail
pointer, indicating that the just-processed descriptor is
available to the device.  As we do this for every received frame
we do not starve the ring.

In this scenario, the device follows the host around the ring;
the problem with the first write is solved since, after the
first frame reception, the device's head pointer will be
somewhere beyond the first descriptor in the ring.  Thus,
writing the tail pointer to the address of (say) the 0'th
element is fine, since that won't be equal to the head pointer
on the peripheral.  Similarly, all available descriptors can
be consumed by the device.

Finally, this allows for some simplification with respect to
barriers: since the tail pointer indicates which descriptors
are owned by the device and which by the host, the host can
batch updates to the descriptors and do a single flush at the
end of its processing, write before writing to the tail pointer.
This was already done in `rx_notify`.

A similar change for the Tx ring has also been sent.

I have validated that, whereas I could reproduce this issue in
a demo branch prior to this change, I cannot reproduce it with
this change.
dancrossnyc added a commit that referenced this issue Aug 17, 2023
At its root, #750 is about a race condition between the DMA
engine and the STM32 Ethernet, and the host CPU.  We can
prevent this race by properly maintaining the Rx tail pointer.

Some background information to contextual this change:

The device and host participate in a simple protocol to drive
a state machine: the host gives descriptors to the device
that specify buffers to DMA incoming Ethernet frames into.
The host gives a descriptor to the device by setting the
ownership bit and setting the tail pointer: this driver
handles the former properly, but does not handle the latter
correctly.

The device consumes these descriptors as frames arrive and
it DMAs them host memory.  In turn, the device lets the host
know that it is done with a descriptor by doing two things:

1. Resetting the ownership bit in the descriptor, and
2. Sending an interrupt to the host.

Internally, the device _also_ increments its "head" pointer in
the Rx ring (modulo the size of the ring).  If the device hits
the end of the ring (as specified by the value in the Rx tail
pointer register), it goes into a "Suspended" state and stops
DMAing packets to the host; it resumes when the Rx tail pointer
register is written again (note: writing a value that is the
same as the head value on the _first_ write doesn't seem to
start the device).

Both device and host maintain (or, perhaps more precisely,
should maintain) two pointers into the ring: a producer pointer,
and a consumer pointer.

One may think of this as the host producing available
descriptors that are consumed by the device, and the device
producing received frames that are consumed by the host.
The tail pointer sets a demarkation point between the host
and device: descriptors before the tail pointer are owned by
the device, those after are owned by the host.  The ownership
bit is used to communicate to the host that the device is
done with a given descriptor.

Put another way, the tail pointer represents the host's producer
pointer; the host's consumer pointer is simply the next
descriptor in the ring after the last one that it processed.
The device's consumer pointer is whatever the host set the tail
pointer to, and it's producer pointer is its head pointer.  If
the device encounters a situation where _its_ head and tail
pointer are equal, the ring is full, and the DMA engine enters
the suspended state until the tail pointer is rewritten.

An invariant that both sides of the protocol must maintain is
that host software ensures that it's producer pointer (that
is, what it sets the tail pointer to) is always equal to or
(ie, the ring is full), or greater than (ie, there is space
available in the ring) to the devices's producer pointer
(that is, the devices's "head" pointer).

Of course, we talk about an ordering relationship here,
but all of this is happening modulo the size of the ring,
but that's a detail.

Note that there are two cases here:

1. The first time you write to the tail pointer to set
   things in motion at the beginning of time, and
2. Every other time.

For the first time, this change writes the address of the last
valid descriptor in the ring into the tail pointer register (not
one beyond the last!), as we have observed there is no non-racy
way to provide the full ring to the device immediately after
reset: writing the beginning of the ring to the tail pointer
does not start the peripheral, and writing just beyond the end
of it open us up to a race against updating the first descriptor
(until the first time the tail pointer is reset).

After consuming and resetting a descriptor on the host, we write
the address of the next descriptor in the ring to the tail
pointer, indicating that the just-processed descriptor is
available to the device.  As we do this for every received frame
we do not starve the ring.

In this scenario, the device follows the host around the ring;
the problem with the first write is solved since, after the
first frame reception, the device's head pointer will be
somewhere beyond the first descriptor in the ring.  Thus,
writing the tail pointer to the address of (say) the 0'th
element is fine, since that won't be equal to the head pointer
on the peripheral.  Similarly, all available descriptors can
be consumed by the device.

Finally, this allows for some simplification with respect to
barriers: since the tail pointer indicates which descriptors
are owned by the device and which by the host, the host can
batch updates to the descriptors and do a single flush at the
end of its processing, write before writing to the tail pointer.
This was already done in `rx_notify`.

A similar change for the Tx ring has also been sent.

I have validated that, whereas I could reproduce this issue in
a demo branch prior to this change, I cannot reproduce it with
this change.
dancrossnyc added a commit that referenced this issue Aug 17, 2023
At its root, #750 is about a race condition between the DMA
engine and the STM32 Ethernet, and the host CPU.  We can
prevent this race by properly maintaining the Rx tail pointer.

Some background information to contextual this change:

The device and host participate in a simple protocol to drive
a state machine: the host gives descriptors to the device
that specify buffers to DMA incoming Ethernet frames into.
The host gives a descriptor to the device by setting the
ownership bit and setting the tail pointer: this driver
handles the former properly, but does not handle the latter
correctly.

The device consumes these descriptors as frames arrive and
it DMAs them host memory.  In turn, the device lets the host
know that it is done with a descriptor by doing two things:

1. Resetting the ownership bit in the descriptor, and
2. Sending an interrupt to the host.

Internally, the device _also_ increments its "head" pointer in
the Rx ring (modulo the size of the ring).  If the device hits
the end of the ring (as specified by the value in the Rx tail
pointer register), it goes into a "Suspended" state and stops
DMAing packets to the host; it resumes when the Rx tail pointer
register is written again (note: writing a value that is the
same as the head value on the _first_ write doesn't seem to
start the device).

Both device and host maintain (or, perhaps more precisely,
should maintain) two pointers into the ring: a producer pointer,
and a consumer pointer.

One may think of this as the host producing available
descriptors that are consumed by the device, and the device
producing received frames that are consumed by the host.
The tail pointer sets a demarkation point between the host
and device: descriptors before the tail pointer are owned by
the device, those after are owned by the host.  The ownership
bit is used to communicate to the host that the device is
done with a given descriptor.

Put another way, the tail pointer represents the host's producer
pointer; the host's consumer pointer is simply the next
descriptor in the ring after the last one that it processed.
The device's consumer pointer is whatever the host set the tail
pointer to, and it's producer pointer is its head pointer.  If
the device encounters a situation where _its_ head and tail
pointer are equal, the ring is full, and the DMA engine enters
the suspended state until the tail pointer is rewritten.

An invariant that both sides of the protocol must maintain is
that host software ensures that it's producer pointer (that
is, what it sets the tail pointer to) is always equal to or
(ie, the ring is full), or greater than (ie, there is space
available in the ring) to the devices's producer pointer
(that is, the devices's "head" pointer).

Of course, we talk about an ordering relationship here,
but all of this is happening modulo the size of the ring,
but that's a detail.

Note that there are two cases here:

1. The first time you write to the tail pointer to set
   things in motion at the beginning of time, and
2. Every other time.

For the first time, this change writes the address of the last
valid descriptor in the ring into the tail pointer register (not
one beyond the last!), as we have observed there is no non-racy
way to provide the full ring to the device immediately after
reset: writing the beginning of the ring to the tail pointer
does not start the peripheral, and writing just beyond the end
of it open us up to a race against updating the first descriptor
(until the first time the tail pointer is reset).

After consuming and resetting a descriptor on the host, we write
the address of the next descriptor in the ring to the tail
pointer, indicating that the just-processed descriptor is
available to the device.  As we do this for every received frame
we do not starve the ring.

In this scenario, the device follows the host around the ring;
the problem with the first write is solved since, after the
first frame reception, the device's head pointer will be
somewhere beyond the first descriptor in the ring.  Thus,
writing the tail pointer to the address of (say) the 0'th
element is fine, since that won't be equal to the head pointer
on the peripheral.  Similarly, all available descriptors can
be consumed by the device.

Finally, this allows for some simplification with respect to
barriers: since the tail pointer indicates which descriptors
are owned by the device and which by the host, the host can
batch updates to the descriptors and do a single flush at the
end of its processing, write before writing to the tail pointer.
This was already done in `rx_notify`.

A similar change for the Tx ring has also been sent.

I have validated that, whereas I could reproduce this issue in
a demo branch prior to this change, I cannot reproduce it with
this change.
dancrossnyc added a commit that referenced this issue Aug 17, 2023
At its root, #750 is about a race condition between the DMA
engine and the STM32 Ethernet, and the host CPU.  We can
prevent this race by properly maintaining the Rx tail pointer.

Some background information to contextual this change:

The device and host participate in a simple protocol to drive
a state machine: the host gives descriptors to the device
that specify buffers to DMA incoming Ethernet frames into.
The host gives a descriptor to the device by setting the
ownership bit and setting the tail pointer: this driver
handles the former properly, but does not handle the latter
correctly.

The device consumes these descriptors as frames arrive and
it DMAs them host memory.  In turn, the device lets the host
know that it is done with a descriptor by doing two things:

1. Resetting the ownership bit in the descriptor, and
2. Sending an interrupt to the host.

Internally, the device _also_ increments its "head" pointer in
the Rx ring (modulo the size of the ring).  If the device hits
the end of the ring (as specified by the value in the Rx tail
pointer register), it goes into a "Suspended" state and stops
DMAing packets to the host; it resumes when the Rx tail pointer
register is written again (note: writing a value that is the
same as the head value on the _first_ write doesn't seem to
start the device).

Both device and host maintain (or, perhaps more precisely,
should maintain) two pointers into the ring: a producer pointer,
and a consumer pointer.

One may think of this as the host producing available
descriptors that are consumed by the device, and the device
producing received frames that are consumed by the host.
The tail pointer sets a demarkation point between the host
and device: descriptors before the tail pointer are owned by
the device, those after are owned by the host.  The ownership
bit is used to communicate to the host that the device is
done with a given descriptor.

Put another way, the tail pointer represents the host's producer
pointer; the host's consumer pointer is simply the next
descriptor in the ring after the last one that it processed.
The device's consumer pointer is whatever the host set the tail
pointer to, and it's producer pointer is its head pointer.  If
the device encounters a situation where _its_ head and tail
pointer are equal, the ring is full, and the DMA engine enters
the suspended state until the tail pointer is rewritten.

An invariant that both sides of the protocol must maintain is
that host software ensures that it's producer pointer (that
is, what it sets the tail pointer to) is always equal to or
(ie, the ring is full), or greater than (ie, there is space
available in the ring) to the devices's producer pointer
(that is, the devices's "head" pointer).

Of course, we talk about an ordering relationship here,
but all of this is happening modulo the size of the ring,
but that's a detail.

Note that there are two cases here:

1. The first time you write to the tail pointer to set
   things in motion at the beginning of time, and
2. Every other time.

For the first time, this change writes the address of the last
valid descriptor in the ring into the tail pointer register (not
one beyond the last!), as we have observed there is no non-racy
way to provide the full ring to the device immediately after
reset: writing the beginning of the ring to the tail pointer
does not start the peripheral, and writing just beyond the end
of it open us up to a race against updating the first descriptor
(until the first time the tail pointer is reset).

After consuming and resetting a descriptor on the host, we write
the address of the next descriptor in the ring to the tail
pointer, indicating that the just-processed descriptor is
available to the device.  As we do this for every received frame
we do not starve the ring.

In this scenario, the device follows the host around the ring;
the problem with the first write is solved since, after the
first frame reception, the device's head pointer will be
somewhere beyond the first descriptor in the ring.  Thus,
writing the tail pointer to the address of (say) the 0'th
element is fine, since that won't be equal to the head pointer
on the peripheral.  Similarly, all available descriptors can
be consumed by the device.

Finally, this allows for some simplification with respect to
barriers: since the tail pointer indicates which descriptors
are owned by the device and which by the host, the host can
batch updates to the descriptors and do a single flush at the
end of its processing, write before writing to the tail pointer.
This was already done in `rx_notify`.

A similar change for the Tx ring has also been sent.

I have validated that, whereas I could reproduce this issue in
a demo branch prior to this change, I cannot reproduce it with
this change.
cbiffle added a commit that referenced this issue Feb 21, 2024
Now that #750 is closed and we're pretty confident in the fix, we can
stop rebooting the netstack periodically. This should save resources (by
not constantly dumping and restarting), and also get us back into the
state where any task's generation number advancing is Probably Bad.
cbiffle added a commit that referenced this issue Feb 21, 2024
Now that #750 is closed and we're pretty confident in the fix, we can
stop rebooting the netstack periodically. This should save resources (by
not constantly dumping and restarting), and also get us back into the
state where any task's generation number advancing is Probably Bad.
cbiffle added a commit that referenced this issue Feb 22, 2024
Now that #750 is closed and we're pretty confident in the fix, we can
stop rebooting the netstack periodically. This should save resources (by
not constantly dumping and restarting), and also get us back into the
state where any task's generation number advancing is Probably Bad.
cbiffle added a commit that referenced this issue Feb 22, 2024
Now that #750 is closed and we're pretty confident in the fix, we can
stop rebooting the netstack periodically. This should save resources (by
not constantly dumping and restarting), and also get us back into the
state where any task's generation number advancing is Probably Bad.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants