-
Notifications
You must be signed in to change notification settings - Fork 174
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
net
can become wedged and stop receiving packets
#750
Comments
humility dump from the gimlet: net-task-stuck-is-mgmt-gateway-the-culprit.zip If it would be helpful I can get a dump from my gimletlet when it's similarly wedged. |
The device memory mapping on ARMv7-M is generally more for memory mapped registers; the DMA attribute we set translates to inner/outer non-cacheable + shared which is likely what we want here. I'm pleased to hear that changing the orderings to SeqCst didn't change the behavior; we worked pretty hard on getting the existing orderings right and there's a comment in there somewhere discussing the rationale. A bug there would signal some intrinsic problems with our approach to DMA here. DSB should be overkill; the atomic fences generate DMB which is lighter weight (only synchronizes explicit memory accesses rather than also synchronizing the execution pipeline). DSB is mostly necessary in cases where you've changed aspects of, like, memory protection in ways that don't require a full-on prefetch-flushing ISB. I suspect that one of two things is happening:
Unfortunately the dump doesn't include the MAC register state (and rightfully so, as Humility has no idea which registers are read-sensitive!), which is what we'd need to diagnose case 2. So I'll have to try and repro this locally once I find my NIC. |
Humility memory dumps from a gimlet in this stuck state:
Larger dumps attached: |
#867 is a partial fix for this, but panics the |
Update from the future: the STM32H7 Ethernet MAC has a difficult-to-nail-down issue where incoming packets stop transferring from the MTL to MAC, with no apparent status bits to indicate this case and distinguish it from "the NIC is very busy." So the watchdog in #867 is a workaround. It's currently not clear that we can do anything more without fabbing a chip, so I'm going to close this for now as worked-around. |
The workaround is causing issues on the dogfood rack (https://github.com/oxidecomputer/meta/issues/225), which is understandable: the Although we could work around this by tweaking watchdogs, I'm reopening this issue in hopes of making some progress towards a true root cause. First, an observation: the previous implementation of I suspect that we were mislead by this in the past, and spent time poking at systems that weren't actually stuck. I've updated this check to be more conservative (require 10x consecutive failures) in the There's a bright side: in a test with the conservative metric, the system – once stuck – did show errors!
Despite my deep existential dread, this is a great result: if we can actually notice that the system is stuck, we can fix it. Sure enough, writing 0x00010c01 to Next steps are running a full dogfood rack update with the
If we get lucky, we'll see the usual 1-2 failures, and will be able to read those dumps and see if they show the same characteristic errors. |
I have realized – to my great dismay – that 0x301 is our VLAN VID. Changing the VLAN VID to a different value changes the value in So, why the heck are we treating our VID as a DMA buffer address? When we receive a packet, it's written to the DMA Rx descriptor in "write-back format" (see Fig 803). In this format, the first word ( This behavior is compatible with the DMA peripheral deciding to read the VID as its memory buffer. It's not clear why it would do such a thing. /// Programs the words in `d` to prepare to receive into `buffer` and sets
/// `d` accessible to hardware. The final write to make it accessible is
/// performed with Release ordering to get a barrier.
fn set_descriptor(d: &RxDesc, buffer: *mut [u8; BUFSZ]) {
d.rdes[0].store(buffer as u32, Ordering::Relaxed);
d.rdes[1].store(0, Ordering::Relaxed);
d.rdes[2].store(0, Ordering::Relaxed);
let rdes3 =
1 << RDES3_OWN_BIT | 1 << RDES3_IOC_BIT | 1 << RDES3_BUF1_VALID_BIT;
d.rdes[3].store(rdes3, Ordering::Release); // <-- release
} This code assigns the buffer before setting Looking at the assembly, it also looks reasonable:
The
It looks like However, if the DMA peripheral reads the Rx descriptor naively (in the order 0-3), then there's a sequence of operations that could lead to what we're seeing:
The store to |
We hit this on sled 8 while mupdating dogfood this evening with your debug branch (#1459); I put the
|
Two more failures overnight, each of which has the same pattern in the logs. I also ran a test where I changed the implementation of fn set_descriptor(d: &RxDesc, buffer: *mut [u8; BUFSZ]) {
d.rdes[0].store(0x123, Ordering::SeqCst); // <-- this is new!
d.rdes[0].store(buffer as u32, Ordering::Relaxed);
d.rdes[1].store(0, Ordering::Relaxed);
d.rdes[2].store(0, Ordering::Relaxed);
let rdes3 =
1 << RDES3_OWN_BIT | 1 << RDES3_IOC_BIT | 1 << RDES3_BUF1_VALID_BIT;
d.rdes[3].store(rdes3, Ordering::Release); // <-- release
} After running for a long time (5ish hours), this fails with This failure only seems possible if (as we suspect) the DMA peripheral reads RDES[0] before RDES[3]. Here's a modified sequence diagram:
@lzrd suggests that the proper fix here is to use two DMA lists, i.e. manipulating the end pointer to serve as a stronger barrier. In the meantime, I'm going to try patching |
One more test: running with DCache entirely disabled diff --git a/drv/stm32h7-startup/src/lib.rs b/drv/stm32h7-startup/src/lib.rs
index 29572cad..7adc0893 100644
--- a/drv/stm32h7-startup/src/lib.rs
+++ b/drv/stm32h7-startup/src/lib.rs
@@ -143,7 +143,7 @@ pub fn system_init_custom(
// Turn on CPU I/D caches to improve performance at the higher clock speeds
// we're about to enable.
cp.SCB.enable_icache();
- cp.SCB.enable_dcache(&mut cp.CPUID);
+ //cp.SCB.enable_dcache(&mut cp.CPUID);
// The Flash controller comes out of reset configured for 3 wait states.
// That's approximately correct for 64MHz at VOS3, which is fortunate, since This still reproduces the issue, which continues to line up with my theory. |
Running with the patch from #1460 succeeded in updating every Gimlet on the dogfood rack; previously, we saw 1-2 failures per round of updates. Here's a modified sequence diagram showing why this helps:
|
@mkeeter I think the driver is supposed to manipulate the tail pointer as descriptor ownership changes. The /// Returns a pointer to the byte just past the end of the `RxDesc` ring.
/// This too gets loaded into the DMA controller, so that it knows what
/// section of the ring is initialized and can be read. (The answer is "all
/// of it.")
pub fn tail_ptr(&self) -> *const RxDesc {
self.storage.as_ptr_range().end
} However, the receive tail pointer register
The DMA status register
RM0433 figure 809 illustrates the tail pointer pointing past the last |
Thanks for taking a look! I agree that it's maybe possible to avoid this situation by carefully manipulating the end pointer. (I also object that in principle, it shouldn't be necessary; in a correct hardware implementation, the DMA peripheral would read RDES3 first with Acquire memory ordering, which would make an incorrect read of RDES0 impossible) The official drivers don't clarify the situation at all: take a look at However, it then writes the tail pointer to 0 (!) if the available Rx descriptor count has changed 🤯 I suspect this is a dummy write – simply to trigger an Receive Poll demand – and they're not using the address-matching aspect of the DMA peripheral at all. (As an aside, they do enough work between writing the buffer address and RDES3 that they're probably dodging the race condition window by accident) |
I agree! The official HAL was useless in trying to understand the reference. They make an off-handed comment in § 58.9.6, explaining how to switch descriptor lists, which makes me think the descriptors are loaded in a not-too-careful manner while RxDMA is running:
Presumably the RxDMA won't read descriptors past the tail pointer at all, allowing them to be sloppy with memory ordering as long as it's set. I can understand why they'd do this at the hardware level: reading |
I opened case #00183800 with ST support, so we'll see what they say! |
At its root, #750 is about a race condition between the DMA engine and the STM32 Ethernet, and the host CPU. We can prevent this race by properly maintaining the Rx tail pointer. Some background information to contextual this change: The device and host participate in a simple protocol to drive a state machine: the host gives descriptors to the device that specify buffers to DMA incoming Ethernet frames into. The host gives a descriptor to the device by setting the ownership bit and setting the tail pointer: this driver handles the former properly, but does not handle the latter correctly. The device consumes these descriptors as frames arrive and it DMAs them host memory. In turn, the device lets the host know that it is done with a descriptor by doing two things: 1. Resetting the ownership bit in the descriptor, and 2. Sending an interrupt to the host. Internally, the device _also_ increments its "head" pointer in the Rx ring (modulo the size of the ring). If the device hits the end of the ring (as specified by the value in the Rx tail pointer register), it goes into a "Suspended" state and stops DMAing packets to the host; it nominally resumes when the Rx tail pointer register is written again (if the newly written tail register value is the same as the old value, it doesn't appear to continue: note this implies it is not possible to transition from a _full_ ring directly to an _empty_ ring). Both device and host maintain (or, perhaps more precisely, should maintain) two pointers into the ring: a producer pointer, and a consumer pointer. One may think of this as the host producing available descriptors that are consumed by the device, and the device producing received frames that are consumed by the host. The tail pointer sets a demarkation point between the host and device: descriptors before the tail pointer are owned by the device, those after are owned by the host. The ownership bit is used to communicate to the host that the device is done with a given descriptor. Put another way, the tail pointer represents the host's producer pointer; the host's consumer pointer is simply the next descriptor in the ring after the last one that it processed. The device's consumer pointer is whatever the host set the tail pointer to, and it's producer pointer is its head pointer. If the device encounters a situation where _its_ head and tail pointer are equal, the ring is full, and the DMA engine enters the suspended state until the tail pointer is rewritten. An invariant that both sides of the protocol must maintain is that host software ensures that it's producer pointer (that is, what it sets the tail pointer to) is always equal to or (ie, the ring is full), or greater than (ie, there is space available in the ring) to the devices's producer pointer (that is, the devices's "head" pointer). Of course, we talk about an ordering relationship here, but all of this is happening modulo the size of the ring, but that's a detail. Note that there are two cases here: 1. The first time you write to the tail pointer to set things in motion at the beginning of time, and 2. Every other time. For the first time, this change writes the address of the last valid descriptor in the ring into the tail pointer register (not one beyond the last!). Then after consuming and resetting a descriptor on the host, we write the address of the next valid descriptor in the ring to the tail pointer, indicating that the just-processed descriptor is available to the device. As we do this for every received frame, we can never starve the ring (at least, not due to this). In this scenario, the device will follows the host around the ring; the problem with the first write is solved since, after the first frame reception, the device's head pointer will be somewhere beyond the first descriptor in the ring. Thus, writing the tail pointer to the address of (say) the 0'th element is fine, since that won't be equal to the head pointer on the peripheral. Similarly, all available descriptors can be consumed by the device. Finally, this allows for some simplification with respect to barriers: since the tail pointer indicates which descriptors are owned by the device and which by the host, the host can batch up updates to the descriptors and do a single flush at the end of its processing, write before writing to the tail pointer. This was already done in `rx_notify`. Note that something similar should probably be done for the transmission ring, which I suspect is vulnerable to a similar race between the DMA engine on the device and writes on the host. I have validated that, whereas I could reproduce this issue in a demo branch prior to this change, I cannot reproduce it with this change.
At its root, #750 is about a race condition between the DMA engine and the STM32 Ethernet, and the host CPU. We can prevent this race by properly maintaining the Rx tail pointer. Some background information to contextual this change: The device and host participate in a simple protocol to drive a state machine: the host gives descriptors to the device that specify buffers to DMA incoming Ethernet frames into. The host gives a descriptor to the device by setting the ownership bit and setting the tail pointer: this driver handles the former properly, but does not handle the latter correctly. The device consumes these descriptors as frames arrive and it DMAs them host memory. In turn, the device lets the host know that it is done with a descriptor by doing two things: 1. Resetting the ownership bit in the descriptor, and 2. Sending an interrupt to the host. Internally, the device _also_ increments its "head" pointer in the Rx ring (modulo the size of the ring). If the device hits the end of the ring (as specified by the value in the Rx tail pointer register), it goes into a "Suspended" state and stops DMAing packets to the host; it nominally resumes when the Rx tail pointer register is written again (if the newly written tail register value is the same as the old value, it doesn't appear to continue: note this implies it is not possible to transition from a _full_ ring directly to an _empty_ ring). Both device and host maintain (or, perhaps more precisely, should maintain) two pointers into the ring: a producer pointer, and a consumer pointer. One may think of this as the host producing available descriptors that are consumed by the device, and the device producing received frames that are consumed by the host. The tail pointer sets a demarkation point between the host and device: descriptors before the tail pointer are owned by the device, those after are owned by the host. The ownership bit is used to communicate to the host that the device is done with a given descriptor. Put another way, the tail pointer represents the host's producer pointer; the host's consumer pointer is simply the next descriptor in the ring after the last one that it processed. The device's consumer pointer is whatever the host set the tail pointer to, and it's producer pointer is its head pointer. If the device encounters a situation where _its_ head and tail pointer are equal, the ring is full, and the DMA engine enters the suspended state until the tail pointer is rewritten. An invariant that both sides of the protocol must maintain is that host software ensures that it's producer pointer (that is, what it sets the tail pointer to) is always equal to or (ie, the ring is full), or greater than (ie, there is space available in the ring) to the devices's producer pointer (that is, the devices's "head" pointer). Of course, we talk about an ordering relationship here, but all of this is happening modulo the size of the ring, but that's a detail. Note that there are two cases here: 1. The first time you write to the tail pointer to set things in motion at the beginning of time, and 2. Every other time. For the first time, this change writes the address of the last valid descriptor in the ring into the tail pointer register (not one beyond the last!). Then after consuming and resetting a descriptor on the host, we write the address of the next valid descriptor in the ring to the tail pointer, indicating that the just-processed descriptor is available to the device. As we do this for every received frame, we can never starve the ring (at least, not due to this). In this scenario, the device will follows the host around the ring; the problem with the first write is solved since, after the first frame reception, the device's head pointer will be somewhere beyond the first descriptor in the ring. Thus, writing the tail pointer to the address of (say) the 0'th element is fine, since that won't be equal to the head pointer on the peripheral. Similarly, all available descriptors can be consumed by the device. Finally, this allows for some simplification with respect to barriers: since the tail pointer indicates which descriptors are owned by the device and which by the host, the host can batch up updates to the descriptors and do a single flush at the end of its processing, write before writing to the tail pointer. This was already done in `rx_notify`. Note that something similar should probably be done for the transmission ring, which I suspect is vulnerable to a similar race between the DMA engine on the device and writes on the host. I have validated that, whereas I could reproduce this issue in a demo branch prior to this change, I cannot reproduce it with this change.
At its root, #750 is about a race condition between the DMA engine and the STM32 Ethernet, and the host CPU. We can prevent this race by properly maintaining the Rx tail pointer. Some background information to contextual this change: The device and host participate in a simple protocol to drive a state machine: the host gives descriptors to the device that specify buffers to DMA incoming Ethernet frames into. The host gives a descriptor to the device by setting the ownership bit and setting the tail pointer: this driver handles the former properly, but does not handle the latter correctly. The device consumes these descriptors as frames arrive and it DMAs them host memory. In turn, the device lets the host know that it is done with a descriptor by doing two things: 1. Resetting the ownership bit in the descriptor, and 2. Sending an interrupt to the host. Internally, the device _also_ increments its "head" pointer in the Rx ring (modulo the size of the ring). If the device hits the end of the ring (as specified by the value in the Rx tail pointer register), it goes into a "Suspended" state and stops DMAing packets to the host; it nominally resumes when the Rx tail pointer register is written again (if the newly written tail register value is the same as the old value, it doesn't appear to continue: note this implies it is not possible to transition from a _full_ ring directly to an _empty_ ring). Both device and host maintain (or, perhaps more precisely, should maintain) two pointers into the ring: a producer pointer, and a consumer pointer. One may think of this as the host producing available descriptors that are consumed by the device, and the device producing received frames that are consumed by the host. The tail pointer sets a demarkation point between the host and device: descriptors before the tail pointer are owned by the device, those after are owned by the host. The ownership bit is used to communicate to the host that the device is done with a given descriptor. Put another way, the tail pointer represents the host's producer pointer; the host's consumer pointer is simply the next descriptor in the ring after the last one that it processed. The device's consumer pointer is whatever the host set the tail pointer to, and it's producer pointer is its head pointer. If the device encounters a situation where _its_ head and tail pointer are equal, the ring is full, and the DMA engine enters the suspended state until the tail pointer is rewritten. An invariant that both sides of the protocol must maintain is that host software ensures that it's producer pointer (that is, what it sets the tail pointer to) is always equal to or (ie, the ring is full), or greater than (ie, there is space available in the ring) to the devices's producer pointer (that is, the devices's "head" pointer). Of course, we talk about an ordering relationship here, but all of this is happening modulo the size of the ring, but that's a detail. Note that there are two cases here: 1. The first time you write to the tail pointer to set things in motion at the beginning of time, and 2. Every other time. For the first time, this change writes the address of the last valid descriptor in the ring into the tail pointer register (not one beyond the last!). Then after consuming and resetting a descriptor on the host, we write the address of the next valid descriptor in the ring to the tail pointer, indicating that the just-processed descriptor is available to the device. As we do this for every received frame, we can never starve the ring (at least, not due to this). In this scenario, the device will follows the host around the ring; the problem with the first write is solved since, after the first frame reception, the device's head pointer will be somewhere beyond the first descriptor in the ring. Thus, writing the tail pointer to the address of (say) the 0'th element is fine, since that won't be equal to the head pointer on the peripheral. Similarly, all available descriptors can be consumed by the device. Finally, this allows for some simplification with respect to barriers: since the tail pointer indicates which descriptors are owned by the device and which by the host, the host can batch up updates to the descriptors and do a single flush at the end of its processing, write before writing to the tail pointer. This was already done in `rx_notify`. Note that something similar should probably be done for the transmission ring, which I suspect is vulnerable to a similar race between the DMA engine on the device and writes on the host. I have validated that, whereas I could reproduce this issue in a demo branch prior to this change, I cannot reproduce it with this change.
At its root, #750 is about a race condition between the DMA engine and the STM32 Ethernet, and the host CPU. We can prevent this race by properly maintaining the Rx tail pointer. Some background information to contextual this change: The device and host participate in a simple protocol to drive a state machine: the host gives descriptors to the device that specify buffers to DMA incoming Ethernet frames into. The host gives a descriptor to the device by setting the ownership bit and setting the tail pointer: this driver handles the former properly, but does not handle the latter correctly. The device consumes these descriptors as frames arrive and it DMAs them host memory. In turn, the device lets the host know that it is done with a descriptor by doing two things: 1. Resetting the ownership bit in the descriptor, and 2. Sending an interrupt to the host. Internally, the device _also_ increments its "head" pointer in the Rx ring (modulo the size of the ring). If the device hits the end of the ring (as specified by the value in the Rx tail pointer register), it goes into a "Suspended" state and stops DMAing packets to the host; it nominally resumes when the Rx tail pointer register is written again (if the newly written tail register value is the same as the old value, it doesn't appear to continue: note this implies it is not possible to transition from a _full_ ring directly to an _empty_ ring). Both device and host maintain (or, perhaps more precisely, should maintain) two pointers into the ring: a producer pointer, and a consumer pointer. One may think of this as the host producing available descriptors that are consumed by the device, and the device producing received frames that are consumed by the host. The tail pointer sets a demarkation point between the host and device: descriptors before the tail pointer are owned by the device, those after are owned by the host. The ownership bit is used to communicate to the host that the device is done with a given descriptor. Put another way, the tail pointer represents the host's producer pointer; the host's consumer pointer is simply the next descriptor in the ring after the last one that it processed. The device's consumer pointer is whatever the host set the tail pointer to, and it's producer pointer is its head pointer. If the device encounters a situation where _its_ head and tail pointer are equal, the ring is full, and the DMA engine enters the suspended state until the tail pointer is rewritten. An invariant that both sides of the protocol must maintain is that host software ensures that it's producer pointer (that is, what it sets the tail pointer to) is always equal to or (ie, the ring is full), or greater than (ie, there is space available in the ring) to the devices's producer pointer (that is, the devices's "head" pointer). Of course, we talk about an ordering relationship here, but all of this is happening modulo the size of the ring, but that's a detail. Note that there are two cases here: 1. The first time you write to the tail pointer to set things in motion at the beginning of time, and 2. Every other time. For the first time, this change writes the address of the last valid descriptor in the ring into the tail pointer register (not one beyond the last!). Then after consuming and resetting a descriptor on the host, we write the address of the next valid descriptor in the ring to the tail pointer, indicating that the just-processed descriptor is available to the device. As we do this for every received frame, we can never starve the ring (at least, not due to this). In this scenario, the device will follows the host around the ring; the problem with the first write is solved since, after the first frame reception, the device's head pointer will be somewhere beyond the first descriptor in the ring. Thus, writing the tail pointer to the address of (say) the 0'th element is fine, since that won't be equal to the head pointer on the peripheral. Similarly, all available descriptors can be consumed by the device. Finally, this allows for some simplification with respect to barriers: since the tail pointer indicates which descriptors are owned by the device and which by the host, the host can batch up updates to the descriptors and do a single flush at the end of its processing, write before writing to the tail pointer. This was already done in `rx_notify`. Note that something similar should probably be done for the transmission ring, which I suspect is vulnerable to a similar race between the DMA engine on the device and writes on the host. I have validated that, whereas I could reproduce this issue in a demo branch prior to this change, I cannot reproduce it with this change.
At its root, #750 is about a race condition between the DMA engine and the STM32 Ethernet, and the host CPU. We can prevent this race by properly maintaining the Rx tail pointer. Some background information to contextual this change: The device and host participate in a simple protocol to drive a state machine: the host gives descriptors to the device that specify buffers to DMA incoming Ethernet frames into. The host gives a descriptor to the device by setting the ownership bit and setting the tail pointer: this driver handles the former properly, but does not handle the latter correctly. The device consumes these descriptors as frames arrive and it DMAs them host memory. In turn, the device lets the host know that it is done with a descriptor by doing two things: 1. Resetting the ownership bit in the descriptor, and 2. Sending an interrupt to the host. Internally, the device _also_ increments its "head" pointer in the Rx ring (modulo the size of the ring). If the device hits the end of the ring (as specified by the value in the Rx tail pointer register), it goes into a "Suspended" state and stops DMAing packets to the host; it resumes when the Rx tail pointer register is written again (note: writing a value that is the same as the head value on the _first_ write doesn't seem to start the device). Both device and host maintain (or, perhaps more precisely, should maintain) two pointers into the ring: a producer pointer, and a consumer pointer. One may think of this as the host producing available descriptors that are consumed by the device, and the device producing received frames that are consumed by the host. The tail pointer sets a demarkation point between the host and device: descriptors before the tail pointer are owned by the device, those after are owned by the host. The ownership bit is used to communicate to the host that the device is done with a given descriptor. Put another way, the tail pointer represents the host's producer pointer; the host's consumer pointer is simply the next descriptor in the ring after the last one that it processed. The device's consumer pointer is whatever the host set the tail pointer to, and it's producer pointer is its head pointer. If the device encounters a situation where _its_ head and tail pointer are equal, the ring is full, and the DMA engine enters the suspended state until the tail pointer is rewritten. An invariant that both sides of the protocol must maintain is that host software ensures that it's producer pointer (that is, what it sets the tail pointer to) is always equal to or (ie, the ring is full), or greater than (ie, there is space available in the ring) to the devices's producer pointer (that is, the devices's "head" pointer). Of course, we talk about an ordering relationship here, but all of this is happening modulo the size of the ring, but that's a detail. Note that there are two cases here: 1. The first time you write to the tail pointer to set things in motion at the beginning of time, and 2. Every other time. For the first time, this change writes the address of the last valid descriptor in the ring into the tail pointer register (not one beyond the last!), as we have observed there is no non-racy way to provide the full ring to the device immediately after reset: writing the beginning of the ring to the tail pointer does not start the peripheral, and writing just beyond the end of it open us up to a race against updating the first descriptor (until the first time the tail pointer is reset). After consuming and resetting a descriptor on the host, we write the address of the next descriptor in the ring to the tail pointer, indicating that the just-processed descriptor is available to the device. As we do this for every received frame we do not starve the ring. In this scenario, the device follows the host around the ring; the problem with the first write is solved since, after the first frame reception, the device's head pointer will be somewhere beyond the first descriptor in the ring. Thus, writing the tail pointer to the address of (say) the 0'th element is fine, since that won't be equal to the head pointer on the peripheral. Similarly, all available descriptors can be consumed by the device. Finally, this allows for some simplification with respect to barriers: since the tail pointer indicates which descriptors are owned by the device and which by the host, the host can batch updates to the descriptors and do a single flush at the end of its processing, write before writing to the tail pointer. This was already done in `rx_notify`. A similar change for the Tx ring has also been sent. I have validated that, whereas I could reproduce this issue in a demo branch prior to this change, I cannot reproduce it with this change.
At its root, #750 is about a race condition between the DMA engine and the STM32 Ethernet, and the host CPU. We can prevent this race by properly maintaining the Rx tail pointer. Some background information to contextual this change: The device and host participate in a simple protocol to drive a state machine: the host gives descriptors to the device that specify buffers to DMA incoming Ethernet frames into. The host gives a descriptor to the device by setting the ownership bit and setting the tail pointer: this driver handles the former properly, but does not handle the latter correctly. The device consumes these descriptors as frames arrive and it DMAs them host memory. In turn, the device lets the host know that it is done with a descriptor by doing two things: 1. Resetting the ownership bit in the descriptor, and 2. Sending an interrupt to the host. Internally, the device _also_ increments its "head" pointer in the Rx ring (modulo the size of the ring). If the device hits the end of the ring (as specified by the value in the Rx tail pointer register), it goes into a "Suspended" state and stops DMAing packets to the host; it resumes when the Rx tail pointer register is written again (note: writing a value that is the same as the head value on the _first_ write doesn't seem to start the device). Both device and host maintain (or, perhaps more precisely, should maintain) two pointers into the ring: a producer pointer, and a consumer pointer. One may think of this as the host producing available descriptors that are consumed by the device, and the device producing received frames that are consumed by the host. The tail pointer sets a demarkation point between the host and device: descriptors before the tail pointer are owned by the device, those after are owned by the host. The ownership bit is used to communicate to the host that the device is done with a given descriptor. Put another way, the tail pointer represents the host's producer pointer; the host's consumer pointer is simply the next descriptor in the ring after the last one that it processed. The device's consumer pointer is whatever the host set the tail pointer to, and it's producer pointer is its head pointer. If the device encounters a situation where _its_ head and tail pointer are equal, the ring is full, and the DMA engine enters the suspended state until the tail pointer is rewritten. An invariant that both sides of the protocol must maintain is that host software ensures that it's producer pointer (that is, what it sets the tail pointer to) is always equal to or (ie, the ring is full), or greater than (ie, there is space available in the ring) to the devices's producer pointer (that is, the devices's "head" pointer). Of course, we talk about an ordering relationship here, but all of this is happening modulo the size of the ring, but that's a detail. Note that there are two cases here: 1. The first time you write to the tail pointer to set things in motion at the beginning of time, and 2. Every other time. For the first time, this change writes the address of the last valid descriptor in the ring into the tail pointer register (not one beyond the last!), as we have observed there is no non-racy way to provide the full ring to the device immediately after reset: writing the beginning of the ring to the tail pointer does not start the peripheral, and writing just beyond the end of it open us up to a race against updating the first descriptor (until the first time the tail pointer is reset). After consuming and resetting a descriptor on the host, we write the address of the next descriptor in the ring to the tail pointer, indicating that the just-processed descriptor is available to the device. As we do this for every received frame we do not starve the ring. In this scenario, the device follows the host around the ring; the problem with the first write is solved since, after the first frame reception, the device's head pointer will be somewhere beyond the first descriptor in the ring. Thus, writing the tail pointer to the address of (say) the 0'th element is fine, since that won't be equal to the head pointer on the peripheral. Similarly, all available descriptors can be consumed by the device. Finally, this allows for some simplification with respect to barriers: since the tail pointer indicates which descriptors are owned by the device and which by the host, the host can batch updates to the descriptors and do a single flush at the end of its processing, write before writing to the tail pointer. This was already done in `rx_notify`. A similar change for the Tx ring has also been sent. I have validated that, whereas I could reproduce this issue in a demo branch prior to this change, I cannot reproduce it with this change.
At its root, #750 is about a race condition between the DMA engine and the STM32 Ethernet, and the host CPU. We can prevent this race by properly maintaining the Rx tail pointer. Some background information to contextual this change: The device and host participate in a simple protocol to drive a state machine: the host gives descriptors to the device that specify buffers to DMA incoming Ethernet frames into. The host gives a descriptor to the device by setting the ownership bit and setting the tail pointer: this driver handles the former properly, but does not handle the latter correctly. The device consumes these descriptors as frames arrive and it DMAs them host memory. In turn, the device lets the host know that it is done with a descriptor by doing two things: 1. Resetting the ownership bit in the descriptor, and 2. Sending an interrupt to the host. Internally, the device _also_ increments its "head" pointer in the Rx ring (modulo the size of the ring). If the device hits the end of the ring (as specified by the value in the Rx tail pointer register), it goes into a "Suspended" state and stops DMAing packets to the host; it resumes when the Rx tail pointer register is written again (note: writing a value that is the same as the head value on the _first_ write doesn't seem to start the device). Both device and host maintain (or, perhaps more precisely, should maintain) two pointers into the ring: a producer pointer, and a consumer pointer. One may think of this as the host producing available descriptors that are consumed by the device, and the device producing received frames that are consumed by the host. The tail pointer sets a demarkation point between the host and device: descriptors before the tail pointer are owned by the device, those after are owned by the host. The ownership bit is used to communicate to the host that the device is done with a given descriptor. Put another way, the tail pointer represents the host's producer pointer; the host's consumer pointer is simply the next descriptor in the ring after the last one that it processed. The device's consumer pointer is whatever the host set the tail pointer to, and it's producer pointer is its head pointer. If the device encounters a situation where _its_ head and tail pointer are equal, the ring is full, and the DMA engine enters the suspended state until the tail pointer is rewritten. An invariant that both sides of the protocol must maintain is that host software ensures that it's producer pointer (that is, what it sets the tail pointer to) is always equal to or (ie, the ring is full), or greater than (ie, there is space available in the ring) to the devices's producer pointer (that is, the devices's "head" pointer). Of course, we talk about an ordering relationship here, but all of this is happening modulo the size of the ring, but that's a detail. Note that there are two cases here: 1. The first time you write to the tail pointer to set things in motion at the beginning of time, and 2. Every other time. For the first time, this change writes the address of the last valid descriptor in the ring into the tail pointer register (not one beyond the last!), as we have observed there is no non-racy way to provide the full ring to the device immediately after reset: writing the beginning of the ring to the tail pointer does not start the peripheral, and writing just beyond the end of it open us up to a race against updating the first descriptor (until the first time the tail pointer is reset). After consuming and resetting a descriptor on the host, we write the address of the next descriptor in the ring to the tail pointer, indicating that the just-processed descriptor is available to the device. As we do this for every received frame we do not starve the ring. In this scenario, the device follows the host around the ring; the problem with the first write is solved since, after the first frame reception, the device's head pointer will be somewhere beyond the first descriptor in the ring. Thus, writing the tail pointer to the address of (say) the 0'th element is fine, since that won't be equal to the head pointer on the peripheral. Similarly, all available descriptors can be consumed by the device. Finally, this allows for some simplification with respect to barriers: since the tail pointer indicates which descriptors are owned by the device and which by the host, the host can batch updates to the descriptors and do a single flush at the end of its processing, write before writing to the tail pointer. This was already done in `rx_notify`. A similar change for the Tx ring has also been sent. I have validated that, whereas I could reproduce this issue in a demo branch prior to this change, I cannot reproduce it with this change.
At its root, #750 is about a race condition between the DMA engine and the STM32 Ethernet, and the host CPU. We can prevent this race by properly maintaining the Rx tail pointer. Some background information to contextual this change: The device and host participate in a simple protocol to drive a state machine: the host gives descriptors to the device that specify buffers to DMA incoming Ethernet frames into. The host gives a descriptor to the device by setting the ownership bit and setting the tail pointer: this driver handles the former properly, but does not handle the latter correctly. The device consumes these descriptors as frames arrive and it DMAs them host memory. In turn, the device lets the host know that it is done with a descriptor by doing two things: 1. Resetting the ownership bit in the descriptor, and 2. Sending an interrupt to the host. Internally, the device _also_ increments its "head" pointer in the Rx ring (modulo the size of the ring). If the device hits the end of the ring (as specified by the value in the Rx tail pointer register), it goes into a "Suspended" state and stops DMAing packets to the host; it resumes when the Rx tail pointer register is written again (note: writing a value that is the same as the head value on the _first_ write doesn't seem to start the device). Both device and host maintain (or, perhaps more precisely, should maintain) two pointers into the ring: a producer pointer, and a consumer pointer. One may think of this as the host producing available descriptors that are consumed by the device, and the device producing received frames that are consumed by the host. The tail pointer sets a demarkation point between the host and device: descriptors before the tail pointer are owned by the device, those after are owned by the host. The ownership bit is used to communicate to the host that the device is done with a given descriptor. Put another way, the tail pointer represents the host's producer pointer; the host's consumer pointer is simply the next descriptor in the ring after the last one that it processed. The device's consumer pointer is whatever the host set the tail pointer to, and it's producer pointer is its head pointer. If the device encounters a situation where _its_ head and tail pointer are equal, the ring is full, and the DMA engine enters the suspended state until the tail pointer is rewritten. An invariant that both sides of the protocol must maintain is that host software ensures that it's producer pointer (that is, what it sets the tail pointer to) is always equal to or (ie, the ring is full), or greater than (ie, there is space available in the ring) to the devices's producer pointer (that is, the devices's "head" pointer). Of course, we talk about an ordering relationship here, but all of this is happening modulo the size of the ring, but that's a detail. Note that there are two cases here: 1. The first time you write to the tail pointer to set things in motion at the beginning of time, and 2. Every other time. For the first time, this change writes the address of the last valid descriptor in the ring into the tail pointer register (not one beyond the last!), as we have observed there is no non-racy way to provide the full ring to the device immediately after reset: writing the beginning of the ring to the tail pointer does not start the peripheral, and writing just beyond the end of it open us up to a race against updating the first descriptor (until the first time the tail pointer is reset). After consuming and resetting a descriptor on the host, we write the address of the next descriptor in the ring to the tail pointer, indicating that the just-processed descriptor is available to the device. As we do this for every received frame we do not starve the ring. In this scenario, the device follows the host around the ring; the problem with the first write is solved since, after the first frame reception, the device's head pointer will be somewhere beyond the first descriptor in the ring. Thus, writing the tail pointer to the address of (say) the 0'th element is fine, since that won't be equal to the head pointer on the peripheral. Similarly, all available descriptors can be consumed by the device. Finally, this allows for some simplification with respect to barriers: since the tail pointer indicates which descriptors are owned by the device and which by the host, the host can batch updates to the descriptors and do a single flush at the end of its processing, write before writing to the tail pointer. This was already done in `rx_notify`. A similar change for the Tx ring has also been sent. I have validated that, whereas I could reproduce this issue in a demo branch prior to this change, I cannot reproduce it with this change.
At its root, #750 is about a race condition between the DMA engine and the STM32 Ethernet, and the host CPU. We can prevent this race by properly maintaining the Rx tail pointer. Some background information to contextual this change: The device and host participate in a simple protocol to drive a state machine: the host gives descriptors to the device that specify buffers to DMA incoming Ethernet frames into. The host gives a descriptor to the device by setting the ownership bit and setting the tail pointer: this driver handles the former properly, but does not handle the latter correctly. The device consumes these descriptors as frames arrive and it DMAs them host memory. In turn, the device lets the host know that it is done with a descriptor by doing two things: 1. Resetting the ownership bit in the descriptor, and 2. Sending an interrupt to the host. Internally, the device _also_ increments its "head" pointer in the Rx ring (modulo the size of the ring). If the device hits the end of the ring (as specified by the value in the Rx tail pointer register), it goes into a "Suspended" state and stops DMAing packets to the host; it resumes when the Rx tail pointer register is written again (note: writing a value that is the same as the head value on the _first_ write doesn't seem to start the device). Both device and host maintain (or, perhaps more precisely, should maintain) two pointers into the ring: a producer pointer, and a consumer pointer. One may think of this as the host producing available descriptors that are consumed by the device, and the device producing received frames that are consumed by the host. The tail pointer sets a demarkation point between the host and device: descriptors before the tail pointer are owned by the device, those after are owned by the host. The ownership bit is used to communicate to the host that the device is done with a given descriptor. Put another way, the tail pointer represents the host's producer pointer; the host's consumer pointer is simply the next descriptor in the ring after the last one that it processed. The device's consumer pointer is whatever the host set the tail pointer to, and it's producer pointer is its head pointer. If the device encounters a situation where _its_ head and tail pointer are equal, the ring is full, and the DMA engine enters the suspended state until the tail pointer is rewritten. An invariant that both sides of the protocol must maintain is that host software ensures that it's producer pointer (that is, what it sets the tail pointer to) is always equal to or (ie, the ring is full), or greater than (ie, there is space available in the ring) to the devices's producer pointer (that is, the devices's "head" pointer). Of course, we talk about an ordering relationship here, but all of this is happening modulo the size of the ring, but that's a detail. Note that there are two cases here: 1. The first time you write to the tail pointer to set things in motion at the beginning of time, and 2. Every other time. For the first time, this change writes the address of the last valid descriptor in the ring into the tail pointer register (not one beyond the last!), as we have observed there is no non-racy way to provide the full ring to the device immediately after reset: writing the beginning of the ring to the tail pointer does not start the peripheral, and writing just beyond the end of it open us up to a race against updating the first descriptor (until the first time the tail pointer is reset). After consuming and resetting a descriptor on the host, we write the address of the next descriptor in the ring to the tail pointer, indicating that the just-processed descriptor is available to the device. As we do this for every received frame we do not starve the ring. In this scenario, the device follows the host around the ring; the problem with the first write is solved since, after the first frame reception, the device's head pointer will be somewhere beyond the first descriptor in the ring. Thus, writing the tail pointer to the address of (say) the 0'th element is fine, since that won't be equal to the head pointer on the peripheral. Similarly, all available descriptors can be consumed by the device. Finally, this allows for some simplification with respect to barriers: since the tail pointer indicates which descriptors are owned by the device and which by the host, the host can batch updates to the descriptors and do a single flush at the end of its processing, write before writing to the tail pointer. This was already done in `rx_notify`. A similar change for the Tx ring has also been sent. I have validated that, whereas I could reproduce this issue in a demo branch prior to this change, I cannot reproduce it with this change.
Now that #750 is closed and we're pretty confident in the fix, we can stop rebooting the netstack periodically. This should save resources (by not constantly dumping and restarting), and also get us back into the state where any task's generation number advancing is Probably Bad.
Now that #750 is closed and we're pretty confident in the fix, we can stop rebooting the netstack periodically. This should save resources (by not constantly dumping and restarting), and also get us back into the state where any task's generation number advancing is Probably Bad.
Now that #750 is closed and we're pretty confident in the fix, we can stop rebooting the netstack periodically. This should save resources (by not constantly dumping and restarting), and also get us back into the state where any task's generation number advancing is Probably Bad.
Now that #750 is closed and we're pretty confident in the fix, we can stop rebooting the netstack periodically. This should save resources (by not constantly dumping and restarting), and also get us back into the state where any task's generation number advancing is Probably Bad.
Yesterday while testing host boot flash operations triggered by
mgmt-gateway
, @mkeeter and I ran into two bugs:poll_for_write_complete
after issuing the command to the qspi driver, which busy waits. I've seen bulk erase take > 1 minute on a gimlet, and the comments say it can take as long as 8 minutes; during that time, thehf
task starves all lower priority tasks (includingnet
).net
was in a strangely wedged state. Client tasks could still send packets (we were seeingudpbroadcast
data), but as far as we could tell no packets were being received.udpecho
was not receiving notifications of incoming packets, andnet
was not responding to NDP messages.Fixing 1 should be straightforward, but I haven't done so yet because I wanted to address 2 first. I took a humility dump of the SP while
net
was in this state (attached to the first comment on this issue). After restartingnet
via jefe, it came back healthy.I've pushed a debug-net-wedge that can relatively reliably wedge the
net
task in the same way on my gimletlet. It adds two tasks:busywait
runs at a relatively high priority. When it receives a message, it busy waits for 5,000 ticks and then responds.udpecho-slowly
acts likeudpecho
, unless the packet it receives begins with an ASCII1
. For such packets, it first messagesbusywait
(producing the same situation we saw with hostflash starving net).The branch also has a
debug-net-client
crate that spins messagingudpecho-slowly
forever. To run it, cd into its directory and thenbut with the address of your gimletlet. When everything is working, the output looks similar to this:
After somewhere between 1 and 15 minutes (usually around 5 for me), you should see fewer than 4 responses and an undending sequence of "no response" logs; e.g.,
At this point,
net
is almost certainly wedged; you can confirm by trying to talk to the normaludpecho
task on port 7.Even after poking at this for the better part of a day and a half, I know frustratingly little. In both the gimlet dump and on my gimletlet, the four
RDES3
values ofRX_DESC
are all0xc1000000
, which is the value we set it to to give ownership back to the hardware. This leads me to believe we think we've told the hardware to give us packets, but it doesn't realize it.Sometimes when I add ringbuf traces (depending on where they are) I'm no longer able to reproduce the wedge. One set of ringbuf traces I collected was aimed at figuring out if we were somehow failing to call
rx_notify()
after clearing space in the RX buffer, but based on the traces it looks like we did indeed callrx_notify()
, but still stopped receiving packets:(I can send a full dump of these traces, but am not sure how useful the earlier portions are, so I've omitted them for now.) I have not been able to get crystal clear traces on our ethernet interrupts, but from what I have gathered, nothing seems amiss there. I don't want to discourage any particular line of thought since I still have no idea what the root cause is, but if there's something wrong with the interrupt handling it's extremely subtle.
At this point I suspect some kind of memory synchronization issue between hubris and the ethernet device, possibly related to #513? But I tried a few different things aimed at narrowing that down, and was still able to reliably wedge
net
. The things I've tried that still leavenet
reliably wedgeable are:ServerImpl::poll()
to repeat if an interface encounters a packet intended for another valid vidOrdering::SeqCst
{rx,tx}_notify()
with calls tocortex_m::asm::dsb()
(a la https://github.com/stm32-rs/stm32h7xx-hal/blob/master/src/ethernet/eth.rs#L315-L317)The things I've tried that appear to make the bug go away, or at least make it hard enough to reproduce that my current technique isn't good enough:
Disabling theUpdate: I left this running for much longer, and I eventually reproduced the wedge on my gimletlet withvlan
featurevlan
disabled.debug-net-client
ran for 1 hr 8 minutes before triggering the wedge.rx_notify()
that record the four RDES3 words before the fence (NOTE: I have not let this run for an extended period of time; until seeing the vlan wedge after an hour, the longest I had had to wait was 15 minutes, and this ran for about 20 without wedging.)The text was updated successfully, but these errors were encountered: