nrf5x: Fix EP OUT race conditions #1279

kasjer · 2022-01-14T09:03:35Z

Describe the PR
When dcd_edpt_xfer() starts new transfer two separate problems were observed.
For both problems stream of OUT packets was pouring from host.

First problem was that total_len and actual_len were not atomic.
In case where incoming OUT packets are less (63) than MPS (64), actual_len and total_len
are set 63.
Then transfer complete from USBD is called that will schedule next 64 bytes transfer.
At that point incoming packet would start DMA if there is place in RAM, normally
it does not happen since actual_len == total_len.
If packets arrives and interrupt is raised after total_len is set (64) but actual_len is still 63 from
previous transfer, interrupt code sees that there is place in ram (1 byte) and transfer this 1 byte
to buffer that was already filled with previous packet.
To remedy this USB interrupt is blocked during transfer setup.

Second problem can happen when dcd_edpt_xfer setups total_len and actual_len correctly
but then context switch happens (or interrupts) before xfer->data_received is checked.
If during this time two packets arrive one will be copied to RAM second will stay in endpoint with
data_received set to 1.
Then when xfer_edpt_xfer() checks data_receive flag it starts DMA again overwriting data.
To remedy this, data_received is checked together with check if data was already transferred.
If transfer was complete, there is no need to start DMA yet.
In such case data_received will be handled in same place by next xfer_edpt_xfer() correctly.
Additional context
Both problems were discovered while stress testing BTH with massive amount of ACL data going
through OUT endpoint.
It should be possible to reach same result with other source of steady flow non-MPS packets.
In my mace it was triggers by sending constant packet stream of 64-64-64-63 packet sequence.

When dcd_edpt_xfer() starts new transfer two separate problems were observed. For both problems stream of OUT packets was pouring from host. First problem was that total_len and actual_len were not atomic. In case where incoming OUT packets are less (63) than MPS (64), actual_len and total_len are set 63. Then transfer complete from USBD is called that will schedule next 64 bytes transfer. At that point incoming packet would start DMA if there is place in RAM, normally it does not happen since actual_len == total_len. If packets arrives and interrupt is raised after total_len is set (64) but actual_len is still 63 from previous transfer, interrupt code sees that there is place in ram (1 byte) and transfer this 1 byte to buffer that was already filled with previous packet. To remedy this USB interrupt is blocked during transfer setup. Second problem can happen when dcd_edpt_xfer setups xfer->total_len and actual_len correctly but then context switch happens before xfer->data_received is checked. If during this time two packets arrive one will be copied to RAM second will stay in endpoint with data_received set to 1. Then when xfer_edpt_xfer() checks data_receive flag it starts DMA again overwriting data. To remedy this, data_received is checked together with check if data was already transferred. If transfer was complete, there is no need to start DMA yet. In such case data_received will be handled in same place by next xfer_edpt_xfer() correctly.

kasjer · 2022-01-15T18:15:57Z

It turns out that its not enough to handle races.
Another problem arise when two tasks are involved.
First blocks USB interrupt to set total_len and clear actual_len but between those two instructions
second task enters same function for other endpoint and while doing this enables interrupt again
while first task did not cleared actual_len yet and first problem shows up again.

To fix this:

CPU critical section should block interrupts during buffer, total_len and actual_len setup
or disabling interrupt should have reference counting
or critical section could be used to guard section that disables/enables interrupt

hathach

thank you very much for putting effort to fix this race condition. I have been some hard time troubleshooting this before as well and have done a few PR to improve the race. All changes make sense, changes look small but have a huge impact and is very difficult to trace down.

hathach · 2022-01-19T03:25:41Z

src/portable/nordic/nrf5x/dcd_nrf5x.c

@@ -453,9 +453,11 @@ bool dcd_edpt_xfer (uint8_t rhport, uint8_t ep_addr, uint8_t * buffer, uint16_t

  xfer_td_t* xfer = get_td(epnum, dir);

+  dcd_int_disable(rhport);


oh, I think I did put the int disable/enable here before, but somehow got reverted.

hathach · 2022-01-19T03:29:04Z

src/portable/nordic/nrf5x/dcd_nrf5x.c

@@ -476,7 +478,7 @@ bool dcd_edpt_xfer (uint8_t rhport, uint8_t ep_addr, uint8_t * buffer, uint16_t
      edpt_dma_start(&NRF_USBD->TASKS_EP0RCVOUT);
    }else
    {
-      if ( xfer->data_received )
+      if ( xfer->data_received && xfer->total_len > xfer->actual_len)


yeah, right, I fell the data_received is not enough as well.

kasjer added the Port nRF label Jan 14, 2022

hathach approved these changes Jan 19, 2022

View reviewed changes

hathach merged commit 983abfd into hathach:master Jan 19, 2022

kasjer deleted the kasjer/nrf5x-int-race branch January 19, 2022 07:20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

nrf5x: Fix EP OUT race conditions #1279

nrf5x: Fix EP OUT race conditions #1279

kasjer commented Jan 14, 2022

kasjer commented Jan 15, 2022

hathach left a comment

hathach Jan 19, 2022

hathach Jan 19, 2022

		@@ -453,9 +453,11 @@ bool dcd_edpt_xfer (uint8_t rhport, uint8_t ep_addr, uint8_t * buffer, uint16_t

		xfer_td_t* xfer = get_td(epnum, dir);

		dcd_int_disable(rhport);

nrf5x: Fix EP OUT race conditions #1279

nrf5x: Fix EP OUT race conditions #1279

Conversation

kasjer commented Jan 14, 2022

kasjer commented Jan 15, 2022

hathach left a comment

Choose a reason for hiding this comment

hathach Jan 19, 2022

Choose a reason for hiding this comment

hathach Jan 19, 2022

Choose a reason for hiding this comment