Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nrf5x: Fix EP OUT race conditions #1279

Merged
merged 1 commit into from Jan 19, 2022

Conversation

kasjer
Copy link
Collaborator

@kasjer kasjer commented Jan 14, 2022

Describe the PR
When dcd_edpt_xfer() starts new transfer two separate problems were observed.
For both problems stream of OUT packets was pouring from host.

First problem was that total_len and actual_len were not atomic.
In case where incoming OUT packets are less (63) than MPS (64), actual_len and total_len
are set 63.
Then transfer complete from USBD is called that will schedule next 64 bytes transfer.
At that point incoming packet would start DMA if there is place in RAM, normally
it does not happen since actual_len == total_len.
If packets arrives and interrupt is raised after total_len is set (64) but actual_len is still 63 from
previous transfer, interrupt code sees that there is place in ram (1 byte) and transfer this 1 byte
to buffer that was already filled with previous packet.
To remedy this USB interrupt is blocked during transfer setup.

Second problem can happen when dcd_edpt_xfer setups total_len and actual_len correctly
but then context switch happens (or interrupts) before xfer->data_received is checked.
If during this time two packets arrive one will be copied to RAM second will stay in endpoint with
data_received set to 1.
Then when xfer_edpt_xfer() checks data_receive flag it starts DMA again overwriting data.
To remedy this, data_received is checked together with check if data was already transferred.
If transfer was complete, there is no need to start DMA yet.
In such case data_received will be handled in same place by next xfer_edpt_xfer() correctly.
Additional context
Both problems were discovered while stress testing BTH with massive amount of ACL data going
through OUT endpoint.
It should be possible to reach same result with other source of steady flow non-MPS packets.
In my mace it was triggers by sending constant packet stream of 64-64-64-63 packet sequence.

When dcd_edpt_xfer() starts new transfer two separate problems were observed.
For both problems stream of OUT packets was pouring from host.

First problem was that total_len and actual_len were not atomic.
In case where incoming OUT packets are less (63) than MPS (64), actual_len and total_len
are set 63.
Then transfer complete from USBD is called that will schedule next 64 bytes transfer.
At that point incoming packet would start DMA if there is place in RAM, normally
it does not happen since actual_len == total_len.
If packets arrives and interrupt is raised after total_len is set (64) but actual_len is still 63 from
previous transfer, interrupt code sees that there is place in ram (1 byte) and transfer this 1 byte
to buffer that was already filled with previous packet.
To remedy this USB interrupt is blocked during transfer setup.

Second problem can happen when dcd_edpt_xfer setups xfer->total_len and actual_len correctly
but then context switch happens before xfer->data_received is checked.
If during this time two packets arrive one will be copied to RAM second will stay in endpoint with
data_received set to 1.
Then when xfer_edpt_xfer() checks data_receive flag it starts DMA again overwriting data.
To remedy this, data_received is checked together with check if data was already transferred.
If transfer was complete, there is no need to start DMA yet.
In such case data_received will be handled in same place by next xfer_edpt_xfer() correctly.
@kasjer
Copy link
Collaborator Author

kasjer commented Jan 15, 2022

It turns out that its not enough to handle races.
Another problem arise when two tasks are involved.
First blocks USB interrupt to set total_len and clear actual_len but between those two instructions
second task enters same function for other endpoint and while doing this enables interrupt again
while first task did not cleared actual_len yet and first problem shows up again.

To fix this:

  • CPU critical section should block interrupts during buffer, total_len and actual_len setup
  • or disabling interrupt should have reference counting
  • or critical section could be used to guard section that disables/enables interrupt

Copy link
Owner

@hathach hathach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thank you very much for putting effort to fix this race condition. I have been some hard time troubleshooting this before as well and have done a few PR to improve the race. All changes make sense, changes look small but have a huge impact and is very difficult to trace down.

@@ -453,9 +453,11 @@ bool dcd_edpt_xfer (uint8_t rhport, uint8_t ep_addr, uint8_t * buffer, uint16_t

xfer_td_t* xfer = get_td(epnum, dir);

dcd_int_disable(rhport);
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh, I think I did put the int disable/enable here before, but somehow got reverted.

@@ -476,7 +478,7 @@ bool dcd_edpt_xfer (uint8_t rhport, uint8_t ep_addr, uint8_t * buffer, uint16_t
edpt_dma_start(&NRF_USBD->TASKS_EP0RCVOUT);
}else
{
if ( xfer->data_received )
if ( xfer->data_received && xfer->total_len > xfer->actual_len)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, right, I fell the data_received is not enough as well.

@hathach hathach merged commit 983abfd into hathach:master Jan 19, 2022
@kasjer kasjer deleted the kasjer/nrf5x-int-race branch January 19, 2022 07:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants