Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dcd_nrf5x: fix race condition #2626

Merged
merged 2 commits into from
May 15, 2024

Conversation

rgrr
Copy link
Contributor

@rgrr rgrr commented May 6, 2024

Describe the PR
The PR fixes two race conditions in the dcd_nrf5x driver. The first one fixes a function invocation from edpt_dma_start() which is called also from interrupt context, the second one (DI/EI) is required to protect a code section from interrupts.

Additional context
The actual problems (crashes/hangups) appeared during repeated connect / disconnect to a CDC-ACM on the nRF5x. The above changes fixed the situation.

@HiFiPhile HiFiPhile requested a review from kasjer May 8, 2024 13:13
@kasjer
Copy link
Collaborator

kasjer commented May 8, 2024

@rgrr could you please tell me more about how you tested this?
Was there any OS used or is it bare metal configuration?
Maybe you have some logs from USB analyzer that can be viewed to see exact sequence that triggered problems?

@rgrr
Copy link
Contributor Author

rgrr commented May 8, 2024

@rgrr could you please tell me more about how you tested this? Was there any OS used or is it bare metal configuration? Maybe you have some logs from USB analyzer that can be viewed to see exact sequence that triggered problems?

No, unfortunately I have no logs from a USB analyzer.

My testcase was a program under windows, USB nRF dongle connected to the PC with CDC-ACM in it. The dongle is more or less a gateway between a UART based packet protocol and BLE.

The test program basically did

  • open COM port to dongle
  • fetch device list from the dongle which the dongle received device list via BLE advertisements
  • open connection to a specific device via BLE
  • exchange some data with the device via BLE
  • close connection with the device
  • close COM port to the dongle

We have plenty of implementations of the Dongle firmware, e.g. with nRF52832 via UART or nRF52840 via UART or nRF52840 via CDC-ACM (Nordic stack) or nRF52840 via CDC-ACM with TinyUSB.

The dongle with TinyUSB showed the behavior that it sometimes (randomly) crashed during the COM port connect phase. Code inspection showed, that a parameter of one call was wrong.
A lot of more fiddling isolated the area where there are the EI/DI. I'm actually not happy with the DI/EI solution, but in the end the result counted.

Hope this helps a little bit.

@kasjer
Copy link
Collaborator

kasjer commented May 8, 2024

@rgrr so TinyUSB was build without any rtos (FreeRTOS/mynewt/other)?

@rgrr
Copy link
Contributor Author

rgrr commented May 8, 2024

@rgrr so TinyUSB was build without any rtos (FreeRTOS/mynewt/other)?

yes

@kasjer
Copy link
Collaborator

kasjer commented May 9, 2024

@rgrr when you use TinyUSB functions, do your code calls them from other interrupts (not just USB one)?
What compiler was used for building your project?

@rgrr
Copy link
Contributor Author

rgrr commented May 9, 2024

The application it self is mainloop based. The mainloop das the call to "tud_task()", everything else in the CDC-ACM handling is done via the callbacks. "proc_soc()" (some power handling of the nRF driver, part of the TinyUSB implementation) is also done from the mainloop.

The CDC handler itself use the regular callbacks.

So... no, I do no calls from interrupt context, everything is done from the mainloop.

@rgrr
Copy link
Contributor Author

rgrr commented May 9, 2024

forgot: compiler is clang 13.0.0, used in many projects in my company.

@kasjer
Copy link
Collaborator

kasjer commented May 9, 2024

@rgrr First change where true is changed to is_in_isr() is reasonable and to me, and does fix locking issue where FIFO used for queue in osal_none.h can be clobbered when xfer started from non-interrupt context tries to add message without disabling USB interrupt then then fires.

The second part that blocks all interrupts for duration of two calls is not clear. It's not obvious why it should fix anything.
The only way I can see that it could fix problem is when dcd_edpt_xfer() is called from other interrupt with lower priority then USB one. Then dcd_event_xfer_complete(..., is_in_isr()) could try to write message to the queue and then legitimately report that it is in interrupt; dcd_int_handler() could try posting another message possibly destroying queue.

@rgrr
Copy link
Contributor Author

rgrr commented May 12, 2024

Yes, I agree: the first case is clear, the second not so much.

That was a lot of try and guess: first made DI/EI over the whole function -> no more crashes. Then narrowed it down to the two calls. Never actually determined who were the (double) callers. If you want to, I will try to find out, but that will take some time because I'm currently very busy.

@kasjer
Copy link
Collaborator

kasjer commented May 13, 2024

Without convincing explanation I can't really give my blessing. I'm pretty sure that DE/EI will not break anything but I'm not sure it really fixes problem that may pop up in different place later.

@rgrr
Copy link
Contributor Author

rgrr commented May 13, 2024

I understand your position. I will revert the second case and commit the mini change.

@rgrr rgrr changed the title dcd_nrf5x: fix two race conditions dcd_nrf5x: fix race condition May 13, 2024
Copy link
Owner

@hathach hathach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this looks good, thank you and @kasjer for reviewing

@hathach hathach merged commit 5393f8d into hathach:master May 15, 2024
62 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants