Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nrf5x: Fix DMA access race condition #1280

Merged
merged 2 commits into from
Feb 22, 2022

Conversation

kasjer
Copy link
Collaborator

@kasjer kasjer commented Jan 14, 2022

Describe the PR
In multi-thread mode starting DMA in thread mode was
prone to race condition resulting in infinite loop.
It may happen on single core CPU with strict priority based
tasks scheduler where ready high prio task never yields to
ready low prio task (Mynewt).

Sequence that failed (T1 - low priority task, T2 - high priority task)

  • T1 called start_dma()
  • T1 set _dcd.dma_running (DMA not started yet, context switch happens)
  • T2 took CPU and saw that _dcd.dma_running is set, so waits for _dcd.dma_running to be 0
  • T1 never gets CPU again, DMA is not started T2 waits forever

OSAL mutex resolves problem of DMA starting from thread-context.

Additional context
Happened while stress testing Nimble stack on Mynewt with USB transport.

@kasjer kasjer force-pushed the kasjer/nrf5x-dma-race branch 2 times, most recently from bad5610 to 9085b34 Compare January 18, 2022 13:09
In multi-thread mode starting DMA in thread mode was
prone to race condition resulting in infinite loop.
It may happen on single core CPU with strict priority based
tasks scheduler where ready high prio task never yields to
ready low prio task (Mynewt).

Sequence that failed (T1 - low priority task, T2 - high priority task)
- T1 called start_dma()
- T1 set _dcd.dma_running (DMA not started yet, context switch happens)
- T2 took CPU and saw that _dcd.dma_running is set, so waits for _dcd.dma_running to be 0
- T1 never gets CPU again, DMA is not started T2 waits forever

OSAL mutex resolves problem of DMA starting from thread-context.
@kasjer
Copy link
Collaborator Author

kasjer commented Jan 19, 2022

Additional mutex lock added during transfer setup to prevent premature interrupt enable that could happen if two tasks started two separate transfers.

When two tasks entered dcd_edpt_xfer() it was possible that
first disabled interrupt to setup total_len and actual_len
but second task for another endpoint enabled interrupt
between total_len and actual_len resulting in race
condition with interrupt, hence mutex is added on top of interrupt being blocked.
Copy link
Owner

@hathach hathach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank your for the PR and sorry for late response, this somehow falled off my radar. The current implementation of nrf5x indeed has issue with race condition in preempted RTOS. I was hoping to use LDREX/STREX to have mutex, but couldn't get those to work. In the future, I think we could make use of semaphore as resource management instead of mutex which make a little bit more sense.

@hathach hathach merged commit e04f15f into hathach:master Feb 22, 2022
@kasjer kasjer deleted the kasjer/nrf5x-dma-race branch February 22, 2022 14:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants