Fix edpt xfer race condition by hathach · Pull Request #508 · hathach/tinyusb

hathach · 2020-09-09T09:09:30Z

Describe the PR

fix Race condition with cdc auto flush() on transfer completion #507 introduce optional usbd_edpt_claim, usbd_edpt_release which can be ued to gain exclusive access to usbd_edpt_xfer
With class driver API() that submit transfer to endpoint. Currently usbd_edpt_busy() is checked before calling usbd_edpt_xfer() e.g

uint32_t tud_cdc_n_write_flush (uint8_t itf)
{
  cdcd_interface_t* p_cdc = &_cdcd_itf[itf];

  // skip if previous transfer not complete yet
  TU_VERIFY( !usbd_edpt_busy(TUD_OPT_RHPORT, p_cdc->ep_in), 0 );

  uint16_t count = tu_fifo_read_n(&p_cdc->tx_ff, p_cdc->epin_buf, sizeof(p_cdc->epin_buf));
  if ( count )
  {
    TU_VERIFY( tud_cdc_n_connected(itf), 0 ); // fifo is empty if not connected
    TU_ASSERT( usbd_edpt_xfer(TUD_OPT_RHPORT, p_cdc->ep_in, p_cdc->epin_buf, count), 0 );
  }

  return count;
}

Although this is fine with noOS since tinyusb API() should always be called in thread mode (never ISR). However with an RTOS, between the edpt_busy() and the edpt_xfer() we can be preempted by higher prio task, which can also calling the same API e.g cdc_write_flush() (often after cdc_write()). This lead to the stack re-submit another transfer to an already busy endpoint after High prio task complete and the Low prio continue. Of course dcd detect and reject this and the call failed nicely. However as above write_flush(), the data that is pull out of the tx_fifo is not processed -> cause missing character with print(). Although we can put it back with fifo_write_front() in theory, however not all processed function is revertible.

In short, this PR introduce usbd_edpt_claim() and release() with the help of RTOS mutex mechanism, to prevent other thread to interfere between edpt_busy() and edpt_xfer(). The claim/release is OPTIONAL for now, since not all class driver need it. For example all MSC transfer is handled by the stack via callback, user never actively submit an transfer, therefore there is no racing. However, later on we can make it mandatory just to be consistent across the API.

Ideally it would be 1 mutex per endpiont, but that seems to be a lot of memory for edge case. Maybe only one general mutex is enough, and the edpt_claim()/edpt_release() is fast enough I guess. Let's me know what do you think about the PR and how we could improve it. I am open to any suggestion.

PS is still WIP, only applies to cdc driver first for feedback.

…ed to gain exclusive access to usbd_edpt_xfer

duempel · 2020-09-10T11:15:37Z

Ideally it would be 1 mutex per endpiont, but that seems to be a lot of memory for edge case. Maybe only one general mutex is enough, and the edpt_claim()/edpt_release() is fast enough I guess. Let's me know what do you think about the PR and how we could improve it. I am open to any suggestion.

Usually I'd say that one mutex for claiming all endpoints should be ok. There is not much processing and high priority tasks can only be blocked very shortly. But this short blocking is not guaranteed, as we can see with a setup like #507 . If the low priority task won't fall back to it's low priority after releasing the mutex other high prio tasks can be blocked for much longer than expected. This won't result in the error we saw before, but could cause problems to time critical tasks. I also don't like to create seperate mutex for each endpoint and don't see a more elegant solution right now. Just want to mention that the chance of blocking is higher because different endpoints are sharing only one ressource. Also in theory the risk of deadlocks will become higher.

Is it necessary to claim the data out endpoint? I think _prep_out_transaction is only called by tud_task and user tasks won't call a function which prepares incoming data. If this is necessary then I'm missing a release of this endpoint. Maybe cdcd_xfer_cb would be a good place for it. Looks like the _prep_out_transaction can not be triggered a 2nd time right now.

Also I'm not sure if we need to have both: busy and claimed flags. The busy flag is indicating that device controller driver are processing the endpoint at that moment. If it's busy we can't schedule more transfers to this endpoint. But right now we also don't prepare fifos or any other ressources with data to process later. From a functional point of view I'd say that an endpoint should be indicated as busy already if some class just plans to use it and prepares ressources (which is done before dcd processed and usbd sets it busy). This is pretty the same as we use the claim flag right now.

hathach · 2020-09-10T12:59:09Z

Usually I'd say that one mutex for claiming all endpoints should be ok. There is not much processing and high priority tasks can only be blocked very shortly. But this short blocking is not guaranteed, as we can see with a setup like #507 . If the low priority task won't fall back to it's low priority after releasing the mutex other high prio tasks can be blocked for much longer than expected. This won't result in the error we saw before, but could cause problems to time critical tasks. I also don't like to create seperate mutex for each endpoint and don't see a more elegant solution right now. Just want to mention that the chance of blocking is higher because different endpoints are sharing only one ressource. Also in theory the risk of deadlocks will become higher.

It only happens with FreeRTOS which is scenarios that almost never occur 😅 😅 . Surely using more mutex will increase the chance of HP being blocked, however it is really FreeRTOS issue, and isn't ours to fix. Eventually they will resolve this when more people start to complain :)

Is it necessary to claim the data out endpoint? I think _prep_out_transaction is only called by tud_task and user tasks won't call a function which prepares incoming data. If this is necessary then I'm missing a release of this endpoint. Maybe cdcd_xfer_cb would be a good place for it. Looks like the _prep_out_transaction can not be triggered a 2nd time right now.

_prep_out_transaction() is also invoked by user via cdc_read(). The flow is

tud_task() try to submit transfer as long as the FIFO can still has spaces for it. But will stop whenever there no more space (else we got discarded data).
When user read() from fifo and it got back the space, the read() will submit the transfer. There is still conflict with multiple read() from different thread. Though it happens much less than write() with printf()

Also I'm not sure if we need to have both: busy and claimed flags. The busy flag is indicating that device controller driver are processing the endpoint at that moment. If it's busy we can't schedule more transfers to this endpoint. But right now we also don't prepare fifos or any other ressources with data to process later. From a functional point of view I'd say that an endpoint should be indicated as busy already if some class just plans to use it and prepares ressources (which is done before dcd processed and usbd sets it busy). This is pretty the same as we use the claim flag right now.

Yeah, I am debating this myself as well, however there is scenario where the driver after claimed the endpoint, decide not to make an transfer due to its internal error, e.g out of data from the fifo, which could be another race issue with others. Then the driver must release the endpoint. I am not entirely sure if we could just simply mark busy = 0 in this case, since it can potentially mask an actually busy-transferring endpoint. However, the more I think about this now, I think we can actually do busy = 0 as release, since it is already claimed, there should be no transfer on it. If it does, then we have another issue to fix. Thanks for the feedback, I will update the PR to only use busy.

hathach · 2020-09-10T16:56:58Z

@duempel oh, while removing the claimed bit, I realize there may be an issue. Since currently the edpt_claim() is optional. Using only 1 busy bit, the edpt_xfer() won't be able to force and check to see if the endpoint is currently transferring or not. E.g an API can call edpt_xfer() 2 times, and the busy is set just like using the claim_edpt() hmm. Maybe we can leave that later when forcing all API to use claim, however, it is not really necessary for endpoint that isn't directly involved to public API().

hathach · 2020-09-11T05:52:34Z

@duempel Maybe TinyUSB could implement osal_mutex as normal semaphore in FreeRTOS instead 🤔 . Of course, we will fall into the priority inversion scenario, though I am not sure between the 2 cases, which is worst hmmm.

duempel · 2020-09-12T15:05:41Z

Surely using more mutex will increase the chance of HP being blocked

Yes using more mutex will increase the chance of blocking. But I actually wanted to say that using more mutex objects we can decrease the chance of blocking since EP3 would not try to get the same mutex as EP2. Well but one is just fine for now. Don't need to waste ressources on this.

_prep_out_transaction() is also invoked by user via cdc_read().

Oh yeah, my fault. I just wonder if we would really need it in user API but this can be discussed later 😀

Since currently the edpt_claim() is optional. Using only 1 busy bit, the edpt_xfer() won't be able to force and check to see if the endpoint is currently transferring or not. E.g an API can call edpt_xfer() 2 times, and the busy is set just like using the claim_edpt() hmm. Maybe we can leave that later when forcing all API to use claim, however, it is not really necessary for endpoint that isn't directly involved to public API().

This is a good point. usbd has to know if dcd is processing the data right now. Better to keep the claim bit instead of adding additional logic to handle those cases.

Maybe TinyUSB could implement osal_mutex as normal semaphore in FreeRTOS instead 🤔 . Of course, we will fall into the priority inversion scenario, though I am not sure between the 2 cases, which is worst hmmm.

I've also thought about this after you came up with #507 . This small change could have fixed the issue. But I would keep it as a mutex. We also should not care that much if priority inversion or priority inheritance is our way to go. If someone is designing a time critical application the developer has to plan the scheduling and ressource sharing anyway. In case of FreeRTOS maybe using both would be our way to go: mutex for fifo blocking and binary semaphore as endpoint claiming.
Our goal is that our API do not influence each other and we do not loose data or run into other errors. Timing is too complex to take care of different applications.

@hathach all in all I feel good with these changes. In _prep_out_transaction maybe claiming the endpoint would make more sense if we do this before availability check. In case tud_task has a higher priority than the user task which is calling a tud_cdc_n_read for example. The user task then processes _prep_out_transaction and reads the available bytes in rx fifo. If the higher priority task then invokes cdcd_xfer_cb and write data to rx fifo there might be less space. In this case the _prep_out_transaction called by tud_task notices there is not enough space and will return without claiming the endpoint. Because the endpoint is not claimed and the user task thinks it's enough space within rx fifo, it will continue with usbd_edpt_xfer and some bytes could get lost. Of course when changing the order there has to be a release call added in case of no available space.

hathach · 2020-09-13T07:33:17Z

Yes using more mutex will increase the chance of blocking. But I actually wanted to say that using more mutex objects we can decrease the chance of blocking since EP3 would not try to get the same mutex as EP2. Well but one is just fine for now. Don't need to waste ressources on this.

Right, I thought of this as well, 16 mutex is lots of resources. 1 Mutex can easily occupies 32 or 64 bytes.

Oh yeah, my fault. I just wonder if we would really need it in user API but this can be discussed later

Oh yeah, If you have idea to not include it in user API, then just submit PR, no need for discussion :D

This is a good point. usbd has to know if dcd is processing the data right now. Better to keep the claim bit instead of adding additional logic to handle those cases.

OK, for now we can leave it as it is for now, at least it shares storage with busy, and doesn't occupies any SRAM. We can always refactor later.

I've also thought about this after you came up with #507 . This small change could have fixed the issue. But I would keep it as a mutex. We also should not care that much if priority inversion or priority inheritance is our way to go. If someone is designing a time critical application the developer has to plan the scheduling and ressource sharing anyway. In case of FreeRTOS maybe using both would be our way to go: mutex for fifo blocking and binary semaphore as endpoint claiming.
Our goal is that our API do not influence each other and we do not loose data or run into other errors. Timing is too complex to take care of different applications.

I agree, it is a bit too much for a usb stack to worry about and it won't be able to solve it anyway. OSAL currently includes the semaphore API, though I have a plan to remove it to only keep queue and mutex to make it easier to port new RTOS. I am thinking about it, however, I am lean toward keeping both as mutex as it reflect the correct scenario we have mutual exclusive to access shared resource. Semaphore is more about synchronization.

@hathach all in all I feel good with these changes. In _prep_out_transaction maybe claiming the endpoint would make more sense if we do this before availability check. In case tud_task has a higher priority than the user task which is calling a tud_cdc_n_read for example. The user task then processes _prep_out_transaction and reads the available bytes in rx fifo. If the higher priority task then invokes cdcd_xfer_cb and write data to rx fifo there might be less space. In this case the _prep_out_transaction called by tud_task notices there is not enough space and will return without claiming the endpoint. Because the endpoint is not claimed and the user task thinks it's enough space within rx fifo, it will continue with usbd_edpt_xfer and some bytes could get lost. Of course when changing the order there has to be a release call added in case of no available space.

~~maybe we should include the edpt_busy() check in the edpt_claim() before attempting to call mutex_lock(). It seems to be the need at all time.~~

hathach · 2020-09-14T14:19:52Z

@hathach all in all I feel good with these changes. In _prep_out_transaction maybe claiming the endpoint would make more sense if we do this before availability check. In case tud_task has a higher priority than the user task which is calling a tud_cdc_n_read for example. The user task then processes _prep_out_transaction and reads the available bytes in rx fifo. If the higher priority task then invokes cdcd_xfer_cb and write data to rx fifo there might be less space. In this case the _prep_out_transaction called by tud_task notices there is not enough space and will return without claiming the endpoint. Because the endpoint is not claimed and the user task thinks it's enough space within rx fifo, it will continue with usbd_edpt_xfer and some bytes could get lost. Of course when changing the order there has to be a release call added in case of no available space.

Ah right, I think I misread this in previous comment. I will update the PR to claim the endpoint before checking the fifo space. Thanks for you analysis.

- add pre-check to reduce mutex lock in usbd_edpt_claim

hathach · 2020-09-14T16:56:17Z

merged now since it is ready and I need to move on with other works. Issues such as semaphore vs mutex can be done a follow-up PR if needed.

hathach added 2 commits September 9, 2020 15:48

introduce optional usbd_edpt_claim, usbd_edpt_release which can be us…

8b9893c

…ed to gain exclusive access to usbd_edpt_xfer

update cdc edpt read

33f0a18

hathach mentioned this pull request Sep 9, 2020

Data is lost when data is sent to the PC by Serial.println. adafruit/Adafruit_nRF52_Arduino#565

Closed

clean up

fe1b5df

hathach mentioned this pull request Sep 9, 2020

Race condition with cdc auto flush() on transfer completion #507

Closed

typo

ed6d48b

update claim edpt for hid and midi

801f8b5

clean up

ce4a9b9

typo

1804dba

hathach added 2 commits September 14, 2020 22:14

cdc device: claim endpoint before checking fifo availability

23e6ee2

- add pre-check to reduce mutex lock in usbd_edpt_claim

more const

9c0d15f

hathach marked this pull request as ready for review September 14, 2020 15:48

hathach merged commit 745c15d into master Sep 14, 2020

hathach deleted the fix-edpt-race branch September 14, 2020 16:56

hathach mentioned this pull request Sep 14, 2020

Fix:Data is lost when data is sent to the PC by Serial.println. #565 adafruit/Adafruit_TinyUSB_ArduinoCore#13

Closed

deymo mentioned this pull request Jan 20, 2022

cdcd_xfer_cb callback uses p_cdc->epout_buf after endpoint was released #1292

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix edpt xfer race condition#508

Fix edpt xfer race condition#508
hathach merged 9 commits into
masterfrom
fix-edpt-race

hathach commented Sep 9, 2020 •

edited

Loading

Uh oh!

duempel commented Sep 10, 2020

Uh oh!

hathach commented Sep 10, 2020

Uh oh!

hathach commented Sep 10, 2020

Uh oh!

hathach commented Sep 11, 2020

Uh oh!

duempel commented Sep 12, 2020

Uh oh!

hathach commented Sep 13, 2020 •

edited

Loading

Uh oh!

hathach commented Sep 14, 2020

Uh oh!

hathach commented Sep 14, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

hathach commented Sep 9, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

duempel commented Sep 10, 2020

Uh oh!

hathach commented Sep 10, 2020

Uh oh!

hathach commented Sep 10, 2020

Uh oh!

hathach commented Sep 11, 2020

Uh oh!

duempel commented Sep 12, 2020

Uh oh!

hathach commented Sep 13, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hathach commented Sep 14, 2020

Uh oh!

hathach commented Sep 14, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

hathach commented Sep 9, 2020 •

edited

Loading

hathach commented Sep 13, 2020 •

edited

Loading