Fix edpt xfer race condition#508
Conversation
…ed to gain exclusive access to usbd_edpt_xfer
Usually I'd say that one mutex for claiming all endpoints should be ok. There is not much processing and high priority tasks can only be blocked very shortly. But this short blocking is not guaranteed, as we can see with a setup like #507 . If the low priority task won't fall back to it's low priority after releasing the mutex other high prio tasks can be blocked for much longer than expected. This won't result in the error we saw before, but could cause problems to time critical tasks. I also don't like to create seperate mutex for each endpoint and don't see a more elegant solution right now. Just want to mention that the chance of blocking is higher because different endpoints are sharing only one ressource. Also in theory the risk of deadlocks will become higher. Is it necessary to claim the data out endpoint? I think Also I'm not sure if we need to have both: busy and claimed flags. The busy flag is indicating that device controller driver are processing the endpoint at that moment. If it's busy we can't schedule more transfers to this endpoint. But right now we also don't prepare fifos or any other ressources with data to process later. From a functional point of view I'd say that an endpoint should be indicated as busy already if some class just plans to use it and prepares ressources (which is done before dcd processed and usbd sets it busy). This is pretty the same as we use the claim flag right now. |
It only happens with FreeRTOS which is
_prep_out_transaction() is also invoked by user via cdc_read(). The flow is
Yeah, I am debating this myself as well, however there is scenario where the driver after claimed the endpoint, decide not to make an transfer due to its internal error, e.g out of data from the fifo, which could be another race issue with others. Then the driver must release the endpoint. I am not entirely sure if we could just simply mark busy = 0 in this case, since it can potentially mask an actually busy-transferring endpoint. However, the more I think about this now, I think we can actually do |
|
@duempel oh, while removing the |
|
@duempel Maybe TinyUSB could implement osal_mutex as normal semaphore in FreeRTOS instead 🤔 . Of course, we will fall into the |
Yes using more mutex will increase the chance of blocking. But I actually wanted to say that using more mutex objects we can decrease the chance of blocking since EP3 would not try to get the same mutex as EP2. Well but one is just fine for now. Don't need to waste ressources on this.
Oh yeah, my fault. I just wonder if we would really need it in user API but this can be discussed later 😀
This is a good point. usbd has to know if dcd is processing the data right now. Better to keep the claim bit instead of adding additional logic to handle those cases.
I've also thought about this after you came up with #507 . This small change could have fixed the issue. But I would keep it as a mutex. We also should not care that much if priority inversion or priority inheritance is our way to go. If someone is designing a time critical application the developer has to plan the scheduling and ressource sharing anyway. In case of FreeRTOS maybe using both would be our way to go: mutex for fifo blocking and binary semaphore as endpoint claiming. @hathach all in all I feel good with these changes. In |
Right, I thought of this as well, 16 mutex is lots of resources. 1 Mutex can easily occupies 32 or 64 bytes.
Oh yeah, If you have idea to not include it in user API, then just submit PR, no need for discussion :D
OK, for now we can leave it as it is for now, at least it shares storage with busy, and doesn't occupies any SRAM. We can always refactor later.
I agree, it is a bit too much for a usb stack to worry about and it won't be able to solve it anyway. OSAL currently includes the semaphore API, though I have a plan to remove it to only keep queue and mutex to make it easier to port new RTOS. I am thinking about it, however, I am lean toward keeping both as mutex as it reflect the correct scenario we have mutual exclusive to access shared resource. Semaphore is more about synchronization.
|
Ah right, I think I misread this in previous comment. I will update the PR to claim the endpoint before checking the fifo space. Thanks for you analysis. |
- add pre-check to reduce mutex lock in usbd_edpt_claim
|
merged now since it is ready and I need to move on with other works. Issues such as semaphore vs mutex can be done a follow-up PR if needed. |
Describe the PR
usbd_edpt_busy()is checked before callingusbd_edpt_xfer()e.gAlthough this is fine with noOS since tinyusb API() should always be called in thread mode (never ISR). However with an RTOS, between the edpt_busy() and the edpt_xfer() we can be preempted by higher prio task, which can also calling the same API e.g cdc_write_flush() (often after cdc_write()). This lead to the stack re-submit another transfer to an already busy endpoint after High prio task complete and the Low prio continue. Of course dcd detect and reject this and the call failed nicely. However as above write_flush(), the data that is pull out of the tx_fifo is not processed -> cause missing character with print(). Although we can put it back with fifo_write_front() in theory, however not all processed function is revertible.
In short, this PR introduce usbd_edpt_claim() and release() with the help of RTOS mutex mechanism, to prevent other thread to interfere between edpt_busy() and edpt_xfer(). The claim/release is OPTIONAL for now, since not all class driver need it. For example all MSC transfer is handled by the stack via callback, user never actively submit an transfer, therefore there is no racing. However, later on we can make it mandatory just to be consistent across the API.
Ideally it would be 1 mutex per endpiont, but that seems to be a lot of memory for edge case. Maybe only one general mutex is enough, and the edpt_claim()/edpt_release() is fast enough I guess. Let's me know what do you think about the PR and how we could improve it. I am open to any suggestion.
PS is still WIP, only applies to cdc driver first for feedback.