New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
usb: Regression of the FIFO word loss issue. #873
Comments
Attempted fix for libopencm3#873 by moving SETUP handling to SETUP complete event. Note: This changes usbd API and st_usbfs driver is not yet updated.
After looking further into it, I've concluded that my preliminary analysis looks correct. The problem is that setting CNAK before the SETUP complete event is received causes a race condition. The SETUP callback is called when the SETUP packet event is received, which means that setting CNAK from the callback is too early. Originally the problem was that CNAK was set by ep_read() which is called by the callback. #672 solved this by moving CNAK out of ep_read() and calling it after the SETUP complete event is received instead. The regression by #785 is caused by the introduction of flow control calls into the SETUP callback. They also set CNAK. To solve this properly, I propose changing the event handling code to only call the SETUP callback after the SETUP complete event is received. Unfortunately, this implies that the callback can't call ep_read() itself anymore, because the packet has to be read out of the FIFO before the SETUP complete event arrives. This implies a change of the API between the hardware drivers and _usbd_control_setup(). I have pushed a quick test fix here: https://github.com/zyp/libopencm3/tree/test_fix_otg_word_loss_regression Please test with that branch and check if it appears to solve the problem. Note that I don't have access to hardware at the moment, so I haven't tested it myself. |
@zyp, with your test_fix the control control-loopback test runs through (with and without interrupts enabled). I also adde a pull request for the gadget-zero stuff to your test_fix_otg_word_loss_regression branch.
I have not tested the test_fix with one of my devices, but the word loss disappears after merging the test_fix into my ctch branch (https://github.com/H2OBrain/libopencm3/tree/ctch_test_fix_otg_word_loss_regression), where I hacked in multi packet support. |
@zyp unfortunately my devices still crash after some time. |
Misbehaving USB devices is a nice way to find bugs on the host… I've lost track of how many times I've made the Mac OS X host stack deadlock. :) If the device is still having sporadic problems, I think we need to collect more data to diagnose it. A test case that would reproduce it more often would be great. Maybe increasing the request rate to more than once per second would help? |
I've (finally) got the ctrl loopback tests, and can confirm that I see missing word problems on f4, but no problems on f3/f0 (with st_usbfs) |
I can confirm this patch works on STM32F407 (Discovery board). Without it, I would not receive the first 4 bytes of the control transfer. I would receive bytes from 4 onward. So if I send I had to adapt the code because the dwc otg code has been reworked in the meanwhile. I think this is an important fix and needs some more attention to apply correctly, since without it, control transfers don't work correctly. Need help making a proper patch. I can test on F407 and F103. |
See #873 Commentary describing this patch originally by zyp: ``` After looking further into it, I've concluded that my preliminary analysis looks correct. The problem is that setting CNAK before the SETUP complete event is received causes a race condition. The SETUP callback is called when the SETUP packet event is received, which means that setting CNAK from the callback is too early. Originally the problem was that CNAK was set by ep_read() which is called by the callback. #672 solved this by moving CNAK out of ep_read() and calling it after the SETUP complete event is received instead. The regression by #785 is caused by the introduction of flow control calls into the SETUP callback. They also set CNAK. To solve this properly, I propose changing the event handling code to only call the SETUP callback after the SETUP complete event is received. Unfortunately, this implies that the callback can't call ep_read() itself anymore, because the packet has to be read out of the FIFO before the SETUP complete event arrives. This implies a change of the API between the hardware drivers and _usbd_control_setup(). ``` L1 (st_usbfs) works and passes tests as before change F4 (dwc_otg_fs) works and now passes tests. (yay) LM4f still compiles, and has had the same style of implementation as st_usbfs, however has not been tested on any hardware.
This has merged now, and fixes the tests that were added to catch this. thanks all for pushing on this |
This is based on linux's gadget0 intel loopback tests, and also github pr: libopencm3#592 Note that this captures the currently broken control loopback issues on dwc_otg devices. See libopencm3#873 and all linked issues. Current status is passing on f3, f0, and failing on f4.
See libopencm3#873 Commentary describing this patch originally by zyp: ``` After looking further into it, I've concluded that my preliminary analysis looks correct. The problem is that setting CNAK before the SETUP complete event is received causes a race condition. The SETUP callback is called when the SETUP packet event is received, which means that setting CNAK from the callback is too early. Originally the problem was that CNAK was set by ep_read() which is called by the callback. libopencm3#672 solved this by moving CNAK out of ep_read() and calling it after the SETUP complete event is received instead. The regression by libopencm3#785 is caused by the introduction of flow control calls into the SETUP callback. They also set CNAK. To solve this properly, I propose changing the event handling code to only call the SETUP callback after the SETUP complete event is received. Unfortunately, this implies that the callback can't call ep_read() itself anymore, because the packet has to be read out of the FIFO before the SETUP complete event arrives. This implies a change of the API between the hardware drivers and _usbd_control_setup(). ``` L1 (st_usbfs) works and passes tests as before change F4 (dwc_otg_fs) works and now passes tests. (yay) LM4f still compiles, and has had the same style of implementation as st_usbfs, however has not been tested on any hardware.
This is based on linux's gadget0 intel loopback tests, and also github pr: libopencm3#592 Note that this captures the currently broken control loopback issues on dwc_otg devices. See libopencm3#873 and all linked issues. Current status is passing on f3, f0, and failing on f4.
See libopencm3#873 Commentary describing this patch originally by zyp: ``` After looking further into it, I've concluded that my preliminary analysis looks correct. The problem is that setting CNAK before the SETUP complete event is received causes a race condition. The SETUP callback is called when the SETUP packet event is received, which means that setting CNAK from the callback is too early. Originally the problem was that CNAK was set by ep_read() which is called by the callback. libopencm3#672 solved this by moving CNAK out of ep_read() and calling it after the SETUP complete event is received instead. The regression by libopencm3#785 is caused by the introduction of flow control calls into the SETUP callback. They also set CNAK. To solve this properly, I propose changing the event handling code to only call the SETUP callback after the SETUP complete event is received. Unfortunately, this implies that the callback can't call ep_read() itself anymore, because the packet has to be read out of the FIFO before the SETUP complete event arrives. This implies a change of the API between the hardware drivers and _usbd_control_setup(). ``` L1 (st_usbfs) works and passes tests as before change F4 (dwc_otg_fs) works and now passes tests. (yay) LM4f still compiles, and has had the same style of implementation as st_usbfs, however has not been tested on any hardware.
Both @h2obrain and @alibabashack are reporting (in comments on #785 and #600 respectively) that they are seeing a regression of the FIFO word loss issue originally reported as #600, caused by the introduction of #785.
I'm filing the regression as a new issue so we can keep all the discussion here, rather than spread around as comments on older tickets. Here's my preliminary analysis of the regression which I posted previously in a different comment:
The text was updated successfully, but these errors were encountered: