Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

usb: Regression of the FIFO word loss issue. #873

Closed
zyp opened this issue Jan 19, 2018 · 7 comments
Closed

usb: Regression of the FIFO word loss issue. #873

zyp opened this issue Jan 19, 2018 · 7 comments

Comments

@zyp
Copy link
Contributor

zyp commented Jan 19, 2018

Both @h2obrain and @alibabashack are reporting (in comments on #785 and #600 respectively) that they are seeing a regression of the FIFO word loss issue originally reported as #600, caused by the introduction of #785.

I'm filing the regression as a new issue so we can keep all the discussion here, rather than spread around as comments on older tickets. Here's my preliminary analysis of the regression which I posted previously in a different comment:

For reference, as far as I can recall at the moment, this issue were caused by a race condition in the usb hardware. It was fixed in 0c1ff68 which is rebased from #672, by rearranging the order of some operations related to flow control (CNAK/SNAK).

#785 enables the use of flow control on EP0 to ensure that control events are handled in order. It is probably adding flow control operations at a point that hits the race condition again.

zyp added a commit to zyp/libopencm3 that referenced this issue Jan 19, 2018
Attempted fix for libopencm3#873 by moving SETUP handling to SETUP complete event.

Note: This changes usbd API and st_usbfs driver is not yet updated.
@zyp
Copy link
Contributor Author

zyp commented Jan 19, 2018

After looking further into it, I've concluded that my preliminary analysis looks correct. The problem is that setting CNAK before the SETUP complete event is received causes a race condition. The SETUP callback is called when the SETUP packet event is received, which means that setting CNAK from the callback is too early.

Originally the problem was that CNAK was set by ep_read() which is called by the callback. #672 solved this by moving CNAK out of ep_read() and calling it after the SETUP complete event is received instead.

The regression by #785 is caused by the introduction of flow control calls into the SETUP callback. They also set CNAK.

To solve this properly, I propose changing the event handling code to only call the SETUP callback after the SETUP complete event is received. Unfortunately, this implies that the callback can't call ep_read() itself anymore, because the packet has to be read out of the FIFO before the SETUP complete event arrives. This implies a change of the API between the hardware drivers and _usbd_control_setup().

I have pushed a quick test fix here: https://github.com/zyp/libopencm3/tree/test_fix_otg_word_loss_regression

Please test with that branch and check if it appears to solve the problem. Note that I don't have access to hardware at the moment, so I haven't tested it myself.

@h2obrain
Copy link
Contributor

@zyp, with your test_fix the control control-loopback test runs through (with and without interrupts enabled). I also adde a pull request for the gadget-zero stuff to your test_fix_otg_word_loss_regression branch.

[ gadget-zero (test_fix_otg_word_loss_regression *)]$ python3 -m unittest  test_gadget0.py 
Running tests for DUT:  stm32f429i-disco
.........ss.................
----------------------------------------------------------------------
Ran 28 tests in 0.359s

OK (skipped=2)

I have not tested the test_fix with one of my devices, but the word loss disappears after merging the test_fix into my ctch branch (https://github.com/H2OBrain/libopencm3/tree/ctch_test_fix_otg_word_loss_regression), where I hacked in multi packet support.

@h2obrain
Copy link
Contributor

h2obrain commented Mar 15, 2018

@zyp unfortunately my devices still crash after some time.
The only thing needed is a request of some value over endpoint 0 every second.
The devices crash (do not respond to setup request) after 1min-1day. On windows, this means that the computer has to be rebooted after a day (sometimes unplugging/powering of the usb hub is enough). Libusb device finder guis (like zadig) cause a stack overflow, because they restart the setup requests indefinitely (after some timeout) and just forget about old transfer requests they added to the list...

@zyp
Copy link
Contributor Author

zyp commented Mar 16, 2018

Misbehaving USB devices is a nice way to find bugs on the host… I've lost track of how many times I've made the Mac OS X host stack deadlock. :)

If the device is still having sporadic problems, I think we need to collect more data to diagnose it. A test case that would reproduce it more often would be great. Maybe increasing the request rate to more than once per second would help?

@karlp
Copy link
Member

karlp commented May 2, 2018

I've (finally) got the ctrl loopback tests, and can confirm that I see missing word problems on f4, but no problems on f3/f0 (with st_usbfs)

@vampirefrog
Copy link

I can confirm this patch works on STM32F407 (Discovery board). Without it, I would not receive the first 4 bytes of the control transfer. I would receive bytes from 4 onward. So if I send 0x01 0x02 0x03 0x04 0x05 0x06, I only receive 0x05 0x06. After applying this patch, it seems to work correctly.

I had to adapt the code because the dwc otg code has been reworked in the meanwhile.

I think this is an important fix and needs some more attention to apply correctly, since without it, control transfers don't work correctly.

Need help making a proper patch. I can test on F407 and F103.

karlp added a commit that referenced this issue Aug 27, 2018
This is based on linux's gadget0 intel loopback tests, and also github
pr: #592

Note that this captures the currently broken control loopback issues on
dwc_otg devices.

See #873 and all linked
issues.

Current status is passing on f3, f0, and failing on f4.
karlp added a commit that referenced this issue Aug 27, 2018
See #873

Commentary describing this patch originally by zyp:

```
After looking further into it, I've concluded that my preliminary
analysis looks correct. The problem is that setting CNAK before
the SETUP complete event is received causes a race condition. The
SETUP callback is called when the SETUP packet event is received,
which means that setting CNAK from the callback is too early.

Originally the problem was that CNAK was set by ep_read() which is
called by the callback. #672 solved this by moving CNAK out of
ep_read() and calling it after the SETUP complete event is received
instead.

The regression by #785 is caused by the introduction of flow control
calls into the SETUP callback. They also set CNAK.

To solve this properly, I propose changing the event handling code
to only call the SETUP callback after the SETUP complete event is
received. Unfortunately, this implies that the callback can't call
ep_read() itself anymore, because the packet has to be read out of
the FIFO before the SETUP complete event arrives. This implies a
change of the API between the hardware drivers and _usbd_control_setup().
```

L1 (st_usbfs) works and passes tests as before change
F4 (dwc_otg_fs) works and now passes tests. (yay)
LM4f still compiles, and has had the same style of implementation as
st_usbfs, however has not been tested on any hardware.
@karlp
Copy link
Member

karlp commented Aug 27, 2018

This has merged now, and fixes the tests that were added to catch this. thanks all for pushing on this

@karlp karlp closed this as completed Aug 27, 2018
grevaillot pushed a commit to grevaillot/libopencm3 that referenced this issue Jan 11, 2019
This is based on linux's gadget0 intel loopback tests, and also github
pr: libopencm3#592

Note that this captures the currently broken control loopback issues on
dwc_otg devices.

See libopencm3#873 and all linked
issues.

Current status is passing on f3, f0, and failing on f4.
grevaillot pushed a commit to grevaillot/libopencm3 that referenced this issue Jan 11, 2019
See libopencm3#873

Commentary describing this patch originally by zyp:

```
After looking further into it, I've concluded that my preliminary
analysis looks correct. The problem is that setting CNAK before
the SETUP complete event is received causes a race condition. The
SETUP callback is called when the SETUP packet event is received,
which means that setting CNAK from the callback is too early.

Originally the problem was that CNAK was set by ep_read() which is
called by the callback. libopencm3#672 solved this by moving CNAK out of
ep_read() and calling it after the SETUP complete event is received
instead.

The regression by libopencm3#785 is caused by the introduction of flow control
calls into the SETUP callback. They also set CNAK.

To solve this properly, I propose changing the event handling code
to only call the SETUP callback after the SETUP complete event is
received. Unfortunately, this implies that the callback can't call
ep_read() itself anymore, because the packet has to be read out of
the FIFO before the SETUP complete event arrives. This implies a
change of the API between the hardware drivers and _usbd_control_setup().
```

L1 (st_usbfs) works and passes tests as before change
F4 (dwc_otg_fs) works and now passes tests. (yay)
LM4f still compiles, and has had the same style of implementation as
st_usbfs, however has not been tested on any hardware.
BOJIT pushed a commit to BOJIT/PlatformIO-libopencm3 that referenced this issue Jan 30, 2021
This is based on linux's gadget0 intel loopback tests, and also github
pr: libopencm3#592

Note that this captures the currently broken control loopback issues on
dwc_otg devices.

See libopencm3#873 and all linked
issues.

Current status is passing on f3, f0, and failing on f4.
BOJIT pushed a commit to BOJIT/PlatformIO-libopencm3 that referenced this issue Jan 30, 2021
See libopencm3#873

Commentary describing this patch originally by zyp:

```
After looking further into it, I've concluded that my preliminary
analysis looks correct. The problem is that setting CNAK before
the SETUP complete event is received causes a race condition. The
SETUP callback is called when the SETUP packet event is received,
which means that setting CNAK from the callback is too early.

Originally the problem was that CNAK was set by ep_read() which is
called by the callback. libopencm3#672 solved this by moving CNAK out of
ep_read() and calling it after the SETUP complete event is received
instead.

The regression by libopencm3#785 is caused by the introduction of flow control
calls into the SETUP callback. They also set CNAK.

To solve this properly, I propose changing the event handling code
to only call the SETUP callback after the SETUP complete event is
received. Unfortunately, this implies that the callback can't call
ep_read() itself anymore, because the packet has to be read out of
the FIFO before the SETUP complete event arrives. This implies a
change of the API between the hardware drivers and _usbd_control_setup().
```

L1 (st_usbfs) works and passes tests as before change
F4 (dwc_otg_fs) works and now passes tests. (yay)
LM4f still compiles, and has had the same style of implementation as
st_usbfs, however has not been tested on any hardware.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants