Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

stmhal: ringbuffer support for USB VCP #1652

Closed
wants to merge 1 commit into from

Conversation

hoihu
Copy link
Sponsor Contributor

@hoihu hoihu commented Nov 27, 2015

  • adds ringbuffer for rx/tx over USB VCP
  • uses a "txComplete" callback rather than TIM3 so it frees up the timer

@hoihu
Copy link
Sponsor Contributor Author

hoihu commented Nov 27, 2015

This is referring to a discussion on the forum:
http://forum.micropython.org/viewtopic.php?f=6&t=1150

@hoihu
Copy link
Sponsor Contributor Author

hoihu commented Nov 27, 2015

@dhylands: Sure I can move it - I was unsure where to place it but I believe your location fits better

@hoihu
Copy link
Sponsor Contributor Author

hoihu commented Nov 27, 2015

Oh ok I also forgot to update the commit history in HALCOMMITS...

@dhylands
Copy link
Contributor

Oh ok I also forgot to update the commit history in HALCOMMITS...

I didn't see any modified hal files...

If we do modify hal files, then we need to consider impacts to all of the hals

@hoihu
Copy link
Sponsor Contributor Author

hoihu commented Nov 27, 2015

yes I was thinking about the USB stack files actually, not the HAL files.
usbd_cdc_msc_hid.h
usbd_cdc_interface.c

They are also part of the cube library

@hoihu
Copy link
Sponsor Contributor Author

hoihu commented Nov 27, 2015

@dhylands I'd appreciate if you could test it on your bench - it works ok for me but as the change is quite large it may have side effects I was not aware of. Maybe by using rshell and forcing a lot of traffic?

I only did tests with the terminal REPL.

@dhylands
Copy link
Contributor

They are also part of the cube library

They're really just examples and not part of the hal proper. HALCOMMITs is just for the files in the hal subtree

@dhylands I'd appreciate if you could test it on your bench - it works ok for me but as the change is quite large it may have side effects I was not aware of. Maybe by using rshell and forcing a lot of traffic?

I only did tests with the terminal REPL.

  • sure I'll give it a go later tonight

@hoihu
Copy link
Sponsor Contributor Author

hoihu commented Nov 27, 2015

Considering the CDC polling interval: currently the endpoint polling interval is set to 32msec. Is there a reason why this value is set so high? I haven't measured it but I'd assume that setting it to a value closer to 5msec would speed up data transfer rate and specifically lag-time..

I'm planning to do some measurements.

@dpgeorge
Copy link
Member

Thanks @hoihu, this is really great!

Considering the CDC polling interval: currently the endpoint polling interval is set to 32msec. Is there a reason why this value is set so high?

To keep interrupts to a minimum and not overload the CPU with USB processing. We can revisit this value. I'm not sure how it works: can the USB CDC device only send (back to the host) one 64-byte USB packet per polling interval, or can it send many?

* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't make cosmetic changes, that can be done separately.

Copy link
Sponsor Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure I'll fix this

@hoihu
Copy link
Sponsor Contributor Author

hoihu commented Nov 27, 2015

To keep interrupts to a minimum and not overload the CPU with USB processing. We can revisit this value.

USB processing time will increase but i'd assume not by much. The USB backend should take care of the low level processing and only call the callback if data is available. Some MCU's even have this low-level processing separated from the main CPU so it doesn't add overhead. I'm not sure how it is done in the F4 though.

I'm not sure how it works: can the USB CDC device only send (back to the host) one 64-byte USB packet per polling interval, or can it send many?

I'm not sure either. I guess that should be easily verified using some measurements (e.g. data transfer rate). If every 32msec a packet of 64 Bytes can be sent, the speed would be low (app. 2kBytes/sec)

The issue is that the large time of 32msec will increase the buffers size, as pointed out in the comments in the source file. I believe that reducing this time will add some processing time, but the benefit in speed and less RAM size is quite sigificant.

@hoihu
Copy link
Sponsor Contributor Author

hoihu commented Nov 27, 2015

So I did some measurement with reducing the polling interval to 5msec.

I toggled a pin in the CDC_Itf_TxFinished handler.

The testcase was measuring the output of "help()" from the REPL. This is app 2kB in size. It takes an astonishing 3msec to transfer, regardless of the polling interval. I'm surprised. It means that once the data transmission is active, the polling interval is close to the minimum (at least that's the repsonse on my iMac).

It would also mean that the net data rate would be close to 1MBytes/sec. This definitely sounds wrong - A wireshark USB log would be handy to have, but it's not availbable under OS X... Hmm maybe I have to activate my raspberry ;)

I also measured an example with buffer sizes of 256Bytes. Attached is the output of the logic analyser measuring the toggling
screen shot 2015-11-27 at 15 26 20

This in comparision with 1024 Bytes nuffer size:
screen shot 2015-11-27 at 15 26 57

The difference is that the TxComplete Handler is more active as it needs to setup more packets. But it look pretty clean to me.

@pfalcon
Copy link
Contributor

pfalcon commented Nov 27, 2015

I'd appreciate if you could test it on your bench - it works ok for me but as the change is quite large it may have side effects I was not aware of. Maybe by using rshell and forcing a lot of traffic?
I only did tests with the terminal REPL.

It would be nice if folks who propose such patches tested them on real use-cases, like e.g. lwip/slip.

@dhylands
Copy link
Contributor

rshell is quite a bit faster with the new code. using master, I get 4.19 seconds to copy timer.c (a 57K file) to /flash and 1.89 seconds to copy the file from the pyboard to the host.

With this patch applied, it only takes 3.19 seconds to copy the same file to the pyboard and 0.25 seconds to copy it back.

@hoihu
Copy link
Sponsor Contributor Author

hoihu commented Nov 27, 2015

Thanks Dave, good to hear that it seems to work.

The asymmetry between the transfer times may be due to the flash handling perhaps?

@dhylands
Copy link
Contributor

yeah - I'm pretty sure that the flash writing time is what dominates. rshell also has overhead because it uses the raw repl, so it constantly switches in and out of raw-repl mode which actually does a soft reset on the board each time.

@dhylands
Copy link
Contributor

I tried using an sdcard and a much bigger file (946K) and got the following:

master: 20.2 secs copy host->pyboard, 28.6 seconds to copy pyboard->host
thispr: 8.9 secs copy host->pyboard, 3.2 seconds to copy pyboard->host

For each 512 bytes transferred, the code then waits for an ACK (this is a simple flow control), which artificially slows down the transfer rates.

I'm a little baffled by the 28 second pyboard->host time (I did repeat it several times and it was quite consistent). It could just be a case of bad timing in terms of the timeslots when the host polls the pyboard.

@dhylands
Copy link
Contributor

Oh yeah - you should squash your commits (look at git rebase -i)

@hoihu
Copy link
Sponsor Contributor Author

hoihu commented Nov 27, 2015

3.2 seconds to copy pyboard->host

That's 2.4MBaud, I'm very pleased ;)

TIM3 was triggered in 10msec interval and apparently, based on your numbers, transferred app. 330Bytes during that time. I'd have expected to be around 1kByte (the size of the buffer), but it's in this range.

It means that TIM3 has delayed the transfer - the USB stack drained the buffer very quickly but the fixed 10msec intervals limited that. The new approach is avoiding this because the callback gets immediately called at TxComplete. I really don't know why ST hasn't exposed this callback in their stack.

It could just be a case of bad timing in terms of the timeslots when the host polls the pyboard.

I don't think it is bad timing - it's likely due to the TIM3 interval.

I'm also suprised by the speedup - the measurements with the logic analyser moreless match your numbers. So every 0.25msec a packet of 256 Bytes is apparently transferred....

@hoihu
Copy link
Sponsor Contributor Author

hoihu commented Nov 27, 2015

you should squash your commits (look at git rebase -i)

yes I know - I thought I'd wait and just commit to input from reviews. At the end I can submit a proper patch.

Maybe @dpgeorge has some more inputs?

@hoihu
Copy link
Sponsor Contributor Author

hoihu commented Nov 27, 2015

Thinking... I'd still favour lowering the poll interval to lets say 5 or 10msec for 2 reasons:

  1. The transfer rate is apparently high once the transmission has started. So an (initial) delay of max. 30msec doesn't really match the transfer speed. It's a bit like "I have a fast car but I can only press the gas pedal every second". Fast cars should react fast. (just to clarify - I own no car :) ).
  2. Less Buffer space / no more timeouts due to full buffers.

@dhylands
Copy link
Contributor

It I comment out the file I/O and just send empty buffers, then the times drop to 4.045 secs host-> pyboard and 2.196 sec pyboard -> host

That's still slowed down by the acks every 512 bytes

The filesize was 972028 bytes (I've been using firmware.map)

@hoihu
Copy link
Sponsor Contributor Author

hoihu commented Nov 27, 2015

and 2.196 sec pyboard -> host

that's 3.55MBaud. Getting close to what I was measuring with the analyser (8 MBaud).

That's still slowed down by the acks every 512 bytes

out of curiosity - can you increase the ack to let's say 4kByte and check? Or does the 512 Bytes need to be that size?

Other than that did you experience any problems / hangs etc? Unfortunately I cannot replicate your test case using rshell as it appears not to work reliably under OS X using your latest version (can't enter raw repl mode although CTRLA works in normal repl).

@dhylands
Copy link
Contributor

The largest buffer size I could anything to work with was 960 bytes. Anything larger than 1024 and it just hangs. I think it winds up overwriting data and then the other side doesn't get as much as it expects which makes it hang (which is why I had to but the flow control logic in there).

I'm pretty sure that the pyboard->host side intentionally overwrites the buffer to allow for the case that USB isn't connected at all. Perhaps that behaviour can be improved.

So I bumped the compiled in buffer sizes to 8192 bytes, and ran rshell with a buffer size of 7936 (8192-256) and then I was able to get host->pyboard down to 2.27 sec and pyboard->host to 1.495 sec (for 972028 bytes). Thats 5.2 Mbit/sec with the ACK handshaking. That's approaching 50% of wirespeed, which is pretty decent in my books (especially since this test wasn't designed to optimize speed).

I haven't seen any hangs - other than the expected ones when rshell gets out of sync with the remote.
I'll have to try it on OSX (I have a Mini here).

rshell (well pyboard.py) has issues if your boot.py prints anything.

@hoihu
Copy link
Sponsor Contributor Author

hoihu commented Nov 28, 2015

Thanks a lot for the feedback.

Well so the USB core runs at high speed? Because microframes (125usec) are part of the high speed spec. And interrupt transfer speed for full speed devices are way below the measured number (64kByte/sec = 0.5MBaud).

The F4 could run at high speed but the USB stack is configured to be full speed: From the STM manual:
("USE_USB_HS" and "USE_USB_HS_IN_FS" when using USB High Speed (HS) Core in FS mode)

Those 2 compiler switches are set. Maybe it's a mix between high speed and full speed.

Anyway. I believe the main point is that there is a great performance increase with the callback method - is there anything needed in order to make a proper pull request? Should I rebase?

@dhylands
Copy link
Contributor

That's running on the full speed interface. My recollection is that you can transfer more than 64 bytes at a time, but it needs to be done using multiple 64 byte packets.

I don't have access to a USB sniffer any more, but I could try wireshark on my laptop.

I'd recommend that you squash and rebase to the latest master and add a notice when you've done that (github doesn't notify when you push a new commit)

@hoihu
Copy link
Sponsor Contributor Author

hoihu commented Nov 28, 2015

Ok I've rebased it on master.

I've left the deletion of the CDC_xxx constants @dpgeorge is that ok for you?

}

bool ringbuffer_putc(ringbuffer_t* rbuffer, uint8_t character) {
*(rbuffer->push_ptr) = character;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just wondering if this should be made atomic?. If the ringbuffer_putc() is ever called twice when its just in the middle of inserting into the buffer it could be a hard problem to notice. Also the ringbuffer could be used for other ringbuffer operations that would benefit it to be re-entrant.

Copy link
Sponsor Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It depends how you are going to use the buffer.

From what I hear you would serialize from different sources? The current implementation is done so that only one putter/getter pair is allowed yes, but it's fast and doesn't need blocking code. For serializing an Uart/I2c/SPI input I believe that is fine.

But there may be use cases where you would "put" from different sources to the same buffer (maybe to serialize incoming packets from different uarts?)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The normal case for multiple putters is having an irq and foreground both doing prints. I can't really imagine a case for multiple getters.

@pfalcon
Copy link
Contributor

pfalcon commented Nov 29, 2015

There're quite a few codestyle issues in this patch.

+    *(rbuffer->push_ptr) = character;
+    rbuffer->push_ptr++;

In C, this is written as:

+    *rbuffer->push_ptr++ = character;

@dpgeorge
Copy link
Member

I've left the deletion of the CDC_xxx constants @dpgeorge is that ok for you?

No, please keep them all and keep them in the .c file (they don't appear anywhere else under stmha/usbdev/ directory).

Also, there are still lots of cosmetic changes which aren't needed, mostly in stmhal/usbdev/class/inc/usbd_cdc_msc_hid.h

I'll need to check things properly, but I'm not sure about how the ringbuffer is implemented. Eg, we could re-use it for uart receive, but you'll notice that the uart rx circular buffer is implemented in a much more concise way.

@neilh10
Copy link
Contributor

neilh10 commented Nov 30, 2015

1+ for pattern re-usage.The ringbuffer could be used for Uart Tx (any serialization). I would strongly encourage that they be made safe for reentrancy for future proofing. I've implemented a Uart Tx buffering in a private build (not based on ringbuffer but similar)

@hoihu
Copy link
Sponsor Contributor Author

hoihu commented Nov 30, 2015

@dpgeorge: No problem I'll put them back. Would you prefer to split the ringbuffer commit and place a separate pull request for the USB callback? may be a bit of work though.

I had a look at the Uart implementation. There are IMO 2 differences, one is to separate 2byte entries vs 1 byte entries, the other is the modulo check on overflow.

I believe it is not a lot of differences, but I can include some of that in the current implementation if you wish.

@hoihu
Copy link
Sponsor Contributor Author

hoihu commented Dec 2, 2015

In C, this is written as:

  • *rbuffer->push_ptr++ = character;

yes - but:
http://denniskubes.com/2012/08/14/do-you-know-what-p-does-in-c/

I prefer to do pointer arithmetics (or in general any operator expressions) in separate steps. The hex output is the same anyway.

@pfalcon
Copy link
Contributor

pfalcon commented Dec 2, 2015

I prefer to do pointer arithmetics (or in general any operator expressions) in separate steps. The hex output is the same anyway.

Thanks for sharing your preferences with us, insightful. Now, MicroPython has its own code style, too, and if you would like to contribute, please follow it. Thanks.

@dpgeorge
Copy link
Member

dpgeorge commented Dec 2, 2015

Regarding the ringbuffer stuff. It doesn't feel right making it a global library in lib/utils because there are many ways to implement a ringbuffer and each has tradeoffs (eg RAM for speed), and the implementation is generally pretty minimal that you can do it inline (eg as per uart.c). I see that for the USB stuff you use a ringbuffer for tx and rx and so that's a good reason to have a ringbuffer object, but I don't think this particular implementation should be the one-and-only to be used by everyone.

So I'd suggest putting the ringbuffer code directly into usb_cdc_interface.c.

@hoihu
Copy link
Sponsor Contributor Author

hoihu commented Dec 3, 2015

So I'd suggest putting the ringbuffer code directly into usb_cdc_interface.c.

ok will move it. Regarding those cosmetic whitespaces... it turned out to be an issue with the Atom editor.

Thanks for sharing your preferences with us, insightful.

My comment was trying to express a personal opinion. I'm not an english native speaker so if my formulation sounded harsh I'm sorry that was not the intention. I'll address your input with the next pull request.

@pfalcon
Copy link
Contributor

pfalcon commented Dec 3, 2015

My comment was trying to express a personal opinion. I'm not an english native speaker so if my formulation sounded harsh I'm sorry that was not the intention. I'll address your input with the next pull request.

No, it didn't, and I really hope my response wasn't such either. Ironic, yes, because I tried to hint that arguments like "that's what I like" are not productive. Every person has own, usually different, likes, but if everyone will write code in different way, then a project will be a mess. So, a project adopts its own code style (essentially, its "like"), and kindly ask contributors to use it when contributing to project. MicroPython's code style is described in https://github.com/micropython/micropython/blob/master/CODECONVENTIONS.md . Of course, every aspect can't be covered there, and the best reference is MicroPython code itself - just try to write yours as a surrounding code looks. What's nice is that this approach works for any project (because indeed, like humans, different projects adopt different code styles). Thanks for considering these issues and looking forward to this patch being cleaned up and merged, as it's indeed solves real issues and provides useful functionality.

- ringbuffer for rx/tx of USB VCP communication
- uses a "txComplete" callback rather than TIM3 so it frees up the timer and provides a faster troughput
@dpgeorge
Copy link
Member

dpgeorge commented Dec 4, 2015

I did some extensive testing and this does not work reliably. There is some issue with the buffering, but it's not clear exactly what. To reproduce run all the tests on a pyboard: cd tests && ./run-tests --target pyboard. About 30% of them fail.

@hoihu
Copy link
Sponsor Contributor Author

hoihu commented Dec 5, 2015

About 30% of them fail.

yes I can reproduce this. I'll have a look at it.

@hoihu
Copy link
Sponsor Contributor Author

hoihu commented Dec 13, 2015

Sorry for the late reply. It's an issue with the ring buffer.

As @neilh10 has pointed out, the putc/getc operation must be atomic, otherwise it can cause raise conditions as the putc operation runs in foreground whereas getc gets called from PCD_irq.

So I've decided to make a new pull request and just apply the TIM3 / callback staff rather than including the ringbuffer which is more complex to implement properly. That pull request will also be much more compact. So I'm going to close this one. Thanks for all the feedback.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants