stmhal: ringbuffer support for USB VCP #1652

hoihu · 2015-11-27T00:23:05Z

adds ringbuffer for rx/tx over USB VCP
uses a "txComplete" callback rather than TIM3 so it frees up the timer

hoihu · 2015-11-27T00:27:26Z

This is referring to a discussion on the forum:
http://forum.micropython.org/viewtopic.php?f=6&t=1150

hoihu · 2015-11-27T00:30:11Z

@dhylands: Sure I can move it - I was unsure where to place it but I believe your location fits better

hoihu · 2015-11-27T00:31:53Z

Oh ok I also forgot to update the commit history in HALCOMMITS...

dhylands · 2015-11-27T00:35:41Z

Oh ok I also forgot to update the commit history in HALCOMMITS...

I didn't see any modified hal files...

If we do modify hal files, then we need to consider impacts to all of the hals

hoihu · 2015-11-27T00:38:24Z

yes I was thinking about the USB stack files actually, not the HAL files.
usbd_cdc_msc_hid.h
usbd_cdc_interface.c

They are also part of the cube library

hoihu · 2015-11-27T00:47:29Z

@dhylands I'd appreciate if you could test it on your bench - it works ok for me but as the change is quite large it may have side effects I was not aware of. Maybe by using rshell and forcing a lot of traffic?

I only did tests with the terminal REPL.

dhylands · 2015-11-27T01:38:33Z

They are also part of the cube library

They're really just examples and not part of the hal proper. HALCOMMITs is just for the files in the hal subtree

@dhylands I'd appreciate if you could test it on your bench - it works ok for me but as the change is quite large it may have side effects I was not aware of. Maybe by using rshell and forcing a lot of traffic?

I only did tests with the terminal REPL.

sure I'll give it a go later tonight

hoihu · 2015-11-27T09:56:37Z

Considering the CDC polling interval: currently the endpoint polling interval is set to 32msec. Is there a reason why this value is set so high? I haven't measured it but I'd assume that setting it to a value closer to 5msec would speed up data transfer rate and specifically lag-time..

I'm planning to do some measurements.

dpgeorge · 2015-11-27T10:59:31Z

Thanks @hoihu, this is really great!

Considering the CDC polling interval: currently the endpoint polling interval is set to 32msec. Is there a reason why this value is set so high?

To keep interrupts to a minimum and not overload the CPU with USB processing. We can revisit this value. I'm not sure how it works: can the USB CDC device only send (back to the host) one 64-byte USB packet per polling interval, or can it send many?

dpgeorge · 2015-11-27T11:00:27Z

stmhal/usbd_cdc_interface.c

-  * Unless required by applicable law or agreed to in writing, software 
-  * distributed under the License is distributed on an "AS IS" BASIS, 
+  * Unless required by applicable law or agreed to in writing, software
+  * distributed under the License is distributed on an "AS IS" BASIS,


Don't make cosmetic changes, that can be done separately.

Sure I'll fix this

hoihu · 2015-11-27T12:57:17Z

To keep interrupts to a minimum and not overload the CPU with USB processing. We can revisit this value.

USB processing time will increase but i'd assume not by much. The USB backend should take care of the low level processing and only call the callback if data is available. Some MCU's even have this low-level processing separated from the main CPU so it doesn't add overhead. I'm not sure how it is done in the F4 though.

I'm not sure how it works: can the USB CDC device only send (back to the host) one 64-byte USB packet per polling interval, or can it send many?

I'm not sure either. I guess that should be easily verified using some measurements (e.g. data transfer rate). If every 32msec a packet of 64 Bytes can be sent, the speed would be low (app. 2kBytes/sec)

The issue is that the large time of 32msec will increase the buffers size, as pointed out in the comments in the source file. I believe that reducing this time will add some processing time, but the benefit in speed and less RAM size is quite sigificant.

hoihu · 2015-11-27T14:43:16Z

So I did some measurement with reducing the polling interval to 5msec.

I toggled a pin in the CDC_Itf_TxFinished handler.

The testcase was measuring the output of "help()" from the REPL. This is app 2kB in size. It takes an astonishing 3msec to transfer, regardless of the polling interval. I'm surprised. It means that once the data transmission is active, the polling interval is close to the minimum (at least that's the repsonse on my iMac).

It would also mean that the net data rate would be close to 1MBytes/sec. This definitely sounds wrong - A wireshark USB log would be handy to have, but it's not availbable under OS X... Hmm maybe I have to activate my raspberry ;)

I also measured an example with buffer sizes of 256Bytes. Attached is the output of the logic analyser measuring the toggling

This in comparision with 1024 Bytes nuffer size:

The difference is that the TxComplete Handler is more active as it needs to setup more packets. But it look pretty clean to me.

pfalcon · 2015-11-27T14:47:43Z

I'd appreciate if you could test it on your bench - it works ok for me but as the change is quite large it may have side effects I was not aware of. Maybe by using rshell and forcing a lot of traffic?
I only did tests with the terminal REPL.

It would be nice if folks who propose such patches tested them on real use-cases, like e.g. lwip/slip.

dhylands · 2015-11-27T17:20:25Z

rshell is quite a bit faster with the new code. using master, I get 4.19 seconds to copy timer.c (a 57K file) to /flash and 1.89 seconds to copy the file from the pyboard to the host.

With this patch applied, it only takes 3.19 seconds to copy the same file to the pyboard and 0.25 seconds to copy it back.

hoihu · 2015-11-27T17:35:42Z

Thanks Dave, good to hear that it seems to work.

The asymmetry between the transfer times may be due to the flash handling perhaps?

dhylands · 2015-11-27T17:40:01Z

yeah - I'm pretty sure that the flash writing time is what dominates. rshell also has overhead because it uses the raw repl, so it constantly switches in and out of raw-repl mode which actually does a soft reset on the board each time.

dhylands · 2015-11-27T19:04:56Z

I tried using an sdcard and a much bigger file (946K) and got the following:

master: 20.2 secs copy host->pyboard, 28.6 seconds to copy pyboard->host
thispr: 8.9 secs copy host->pyboard, 3.2 seconds to copy pyboard->host

For each 512 bytes transferred, the code then waits for an ACK (this is a simple flow control), which artificially slows down the transfer rates.

I'm a little baffled by the 28 second pyboard->host time (I did repeat it several times and it was quite consistent). It could just be a case of bad timing in terms of the timeslots when the host polls the pyboard.

dhylands · 2015-11-27T19:05:29Z

Oh yeah - you should squash your commits (look at git rebase -i)

hoihu · 2015-11-27T19:34:59Z

3.2 seconds to copy pyboard->host

That's 2.4MBaud, I'm very pleased ;)

TIM3 was triggered in 10msec interval and apparently, based on your numbers, transferred app. 330Bytes during that time. I'd have expected to be around 1kByte (the size of the buffer), but it's in this range.

It means that TIM3 has delayed the transfer - the USB stack drained the buffer very quickly but the fixed 10msec intervals limited that. The new approach is avoiding this because the callback gets immediately called at TxComplete. I really don't know why ST hasn't exposed this callback in their stack.

It could just be a case of bad timing in terms of the timeslots when the host polls the pyboard.

I don't think it is bad timing - it's likely due to the TIM3 interval.

I'm also suprised by the speedup - the measurements with the logic analyser moreless match your numbers. So every 0.25msec a packet of 256 Bytes is apparently transferred....

hoihu · 2015-11-27T19:39:09Z

you should squash your commits (look at git rebase -i)

yes I know - I thought I'd wait and just commit to input from reviews. At the end I can submit a proper patch.

Maybe @dpgeorge has some more inputs?

hoihu · 2015-11-27T19:50:22Z

Thinking... I'd still favour lowering the poll interval to lets say 5 or 10msec for 2 reasons:

The transfer rate is apparently high once the transmission has started. So an (initial) delay of max. 30msec doesn't really match the transfer speed. It's a bit like "I have a fast car but I can only press the gas pedal every second". Fast cars should react fast. (just to clarify - I own no car :) ).
Less Buffer space / no more timeouts due to full buffers.

dhylands · 2015-11-27T21:31:49Z

It I comment out the file I/O and just send empty buffers, then the times drop to 4.045 secs host-> pyboard and 2.196 sec pyboard -> host

That's still slowed down by the acks every 512 bytes

The filesize was 972028 bytes (I've been using firmware.map)

hoihu · 2015-11-27T22:42:12Z

and 2.196 sec pyboard -> host

that's 3.55MBaud. Getting close to what I was measuring with the analyser (8 MBaud).

That's still slowed down by the acks every 512 bytes

out of curiosity - can you increase the ack to let's say 4kByte and check? Or does the 512 Bytes need to be that size?

Other than that did you experience any problems / hangs etc? Unfortunately I cannot replicate your test case using rshell as it appears not to work reliably under OS X using your latest version (can't enter raw repl mode although CTRLA works in normal repl).

dhylands · 2015-11-27T23:56:32Z

The largest buffer size I could anything to work with was 960 bytes. Anything larger than 1024 and it just hangs. I think it winds up overwriting data and then the other side doesn't get as much as it expects which makes it hang (which is why I had to but the flow control logic in there).

I'm pretty sure that the pyboard->host side intentionally overwrites the buffer to allow for the case that USB isn't connected at all. Perhaps that behaviour can be improved.

So I bumped the compiled in buffer sizes to 8192 bytes, and ran rshell with a buffer size of 7936 (8192-256) and then I was able to get host->pyboard down to 2.27 sec and pyboard->host to 1.495 sec (for 972028 bytes). Thats 5.2 Mbit/sec with the ACK handshaking. That's approaching 50% of wirespeed, which is pretty decent in my books (especially since this test wasn't designed to optimize speed).

I haven't seen any hangs - other than the expected ones when rshell gets out of sync with the remote.
I'll have to try it on OSX (I have a Mini here).

rshell (well pyboard.py) has issues if your boot.py prints anything.

hoihu · 2015-11-28T21:26:36Z

Thanks a lot for the feedback.

Well so the USB core runs at high speed? Because microframes (125usec) are part of the high speed spec. And interrupt transfer speed for full speed devices are way below the measured number (64kByte/sec = 0.5MBaud).

The F4 could run at high speed but the USB stack is configured to be full speed: From the STM manual:
("USE_USB_HS" and "USE_USB_HS_IN_FS" when using USB High Speed (HS) Core in FS mode)

Those 2 compiler switches are set. Maybe it's a mix between high speed and full speed.

Anyway. I believe the main point is that there is a great performance increase with the callback method - is there anything needed in order to make a proper pull request? Should I rebase?

dhylands · 2015-11-28T22:32:41Z

That's running on the full speed interface. My recollection is that you can transfer more than 64 bytes at a time, but it needs to be done using multiple 64 byte packets.

I don't have access to a USB sniffer any more, but I could try wireshark on my laptop.

I'd recommend that you squash and rebase to the latest master and add a notice when you've done that (github doesn't notify when you push a new commit)

hoihu · 2015-11-28T23:04:54Z

Ok I've rebased it on master.

I've left the deletion of the CDC_xxx constants @dpgeorge is that ok for you?

neilh10 · 2015-11-29T00:56:32Z

lib/utils/ringbuffer.c

+}
+
+bool ringbuffer_putc(ringbuffer_t* rbuffer, uint8_t character) {
+    *(rbuffer->push_ptr) = character;


Just wondering if this should be made atomic?. If the ringbuffer_putc() is ever called twice when its just in the middle of inserting into the buffer it could be a hard problem to notice. Also the ringbuffer could be used for other ringbuffer operations that would benefit it to be re-entrant.

It depends how you are going to use the buffer.

From what I hear you would serialize from different sources? The current implementation is done so that only one putter/getter pair is allowed yes, but it's fast and doesn't need blocking code. For serializing an Uart/I2c/SPI input I believe that is fine.

But there may be use cases where you would "put" from different sources to the same buffer (maybe to serialize incoming packets from different uarts?)

The normal case for multiple putters is having an irq and foreground both doing prints. I can't really imagine a case for multiple getters.

pfalcon · 2015-11-29T02:52:36Z

There're quite a few codestyle issues in this patch.

+    *(rbuffer->push_ptr) = character;
+    rbuffer->push_ptr++;

In C, this is written as:

+    *rbuffer->push_ptr++ = character;

dpgeorge · 2015-11-30T00:03:33Z

I've left the deletion of the CDC_xxx constants @dpgeorge is that ok for you?

No, please keep them all and keep them in the .c file (they don't appear anywhere else under stmha/usbdev/ directory).

Also, there are still lots of cosmetic changes which aren't needed, mostly in stmhal/usbdev/class/inc/usbd_cdc_msc_hid.h

I'll need to check things properly, but I'm not sure about how the ringbuffer is implemented. Eg, we could re-use it for uart receive, but you'll notice that the uart rx circular buffer is implemented in a much more concise way.

neilh10 · 2015-11-30T02:12:09Z

1+ for pattern re-usage.The ringbuffer could be used for Uart Tx (any serialization). I would strongly encourage that they be made safe for reentrancy for future proofing. I've implemented a Uart Tx buffering in a private build (not based on ringbuffer but similar)

hoihu · 2015-11-30T07:31:06Z

@dpgeorge: No problem I'll put them back. Would you prefer to split the ringbuffer commit and place a separate pull request for the USB callback? may be a bit of work though.

I had a look at the Uart implementation. There are IMO 2 differences, one is to separate 2byte entries vs 1 byte entries, the other is the modulo check on overflow.

I believe it is not a lot of differences, but I can include some of that in the current implementation if you wish.

hoihu · 2015-12-02T19:49:00Z

In C, this is written as:

*rbuffer->push_ptr++ = character;

yes - but:
http://denniskubes.com/2012/08/14/do-you-know-what-p-does-in-c/

I prefer to do pointer arithmetics (or in general any operator expressions) in separate steps. The hex output is the same anyway.

pfalcon · 2015-12-02T20:01:39Z

I prefer to do pointer arithmetics (or in general any operator expressions) in separate steps. The hex output is the same anyway.

Thanks for sharing your preferences with us, insightful. Now, MicroPython has its own code style, too, and if you would like to contribute, please follow it. Thanks.

dpgeorge · 2015-12-02T20:01:50Z

Regarding the ringbuffer stuff. It doesn't feel right making it a global library in lib/utils because there are many ways to implement a ringbuffer and each has tradeoffs (eg RAM for speed), and the implementation is generally pretty minimal that you can do it inline (eg as per uart.c). I see that for the USB stuff you use a ringbuffer for tx and rx and so that's a good reason to have a ringbuffer object, but I don't think this particular implementation should be the one-and-only to be used by everyone.

So I'd suggest putting the ringbuffer code directly into usb_cdc_interface.c.

hoihu · 2015-12-03T20:54:28Z

So I'd suggest putting the ringbuffer code directly into usb_cdc_interface.c.

ok will move it. Regarding those cosmetic whitespaces... it turned out to be an issue with the Atom editor.

Thanks for sharing your preferences with us, insightful.

My comment was trying to express a personal opinion. I'm not an english native speaker so if my formulation sounded harsh I'm sorry that was not the intention. I'll address your input with the next pull request.

pfalcon · 2015-12-03T21:25:21Z

My comment was trying to express a personal opinion. I'm not an english native speaker so if my formulation sounded harsh I'm sorry that was not the intention. I'll address your input with the next pull request.

No, it didn't, and I really hope my response wasn't such either. Ironic, yes, because I tried to hint that arguments like "that's what I like" are not productive. Every person has own, usually different, likes, but if everyone will write code in different way, then a project will be a mess. So, a project adopts its own code style (essentially, its "like"), and kindly ask contributors to use it when contributing to project. MicroPython's code style is described in https://github.com/micropython/micropython/blob/master/CODECONVENTIONS.md . Of course, every aspect can't be covered there, and the best reference is MicroPython code itself - just try to write yours as a surrounding code looks. What's nice is that this approach works for any project (because indeed, like humans, different projects adopt different code styles). Thanks for considering these issues and looking forward to this patch being cleaned up and merged, as it's indeed solves real issues and provides useful functionality.

- ringbuffer for rx/tx of USB VCP communication - uses a "txComplete" callback rather than TIM3 so it frees up the timer and provides a faster troughput

dpgeorge · 2015-12-04T14:03:47Z

I did some extensive testing and this does not work reliably. There is some issue with the buffering, but it's not clear exactly what. To reproduce run all the tests on a pyboard: cd tests && ./run-tests --target pyboard. About 30% of them fail.

hoihu · 2015-12-05T19:02:26Z

About 30% of them fail.

yes I can reproduce this. I'll have a look at it.

hoihu · 2015-12-13T00:35:59Z

Sorry for the late reply. It's an issue with the ring buffer.

As @neilh10 has pointed out, the putc/getc operation must be atomic, otherwise it can cause raise conditions as the putc operation runs in foreground whereas getc gets called from PCD_irq.

So I've decided to make a new pull request and just apply the TIM3 / callback staff rather than including the ringbuffer which is more complex to implement properly. That pull request will also be much more compact. So I'm going to close this one. Thanks for all the feedback.

dpgeorge reviewed Nov 27, 2015
View reviewed changes

hoihu force-pushed the ringbuffer branch from 9bc8568 to a0cc66b Compare November 28, 2015 22:59

neilh10 reviewed Nov 29, 2015
View reviewed changes

hoihu force-pushed the ringbuffer branch from a0cc66b to 056e0d0 Compare December 2, 2015 19:53

stmhal: ringbuffer support for USB VCP

3be0314

- ringbuffer for rx/tx of USB VCP communication - uses a "txComplete" callback rather than TIM3 so it frees up the timer and provides a faster troughput

hoihu force-pushed the ringbuffer branch from 056e0d0 to 3be0314 Compare December 3, 2015 22:15

hoihu closed this Dec 13, 2015

hoihu mentioned this pull request Dec 13, 2015

stmhal: USB CDC speed improvements, TIM3 release #1713

Closed

stmhal: ringbuffer support for USB VCP #1652

stmhal: ringbuffer support for USB VCP #1652

Conversation

hoihu commented Nov 27, 2015

hoihu commented Nov 27, 2015

hoihu commented Nov 27, 2015

hoihu commented Nov 27, 2015

dhylands commented Nov 27, 2015

hoihu commented Nov 27, 2015

hoihu commented Nov 27, 2015

dhylands commented Nov 27, 2015

hoihu commented Nov 27, 2015

dpgeorge commented Nov 27, 2015

dpgeorge Nov 27, 2015

Choose a reason for hiding this comment

hoihu Nov 27, 2015

Choose a reason for hiding this comment

hoihu commented Nov 27, 2015

hoihu commented Nov 27, 2015

pfalcon commented Nov 27, 2015

dhylands commented Nov 27, 2015

hoihu commented Nov 27, 2015

dhylands commented Nov 27, 2015

dhylands commented Nov 27, 2015

dhylands commented Nov 27, 2015

hoihu commented Nov 27, 2015

hoihu commented Nov 27, 2015

hoihu commented Nov 27, 2015

dhylands commented Nov 27, 2015

hoihu commented Nov 27, 2015

dhylands commented Nov 27, 2015

hoihu commented Nov 28, 2015

dhylands commented Nov 28, 2015

hoihu commented Nov 28, 2015

neilh10 Nov 29, 2015

Choose a reason for hiding this comment

hoihu Nov 30, 2015

Choose a reason for hiding this comment

dhylands Nov 30, 2015

Choose a reason for hiding this comment

pfalcon commented Nov 29, 2015

dpgeorge commented Nov 30, 2015

neilh10 commented Nov 30, 2015

hoihu commented Nov 30, 2015

hoihu commented Dec 2, 2015

pfalcon commented Dec 2, 2015

dpgeorge commented Dec 2, 2015

hoihu commented Dec 3, 2015

pfalcon commented Dec 3, 2015

dpgeorge commented Dec 4, 2015

hoihu commented Dec 5, 2015

hoihu commented Dec 13, 2015