faster spiFlashWriteByte #1862
Replies: 17 comments
-
Posted at 2020-06-03 by @gfwilliams Hi, Honestly, I think the only way to really be sure with stuff like this is to benchmark (and also look at the disassembly (
And it seems reasonably tight - everything is already inlined. I'm not sure you're really going to see a noticeable difference in speed especially as I believe IO is stuck at 16MHz. If it did turn out that unrolling the loop was significantly faster then I'd be interested, but if at all possible I'd like to just get the compiler to do it so it wasn't such a maintenance headache: https://stackoverflow.com/questions/4071690/tell-gcc-to-specifically-unroll-a-loop I don't know for sure what the flash chip used in the Bangle is I'm afraid - as far as I can tell it's marked Honestly I developed the whole thing from scratch on one Bangle.js device which is still working fine, so I don't think you'll brick your Bangle if you mess with the flash code. The bootloader doesn't use flash at all so you should still get able to get to it regardless of what you change. I'd strongly suggest you open your Bangle and attach a nRF52DK debugger to it though - it'll make uploading firmware so much faster. Another option is to use something like an MDBT42 breakout, attach it to an external flash chip, and try using 'Inline C' in the Web IDE to see how fast you can get the IO going. Some things that I think really might increase speed though:
|
Beta Was this translation helpful? Give feedback.
-
Posted at 2020-06-03 by FransM Hi Gordon, Thanks for the extensive reply. I was unaware of make lst, I used cc commands that I found by running make under sh -x Looking at the assembly unrolling the for loop would eliminate the branch at line 26. M4 has no instruction cache and no speculative execution or so, so that would help a bit. DMA might not be faster. I've seen situations where the setup time exceeded the time needed for bitbanging. And unfortunately I have no debugger. I think I didn't see that option when ordering my bangle (or should I have gotten it from somewhere else?) Note also that I am still learning about the ecosystem and software structure. |
Beta Was this translation helpful? Give feedback.
-
Posted at 2020-06-03 by @gfwilliams Yes, it would eliminate the branch but I think in real terms that may not make a noticeable difference. The debugger isn't something I provide - you'd just have to wire up an nRF52832DK board. 3 wires basically, but you do have to pull back the LCD so it's not ideal if you plan on putting the watch back together again :) https://www.espruino.com/Bangle.js+Technical - there's a link to Macronix MX25 which seems to be a similar chip. I believe pretty much all SPI flash are driven the same way though |
Beta Was this translation helpful? Give feedback.
-
Posted at 2020-06-03 by FransM Eliminating the branch will save one clock cycle and up to 3 cycles to refill the pipeline.
I did not count all the instruction cycles, but the bcs instruction might well count for 10% of the time in the loop. The other suggestions also will help but gains in the inner loop are 8 times as effective as an optimisation in the outer loop. Afterthought: if we unroll the loop then the loading of r2 and the adds.w are also not needed. Edit: when unrolling it might also be that the asr.w can be simplified (and if not the decrement of r7 also will be kept. Wrt measuring: For measuring maybe the DWT CYCCNT register can be used. |
Beta Was this translation helpful? Give feedback.
-
Posted at 2020-06-04 by @gfwilliams
Optimisations that end up not calling the function at all are probably going to be very significant :)
Maybe... IMO the best way to benchmark this is to write enough data that it takes a noticeable amount of time (you can do it with CS not asserted so it's ignored) and then measure that. But just in general, based on your other comments: I'm all for optimising flash memory access, graphics, and Espruino in general - but if this is specifically for a single-purpose function for blitting a small portion of screen it may not be something I'd want to pull into Bangle.js, given we're currently using pretty much all the available flash memory. |
Beta Was this translation helpful? Give feedback.
-
Posted at 2020-06-04 by FransM
I tried unrolling with gcc 9.3.1 and with adding
before
The generated code:
As can be seen now each iteration saves 3 instructions. And of course this is not specific for blitting. Actually I did prototype this on flash write which is not relevant for blitting at all. (Oh and I can understand fully if you do not want to take this but it was fun researching this) |
Beta Was this translation helpful? Give feedback.
-
Posted at 2020-06-05 by @gfwilliams Thanks - that's really interesting. Did you manage to get any real world performance stats though? It's fine in theory but I'm pretty sure the speed ends up limited by the IO bus speed. Since JS code saved to flash is executed from flash, you should see some improvements in actual JS execution speed. I tend to use this mandelbrot code as a speed test:
If you copy & paste this in the left hand size hopefully you'll see an improvement. |
Beta Was this translation helpful? Give feedback.
-
Posted at 2020-06-05 by @gfwilliams Just to add, I had a quick look at this, and I replaced the
|
Beta Was this translation helpful? Give feedback.
-
Posted at 2020-06-05 by @fanoush there is also unused SPI2 interface, if you are really going for speed, that could be even faster. spi is already compiled in so it should not enlarge code size as the driver code is already there. or it can be even written without using nordic drivers, without using interrupts it is just writing RXD,TXD registers and polling ready flag, this would be similar style to current bitbanging way but could be even shorter (and faster) |
Beta Was this translation helpful? Give feedback.
-
Posted at 2020-06-05 by FransM I don't think I/O will be the bottleneck. This is bit-banged SPI so the clock is generated by software. Of course it could be that the chip introduces additional latency when writing to a GPIO pin, but generally that is pretty fast, just sending a bit to an output pin. Wrt the test: For the test: as I only updated spiFlashWrite I don't expect any gains as executing from flash is only read, but let me try. Copying all but the last three lines to the right pane and selecting execute from ram gives me 7.4944 seconds. Let's try to also unroll the loop for spiFlashReadWrite rerunning the exe now gives me 8.7609, 2nd run gives 8.7582 Observations/tentative conclusions
While looking at the code I noticed that some more optimization opportunities. I'll look into those as well and report back. |
Beta Was this translation helpful? Give feedback.
-
Posted at 2020-06-05 by @gfwilliams
Yes, that's what I mean. I believe the IO bus on-chip is 16MHz
Works for me. Ubuntu + Chrome too. Ctrl+V ? Actually I found maybe a better set of tests for you:
More info at espruino/Espruino#1849 |
Beta Was this translation helpful? Give feedback.
-
Posted at 2020-06-05 by @gfwilliams Yes @fanoush - that was one of my initial suggestions - it could definitely be worth a try... On Bangle.js I don't even use SPI1 for anything :) The latest changes are no committed to Git so you should see some decent improvements. The speed difference could well be the new GCC - probably the optimise for size uses some slightly different heuristics. But yes, if you see other optimisation improvements I'm definitely interested! Especially in drawImage(s) we have some fasts paths for drawing, but if you don't hit them then the unpacking of bitmap data isn't that fast and could probably be improved using a pointer to a decode function. |
Beta Was this translation helpful? Give feedback.
-
Posted at 2020-06-05 by FransM
Triggered by the comments in your code snippet I decided to sync with the latest version and measure again Indeed now unrolling also fails for me. Looking at the assembly the whole function spiFlashRead is inlined in jshFlashRead.
I stripped off the source as they were misleading. After unrolling the following 5 instructions are executed to read a bit
After that there is another write to set the clk high (write to reg OUTSET (0x508) I suspect that things fail because the clock down time is too short. |
Beta Was this translation helpful? Give feedback.
-
Posted at 2020-06-05 by FransM One last observation to report.
there was only one caller left to spiFlashReadWrite, namely in spiFlashStatus. This works like a charm but degrades the performace a bit as jshFlashRead does not inline spiFlashRead any more. Without my change jshFlashRead was the only caller of spiFlashRead and the compiler or the post linker optimizer decided the function could be inlined. It might we worthwhile to mark spiFlashRead and spiFlashWrite as inline functions. Thatwill (marginally) speed up jshFlashRead as the call to spiFlashWrite will be gone. |
Beta Was this translation helpful? Give feedback.
-
Posted at 2020-06-05 by @gfwilliams
That's odd - actually when I tried it, However you're talking about pretty minimal clock cycle impacts since it's only called once per read (which'll be 16 bytes normally) - again, if it speeds things up noticeably I'm all for it - but I'm not convinced you will see a noticeable increase in read speeds |
Beta Was this translation helpful? Give feedback.
-
Posted at 2020-06-05 by FransM Probably the function was inlined because I was using gcc 9 If you want to get rid of spiFlashReadWrite you can merge this pull request Edit: just force-pushed an update to the branch as I forgot to add an entry in Changelog |
Beta Was this translation helpful? Give feedback.
-
Posted at 2020-06-05 by @gfwilliams Thanks! Don't worry about the changelog - the commit comes after all the other stuff to change flash memory access so I don't think it's a big deal. |
Beta Was this translation helpful? Give feedback.
-
Posted at 2020-06-02 by FransM
Hi,
I authored a version of spiFlashWriteByte that is a bit more efficient (at the expense of using a bit more rom due to loop unrolling. Appreciate feedback on this.
I'm also interested on how best to test this (I want to avoid bricking my bangle.js)
Code is at:
https://github.com/FransM/Espruino/tree/spiFlashWriteByte
Something similar can be done for spiFlashReadWriteByte
(oh and the rationale for this, is that these two functions are very low level functions to access the spi flash so every clock cycle saved here pays off).
Edit: what flash chip is actually used in bangje.js?
Beta Was this translation helpful? Give feedback.
All reactions