Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimised software rasterizer / ImGui on Arduino #1613

Closed
LAK132 opened this issue Feb 13, 2018 · 37 comments
Closed

Optimised software rasterizer / ImGui on Arduino #1613

LAK132 opened this issue Feb 13, 2018 · 37 comments

Comments

@LAK132
Copy link

LAK132 commented Feb 13, 2018

Creating a new thread for this so I don't clutter the other sharing threads

Current performance figures:

so far so good

Slider 1: Loop time excluding raster and TFT draw time (4ms)

Slider 2: TFT draw time (248ms)

Slider 3: Raster time (187ms)

The main loop without rasterizing and drawing to the screen is running at ~240Hz which is crazy fast for such a small device (160MHz ESP32), no need for any more performance gains here.

The TFT library is pushed right to its limit, only way to get it any faster will be to crank up the SPI speed or somehow work out how to only draw to the parts of the screen that have actually changed between refreshes (this would be handy for other devices with less RAM).

The rasterizer is still running a little slow, but as mentioned before it can be optimised with a heap of special cases, some of which have been tested but are currently disabled for debugging purposes.

  1. Test if 2 triangles are actually a square

  2. Test if triangle is a flat colour with basically no UV map, if it is don't bother interpolating values

  3. Test if triangle is actually a line/single pixel

  4. Faster lerp factor calculation for grid aligned triangles/rectangles

  5. Remove rounding

  6. Special cases for alpha blending

  7. Less specific to the rasterizer but not rendering the window or background makes sense on a platform like this and should also give a performance boost

I'm going to continue to try and get as much performance as I can out of the software rasterizer for use both on Arduino and for general PC use, perhaps it will be a good base for an sr example or the regression testing system mentioned in an earlier thread.

If anyone has any working examples for optimisations that would certainly help speed me up

Code: https://github.com/LAK132/ImDuino
Only necessary modification to the base ImGui library was the removal of #include <memory.h> from stb_truetype.h

@ratchetfreak
Copy link

draw (subsection of) font bitmap directly when drawing characters.

@dpethes
Copy link

dpethes commented Feb 13, 2018

Treat the texture as bitmap - if there's a white pixel, draw a pixel to your framebuffer, else do nothing. No need for alpha blending there.

@ocornut
Copy link
Owner

ocornut commented Feb 13, 2018

Yes it is also worth noting the atlas texture can be output as Alpha8, so 1 byte per pixel without color information.

@LAK132
Copy link
Author

LAK132 commented Feb 13, 2018

Interesting you should mention that, it is currently using Alpha8 but I had to immediately MemFree the pixel buffer it returns as it is way too big to fit in the ESP32s RAM, luckily I managed to store it as a constant at the top of the ino so it is read from flash rather than RAM.
Might be able to get it into RAM if it was returned as a more space efficient 2D array rather than one large 1D array

@dpethes
Copy link

dpethes commented Feb 13, 2018

If you're low on ram, convert it to 1bit per pixel format, that should reduce it to 4kB at the cost of a few bitops more per texture access.

@LAK132
Copy link
Author

LAK132 commented Feb 13, 2018

Reading flash is probably still faster than doing bitops, but I'd need to actually test that to be sure

ocornut added a commit that referenced this issue Feb 14, 2018
@ocornut
Copy link
Owner

ocornut commented Feb 14, 2018

Yes in the case of the default font, even 1bpp would work.

By the way I just added two flags to ImFontAtlas which are helpful in that sort of situation:

enum ImFontAtlasFlags_
{
    ImFontAtlasFlags_NoPowerOfTwoHeight = 1 << 0,   // Don't round the height to next power of two
    ImFontAtlasFlags_NoMouseCursors     = 1 << 1    // Don't build software mouse cursors into the atlas
};
// Use with
io.Fonts->Flags |= ImFontAtlasFlags_NoPowerOfTwoHeight | ImFontAtlasFlags_NoMouseCursors;`

ImFontAtlasFlags_NoPowerOfTwoHeight is probably usable with most backends, not sure how it may impact performances on modern GPU.?

With default flags: 32768 bytes
image

Without mouse cursors, without rounding height to next power-of-two: 13824 bytes
image

There's also a ProggyTiny font in misc/fonts you may use for that sorts of screen.

@Pagghiu
Copy link
Contributor

Pagghiu commented Feb 14, 2018

This thread is beautiful, I have some ESP32 here in the office getting dust ;)

@dpethes
Copy link

dpethes commented Feb 14, 2018

ImFontAtlasFlags_NoPowerOfTwoHeight is probably usable with most backends, not sure how it may impact performances on modern GPU.?

On one small texture, they won't even notice :) IIRC I used single npot screen-sized texture per frame on Radeon 9600 some 10 years ago (for a video player) and generally there was tiny perf difference (if any at all, and it was certainly faster than stretching the image to power of two dimensions before sending it to gpu).

@LAK132
Copy link
Author

LAK132 commented Feb 15, 2018

First lot of optimizations more than halved the raster time (180ms -> 80ms). Roughly 11FPS excluding screen updates

80ms

I also added support for 8bit, 16bit, 24bit and 32bit textures. Might be able to speed the raster time up further if you only use one type, but potentially at the cost of space (which the ESP32 doesn't have much of)

@ocornut
Copy link
Owner

ocornut commented Feb 15, 2018

It's a little curious how you are using SliderFloat to display times, instead of, say ImGui::Text("%f ms", time);
Which optimizations of the ones above have you applied?

@LAK132
Copy link
Author

LAK132 commented Feb 15, 2018

Just removed rounding and added special cases for alpha blending (return if 0, don't blend if 255). Currently working on adding more

@LAK132
Copy link
Author

LAK132 commented Feb 15, 2018

Alright, I think this is about as good as I'm gonna get it

https://i.imgur.com/ML4A2ve.gifv

@ocornut
Copy link
Owner

ocornut commented Feb 16, 2018

That needs to be at minimum 10 times faster to be usable, let's make it happen :)

You still have WindowRounding and borders visible in the video. The rounding will cause your window background to use large thin triangles instead of one rectangle. You'll probably double your speed for that given code just by disabling WindowRounding. Have you got anti-aliasing enabled? Between rounding and borders with AA just cost you double the amount of vertices in that shot.

I'm not sure I understand why you have those 8/16/24/32 paths, especially for textures as you know your texture is 1bpp or 8bpp?

You detect rectangle by comparing vertex contents whereas you could compare indices.

The triangle rasterization could be done much faster, maybe look up at state of art triangle rasterization.

Not sure why you go and do all those extraction of colors when it's not necessary for case where we don't blend?

And you can switch to ProggyTiny (10 px) instead of ProggyClean (13 px) for that sorts of screen.

I think I'm going to run a little bounty challenge for that tonight! It would be useful to have a good specialized software rasterizer available for imgui. Someone specialized in that sort of things (not me) could probably get us 100 times faster. I guess using much floating points on ESP32 isn't exactly desirable?

EDIT Also added a link to my comment in the gallery thread: #1269 (comment) for people stumbling here.

@LAK132
Copy link
Author

LAK132 commented Feb 16, 2018

Alright, that points me in the right direction for more optimisations at least. Currently the font atlas is 8bit, the screen is 16bit and ImGui seems to work in 24/32, and I didn't see any performance impact by having them all supported by texture_t. I also found that checking for the cases where it doesn't blend was actually slower than just blending. Might have something to do with the compilers optimisations?

@ocornut
Copy link
Owner

ocornut commented Feb 16, 2018

At pointed out by Per on twitter (I dumbly had overlooked the actual numbers) the raster cost is only a fifth of the cost, so while ultimately we can drive that down, it should probably be tacked along with the final blitting which is currently the slowest part.

Where is the drawBitmap() function you are calling in UpdateScreen?
https://github.com/LAK132/ImDuino/blob/master/ImDuino.ino#L30

If I look here there's no copy of drawBitmap() that matches your exact prototype
https://github.com/Nkawu/TFT_22_ILI9225/blob/master/src/TFT_22_ILI9225.cpp

The good news is that this TFT_22_ILI9225 code seemingly has immense of room for optimization.

@LAK132
Copy link
Author

LAK132 commented Feb 16, 2018

Check the PRs on that repo, my version is several times faster (4s vs 250ms)

EDIT https://github.com/LAK132/TFT_22_ILI9225/blob/922c08093fc05e3868b94be2dce03a16d6d564ea/src/TFT_22_ILI9225.cpp#L1013

@ocornut
Copy link
Owner

ocornut commented Feb 16, 2018

Thanks!
I'll post it here Nkawu/TFT_22_ILI9225#23
I asked on twitter for people to try to help solving it (with a bounty) .

OK so you'll already done a good job optimizing that part from the original version, that leaves us with less obvious perspectives.

@wizzard0
Copy link

wizzard0 commented Feb 16, 2018

i had built a softrender for imgui too, but have no idea whether it will be faster on ESP32 (also, not interested in bounty, attribution is more than enough) - https://github.com/AlgoTradingHub/imgui_rt

@LAK132
Copy link
Author

LAK132 commented Feb 16, 2018

@wizzard0 I'll see if I can get it running on my ESP tomorrow afternoon

@bkaradzic
Copy link
Contributor

Here is an idea, instead optimizing rasterizer, ImGui should support terminal rendering (prototype is here: https://github.com/jonvaldes/tear_imgui, video https://www.youtube.com/watch?v=OEGb4HrMkDo). This way you don't have to optimize generalized polygon rendering, rather, you focus only on terminal text rendering.

@LAK132
Copy link
Author

LAK132 commented Feb 18, 2018

Haven't started optimising yet, but I did add a screen clip. Worst case it's 4 if/pixel slower, best case it doesn't draw to the screen at all. Current test case is 2x faster:

120ms

This version requires the testing branch of my fork of the TFT library https://github.com/LAK132/TFT_22_ILI9225/tree/testing

@LAK132
Copy link
Author

LAK132 commented Feb 25, 2018

None of the triangle render functions are working yet, but the new rectangle functions seems to be a heap faster (down to 2~3ms)

a

EDIT: Current version no longer crashes on renderTri but it still isn't drawing correctly. Raster time is now at 13ms, a little over 10x faster than the first version

b

@LAK132
Copy link
Author

LAK132 commented Feb 25, 2018

The rewrite has been successful (as far as I can tell), it's well over 10x faster with WindowRounding disabled

With WindowRounding:
with rounding

Without WindowRounding:
without rounding

@emilk
Copy link

emilk commented Apr 8, 2018

I made a software rasterizer for Dear ImGui which is NOT made for Arduino (it relies heavilty on floating point math), but it could maybe be a useful reference: https://github.com/emilk/imgui_software_renderer/blob/master/src/imgui_sw.cpp

@LAK132
Copy link
Author

LAK132 commented May 1, 2019

I'm close to breaking that 10x faster threshold with some more modifications to the TFT library

My version now looks like this

void TFT_22_ILI9225::_spiWrite16(uint16_t s)
{
    #ifdef HSPI_WRITE16
    if(_clk < 0){
        HSPI_WRITE16(s);
        return;
    }
    #endif
    _spiWrite((uint8_t)(s >> 8));
    _spiWrite((uint8_t)s);
}

void TFT_22_ILI9225::drawBitmap(uint16_t x1, uint16_t y1,
const uint16_t* bitmap, int16_t w, int16_t h) {
    _setWindow(x1, y1, x1+w-1, y1+h-1,L2R_TopDown);
    startWrite();
    SPI_DC_HIGH();
    SPI_CS_LOW();
    #ifdef HSPI_WRITE_PIXELS
    if (_clk < 0) {
        HSPI_WRITE_PIXELS(bitmap, w * h * sizeof(uint16_t));
    } else
    #endif
    for (uint16_t i = 0; i < h * w; ++i) {
        _spiWrite16(bitmap[i]);
    }
    SPI_CS_HIGH();
    endWrite();
}

This is with the hardware SPI clocked at 20MHz. The ESP32 can handle 40MHz (and even 80MHz iirc), but the cables I'm using aren't good enough for that kind of speed.

https://www.youtube.com/watch?v=EiPNv7j-pTE

@LAK132
Copy link
Author

LAK132 commented May 21, 2019

And there we have it, software rasteriser running the test code in under 10ms!

Rasteriser is roughly 20x faster than in the original post, full loop roughly 10x faster!

I have also moved some stuff around, softraster is now in the misc folder and there is an example impl for it:
https://github.com/LAK132/ImDuino/blob/master/ImDuino.ino
https://github.com/LAK132/ImDuino/blob/master/misc/softraster/softraster.h
https://github.com/LAK132/ImDuino/blob/master/examples/imgui_impl_softraster.h

https://www.youtube.com/watch?v=_yaSyCU3hZI

@TroyNeubauer
Copy link

Great work! I’m excited to use this in future project.

@LAK132
Copy link
Author

LAK132 commented May 23, 2019

It looks like there's still a few more kinks to work out, mainly texture mapping and alpha blending, but performance looks rock solid even on PC!

@ocornut
Copy link
Owner

ocornut commented May 23, 2019

Nice! Would you mind update the wiki (root page and/or back-end page) with any useful applicable link? Thank you!

@JarrettR
Copy link

JarrettR commented Jul 3, 2021

Any thoughts on turning this into an ESP-IDF component?

I hacked on it a little bit - Got something compiling, but crashing, and ran out of available time for the moment to really get into it.

@LAK132
Copy link
Author

LAK132 commented Jul 4, 2021

there's an IDF version (with Dual Shock 3 support!) here https://github.com/LAK132/IM-ESP32-PS3

@LAK132
Copy link
Author

LAK132 commented Jul 21, 2021

Is there any interest in having the software rasteriser fork merged into this repo?

@ocornut
Copy link
Owner

ocornut commented Jul 26, 2021

Is there any interest in having the software rasteriser fork merged into this repo?

I don't have have enough info to make that judgment right now (there are 2-3 rasterizers I don't know the pros and cons of each).
What I think would be good however would be to move it to a more dedicated and named repository, as the IM-ESP32-PS3 ImDuino may hinder discoverability.

@LAK132
Copy link
Author

LAK132 commented Jul 31, 2021

The actual repository is https://github.com/LAK132/ImSoft, https://github.com/LAK132/IM-ESP32-PS3 and https://github.com/LAK132/ImDuino are just example projects

@ocornut
Copy link
Owner

ocornut commented Jul 31, 2021

My bad, thanks!

@ocornut
Copy link
Owner

ocornut commented May 17, 2024

Closing this as it doesn't seem to need to be open anymore! Thanks for sharing that fun project!

@ocornut ocornut closed this as completed May 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

10 participants