Capture performance discussion from issue 918 #971

marcmerlin · 2020-01-24T20:32:04Z

Add performance tips and reasonable limits to expect.

marcmerlin · 2020-03-22T15:20:22Z

@hzeller do you feel this would be useful to include so that people have a better idea how far they can push all this now that the panels that let us hit those limits, are readily available?
You can see the rendered version here if needed: https://github.com/marcmerlin/rpi-rgb-led-matrix/blob/patch-3/README.md

hzeller · 2020-03-22T16:19:33Z

The discussion here including the rules of thumb is very specific to your anecdotal experience with 64 pixel high displays (what is an expected maximum refresh etc). However, the range of panels and refresh rates etc. is vastly different and it is hard to suggest what a good expectation should be.

So keep it a bit more generic and not describe what to expect, but to just include tips to describe what parameters can influence the refresh rate and what should be considered

One wants to be about 80-100Hz. Ideally 180Hz and more.
Ways to achieve that: pwm bits, dither-bits. What are the disadvantages of these (pwm-bits: less colors; dither-bits: less brightness)
We are already at the limit of what the panels can do speed-wise, but we're still at the mercy of a non-realtime operating system, so an FPGA solution might not necessarily be faster, but creates less flicker glitches.
Some panels can go fast enough, some need led-gpio-slowdown.
other things that might be useful ...

A lot of these things are already described in the section with the relevant options so you should emphasize in your paragraph a good systematic approach that goes beyond the flag descriptions how to work on the refresh rate (switch on --led-show-refresh, and which options to tiddle; include led-limit-refresh to counter flicker etc.)

marcmerlin · 2020-03-22T16:47:27Z

thanks for the review. Will rework this accordingly.

marcmerlin · 2020-03-23T19:36:32Z

@hzeller I had another shot.
Hopefully this is closer but if not quite what you wanted still, do you mind merging and touching up accordingly? (will probably quicker than asking me to do it via this issue and going back and forth).
Don't worry, I won't take offense if you hack it up or remove stuff :)

marcmerlin · 2020-03-29T00:04:20Z

@hzeller gentle ping :)
If it's not exactly what you wanted, are you ok merging and putting your diff/change on top?

marcmerlin · 2020-04-17T02:26:22Z

Hi @hzeller, any time to review this?

hzeller · 2020-04-17T03:15:18Z

Sorry, I am currently involved in a ton of Covid-19 response projects; will get back to this once that load lightens.

marcmerlin · 2020-05-09T01:24:20Z

@hzeller Just found this abandonned tab in one of my browsers :) not sure if you're still working on response projects?

hzeller · 2020-05-09T01:26:24Z

I do ( part of https://www.covidshieldnexus.org/ ), but now the 3D printers are mostly humming away on their own.
Let me have a look this weekend.

marcmerlin · 2020-05-24T22:11:31Z

@hzeller Maybe now is a better time? :)

marcmerlin · 2020-07-18T22:15:15Z

Hi @hzeller how goes? :)

hzeller · 2020-07-19T01:48:58Z

Cool, thanks!
Sorry for the delay.

arahasya · 2020-07-22T18:02:16Z

Hi,
I am not sure how you have concluded that --led-multiplexing = 0 is the slowest

I am currently in process of manufacturing RGB led boards

I have made 16x48 matrices both with straight scan and interleaving at 48
i.e both --led-multiplexing = 0 , --led-multiplexing = 1
Case A requires 2 RGB data input, Case 2 requires 1 RGB data input
But there is additional track for interleaving at 48 or 32 in case of 16x32
So the tradeoff is double the data on 1 Data line or data on 2 Data lines
2 Data lines seems faster to me

arahasya · 2020-07-22T18:21:43Z

The refresh rate is dependent on scan rows and not on type of multiplexing

hzeller · 2020-08-07T16:32:50Z

The speed is set with the gpio slowdown, and with a current Pi4, we need to slow down things a lot to work with the panels to be in the 20Mhz range.

The color output is of course not done with PWM (which would be strongly limited by the clock speed) but binary code modulation 11 bit resolution. The clock speed of the panels still affect the few lower bits as limiting factor. The 11 bit linear resolution is needed to get enough detail to satisfactorily convert a 8-bit value with exponential luminance curve. So from a marketing perspective, we could say 33 bit color resolution ... but of course that translates to regular 24 bit color.

There are many parameters that influence each other (e.g. the length of the chain determines how much we're waiting for the clocking vs. the Output Enable. As we parallel clock things in, we're only limited by the clocking in the lower bits) that I usually suggest to do a dry-run on a Pi with --led-show-refresh is someone wants to determine the refresh rate.

hzeller · 2020-08-07T17:20:15Z

Yes, the 710fps figure is right if you dont' take OE time into account. But OE is the dominant part of the time spent for the highter bits.
So in practice you'd get maybe 400Hz if you choose low 50ns LSB OE and 11 bit.
If you choose to dither the lower bits you might get 600-800 fps with that setting.

Clock rate needs to be tweaked with the panels at hand, some are faster, some are slower; also how short the cables are etc.

Raw color palette is (2^11)^3.

hzeller · 2020-08-07T22:55:16Z

No, the--led-pwm-lsb-nanoseconds just influences the on-time for the LSB.

The 11 bits are bitplanes. Each bit plane is shifted out and left on with output enable a different time: (lsb-nanoseconds << plane#). So the lowest bit-plane stays on for, say, 100ns, the next for 200ns, then 400ns and so on...
That way, once we have done this for all bits, the sum of on-time represents the value 0..2047.
This is commonly referred to as binary code modulation (I've explained that in #1062 already).

hzeller · 2020-08-08T00:28:11Z

The lowest bits indeed take longer to clock than the on-time which follows, so there is a bit of dark-time for these bits. However for the higher values, the next row can be clocked in completely while the on-time is still ongoing.
In lower bits, the clocking in data dominates the time, in higher bits, OE dominates the time. So for a refined calculation, you've to determine for each bit-plane value max(PWM_LSB_NS * 2^bit, serial_clock_cycle * columns) to determine how long each bit takes. Sum that up for all bits. Then multiply that with the multiplex ratio.
Not sure where you see the rows are scanned first; each row goes through the full BCM cycle before the next row is tackled (it would also not work, as the LED matrix have quite some transition time between rows, visible as significant flicker)
The realtime requirements are not as much as the OE timing is actually done by hardware (I am using a circuit in the BCM that directly outputs to a pin) so indepenend of anything Linux can mess with. With increased pressure close to 100% CPU though, this still can interrupt visibly (which is why you use limit refresh then).
With naive PWM you would not even get near any useful refresh rate.

greatballoflazers · 2021-10-29T02:24:14Z

Adding to this conversation it should be noted that the MSB is always sent along with its time multiple, which is currently 1024 times lsb-nanoseconds. If you only enable 3 bits of PWM the on time is 1024 + 512 + 128 times lsb-nanoseconds. The off time however is the amount of time it takes to shift out three times.

For long changes this will be longer and thus decrease the display brightness which will lower the average power. However the peak power will still be unchanged.

Your refresh may be constrained by on time or shift time. The Linux scheduler does come into play, but less so on multicore with core reservation. However there other cases which can impact performance such as memory operations and other processes using GPIO. These are arbitrated and can cause massive refresh drops randomly if not factored in.

To factor these in you simply load the system in the worse case configuration then edit the settings to the desire min. The refresh rate will spike when unloaded and drop when loaded. Note perfection is not likely possible here. There are high priority events which cannot be blocked or filtered out.

GPIO speed has a significant impact on shift time. This should be kept around 15MHz maybe up to 21MHz. The speed achievable is depending on the version of Pi used. Higher refresh is possible with FPGAs using something like S-PWM without losing as much quality here. However I put a pull request which makes an attempt to add it here. Note is not recommended in all cases.

BCM is used here to reduce the amount of memory and processing cycles required to divide the LED current to give color shades. CIE1931 is used map color shades into current division steps. To get 256 color shades per color you need just over 11 bits of information. However in some panels this is may not be completely possible. You can still get decent color depth with less bits.

For long chains there is not enough time for the lower bit planes to really appear without collapsing the refresh. Therefore their brightness is reduced. At a certain point you might as well not send them. You can configure the library to send them but they are just overhead.

Fancy LED drivers with built in PWM do exist. However these are more expensive and are not supported. These require continuous IO updates which as mentioned with BCM can be problematic. It is technically supportable, however these are completely different. You get large amounts of bits for gamma correction, but the color depth may not improve significantly despite the increase in overall shift efficiency.

BCM the timer created by PinPulser serves another very important function. It prevents memory bus arbitration. Concurrency is achieved by shifting will the previous row is being shown. This allows for a larger window of time per bit to shift a bit plane. This reduces the amount of memory operations required. Therefore if any instability exists it will not cause a significant problem.

Hardware PinPulser is recommended because it shuts off the display instantly via uses of OE signal directly from PWM timer. Enabling much smoother operation and color consistency. This allows a lower bit plane to not out shine a higher bit plane. Software version is only for compatibility with different pin mappers.

Linux scheduler is slow and cannot be trusted to multiplex the display. By default the Pi uses a 250Hz tick rate, which is for blocking operation on single core. However it is not enough, this is possible on some RTOS systems. Kernel threads will not likely fix this. Moving to RTthreads will likely cause compatibility issues.

Using background thread library is done to avoid certain things which may be viewed as hackish. Most notably memory bus overhead, which can promote refresh turbulence. An IO processing including ping apparently is capable of causing issues.

FPGA is capable of using PWM instead of BCM due to processing efficiency. It is also capable of doing S-PWM. However FPGA is a little harder to manage complexity with, depending on skill set. FPGA is also more expensive and would likely need its own memory interface. Further increasing the cost and complexity.

DMA is slow because of a few reasons. DMA is not likely pipelined or optimized. Generally requires something like 5 cycles per single shot. Burst transfers are better but unsupported by GPIO. Memory locality does not exist for DMA. Main memory device is slow burst DDR not SRAM. CPU and DMA are clocked from different clock domains. Memory and GPIO likely occupy two different buses in terms of clock and size. These are just guesses but the L1 cache speeds up the CPU quite a bit.

Basically BCM is time hack. This implementation uses lsb-nanoseconds as time window. This implementation starts from the most significant bit and works down. This required by CIE1931, which can be disabled via direct mapping option. For higher quality you may need to adjust this. For better performance on longer chains you may need to adjust this. However overall don't.

Any support for additional LED drivers will be likely be constrained not by the serial clock but by the memory operations. Which as it would appear are protected by worst case arbitration to be more an enough depending on Pi version. PWM on OE is likely possible for GCLK. However there is more too it than that but this library would need to be completely refactored for that to even be possible.

greatballoflazers · 2021-10-29T02:53:42Z

Note aggressive usage of set pixel may cause memory arbitration. However this is decoupled from the background thread when properly configured. Timing instability is protected via time windows. Again when properly configured.

Note CIE1931 is used which causes massive drops in average power for brightness reductions. Peak power is still not changed, but that may not matter in certain cases.

Capture performance discussion from hzeller#918

5703107

Add performance tips and reasonable limits to expect.

This was referenced Jan 24, 2020

Getting high refresh rates on a 192x128 matrix #918

Open

Maximum Refresh Rate #969

Open

marcmerlin mentioned this pull request Feb 27, 2020

DietPi Buster + Raspberry Pi 4 works well! #991

Open

updated with more current links/text

b1b7bd2

"Performance improvements and limits" updates

8085a09

hzeller merged commit d14160b into hzeller:master Jul 19, 2020

marcmerlin deleted the patch-3 branch July 19, 2020 01:52

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Capture performance discussion from issue 918 #971

Capture performance discussion from issue 918 #971

marcmerlin commented Jan 24, 2020

marcmerlin commented Mar 22, 2020

hzeller commented Mar 22, 2020

marcmerlin commented Mar 22, 2020

marcmerlin commented Mar 23, 2020

marcmerlin commented Mar 29, 2020

marcmerlin commented Apr 17, 2020

hzeller commented Apr 17, 2020

marcmerlin commented May 9, 2020

hzeller commented May 9, 2020

marcmerlin commented May 24, 2020 •

edited

marcmerlin commented Jul 18, 2020

hzeller commented Jul 19, 2020

arahasya commented Jul 22, 2020

arahasya commented Jul 22, 2020

hzeller commented Aug 7, 2020

hzeller commented Aug 7, 2020 •

edited

hzeller commented Aug 7, 2020 •

edited

hzeller commented Aug 8, 2020 •

edited

greatballoflazers commented Oct 29, 2021

greatballoflazers commented Oct 29, 2021

Capture performance discussion from issue 918 #971

Capture performance discussion from issue 918 #971

Conversation

marcmerlin commented Jan 24, 2020

marcmerlin commented Mar 22, 2020

hzeller commented Mar 22, 2020

marcmerlin commented Mar 22, 2020

marcmerlin commented Mar 23, 2020

marcmerlin commented Mar 29, 2020

marcmerlin commented Apr 17, 2020

hzeller commented Apr 17, 2020

marcmerlin commented May 9, 2020

hzeller commented May 9, 2020

marcmerlin commented May 24, 2020 • edited

marcmerlin commented Jul 18, 2020

hzeller commented Jul 19, 2020

arahasya commented Jul 22, 2020

arahasya commented Jul 22, 2020

hzeller commented Aug 7, 2020

hzeller commented Aug 7, 2020 • edited

hzeller commented Aug 7, 2020 • edited

hzeller commented Aug 8, 2020 • edited

greatballoflazers commented Oct 29, 2021

greatballoflazers commented Oct 29, 2021

marcmerlin commented May 24, 2020 •

edited

hzeller commented Aug 7, 2020 •

edited

hzeller commented Aug 7, 2020 •

edited

hzeller commented Aug 8, 2020 •

edited