Rendering performance of chafa is very slow #410

miniksa · 2019-04-11T18:17:08Z

sudo apt install chafa
curl https://media.giphy.com/media/12UwsVgQCYL3H2/giphy.gif --output winanim.gif
chafa winanim.gif --font-ratio 1/3

Edit: On Ubuntu 18.04, follow directions here to get chafa sources and build 'em:
https://hpjansson.org/chafa/

Edit2: after doing a ./configure and sudo apt install loop as you realize stuff is missing, you'll get all the way through and it will whine about not being able to find the lib. Do ldconfig and it will shut up.

Edit3:

ubuntu
cd ~
mkdir chafa
cd chafa
curl https://hpjansson.org/chafa/releases/chafa-1.4.0.tar.xz --output chafa.tar.xz
tar xf chafa.tar.xz chafa-1.4.0/
cd chafa-1.4.0/
sudo apt install gcc pkg-config libgtk2.0-dev libmagickwand-dev
sudo ldconfig
./configure
make
sudo make install

The text was updated successfully, but these errors were encountered:

miniksa · 2019-04-11T18:35:56Z

WPR analysis:

11520ms in conhost.exe
3455ms in chafa

For conhost.exe...

5990ms on I/O thread (the likely culprit as it going slow will stop the other side from giving us data)
5183ms on Render thread (which might still be involved as it can lock the buffer when looking things up preventing I/O from writing)

On the I/O thread, hot areas include:

2007ms spent notifying accessibility eventing

Most of this (1405ms) spent inside user32.dll!NotifyWinEvent
This call causes a syscall/kernel transition which is SLOW
The best thing to do here is probably detect that no one needs the event and not transmit it OR
Transmit it less often by coalescing the accessibility events into frames much like the renderer

1196ms spent adjusting the cursor position

This is mostly attributable (823ms) to figuring out whether the cursor is sitting on top of a 2-column-wide character (so it can move 2 spaces right instead of 1). It looks like we're doing this the wrong way and wasting time here since _lookupIsWide is called below for another purpose and retrieving the same information and that is also taking a lot of time.
236ms also spent in kernelbase.dll!SetEvent to trigger the render thread (probably can't avoid, need kernel object to notify a potentially sleeping thread...)

877ms spent run-length-encoding colors

634ms spent in vector reallocation for holding the run-length-encoded colors. This could maybe be re-strategized to leave a bit of excess memory usage around in exchange for not reallocing so hard.

608ms spent looking up the narrow/wideness of characters (_lookupIsWide) during insertion

This is actually backending on gdi32full.dll!GetCharABCWidthsW versus the current font to figure out how wide it is going to be when we don't otherwise know because it's an ambiguous width character.

On the render thread, hot areas include:

4386ms in gdi32full.dll!PolyTextOutW

I don't think there's anything to be done here. I think this is just the consequence of trying to emit a ton of text really fast

I didn't do a wait chain analysis yet to see if the locking/threading was slowing things down because at this point, we have a few areas with obvious routes to improvement that might alleviate the whole deal:

Therefore, my conclusion is:

Cursor movement shouldn't be looking up the column count by the character, it should use the already known cell width value
The accessibility events need to be coalesced
We should be caching the queries to GDI for ambiguous character widths until the font changes
Investigate reducing vector reallocs in color run length encoding manipulation

And I have now filed MSFT:21167256 to do these things at some point and hopefully we'll have fixed the performance issue.

oising · 2019-04-11T21:18:15Z

Thanks Michael, that's a nice analysis. So, this is probably a moot question, but I'm assuming PolyTextOutW is called once per render frame and only emits text for invalidated regions?

miniksa · 2019-04-11T21:18:55Z

Correct.

oising · 2019-04-11T21:19:34Z

Okay, excuse the debugging by proxy ;) I'll get back to bundling up a PR for you... :)

miniksa · 2019-04-11T21:43:26Z

No problem. It was a fair question to ask. We make dumb mistakes all the time.

miniksa · 2019-04-17T16:46:39Z

@oising, somehow this morning when I'm looking at this, it's not as slow on my machine as it was in one of your videos. I implemented the 3rd thing above (GDI size measurement caching) quickly as I thought it was the best cost-benefit ratio and it improved things by 20-30%, but I want to make sure that I'm actually fixing your problem.

Can you possibly send me a WPR trace of your specific repro? (First Level Triage + CPU Usage Profiles? Let me know if you need help on how to do this.)

oising · 2019-04-17T18:50:24Z

@miniksa Sure, I've got the WPR trace now. Where shall I send it?

miniksa · 2019-04-17T18:58:45Z

Email me the attachment or a link to a share at Microsoft.com. My GitHub alias is unoriginal and is my work address too. Just don't sign me up for spam please.

oising · 2019-04-17T20:28:53Z

I'm sorry, I don't understand -- your github alias is unoriginal? I don't know what you mean.

miniksa · 2019-04-17T20:30:37Z

My e-mail is my Github alias @microsoft.com. Sorry for being obtuse, I'm trying to avoid spam bots picking it up if I write the real mailto:

miniksa · 2019-04-19T20:50:04Z

The major time spent on the WinEvent turns out to be only if Node.js is running on your system.

If Node.js is running, it registers for the WinEvent notifications for EVENT_CONSOLE_LAYOUT to know when the window size has changed. Given WinEvents require kernel work to broadcast and tend to be registered globally, this causes a system-wide slowdown of all of your consoles when it is listening here.

https://github.com/nodejs/node/blob/0109e121d3a2f87c4bad75ac05436b56c9fd3407/deps/uv/src/win/tty.c line 2294

If you kill all node.js runtimes (including the one that Visual Studio 2017 launches), that performance drag goes away.

I need to:

Coalesce MSAA events so they don't happen so often.
Follow up with the WinEvent team to see if they can tell me Node.js is only listening for EVENT_CONSOLE_LAYOUT and not all the messages (because the expensive ones aren't the layout messages, but because MSAA/WinEvent infrastructure is very old... registering for any one registers you for all of them.)
Probably formalize this into some sort of issue/bug on the Node team to stop doing this as they're shooting the entire system's performance in the foot to get a resize notification.

miniksa · 2019-04-19T20:58:13Z

I checked ReadConsoleInput queue is filled with a WINDOW_BUFFER_SIZE_RECORD at more or less the same time and circumstances as when EVENT_CONSOLE_LAYOUT is dispatched over NotifyWinEvent.

Given Node.js in the tty file is already reading through the queue with ReadConsoleInput and discarding all non-KEY_EVENT records... they could probably drop the whole MSAA hookup and just get the events there in a performant manner instead of hijacking an accessibility feature.

Of course, this also screams of #281 needing to implement a better way overall of receiving these sorts of events, but it's going to be a bit before we get to that.

oising · 2019-04-19T21:43:35Z

That rabbit hole is getting deep :D Very interesting to read.

miniksa · 2019-04-19T22:11:39Z

Yes... It is...
@DHowett-MSFT promised me he'd drive resolution with Node (or libuv upstream).

miniksa · 2019-04-23T20:29:17Z

Follow up with the WinEvent team to see if they can tell me Node.js is only listening for EVENT_CONSOLE_LAYOUT and not all the messages (because the expensive ones aren't the layout messages, but because MSAA/WinEvent infrastructure is very old... registering for any one registers you for all of them.)

The answer to this is "no". We'd have to make categories for each event that we wanted to register separately and I'm not sure there are enough category flags left. Also, no one wants to touch this given it's legacy tech. We will need to drive improvement of this through the other options.

oising · 2019-04-23T21:44:26Z

So, the upshot of this is that any time NodeJS is running (e.g. Visual Studio) then all console windows suffer an approximate 20% slowdown (or worse) due to accessibility eventing broadcasts. Urgh.

DHowett-MSFT · 2019-07-03T17:00:24Z

The first quick fix for this (GDI measurement caching) just went out with insider build 18932!

Continuing improvement of SIGWINCH from PR libuv#2308. Running SetWinEventHook without filtering for the specific PIDs has significant impact on the performance of the entire system. This PR changes the way SIGWINCH is handled. The SetWinEventHook callback now signals a separate thread, uv__tty_console_resize_watcher_thread. This thread calls uv__tty_console_signal_resize() which checks if the console was actually resized. The uv__tty_console_resize_watcher_thread makes sure to not to call the uv__tty_console_signal_resize function more than 30 times per second. The SetWinEventHook will not be installed, if the PID of the conhost.exe process that owns the console window cannot be determinated. This can happen when a 32bit libuv app is running on a 64bit Windows. For such cases PR libuv#1408 is partially reverted - when tty reads WINDOW_BUFFER_SIZE_EVENT, it will also trigger a call to uv__tty_console_signal_resize(). This will also help when the app is running under console emulators. Documentation was alos updated to reflect that. Refs: microsoft/terminal#1811 Refs: microsoft/terminal#410 Refs: libuv#2308

Continuing improvement of SIGWINCH from PR #2308. Running SetWinEventHook without filtering for the specific PIDs has significant impact on the performance of the entire system. This PR changes the way SIGWINCH is handled. The SetWinEventHook callback now signals a separate thread, uv__tty_console_resize_watcher_thread. This thread calls uv__tty_console_signal_resize() which checks if the console was actually resized. The uv__tty_console_resize_watcher_thread makes sure to not to call the uv__tty_console_signal_resize function more than 30 times per second. The SetWinEventHook will not be installed, if the PID of the conhost.exe process that owns the console window cannot be determinated. This can happen when a 32bit libuv app is running on a 64bit Windows. For such cases PR #1408 is partially reverted - when tty reads WINDOW_BUFFER_SIZE_EVENT, it will also trigger a call to uv__tty_console_signal_resize(). This will also help when the app is running under console emulators. Documentation was also updated to reflect that. Refs: microsoft/terminal#1811 Refs: microsoft/terminal#410 Refs: #2308 PR-URL: #2381 Reviewed-By: Ben Noordhuis <info@bnoordhuis.nl>

oising · 2021-03-27T03:40:29Z

I wonder how this issue looks now?

oising · 2021-03-27T03:45:24Z

Not too bad -- the slowdown at the start was when I enabled the win+g gamebar recording...

oisin@BEASTIEBOOK3_.2021-03-26.23-43-47.mp4

miniksa · 2021-03-29T16:08:19Z

Also @oising let's leave this here: https://github.com/hzeller/timg/

* #8000 - Supports buffer rewrite work. A re-use of `til::rle` will be useful as a column counter as we pursue NxM storage and presentation. * #3075 - The new iterators allow skipping forward by multiple units, which wasn't possible under `TextBuffer-/OutputCellIterator`. Additionally it also allows a bulk insertions. * #8787 and #410 - High probability this should be `pmr`-ified like `bitmap` for things like `chafa` and `cacafire` which are changing the run length frequently. * [x] Closes #8741 * [x] I work here. * [x] Tests added. * [x] Tests passed. - [x] Ran `cacafire` in `OpenConsole.exe` and it looked beautiful - [x] Ran new suite of `RunLengthEncodingTests.cpp` Co-authored-by: Michael Niksa <miniksa@microsoft.com>

## Summary of the Pull Request Introduces `til::rle`, a vector-like container which stores elements of type T in a run length encoded format. This allows efficient compaction of repeated elements within the vector. ## References * #8000 - Supports buffer rewrite work. A re-use of `til::rle` will be useful as a column counter as we pursue NxM storage and presentation. * #3075 - The new iterators allow skipping forward by multiple units, which wasn't possible under `TextBuffer-/OutputCellIterator`. Additionally it also allows a bulk insertions. * #8787 and #410 - High probability this should be `pmr`-ified like `bitmap` for things like `chafa` and `cacafire` which are changing the run length frequently. ## PR Checklist * [x] Closes #8741 * [x] I work here. * [x] Tests added. * [x] Tests passed. ## Validation Steps Performed * [x] Ran `cacafire` in `OpenConsole.exe` and it looked beautiful * [x] Ran new suite of `RunLengthEncodingTests.cpp` Co-authored-by: Michael Niksa <miniksa@microsoft.com>

commit 4b0eeef Author: Leonard Hecker <lhecker@microsoft.com> Date: Fri May 14 23:56:08 2021 +0200 Introduce til::rle - a run length encoded vector ## Summary of the Pull Request Introduces `til::rle`, a vector-like container which stores elements of type T in a run length encoded format. This allows efficient compaction of repeated elements within the vector. ## References * #8000 - Supports buffer rewrite work. A re-use of `til::rle` will be useful as a column counter as we pursue NxM storage and presentation. * #3075 - The new iterators allow skipping forward by multiple units, which wasn't possible under `TextBuffer-/OutputCellIterator`. Additionally it also allows a bulk insertions. * #8787 and #410 - High probability this should be `pmr`-ified like `bitmap` for things like `chafa` and `cacafire` which are changing the run length frequently. ## PR Checklist * [x] Closes #8741 * [x] I work here. * [x] Tests added. * [x] Tests passed. ## Validation Steps Performed * [x] Ran `cacafire` in `OpenConsole.exe` and it looked beautiful * [x] Ran new suite of `RunLengthEncodingTests.cpp` Co-authored-by: Michael Niksa <miniksa@microsoft.com>

## Summary of the Pull Request Introduces `til::rle`, a vector-like container which stores elements of type T in a run length encoded format. This allows efficient compaction of repeated elements within the vector. ## References * #8000 - Supports buffer rewrite work. A re-use of `til::rle` will be useful as a column counter as we pursue NxM storage and presentation. * #3075 - The new iterators allow skipping forward by multiple units, which wasn't possible under `TextBuffer-/OutputCellIterator`. Additionally it also allows a bulk insertions. * #8787 and #410 - High probability this should be `pmr`-ified like `bitmap` for things like `chafa` and `cacafire` which are changing the run length frequently. ## PR Checklist * [x] Closes #8741 * [x] I work here. * [x] Tests added. * [x] Tests passed. ## Validation Steps Performed * [x] Ran `cacafire` in `OpenConsole.exe` and it looked beautiful * [x] Ran new suite of `RunLengthEncodingTests.cpp` Co-authored-by: Michael Niksa <miniksa@microsoft.com>

lhecker · 2022-02-10T16:02:46Z

@zadjii-msft Should we close this issue in favor of #10462? The rendering performance was fixed with AtlasEngine, now only the VT performance is subpar and #10462 seems more "specific".

miniksa · 2022-02-10T19:19:01Z

@lhecker, if you profiled this scenario with chafa and all that's left is the VT engine being the super hot path, I'm OK with that. Chafa's just a good test and we shouldn't forget about using it in 10462 then.

zadjii-msft · 2022-02-10T21:28:10Z

Yea I dig it. Let's focus our efforts there for remaining improvements to make in the space.

miniksa mentioned this issue Apr 11, 2019

Sixel graphics should be ignored unless supported #120

Closed

miniksa self-assigned this Apr 11, 2019

miniksa added Area-Performance Performance-related issue Product-Conhost For issues in the Console codebase Work-Item It's being tracked by an actual work item internally. (to be removed soon) Issue-Bug It either shouldn't be doing this or needs an investigation. labels Apr 11, 2019

miniksa added Mass-Chaos and removed Mass-Chaos labels May 17, 2019

oising mentioned this issue Jun 20, 2019

Fundamental Accessibility: Modify conhost accessibility model #1352

Closed

3 tasks

oising mentioned this issue Jul 4, 2019

WinEvents and ConPTY question #1811

Closed

bzoz mentioned this issue Jul 18, 2019

win, tty: improve SIGWINCH performance libuv/libuv#2381

Closed

miniksa mentioned this issue Oct 11, 2019

Correct fill attributes when scrolling and erasing #3100

Merged

5 tasks

oising mentioned this issue Oct 18, 2019

nodejs: resize doesn't fire on Windows Terminal #3238

Closed

oising mentioned this issue Apr 1, 2020

Render row-by-row instead of invalidating entire screen #5185

Merged

5 tasks

miniksa mentioned this issue Jan 15, 2021

til::rle<T, S> - a run length encoded storage template #8794

Closed

6 tasks

lhecker mentioned this issue May 14, 2021

Introduce til::rle - a run length encoded vector #10099

Merged

6 tasks

DHowett mentioned this issue Jun 8, 2021

Extremely slow performance when processing virtual terminal sequences #10362

Closed

DHowett mentioned this issue Jun 29, 2021

WriteCharsLegacy performance issues caused by NotifyConsoleUpdateSimpleEvent #10528

Open

zadjii-msft added the Priority-1 A description (P1) label Jan 10, 2022

zadjii-msft added this to the 22H1 milestone Jan 10, 2022

zadjii-msft modified the milestones: 22H1, Terminal v1.14 Feb 2, 2022

zadjii-msft closed this as completed Feb 10, 2022

ghost added the Needs-Tag-Fix Doesn't match tag requirements label Feb 10, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rendering performance of chafa is very slow #410

Rendering performance of chafa is very slow #410

miniksa commented Apr 11, 2019 •

edited

Loading

miniksa commented Apr 11, 2019 •

edited by zadjii-msft

Loading

oising commented Apr 11, 2019 •

edited

Loading

miniksa commented Apr 11, 2019

oising commented Apr 11, 2019

miniksa commented Apr 11, 2019

miniksa commented Apr 17, 2019

oising commented Apr 17, 2019

miniksa commented Apr 17, 2019

oising commented Apr 17, 2019

miniksa commented Apr 17, 2019

miniksa commented Apr 19, 2019

miniksa commented Apr 19, 2019

oising commented Apr 19, 2019

miniksa commented Apr 19, 2019

miniksa commented Apr 23, 2019

oising commented Apr 23, 2019

DHowett-MSFT commented Jul 3, 2019

oising commented Mar 27, 2021

oising commented Mar 27, 2021

miniksa commented Mar 29, 2021

lhecker commented Feb 10, 2022

miniksa commented Feb 10, 2022

zadjii-msft commented Feb 10, 2022

Rendering performance of chafa is very slow #410

Rendering performance of chafa is very slow #410

Comments

miniksa commented Apr 11, 2019 • edited Loading

miniksa commented Apr 11, 2019 • edited by zadjii-msft Loading

oising commented Apr 11, 2019 • edited Loading

miniksa commented Apr 11, 2019

oising commented Apr 11, 2019

miniksa commented Apr 11, 2019

miniksa commented Apr 17, 2019

oising commented Apr 17, 2019

miniksa commented Apr 17, 2019

oising commented Apr 17, 2019

miniksa commented Apr 17, 2019

miniksa commented Apr 19, 2019

miniksa commented Apr 19, 2019

oising commented Apr 19, 2019

miniksa commented Apr 19, 2019

miniksa commented Apr 23, 2019

oising commented Apr 23, 2019

DHowett-MSFT commented Jul 3, 2019

oising commented Mar 27, 2021

oising commented Mar 27, 2021

miniksa commented Mar 29, 2021

lhecker commented Feb 10, 2022

miniksa commented Feb 10, 2022

zadjii-msft commented Feb 10, 2022

miniksa commented Apr 11, 2019 •

edited

Loading

miniksa commented Apr 11, 2019 •

edited by zadjii-msft

Loading

oising commented Apr 11, 2019 •

edited

Loading