[Epic] Text Buffer rewrite #8000

DHowett · 2020-10-22T00:25:06Z

This is the issue tracking the great buffer rewrite of 202x.

Aims

Refactor to remove the need for UnicodeStorage (which is a lookup table keyed on row+column)
- Removing this allows us to remove ROW::_id, ROW::_pParent, CharRow::_pParent
Reduce the fiddliness of the DBCS attribute APIs
- DBCS attributes are stored for every character when they could be easily inferred from column position
Add support for the storage of surrogate pairs
- Surrogate pairs work today as an accident of fate: a pair of UTF-16 code units encoding a EA=wide codepoint is seen as wide, which conveniently matches how many wchar_t it takes up.
- We have little to no proper support for a codepoint requiring two UTF-16 code units that is only seen as one column wide (Narrow emoji >=U+10000 make WriteCharsLegacy sad ("wrong character insertion when scrolling to bash history") #6555 (master issue), Extra space cell inserted after unicode character #6162 Links/URLs get offset when certain unicode characters are printed #8709)
Provide a platform on which to build full ZWJ support (Feature Request: Finish full unicode support (M:N cell rendering, ZWJ?) #1472)
Kill CharRow, CharRowCell, CharRowCellReference
Reduce the static storage required to store a row (eventually) by not storing space characters
- This should make MeasureRight faster, and therefore help fix Console doesn't handle colored regions when reflowed #32.

Notes

Surrogate Pairs

Work will be required to teach WriteCharsLegacy to measure UTF-16 codepoints in aggregate, rather than individual code units.

I have done a small amount of work in WriteCharsLegacy. It is slow going.

Motivation

#8689 (IRM) requires us to be able to shift buffer contents rightward. I implemented it in a hacky way, but then realized that UnicodeStorage would need to be rekeyed.

Implementation

The buffer is currently stored as a vector (small_vector) of CharRowCell, each of which contains a DbcsAttribute and a wchar_t. Each cell takes 3 bytes (plus padding, if required.)

In the common case (all narrow text), this is terribly wasteful.

To better support codepoints requiring one or more code units representing a character, we are going to move to a single wchar string combined with a column count table. The column count table will be stored compressed by way of til::rle (#8741).

Simple case - all glyphs narrow
 CHAR    A    B    C    D
UNITS 0041 0042 0043 0044
 COLS    1    1    1    1

Simple case - all glyphs wide
 CHAR   カ   タ   カ   ナ
UNITS 30ab 30bf 30ab 30ca
 COLS    2    2    2    2

Surrogate pair case - glyphs narrow
 CHAR         🕴        🕴        🕴
UNITS d83d dd74 d83d dd74 d83d dd74
 COLS    1    0    1    0    1    0

Surrogate pair case - glyphs wide
 CHAR        🥶        🥶        🥶
UNITS d83e dd76 d83e dd76 d83e dd76
 COLS    2    0    2    0    2    0

Representative complicated case
 CHAR        🥶    A    B         🕴
UNITS d83e dd76 0041 0042 d83d dd74
 COLS    2    0    1    1    1    0

Representative complicated case (huge character)
[FUTURE WORK]
 CHAR ﷽
UNITS         fdfd
 COLS           12

Representative complicated case (Emoji with skin tone variation)
[FUTURE WORK]
 CHAR 👍🏼
UNITS d83d dc31 200d d83d dc64
 COLS    2    0    0    0    0

A column count of zero indicates a code unit that is a continuation of an existing glyph.

Since there is one column width for each code unit, it is trivial to match column offsets with character string indices by summation.

Work Log

Add tests for reflow so that we can rewrite it (Add some tests for TextBuffer::Reflow #8715)
Hide more of CharRow/AttrRow's implementation details inside Row (ROW: clean up in preparation to hide CharRow & AttrRow #8446)
(from Michael) til::rle<T, S> - a run length encoded storage template, which we will use to store column counts

Other issues that might just be fixed by this

The text was updated successfully, but these errors were encountered:

DHowett · 2021-01-12T03:29:53Z

Alright, I've claimed this issue for the text buffer updates.

KalleOlaviNiemitalo · 2021-01-12T05:26:07Z

Representative complicated case (huge character)

TAB characters in #7810 could also be stored this way.

DHowett · 2021-01-12T21:23:10Z

@KalleOlaviNiemitalo Indeed! I'd thought about this when designing the new advances-based buffer, but I was a little worried about the implications. We don't have good coverage for what happens when the cursor moves into a column covered by a wide glyph, and I think nailing that down will be important for getting tab-in-buffer working.

DHowett · 2021-01-12T21:30:12Z

Actually, on second thought ... this isn't as bad as I'd expected. We already move into multi-cell glyphs, and we don't damage them properly when they're overstruck. We don't render the cursor with the correct width when we're inside such a glyph either...

When the new damage and column-to-region mapping algorithms, this may be a breeze.

naikel · 2021-01-13T00:27:03Z

Will this rewrite support a TextBuffer >= 32767? Currently ROW._id is a SHORT and TextBuffer size is a COORD, which is also composed of SHORTs.

Changing TextBuffer COORD references to til::point mean a major rewrite of half the code, but it's needed if you want to support Infinite Scrollback (with files).

j4james · 2021-01-13T00:36:56Z

@DHowett I don't think this is as simple as you think. Lets say you write out the string ABCDEFGHIJ, move the cursor back to column 3 (on the character C), and then output a TAB. This moves the cursor to column 9 (on the character I), but there's no visible change to the content. In what way does the buffer now change to reflect that TAB?

Things like a zero-width-joiner are even more problematic. Let's say you write out a waving white flag (U+1F3F3), a space, and a rainbow (U+1F308), then go back and replace that space with a zero-width-joiner. The buffer sequence is now flag-zwj-rainbow, which Unicode defines as a rainbow flag. How would you alter the display to reflect those 3 characters collapsing into 1?

And whatever you do has to be compatible with the way other terminal emulators behave, and more importantly, the way apps expect the terminals to behave, otherwise they just won't work correctly in Windows Terminal. I suspect some of the things you're envisioning here are simply not possible.

DHowett · 2021-01-13T00:49:02Z

@j4james I'm not terribly concerned about tab - it's an auxiliary consideration for this spec. If we wanted to replicate iTerm's behavior here, it could only replace up to the next $TABSTOP characters with a N-column \t if the moved-over portion was otherwise stored as spaces.

ZWJ is another "future consideration". I'm hoping that this is more robust for now, rather than fully scalable for the future. N:M glyph mapping, inserting ZWJs, etc. is more impossible with what we have than this specification.

I wouldn't expect overstriking a non-joining space with a joining one to ever join adjacent columns once they're frozen in place in the text buffer. The behavior would, rather, be "ZWJ only attaches characters if they are written consecutively, and a cursor move breaks a character-in-composition" or "ZWJ is zero-width and always attaches to the character stored in the prior column." Either of those would be more possible than what we're working with today. It doesn't necessarily mean they're right, it just means that our horizon expands to better consider them.

All that, though, is not in scope right now. Incremental progress 😄 while retaining compatibility with what applications (using both the Win32 Console APIs and VT) do today are my goal.

DHowett · 2021-01-13T00:53:49Z

@naikel breaking any dependency we have on COORD internally is a significant bonus.

naikel · 2021-01-13T00:55:28Z

@DHowett I tried once to extend TextBuffer >= 32767 changing ROW._id from SHORT to size_t and TextBuffer COORD to til::point and after three days failed miserably. I think it's easier to just code a whole new terminal from scratch.

EDIT: I really hope you can do this!

DHowett · 2021-01-29T04:32:24Z

Thoughts for replacing the global function pointer that determines glyph width (which the Renderer uses to answer for ambiguous glyphs).

struct IMeasurer {
    // Measures one codepoint, stored as UTF-16 code units
    virtual CodepointWidth MeasureCodepoint(std::wstring_view) = 0;

    // Measures multiple codepoints, stored as a string of UTF-16 code units.
    // This function should (?) return 0-width code units for combining characters
    // if and only if they would actually combine. Use Renzhi's new measuring algorithm here.
	virtual std::vector<CodepointWidth> MeasureString(std::wstring_view) = 0;
}

struct UnicodeUCDMeasurer : public IMeasurer {
    // implements it using raw compiled-in unicode UCD, never asks anyone else, totally static
}

struct RendererFallbackMeasurer : public IMeasurer {
    IMeasurer* rootMeasurer;
    RenderEngine* renderEngine;

    // IFF rootMeasurer returns Ambiguous widths for any individual codepoints, ask the render engine
    // This is only required for conhost, where we **must** maintain compatibility with "font dictates display width"
}

j4james · 2021-01-29T21:48:32Z

    // This function should (?) return 0-width code units for combining characters
    // if and only if they would actually combine. Use Renzhi's new measuring algorithm here.

Can I ask what Renzhi's algorithm is? My concern is if it doesn't match the behaviour of the wcwidth routines that Linux apps often use, then those apps are just not going to work correctly in WT, no matter how brilliantly we render zero-width joiners.

Also how does this relate to what is stored in the buffer? Is the intention to measure before deciding how something gets stored, or is the measuring to determine how the buffer is rendered? The reason I ask, is because if element N in the row buffer doesn't map to column N on the display, then some VT operations will also not work correctly.

This can even be a problem for client-side operations, like search and selection, although that at least is something we could maybe compensate for. And I know the DX renderer already has issues with the buffer to screen mapping, but I'd rather it didn't get worse.

skyline75489 · 2021-01-29T23:20:55Z

Renzhi is a MS employee like me who is moonlighting this project. He is an expert on font, Unicode and layout. He has been working with us internally to improve the overall Unicode support by providing an algorithm for general text width measurement. 获取 Outlook for iOS<https://aka.ms/o0ukef>

j4james · 2021-01-29T23:26:09Z

OK, thanks for info. As long as you guys are confident that this is going work out OK, feel free to ignore my questions - I don't need to know all the details. The main thing was I wanted to make sure you weren't overlooking something in the design that might later turn out to be a problem.

zadjii-msft · 2021-02-01T12:01:30Z

since we said his name 3 times: @reli-msft

This adds PR adds a couple foundational functions and classes to make our TextBuffer more performant and allow us to improve our Unicode correctness in the future, by getting rid of our dependence on `OutputCellIterator`. In the future we can then replace the simple UTF-16 code point iterator with a proper grapheme cluster iterator. While my focus is technically on Unicode correctness, the ~4x VT throughput increase in OpenConsole is pretty nice too. This PR adds: * A new, simpler ROW iterator (unused in this PR) * Cursor movement functions (`NavigateToPrevious`, `NavigateToNext`) They're based on functions that align the cursor to the start/end of the _current_ cell, so such functions can be added as well. * `ReplaceText` to write a raw string of text with the possibility to specify a right margin. * `CopyRangeFrom` will allow us to make reflow much faster, as it's able to bulk-copy already measured strings without re-measuring them. Related to #8000 ## Validation Steps Performed * enwik8.txt, zhwik8.txt, emoji-test.txt, all work with proper wide glyph reflow at the end of a row ✅ * This produces "a 咪" where only "a" has a white background: ```sh printf '\e7こん\e8\x1b[107ma\x1b[m\n' ``` * This produces "abん": ```sh stdbuf -o0 printf '\x1b7こん\x1b8a'; printf 'b\n' ``` * This produces "xy" at the end of the line: ```sh stdbuf -o0 printf '\e[999C\bこ\bx'; printf 'y\n' ``` * This produces red whitespace followed by "こ " in the default background color at the end of the line, and "ん" on the next line: ```sh printf '\e[41m\e[K\e[m\e[999C\e[2Dこん\n' ```

mominshaikhdevs · 2023-05-02T10:14:34Z

why this task is marked as completed when the linked issue is open?

lhecker · 2023-05-02T12:03:38Z

The linked issue (#1472) is just a reference what that bullet point is for and not the task itself. Thinking about it some more, I realize that this issue (#8000) is actually kind of done already. I rewrote the buffer in #13626 and the other commits I've done since aren't really related to the "Text Buffer" anymore and they're more related to #1472, hmm... Eh, I guess it's not really that important.

… Emoji support (#15567) This is a complete rewrite of the old `WriteCharsLegacy` function which is used when VT mode is disabled as well as for all interactive console input handling on Windows. The previous code was almost horrifying in some aspects as it first wrote the incoming text into a local buffer, stripping/replacing any control characters. That's not particular fast and never was. It's unknown why it was like that. It also measured the width of each glyph to correctly determine the cursor position and line wrapping. Presumably this used to work quite well in the original console code, because it would then just copy that local buffer into the destination text buffer, but with the introduction of the broken and extremely slow `OutputCellIterator` abstraction this would end up measuring all text twice and cause disagreements between `WriteCharsLegacy`'s idea of the cursor position and `OutputCellIterator`'s cursor position. Emoji input was basically entirely broken. This PR fixes it by passing any incoming text straight to the `TextBuffer` as well as by using its cursor positioning facilities to correctly implement wrapping and backspace handling. Backspacing over Emojis and an array of other aspects still don't work correctly thanks to cmdline.cpp, but it works quite a lot better now. Related to #8000 Closes #8839 Closes #10808 ## Validation Steps Performed * Printing various Unicode text ✅ * On an fgets() input line * Typing text works ✅ * Inserting text works anywhere ✅ * Ctrl+X is translated to ^X ✅ * Null is translated to ^@ ✅ This was tested by hardcoding the `OutputMode` to 3 instead of 7. * Backspace only advances to start of the input ✅ * Backspace deletes the entire preceding tab ✅ * Backspace doesn't delete whitespace preceding a tab ✅ * Backspacing a force-wrapped wide glyph unwraps the line break ✅ * Backspacing ^X deletes both glyphs ✅ * Backspacing a force-wrapped tab deletes trailing whitespace ✅ * When executing ```cpp fputs("foo: ", stdout); fgets(buffer, stdin); ``` pressing tab and then backspace does not delete the whitespace that follows after the "foo:" string (= `sOriginalXPosition`).

`COOKED_READ_DATA` is a little special and requires cursor navigation based on the raw (buffered) text contents instead of what's in the text buffer. This requires the introduction of new helper functions to implement such cursor navigation. They're made part of `TextBuffer` as these helpers will get support graphemes in the future. It also helps keeping it close to `TextBuffer` as the cursor navigation should optimally behave identical between the two. Part of #8000.

The ultimate goal of this PR was to use ICU for text search to * Improve Unicode support Previously we used `towlower` and only supported BMP glphs. * Improve search performance (10-100x) This allows us to search for all results in the entire text buffer at once without having to do so asynchronously. Unfortunately, this required some significant changes too: * ICU's search facilities operate on text positions which we need to be mapped back to buffer coordinates. This required the introduction of `CharToColumnMapper` to implement sort of a reverse-`_charOffsets` mapping. It turns text (character) positions back into coordinates. * Previously search restarted every time you clicked the search button. It used the current selection as the starting position for the new search. But since ICU's `uregex` cannot search backwards we're required to accumulate all results in a vector first and so we need to cache that vector in between searches. * We need to know when the cached vector became invalid and so we have to track any changes made to `TextBuffer`. The way this commit solves it is by splitting `GetRowByOffset` into `GetRowByOffset` for `const ROW` access and `GetMutableRowByOffset` which increments a mutation counter on each call. The `Search` instance can then compare its cached mutation count against the previous mutation count. Finally, this commit makes 2 semi-unrelated changes: * URL search now also uses ICU, since it's closely related to regular text search anyways. This significantly improves performance at large window sizes. * A few minor issues in `UiaTracing` were fixed. In particular 2 functions which passed strings as `wstring` by copy are now using `wstring_view` and `TraceLoggingCountedWideString`. Related to #6319 and #8000 ## Validation Steps Performed * Search upward/downward in conhost ✅ * Search upward/downward in WT ✅ * Searching for any of ß, ẞ, ss or SS matches any of the other ✅ * Searching for any of Σ, σ, or ς matches any of the other ✅

Subjectively speaking, this commit makes 3 improvements: * Most importantly, it now would work with arbitrary Unicode text. (No more `IsGlyphFullWidth` or DBCS handling during reflow.) * Due to the simpler implementation it hopefully makes review of future changes and maintenance simpler. (~3x less LOC.) * It improves perf. by 1-2 orders of magnitude. (At 120x9001 with a full buffer I get 60ms -> 2ms.) Unfortunately, I'm not confident that the new code replicates the old code exactly, because I failed to understand it. During development I simply tried to match its behavior with what I think reflow should do. Closes #797 Closes #3088 Closes #4968 Closes #6546 Closes #6901 Closes #15964 Closes MSFT:19446208 Related to #5800 and #8000 ## Validation Steps Performed * Unit tests ✅ * Feature tests ✅ * Reflow with a scrollback ✅ * Reflowing the cursor cell causes a forced line-wrap ✅ (Even at the end of the buffer. ✅) * `color 8f` and reflowing retains the background color ✅ * Enter alt buffer, Resize window, Exit alt buffer ✅

Subjectively speaking, this commit makes 3 improvements: * Most importantly, it now would work with arbitrary Unicode text. (No more `IsGlyphFullWidth` or DBCS handling during reflow.) * Due to the simpler implementation it hopefully makes review of future changes and maintenance simpler. (~3x less LOC.) * It improves perf. by 1-2 orders of magnitude. (At 120x9001 with a full buffer I get 60ms -> 2ms.) Unfortunately, I'm not confident that the new code replicates the old code exactly, because I failed to understand it. During development I simply tried to match its behavior with what I think reflow should do. Closes #797 Closes #3088 Closes #4968 Closes #6546 Closes #6901 Closes #15964 Closes MSFT:19446208 Related to #5800 and #8000 ## Validation Steps Performed * Unit tests ✅ * Feature tests ✅ * Reflow with a scrollback ✅ * Reflowing the cursor cell causes a forced line-wrap ✅ (Even at the end of the buffer. ✅) * `color 8f` and reflowing retains the background color ✅ * Enter alt buffer, Resize window, Exit alt buffer ✅ (cherry picked from commit 7474839) Service-Card-Id: 90642727 Service-Version: 1.19

mominshaikhdevs · 2023-09-26T12:03:03Z

can someone update the issue description? #30 seems to be the only remaining open issue at this point.

lhecker · 2024-04-23T21:20:33Z

With only a few places left that rely on the old grid-based buffer layout, all of which require some long term, but focused work, I'll be going ahead and close this more general issue.

ghost added Needs-Triage It's a new issue that the core contributor team needs to triage at the next triage meeting Needs-Tag-Fix Doesn't match tag requirements labels Oct 22, 2020

DHowett added ⛺ Reserved For future use and removed Needs-Triage It's a new issue that the core contributor team needs to triage at the next triage meeting labels Oct 22, 2020

DHowett changed the title ~~<camper>~~ Epic: Buffer rewrite of 2021 Jan 12, 2021

DHowett changed the title ~~Epic: Buffer rewrite of 2021~~ Epic: Text Buffer rewrite of 2021 Jan 12, 2021

DHowett changed the title ~~Epic: Text Buffer rewrite of 2021~~ [Epic] Text Buffer rewrite of 2021 Jan 12, 2021

ghost removed the Needs-Tag-Fix Doesn't match tag requirements label Jan 12, 2021

DHowett added this to the Console Backlog milestone Jan 12, 2021

miniksa mentioned this issue Jan 15, 2021

til::rle<T, S> - a run length encoded storage template #8794

Closed

6 tasks

zadjii-msft mentioned this issue May 3, 2023

Support Right to Left (RTL) text #538

Open

DHowett mentioned this issue May 8, 2023

ReadConsoleW fails with non-BMP characters #4628

Closed

carlos-zamora mentioned this issue May 17, 2023

Nerd font icons with 5 digit hex codes are not displayed correctly #15369

Closed

lhecker mentioned this issue Jun 16, 2023

We've been trying to reach you about your WriteCharsLegacy's extended Emoji support #15567

Merged

lhecker mentioned this issue Jul 7, 2023

Replace IInputEvent with INPUT_RECORD #15673

Merged

XXXE88 mentioned this issue Jul 10, 2023

Show only one page #15669

Closed

triksterx mentioned this issue Jul 11, 2023

Quake hotkey creates new tab instead of hiding window #15688

Closed

lhecker mentioned this issue Jul 12, 2023

Reimplement TextBuffer::Reflow #15701

Merged

lhecker mentioned this issue Jul 31, 2023

Add text based cursor movement helpers #15779

Merged

lhecker mentioned this issue Aug 21, 2023

Use ICU for text search #15858

Merged

parasyte mentioned this issue Aug 29, 2023

Buffer: unicode-width and emojis ratatui-org/ratatui#75

Open

lhecker mentioned this issue Oct 24, 2023

Funadmental Flaws inherent in the design of AtlasEngine #16132

Closed

tangturtle mentioned this issue Feb 8, 2024

无法正常加载settings.json | Fails to load "settings.json" properly #16682

Closed

lhecker removed their assignment Apr 23, 2024

lhecker closed this as completed Apr 23, 2024

microsoft-github-policy-service bot added the Needs-Tag-Fix Doesn't match tag requirements label Apr 23, 2024

Anz-03 mentioned this issue Jun 12, 2024

No response when open windows terminal #17425

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Epic] Text Buffer rewrite #8000

[Epic] Text Buffer rewrite #8000

DHowett commented Oct 22, 2020 •

edited by zadjii-msft

Loading

DHowett commented Jan 12, 2021

KalleOlaviNiemitalo commented Jan 12, 2021

DHowett commented Jan 12, 2021

DHowett commented Jan 12, 2021

naikel commented Jan 13, 2021

j4james commented Jan 13, 2021

DHowett commented Jan 13, 2021

DHowett commented Jan 13, 2021

naikel commented Jan 13, 2021 •

edited

Loading

DHowett commented Jan 29, 2021 •

edited

Loading

j4james commented Jan 29, 2021

skyline75489 commented Jan 29, 2021 via email •

edited by ghost

Loading

j4james commented Jan 29, 2021

zadjii-msft commented Feb 1, 2021 •

edited

Loading

mominshaikhdevs commented May 2, 2023

lhecker commented May 2, 2023

mominshaikhdevs commented Sep 26, 2023

lhecker commented Apr 23, 2024

[Epic] Text Buffer rewrite #8000

[Epic] Text Buffer rewrite #8000

Comments

DHowett commented Oct 22, 2020 • edited by zadjii-msft Loading

Aims

Notes

Surrogate Pairs

Motivation

Implementation

Work Log

Other issues that might just be fixed by this

DHowett commented Jan 12, 2021

KalleOlaviNiemitalo commented Jan 12, 2021

DHowett commented Jan 12, 2021

DHowett commented Jan 12, 2021

naikel commented Jan 13, 2021

j4james commented Jan 13, 2021

DHowett commented Jan 13, 2021

DHowett commented Jan 13, 2021

naikel commented Jan 13, 2021 • edited Loading

DHowett commented Jan 29, 2021 • edited Loading

j4james commented Jan 29, 2021

skyline75489 commented Jan 29, 2021 via email • edited by ghost Loading

j4james commented Jan 29, 2021

zadjii-msft commented Feb 1, 2021 • edited Loading

mominshaikhdevs commented May 2, 2023

lhecker commented May 2, 2023

mominshaikhdevs commented Sep 26, 2023

lhecker commented Apr 23, 2024

DHowett commented Oct 22, 2020 •

edited by zadjii-msft

Loading

naikel commented Jan 13, 2021 •

edited

Loading

DHowett commented Jan 29, 2021 •

edited

Loading

skyline75489 commented Jan 29, 2021 via email •

edited by ghost

Loading

zadjii-msft commented Feb 1, 2021 •

edited

Loading