Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Epic] Text Buffer rewrite #8000

Closed
12 of 15 tasks
Tracked by #190
DHowett opened this issue Oct 22, 2020 · 33 comments
Closed
12 of 15 tasks
Tracked by #190

[Epic] Text Buffer rewrite #8000

DHowett opened this issue Oct 22, 2020 · 33 comments
Labels
Area-Output Related to output processing (inserting text into buffer, retrieving buffer text, etc.) Area-Server Down in the muck of API call servicing, interprocess communication, eventing, etc. Issue-Scenario Needs-Tag-Fix Doesn't match tag requirements Product-Conhost For issues in the Console codebase Product-Terminal The new Windows Terminal.
Milestone

Comments

@DHowett
Copy link
Member

DHowett commented Oct 22, 2020

This is the issue tracking the great buffer rewrite of 202x.

Aims

Notes

Surrogate Pairs

Work will be required to teach WriteCharsLegacy to measure UTF-16 codepoints in aggregate, rather than individual code units.

I have done a small amount of work in WriteCharsLegacy. It is slow going.

Motivation

#8689 (IRM) requires us to be able to shift buffer contents rightward. I implemented it in a hacky way, but then realized that UnicodeStorage would need to be rekeyed.

Implementation

The buffer is currently stored as a vector (small_vector) of CharRowCell, each of which contains a DbcsAttribute and a wchar_t. Each cell takes 3 bytes (plus padding, if required.)

In the common case (all narrow text), this is terribly wasteful.

To better support codepoints requiring one or more code units representing a character, we are going to move to a single wchar string combined with a column count table. The column count table will be stored compressed by way of til::rle (#8741).

Simple case - all glyphs narrow
 CHAR    A    B    C    D
UNITS 0041 0042 0043 0044
 COLS    1    1    1    1

Simple case - all glyphs wide
 CHAR   カ   タ   カ   ナ
UNITS 30ab 30bf 30ab 30ca
 COLS    2    2    2    2

Surrogate pair case - glyphs narrow
 CHAR         🕴        🕴        🕴
UNITS d83d dd74 d83d dd74 d83d dd74
 COLS    1    0    1    0    1    0

Surrogate pair case - glyphs wide
 CHAR        🥶        🥶        🥶
UNITS d83e dd76 d83e dd76 d83e dd76
 COLS    2    0    2    0    2    0

Representative complicated case
 CHAR        🥶    A    B         🕴
UNITS d83e dd76 0041 0042 d83d dd74
 COLS    2    0    1    1    1    0

Representative complicated case (huge character)
[FUTURE WORK]
 CHAR ﷽
UNITS         fdfd
 COLS           12

Representative complicated case (Emoji with skin tone variation)
[FUTURE WORK]
 CHAR 👍🏼
UNITS d83d dc31 200d d83d dc64
 COLS    2    0    0    0    0

A column count of zero indicates a code unit that is a continuation of an existing glyph.

Since there is one column width for each code unit, it is trivial to match column offsets with character string indices by summation.

Work Log

Other issues that might just be fixed by this

@ghost ghost added Needs-Triage It's a new issue that the core contributor team needs to triage at the next triage meeting Needs-Tag-Fix Doesn't match tag requirements labels Oct 22, 2020
@DHowett DHowett added ⛺ Reserved For future use and removed Needs-Triage It's a new issue that the core contributor team needs to triage at the next triage meeting labels Oct 22, 2020
@DHowett DHowett changed the title <camper> Epic: Buffer rewrite of 2021 Jan 12, 2021
@DHowett DHowett changed the title Epic: Buffer rewrite of 2021 Epic: Text Buffer rewrite of 2021 Jan 12, 2021
@DHowett DHowett changed the title Epic: Text Buffer rewrite of 2021 [Epic] Text Buffer rewrite of 2021 Jan 12, 2021
@DHowett
Copy link
Member Author

DHowett commented Jan 12, 2021

Alright, I've claimed this issue for the text buffer updates.

@DHowett DHowett added Area-Output Related to output processing (inserting text into buffer, retrieving buffer text, etc.) Area-Server Down in the muck of API call servicing, interprocess communication, eventing, etc. Issue-Scenario Product-Conhost For issues in the Console codebase Product-Terminal The new Windows Terminal. and removed ⛺ Reserved For future use labels Jan 12, 2021
@ghost ghost removed the Needs-Tag-Fix Doesn't match tag requirements label Jan 12, 2021
@DHowett DHowett added this to the Console Backlog milestone Jan 12, 2021
@KalleOlaviNiemitalo
Copy link

Representative complicated case (huge character)

TAB characters in #7810 could also be stored this way.

@DHowett
Copy link
Member Author

DHowett commented Jan 12, 2021

@KalleOlaviNiemitalo Indeed! I'd thought about this when designing the new advances-based buffer, but I was a little worried about the implications. We don't have good coverage for what happens when the cursor moves into a column covered by a wide glyph, and I think nailing that down will be important for getting tab-in-buffer working.

@DHowett
Copy link
Member Author

DHowett commented Jan 12, 2021

Actually, on second thought ... this isn't as bad as I'd expected. We already move into multi-cell glyphs, and we don't damage them properly when they're overstruck. We don't render the cursor with the correct width when we're inside such a glyph either...

When the new damage and column-to-region mapping algorithms, this may be a breeze.

@naikel
Copy link

naikel commented Jan 13, 2021

Will this rewrite support a TextBuffer >= 32767? Currently ROW._id is a SHORT and TextBuffer size is a COORD, which is also composed of SHORTs.

Changing TextBuffer COORD references to til::point mean a major rewrite of half the code, but it's needed if you want to support Infinite Scrollback (with files).

@j4james
Copy link
Collaborator

j4james commented Jan 13, 2021

@DHowett I don't think this is as simple as you think. Lets say you write out the string ABCDEFGHIJ, move the cursor back to column 3 (on the character C), and then output a TAB. This moves the cursor to column 9 (on the character I), but there's no visible change to the content. In what way does the buffer now change to reflect that TAB?

Things like a zero-width-joiner are even more problematic. Let's say you write out a waving white flag (U+1F3F3), a space, and a rainbow (U+1F308), then go back and replace that space with a zero-width-joiner. The buffer sequence is now flag-zwj-rainbow, which Unicode defines as a rainbow flag. How would you alter the display to reflect those 3 characters collapsing into 1?

And whatever you do has to be compatible with the way other terminal emulators behave, and more importantly, the way apps expect the terminals to behave, otherwise they just won't work correctly in Windows Terminal. I suspect some of the things you're envisioning here are simply not possible.

@DHowett
Copy link
Member Author

DHowett commented Jan 13, 2021

@j4james I'm not terribly concerned about tab - it's an auxiliary consideration for this spec. If we wanted to replicate iTerm's behavior here, it could only replace up to the next $TABSTOP characters with a N-column \t if the moved-over portion was otherwise stored as spaces.

ZWJ is another "future consideration". I'm hoping that this is more robust for now, rather than fully scalable for the future. N:M glyph mapping, inserting ZWJs, etc. is more impossible with what we have than this specification.

I wouldn't expect overstriking a non-joining space with a joining one to ever join adjacent columns once they're frozen in place in the text buffer. The behavior would, rather, be "ZWJ only attaches characters if they are written consecutively, and a cursor move breaks a character-in-composition" or "ZWJ is zero-width and always attaches to the character stored in the prior column." Either of those would be more possible than what we're working with today. It doesn't necessarily mean they're right, it just means that our horizon expands to better consider them.

All that, though, is not in scope right now. Incremental progress 😄 while retaining compatibility with what applications (using both the Win32 Console APIs and VT) do today are my goal.

@DHowett
Copy link
Member Author

DHowett commented Jan 13, 2021

@naikel breaking any dependency we have on COORD internally is a significant bonus.

@naikel
Copy link

naikel commented Jan 13, 2021

@DHowett I tried once to extend TextBuffer >= 32767 changing ROW._id from SHORT to size_t and TextBuffer COORD to til::point and after three days failed miserably. I think it's easier to just code a whole new terminal from scratch.

EDIT: I really hope you can do this!

@DHowett
Copy link
Member Author

DHowett commented Jan 29, 2021

Thoughts for replacing the global function pointer that determines glyph width (which the Renderer uses to answer for ambiguous glyphs).

struct IMeasurer {
    // Measures one codepoint, stored as UTF-16 code units
    virtual CodepointWidth MeasureCodepoint(std::wstring_view) = 0;

    // Measures multiple codepoints, stored as a string of UTF-16 code units.
    // This function should (?) return 0-width code units for combining characters
    // if and only if they would actually combine. Use Renzhi's new measuring algorithm here.
	virtual std::vector<CodepointWidth> MeasureString(std::wstring_view) = 0;
}
struct UnicodeUCDMeasurer : public IMeasurer {
    // implements it using raw compiled-in unicode UCD, never asks anyone else, totally static
}
struct RendererFallbackMeasurer : public IMeasurer {
    IMeasurer* rootMeasurer;
    RenderEngine* renderEngine;

    // IFF rootMeasurer returns Ambiguous widths for any individual codepoints, ask the render engine
    // This is only required for conhost, where we **must** maintain compatibility with "font dictates display width"
}

@j4james
Copy link
Collaborator

j4james commented Jan 29, 2021

    // This function should (?) return 0-width code units for combining characters
    // if and only if they would actually combine. Use Renzhi's new measuring algorithm here.

Can I ask what Renzhi's algorithm is? My concern is if it doesn't match the behaviour of the wcwidth routines that Linux apps often use, then those apps are just not going to work correctly in WT, no matter how brilliantly we render zero-width joiners.

Also how does this relate to what is stored in the buffer? Is the intention to measure before deciding how something gets stored, or is the measuring to determine how the buffer is rendered? The reason I ask, is because if element N in the row buffer doesn't map to column N on the display, then some VT operations will also not work correctly.

This can even be a problem for client-side operations, like search and selection, although that at least is something we could maybe compensate for. And I know the DX renderer already has issues with the buffer to screen mapping, but I'd rather it didn't get worse.

@skyline75489
Copy link
Collaborator

skyline75489 commented Jan 29, 2021 via email

@j4james
Copy link
Collaborator

j4james commented Jan 29, 2021

OK, thanks for info. As long as you guys are confident that this is going work out OK, feel free to ignore my questions - I don't need to know all the details. The main thing was I wanted to make sure you weren't overlooking something in the design that might later turn out to be a problem.

@zadjii-msft
Copy link
Member

zadjii-msft commented Feb 1, 2021

since we said his name 3 times: @reli-msft

DHowett pushed a commit that referenced this issue Mar 24, 2023
This adds PR adds a couple foundational functions and classes to make
our TextBuffer more performant and allow us to improve our Unicode
correctness in the future, by getting rid of our dependence on
`OutputCellIterator`. In the future we can then replace the simple
UTF-16 code point iterator with a proper grapheme cluster iterator.

While my focus is technically on Unicode correctness, the ~4x VT
throughput increase in OpenConsole is pretty nice too.

This PR adds:
* A new, simpler ROW iterator (unused in this PR)
* Cursor movement functions (`NavigateToPrevious`, `NavigateToNext`)
  They're based on functions that align the cursor to the start/end
  of the _current_ cell, so such functions can be added as well.
* `ReplaceText` to write a raw string of text with the possibility to
  specify a right margin.
* `CopyRangeFrom` will allow us to make reflow much faster, as it's able
  to bulk-copy already measured strings without re-measuring them.

Related to #8000

## Validation Steps Performed
* enwik8.txt, zhwik8.txt, emoji-test.txt, all work with proper
  wide glyph reflow at the end of a row ✅
* This produces "a 咪" where only "a" has a white background:
  ```sh
  printf '\e7こん\e8\x1b[107ma\x1b[m\n'
  ```
* This produces "abん":
  ```sh
  stdbuf -o0 printf '\x1b7こん\x1b8a'; printf 'b\n'
  ```
* This produces "xy" at the end of the line:
  ```sh
  stdbuf -o0 printf '\e[999C\bこ\bx'; printf 'y\n'
  ```
* This produces red whitespace followed by "こ " in the default
  background color at the end of the line, and "ん" on the next line:
  ```sh
  printf '\e[41m\e[K\e[m\e[999C\e[2Dこん\n'
  ```
@mominshaikhdevs
Copy link

Screenshot 2023-05-02 161212
why this task is marked as completed when the linked issue is open?

@lhecker
Copy link
Member

lhecker commented May 2, 2023

The linked issue (#1472) is just a reference what that bullet point is for and not the task itself. Thinking about it some more, I realize that this issue (#8000) is actually kind of done already. I rewrote the buffer in #13626 and the other commits I've done since aren't really related to the "Text Buffer" anymore and they're more related to #1472, hmm... Eh, I guess it's not really that important.

microsoft-github-policy-service bot pushed a commit that referenced this issue Jun 30, 2023
… Emoji support (#15567)

This is a complete rewrite of the old `WriteCharsLegacy` function
which is used when VT mode is disabled as well as for all interactive
console input handling on Windows. The previous code was almost
horrifying in some aspects as it first wrote the incoming text into a
local buffer, stripping/replacing any control characters. That's not
particular fast and never was. It's unknown why it was like that.

It also measured the width of each glyph to correctly determine the
cursor position and line wrapping. Presumably this used to work quite
well in the original console code, because it would then just copy
that local buffer into the destination text buffer, but with the
introduction of the broken and extremely slow `OutputCellIterator`
abstraction this would end up measuring all text twice and cause
disagreements between `WriteCharsLegacy`'s idea of the cursor position
and `OutputCellIterator`'s cursor position. Emoji input was basically
entirely broken. This PR fixes it by passing any incoming text
straight to the `TextBuffer` as well as by using its cursor positioning
facilities to correctly implement wrapping and backspace handling.

Backspacing over Emojis and an array of other aspects still don't work
correctly thanks to cmdline.cpp, but it works quite a lot better now.

Related to #8000
Closes #8839
Closes #10808

## Validation Steps Performed
* Printing various Unicode text ✅
* On an fgets() input line
  * Typing text works ✅
  * Inserting text works anywhere ✅
  * Ctrl+X is translated to ^X ✅
  * Null is translated to ^@ ✅
    This was tested by hardcoding the `OutputMode` to 3 instead of 7.
  * Backspace only advances to start of the input ✅
  * Backspace deletes the entire preceding tab ✅
  * Backspace doesn't delete whitespace preceding a tab ✅
  * Backspacing a force-wrapped wide glyph unwraps the line break ✅
  * Backspacing ^X deletes both glyphs ✅
  * Backspacing a force-wrapped tab deletes trailing whitespace ✅
* When executing
  ```cpp
  fputs("foo: ", stdout);
  fgets(buffer, stdin);
  ```
  pressing tab and then backspace does not delete the whitespace
  that follows after the "foo:" string (= `sOriginalXPosition`).
microsoft-github-policy-service bot pushed a commit that referenced this issue Aug 1, 2023
`COOKED_READ_DATA` is a little special and requires cursor navigation
based on the raw (buffered) text contents instead of what's in the
text buffer. This requires the introduction of new helper functions
to implement such cursor navigation. They're made part of `TextBuffer`
as these helpers will get support graphemes in the future.
It also helps keeping it close to `TextBuffer` as the cursor
navigation should optimally behave identical between the two.

Part of #8000.
microsoft-github-policy-service bot pushed a commit that referenced this issue Aug 24, 2023
The ultimate goal of this PR was to use ICU for text search to
* Improve Unicode support
  Previously we used `towlower` and only supported BMP glphs.
* Improve search performance (10-100x)
  This allows us to search for all results in the entire text buffer
  at once without having to do so asynchronously.

Unfortunately, this required some significant changes too:
* ICU's search facilities operate on text positions which we need to be
  mapped back to buffer coordinates. This required the introduction of
  `CharToColumnMapper` to implement sort of a reverse-`_charOffsets`
  mapping. It turns text (character) positions back into coordinates.
* Previously search restarted every time you clicked the search button.
  It used the current selection as the starting position for the new
  search. But since ICU's `uregex` cannot search backwards we're
  required to accumulate all results in a vector first and so we
  need to cache that vector in between searches.
* We need to know when the cached vector became invalid and so we have
  to track any changes made to `TextBuffer`. The way this commit solves
  it is by splitting `GetRowByOffset` into `GetRowByOffset` for
  `const ROW` access and `GetMutableRowByOffset` which increments a
  mutation counter on each call. The `Search` instance can then compare
  its cached mutation count against the previous mutation count.

Finally, this commit makes 2 semi-unrelated changes:
* URL search now also uses ICU, since it's closely related to regular
  text search anyways. This significantly improves performance at
  large window sizes.
* A few minor issues in `UiaTracing` were fixed. In particular
  2 functions which passed strings as `wstring` by copy are now
  using `wstring_view` and `TraceLoggingCountedWideString`.

Related to #6319 and #8000

## Validation Steps Performed
* Search upward/downward in conhost ✅
* Search upward/downward in WT ✅
* Searching for any of ß, ẞ, ss or SS matches any of the other ✅
* Searching for any of Σ, σ, or ς matches any of the other ✅
DHowett pushed a commit that referenced this issue Sep 26, 2023
Subjectively speaking, this commit makes 3 improvements:
* Most importantly, it now would work with arbitrary Unicode text.
  (No more `IsGlyphFullWidth` or DBCS handling during reflow.)
* Due to the simpler implementation it hopefully makes review of
  future changes and maintenance simpler. (~3x less LOC.)
* It improves perf. by 1-2 orders of magnitude.
  (At 120x9001 with a full buffer I get 60ms -> 2ms.)

Unfortunately, I'm not confident that the new code replicates the old
code exactly, because I failed to understand it. During development
I simply tried to match its behavior with what I think reflow should do.

Closes #797
Closes #3088
Closes #4968
Closes #6546
Closes #6901
Closes #15964
Closes MSFT:19446208

Related to #5800 and #8000

## Validation Steps Performed
* Unit tests ✅
* Feature tests ✅
* Reflow with a scrollback ✅
* Reflowing the cursor cell causes a forced line-wrap ✅
  (Even at the end of the buffer. ✅)
* `color 8f` and reflowing retains the background color ✅
* Enter alt buffer, Resize window, Exit alt buffer ✅
DHowett pushed a commit that referenced this issue Sep 26, 2023
Subjectively speaking, this commit makes 3 improvements:
* Most importantly, it now would work with arbitrary Unicode text.
  (No more `IsGlyphFullWidth` or DBCS handling during reflow.)
* Due to the simpler implementation it hopefully makes review of
  future changes and maintenance simpler. (~3x less LOC.)
* It improves perf. by 1-2 orders of magnitude.
  (At 120x9001 with a full buffer I get 60ms -> 2ms.)

Unfortunately, I'm not confident that the new code replicates the old
code exactly, because I failed to understand it. During development
I simply tried to match its behavior with what I think reflow should do.

Closes #797
Closes #3088
Closes #4968
Closes #6546
Closes #6901
Closes #15964
Closes MSFT:19446208

Related to #5800 and #8000

## Validation Steps Performed
* Unit tests ✅
* Feature tests ✅
* Reflow with a scrollback ✅
* Reflowing the cursor cell causes a forced line-wrap ✅
  (Even at the end of the buffer. ✅)
* `color 8f` and reflowing retains the background color ✅
* Enter alt buffer, Resize window, Exit alt buffer ✅

(cherry picked from commit 7474839)
Service-Card-Id: 90642727
Service-Version: 1.19
@mominshaikhdevs
Copy link

can someone update the issue description? #30 seems to be the only remaining open issue at this point.

@lhecker
Copy link
Member

lhecker commented Apr 23, 2024

With only a few places left that rely on the old grid-based buffer layout, all of which require some long term, but focused work, I'll be going ahead and close this more general issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Area-Output Related to output processing (inserting text into buffer, retrieving buffer text, etc.) Area-Server Down in the muck of API call servicing, interprocess communication, eventing, etc. Issue-Scenario Needs-Tag-Fix Doesn't match tag requirements Product-Conhost For issues in the Console codebase Product-Terminal The new Windows Terminal.
Projects
None yet
Development

No branches or pull requests