Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rewrite ROW to be Unicode capable #13626

Merged
merged 25 commits into from
Nov 11, 2022
Merged

Conversation

lhecker
Copy link
Member

@lhecker lhecker commented Jul 29, 2022

This commit is a from-scratch rewrite of ROW with the primary goal to get
rid of the rather bodgy UnicodeStorage class and improve Unicode support.

Previously a 120x9001 terminal buffer would store a vector of 9001 ROWs
where each ROW stored exactly 120 wchar_t. Glyphs exceeding their
allocated space would be stored in the UnicodeStorage which was basically
a hashmap<Coordinate, String>. Iterating over the text in a ROW would
require us to check each glyph and fetch it from the map conditionally.
On newlines we'd have to invalidate all map entries that are now gone,
so for every invalidated ROW we'd iterate through all glyphs again and if
a single one was stored in UnicodeStorage, we'd then iterate through the
entire hashmap to remove all coordinates that were residing on that ROW.
All in all, this wasn't the most robust nor performant code.

The new implementation is simple (from a design perspective):
Store all text in a ROW in a regular string. Grow the string if needed.
The association between columns and text works by storing character offsets
in a column-wide array. This algorithm is <100 LOC and removes ~1000.

As an aside this PR does a few more things that go hand in hand:

  • Remove most of ROW helper classes, which aren't needed anymore.
  • Allocate backing memory in a single VirtualAlloc call.
  • Rewrite IsCursorDoubleWidth to use DbcsAttrAt directly.
    Improves overall performance by 10-20% and makes this implementation
    faster than the previous NxM storage, despite the added complexity.

Part of #8000

Validation Steps Performed

  • Existing and new unit and feature tests complete ✅
  • Printing Unicode completes without crashing ✅
  • Resizing works without crashing ✅

@lhecker lhecker force-pushed the dev/lhecker/8000-text-buffer-rework branch from eb4a04e to 2f01956 Compare July 29, 2022 15:43
@DHowett
Copy link
Member

DHowett commented Jul 29, 2022

image

I love it.

void ROW::_init() noexcept
{
std::fill_n(_chars.begin(), _columnCount, UNICODE_SPACE);
iota_n(_charOffsets.begin(), _columnCount + 1u, uint16_t{ 0 });
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we worried about the expense of having, say, 9001 copies of an array of all numbers between 0 and 120?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The space requirement and initialization cost is about the same as before when we used 16-bit sized DbcsAttribute instances for each wchar_t in a ROW. As such I'm not concerned about any regressions.
I would've liked to do lazy initialization for ROWs, but that would require us to remove the const variant of GetRowByOffset, which I wasn't willing to do in this already massive PR.

@DHowett
Copy link
Member

DHowett commented Jul 29, 2022

I love how removing UnicodeStorage obviates the need for rows to have IDs, which removes the entire machine built around making sure row IDs remain sane

@@ -343,68 +343,6 @@ class SelectionTests
VERIFY_ARE_EQUAL(srOriginal.Left + sDeltaLeft, srSelection.Left);
VERIFY_ARE_EQUAL(srOriginal.Right + sDeltaRight, srSelection.Right);
}

TEST_METHOD(TestBisectSelection)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm. For some tests, it may be good practice to leave a comment on the PR explaining why it's not necessary any more. :)

Copy link
Member Author

@lhecker lhecker Jul 29, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Bisect" in this case stands for "trailing wide character in the first / leading wide character in the last cell of a row", which we don't support anymore, even with CHAR_INFOs. So this test was removed.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

which we don't support anymore

As in, this PR breaks that functionality? Or we officially stopped supporting it a while back?
Why is it ok that we're dropping support for that? I seem to be missing some context here haha.

Copy link
Member Author

@lhecker lhecker Nov 9, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As in, this PR breaks that functionality? Or we officially stopped supporting it a while back?
Why is it ok that we're dropping support for that? I seem to be missing some context here haha.

Both. Kinda. This PR breaks it harder than we already broke it in the past.
I'm not sure when, but at some point, we stopped supporting storage of "half a wide glyph" in the first/last column of the console ("bisecting rows"). Instead it was supposed to either store a whole wide glyph or none at all. With any of the public console APIs it shouldn't have been possible to bring conhost into a state where it stores such bisecting glyphs in a long while.
But we still kind of supported it internally, for instance when you rapidly resized the console you can bring it into a state, where the reflow algorithm puts glyphs into the wrong columns. This PR puts the last nail into the coffin and ensures that no one can mess with our TextBuffer anymore, not even ourselves. And so this test can't possibly work anymore, just like you already couldn't do that via any of the public APIs. To be entirely honest, I'm not even 100% sure why TestBisectSelection still existed in the first place, and our git history didn't bring up any clues for that either.

BTW a bug caused by bisecting glyphs: #14275
I'm sure there's tons more of that lingering around in our code base.

@DHowett
Copy link
Member

DHowett commented Jul 29, 2022

I haven't found it yet, but how do you handle the case where a double-cell character is inserted on a line where only a single-cell character could fit on the right side? We used to keep track of whether we had "forced" a wide glyph on to the next line, and we used that to reconsitute something for the API i believe...

@lhecker
Copy link
Member Author

lhecker commented Jul 29, 2022

That's still all there in ROW::WriteCells which still has the SetDoubleBytePadded(true) logic.
This PR doesn't improve Unicode support just yet - it only makes it possible to support in the future once we get rid of ROW::WriteCells and until then, in this PR, all the existing code around it was kept intact.

Comment on lines -184 to -195
void TextBufferTests::TestBufferRowByOffset()
{
auto& textBuffer = GetTbi();
auto csBufferHeight = GetBufferHeight();

VERIFY_IS_TRUE(csBufferHeight > 20);

auto sId = csBufferHeight / 2 - 5;

const auto& row = textBuffer.GetRowByOffset(sId);
VERIFY_ARE_EQUAL(row.GetId(), sId);
}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test isn't helpful anymore, since ROWs don't have IDs anymore.

Comment on lines +88 to +99
void swap(ROW& lhs, ROW& rhs) noexcept
{
std::swap(lhs._charsBuffer, rhs._charsBuffer);
std::swap(lhs._charsHeap, rhs._charsHeap);
std::swap(lhs._chars, rhs._chars);
std::swap(lhs._charOffsets, rhs._charOffsets);
std::swap(lhs._attr, rhs._attr);
std::swap(lhs._columnCount, rhs._columnCount);
std::swap(lhs._lineRendition, rhs._lineRendition);
std::swap(lhs._wrapForced, rhs._wrapForced);
std::swap(lhs._doubleBytePadded, rhs._doubleBytePadded);
}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure whether I should keep thisstd::swap specialization... It helps the std::rotate in TextBuffer, which relies on std::swap internally.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Eh, I vote keep it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

any way to guarantee that it swaps all contents?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No unfortunately not. The alternative is to implement a operator=(ROW&&) which of course would suffer from the same unsafety issue...

@lhecker lhecker force-pushed the dev/lhecker/8000-text-buffer-rework branch from 04dd61f to 1761d60 Compare July 31, 2022 00:30
@github-actions

This comment was marked as outdated.

@zadjii-msft zadjii-msft added this to the Terminal v1.17 milestone Aug 1, 2022
ghost pushed a commit that referenced this pull request Aug 2, 2022
This fixes an issue were overwriting parts of a row would only trigger
that specific portion of the row to be redrawn. This isn't just
problematic for combining characters, but also for things like
the new `TestDbcsBisectWriteCells` test introduced in #13626.

Benchmarks showed no impact on performance whatsoever.

## Validation Steps Performed
* Pick this commit into #13626
* Run the `TestDbcsBisectWriteCells` test and break before OpenConsole exits
* A correct "QいかなZYXWVUTに" output is visible ✅
src/buffer/out/Row.hpp Outdated Show resolved Hide resolved
src/buffer/out/Row.cpp Outdated Show resolved Hide resolved
@j4james
Copy link
Collaborator

j4james commented Aug 9, 2022

I love how removing UnicodeStorage obviates the need for rows to have IDs, which removes the entire machine built around making sure row IDs remain sane

I think it's worth mentioning that I was using the row IDs to track image segments in my sixel implementation. Not suggesting that's a good reason to keep them, since I'm not likely to do anything with that branch any time soon, but I do think there is value in having some mechanism via which additional metadata can be attached a row.

Another potential use case would be the scrollbar marks. I think we're currently tracking them with just a row number, but long term I'm hoping we'll change that, because I suspect the current approach will break if we ever get margins working in the terminal. That's perhaps a bit of an edge case, so not critical, but still.

@lhecker
Copy link
Member Author

lhecker commented Aug 9, 2022

[...] but I do think there is value in having some mechanism via which additional metadata can be attached a row.

Why not store or reference that data in ROW itself?
Can sixels be treated as (immutable) bitmaps? If so I can imagine it might be fairly performant and simple to store a vector of shared_ptr to the sixels/bitmaps on each row that they touch. We should then still be able to infer the exact pixels that need to be rendered on each row given the height stored in the referenced sixel/bitmap.

@j4james
Copy link
Collaborator

j4james commented Aug 9, 2022

Why not store or reference that data in ROW itself?

In the case of sixel, 99% of the time you're not likely to have any data attached to the row - most users will probably never use sixel - so I didn't want to introduce an additional field that would just be wasting space most of the time. And since the ID already existed, that provided an easy way to add sixel functionality without impacting anyone that wasn't using it.

Having some kind of pointer on the row itself would certainly be easier, but it does mean more overhead. And what happens when we have some other bit of metadata that's also rarely used, and also needs to be associated with a row? Do we add yet another field? Maybe some kind of linked list of extension points? I don't know. I just wanted to point out that this use case exists, and row IDs were quite a compact way of addressing that need.

Can sixels be treated as (immutable) bitmaps?

Not really. Sixel isn't so much an image format - it's more of a drawing protocol. So you need to think of it in terms of plotting arbitrary pixels at a point on the screen. The way I managed that was by creating a bitmap associated with each row that was touched by sixel output. Those bitmaps obviously need to be mutable so they can be updated when additional pixels are output.

But any kind of reference in the row would be fine for this - I was just looking for a way to minimize the overhead involved, and potentially share it with other row extensions.

Anyway, this isn't anything that has to be resolved now. I just wanted to the raise the issue as something to consider for the future.

@lhecker
Copy link
Member Author

lhecker commented Aug 9, 2022

Thanks for the feedback! I don't disagree with you and I'd be absolutely open to add back IDs even if for no reason at all. 🙂

But unrelated to this PR and any designs you have in mind (which I'd approve regardless), I'd like to mention an opposing view to the overhead of unused fields, just because...
Right now on x64/ARM64 a ROW is 584 bytes and this PR will shrink it down to just 84 bytes (with the option to shrink it further down to just 48 bytes). Even if we add a dozen pointers or so to ROW we'd still have way less overhead than before. Because we have many more potential areas were we can get drastic improvements like that, I strongly feel like we shouldn't worry at all about memory usage of a few unused fields and rather pick whatever is the easiest for us to write and the CPU to process. 🙂

@AtariDreams
Copy link
Contributor

Any updates on this

Copy link
Member

@carlos-zamora carlos-zamora left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pacha Perfect Meme from Emperor's New Groove

@lhecker
Copy link
Member Author

lhecker commented Nov 11, 2022

3 approvals

tom and jerry train

@lhecker lhecker merged commit a01500f into main Nov 11, 2022
@lhecker lhecker deleted the dev/lhecker/8000-text-buffer-rework branch November 11, 2022 19:35
Copy link
Contributor

@PankajBhojwani PankajBhojwani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just had a couple of questions!

Comment on lines +274 to +275
pRow->ReplaceCharacters(x, 2, t.string);
x += 2;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why don't we need to loop through the characters in the string (like the way we need to below when the string isn't wide)? Is it guaranteed that if wide == true then the string is just 1 glyph that's 2-cells wide? I know that the strings we pass in conform to this, but in the context of within this function is that always true

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wrote a different response, but you know what... I just realized, I should just remove the for loop instead and place every character individually. After all, there's no guarantee technically that narrow glyphs are always only 1 character long. There's also only a single caller of applyTestString, so I'm probably just going to inline the ReplaceCharacters calls to make this less confusing.

src/host/ut_host/ScreenBufferTests.cpp Show resolved Hide resolved
ghost pushed a commit that referenced this pull request Nov 12, 2022
## Summary of the Pull Request
Ensures that reading the buffer content actually returns the content.

## References
Regressed in #13626.

## PR Checklist
* [x] Closes #14378
* [x] CLA signed. If not, go over [here](https://cla.opensource.microsoft.com/microsoft/Terminal) and sign the CLA
* [x] Tests added/passed

## Validation Steps Performed
Added a test.
DHowett pushed a commit that referenced this pull request Dec 1, 2022
`TextBuffer` is buggy and allows a `Trailing` `DbcsAttribute` to be written
into the first column. Since other code then blindly assumes that there's a
preceding `Leading` character, we'll get called with a X coordinate of -1.
This issue will be fixed by #13626 and this commit fixes it in the meantime.

Additionally fixes an unimportant crash when the window height is 0px,
because it was annoying during testing and doesn't hurt to be fixed.

## Validation Steps Performed
* Run a stress test that prints random Unicode at random positions
* Resize the window furiously at the same time
* Doesn't crash / fail-fast ✅

(cherry picked from commit bb4711d)
Service-Card-Id: 86353783
Service-Version: 1.16
@zadjii-msft zadjii-msft mentioned this pull request Dec 12, 2022
15 tasks
ghost pushed a commit that referenced this pull request Jan 13, 2023
When the buffer contains wide characters that occupy more than one cell,
and those cells are scrolled horizontally by exactly one column, that
operation can result in the wide characters being completely erased.
This PR attempts to fix that bug, although it's not an ideal long term
solution.

Although not really to blame, it was PR #13626 that exposed this issue.

The root of the problem is that scrolling operations copy cells one by
one, but wide characters are written to the buffer two cells at a time.
So when you move a wide character one position to the left or right, it
can overwrite itself before it's finished copying, and the end result is
the whole character gets erased.

I've attempt to solve this by getting the affected operations to read
two cells in advance before they start writing, so there's no risk of
losing the source data before it's fully output. This may not work in
the long term, with characters wider than two cells, but it should at
least be good enough for now.

I've also changed the `TextBuffer::Write` call to a `WriteLine` call to
improve the handling of a wide character on the end of the line, where
moving it right by one column would place it half off screen. It should
just be dropped, but with the `Write` method, it would end up pushed
onto the following line.

## Validation Steps Performed

I've manually confirmed this fixes all the test cases described in
#14626, and also added some unit tests that replicate those scenarios.

Closes #14626
@ghost
Copy link

ghost commented Jan 24, 2023

🎉Windows Terminal Preview v1.17.1023 has been released which incorporates this pull request.:tada:

Handy links:

DHowett pushed a commit that referenced this pull request Feb 15, 2023
#13626 contains a small "regression" compared to #13321:
It now began to store trailers in the buffer wherever possible to allow
a region
of the buffer to be backed up and restored via Read/WriteConsoleOutput.
But we're unfortunately still ill-equipped to handle anything but UCS-2
via
WriteConsoleOutput, so it's best to again ignore trailers just like in
#13321.

## Validation Steps Performed
* Added unit test ✅
DHowett pushed a commit that referenced this pull request Mar 31, 2023
#13626 contains a small "regression" compared to #13321:
It now began to store trailers in the buffer wherever possible to allow
a region
of the buffer to be backed up and restored via Read/WriteConsoleOutput.
But we're unfortunately still ill-equipped to handle anything but UCS-2
via
WriteConsoleOutput, so it's best to again ignore trailers just like in
#13321.

## Validation Steps Performed
* Added unit test ✅

(cherry picked from commit 9dcdcac)
Service-Card-Id: 88719297
Service-Version: 1.17
DHowett pushed a commit that referenced this pull request Mar 20, 2024
The `nLength` parameter of `ReadConsoleOutputCharacterW` indicates
the number of columns that should be read. For single-column (narrow)
surrogate pairs this previously clipped a trailing character of the
returned string. In the major Unicode support update in #13626
surrogate pairs truly got stored as atomic units for the first time.
This now meant that a 120 column read with such codepoints resulted
in 121 characters. Other parts of conhost still assume UCS2 however,
and so this results in the entire read failing.

This fixes the issue by turning surrogate pairs into U+FFFD
which makes it UCS2 compatible.

Closes #16892

## Validation Steps Performed
* Write U+F15C0 and read it back with `ReadConsoleOutputCharacterW`
* Read succeeds with a single U+FFFD ✅
DHowett pushed a commit that referenced this pull request Mar 20, 2024
The `nLength` parameter of `ReadConsoleOutputCharacterW` indicates
the number of columns that should be read. For single-column (narrow)
surrogate pairs this previously clipped a trailing character of the
returned string. In the major Unicode support update in #13626
surrogate pairs truly got stored as atomic units for the first time.
This now meant that a 120 column read with such codepoints resulted
in 121 characters. Other parts of conhost still assume UCS2 however,
and so this results in the entire read failing.

This fixes the issue by turning surrogate pairs into U+FFFD
which makes it UCS2 compatible.

Closes #16892

* Write U+F15C0 and read it back with `ReadConsoleOutputCharacterW`
* Read succeeds with a single U+FFFD ✅

(cherry picked from commit 373faf0)
Service-Card-Id: 92121912
Service-Version: 1.20
DHowett pushed a commit that referenced this pull request Mar 20, 2024
The `nLength` parameter of `ReadConsoleOutputCharacterW` indicates
the number of columns that should be read. For single-column (narrow)
surrogate pairs this previously clipped a trailing character of the
returned string. In the major Unicode support update in #13626
surrogate pairs truly got stored as atomic units for the first time.
This now meant that a 120 column read with such codepoints resulted
in 121 characters. Other parts of conhost still assume UCS2 however,
and so this results in the entire read failing.

This fixes the issue by turning surrogate pairs into U+FFFD
which makes it UCS2 compatible.

Closes #16892

* Write U+F15C0 and read it back with `ReadConsoleOutputCharacterW`
* Read succeeds with a single U+FFFD ✅

(cherry picked from commit 373faf0)
Service-Card-Id: 92129542
Service-Version: 1.19
github-merge-queue bot pushed a commit that referenced this pull request Jun 20, 2024
This is simply some unused code from the days before the big
text buffer rewrite in #13626.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Area-Output Related to output processing (inserting text into buffer, retrieving buffer text, etc.) Issue-Task It's a feature request, but it doesn't really need a major design. Product-Conhost For issues in the Console codebase
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

8 participants