Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UNSAFE_TO_CONCAT [Was Clarify UNSAFE_TO_BREAK and expose OS2V2Tail::usMaxContext as context.] #1463

Open
bungeman opened this issue Dec 7, 2018 · 23 comments · May be fixed by #3297
Open

UNSAFE_TO_CONCAT [Was Clarify UNSAFE_TO_BREAK and expose OS2V2Tail::usMaxContext as context.] #1463

bungeman opened this issue Dec 7, 2018 · 23 comments · May be fixed by #3297
Labels

Comments

@bungeman
Copy link
Contributor

@bungeman bungeman commented Dec 7, 2018

The current documentation clearly states what UNSAFE_TO_BREAK (or it's absence) can be interpreted to mean. However, it should also be documented as to what it is not, since it's very easy to be lured into a false sense of security with it.

The current behavior (as I understand it) is that one can shape a sequence of code points into glyphs, pick any two safe to break locations in the resulting buffer, reshape the associated sub-sequence of code points in isolation and get the same buffer output as is seen between the original two safe to break locations. Essentially, any pair of safe to break locations is stating the the larger context of the original sequence of code points did not affect the shaping of that run. (This is actually a bit stronger of a statement than the current documentation implies, please correct if this is not true.)

However, it is important to note that the inverse is not true, it is not the case that such a run of code points will always shape this way; any modification to the original sequence of code points can arbitrarily modify the overall shaping. For instance I can make a font which creates ligatures of all the sentences in the works of Shakespeare such that "There is a written scroll" will come back with many safe to break locations, but "There is a written scroll!" will shape into a single glyph which visually represents the icon for the Prince of Morocco. In other words, adding to or modifying any of the original sequence of code points invalidates everything about the shaping before doing so (including any safe to breaks).

In order to avoid requiring complete re-shaping when editing (especially simply adding to the end of a paragraph), it is necessary to let the user know the maximum affected area. For OpenType this is the purpose of OS2V2Tail::usMaxContext, in theory even TrueType knows the answer to this question for at least appending (keep track of the last ground state). With this sort of information it should be possible for a sophisticated user to only re-shape a portion of the overall sequence of code points by dirtying the maximum context area and finding the safe to break locations surrounding that area. The pathological Shakespeare font I propose above would obviously have quite a large value for the maximum context, but for most fonts this value is three or less.

In between these two cases (breaking at safe to break locations and mutating the sequence of code points) there is the case of determining the extent of re-shaping if truncating the original sequence of code points. Is it enough to find some safe to break location before the truncation and shape only the remaining portion? I cannot construct an example to answer in the negative, but it is not obvious that this is always allowable either. I think this is equivalent to asking if any shaping operation can operate by negative look ahead in a way which doesn't affect safe to break. With the maximum context information it seems this could be at least bounded, but it would be nice to know if this common case can be handled even more efficiently (may be common when line breaking).

With all of that as background, the two actual requests here are:

Can general and/or end of shaping maximum context information be exposed to reduce re-shaping when editing?

Can anything be assumed about code point sequence truncation and safe to break information?

@khaledhosny
Copy link
Collaborator

@khaledhosny khaledhosny commented Dec 7, 2018

I’m not sure usMaxContext can e trusted, HarfBuzz does not use it and I don’t know any other open source library or application that makes use of it (the only application that I know that makes use of such information, LibreOffice, actually uses a hard-coded value for max context https://github.com/LibreOffice/core/blob/148beecad4ead6378e3c92c0b48d37ad2f0ecfc0/sw/source/core/text/itrform2.cxx#L2702-L2705). As such I wouldn’t trust fonts to be setting this value correctly, since any bugs in it would go unnoticed.

Loading

@bungeman
Copy link
Contributor Author

@bungeman bungeman commented Dec 8, 2018

Hmmm... yes, I see less than enthusiastic comments about its value, for example fonttools/fonttools#466 (comment). Of course, there's always the slow way of just calculating it, which may actually be fine for an editor where there aren't actually that many fonts in play but layout is going to be done many times.

Without some sort of value like this correct editing (and sometimes line breaking) is more like n^2 on the length. I happen to know the guy who wrote the first code for detecting ground states in the TrueType state machine shaping to determine how far back to re-shape when appending. Just picking a number like LibreOffice seems to do seems as bogus as using a semi bogus number from the font. Also, until UNSAFE_TO_BREAK was introduced in HarfBuzz there wasn't any reason to know about this value since it was necessary to re-shape the entire run anyway for correctness. This would explain why no open source project was ever interested in this value.

I am interested if there is a example where truncation of the of the initial code point sequence requires full re-shaping. It's super tempting on truncation to simply find the previous safe to break position and just re-shape from there to the truncation point. If this is always possible it would be great to state that, if not a counter example should be stated in the docs to scare implementors away from doing it.

Loading

@behdad
Copy link
Member

@behdad behdad commented Dec 8, 2018

Thanks Ben. You make the use-csae very clear.

  1. The current usMaxContext cannot be relied on. It's easy/fast enough to ask HarfBuzz to compute it. We need to implement that at some point anyway.

  2. usMaxContext cannot reliably be used however, given that lookups can skip marks, and HarfBuzz also skips default-ignorable-characters. So, to reliably use usMaxContext one also needs to duplicate that logic.

However, all of the above is irrelevant. You are assuming that appending a char to a buffer can affect at most usMaxContext chars before it, and as such one can find any unsafe-to-break point before that range and reshape from there forward?

There's at least one very easy reason why that's completely unsafe: ReverseChainContextSubst subtables. With those, eg. you can have a sequence of 'a's shape to themselves. But as soon as you append a 'b', all the 'a's before it also change to 'b'. The usMaxContext for that substitution is short (1 or 2), but the effects propoagate backwards.

'morx' table can also do this easily as morx subchains can process input in reverse.

Even without ReverseChainContextSubst subtables, the usMaxContext needs to be multiplied by the number of lookups, since having 100 lookups each of which can look forward 5 glyphs, when combined, can look forward 500 glyphs and substitute....

In summary, I suggest you don't try to get this information out of shaping / font. IMO the best way to handle this is to reshape from last line break.

Loading

@drott
Copy link
Collaborator

@drott drott commented Dec 10, 2018

Loading

@bungeman
Copy link
Contributor Author

@bungeman bungeman commented Dec 10, 2018

This is great information. I suppose that means https://cs.chromium.org/chromium/src/third_party/blink/renderer/core/layout/ng/inline/ng_line_breaker.cc?l=525&rcl=664f94acf2faf985bd33be531b5a9a0f8fcaff46 may need to be reconsidered?

So, restating as I understand it, safe to break is good only for partitioning an unchanging sequence of code points at safe to break points. Changing the sequence of code points in any way will require re-shaping any shaping runs which the change affects (often an entire paragraph). Attempting to partition at an unsafe to break point (e.g. while line breaking) will require reshaping from the beginning of the sequence of code points (which is often the beginning of the current partition / line). In theory a real maximum context could be computed from the font, but this number may be very large and generally not useful. At any given point a tighter maximum local context could be computed, but that may be more expensive that just re-shaping (and a lot of code).

Loading

@behdad
Copy link
Member

@behdad behdad commented Dec 10, 2018

I think the best way forward is for HarfBuzz to expose a new flag bit that does exactly what you want. Since most fonts don't have back-to-front lookups, this will be effective for most fonts. If font has ay back-to-front lookups, we'll mark the entire run unsafe. Let me think about exactly how to implement that efficiently.

Need a name for the new flag.

Loading

@behdad
Copy link
Member

@behdad behdad commented Dec 11, 2018

I think I know how to implement this. Next step is to prototype it. @kojiishi / @bungeman would you be able to test it in LayoutNG?

Loading

@kojiishi
Copy link
Contributor

@kojiishi kojiishi commented Dec 12, 2018

Like this example site, reported at crbug.com/479370?

I think I was under the assumption that UNSAFE_TO_BREAK includes these cases. If not, and if we're adding a new flag, it's easy for LayoutNG to switch to it. We could optimize further by distinguishing two cases as @bungeman pointed out, but I think I'd like to do it incrementally, when we know better if it's causing the perf problems. /cc @eaenet in case he has different opinions.

Loading

@behdad
Copy link
Member

@behdad behdad commented Dec 12, 2018

In order to avoid requiring complete re-shaping when editing (especially simply adding to the end of a paragraph) ...

While in theory true, I doubt that the implementation will use the flag I'm going to add properly to avoid reshaping when text changes. That requires a lot of change tracking....

Loading

@kojiishi
Copy link
Contributor

@kojiishi kojiishi commented Dec 13, 2018

Currently LayoutNG tracks changes in nodes, but not part of text nodes. In Phase 1, we will not ship editing part though, we switch to the current engine for editable elements. The flag may help fixing cases such as:

<div id=e>a</div>
e.appendChild(document.createTextNode(' b'));

though I'm not sure how common the problem is, and Blink will need some additional work in addition to using the new flag.

We will start working on editing after phase 1. At that point, we will know better how much text-offset-level change tracking is needed. /cc @yosinch

Loading

@bungeman
Copy link
Contributor Author

@bungeman bungeman commented Dec 14, 2018

As a potential counter-example without explicit backward running features, consider that I construct a font with a simple 1:1 mapping of characters to glyphs except for two features, one of which (say 'rlig') which has a lookup 1 which maps "ived" with a special glyph and another (say 'liga') which has lookup 2 which maps "tri" to a different special glyph and shape the string "contrived" with this font. (The same effect might happen with one feature with two lookups.)

As I understand it, the feature lookups run in the order of the lookup list (after determining which features to run in a given pass). As a result, "contrived" should map to [c o n t r ived] with 'ived' marked as safe to break. However, should there be a possible line breaking point found between "e" and "d" such that we should be shaping "contrive" we should end up with [c o n tri v e]. However, this would not be the result if going back to just before 'ived' and just re-shaping the "ive" into [i v e].

It seems like this is a general issue any time a prefix of an early lookup is a suffix of a later lookup. If this is a valid example, it seems that it would be more difficult to take this sort of behavior into account than simply looking to ensure no backward looking features are in the font. I suppose most fonts don't have this property, though now I think I'm going to create this one just to see the effect.

Loading

@kojiishi
Copy link
Contributor

@kojiishi kojiishi commented Dec 14, 2018

IIUC we all agree you're correct, and appreciate to bring this problem into our focus. It's even easier to hit than your example. Consider, user typed "f", shape, then type "i". If apps is smart enough to know "f" is unchanged, because after "f" is safe-to-break, the app will not reshape "f" and fail to get the ligature.

Behdad's point is that apps is not that smart today, and he's right on that regard. Blink reshapes all the text as long as they're in a same node. The Blink code you quoted above is only for removing trailing spaces of line end, so it should be ok.

Blink may want the smartness when we work on editing in post phase 1 if the current per-node change tracking hits the performance bad enough. If Blink switch to per-character change tracking, this case will be much more common.

Loading

@behdad
Copy link
Member

@behdad behdad commented Dec 18, 2018

Okay. I thought more about this. There are some things that I misunderstood before about our current unsafe-to-break flag. I like to correct that understanding, and together we can design a new flag for stronger guarantees.

The way I previously suggested the unsafe-to-break can be used was this: full paragraph shaped, then line-break position determined, and if line-break position is unsafe-to-break, then walk back from it to the first safe-to-break, and shape the tiny resulting piece and append it to the shaped results for the chunk of line before it. This, unfortunately doesn't work.

What safe-to-break guarantees, and only thing it guarantees, is that if you break at this point, and shape the two sides separately and append the results, it would be same as shaping the whole thing. But if you break here and then change one of the pieces, eg by truncating it, then there's no guarantee that shaping can be done in pieces anymore. Ie. if "abcd" as text results in glyphs "AB|CD" where "|" means safe-to-break, then it must be the case that "ab" shapes to "AB" and "cd" shapes to "CD". Let's imagine "c" also shapes to "C", but there's no guarantee from these that "abc" shapes to "ABC". It might shape to "XYZ".

Back to line-breaking. The optimization can be changed to this: if chosen line-break is safe-to-break, then good, just break the glyphstring and move on. Otherwise, the line needs to be reshaped. In a separate commit I'll talk about how to design a flag that can give stronger guarantees.

Loading

@behdad
Copy link
Member

@behdad behdad commented Dec 18, 2018

It seems like this is a general issue any time a prefix of an early lookup is a suffix of a later lookup. If this is a valid example, it seems that it would be more difficult to take this sort of behavior into account than simply looking to ensure no backward looking features are in the font. I suppose most fonts don't have this property, though now I think I'm going to create this one just to see the effect.

Oh I didn't mean that any font that doesn't have backward features is safe... I was designing a new flag that can be implemented to discover such break positions. I'll expand on it in more detail separately.

Loading

@kojiishi
Copy link
Contributor

@kojiishi kojiishi commented Dec 18, 2018

if "abcd" as text results in glyphs "AB|CD"..."abc"...might shape to "XYZ".

Novice question; doesn't OpenType match the longest prefix? I mean, if "ab" shapes to "AB" and "abc" shapes to "XYZ", shouldn't "abcd" shape to "XYZd" rather than "ABCD" in the first place?

Loading

@behdad
Copy link
Member

@behdad behdad commented Dec 18, 2018

Fonts can have complicated context-sensitive rules. In this case, can have rules such that abc shapes to XYZ only if not followed by another letter, or not followed by d.

Loading

@bungeman
Copy link
Contributor Author

@bungeman bungeman commented Feb 2, 2019

I now have a simple line breaker implementation (still working on it) at https://skia-review.googlesource.com/c/skia/+/156641/6/modules/skshaper/src/SkShaper_harfbuzz.cpp#715 which illustrates the issue.

First, any time an unsafe to break location is considered it is necessary to reshape from the beginning of the text under consideration (essentially this issue as described best above at #1463 (comment)).

Second the only way to know to stop looking at line breaking opportunities (instead of looking at all of them) is to stop looking at safe to break opportunities after the first safe to break opportunity which is too long (later unsafe opportunities would still need to be looked at). Note that it is possible to use the heuristic of "just stop looking after the first break opportunity which is too long" (I think this is effectively what layoutNG currently does), but this gives up the nice trait of always fitting the most characters onto a single line. If it were known ahead of time that all breaking opportunities were either safe to break or the kind where you can just go back to the previous safe to break and re-shape the little bit then we know all breaking opportunities are monotonic and the heuristic is correct. Otherwise... I suppose some heuristic would probably still need to be used in practice since the unoptimized 'correct' algorithm one would need to fall back to is so slow.

In any event, any information which can be used to speed up these cases is of interest, and I now have some example code around to tweak to test out proposed changes (which, depending on the change, may be easier to reason about than all of layoutNG). Note that both of these cases are seen most easily in practice with Times New Roman which likes to kern [space 'T]' (I forget if it's actually [period space 'T']).

Loading

@behdad
Copy link
Member

@behdad behdad commented Mar 26, 2019

I thought more about this. It is possible to add a flag to HB such that after, say, shaping "ab" and getting "AB" as results, it will signal that no matter what you add after "ab", the shape for it will not change and stay "AB". However, this itself is not enough either. Because if you want to shape "abc", you still have to shape the whole thing to get the part after "AB". You cannot assume it's the same as shaping "c" alone.

That said, if we expose the same kind of flag on both side of the string, then we can tell whether concatenating is safe. So, that one would be UNSAFE_TO_CONCAT unlike UNSAFE_TO_BREAK.

If we do have the above flag exposed in the middle of the shaping result as well, then one can shape a paragraph, then look for fragments between UNSAFE_TO_CONCAT flags and cache the shaping result for those (a trie fits more naturally, but since these are short, quite possibly a hash table works best in practice) and "shape" new text by concating fragments as found.

The above is certainly possible, though I'm not sure whether benefits are justified. I'd rather see actual numbers before we engineer too many things. In particular, this UNSAFE_TO_CONCAT flag would be more expensive than UNSAFE_TO_BREAK to compute.

Loading

@behdad behdad changed the title Clarify UNSAFE_TO_BREAK and expose OS2V2Tail::usMaxContext as context. UNSAFE_TO_CONCAT [Was Clarify UNSAFE_TO_BREAK and expose OS2V2Tail::usMaxContext as context.] Mar 26, 2019
@behdad
Copy link
Member

@behdad behdad commented May 6, 2019

Or should that be UNSAFE_TO_UNBREAK? :D

Loading

@SergeyMalkin
Copy link

@SergeyMalkin SergeyMalkin commented Jun 25, 2019

We discussed this with Behdad couple of weeks ago and he asked me to explain how we solved this problem in DWrite. We actually spent 3 years in on/off discussions with multiple restarts to figure out final solution, so it was very interesting to me to go back and revisit whole thing, rather than just give you our results. So this text turned out to be quite long, but it shows what effort went into defining this simple bit. I hope you enjoy reading it as much I was when writing it :).

Logic described here is implemented in current version of DWrite, starting from Windows Redstone 5 (Autumn 2018) update and already used by Word and line layout in Edge proper. They don't use these bits to full extent yet, but it seems like logic is flexible enough to satisfy all our goals.

Here are our with high-level goals, as we identified them:

  • Minimize shaping for different types of clients: clients that shape whole paragraph (including Knuth-like optimal paragraph), doing layout line-by-line, or doing line-by-line with ability to pre-shaping whole paragraph. Ideally, single shaping call per paragraph will be sufficient.
  • Minimize reshaping on incremental update, e.g. typing at the end of paragraph. Nice to have: logic works when content changes in the middle of paragraph.
  • For line-by-line layout engines, better detection of insufficient text fetching when a bit longer input can dramatically effect shaping results (e.g. long Arabic ligatures).
  • Work efficiently for most scripts and fonts:
    • "common" Latin fonts, with kerning and ligatures enabled.
    • Japanese text with or without Kana tome marks.
    • Arabic text, including long ligatures.
    • Other complex scripts, including ones without spaces.
  • Work efficiently for more advanced scenarios, e.g. hyphenation.
  • Implementation should ideally have minimal impact on overall code flow of shaping library.

And this is principles and assumptions, we used:

  • We are not looking for ways to pre-analyze a font and calculate some properties. Although, there may be ways to make logic work faster with some kind of preprocessing.
  • Results are always in context of particular shaping call. We are not making any assumptions about changes to shaping that may happen when text is added or changed from initial call.
  • In most cases context is local, within one word or even couple of letters. Longer contexts can make layout slower, but core logic and layout algorithms based on it should work for such cases.
  • Line breaking in general case requires shaping between every two potential line breaks. However, this doesn’t happen except in artificial scenarios. Unless some glyph far away affects text width in some significant way (e.g. forms narrow ligature merging thousand glyphs), such glyph will be beyond line break anyway and will not affect our line after all. But we need some help from API to identify potential longer contexts, e.g. long Arabic ligatures.
  • Shaping few characters is much faster than reshaping whole line.
  • There are no strict requirements where bits should be set. Results may differ between implementations or versions, e.g. may be more or less granular. But this doesn't change client code, except performance may be affected depending on particular situation.
  • Responsibility to calculate bits can be split between shaping engine and lower-level OpenType layout. But client should be script-agnostic, it doesn't matter where bits actually come from.

And this is how bit is defined as part of per-character DWRITE_SHAPING_TEXT_PROPERTIES structure in dwrite.h, but accompanying comment does not list full list of guarantees provided by DWrite:

struct DWRITE_SHAPING_TEXT_PROPERTIES
{
    ...
    /// <summary>
    /// Glyph shaping can be cut after this point without affecting
    /// shaping before or after it. Otherwise, splitting a call to
    /// GetGlyphs would cause a reflow of glyph advances and shapes.
    /// </summary>
    UINT16 canBreakShapingAfter : 1;
    ...
};

So this is full description and it will use following definitions:

  • [A] means shaping results of input string A. Shaping results include glyph ids, widths, offsets, character properties (including new bit), glyph properties, cluster mapping.
  • [A|B] is results of shaping A and B merged, where bit set on the last character of A, i.e. between A and B. For [A|B|C], bit is set between A and B and between B and C.
  • Part of a string between two bits we will call cluster (may be confusing because of grapheme clusters, but this is what we used internally).

Shaping library guarantees that:

  1. if x=[A|B] => x = [A] + [B]. Basically, we can do scissor-cut between clusters, and shaping results for strings before and after it will not change.
  2. if x=[A|B|C] => x= [A|B] + [C] and x=[A]+[B|C]. Can be derived from (1), but important, so worth mentioning separately. This means scissor cut doesn’t affect bits in both parts and we continue cutting remaining part.
  3. From (1) and (2) it follows that x=[A|B|C] => x=[A]+[B]+[C]. Version of scissor-cut guarantee for multiple parts. If all line breaking opportunities fall on characters with bits set, you can simply scissor-cut whole paragraph without any more shaping.
  4. If x=[A|B|C] and y=[B|D] => [A|B|D] (which in turn is equal to [A]+[B]+[D]). This means than if tail of a string changed, we can shape only small part of a line before changed text and verify that this prefix didn't change.
  5. Same works in opposite direction: [A|B|C] and [D|B] => [D|B|C].

In essence, we say that simple bit gives you only some information to do scissor-cut, but not much beyond that. Our bits provide different approach. Cluster serves as a separator for independent pieces: when unchanged between calls, any text on either side doesn't affect shaping on another side.

Important thing to note is that when Layout engine uses this bit, it is flexible in terms of how much prefix before changed text it will use to reshape. Goal is to balance amount of reshaped text with probability of finding unchanged cluster. Piece to reshape could be from single cluster (two works better in real life) to whole preceding word, to whole line (not practical). This decision doesn't affect rest of the logic at all, only probability of successful reuse. If any cluster in prefix doesn't change, we can trust all clusters on the line before (or after) it didn't change. If there are no fixed clusters in the string we tried, we can extent context or simply back to shaping this line as a whole.

Examples of how this applies to various fonts and scripts:

  • Latin and "standard" font: bits are set between any characters that don't form a ligature and there is no kerning between them.
  • Assuming no kerning, bit can be set on 'f' if it is not followed by 'i'.
  • If some glyph kerns with a space, it clears bit inside this particular combination only. Space + any other glyph will get bit set between them. We assume such kerning pairs are rare.
  • Kana characters will have bits set between all of them, except when followed by tone mark. But there is no break opportunity between them, so scissor cut is still possible.
  • Really complex fonts that have contexts across words will simply not set many bits and we will effectively fall back to full reshaping.
  • For most scripts, shaping engines clear bits inside grapheme clusters. Line breaks are not allowed there anyway, so this is mostly for consistency.
  • Indic scripts should not have bits set inside syllables but can have bits on syllable boundaries. More bits can be further cleared by GPOS lookups that work across syllables. Since there are no breaking opportunities inside syllables, we can do scissor-cut.
  • For Arabic we can clear bits inside words and then font that has lookups crossing word boundaries may clear some of them, but this is rare case for regular fonts.
  • In case when long Arabic ligature was only partially fetched (for per-line layout engines), bits will not be set for all glyphs in this partial match. This indicates necessity for fetching additional characters (although, this doesn't guarantee that ligature will actually form).

Implementation by shaping library:

  • Bits are initially set on all characters.
  • Each shaping engine can clear some of them, logic is specific to script. Most of them will clear bits inside grapheme clusters, Indic inside syllalbles, Arabic inside whole word. OpenType layout engine may clear more bits after that, depending on input and font lookups.
  • Lookup processing engine does whatever it did before, iterating through lookups and glyphs. On each iteration (lookup + starting glyph position), we track leftmost and rightmost glyph that had been matched by lookup. Every time you consider some glyph from input or context and it matched, range of matched glyphs is expanded.
  • For the purpose of this algorithm, glyphs that are filtered out due to lookup flags are considered matched: every skipped glyph will expand matched range.
  • At some point either full context of lookup is matched, or some glyph doesn't match, or input ends at place where lookup expects one more glyph. It doesn't matter what happened, in all these cases we clear bits on all glyphs between leftmost and rightmost matched.
  • There is no need to clear bit before unmatched glyph. We can keep it because not matching is equivalent to having nothing there, i.e. when input is cut right before this position (or after, if we were going through backtracking context). Note that if only single glyph matched, no bits will be cleared.
  • That it doesn't matter in which order code checks glyphs for particular lookup. Engine can check backtrack context first or input glyphs, it only matters that order is consistent no matter what input glyphs are. This way we know that cutting input will not introduce different behavior. For this reason, we had to one simple optimization: if input is already shorter than ligature or context we are checking against, there is no need to match glyphs one by one and quickly bail. Such optimization breaks rule of consistent matching order and will not set bits correctly.

And finally, this is how line layout engine may use new bits:

  • Ideal client will shape full paragraph at once and consider breaking opportunities. If potential break point falls on a bit, we can simply break here and use results without reshaping.
  • If break doesn’t fall exactly on a bit, take few characters (clusters) before it and shape. In most cases, there will be unchanged cluster and we can reuse shaping results before it. Same happens at the beginning of next line: only few characters need to be reshaped and we can go on.
  • Decision to reshape is local to particular line break, shaping restores after fixed cluster. Breaking next line can be simple scissor-cut if bit is set in right place.
  • Pretty much same logic is used for hyphenation. Engine only need to shape few characters before plus hyphen and look for unchanged cluster.
  • Same for editing, you can reshape only few characters before changed/typed text.
  • Layout engine should be able to minimize number of lines that will be invalidated as a result of some change inside paragraph. Particular line break depends on text on lines before or after it and in general case we don’t know how far dependency goes. But using bits we can narrow dependency to cluster right before/after this line. If this cluster doesn’t change after text change, line doesn’t need to be invalidated.
  • There are opportunities for better caching and faster layout. Engine can pre-shape whole paragraph and remember bit positions plus widths of clusters. Since this process doesn't depend on actual column width, this can be done on parallel thread along with other text analysis.

Loading

@drott
Copy link
Collaborator

@drott drott commented Jun 26, 2019

Excellent, @SergeyMalkin - thank you so much for taking the time to explaing your findings and design.

Loading

@behdad
Copy link
Member

@behdad behdad commented Sep 7, 2020

Thanks again @SergeyMalkin for writing that down. I read it again now, and I am certain that what you are describing matches my proposed UNSAFE_TO_CONCAT bit. I'll go ahead and implement that.

Loading

behdad added a commit that referenced this issue Nov 16, 2021
@behdad behdad linked a pull request that will close this issue Nov 16, 2021
@behdad
Copy link
Member

@behdad behdad commented Nov 16, 2021

I've given this a first stab finally, in #3297. I believe that matches what @SergeyMalkin describes.

Loading

behdad added a commit that referenced this issue Nov 17, 2021
behdad added a commit that referenced this issue Dec 5, 2021
behdad added a commit that referenced this issue Dec 5, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

7 participants