[Disucssion] handling the "middle" truncation option #93

ctron · 2023-11-21T09:02:36Z

Continuing from patternfly-yew/patternfly-yew-quickstart#36 (comment):

Looks good to me, it might be worth adding a comment in the docs saying that the truncate doesn't necessarily truncate num grapheme clusters but rather rounds down to the next unicode codepoint after num bytes.

I'm not sure if adding any extra complexity to the code is worth it though. At best you could maybe trim off num unicode codepoints by using s.char_indices().rev().nth(num) which will still look weird in a lot of cases and anything better than that will require an extra dependency. So a comment would probably be good, just so people aren't surprised that there are only 5 characters visible when they wanted to keep 10.

helio-frota · 2023-11-23T11:40:38Z

and anything better than that will require an extra dependency. So a comment would probably be good,

+1 to add a comment.

Apparently the docs says nothing about this case:
https://www.patternfly.org/components/truncate/design-guidelines/

And they are not using external deps: https://github.com/patternfly/patternfly-react/blob/main/packages/react-core/src/components/Truncate/Truncate.tsx#L46

ctron · 2023-11-23T12:38:36Z

Right, I am not sure how JavaScript handles Unicode and UTF-8, so I am not sure this can be compared. In Rust, you have "characters", but they are more like "technical" characters. Not “user-perceived characters” (aka grapheme clusters). Also see: https://unicode.org/reports/tr29/ and maybe https://stackoverflow.com/questions/58770462/how-to-iterate-over-unicode-grapheme-clusters-in-rust

Now it gets even more complicated, strings are indexed by bytes (not characters). So split_at (https://doc.rust-lang.org/std/primitive.str.html#method.split_at) will panic in case one picks a byte which is actually part of a multi-byte unicode "code point".

The current implementation is pretty naive, and just looks for the next location which can he split, based on the byte index. That's a rather quick operation. Assuming Latin 1, most likely is a hit at the first attempt. Otherwise, it takes just a few tries more.

The downside is that it might be horribly wrong. Having mostly non Latin1 characters, the byte index doesn't work well. Still, it's quick.

Searching for the x-th characters from the end of a string, is a terribly imperforment operation. As one needs to count all "characters" from the beginning of the string. And even then, one would actually need to count all "grapheme clusters" from the start of a string.

Maybe it's just better to drop this use case. Then again the current thing is better than not having it.

ctron · 2023-11-23T12:48:46Z

Just documented the current impl: 1dccf8a

helio-frota · 2023-11-23T13:24:44Z

Thanks for clarification. I was basically comparing patternfly and patternfly-yew.
The extra info also make more sense to me the original comment in the issue now.

I am not sure how JavaScript handles Unicode and UTF-8, so I am not sure this can be compared.

For a quick info from https://exploringjs.com/impatient-js/ch_strings.html#atoms-of-text

* Code points are the atomic parts of Unicode text. Each code point is 21 bits in size.

* JavaScript strings implement Unicode via the encoding format UTF-16. It uses one or two 16-bit code units to encode a single code point.

    * Each JavaScript character (as indexed in strings) is a code unit. In the JavaScript standard library, code units are also called char codes.

* Grapheme clusters (user-perceived characters) represent written symbols, as displayed on screen or paper. One or more code points are needed to encode a single grapheme cluster.

TC39 is working on [Intl.Segmenter](https://github.com/tc39/proposal-intl-segmenter), a proposal for the ECMAScript Internationalization API to support Unicode segmentation (along grapheme cluster boundaries, word boundaries, sentence boundaries, etc.).

Until that proposal becomes a standard, we can use one of several libraries that are available (do a web search for “JavaScript grapheme”).

I found this table where apparently firefox is not supporting yet.

And the 15 million downloads from this library graphemer are also related to the dependents
( So I'm not sure if end users are using this directly or the std API Intl.Segmenter etc...)

Also see: https://unicode.org/reports/tr29/ and maybe https://stackoverflow.com/questions/58770462/how-to-iterate-over-unicode-grapheme-clusters-in-rust

Thanks for the links, in a comment a person shared that rust was supporting it on the stdlib, in the past.

so yeah now I got the naive and the might be horribly wrong parts, thanks for the clarification ...

in this case, the comment to be added should be "Warning: this might be horribly wrong !" 😄
or to use external dependency like @aDogCalledSpot mentioned in the beginning.

Maybe it's just better to drop this use case. Then again the current thing is better than not having it.

yeah I agree, a product/project Foo using patternfly-yew will certainly open an issue in the future, in case needed so 🤷‍♂️ 👍

aDogCalledSpot · 2023-11-23T13:42:34Z

I think the comment is sufficient. It would also be worth adding a link to this issue so that the discussion is easily accessible.

I also don't think the implementation needs fixing. If someone is using graphemes that need a lot of bytes, then you will always have a few less of those at the end compared to if everything was ASCII, i.e. visually "more" is truncated - truncating less might be problematic but that will never happen. The behavior is documented now so it is easy to investigate why this is happening if someone stumbles upon it. In the (unlikely) event that someone urgently needs the amount of grapheme clusters to be correct at the end of the string, you could maybe think about opting in to a feature.

ctron · 2023-11-23T13:47:47Z

In the (unlikely) event that someone urgently needs the amount of grapheme clusters to be correct at the end of the string, you could maybe think about opting in to a feature.

Aside from PRs welcome … one can still manually create the TruncateContent::Middle enum variant. And bring your own logic.

aDogCalledSpot · 2023-11-23T13:56:31Z

Also, amount of characters is a very inaccurate way of measuring how much text you want to keep. 10 "i"s is a lot shorter than 10 "m"s and graphemes could be much much wider again.

ctron · 2023-11-23T14:05:50Z

That's why I am just using the ExpandableSection component with the Truncate variant. It uses "x lines" and allows for using HTML content.

ctron added help wanted Extra attention is needed question Further information is requested labels Nov 21, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Disucssion] handling the "middle" truncation option #93

[Disucssion] handling the "middle" truncation option #93

ctron commented Nov 21, 2023

helio-frota commented Nov 23, 2023

ctron commented Nov 23, 2023

ctron commented Nov 23, 2023

helio-frota commented Nov 23, 2023 •

edited

Loading

aDogCalledSpot commented Nov 23, 2023

ctron commented Nov 23, 2023

aDogCalledSpot commented Nov 23, 2023

ctron commented Nov 23, 2023

[Disucssion] handling the "middle" truncation option #93

[Disucssion] handling the "middle" truncation option #93

Comments

ctron commented Nov 21, 2023

helio-frota commented Nov 23, 2023

ctron commented Nov 23, 2023

ctron commented Nov 23, 2023

helio-frota commented Nov 23, 2023 • edited Loading

aDogCalledSpot commented Nov 23, 2023

ctron commented Nov 23, 2023

aDogCalledSpot commented Nov 23, 2023

ctron commented Nov 23, 2023

helio-frota commented Nov 23, 2023 •

edited

Loading