Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Disucssion] handling the "middle" truncation option #93

Open
ctron opened this issue Nov 21, 2023 · 8 comments
Open

[Disucssion] handling the "middle" truncation option #93

ctron opened this issue Nov 21, 2023 · 8 comments
Labels
help wanted Extra attention is needed question Further information is requested

Comments

@ctron
Copy link
Member

ctron commented Nov 21, 2023

Continuing from patternfly-yew/patternfly-yew-quickstart#36 (comment):

Looks good to me, it might be worth adding a comment in the docs saying that the truncate doesn't necessarily truncate num grapheme clusters but rather rounds down to the next unicode codepoint after num bytes.

I'm not sure if adding any extra complexity to the code is worth it though. At best you could maybe trim off num unicode codepoints by using s.char_indices().rev().nth(num) which will still look weird in a lot of cases and anything better than that will require an extra dependency. So a comment would probably be good, just so people aren't surprised that there are only 5 characters visible when they wanted to keep 10.

@ctron ctron added help wanted Extra attention is needed question Further information is requested labels Nov 21, 2023
@helio-frota
Copy link

and anything better than that will require an extra dependency. So a comment would probably be good,

+1 to add a comment.

Apparently the docs says nothing about this case:
https://www.patternfly.org/components/truncate/design-guidelines/

And they are not using external deps: https://github.com/patternfly/patternfly-react/blob/main/packages/react-core/src/components/Truncate/Truncate.tsx#L46

@ctron
Copy link
Member Author

ctron commented Nov 23, 2023

Right, I am not sure how JavaScript handles Unicode and UTF-8, so I am not sure this can be compared. In Rust, you have "characters", but they are more like "technical" characters. Not “user-perceived characters” (aka grapheme clusters). Also see: https://unicode.org/reports/tr29/ and maybe https://stackoverflow.com/questions/58770462/how-to-iterate-over-unicode-grapheme-clusters-in-rust

Now it gets even more complicated, strings are indexed by bytes (not characters). So split_at (https://doc.rust-lang.org/std/primitive.str.html#method.split_at) will panic in case one picks a byte which is actually part of a multi-byte unicode "code point".

The current implementation is pretty naive, and just looks for the next location which can he split, based on the byte index. That's a rather quick operation. Assuming Latin 1, most likely is a hit at the first attempt. Otherwise, it takes just a few tries more.

The downside is that it might be horribly wrong. Having mostly non Latin1 characters, the byte index doesn't work well. Still, it's quick.

Searching for the x-th characters from the end of a string, is a terribly imperforment operation. As one needs to count all "characters" from the beginning of the string. And even then, one would actually need to count all "grapheme clusters" from the start of a string.

Maybe it's just better to drop this use case. Then again the current thing is better than not having it.

@ctron
Copy link
Member Author

ctron commented Nov 23, 2023

Just documented the current impl: 1dccf8a

@helio-frota
Copy link

helio-frota commented Nov 23, 2023

Thanks for clarification. I was basically comparing patternfly and patternfly-yew.
The extra info also make more sense to me the original comment in the issue now.

I am not sure how JavaScript handles Unicode and UTF-8, so I am not sure this can be compared.

For a quick info from https://exploringjs.com/impatient-js/ch_strings.html#atoms-of-text

* Code points are the atomic parts of Unicode text. Each code point is 21 bits in size.

* JavaScript strings implement Unicode via the encoding format UTF-16. It uses one or two 16-bit code units to encode a single code point.

    * Each JavaScript character (as indexed in strings) is a code unit. In the JavaScript standard library, code units are also called char codes.

* Grapheme clusters (user-perceived characters) represent written symbols, as displayed on screen or paper. One or more code points are needed to encode a single grapheme cluster.
TC39 is working on [Intl.Segmenter](https://github.com/tc39/proposal-intl-segmenter), a proposal for the ECMAScript Internationalization API to support Unicode segmentation (along grapheme cluster boundaries, word boundaries, sentence boundaries, etc.).

Until that proposal becomes a standard, we can use one of several libraries that are available (do a web search for “JavaScript grapheme”).

I found this table where apparently firefox is not supporting yet.

And the 15 million downloads from this library graphemer are also related to the dependents
( So I'm not sure if end users are using this directly or the std API Intl.Segmenter etc...)

Also see: https://unicode.org/reports/tr29/ and maybe https://stackoverflow.com/questions/58770462/how-to-iterate-over-unicode-grapheme-clusters-in-rust

Thanks for the links, in a comment a person shared that rust was supporting it on the stdlib, in the past.

so yeah now I got the naive and the might be horribly wrong parts, thanks for the clarification ...

in this case, the comment to be added should be "Warning: this might be horribly wrong !" 😄
or to use external dependency like @aDogCalledSpot mentioned in the beginning.

Maybe it's just better to drop this use case. Then again the current thing is better than not having it.

yeah I agree, a product/project Foo using patternfly-yew will certainly open an issue in the future, in case needed so 🤷‍♂️ 👍

@aDogCalledSpot
Copy link
Contributor

I think the comment is sufficient. It would also be worth adding a link to this issue so that the discussion is easily accessible.

I also don't think the implementation needs fixing. If someone is using graphemes that need a lot of bytes, then you will always have a few less of those at the end compared to if everything was ASCII, i.e. visually "more" is truncated - truncating less might be problematic but that will never happen. The behavior is documented now so it is easy to investigate why this is happening if someone stumbles upon it. In the (unlikely) event that someone urgently needs the amount of grapheme clusters to be correct at the end of the string, you could maybe think about opting in to a feature.

@ctron
Copy link
Member Author

ctron commented Nov 23, 2023

In the (unlikely) event that someone urgently needs the amount of grapheme clusters to be correct at the end of the string, you could maybe think about opting in to a feature.

Aside from PRs welcome … one can still manually create the TruncateContent::Middle enum variant. And bring your own logic.

@aDogCalledSpot
Copy link
Contributor

Also, amount of characters is a very inaccurate way of measuring how much text you want to keep. 10 "i"s is a lot shorter than 10 "m"s and graphemes could be much much wider again.

@ctron
Copy link
Member Author

ctron commented Nov 23, 2023

That's why I am just using the ExpandableSection component with the Truncate variant. It uses "x lines" and allows for using HTML content.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Extra attention is needed question Further information is requested
Projects
None yet
Development

No branches or pull requests

3 participants