Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Unicode text support #2

Closed
nodefourtytwo opened this issue Feb 19, 2019 · 6 comments
Closed

Add Unicode text support #2

nodefourtytwo opened this issue Feb 19, 2019 · 6 comments

Comments

@nodefourtytwo
Copy link

Using Linux, I tried in different terminal emulators and all the accented characters I tested (éèàçôîû) show up wrong.

@octobanana
Copy link
Owner

At the moment, it can only handle reading ascii text files. To work around the limitation, you could pipe your text through iconv with cat <file> | iconv -f utf-8 -t ascii//translit, although it will remove all the accents from the characters.

In the future to support those characters, I'll need to find/write a unicode string class along with a unicode supported regex library.

@nodefourtytwo
Copy link
Author

@octobanana
Copy link
Owner

Thanks, I'll have a look at it.

@octobanana
Copy link
Owner

Here's a quick update on my progress so far.

I wasn't too familiar with how Unicode worked, but after doing some reading and research the past few days, I think I have a good, general understanding of it now.

To properly support displaying, aligning, and correctly highlighting Unicode text, it appears the program needs a way to iterate over the user-perceived characters, known as grapheme clusters.

I played around with the above linked library, tinyutf8, which allowed to iterate and get the size of a string in Unicode code points. It doesn't seem to have the capability to iterate over grapheme clusters at the moment. Since a grapheme cluster can be represented by multiple Unicode code points, it doesn't seem like the appropriate solution for what's needed.
I've also seemed to have encountered a bug with the library. The lookup table seems to become corrupt in certain reproducible cases, causing the substr function to consequently return an incorrect value for a multibyte sequence. I'll open an issue over there after I write a minimal test case that demonstrates the issue.

Following that, I checked out Boost.Locale using the icu backend. Its api boost::locale::boundary::segment_index using boost::locale::boundary::character for the boundary type appears to correctly allow iteration over grapheme clusters. I have a local branch using Boost.Locale where Fltrdr now correctly displays Unicode text.
Although it works, the segment index seems to create a copy of the indexed text for itself, resulting in higher memory usage as the text is stored twice, one within a std::string and the other in the segment_index.

I'm trying out another option today using the icu::BreakIterator api. It looks like it is similar to the segment_index class from Boost.Locale. It takes an icu::UnicodeString as its string parameter, which can alias an external array of characters, meaning it doesn't own the array, but just points to it.
The plan is to test using std::string to store the text, with an icu::UnicodeString pointing to the main string, and use the Unicode string to be indexed with icu::BreakIterator. This would allow a single copy of the text to be kept.

Lastly, I've been tinkering with the Boost.Regex library using the icu backend to support Unicode regex searching using boost::regex::utf8regex_iterator.

In conclusion, Fltrdr will have Unicode text support soon!

@octobanana octobanana changed the title Accented characters are not rendered corretly Add Unicode text support Feb 26, 2019
@octobanana
Copy link
Owner

As of Version 0.2.0, Fltrdr supports UTF-8 Unicode text!

UTF-8 text should now render properly, including full-width CJK Unified Ideographs. I ended up using the ICU library to provide Unicode support for a non-owning/view OB::Text::View class, an owning/string OB::Text::String class, and a non-owning/view regex iterator OB::Text::Regex.

If you get the chance to try it out, please let me know if you have any suggestions or encounter any issues :)

@nodefourtytwo
Copy link
Author

It works for me, that is fantastic. Thank you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants