Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle utf8 nicks #471

Closed
wants to merge 8 commits into from
Closed

Conversation

xavierog
Copy link
Contributor

First, a little bit of context regarding this pull request: a friend of mine uses irssi and Bitlbee to connect to various non-IRC networks. Doing so, he stumbled upon a few issues related to nicks containing UTF-8 characters, i.e. something irssi did not really expect. The commits that appear in this pull request are what we made to solve those issues. As described in the last commit, we did not solve everything (as wcwidth() and wcswidth() seem to fail at doing their job) but we made the situation a little saner. Feel free to review, comment and/or integrate these changes.

When using irssi along with some tools like e.g. BitlBee, one can encounter
nicknames or channels with UTF-8 multibyte characters. It becomes therefore
necessary for some irssi functions to better handle UTF-8, starting with
get_alignment. Indeed, get_alignment was liable to mess up terminals by
splitting strings in the middle of a multibyte character.
This was done assuming an "indent with tab, align with spaces" approach.
Since get_alignment handles UTF-8, it can be useful in other parts of irssi,
e.g. display_sorted_nicks().
This commit aims at improving the way irssi generates the output of /names with
UTF-8 nicks. However, it is only a partial fix: instead of counting bytes,
irssi now counts characters; but it should ideally count columns, since Unicode
characters are liable to spread accross several columns. Alas, as this message
is being written, the whole ecosystem is far from being able to deal with that:
some terminals handle multi-columns characters, some others don't; various
ncurses programs (vim, tmux, etc.) suffer from bugs on terminals who do handle
them.
@dequis
Copy link
Member

dequis commented Apr 19, 2016

Thanks!

I assume this is fixing #40, right?

There's no guarantee that the nicks received over the network will be valid utf-8, or that the terminal itself uses utf-8. That's where the thing gets ugly. Also recode is involved somewhere in there. I'm not sure how to cover all cases to test that this doesn't break.

@xavierog
Copy link
Contributor Author

xavierog commented Apr 19, 2016

I assume this is fixing #40, right?

I wish I could tell you it completely fixes #40 ... but that fixes only half the problem. We were formerly living in a nice, happy heaven where 1 byte == 1 character == 1 column. We now live in a harsh hell where one to four bytes make up one Unicode thingie (which may be a printable character, but which may also be a non-printable kind of indicator for various linguistic needs) which itself may take up more than one column in our terminals (for instance, a musical note is 1 column wide while the infamous pile of poo character is 2 columns wide): x bytes == y characters == z columns.
My patch ensures we cut/pad UTF-8 nicks based on UTF-8 characters, not bytes. That does solve #40 for terminals which do not handle multi-column characters, such as urxvt (urxvt enforces all characters to fit in a single column, even if it is not the way they were meant to be). But for those who handle it, such as konsole, the misalignment is still present (although usually attenuated). I wanted to improve my patch by leveraging wcwidth() and wcswidth() to compute the actual size a given string would take on the terminal but...

  • since not all terminals render multi-columns characters the same way, we would first need a way to detect the terminal capabilities; I did not dive in terminfo / termcap, perhaps there are some hints for us there...
  • the C functions wcwidth() and wcswidth(), as tested on my Debian Sid workstation, seem to be rather useless: wcwidth() frequently returns -1, wcswidth() frequently returns the number of characters instead of the number of columns... That does not look like a proper environment to develop a complete fix to /names list isn't lined up properly if nicks are in UTF-8 #40 :-/
    So, long story short, my humble opinion on the matter is: we can improve things by properly counting UTF-8 characters, but when it comes to counting columns, the situations is all fucked-up :')

There's no guarantee that the nicks received over the network will be valid utf-8

Indeed... I took care to call g_utf8_validate() in get_alignment to ensure we had proper UTF-8 (which includes proper ASCII), but that excludes e.g. ISO-8859-* encodings... I am not sure where I could find a real-life example of a server relaying such nicknames though...

or that the terminal itself uses utf-8. That's where the thing gets ugly. Also recode is involved somewhere in there.

Hey, I had completely forgotten about recode; we could indeed recode nicknames the same way text gets recoded, possibly with a recode_nicknames option...

I'm not sure how to cover all cases to test that this doesn't break.

Admittedly, what I provide here is meant as an improvement in our UTF-8-converging world, not a holy solution to handle every possible encodings and terminals.

@xavierog
Copy link
Contributor Author

xavierog commented Apr 19, 2016

I forgot to mention: it was a bit frustrating to see that wcswidth() was returning the wrong result, but at the same time, my konsole terminal was able to tell which characters it should display on two columns. Considering the dependencies of konsole, I assume this is due to its use of the ICU library (International Components for Unicode)... i.e. we could probably compute the number of columns taken by a string in a reliable way by adding libicu to irssi's dependencies... but that feels a little too much just to align nicknames...

@LemonBoy
Copy link
Member

What wcswidth are we talking about? The glibc one or the one we ship (that's called mk_wcswidth) ?

@xavierog
Copy link
Contributor Author

The glibc one... I was completely unaware of the existence of core/wcwidth.c... I am going to have a look at it... if it proves operational, I am totally willing to improve my patch :)

@xavierog
Copy link
Contributor Author

Actually, mk_wcwidth() seems too simplistic; it spots asian characters but not emojis... I do not know how much work it would represent to improve it to a point where no one would ever notice a problem (perhaps it is just a few byte ranges or conditions to add), but that would clearly duplicate efforts of other projects, starting with glibc.
Also, mk_wcswidth is commented out (well, #defined out); I assume this is due to the fact that getting the column count of a string is much more complicated than adding up the column counts of its characters.

@dequis
Copy link
Member

dequis commented Apr 19, 2016

urxvt enforces all characters to fit in a single column, even if it is not the way they were meant to be

Uh, works for me? urxvt 9.22. The white square in the middle is the cursor selection, two cells wide.

Actually, mk_wcwidth() seems too simplistic; it spots asian characters but not emojis

Emoji are defined to have a width of 1 according to EastAsianWidth.txt, what mk_wcwidth() does is correct.

Your glibc is older than 2.22, which means its wcwidth() only supports up to unicode 5.1, and returns -1 for unknown characters. Newer versions were updated to newer unicode versions: 2.22 supports unicode 7.0 and 2.23 supports unicode 8.0, so they will return the correct widths for characters introduced since 5.1.

Our mk_wcwidth(), in contrast, only supports up to 5.0, but returns 1 for unknown characters, which is a pretty good guess for most new characters, including emoji.

Personally the only thing i'd improve with mk_wcwidth() is to make it call the glibc wcwidth() if that one supports newer characters.

(I have a few notes on this topic here)

@xavierog
Copy link
Contributor Author

xavierog commented Apr 19, 2016

Uh, works for me?

You are correct and I am wrong. I have discussed the matter with my friend (he uses urxvt, I use konsole) and he confirms that indeed, to his surprise, urxvt is able to display multi-column characters (he made the test with "苺ましまろ").

Emoji are defined to have a width of 1 according to EastAsianWidth.txt, what mk_wcwidth() does is correct.

I may have misused the "emoji" term here; the two characters for which we have experienced a difference in behaviour are PILE OF POO and CHERRY BLOSSOM, which are not part of EastAsianWidth.txt but can be found in UnicodeData.txt. As far as I understand this file, they should be 1 column wide (it states neither nor ) and this is the way urxvt displays them.
However, konsole displays those 2 columns wide, presumably because libICU-provided data state that a pile of poo is two columns wide; http://www.fileformat.info/info/unicode/char/1F4A9/index.htm also shows "Character.charCount(): 2" in Java Data. I have no idea who in the world picked "2 columns" for those characters without a hint from those defining Unicode... Any idea about that?
Summary: different terminals are still liable to render some characters using a different number of columns :-/

Your glibc is older than 2.22, which means its wcwidth() only supports up to unicode 5.1, and returns -1 for unknown characters. Newer versions were updated to newer unicode versions: 2.22 supports unicode 7.0 and 2.23 supports unicode 8.0, so they will return the correct widths for characters introduced since 5.1.

Actually, I do have 2.22 (and those characters were introduced in Unicode 6.0,), ... I just assumed it was completely out of the league because wcwidth() was returning -1 for those characters (probably a coding mistake of mine there) and wcswidth() was telling me one of the nicknames used for my tests was 19 columns wide while my terminal was rendering it on 21 columns.

Conclusion: I am going to improve my "handle-utf8-nicks" branch to count columns instead of characters, and I will probably do so by following your suggestion of having mk_wcwidth calling glibc's wcwidth() at some point. By the way, I have checked glibc's wcswidth()'s code and it actually seems to behave terribly, so let's just forget it.
In the meantime, should we close this pull request?

@dequis
Copy link
Member

dequis commented Apr 20, 2016

However, konsole displays those 2 columns wide, presumably because libICU-provided data state that a pile of poo is two columns wide

I can't test this myself (my system can't render emoji in non-libxft apps, it's complicated), but judging by the source code, konsole doesn't use libicu, and includes its own konsole_wcwidth.cpp which is the same as our wcwidth.c in implementation (the combining/fullwidth ranges are the same), the only differences are types and indentation.

I can't see why it would use two columns.

also shows "Character.charCount(): 2" in Java Data

That's because java uses UTF-16 internally, so characters beyond the basic multilingual plane (U+FFFF) are stored as a surrogate pair (two "characters"). See their docs. Either way this doesn't mean column width.

In the meantime, should we close this pull request?

Having WIP stuff in here is fine, and as far as I can see this code is a step in the right direction.

@xavierog
Copy link
Contributor Author

I had not dared looking into the konsole sources (the "one problem at a time" approach...); my assumption on libicu was based on a simple "ldd".

the only differences are types and indentation.
I can't see why it would use two columns.

I will try to spend some time to discuss that with the authors of konsole; I suspect their implementation is a little hackish for characters beyond the BMP as, like Java, they seem to deal with 16-bit char values (their konsole_wcwidth takes a single "quint16" parameter), surrogation and a notion of extended characters; I think a picture will speak better than I do:
Konsole weird behaviour.
Either way, thanks for your feedback on the matter, it makes things clearer to me.

Having WIP stuff in here is fine, and as far as I can see this code is a step in the right direction.

Ok, fine -- I will update this pull request with additional commits in the coming days :-)

@dequis
Copy link
Member

dequis commented Apr 20, 2016

Ohhhh, that picture explains a lot. Holy crap it's surprising that works at all.

@xavierog xavierog force-pushed the handle-utf8-nicks branch 2 times, most recently from 687230b to 47dfc78 Compare April 23, 2016 01:11
This move makes sense since these files contain rather fundamental functions
(fundamental here means that we shouldn't have had to implement them)
which are required by other functions located in core/special-vars.c.
@xavierog
Copy link
Contributor Author

Some progress.
Considering that, according to you:

  • libc 2.19 supports Unicode 5.1
  • libc 2.22 supports Unicode 7.0
  • libc 2.23 supports Unicode 8.0

... and that, according to the GLib changelogs:

  • GLib 2.30 supports Unicode 6.0
  • GLib 2.41.2 supports Unicode 7.0
  • GLib 2.47 supports Unicode 8.0

When applied to the Debian environment, here is what we get:

  • Debian jessie: GLib 2.42.1 => Unicode 7.0; libc 2.19 => Unicode 5.1
  • Debian sid: GLib 2.48.0 => Unicode 8.0; libc 2.22 => Unicode 7.0

I double-checked all of this with a couple of test programs (one using wcwidth, one using g_unichar_iswide()) and concluded that relying on GLib's g_unichar_iswide() was a safer bet than relying on libc's wcwidth(). Additionally, the libc functions use wchar_t, which is implementation-dependent, whereas GLib's gchar and gunichar are already used a lot within irssi.

Following my latest commits, get_alignment now works with columns instead of characters. It relies on three helper functions, which deal with UTF-8 strings and determine the number of columns associated with each character by calling GLib-provided functions. I kept mk_wcwidth() in the code but I actually did not use it.
I took the liberty to move wcwidth.c, utf8.h and utf8.c in core/, as it really made a lot of sense to me. Not moving them would have implied that get_alignment() (which resides in core/special-vars.c) depend on functions located in fe-common/core, which sounds crazy to me -- I hope it is okay.

The way I determine the number of columns it takes to render a string is naive: I simply add up the number of columns as I iterate over characters. But that's an honest start for a function we shouldn't have to implement :-) As to the Unicode subtleties I mention in a comment, it seems that http://unicode.org/reports/tr51/#Diversity and http://unicode.org/reports/tr51/#Emoji_ZWJ_Sequences will bring their share of headaches in the future.

All of this being said, I think this can now pretend to fix #40 .

@ailin-nemui
Copy link
Contributor

hi, thanks for working on this. I would prefer if the code could stay uniform, mk_wcwidth is currently used for screen alignment in formats.c advance, several places in gui-entry.c and term-terminfo.c. If we find g_unichar_iswide to be superier to mk_wcwidth then we may want to consistently replace it. Or otherwise introduce a wrapper to make further replacements of such code easier.

Also I forgot if BIG5 and 8-byte terminal encodings are still working or already totally broken, and if they are still fine, on which layers they need to be taken care of. Presumably with term being in 8-byte mode the length manipulation functions should also work on bytes? Or did irssi already convert everything to unicode internally?

@ailin-nemui
Copy link
Contributor

(for example https://developer.gnome.org/glib/stable/glib-Unicode-Manipulation.html#g-unichar-iszerowidth would suggest that isprint && !iszerowidth would be a more correct implementation of wcwidth using glib's unicode primitives)

@LemonBoy
Copy link
Member

As a reminder, there's also #411 that's closely related to this (and/or a wrong use of strlen instead).

@ailin-nemui
Copy link
Contributor

ailin-nemui commented Apr 24, 2016

one further note on the number of columns when truncating: zero width characters such as combining characters should be included on the final character

@ailin-nemui
Copy link
Contributor

hi @xavierog are you interested in finishing this?

@xavierog
Copy link
Contributor Author

xavierog commented May 9, 2016

Hi,

I would prefer if the code could stay uniform

I can relate to your position on this.

mk_wcwidth is currently used...

mk_wcwidth() is indeed used in various places throughout the code; plus, it is frequently associated with conditions such as unichar_isprint() or g_unichar_isalnum(), which makes it trickier to swiftly get rid of mk_wcwidth(). I suggest this "cleaning" operation be treated as a separate task/issue as the slightest mistake will probably break noticeable things.

If we find g_unichar_iswide to be superior to mk_wcwidth

I think it is not unwise to say that relying on GLib, which seems to strive to follow Unicode closely, is indeed superior to relying on an old function, prefixed with its author's initials, copied and pasted from the Internet and into the irssi source code. Basically, my point is that mk_wcwidth() is bound to become obsolete at some future point in time (according to an earlier comment of dequis', mk_wcwidth() handles Unicode 5.0 along with a good guess for most new characters).
Additionally, GLib might surprise us some day with a clever counterpart to wcswidth(), who knows?

Or otherwise introduce a wrapper to make further replacements of such code easier.

Indeed, irssi ideally needs its own set of functions to centralize whatever magic needs be done to determine the width of a character or the width of a string. Again, I think this should be part of a separate task.

Also I forgot if BIG5 and 8-byte terminal encodings are still working

I assume you meant "8-bits"? If no, could you please elaborate on those 8-byte encodings?
Since we are focusing on nicknames here, the situation is pretty simple: nicknames are char * which come straight from the ircd and we have strictly no idea of their encoding.
Until I start working on this task, these char * were truncated and/or aligned assuming that 1 byte == 1 character == 1 column, which works perfectly for ASCII and most 8-bits encodings -- it probably never worked for BIG-5 and UTF-16 though.
After my commits, the situation is different: strings that are neither ASCII nor UTF-8 will not be treated (get_alignment() will return NULL, which will end up as an empty string through format functions), and this can indeed be improved :)
My proposal is to fall back on the former implementation of get_alignment() if g_utf8_validate() says a nickname is not UTF8-encoded. That way, we would solve issue #40 ("handle UTF-8 nicknames") and irssi would behave the same way it did before for other encodings. This is not a perfect fix, this is not a perfect approach but it seems quite practical to me considering that:

  • issue /names list isn't lined up properly if nicks are in UTF-8 #40 has been opened for... 8 years;
  • in the meantime, no bug was reported regarding the handling of BIG5 or UTF-16 nicknames (let me know if I am wrong here);
  • handling UTF-8 correctly is more and more important each day. Whether this statement could also be said of BIG5 is an interesting geopolitical question...

Note: it seems that trying to recode nicknames could actually lead to extra issues (e.g. breaking the ability to highlight someone as their nickname was recoded).

Please let me know if the approach I am proposing here is acceptable; if so, I am willing to implement it; otherwise, I might feel slightly discouraged; also, please note I am particularly unwilling to dive in the details of exotic encodings such as BIG5.

isprint && !iszerowidth would be a more correct implementation of wcwidth using glib's unicode primitives
when truncating: zero width characters such as combining characters should be included on the final character

I agree -- I can take care of this.

@ailin-nemui
Copy link
Contributor

ailin-nemui commented May 9, 2016

I would suggest you make this advance function available https://github.com/irssi/irssi/blob/master/src/fe-common/core/formats.c#L423 together with the utf8 logic as present in https://github.com/irssi/irssi/blob/master/src/fe-common/core/formats.c#L449

  • i.e. if the terminal is not set to utf8, always use strcpy semantics, otherwise fall back to strcpy if the string (nick) does not validate as utf8.
  • remove the utf8 from the function names and make it a parameter instead
  • And also use that function instead of get_utf8_char_width (name it string_width_advance or something),
  • and model the other function as a copy&paste of the format_get_length (name it string_get_width)
  • I think the implementation of get_utf8_chars_for_width or the implementation of get_alignment can probably be improved

for the latter I would suggest doing away with most changes to get_alignment and instead keep the basic skeleton of the original get_alignment function. Then, instead of calling g_string_truncate it should be a simple matter of calling g_string_truncate(str, string_get_strlen_width(str->str, align, utf8))

(correct me if I'm wrong)

if you feel like it you could then in a later pull request replace all calls to mk_wcwidth with unichar_width and then successively replace the implementation

what do you think?

@xavierog
Copy link
Contributor Author

Your suggestions look reasonable to me; I will work on it in the next days and then come back to you :)

@ailin-nemui
Copy link
Contributor

fixed by #480

@ailin-nemui ailin-nemui added this to the 1.0.0 milestone Jun 26, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants