-
Notifications
You must be signed in to change notification settings - Fork 352
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Handle utf8 nicks #471
Handle utf8 nicks #471
Conversation
When using irssi along with some tools like e.g. BitlBee, one can encounter nicknames or channels with UTF-8 multibyte characters. It becomes therefore necessary for some irssi functions to better handle UTF-8, starting with get_alignment. Indeed, get_alignment was liable to mess up terminals by splitting strings in the middle of a multibyte character.
This was done assuming an "indent with tab, align with spaces" approach.
Since get_alignment handles UTF-8, it can be useful in other parts of irssi, e.g. display_sorted_nicks().
This commit aims at improving the way irssi generates the output of /names with UTF-8 nicks. However, it is only a partial fix: instead of counting bytes, irssi now counts characters; but it should ideally count columns, since Unicode characters are liable to spread accross several columns. Alas, as this message is being written, the whole ecosystem is far from being able to deal with that: some terminals handle multi-columns characters, some others don't; various ncurses programs (vim, tmux, etc.) suffer from bugs on terminals who do handle them.
Thanks! I assume this is fixing #40, right? There's no guarantee that the nicks received over the network will be valid utf-8, or that the terminal itself uses utf-8. That's where the thing gets ugly. Also recode is involved somewhere in there. I'm not sure how to cover all cases to test that this doesn't break. |
I wish I could tell you it completely fixes #40 ... but that fixes only half the problem. We were formerly living in a nice, happy heaven where 1 byte == 1 character == 1 column. We now live in a harsh hell where one to four bytes make up one Unicode thingie (which may be a printable character, but which may also be a non-printable kind of indicator for various linguistic needs) which itself may take up more than one column in our terminals (for instance, a musical note is 1 column wide while the infamous pile of poo character is 2 columns wide): x bytes == y characters == z columns.
Indeed... I took care to call g_utf8_validate() in get_alignment to ensure we had proper UTF-8 (which includes proper ASCII), but that excludes e.g. ISO-8859-* encodings... I am not sure where I could find a real-life example of a server relaying such nicknames though...
Hey, I had completely forgotten about recode; we could indeed recode nicknames the same way text gets recoded, possibly with a recode_nicknames option...
Admittedly, what I provide here is meant as an improvement in our UTF-8-converging world, not a holy solution to handle every possible encodings and terminals. |
I forgot to mention: it was a bit frustrating to see that wcswidth() was returning the wrong result, but at the same time, my konsole terminal was able to tell which characters it should display on two columns. Considering the dependencies of konsole, I assume this is due to its use of the ICU library (International Components for Unicode)... i.e. we could probably compute the number of columns taken by a string in a reliable way by adding libicu to irssi's dependencies... but that feels a little too much just to align nicknames... |
What |
The glibc one... I was completely unaware of the existence of core/wcwidth.c... I am going to have a look at it... if it proves operational, I am totally willing to improve my patch :) |
Actually, mk_wcwidth() seems too simplistic; it spots asian characters but not emojis... I do not know how much work it would represent to improve it to a point where no one would ever notice a problem (perhaps it is just a few byte ranges or conditions to add), but that would clearly duplicate efforts of other projects, starting with glibc. |
Uh, works for me? urxvt 9.22. The white square in the middle is the cursor selection, two cells wide.
Emoji are defined to have a width of 1 according to EastAsianWidth.txt, what Your glibc is older than 2.22, which means its Our Personally the only thing i'd improve with (I have a few notes on this topic here) |
You are correct and I am wrong. I have discussed the matter with my friend (he uses urxvt, I use konsole) and he confirms that indeed, to his surprise, urxvt is able to display multi-column characters (he made the test with "苺ましまろ").
I may have misused the "emoji" term here; the two characters for which we have experienced a difference in behaviour are PILE OF POO and CHERRY BLOSSOM, which are not part of EastAsianWidth.txt but can be found in UnicodeData.txt. As far as I understand this file, they should be 1 column wide (it states neither nor ) and this is the way urxvt displays them.
Actually, I do have 2.22 (and those characters were introduced in Unicode 6.0,), ... I just assumed it was completely out of the league because wcwidth() was returning -1 for those characters (probably a coding mistake of mine there) and wcswidth() was telling me one of the nicknames used for my tests was 19 columns wide while my terminal was rendering it on 21 columns. Conclusion: I am going to improve my "handle-utf8-nicks" branch to count columns instead of characters, and I will probably do so by following your suggestion of having mk_wcwidth calling glibc's wcwidth() at some point. By the way, I have checked glibc's wcswidth()'s code and it actually seems to behave terribly, so let's just forget it. |
I can't test this myself (my system can't render emoji in non-libxft apps, it's complicated), but judging by the source code, konsole doesn't use libicu, and includes its own I can't see why it would use two columns.
That's because java uses UTF-16 internally, so characters beyond the basic multilingual plane (U+FFFF) are stored as a surrogate pair (two "characters"). See their docs. Either way this doesn't mean column width.
Having WIP stuff in here is fine, and as far as I can see this code is a step in the right direction. |
Ohhhh, that picture explains a lot. Holy crap it's surprising that works at all. |
687230b
to
47dfc78
Compare
This move makes sense since these files contain rather fundamental functions (fundamental here means that we shouldn't have had to implement them) which are required by other functions located in core/special-vars.c.
47dfc78
to
f503303
Compare
Some progress.
... and that, according to the GLib changelogs:
When applied to the Debian environment, here is what we get:
I double-checked all of this with a couple of test programs (one using wcwidth, one using g_unichar_iswide()) and concluded that relying on GLib's g_unichar_iswide() was a safer bet than relying on libc's wcwidth(). Additionally, the libc functions use wchar_t, which is implementation-dependent, whereas GLib's gchar and gunichar are already used a lot within irssi. Following my latest commits, get_alignment now works with columns instead of characters. It relies on three helper functions, which deal with UTF-8 strings and determine the number of columns associated with each character by calling GLib-provided functions. I kept mk_wcwidth() in the code but I actually did not use it. The way I determine the number of columns it takes to render a string is naive: I simply add up the number of columns as I iterate over characters. But that's an honest start for a function we shouldn't have to implement :-) As to the Unicode subtleties I mention in a comment, it seems that http://unicode.org/reports/tr51/#Diversity and http://unicode.org/reports/tr51/#Emoji_ZWJ_Sequences will bring their share of headaches in the future. All of this being said, I think this can now pretend to fix #40 . |
hi, thanks for working on this. I would prefer if the code could stay uniform, mk_wcwidth is currently used for screen alignment in formats.c Also I forgot if BIG5 and 8-byte terminal encodings are still working or already totally broken, and if they are still fine, on which layers they need to be taken care of. Presumably with term being in 8-byte mode the length manipulation functions should also work on bytes? Or did irssi already convert everything to unicode internally? |
(for example https://developer.gnome.org/glib/stable/glib-Unicode-Manipulation.html#g-unichar-iszerowidth would suggest that isprint && !iszerowidth would be a more correct implementation of wcwidth using glib's unicode primitives) |
As a reminder, there's also #411 that's closely related to this (and/or a wrong use of strlen instead). |
one further note on the number of columns when truncating: zero width characters such as combining characters should be included on the final character |
hi @xavierog are you interested in finishing this? |
Hi,
I can relate to your position on this.
mk_wcwidth() is indeed used in various places throughout the code; plus, it is frequently associated with conditions such as unichar_isprint() or g_unichar_isalnum(), which makes it trickier to swiftly get rid of mk_wcwidth(). I suggest this "cleaning" operation be treated as a separate task/issue as the slightest mistake will probably break noticeable things.
I think it is not unwise to say that relying on GLib, which seems to strive to follow Unicode closely, is indeed superior to relying on an old function, prefixed with its author's initials, copied and pasted from the Internet and into the irssi source code. Basically, my point is that mk_wcwidth() is bound to become obsolete at some future point in time (according to an earlier comment of dequis', mk_wcwidth() handles Unicode 5.0 along with a good guess for most new characters).
Indeed, irssi ideally needs its own set of functions to centralize whatever magic needs be done to determine the width of a character or the width of a string. Again, I think this should be part of a separate task.
I assume you meant "8-bits"? If no, could you please elaborate on those 8-byte encodings?
Note: it seems that trying to recode nicknames could actually lead to extra issues (e.g. breaking the ability to highlight someone as their nickname was recoded). Please let me know if the approach I am proposing here is acceptable; if so, I am willing to implement it; otherwise, I might feel slightly discouraged; also, please note I am particularly unwilling to dive in the details of exotic encodings such as BIG5.
I agree -- I can take care of this. |
I would suggest you make this advance function available https://github.com/irssi/irssi/blob/master/src/fe-common/core/formats.c#L423 together with the utf8 logic as present in https://github.com/irssi/irssi/blob/master/src/fe-common/core/formats.c#L449
for the latter I would suggest doing away with most changes to get_alignment and instead keep the basic skeleton of the original get_alignment function. Then, instead of calling g_string_truncate it should be a simple matter of calling (correct me if I'm wrong) if you feel like it you could then in a later pull request replace all calls to mk_wcwidth with unichar_width and then successively replace the implementation what do you think? |
Your suggestions look reasonable to me; I will work on it in the next days and then come back to you :) |
fixed by #480 |
First, a little bit of context regarding this pull request: a friend of mine uses irssi and Bitlbee to connect to various non-IRC networks. Doing so, he stumbled upon a few issues related to nicks containing UTF-8 characters, i.e. something irssi did not really expect. The commits that appear in this pull request are what we made to solve those issues. As described in the last commit, we did not solve everything (as wcwidth() and wcswidth() seem to fail at doing their job) but we made the situation a little saner. Feel free to review, comment and/or integrate these changes.