New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a wrapper of wcwidth() that picks the best implementation #917

Merged
merged 8 commits into from Sep 4, 2018

Conversation

Projects
None yet
3 participants
@dequis
Copy link
Member

dequis commented Aug 23, 2018

Fixes #720


This adds a i_wcwidth() function that replaces mk_wcwidth(), and a
'wcwidth_implementation' setting to pick which one it wraps.

Values:

  • old: uses our local mk_wcwidth() which implements unicode 5.0
  • system: uses the libc-provided wcwidth(), which may be better or worse
    than ours depending on how up to date the system is.
  • auto: tests the system one against two characters that became
    fullwidth in unicode 5.2 and 9.0 respectively. If either of them pass,
    pick the system implementation, otherwise pick ours.

It defaults to auto.

mk_wcwidth() is still preferable in some cases, since the way it uses
ranges for fullwidth characters means most CJK blocks are covered even
if their characters didn't exist back then.

The "system" implementation is also wrapped to never return -1, but to
assume those unknown characters use one cell. Quoting the code:

/* Treat all unknown characters as taking one cell. This is
 * the reason mk_wcwidth and other outdated implementations
 * mostly worked with newer unicode, while glibc's wcwidth
 * needs updating to recognize new characters.
 *
 * Instead of relying on that, we keep the behavior of assuming
 * one cell even for glibc's implementation, which is still
 * highly accurate and less of a headache overall.
 */
Add a wrapper of wcwidth() that picks the best implementation
This adds a i_wcwidth() function that replaces mk_wcwidth(), and a
'wcwidth_implementation' setting to pick which one it wraps.

Values:

- old: uses our local mk_wcwidth() which implements unicode 5.0
- system: uses the libc-provided wcwidth(), which may be better or worse
  than ours depending on how up to date the system is.
- auto: tests the system one against two characters that became
  fullwidth in unicode 5.2 and 9.0 respectively. If either of them pass,
  pick the system implementation, otherwise pick ours.

It defaults to auto.

mk_wcwidth() is still preferable in some cases, since the way it uses
ranges for fullwidth characters means most CJK blocks are covered even
if their characters didn't exist back then.

The "system" implementation is also wrapped to never return -1, but to
assume those unknown characters use one cell. Quoting the code:

    /* Treat all unknown characters as taking one cell. This is
     * the reason mk_wcwidth and other outdated implementations
     * mostly worked with newer unicode, while glibc's wcwidth
     * needs updating to recognize new characters.
     *
     * Instead of relying on that, we keep the behavior of assuming
     * one cell even for glibc's implementation, which is still
     * highly accurate and less of a headache overall.
     */
@dequis

This comment has been minimized.

Copy link
Member

dequis commented Aug 23, 2018

One way in which i saw #720 manifest:

  1. Have a glibc 2.26+ system (or two systems with unicode versions newer than 9)
  2. Do mosh localhost (or another server)
  3. Start irssi
  4. Paste one emoji (try 🔥)
  5. Write some text after it
  6. Press ^L to refresh
  7. Watch the cursor (and possibly other chars) move to the left.

It's invisible outside of mosh, which is interesting, but it's clearly our fault. Using CJK characters that mk_wcwidth knows about, such as , does not show this issue.

The other CJK character I used in the system wcwidth tests of this PR is 🈀, added in unicode 5.2, and shows the same bad behavior.

@ailin-nemui

This comment has been minimized.

Copy link
Contributor

ailin-nemui commented Aug 23, 2018

nice!

I wonder how this might confuse the textbuffer view's line cache when you change the implementation back and forth?

@ailin-nemui ailin-nemui force-pushed the dequis:wcwidth-wrapper branch from 3c7be6f to b95ce3e Aug 23, 2018

@ailin-nemui

This comment has been minimized.

Copy link
Contributor

ailin-nemui commented Aug 23, 2018

I think I fixed it...

@dequis

This comment has been minimized.

Copy link
Member

dequis commented Aug 24, 2018

Neat, thanks for the changes.

I'm not really sure why we need utf8proc, but i guess it would make sense to force newer unicode in mac os whose libc is stuck in the past forever. The problem with doing that is that it might make mismatches with the terminal / tmux / mosh / etc more likely. I guess we could just document the risks and benefits.

@ailin-nemui

This comment has been minimized.

Copy link
Contributor

ailin-nemui commented Aug 24, 2018

sometimes you have a remote server and a local terminal emulator who disagree (different libc)

but utf8proc also cannot solve this properly

@dequis

This comment has been minimized.

Copy link
Member

dequis commented Aug 24, 2018

Oh right i'm the only weirdo who has an arch linux server (wasn't my choice). I guess it helps for those cases.

@ailin-nemui

This comment has been minimized.

Copy link
Contributor

ailin-nemui commented Aug 26, 2018

@jillest
Copy link
Member

jillest left a comment

I suppose this conceptually makes sense, given that many terminal emulators seem to find it more important to be "correct" and up to date rather than to match the text-based applications exactly.

@@ -31,6 +31,7 @@
#include "gui-printtext.h"

static int window_create_override;
int wcwidth_impl;

This comment has been minimized.

@jillest

jillest Aug 26, 2018

Member

Any reason for this not to be static?


#ifdef HAVE_LIBUTF8PROC
case WCWIDTH_IMPL_JULIA:
wcwidth_impl_func = (WCWIDTH_FUNC) &utf8proc_charwidth;

This comment has been minimized.

@jillest

jillest Aug 26, 2018

Member

Calling a function through an incompatibly cast function pointer is undefined behaviour (in practice, control flow graph enforcement and attributes like dllimport seem scariest). This can be avoided via a wrapper function.

This comment has been minimized.

@dequis

dequis Aug 27, 2018

Member
int utf8proc_charwidth(int32_t codepoint);
typedef guint32 unichar;
typedef int (*WCWIDTH_FUNC) (unichar ucs);

Hm. I guess the only difference is signedness.

51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA.
*/

#define _XOPEN_SOURCE

This comment has been minimized.

@jillest

jillest Aug 26, 2018

Member

This requests a rather old standard (UNIX95), which also ties it to C89. Please remove it if possible.

Fortunately, FreeBSD ignores this value (it only recognizes non-empty values greater than or equal to 500).

This comment has been minimized.

@dequis

dequis Aug 27, 2018

Member

I took that from the man page of wcwidth. Any idea of better feature flags to make wcwidth available?

This comment has been minimized.

@ailin-nemui

ailin-nemui Aug 27, 2018

Contributor

maybe _GNU_SOURCE

dequis and others added some commits Aug 27, 2018

@ailin-nemui ailin-nemui merged commit d93cd63 into irssi:master Sep 4, 2018

1 check passed

continuous-integration/travis-ci/pr The Travis CI build passed
Details
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment