Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

comp_ui.NiceDisplay should take into account unicode character width #269

Closed
andychu opened this issue Mar 24, 2019 · 31 comments
Closed

comp_ui.NiceDisplay should take into account unicode character width #269

andychu opened this issue Mar 24, 2019 · 31 comments

Comments

@andychu
Copy link
Contributor

andychu commented Mar 24, 2019

@jyn514 I'm pretty sure this is the issue you just mentioned in #257. There are two separate bugs:

  • If you keep typing past the end of the terminal, OSH has no idea (but zsh and fish do). This is Detect when the user types past the last column in the terminal #257.
  • If your prompt contains a unicode character that is say 4 bytes long in utf-8, the code will currently move right FOUR cursor positions instead of ONE. I think that explains the screenshot, since you have a unicode character there.

I think we can just call the POSIX function wcwidth(), although I guess that involves some conversion to wchar_t, which is annoying:

http://man7.org/linux/man-pages/man3/wcwidth.3.html

There is a pure Python version of it, but I'm not sure if we should use it.

https://pypi.org/project/wcwidth/#files

@andychu
Copy link
Contributor Author

andychu commented Mar 24, 2019

It looks like wchar_t might introduce some portability problems? Not sure how much of an issue this is in practice.

https://en.wikipedia.org/wiki/Wide_character#C/C++

Is there a version wcwidth() that works on utf-8 data?

Note: this function might be useful for short-circuiting?

http://man7.org/linux/man-pages/man3/iswprint.3.html

@andychu
Copy link
Contributor Author

andychu commented Mar 24, 2019

Fairly related, since OSH essentially does this!

https://www.openbsd.org/papers/eurobsdcon2016-utf8.pdf

The OpenBSD base system supportsexactly two LC_CTYPE locales:

  1. UTF-8
  2. C=POSIX = US-ASCII

We don’t even support ISO-LATIN-1anylonger in the base system.

Apparently there is mbtowc(3), but it's not multi-threaded! Gah... author recommends not using the reentrant versions?

@andychu
Copy link
Contributor Author

andychu commented Mar 24, 2019

Here's an interesting alternative:

https://unix.stackexchange.com/questions/245013/get-the-display-width-of-a-string-of-characters

We could just print the prompt, and query cursor positions before and after! Then we know exactly how many displayable characters it is. Downsides?

Querying the cursor position is probably also necessary for #257, for proper wrapping.

@jyn514
Copy link
Contributor

jyn514 commented Mar 24, 2019

Why do we need multithreading? Shouldn't we have at most one main thread at a time?

@andychu
Copy link
Contributor Author

andychu commented Mar 24, 2019

Yeah we don't really need multi-threading in the near term, but functions that use globals are a "smell" to me ... it's not what I'm used to.

Some people have asked for Oil to be a Lua-like library, in which case it should use no globals. This would be nice in theory, but it's way down on the priority list now.

It's possible that the prompt could use globals but it's separate from the OSH interpreter as a library, though.


Although querying the cursor position might seem like a hack, I think it might be the right thing for the case of the prompt, if not in general.

@jyn514
Copy link
Contributor

jyn514 commented Jun 22, 2019

Since this is the last blocker from me using osh as my fulltime shell, I think I'll take a look at it.

Unfortunately I don't think this is going to play nicely with readline escape codes (https://superuser.com/questions/301353/escape-non-printing-characters-in-a-function-for-a-bash-prompt). The python wcwidth library returns -1 when it encounters non-printing characters and doesn't work for single characters, so we couldn't just modify the existing loop in core/comp_ui.py:

$ printf '(\x01\x1b[0;33m\x02osh\x01\x1b[0;0m\x02) ' | python -c 'import wcwidth; print(wcwidth.wcswidth(input()))'
-1
$ printf '▶️' | python -c 'import wcwidth; print(wcwidth.wcwidth(input()))'
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/home/joshua/.local/lib/python3.6/site-packages/wcwidth/wcwidth.py", line 157, in wcwidth
    ucs = ord(wc)
TypeError: ord() expected a character, but string of length 2 found

I think your idea of querying readline for the cursor position is the best way to do this without becoming platform dependent, but I'm not sure how to do that ...

@jyn514
Copy link
Contributor

jyn514 commented Jun 22, 2019

It looks like line_span is an offender as well: I see a lot of calls to arena.AddLineSpan that use len(line), but len counts the number of bytes, not their width:

$ python -c 'print(len("▶️"))'
2

@andychu
Copy link
Contributor Author

andychu commented Jun 22, 2019

@jyn514 I heard of that ability here:

http://ballingt.com/rich-terminal-applications-2/

Finding this is a bit tricky. We can get at it by querying the terminal for the cursor’s position with \x1b[6n, but it’s a bit of a hassle because the response comes back from the terminal on stdin and needs to be distinguished from user input.

(BTW there are several blog posts on that site about terminals worth reading)


I'm not sure if this is the best way to do it, since I didn't try it. Other shells seem to use wcwidth.

Can we just write our own wcwidth wrapper? I think wcwidth.py is pure Python.

But we could put a wrapper in native/libc.c.

Oh I remember the issue is that we don't use wchar_t in OSH. We use utf-8 encoded strings. But I think that is solveable? let me think...

@andychu
Copy link
Contributor Author

andychu commented Jun 22, 2019

Oh sorry I didn't read back on this thread... that was already discussed.

I'm fine with using mbtowc(3) if it works, despite the globals...

Also the stackexchange link above talks about the cursor position.

I think the "safer" thing to do is probably to try wcwidth... But either one seems doable with enough effort (?)

@andychu
Copy link
Contributor Author

andychu commented Jun 22, 2019

Oh I guess a gotcha for mbtowc is what happens when the locale isn't utf-8 or c-posix ?

The complication is that OSH is basically the following the OpenBSD utf-8 philosophy, whereas most shells use fixed width representations in memory.


(However, they do it very badly.

Take a look at how bash's length operator behaves... It basically gives you garbage when there are encoding errors. The number of characters it reports is not a monotonically increasing function of the number of bytes!!!

https://github.com/oilshell/oil/blob/master/spec/var-op-len.test.sh#L18
)

So OSH is somewhat of an outlier, but it also has more of a chance of being correct.

@andychu
Copy link
Contributor Author

andychu commented Jun 22, 2019

FWIW if you're not familiar with these Unicode issues (I wasn't until recently and still have more to learn), I just updated this section of the FAQ, and it has some relevant links:

http://www.oilshell.org/blog/2018/03/04.html#faq

http://lucumr.pocoo.org/2014/1/9/ucs-vs-utf8/

We can chat about it on Zulip too

@jyn514
Copy link
Contributor

jyn514 commented Jun 22, 2019

Ok I have this working using the wcwidth python library. How do you like to add libraries to the project? I know it gets compiled so we can't use pip, it's short enough to copy the whole thing into pylib if we need to.

jyn514 added a commit to jyn514/oil that referenced this issue Jun 22, 2019
Addresses oilshell#269.
Uses the  [wcwidth](https://pypi.org/project/wcwidth/) python library to
find the display width of a character.
jyn514 added a commit to jyn514/oil that referenced this issue Jun 22, 2019
Addresses oilshell#269.
Uses the  [wcwidth](https://pypi.org/project/wcwidth/) python library to
find the display width of a character.
@andychu
Copy link
Contributor Author

andychu commented Jun 22, 2019

The problem is that we don't have Python's unicode types in the release build! It will work in the dev build because that's running with a plain Python interpreter, e.g. /usr/bin/python2.

I shaved the Python interpreter down to this:

http://www.oilshell.org/release/0.6.pre22/metrics.wwz/line-counts/nativedeps.txt

e.g. there's a stringobject.c but no unicodeobject.c

This blog post talks about that:

http://www.oilshell.org/blog/2018/11/15.html

and the conclusion is

Oil is now more like a C program, and less like a Python program.

And this is not just for vanity / minimalism. Speed is probably the top problem right now! Python is too slow, e.g. the benchmarks show that running a configure script is 6x slower :-(

http://www.oilshell.org/release/0.6.pre22/benchmarks.wwz/osh-runtime/

@andychu
Copy link
Contributor Author

andychu commented Jun 22, 2019

But it's good that this works for your use case! That means we can just write libc.wcwidth(), which does utf-8 decoding from a Python string to wchar_t, and then calls libc's wcwidth function.

The way I would start is to copy your prompt into native/libc_test.py, and then iterate on libc.wcwidth(myprompt) until it matches what the Python version does:

This command can quickly iterate on it. (Although you might want to re-enable stdout; it's piped to /dev/null)

build/dev.sh pylibc

I would love that as a contribution because I bet a lot of other people have unicode prompts too!

@andychu
Copy link
Contributor Author

andychu commented Jun 22, 2019

FWIW using the Python/C API should be pretty straightforward for this case (although I find it hard in general)

The libc.fnmatch() function already takes a string and returns a boolean. Taking a string and returning an integer is pretty much the same, so that could be used as a template for libc.wcwidth().

https://github.com/oilshell/oil/blob/master/native/libc.c#L396

@jyn514
Copy link
Contributor

jyn514 commented Jun 22, 2019

The tricky thing here is wcwidth assumes you have a valid unicode character already. I'm not sure how to parse ' \xe2\x96\xb6\xef\xb8\x8f ' into the 4 unicode characters u' \u25b6\ufe0f '

@andychu
Copy link
Contributor Author

andychu commented Jun 22, 2019

If I understand correctly, that's what mbtowc does.

So the Python version of wcwidth is basically just a wrapper tha calls mbtowc and then wcwidth. It should do no work on its own. Basically just error checking and so forth.

@jyn514
Copy link
Contributor

jyn514 commented Jun 22, 2019

I'm kind of stumped. I wrote a wrapper that seems right to me, but it keeps trying to decode the arguments as ASCII and I'm not sure why.

static PyObject *
func_wcswidth(PyObject *self, PyObject *args){
    const char *string;
    if (!PyArg_ParseTuple(args, "es", "utf-8", &string)) {
        return NULL;
    }
    printf("made it past argparsing, string is %s\n", string);
    int len = mbstowcs(NULL, string, 0);
    printf("len is %d", len);
    if (len == -1) {
        PyErr_SetString(PyExc_UnicodeError, "Invalid UTF-8 string");
        return NULL;
    }
    wchar_t unicode[len + 1];
    mbstowcs(unicode, string, len + 1);
    printf("unicode is %ls\n", unicode);
    int width = wcswidth(unicode, len + 1);
    PyMem_Free(string);
    return PyInt_FromLong(width);
}

When I call it with libc.wcswidth("▶️") I get the following traceback, which happens before it even finishes parsing the arguments:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 0: ordinal not in range(128)

@andychu
Copy link
Contributor Author

andychu commented Jun 22, 2019

PyArg_ParseTuple shouldn't be doing any decoding (is that Python 3?) You should just get a raw char* and then pass it to mbstowc (if I understand correctly, I haven't actually done it before)

@andychu
Copy link
Contributor Author

andychu commented Jun 22, 2019

That is, you go from Python string -> char* -> wchar_t -> int essentially. Python should not do any decoding or encoding. Using libc is convenient because it's available everywhere "for free".

(I'll be offline without Github access for until later tonight.)

@jyn514
Copy link
Contributor

jyn514 commented Jun 22, 2019

Oh I'm silly it's the same reason that unicode doesn't work, it's not compiled in

@jyn514
Copy link
Contributor

jyn514 commented Jun 22, 2019

Ok yeah now I'm getting different errors :P

@andychu
Copy link
Contributor Author

andychu commented Jun 22, 2019

I think the invocation from fnmatch is pretty close to what we want:

https://github.com/oilshell/oil/blob/master/native/libc.c#L59

@jyn514
Copy link
Contributor

jyn514 commented Jun 22, 2019

I'm not sure this is my end, mbstowcs is returning -1 for valid unicode strings :(

static PyObject *
func_wcswidth(PyObject *self, PyObject *args){
    char *string;
    if (!PyArg_ParseTuple(args, "s", &string)) {
        return NULL;
    }
    printf("made it past argparsing, string is %s\n", string);
    int len = mbstowcs(NULL, string, 0);
    printf("len is %d\n", len);
    if (len == -1) {
        PyErr_SetString(PyExc_UnicodeError, "Invalid UTF-8 string");
        return NULL;
    }
    wchar_t unicode[len + 1];
    mbstowcs(unicode, string, len + 1);
    printf("unicode is %ls\n", unicode);
    int width = wcswidth(unicode, len + 1);
    //PyMem_Free(string);
    return PyInt_FromLong(width);
}
>>> import libc; libc.wcswidth("▶️")
made it past argparsing, string is ▶️
len is -1
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeError: Invalid UTF-8 string

@jyn514
Copy link
Contributor

jyn514 commented Jun 22, 2019

@andychu I tried the call from fnmatch earlier and it gave me the traceback I just posted, I didn't realize it was coming from my code instead of the python library

@jyn514
Copy link
Contributor

jyn514 commented Jun 22, 2019

stackoverflow says this is platform dependent https://stackoverflow.com/questions/21120965/converting-a-utf-8-text-to-wchar-t

@andychu
Copy link
Contributor Author

andychu commented Jun 23, 2019

I copied the example here and it worked:

http://man7.org/linux/man-pages/man3/mbstowcs.3.html

https://github.com/oilshell/blog-code/tree/master/libc-unicode

But you do have to do setlocale() utf-8, even though that's my system's locale.

I am sort of uncomfortable with the globals... but this might be the way to do it.

I guess we could write a test with utf-8 and ucs2/utf-16 or whatever as the default encoding, and see what happens.

I only care about utf-8 for Oil, but I don't want it to do something fantastically buggy if that's not your locale...


The other thing we can do is decode utf-8 ourselves, which is not hard. The wcwidth() thing is hard, so we want to reuse that, but we don't have to do reuse decoding.

There is actually code to do that in osh/string_ops.py. It's not quite decoding but it's close -- it counts utf-8 chars, which I guess is not enough here.

(It would also be nice to move that to C at some point, but that's a different story than getting it working.)

@andychu
Copy link
Contributor Author

andychu commented Jun 23, 2019

FWIW if you want to geek out there are a lot of tiny utf-8 decoders here:

https://news.ycombinator.com/item?id=15423674

e.g.

https://bjoern.hoehrmann.de/utf-8/decoder/dfa/

(I don't vouch for that code, it's not the most straightforward code, but it shows how small the problem is)

Writing our own or copying one into native/libc.c as a shortcut is not a horrible idea, I think... it is a small enough calculation that it shouldn't have any bugs (at least that was my experience with the Python implementation).

@jyn514
Copy link
Contributor

jyn514 commented Jun 23, 2019

Hmm, maybe the best way would be to save the original locale, setlocale to UTF8, convert the string, and then set it back?

@jyn514
Copy link
Contributor

jyn514 commented Jun 23, 2019

Got this working!

@andychu
Copy link
Contributor Author

andychu commented Jul 27, 2019

Released with 0.7.pre1 I think

@andychu andychu closed this as completed Jul 27, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants