Serial Console doesn't handle unicode characters properly #797

olivier-boesch · 2019-03-24T22:35:37Z

If you are reporting a bug, we would like to know:

What you were trying to do,
display text send by circuitpython script on serial console in adafruit mode.
What steps you took to make this happen,
open serial console only
What you expected to happen,
when i write '°' or '\u00b0' in a print statement in my script, i expect '°'
I see this if I connect with putty.
What actually happened,
I see 'Â°' in the serial console
Why this difference is problematic (it may not be a bug!),
unicode characters are not handled properly
Technical details like the version of Mu you're using, your OS version and
other aspects of the context in which Mu was running.
installed via installer(64bits version), windows 10, mu version 1.0.2

However that's a good software. Thanks.

carlosperate · 2019-04-16T00:02:44Z

Thanks for the report @olivier-boesch.
I can also replicate this on the micro:bit mode.

Quick notes:

° in UTF-8 is 0xC2 0xB0 and in UTF-16 is 0x00B0
- https://www.fileformat.info/info/unicode/char/00c2/index.htm
Â in UTF-8 is 0xC3 0x82 and in UTF-16 is 0x00C2
- http://www.fileformat.info/info/unicode/char/b0/index.htm
So it looks like each UTF-8 byte sent by MicroPython is being decoded as a UTF-16 character.

carlosperate · 2020-08-03T17:25:55Z

As this was one of the older issues for this problem and there was already a few internal and external github issues linking here I've decided to unify all duplicates we have into this one.

I've also updated the issue title to be more generic, as this affects all serial terminals (CircuitPython, micro:bit and MicroPython).

carlosperate · 2020-08-03T17:52:42Z

We currently have two PRs looking into this:

Both need to be expanded to deal with incomplete multi-byte unicode characters at the beginning and end of the data array. I've looked into this, but didn't quite finished it up, I'll try to push my current status before the end of this week (can't today as it's my birthday 🥳 ).

I've also run a benchmark on a couple of different implementations to see what would be faster. Mu currently struggles to process serial data coming without interruption at 115200 baud, so I didn't want to make this worse. Good news is a version based on @k0d's implementation is actually faster than the original implementation before any fix, as it decodes the entire data array at the beginning (otherwise it does decoding on smaller chunks when trying to regex VT100 commands).

@dybber We will also need to agree on a merge order with #1026 as it is touching the same area and some aspects of keeping bytes from previous iterations will have to be combined.

dybber · 2020-08-03T18:52:01Z

As mentioned elsewhere, I believe it's better to start from what I've done in #1026, and add unicode support from there, than to try and merge the two branches later on. I think it will become a mess to figure out how to do such a merge.

In that branch I already added support for receiving partial input, and leaving unprocessed input for next call to process_bytes, as the same issue happens when we're receiving multibyte VT100-codes split over multiple calls to process_bytes. (https://github.com/dybber/mu/blob/bugfix/replcursor_movement/mu/interface/panes.py#L362)

dybber · 2020-08-03T19:02:24Z

On the otherhand, it wasn't terrible difficult actually. I just made a commit that seems to fix it to my replcursor_movement branch, check dybber@810aa71

carlosperate · 2020-08-04T10:27:40Z

Yeah, my main concern was performance, as this is already a very busy loop that hangs the entire UI if the incoming serial data is large enough. For example, on an i7 MacBook this will max out my CPU and hang the Mu UI until I unplug the micro:bit or I kill the process, so I suspect lower spec computers will struggle with less:

from microbit import *

while True:
    print(help())
    sleep(20)

codecs.getincrementaldecoder looks promising, I'll add it to the benchmark and see how well it does.
I should probably test #1026 as well, see how it affects the current timings.

carlosperate · 2020-08-11T23:25:36Z

Okay, I've added the benchmark source code in this gist (it might look like a lot of code, but each "option" file is the original process_bytes method with different UTF-8 decoding added): https://gist.github.com/carlosperate/1dfcdc9823646e5983b92419ea13bdc1

There are 5 implementations, but really it's only comparing 3 different methods:

Option 1: Decodes UTF-8 characters one by one at the end of the loop iteration
Option 2: Manually decodes full data byte array first
Option 3: Uses standard library codecs.incrementaldecoder

The results for those are (in seconds):

Original 10k runs: 2.9971564229927026
Option 1 10k runs: 4.138417423993815
Option 2 10k runs: 1.7315781200304627
Option 3 10k runs: 1.8996449070400558

Option 1 was clearly going to be the worst implementation, which why I started looking into Option 2 (looking at the UTF-8 bits to figure out if we have incomplete characters at the beginning or end of the byte array), so unsurprisingly Option 2 does much better.

@dybber I'm very glad you found codecs.incrementaldecoder, which is included in Option 3. Even though it is slower than Option 2 is better to leverage the rock-solid standard library implementation, instead of manually looking at the UFT-8 bits in Option 2.

The other two measurements on the benchmark were based on the process_bytes method in #1026.
Option 4 is the original code in the PR as a baseline, and Option 5 adds a couple of minor performance improvements:

Option 3 10k runs: 1.8996449070400558

Option 4 10k runs: 2.3355612590094097
Option 5 10k runs: 2.0476765239727683

@dybber I've created PR dybber#1 in your fork to include these changes in #1026.

dybber · 2020-08-17T18:18:02Z

Great, thanks, I hope we can soon get it merged into Mu.

dybber · 2020-09-29T06:36:03Z

My branch fixing this is now merged, and I will close this issue

carlosperate mentioned this issue Apr 16, 2019

Decode incoming REPL data as UTF-8. #817

Closed

tannewt mentioned this issue Jul 17, 2020

_stage.Text does not support characters like ░, ▒, and ▓ adafruit/circuitpython#3159

Closed

jatmega mentioned this issue Aug 3, 2020

æøå ÆØÅ becoming Ã¦Ã¸Ã¥ ÃÃÃ in serial in adafruit mode #1085

Closed

carlosperate changed the title ~~Serial Console in adafruit mode doesn't handle unicode characters properly~~ Serial Console doesn't handle unicode characters properly Aug 3, 2020

carlosperate mentioned this issue Aug 3, 2020

wrong Circuitpython Umlaute äöüß #759

Closed

carlosperate mentioned this issue Aug 12, 2020

Support mouse movement, selection, and fix several bugs in micro:bit & ESP-mode REPL #1026

Merged

dybber added the modes label Sep 5, 2020

dybber mentioned this issue Sep 29, 2020

Fixes Mu-Terminal utf-8 encoding. Fixes #759 #1044

Closed

dybber closed this as completed Sep 29, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Serial Console doesn't handle unicode characters properly #797

Serial Console doesn't handle unicode characters properly #797

olivier-boesch commented Mar 24, 2019

carlosperate commented Apr 16, 2019

carlosperate commented Aug 3, 2020

carlosperate commented Aug 3, 2020

dybber commented Aug 3, 2020

dybber commented Aug 3, 2020

carlosperate commented Aug 4, 2020

carlosperate commented Aug 11, 2020 •

edited

Loading

dybber commented Aug 17, 2020

dybber commented Sep 29, 2020

Serial Console doesn't handle unicode characters properly #797

Serial Console doesn't handle unicode characters properly #797

Comments

olivier-boesch commented Mar 24, 2019

carlosperate commented Apr 16, 2019

carlosperate commented Aug 3, 2020

carlosperate commented Aug 3, 2020

dybber commented Aug 3, 2020

dybber commented Aug 3, 2020

carlosperate commented Aug 4, 2020

carlosperate commented Aug 11, 2020 • edited Loading

dybber commented Aug 17, 2020

dybber commented Sep 29, 2020

carlosperate commented Aug 11, 2020 •

edited

Loading