Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Serial Console doesn't handle unicode characters properly #797

Closed
olivier-boesch opened this issue Mar 24, 2019 · 9 comments
Closed

Serial Console doesn't handle unicode characters properly #797

olivier-boesch opened this issue Mar 24, 2019 · 9 comments
Labels

Comments

@olivier-boesch
Copy link

If you are reporting a bug, we would like to know:

  • What you were trying to do,
    display text send by circuitpython script on serial console in adafruit mode.
  • What steps you took to make this happen,
    open serial console only
  • What you expected to happen,
    when i write '°' or '\u00b0' in a print statement in my script, i expect '°'
    I see this if I connect with putty.
  • What actually happened,
    I see '°' in the serial console
  • Why this difference is problematic (it may not be a bug!),
    unicode characters are not handled properly
  • Technical details like the version of Mu you're using, your OS version and
    other aspects of the context in which Mu was running.
    installed via installer(64bits version), windows 10, mu version 1.0.2

However that's a good software. Thanks.

@carlosperate
Copy link
Member

Thanks for the report @olivier-boesch.
I can also replicate this on the micro:bit mode.

Quick notes:

@carlosperate carlosperate changed the title Serial Console in adafruit mode doesn't handle unicode characters properly Serial Console doesn't handle unicode characters properly Aug 3, 2020
@carlosperate
Copy link
Member

As this was one of the older issues for this problem and there was already a few internal and external github issues linking here I've decided to unify all duplicates we have into this one.

I've also updated the issue title to be more generic, as this affects all serial terminals (CircuitPython, micro:bit and MicroPython).

@carlosperate
Copy link
Member

We currently have two PRs looking into this:

Both need to be expanded to deal with incomplete multi-byte unicode characters at the beginning and end of the data array. I've looked into this, but didn't quite finished it up, I'll try to push my current status before the end of this week (can't today as it's my birthday 🥳 ).

I've also run a benchmark on a couple of different implementations to see what would be faster. Mu currently struggles to process serial data coming without interruption at 115200 baud, so I didn't want to make this worse. Good news is a version based on @k0d's implementation is actually faster than the original implementation before any fix, as it decodes the entire data array at the beginning (otherwise it does decoding on smaller chunks when trying to regex VT100 commands).

@dybber We will also need to agree on a merge order with #1026 as it is touching the same area and some aspects of keeping bytes from previous iterations will have to be combined.

@dybber
Copy link
Collaborator

dybber commented Aug 3, 2020

As mentioned elsewhere, I believe it's better to start from what I've done in #1026, and add unicode support from there, than to try and merge the two branches later on. I think it will become a mess to figure out how to do such a merge.

In that branch I already added support for receiving partial input, and leaving unprocessed input for next call to process_bytes, as the same issue happens when we're receiving multibyte VT100-codes split over multiple calls to process_bytes. (https://github.com/dybber/mu/blob/bugfix/replcursor_movement/mu/interface/panes.py#L362)

@dybber
Copy link
Collaborator

dybber commented Aug 3, 2020

On the otherhand, it wasn't terrible difficult actually. I just made a commit that seems to fix it to my replcursor_movement branch, check dybber@810aa71

@carlosperate
Copy link
Member

Yeah, my main concern was performance, as this is already a very busy loop that hangs the entire UI if the incoming serial data is large enough. For example, on an i7 MacBook this will max out my CPU and hang the Mu UI until I unplug the micro:bit or I kill the process, so I suspect lower spec computers will struggle with less:

from microbit import *

while True:
    print(help())
    sleep(20)

codecs.getincrementaldecoder looks promising, I'll add it to the benchmark and see how well it does.
I should probably test #1026 as well, see how it affects the current timings.

@carlosperate
Copy link
Member

carlosperate commented Aug 11, 2020

Okay, I've added the benchmark source code in this gist (it might look like a lot of code, but each "option" file is the original process_bytes method with different UTF-8 decoding added): https://gist.github.com/carlosperate/1dfcdc9823646e5983b92419ea13bdc1

There are 5 implementations, but really it's only comparing 3 different methods:

  • Option 1: Decodes UTF-8 characters one by one at the end of the loop iteration
  • Option 2: Manually decodes full data byte array first
  • Option 3: Uses standard library codecs.incrementaldecoder

The results for those are (in seconds):

Original 10k runs: 2.9971564229927026
Option 1 10k runs: 4.138417423993815
Option 2 10k runs: 1.7315781200304627
Option 3 10k runs: 1.8996449070400558

Option 1 was clearly going to be the worst implementation, which why I started looking into Option 2 (looking at the UTF-8 bits to figure out if we have incomplete characters at the beginning or end of the byte array), so unsurprisingly Option 2 does much better.

@dybber I'm very glad you found codecs.incrementaldecoder, which is included in Option 3. Even though it is slower than Option 2 is better to leverage the rock-solid standard library implementation, instead of manually looking at the UFT-8 bits in Option 2.

The other two measurements on the benchmark were based on the process_bytes method in #1026.
Option 4 is the original code in the PR as a baseline, and Option 5 adds a couple of minor performance improvements:

Option 3 10k runs: 1.8996449070400558

Option 4 10k runs: 2.3355612590094097
Option 5 10k runs: 2.0476765239727683

@dybber I've created PR dybber#1 in your fork to include these changes in #1026.

@dybber
Copy link
Collaborator

dybber commented Aug 17, 2020

Great, thanks, I hope we can soon get it merged into Mu.

@dybber
Copy link
Collaborator

dybber commented Sep 29, 2020

My branch fixing this is now merged, and I will close this issue

@dybber dybber closed this as completed Sep 29, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants