Add Unicode Support #5

ndd7xv · 2022-08-09T04:32:46Z

Suggested in a Reddit comment, but putting it here if there are any ideas.

I think it'll be complicated to do this, and I don't think many hex editors support Unicode. Many files inspected with a hex editor (such as binaries or specific file formats) often have random bytes that are not ASCII printable but also not Unicode. Thus, there probably needs to be a way to differentiate between a byte stream that can represent Unicode and just a byte stream that's random bytes.

There may also be other concerns that aren't coming to mind right now.

0b11001111 · 2022-10-31T08:43:12Z

I used the weekend to dive a bit into the topic and it turns out this is a bit more complicated than ASCII support -- surprise :D

A few things I've noticed:

Unicode is not the encoding. There are different encodings for Unicode like UTF-8 or UTF-16
The mentioned encodings are variable length encodings which means one character in those encodings may (read will for UTF16) occupy more than one byte
This makes the mapping of byte x to character y non trivial, unless you decode character by character

I came up with two basic approaches to tackle the issue.

Use something like String::from_utf8_lossy, put the whole line/paragraph in it and accept the ragged margin you get as an result as well as decoding errors at the endpoints of your byte slice. Haven't really followed this idea but I don't think it promising.
Decode character by character and here's how
1. Given a slice of bytes, try parsing them from the start using str:from_utf8
  - on success: cool, you've just decoded the whole slice
  - on failure: the error contains information about how many bytes were parsed successfully until the error happened. We yield the successfully parsed substring along with it's offset in the slice.
  - increment the offset and repeat the procedure
2. Now, we got a stream of successfully parsed substrings and their byte offsets. This can be turned into a stream of characters, each corresponding to a byte in the original slice.
  - a character in a substring simply becomes a character in the steam
    - if a character occupies more than one byte, it will be followed by size-1 dummy characters (need good Ideas which one to use here, currently it is '•')
  - b bytes outside the successfully parsed substrings get represented by a dummy character (just like now)

The latter sort of works but before seriously considering it, a few problems have to be sorted out:

Monospaced fonts work great for ASCII but 💩 looks wider in my terminal (and others may appear narrower)
There is a lot of weird stuff in unicode, especially further control codes, that needs to be tested and handled
How to deal with valid Unicode that cannot be displayed by your font?
Reliably detect, if the used terminal supports Unicode

Here's a screenshot of my progress:

(On the left: my modified version of heh, note the overflow in line one! On the right: original heh.)

~~If I find the time I'll polish the current state a bit and push it :)~~ See #36

ndd7xv added the enhancement New feature or request label Aug 9, 2022

0b11001111 mentioned this issue Oct 31, 2022

Unicode Support #36

Merged

ndd7xv closed this as completed Nov 14, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Unicode Support #5

Add Unicode Support #5

ndd7xv commented Aug 9, 2022

0b11001111 commented Oct 31, 2022 •

edited

Loading

Add Unicode Support #5

Add Unicode Support #5

Comments

ndd7xv commented Aug 9, 2022

0b11001111 commented Oct 31, 2022 • edited Loading

0b11001111 commented Oct 31, 2022 •

edited

Loading