Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Unicode Support #5

Closed
ndd7xv opened this issue Aug 9, 2022 · 1 comment
Closed

Add Unicode Support #5

ndd7xv opened this issue Aug 9, 2022 · 1 comment
Labels
enhancement New feature or request

Comments

@ndd7xv
Copy link
Owner

ndd7xv commented Aug 9, 2022

Suggested in a Reddit comment, but putting it here if there are any ideas.

I think it'll be complicated to do this, and I don't think many hex editors support Unicode. Many files inspected with a hex editor (such as binaries or specific file formats) often have random bytes that are not ASCII printable but also not Unicode. Thus, there probably needs to be a way to differentiate between a byte stream that can represent Unicode and just a byte stream that's random bytes.

There may also be other concerns that aren't coming to mind right now.

@ndd7xv ndd7xv added the enhancement New feature or request label Aug 9, 2022
@0b11001111
Copy link
Contributor

0b11001111 commented Oct 31, 2022

I used the weekend to dive a bit into the topic and it turns out this is a bit more complicated than ASCII support -- surprise :D

A few things I've noticed:

  • Unicode is not the encoding. There are different encodings for Unicode like UTF-8 or UTF-16
  • The mentioned encodings are variable length encodings which means one character in those encodings may (read will for UTF16) occupy more than one byte
  • This makes the mapping of byte x to character y non trivial, unless you decode character by character

I came up with two basic approaches to tackle the issue.

  1. Use something like String::from_utf8_lossy, put the whole line/paragraph in it and accept the ragged margin you get as an result as well as decoding errors at the endpoints of your byte slice. Haven't really followed this idea but I don't think it promising.
  2. Decode character by character and here's how
    1. Given a slice of bytes, try parsing them from the start using str:from_utf8
      • on success: cool, you've just decoded the whole slice
      • on failure: the error contains information about how many bytes were parsed successfully until the error happened. We yield the successfully parsed substring along with it's offset in the slice.
      • increment the offset and repeat the procedure
    2. Now, we got a stream of successfully parsed substrings and their byte offsets. This can be turned into a stream of characters, each corresponding to a byte in the original slice.
      • a character in a substring simply becomes a character in the steam
        • if a character occupies more than one byte, it will be followed by size-1 dummy characters (need good Ideas which one to use here, currently it is '•')
      • b bytes outside the successfully parsed substrings get represented by a dummy character (just like now)

The latter sort of works but before seriously considering it, a few problems have to be sorted out:

  • Monospaced fonts work great for ASCII but 💩 looks wider in my terminal (and others may appear narrower)
  • There is a lot of weird stuff in unicode, especially further control codes, that needs to be tested and handled
  • How to deal with valid Unicode that cannot be displayed by your font?
  • Reliably detect, if the used terminal supports Unicode

Here's a screenshot of my progress:
grafik
(On the left: my modified version of heh, note the overflow in line one! On the right: original heh.)

If I find the time I'll polish the current state a bit and push it :) See #36

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants