Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

v3: Marshal escapes unicode (UTF-16) characters #737

Open
mikefarah opened this issue May 14, 2021 · 11 comments
Open

v3: Marshal escapes unicode (UTF-16) characters #737

mikefarah opened this issue May 14, 2021 · 11 comments

Comments

@mikefarah
Copy link

If a yaml file has unicode characters like:

a: 🛑

go-yaml manages to decode this fine (if we print the yaml.Nodes value) however when marshalling the document out it prints out the escaped unicode number instead of the representation.

See playground:

https://play.golang.org/p/NQlREuXFT7_q

@mikefarah
Copy link
Author

Digging around - seems like this has something to do with the logic in the yaml_emitter_write_double_quoted_scalar function. There's some funky rune logic in there that I can't quite follow - but if I change the function to just write(emitter, value, &i) then the output is as expected

@mikefarah
Copy link
Author

Ah it thinks the character is not printable is_printable in yamlprivateh.go returns false.

Again this logic is a little beyond me. @niemeyer

@elasticdotventures
Copy link

elasticdotventures commented May 25, 2021

NOTE: is_printable() almost certainly shouldn't be a hardcoded list, since the actual list of printable characters would vary based on a terminal, and afaik termcap doesn't even reliably emoji detection inappropriate for go-yaml tool to make judgements about my terminals display capabilities.

It's not relevant but (I'm running in an alacritty shell, into a WSLg native X11 emulation layer, emoji works great), very colourful for grep .. the bug was uncovered during the development of
b00t
which extensively uses mandarin pinyin and emoji for it's naming conventions.

IMHO the easiest possible case for this issue is to make is_printable something which has a far fewer characters in the is_printable() list, the code I'm referring to is here. I suppose the problem could also be a typo, but I'm hoping @niemeyer knows the source of the random codes here:

func is_printable(b []byte, i int) bool {

To resolve I would propose you either:

  1. continue on the current approach, create detailed lists of which characters are appropriate.
  2. install a binary mode (or "more" binary-friendly mode), punt the problem to some other future filter such as node's "supports-emoji".

The node/typescript "supports-emoji" package by offering with "levels" indicating the ranges. 🤔 Then jq could pass it through on the command line.

I thought the 4 levels proposed by support-emoji (wip):
sindresorhus/project-ideas#64

0 : Unicode not supported
1 : Unicode supported
2: Monochrome emoji supported
3: Colour emoji supported

My final thoughts ..

🤓 it's probably not reasonable to look at the OS, terminal type to autodetect since the apps position in the pipeline is NOT known.

🤓 this ruling on "is_printable" is totally arbitrary, and is a task better suited for some other utility/step that does the (in this circumstance) unwanted escape/encoding. It should at a minimum be referenced where it came from.

🤓 if don't you know of an easy way to reliably unescape ("restore") the output back to it's unencoded version then adding a binary mode which is always boolean true is probably easiest and fastest resolution and make the is_printable filter "feature" a separate tool/subsystem.

Thanks!

@laverya
Copy link
Contributor

laverya commented May 25, 2021

NOTE: is_printable() almost certainly shouldn't be a hardcoded list, since the actual list of printable characters would vary based on a terminal, and afaik termcap doesn't even have emoji support.

This brings up a good point, but overall I think it's one that goes in the opposite direction - YAML output from this library should be readable (and likely copy-pasteable) everywhere, not just the device on which it was originally printed. If that means that things need to be encoded to the lowest common denominator, that's not the worst thing.

Though perhaps the is_printable function should be an encoder setting in v3?

@elasticdotventures
Copy link

This brings up a good point, but overall I think it's one that goes in the opposite direction - YAML output from this library should be readable (and likely copy-pasteable) everywhere, not just the device on which it was originally printed.

Yes, but that 'proper display' is generally the role/responsibility of termcap (or binary filter such as "strings"), not YAML.

I make a broad appeal to the sanity behind treating Emoji as a first class character support is -- the adage "a picture is worth 1,000 words", I make an explanation how four pictures made from emoji and 500 chinese pinyin characters can represent 5.9bn "user stories" in only 4 visual characters using b00t style encoding. My screen size and most importantly my "area of visual focus" has not scaled linearly with my processing power. Using words is 😖 too much reading, symbols are much faster for our primate pattern matching brains (and the machines prefer them too).

If that means that things need to be encoded to the lowest common denominator, that's not the worst thing. Again, that is the job/role of a Unix utility called "strings" that has been in Unix since the 1970's. YAML is a file format for display, if the terminal supports emoji, YAML supports emoji.

Though perhaps the is_printable function should be an encoder setting in v3?

🤓 a more binary version is probably a great place to start. If i need to dumb it down for the terminal then I'll use a different tool for that.

I know I'm biased but it seems like the sensible setting to me is binary by default.

@mikefarah
Copy link
Author

mikefarah commented May 25, 2021 via email

@chaimleib
Copy link

chaimleib commented Oct 27, 2021

I feel like this should be a configurable option:

  1. Keep original encoding
  2. Prefer Unicode
  3. Escape all non-ASCII

@braydonk
Copy link

I hope that this issue can be resolved soon! Based on my understanding the fix in the PR looks solid, but even if that's not the way forward it would be great to get this fixed.

@braydonk
Copy link

braydonk commented Aug 30, 2022

So this might be super overkill, but in case someone is interested in a workaround for this I wrote a small dependency-free package that parses string literal codepoints like shown in the example and outputs the proper encoding.
https://pkg.go.dev/github.com/RageCage64/go-utf8-codepoint-converter/codepoint

I managed to use it to fix this problem in yamlfmt by parsing for the codepoint literal and converting it using that library. You can see that solution here:
https://github.com/google/yamlfmt/blob/main/internal/hotfix/unicode.go

Definitely a bit heavy-handed, but figured I'd bring it up here in case anyone is desperate enough to solve this manually like I did.

@Well2333
Copy link

Well2333 commented Nov 15, 2022

I feel like this should be a configurable option:

  1. Keep original encoding
  2. Prefer Unicode
  3. Escape all non-ASCII

Agreeing with this view, the solution now is not only unsightly, but can also lead to some errors.

For example, when I use this project to convert clash's profiles rules, my input is
DOMAIN,steampipe.akamaized.net,🎯 全球直连 # 自定义规则
But output is
"DOMAIN,steampipe.akamaized.net,\U0001F3AF 全球直连 # 自定义规则"
As you can see, the output has quotation marks around the string, whereas normal output is unquoted. This handling causes the comment content to be included in the string, resulting in a complete error in the clash.

After a simple test, this issue should occur in yaml. Marshal. Which will cause overprocesses of the string,

@braydonk
Copy link

As you can see, the output has quotation marks around the string, whereas normal output is unquoted.

The quoting around the string seems to be a result of the important unicode byte being considered not printable. When I implemented the fix in my fork, emojis are properly handled and nothing is quoted. (Note, I opted not to make it a configurable option for now)

dcreey added a commit to dcreey/ratchet that referenced this issue Aug 7, 2023
Fork is specifically maintained by @braydonk particularly for the interests of yamlfmt.
Context: go-yaml/yaml#737 (comment)
sethvargo pushed a commit to sethvargo/ratchet that referenced this issue Aug 8, 2023
Fork is specifically maintained by @braydonk particularly for the
interests of yamlfmt. Context:
go-yaml/yaml#737 (comment)

Before:
Any multi-line string with emojis will collapse into a single line
string and the emoji will be escaped.

After:
Multi-line strings with emojis are correctly left unedited.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants