Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use correct encoding #66

Closed
flying-sheep opened this issue Apr 11, 2016 · 14 comments
Closed

Use correct encoding #66

flying-sheep opened this issue Apr 11, 2016 · 14 comments

Comments

@flying-sheep
Copy link
Contributor

on windows, to correctly parse text, you need to do:

parse(text = code, encoding = Encoding(code))

if the code has set an encoding, it gets returned by Encoding. Else, Encoding returns 'unknown', the default value for encoding in parse.

ideally, parse_all should internally do the above instead of a encoding-less parse call.

an alternative would be to pass down an encoding parameter through evaluateparse_allparse

@jankatins
Copy link

This isn't the problem here: it seems that encoding makes no difference at all (in a RStudio on windows 7, 64bit, R 3.2.x):

> s <- '"法 \\u8FDB"'
> parse(text=s, encoding = Encoding(s))
expression("<U+6CD5> \u8FDB")
> s <- '"法 \\u8FDB"'
> Encoding(s)
[1] "UTF-8"
> parse(text=s, encoding = Encoding(s))
expression("<U+6CD5> \u8FDB")

Just to show the original problem:

library(evaluate)

code <- "
x = '法'
y = '\\u8FDB'
print(nchar(x))
print(nchar(y))
print(x)
print(y)
"


l = list()
txt <- function(o, type) {
  t <- paste(o, collapse = '\n')
  l[length(l)+1] <<- t
}
oh <- new_output_handler(source = identity, 
                         text = function(o) txt(o, "text"), 
                         graphics = identity,
                         message = identity, 
                         warning = identity, 
                         error = identity, 
                         value = identity)


x <- evaluate(code, output_handler = oh)
l
> l
[[1]]
[1] "[1] 8\n" # -> the unicode char is already wring when it gets executed

[[2]]
[1] "[1] 1\n"

[[3]]
[1] "[1] \"<U+6CD5>\"\n"

[[4]]
[1] "[1] \"<U+8FDB>\"\n" # -> And even "good" unicode chars get mangled on the way out.

@flying-sheep
Copy link
Contributor Author

OK. code is obviously good UTF-8 since it comes from user input, right?

doesn’t that mean that using parse(text = code, encoding = Encoding(code)) creates the right object, and only the printing gets messed up?

if so, will our capture.output(print(obj)) introduce the error?

@flying-sheep
Copy link
Contributor Author

important printing stuff (i think)

WinUTF8out and EncodeString

@jankatins
Copy link

doesn’t that mean that using parse(text = code, encoding = Encoding(code)) creates the right object, and only the printing gets messed up?

No: it's already messed up because nchar(x) in the evaluate call returns 8, which happens when it sees a string like <U+6CD5>.

But the 4th case in the evaluate might be because of bad printing behaviour...

if so, will our capture.output(print(obj)) introduce the error?

Seems like it at least explains the second error (the [4]): e.g. executing '\u8FDB';print('\u8FDB') in irkernel, which is equivalent to

code <- "
y = '\\u8FDB'
y
print(y)
"

and then doing capture.output(print(value)) in the value output handler

leads to this:

options(jupyter.rich_display = FALSE) # to only get print and not html
'\u8FDB'
print('\u8FDB')
[1] "<U+8FDB>"
[1] "<U+8FDB>"

-> Both are mangled, no matter if the print happens within evaluate or outside. So it seems a problem that sinked cons can't handle UTF-8 on non-UTF-8 systems? But that still doesn't explain the problem "coming in".

@jankatins
Copy link

Knitr has the same problem:

# code cell marker broken to not confuse github...
` ``{r} 
x = ''
y = '\u8FDB'
print(nchar(x))
print(nchar(y))
print(x)
print(y)
x
y
` ``

Outputs:

## [1] 8 -> alread 8 chars when executed -> broken coming in
## [1] 1 -> escaping works for coming in
## [1] "<U+6CD5>" # -> just the broken 8 char string printed again
## [1] "<U+8FDB>" # -> printed in evaluate -> broken
## [1] "<U+6CD5>" # -> again the broken 8 char string
## [1] "<U+8FDB>" # -> printed by knit_print and broken there...

Another mentioning of this problem: https://stat.ethz.ch/pipermail/r-help//2014-April/373558.html (without an answer... :-()

@jankatins
Copy link

And this also doesn't make a difference (in RStudio, win7):

> f <- textConnection("rval2", "w", local=TRUE, encoding = "UTF-8")
> sink(f)
> print('法')
> print('\u8FDB')
> sink()
> print(rval2)
[1] "[1] \"<U+6CD5>\"" "[1] \"<U+8FDB>\""

@expectopatronum
Copy link

I once had a problem with encoding when the locale of my RStudio was not set. You find out by Sys.getlocale()
I'm not sure if this applies here though.

@jankatins
Copy link

> Sys.getlocale()
[1] "LC_COLLATE=German_Germany.1252;LC_CTYPE=German_Germany.1252;LC_MONETARY=German_Germany.1252;LC_NUMERIC=C;LC_TIME=German_Germany.1252"

-> Looks ok to me (this is a plain RStudio session on win7)

The problem is that the executed code is UTF-8 (e.g. read from a file) and the output should also be UTF-8 (eg written to a file), so the locale should not matter?

@yihui
Copy link
Collaborator

yihui commented Apr 11, 2016

I think there is something fundamentally broken (related to sink()) in base R on Windows. See #59. Basically if the characters are not supported by the system native encoding, they will be silently converted to <U+XXXX> sequences.

@jankatins
Copy link

Any idea how to get out of this problem? This is basically a big problem in the current R kernel, because the "in" side of this problem prevents unicode code input on windows, resulting in wrong computations (e.g. probably wrong comparisons if one side comes from a file and one from a string defined in the notebook).

@yihui
Copy link
Collaborator

yihui commented Apr 11, 2016

No, I think this requires one to dig deep into the C code in base R, and I don't really understand that level of details.

This is a problem, but probably not as big as you imagined. This only becomes a problem when a Windows user has characters in the document that his/her Windows native character encoding does not support. I think this is relatively rare. In the above examples, as long as your Windows supports the Chinese locale, you should be fine (Sys.setlocale(, 'Chinese')). As you mentioned, knitr suffers from the same problem, but over the four years, this issue has bitten users at most three times as far as I can remember.

@speechchemistry
Copy link

Thank you for all your work on this. This comment is coming from someone still trying to understand the issue but I believe this is a problem for those of us who work with under-resourced languages, and linguists working with IPA data. If I've understood correctly, this issue is a problem for about half of the world's languages.

This is a problem, but probably not as big as you imagined.

@yihui
Copy link
Collaborator

yihui commented Jun 4, 2020

@speechchemistry We had to wait for Windows to support UTF-8; see #59. R core has been making effort: https://developer.r-project.org/Blog/public/2020/05/02/utf-8-support-on-windows/ Eventually we should be able to forget about character encodings on Windows. Trust me: my native language is Chinese, and I have felt the enormous pain for many years, but still have to wait.

@flying-sheep
Copy link
Contributor Author

Hi, since R 4.2, it supports unicode on windows!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants