Use correct encoding #66

flying-sheep · 2016-04-11T08:20:56Z

on windows, to correctly parse text, you need to do:

parse(text = code, encoding = Encoding(code))

if the code has set an encoding, it gets returned by Encoding. Else, Encoding returns 'unknown', the default value for encoding in parse.

ideally, parse_all should internally do the above instead of a encoding-less parse call.

an alternative would be to pass down an encoding parameter through evaluate → parse_all → parse

The text was updated successfully, but these errors were encountered:

jankatins · 2016-04-11T09:11:41Z

This isn't the problem here: it seems that encoding makes no difference at all (in a RStudio on windows 7, 64bit, R 3.2.x):

> s <- '"法 \\u8FDB"'
> parse(text=s, encoding = Encoding(s))
expression("<U+6CD5> \u8FDB")
> s <- '"法 \\u8FDB"'
> Encoding(s)
[1] "UTF-8"
> parse(text=s, encoding = Encoding(s))
expression("<U+6CD5> \u8FDB")

Just to show the original problem:

library(evaluate)

code <- "
x = '法'
y = '\\u8FDB'
print(nchar(x))
print(nchar(y))
print(x)
print(y)
"


l = list()
txt <- function(o, type) {
  t <- paste(o, collapse = '\n')
  l[length(l)+1] <<- t
}
oh <- new_output_handler(source = identity, 
                         text = function(o) txt(o, "text"), 
                         graphics = identity,
                         message = identity, 
                         warning = identity, 
                         error = identity, 
                         value = identity)


x <- evaluate(code, output_handler = oh)
l

> l
[[1]]
[1] "[1] 8\n" # -> the unicode char is already wring when it gets executed

[[2]]
[1] "[1] 1\n"

[[3]]
[1] "[1] \"<U+6CD5>\"\n"

[[4]]
[1] "[1] \"<U+8FDB>\"\n" # -> And even "good" unicode chars get mangled on the way out.

flying-sheep · 2016-04-11T11:10:56Z

OK. code is obviously good UTF-8 since it comes from user input, right?

doesn’t that mean that using parse(text = code, encoding = Encoding(code)) creates the right object, and only the printing gets messed up?

if so, will our capture.output(print(obj)) introduce the error?

flying-sheep · 2016-04-11T11:15:51Z

important printing stuff (i think)

WinUTF8out and EncodeString

jankatins · 2016-04-11T11:25:32Z

doesn’t that mean that using parse(text = code, encoding = Encoding(code)) creates the right object, and only the printing gets messed up?

No: it's already messed up because nchar(x) in the evaluate call returns 8, which happens when it sees a string like <U+6CD5>.

But the 4th case in the evaluate might be because of bad printing behaviour...

if so, will our capture.output(print(obj)) introduce the error?

Seems like it at least explains the second error (the [4]): e.g. executing '\u8FDB';print('\u8FDB') in irkernel, which is equivalent to

code <- "
y = '\\u8FDB'
y
print(y)
"

and then doing capture.output(print(value)) in the value output handler

leads to this:

options(jupyter.rich_display = FALSE) # to only get print and not html
'\u8FDB'
print('\u8FDB')

[1] "<U+8FDB>"
[1] "<U+8FDB>"

-> Both are mangled, no matter if the print happens within evaluate or outside. So it seems a problem that sinked cons can't handle UTF-8 on non-UTF-8 systems? But that still doesn't explain the problem "coming in".

jankatins · 2016-04-11T11:54:30Z

Knitr has the same problem:

# code cell marker broken to not confuse github...
` ``{r} 
x = '法'
y = '\u8FDB'
print(nchar(x))
print(nchar(y))
print(x)
print(y)
x
y
` ``

Outputs:

## [1] 8 -> alread 8 chars when executed -> broken coming in
## [1] 1 -> escaping works for coming in
## [1] "<U+6CD5>" # -> just the broken 8 char string printed again
## [1] "<U+8FDB>" # -> printed in evaluate -> broken
## [1] "<U+6CD5>" # -> again the broken 8 char string
## [1] "<U+8FDB>" # -> printed by knit_print and broken there...

Another mentioning of this problem: https://stat.ethz.ch/pipermail/r-help//2014-April/373558.html (without an answer... :-()

jankatins · 2016-04-11T12:29:08Z

And this also doesn't make a difference (in RStudio, win7):

> f <- textConnection("rval2", "w", local=TRUE, encoding = "UTF-8")
> sink(f)
> print('法')
> print('\u8FDB')
> sink()
> print(rval2)
[1] "[1] \"<U+6CD5>\"" "[1] \"<U+8FDB>\""

expectopatronum · 2016-04-11T12:48:56Z

I once had a problem with encoding when the locale of my RStudio was not set. You find out by Sys.getlocale()
I'm not sure if this applies here though.

jankatins · 2016-04-11T13:13:17Z

> Sys.getlocale()
[1] "LC_COLLATE=German_Germany.1252;LC_CTYPE=German_Germany.1252;LC_MONETARY=German_Germany.1252;LC_NUMERIC=C;LC_TIME=German_Germany.1252"

-> Looks ok to me (this is a plain RStudio session on win7)

The problem is that the executed code is UTF-8 (e.g. read from a file) and the output should also be UTF-8 (eg written to a file), so the locale should not matter?

yihui · 2016-04-11T20:11:30Z

I think there is something fundamentally broken (related to sink()) in base R on Windows. See #59. Basically if the characters are not supported by the system native encoding, they will be silently converted to <U+XXXX> sequences.

jankatins · 2016-04-11T21:08:12Z

Any idea how to get out of this problem? This is basically a big problem in the current R kernel, because the "in" side of this problem prevents unicode code input on windows, resulting in wrong computations (e.g. probably wrong comparisons if one side comes from a file and one from a string defined in the notebook).

yihui · 2016-04-11T21:29:32Z

No, I think this requires one to dig deep into the C code in base R, and I don't really understand that level of details.

This is a problem, but probably not as big as you imagined. This only becomes a problem when a Windows user has characters in the document that his/her Windows native character encoding does not support. I think this is relatively rare. In the above examples, as long as your Windows supports the Chinese locale, you should be fine (Sys.setlocale(, 'Chinese')). As you mentioned, knitr suffers from the same problem, but over the four years, this issue has bitten users at most three times as far as I can remember.

speechchemistry · 2020-06-03T13:36:04Z

Thank you for all your work on this. This comment is coming from someone still trying to understand the issue but I believe this is a problem for those of us who work with under-resourced languages, and linguists working with IPA data. If I've understood correctly, this issue is a problem for about half of the world's languages.

This is a problem, but probably not as big as you imagined.

yihui · 2020-06-04T15:14:39Z

@speechchemistry We had to wait for Windows to support UTF-8; see #59. R core has been making effort: https://developer.r-project.org/Blog/public/2020/05/02/utf-8-support-on-windows/ Eventually we should be able to forget about character encodings on Windows. Trust me: my native language is Chinese, and I have felt the enormous pain for many years, but still have to wait.

flying-sheep · 2022-06-27T08:05:44Z

Hi, since R 4.2, it supports unicode on windows!

flying-sheep mentioned this issue Apr 11, 2016

broken unicode uses <U+884C> which needs to be escaped IRkernel/repr#21

Open

jankatins mentioned this issue Apr 11, 2016

abbreviated output for data.frame is wrong IRkernel/repr#28

Closed

jankatins mentioned this issue Aug 1, 2016

Add max string length option tidyverse/tibble#104

Closed

krlmlr mentioned this issue Aug 19, 2016

Re-encode character columns and column names to UTF-8 tidyverse/tibble#87

Closed

yihui mentioned this issue Dec 1, 2016

asis handling inconsistancies rstudio/rmarkdown#895

Closed

yihui mentioned this issue Jun 22, 2017

evaluate losing character encoding information of arguments #74

Closed

yihui mentioned this issue Oct 24, 2017

Greek characters garbled when self_contained = FALSE rstudio/rmarkdown#464

Closed

yihui mentioned this issue Feb 12, 2018

Suggestion: Replace non-standard Unicode characters with entities for HTML output yihui/knitr#1506

Closed

cderv mentioned this issue Jan 14, 2021

Encoding issue during knitting on Windows yihui/knitr#1944

Closed

cderv mentioned this issue Dec 8, 2021

Unicode symbols in ggplot fail to render, but only in markdown rstudio/rmarkdown#2256

Open

flying-sheep mentioned this issue Feb 3, 2023

Problem in unicode in R from jupyternotebook IRkernel/IRkernel#731

Closed

hadley closed this as completed Jun 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use correct encoding #66

Use correct encoding #66

flying-sheep commented Apr 11, 2016

jankatins commented Apr 11, 2016

flying-sheep commented Apr 11, 2016

flying-sheep commented Apr 11, 2016

jankatins commented Apr 11, 2016

jankatins commented Apr 11, 2016

jankatins commented Apr 11, 2016

expectopatronum commented Apr 11, 2016

jankatins commented Apr 11, 2016

yihui commented Apr 11, 2016

jankatins commented Apr 11, 2016

yihui commented Apr 11, 2016

speechchemistry commented Jun 3, 2020

yihui commented Jun 4, 2020

flying-sheep commented Jun 27, 2022

Use correct encoding #66

Use correct encoding #66

Comments

flying-sheep commented Apr 11, 2016

jankatins commented Apr 11, 2016

flying-sheep commented Apr 11, 2016

flying-sheep commented Apr 11, 2016

jankatins commented Apr 11, 2016

jankatins commented Apr 11, 2016

jankatins commented Apr 11, 2016

expectopatronum commented Apr 11, 2016

jankatins commented Apr 11, 2016

yihui commented Apr 11, 2016

jankatins commented Apr 11, 2016

yihui commented Apr 11, 2016

speechchemistry commented Jun 3, 2020

yihui commented Jun 4, 2020

flying-sheep commented Jun 27, 2022