-
Notifications
You must be signed in to change notification settings - Fork 35
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use correct encoding #66
Comments
This isn't the problem here: it seems that encoding makes no difference at all (in a RStudio on windows 7, 64bit, R 3.2.x): > s <- '"法 \\u8FDB"'
> parse(text=s, encoding = Encoding(s))
expression("<U+6CD5> \u8FDB")
> s <- '"法 \\u8FDB"'
> Encoding(s)
[1] "UTF-8"
> parse(text=s, encoding = Encoding(s))
expression("<U+6CD5> \u8FDB") Just to show the original problem: library(evaluate)
code <- "
x = '法'
y = '\\u8FDB'
print(nchar(x))
print(nchar(y))
print(x)
print(y)
"
l = list()
txt <- function(o, type) {
t <- paste(o, collapse = '\n')
l[length(l)+1] <<- t
}
oh <- new_output_handler(source = identity,
text = function(o) txt(o, "text"),
graphics = identity,
message = identity,
warning = identity,
error = identity,
value = identity)
x <- evaluate(code, output_handler = oh)
l
|
OK. code is obviously good UTF-8 since it comes from user input, right? doesn’t that mean that using if so, will our |
important printing stuff (i think) |
No: it's already messed up because But the 4th case in the evaluate might be because of bad printing behaviour...
Seems like it at least explains the second error (the [4]): e.g. executing
and then doing leads to this:
-> Both are mangled, no matter if the print happens within evaluate or outside. So it seems a problem that sinked cons can't handle UTF-8 on non-UTF-8 systems? But that still doesn't explain the problem "coming in". |
Knitr has the same problem: # code cell marker broken to not confuse github...
` ``{r}
x = '法'
y = '\u8FDB'
print(nchar(x))
print(nchar(y))
print(x)
print(y)
x
y
` `` Outputs:
Another mentioning of this problem: https://stat.ethz.ch/pipermail/r-help//2014-April/373558.html (without an answer... :-() |
And this also doesn't make a difference (in RStudio, win7):
|
I once had a problem with encoding when the locale of my RStudio was not set. You find out by |
-> Looks ok to me (this is a plain RStudio session on win7) The problem is that the executed code is UTF-8 (e.g. read from a file) and the output should also be UTF-8 (eg written to a file), so the locale should not matter? |
I think there is something fundamentally broken (related to |
Any idea how to get out of this problem? This is basically a big problem in the current R kernel, because the "in" side of this problem prevents unicode code input on windows, resulting in wrong computations (e.g. probably wrong comparisons if one side comes from a file and one from a string defined in the notebook). |
No, I think this requires one to dig deep into the C code in base R, and I don't really understand that level of details. This is a problem, but probably not as big as you imagined. This only becomes a problem when a Windows user has characters in the document that his/her Windows native character encoding does not support. I think this is relatively rare. In the above examples, as long as your Windows supports the Chinese locale, you should be fine ( |
Thank you for all your work on this. This comment is coming from someone still trying to understand the issue but I believe this is a problem for those of us who work with under-resourced languages, and linguists working with IPA data. If I've understood correctly, this issue is a problem for about half of the world's languages.
|
@speechchemistry We had to wait for Windows to support UTF-8; see #59. R core has been making effort: https://developer.r-project.org/Blog/public/2020/05/02/utf-8-support-on-windows/ Eventually we should be able to forget about character encodings on Windows. Trust me: my native language is Chinese, and I have felt the enormous pain for many years, but still have to wait. |
Hi, since R 4.2, it supports unicode on windows! |
on windows, to correctly parse text, you need to do:
if the
code
has set an encoding, it gets returned byEncoding
. Else,Encoding
returns'unknown'
, the default value forencoding
inparse
.ideally,
parse_all
should internally do the above instead of aencoding
-lessparse
call.an alternative would be to pass down an
encoding
parameter throughevaluate
→parse_all
→parse
The text was updated successfully, but these errors were encountered: