Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Encoding problems on Windows caused by character -> symbol -> character roundtrip #1950

Closed
krlmlr opened this issue Jun 21, 2016 · 11 comments
Closed
Labels
bug an unexpected problem or unintended behavior
Milestone

Comments

@krlmlr
Copy link
Member

krlmlr commented Jun 21, 2016

Some of the encoding problems (e.g., grouping) seem to be caused by converting character to symbol and back. On Linux (UTF-8 locale):

> "ä" %>% Encoding
[1] "UTF-8"
> "ä" %>% as.name %>% as.character %>% Encoding
[1] "UTF-8"
> "ä" %>% iconv(to = "latin1") %>% as.name %>% as.character %>% Encoding
[1] "unknown"
> "ä" %>% iconv(to = "latin1") %>% as.name %>% as.character %>% enc2utf8
[1] "<e4>"

On Windows (latin-1 locale):

> "ä" %>% Encoding
[1] "latin1"
> "ä" %>% as.name %>% as.character %>% Encoding
[1] "latin1"
> "ä" %>% iconv(to = "UTF-8") %>% as.name %>% as.character %>% Encoding
[1] "unknown"
> "ä" %>% iconv(to = "UTF-8") %>% as.name %>% as.character %>% enc2utf8
[1] "ä"

On Windows, one test fails because of that.

So, currently we should suggest using ASCII (or at least native-encoded) column names, and not fiddle with the encoding, in particular not set it to UTF-8 on Windows.

A lot of internal dplyr logic seems to be based on the symbol type, I don't see a quick solution here.

@kevinushey
Copy link
Contributor

kevinushey commented Jun 21, 2016

IIUC, the over-arching issue is that the as.name => as.character roundtrip loses the encoding, and so we no longer know how to properly translate from the original encoding to the desired encoding (hence why enc2utf8 doesn't produce the desired result)?

@krlmlr
Copy link
Member Author

krlmlr commented Jun 21, 2016

Yes -- if it's not the native encoding, the roundtrip loses it, and there's no way to recover.

@hadley
Copy link
Member

hadley commented Jun 23, 2016

Maybe dplyr could have it's own symbol coercion methods that always assumed UTF-8?

@krlmlr
Copy link
Member Author

krlmlr commented Jun 23, 2016

We could do that, but I think the safest option is not to use symbols to represent column names -- a simple S3 class should do the same job.

@hadley
Copy link
Member

hadley commented Jun 23, 2016

I think symbols are the correct way to represent column names for a number of reasons.

@krlmlr
Copy link
Member Author

krlmlr commented Jun 23, 2016

We could settle for a combination:

> structure(as.name("col_name"), class = "encoded_symbol")
col_name
attr(,"class")
[1] "encoded_symbol"

With a suitable as.character.encoded_symbol() function.

@krlmlr
Copy link
Member Author

krlmlr commented Aug 9, 2016

Windows users are confined to their native encoding for column names anyway if they want to use them in expressions. These are always in the native encoding, the following doesn't work on Windows:

~成交日期
## Error: unexpected input in "~\"

For characters that can be represented in the native encoding, . %>% as.name %>% as.character works as expected and reliably maintains the declared encoding. I think we should use (and expect) the native encoding whenever we interact with language objects, and UTF-8 otherwise.

I have submitted a bug report to R's bugzilla concerning the behavior of as.name() for strings in non-native encoding.

@krlmlr
Copy link
Member Author

krlmlr commented Aug 10, 2016

A comment in the R source leaves little hope for an upstream fix, but we'll see.

@krlmlr
Copy link
Member Author

krlmlr commented Nov 8, 2016

Related: #1885, joining strings with different encodings.

@krlmlr
Copy link
Member Author

krlmlr commented Jan 25, 2017

But the following works on Windows:

~"成交日期"

CC @hadley

@krlmlr
Copy link
Member Author

krlmlr commented Jan 25, 2017

> data_frame(a = 1) %>% setNames("成交日期")
# A tibble: 1 × 1
  `<U+6210><U+4EA4><U+65E5><U+671F>`
                               <dbl>
1                                  1

krlmlr added a commit to krlmlr/dplyr that referenced this issue Feb 20, 2017
krlmlr added a commit to krlmlr/dplyr that referenced this issue Feb 20, 2017
krlmlr added a commit to krlmlr/dplyr that referenced this issue Feb 20, 2017
@krlmlr krlmlr modified the milestone: data frame 1 Feb 21, 2017
krlmlr added a commit to krlmlr/dplyr that referenced this issue Feb 21, 2017
krlmlr added a commit to krlmlr/dplyr that referenced this issue Feb 23, 2017
@lock lock bot locked as resolved and limited conversation to collaborators Jun 8, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug an unexpected problem or unintended behavior
Projects
None yet
Development

No branches or pull requests

3 participants