Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problem with UTF-8 file paths on windows #182

Closed
jennybc opened this issue Apr 27, 2019 · 7 comments · Fixed by #183
Closed

Problem with UTF-8 file paths on windows #182

jennybc opened this issue Apr 27, 2019 · 7 comments · Fixed by #183

Comments

@jennybc
Copy link
Contributor

jennybc commented Apr 27, 2019

This presented as tidyverse/googledrive#229 but I'm not sure there's anything I can do about it. I think the problem may ultimately lie in curl::curl_fetch_disk().

library(curl)
destdir <- tempfile()
dir.create(destdir)
setwd(destdir)
res <- curl_fetch_disk("http://httpbin.org/stream/10", "äggplanta")
res$content
#> [1] "C:\\Users\\jenny\\AppData\\Local\\Temp\\Rtmp6nRGxs\\file23bcb02d3b\\äggplanta"
file.exists(res$content)
#> [1] FALSE
list.files()
#> [1] "äggplanta"

Encoding("äggplanta")
#> [1] "latin1"
Encoding(normalizePath("äggplanta", mustWork = FALSE))
#> [1] "UTF-8"

Created on 2019-04-26 by the reprex package (v0.2.1)

I had to hand-edit the output of Encoding("äggplanta") to reflect what I see in the Console. There's some knitr issue that causes it to show as "unknown" in a reprex.

@jennybc
Copy link
Contributor Author

jennybc commented Apr 27, 2019

I suppose it could be related to the same problem we grappled with in readxl and readr where normalizePath() now converts to UTF-8, because path.expand() does.

curl/R/fetch.R

Lines 74 to 75 in 5be4dcc

path <- normalizePath(path, mustWork = FALSE)
output <- .Call(R_curl_fetch_disk, enc2utf8(url), handle, path, "wb", nonblocking)

tidyverse/readxl#477
tidyverse/readxl@ad57de3

@jay
Copy link

jay commented Apr 27, 2019

I think the problem may ultimately lie in curl::curl_fetch_disk().

libcurl (on which r curl depends) can write to a FILE. On Windows there's no UTF-8 locale in the CRT but recent versions of the CRT allow specifying the UTF-8 encoding in the call to fopen for example like "wb, ccs=UTF-8". To cover older CRTs (I'm not sure when ccs support started) (correction: ccs is for encoding the file's content, not the encoding of its filename)

UTF-8 has to be converted to UTF-16 and then _wfopen used to open a FILE.

@jeroen
Copy link
Owner

jeroen commented Jul 26, 2019

Fix in curl 4.0 on CRAN now.

@jennybc
Copy link
Contributor Author

jennybc commented Apr 29, 2021

I'm stopping by here again to note that I've re-encountered this problem and, possibly, a version of it that my previous fix does not solve: what if the incoming UTF-8 file path doesn't survive enc2native()?

This is not a full reprex, because I'm not really re-opening this issue. I'm just going to say this is currently unsupported. But I want to capture a glimpse of the problem.

Here's a local file path I'm trying to upload in googledrive:

> filename_2
[1] "C:\\Users\\jenny\\AppData\\Local\\Temp\\RtmpiMUoMa/multibyte-chars-2-マルチ-TEST-drive-upload-jenny"

and here's how that path looks in my assembled request (this is via googledrive --> gargle --> httr::upload_file())

Browse[2]> request$body$media$path
[1] "C:\\Users\\jenny\\AppData\\Local\\Temp\\RtmpiMUoMa\\multibyte-chars-2-<U+30DE><U+30EB><U+30C1>-TEST-drive-upload-jenny"

which leads to upload failure with

 Error in curl::curl_fetch_memory(url, handle = handle) : 
  Failed to open/read local data from file/application 

This feels like a case where the "proper" solution suggested by @jay is the only thing that will actually work, in general.

For googledrive, this would be a "nice to have" vs. "must have". So I just want to record that there has been some real world need, in case anyone is ever motivated to revisit this.

@jennybc
Copy link
Contributor Author

jennybc commented Apr 29, 2021

More notes on what it might look like to do this here in curl:

https://github.com/gaborcsardi/rencfaq#how-to-use-utf-8-file-names-on-windows

If you need to open a file from C code, you need to do the following:

  • Convert the file name to UTF-8 just before passing it to C. In generic code enc2utf8() will do.
  • In the C code, on Windows, convert the UTF-8 path to UTF-16 using the MultiByteToWideChar() Windows API function.
  • Use the _wfopen() Windows API function to open the file.

Here is an example from the brio package: https://github.com/r-lib/brio/blob/2cf72bb77ad55c758b4a140112916ddc23f00b59/src/brio.c#L9

@gaborcsardi
Copy link
Contributor

gaborcsardi commented Apr 29, 2021

One more thing to add: you need enc2native() on Unix, before calling to C. (As we worked it out with @jeroen for another package not so long ago. :))

@jeroen
Copy link
Owner

jeroen commented Apr 30, 2021

Somebody wants to open an issue or pr?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants