New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

correctly parse non-ASCII characters of R files in Windows #532

Merged
merged 2 commits into from Dec 2, 2016

Conversation

Projects
None yet
2 participants
@shrektan
Contributor

shrektan commented Oct 24, 2016

If there's non-ASCII characters in R files (like Chinese Chars), roxygen2 will result an error, because the base::parse() function internally uses readLines() without explicit setting the encoding to UTF-8. Instead, the default value of encoding in readLines() is unknow, which is not desired in windows, since we all agree to build R package with UTF-8 scripts.

So I replaced parse() with the codes from devtools().

Hope this PR gets approved. Or we have to use stringi::stri_escape_unicode() again and again, leaving
a lot of chars like "\u4e2d\u6587", which is not so human-friendly.

Thanks.

Reference to r-lib/devtools#1378

NOTES

source_package() uses base::sys.source() to parse functions. Unfortunately, base::sys.source() has the same issue that no setting encoding in readLines(). So, based on base::sys.source(), I defined a function sys_source() to parse UTF-8 encoded script correctly.

@hadley

This comment has been minimized.

Collaborator

hadley commented Oct 24, 2016

Wouldn't it be better to use the Encoding field from the DESCRIPTION?

@hadley

hadley approved these changes Oct 24, 2016

Can you please also add a unit test?

R/source.R Outdated
env
}
sys_source <- function(file, envir = baseenv()) {

This comment has been minimized.

@hadley

hadley Oct 24, 2016

Collaborator

If you copy code from base R, you must include an attribution of the source, check that the licenses are compatible, and add base R to the Authors@R. Instead, I think you can create a file() connection with the correct encoding and pass that to sys.source()

This comment has been minimized.

@shrektan

shrektan Oct 25, 2016

Contributor

Unfortunately, it seems like not possible to use sys.source(), because it doesn't allow file() connection as the input. Here's the first line source code of sys.source():

    if (!(is.character(file) && file.exists(file))) 
        stop(gettextf("'%s' is not an existing file", file))

This comment has been minimized.

@shrektan

shrektan Oct 25, 2016

Contributor

My proposal is to use source(), does it make sense for you?

 expr <-
    substitute(
      quote(source(file, encoding = "UTF-8", keep.source = FALSE, local = TRUE)),
      list(file = file)
    )
  eval(expr, envir = envir)

This comment has been minimized.

@hadley

hadley Oct 25, 2016

Collaborator

Why do you need substitute() + eval()?

This comment has been minimized.

@shrektan

shrektan Oct 26, 2016

Contributor

My bad, not notice local accepts environment values. So just use source(file, encoding = "UTF-8", keep.source = FALSE, local = envir) is ok then.

R/parse.R Outdated
@@ -25,7 +25,12 @@ parse_text <- function(text, registry = default_tags(), global_options = list())
}
parse_blocks <- function(file, env, registry, global_options = list()) {
parsed <- parse(file = file, keep.source = TRUE)
lines <- readLines(file, warn = FALSE, encoding = "UTF-8")

This comment has been minimized.

@hadley

hadley Oct 24, 2016

Collaborator

As below, it would be simpler to do

con <- file(file, encoding = "UTF-8")
on.exit(close(con))
parsed <- parse(con, keep.source = TRUE)

This comment has been minimized.

@shrektan

shrektan Oct 25, 2016

Contributor

Thanks. Find it needs to explicitly set the encoding of srcfile to preserve the srcref attribution:

  con <- file(file, encoding = "UTF-8")
  on.exit(close(con), add = TRUE)
  parsed <- parse(con, keep.source = TRUE, srcfile = srcfile(file, "UTF-8"))
@shrektan

This comment has been minimized.

Contributor

shrektan commented Oct 25, 2016

Thanks for you quick respond. Would you please give me a hint of how to get the Encoding field from the DESCRIPTION file? I would like to add this little feature.

BTW, actually will anybody use non-UTF8 script to write packages? Seems like not a good practice for me...

@hadley

This comment has been minimized.

Collaborator

hadley commented Oct 25, 2016

Yeah, maybe your right. It's unusual to use anything other than UTF-8, and if turns out to be a problem for someone we can fix it then.

@shrektan

This comment has been minimized.

Contributor

shrektan commented Nov 16, 2016

Note the unit testing hasn't been completed. Will finish it soon. Will also try to implement encoding other than UTF-8.

@shrektan shrektan force-pushed the shrektan:parseNonASCII branch from 14d6273 to 0bb12da Nov 24, 2016

@shrektan shrektan force-pushed the shrektan:parseNonASCII branch from 0bb12da to 8937f4b Nov 24, 2016

@shrektan

This comment has been minimized.

Contributor

shrektan commented Nov 24, 2016

@hadley I updated this PR with improvements on :

  1. Improve the codes based on your suggestions,
  2. Could read text based on Encoding field in DESCRIPTION,
  3. Add a unit test.
@hadley

This comment has been minimized.

Collaborator

hadley commented Nov 28, 2016

Looking good. Using devtools::build_win() can you please confirm that this doesn't cause any problems on windows?

@shrektan

This comment has been minimized.

Contributor

shrektan commented Dec 1, 2016

@hadley I'm not able to use devtools::build_win(), because I just can't receive the email (I definitely changed the author's email to my personal email)...

However, I have a windows machine as well. I think run devtools::check() in the windows machine is literally the same as using devtools::build_win(), right? Then with one additional patch, the check succeeds. Below is the log:

LOG of devtools::check()

Restarting R session...

> devtools::check()
Updating roxygen2 documentation
Loading roxygen2
Writing NAMESPACE
Writing update_collate.Rd
Writing markdown-internals.Rd
Writing double_escape_md.Rd
Writing markdown-test.Rd
Writing namespace_roclet.Rd
Writing object_format.Rd
Writing is_s3_generic.Rd
Writing object.Rd
Writing rd_roclet.Rd
Writing roclet.Rd
Writing roclet_find.Rd
Writing roc_proc_text.Rd
Writing roxygen2-package.Rd
Writing roxygenize.Rd
Writing load_options.Rd
Writing source_package.Rd
Writing roxy_tag.Rd
Writing vignette_roclet.Rd
Setting env vars -----------------------------------------------------------------------------------
CFLAGS  : -Wall -pedantic
CXXFLAGS: -Wall -pedantic
Building roxygen2 ----------------------------------------------------------------------------------
"D:/app/R/bin/i386/R" --no-site-file --no-environ --no-save --no-restore --quiet CMD build  \
  "D:/RWD/roxygen" --no-resave-data --no-manual 

* checking for file 'D:/RWD/roxygen/DESCRIPTION' ... OK
* preparing 'roxygen2':
* checking DESCRIPTION meta-information ... OK
* cleaning src
* installing the package to build vignettes
* creating vignettes ... OK
* cleaning src
* checking for LF line-endings in source and make files
* checking for empty or unneeded directories
* building 'roxygen2_5.0.1.9000.tar.gz'

Setting env vars -----------------------------------------------------------------------------------
_R_CHECK_CRAN_INCOMING_ : FALSE
_R_CHECK_FORCE_SUGGESTS_: FALSE
Checking roxygen2 ----------------------------------------------------------------------------------
"D:/app/R/bin/i386/R" --no-site-file --no-environ --no-save --no-restore --quiet CMD check  \
  "C:\Users\amc038\AppData\Local\Temp\RtmpiCDMhc/roxygen2_5.0.1.9000.tar.gz" --as-cran --timings  \
  --no-manual 

* using log directory 'C:/Users/amc038/AppData/Local/Temp/RtmpiCDMhc/roxygen2.Rcheck'
* using R version 3.3.2 (2016-10-31)
* using platform: i386-w64-mingw32 (32-bit)
* using session charset: CP936
* using options '--no-manual --as-cran'
* checking for file 'roxygen2/DESCRIPTION' ... OK
* this is package 'roxygen2' version '5.0.1.9000'
* checking package namespace information ... OK
* checking package dependencies ...Warning: unable to access index for repository http://www.stats.ox.ac.uk/pub/RWin/src/contrib:
  Line starting '<!--//////////////// ...' is malformed!
 NOTE
Package suggested but not available for checking: 'covr'
* checking if this is a source package ... OK
* checking if there is a namespace ... OK
* checking for executable files ... OK
* checking for hidden files and directories ... OK
* checking for portable file names ... OK
* checking whether package 'roxygen2' can be installed ... OK
* checking installed package size ... OK
* checking package directory ... OK
* checking 'build' directory ... OK
* checking DESCRIPTION meta-information ... OK
* checking top-level files ... OK
* checking for left-over files ... OK
* checking index information ... OK
* checking package subdirectories ... OK
* checking R files for non-ASCII characters ... OK
* checking R files for syntax errors ... OK
* loading checks for arch 'i386'
** checking whether the package can be loaded ... OK
** checking whether the package can be loaded with stated dependencies ... OK
** checking whether the package can be unloaded cleanly ... OK
** checking whether the namespace can be loaded with stated dependencies ... OK
** checking whether the namespace can be unloaded cleanly ... OK
** checking loading without being on the library search path ... OK
* loading checks for arch 'x64'
** checking whether the package can be loaded ... OK
** checking whether the package can be loaded with stated dependencies ... OK
** checking whether the package can be unloaded cleanly ... OK
** checking whether the namespace can be loaded with stated dependencies ... OK
** checking whether the namespace can be unloaded cleanly ... OK
** checking loading without being on the library search path ... OK
* checking dependencies in R code ... OK
* checking S3 generic/method consistency ... OK
* checking replacement functions ... OK
* checking foreign function calls ... OK
* checking R code for possible problems ... OK
* checking Rd files ... OK
* checking Rd metadata ... OK
* checking Rd line widths ... OK
* checking Rd cross-references ... OK
* checking for missing documentation entries ... OK
* checking for code/documentation mismatches ... OK
* checking Rd \usage sections ... OK
* checking Rd contents ... OK
* checking for unstated dependencies in examples ... OK
* checking line endings in C/C++/Fortran sources/headers ... OK
* checking compiled code ... OK
 WARNING
'qpdf' is needed for checks on size reduction of PDFs
* checking installed files from 'inst/doc' ... OK
* checking files in 'vignettes' ... OK
* checking examples ...
** running examples for arch 'i386' ... OK
** running examples for arch 'x64' ... OK
* checking for unstated dependencies in 'tests' ... OK
* checking tests ...
** running tests for arch 'i386' ...
  Running 'testthat.R'
 OK
** running tests for arch 'x64' ...
  Running 'testthat.R'
 OK
* checking for unstated dependencies in vignettes ... OK
* checking package vignettes in 'inst/doc' ... OK
* checking re-building of vignette outputs ... OK
* DONE

Status: 1 WARNING, 1 NOTE
See
  'C:/Users/amc038/AppData/Local/Temp/RtmpiCDMhc/roxygen2.Rcheck/00check.log'
for details.


R CMD check results
0 errors | 0 warnings | 1 note 
checking package dependencies ... NOTE
Package suggested but not available for checking: 'covr'

Session info

> sessionInfo()
R version 3.3.2 (2016-10-31)
Platform: i386-w64-mingw32/i386 (32-bit)
Running under: Windows 7 x64 (build 7601) Service Pack 1

locale:
[1] LC_COLLATE=Chinese (Simplified)_People's Republic of China.936 
[2] LC_CTYPE=Chinese (Simplified)_People's Republic of China.936   
[3] LC_MONETARY=Chinese (Simplified)_People's Republic of China.936
[4] LC_NUMERIC=C                                                   
[5] LC_TIME=Chinese (Simplified)_People's Republic of China.936    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] roxygen2_5.0.1.9000

loaded via a namespace (and not attached):
 [1] Rcpp_0.12.8          digest_0.6.10        crayon_1.3.2         withr_1.0.2         
 [5] commonmark_0.9       R6_2.2.0             magrittr_1.5         stringi_1.1.2       
 [9] testthat_1.0.2.9000  xml2_1.0.0           brew_1.0-6           devtools_1.12.0.9000
[13] desc_1.0.1           tools_3.3.2          stringr_1.1.0        memoise_1.0.0  
# Because the parse in testthat::test don't specify encoding to UTF-8 as well,
# we have to use `stringi::stri_escape_unicode` to avoid errors.
expect_true(
any(grepl("\u6211\u7231\u4e2d\u6587", cnChar)) &&

This comment has been minimized.

@hadley

hadley Dec 1, 2016

Collaborator

It would be better to use two expect_true() statements here

This comment has been minimized.

@hadley

hadley Dec 1, 2016

Collaborator

Why does one string get wrapped with enc2utf8() and not the other? A comment would be helpful.

This comment has been minimized.

@shrektan

shrektan Dec 2, 2016

Contributor

Sorry... the later one is a typo....

cnChar <- readLines(file.path(test_pkg, "man", "printChineseMsg.Rd"), encoding = "UTF-8")
# Because the parse in testthat::test don't specify encoding to UTF-8 as well,
# we have to use `stringi::stri_escape_unicode` to avoid errors.

This comment has been minimized.

@hadley

hadley Dec 1, 2016

Collaborator

Just say "so we have to use unicode escapes" (it doesn't matter how you generated them)

This comment has been minimized.

@shrektan

shrektan Dec 2, 2016

Contributor

changed.

# This script is intended to be saved in GB2312 to test if non UTF-8 encoding is
# supported.
#' 中文注释

This comment has been minimized.

@hadley

hadley Dec 1, 2016

Collaborator

The fact that this previews correctly makes me worried that it isn't actually saved as GB2312.

This comment has been minimized.

@shrektan

shrektan Dec 2, 2016

Contributor

I doubted as well in the first beginning, but it seems like GitHub added a new feature to guess the file encoding. In the past, I remember the file encoded other than UTF-8 doesn't display well. But now when I checked my repos, they display good as well.

Below is the what I saw in RStudio:

image

image

@hadley

This comment has been minimized.

Collaborator

hadley commented Dec 1, 2016

Thanks for your help with this - a view more minor comments.

@shrektan shrektan force-pushed the shrektan:parseNonASCII branch from 5f66da4 to 3be15de Dec 2, 2016

@shrektan shrektan force-pushed the shrektan:parseNonASCII branch from 3be15de to 000b831 Dec 2, 2016

@shrektan

This comment has been minimized.

Contributor

shrektan commented Dec 2, 2016

You're welcome. I've learned more by submitting this PR 👍

@hadley hadley merged commit 8d879e9 into klutometis:master Dec 2, 2016

3 checks passed

codecov/patch 100% of diff hit (target 89.13%)
Details
codecov/project 89.30% (+0.16%) compared to 9b34848
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
@hadley

This comment has been minimized.

Collaborator

hadley commented Dec 2, 2016

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment