Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conflict with clustermq, callr, and ggsave() #197

Closed
wlandau opened this issue Jun 6, 2020 · 14 comments
Closed

Conflict with clustermq, callr, and ggsave() #197

wlandau opened this issue Jun 6, 2020 · 14 comments

Comments

@wlandau
Copy link
Contributor

wlandau commented Jun 6, 2020

@djbirke discovered this issue and originally posted to ropensci/drake#1270. Something about the combination of clustermq, callr, and ggsave() leads to crashes. The following reprex works without the call to ggsave().

fun <- function() {
  options(clustermq.scheduler = "multicore")
  f <- function(x) {                                                                                      
    ggplot2::ggsave(filename = "plot.png", plot = plot(42))                                               
  }                                                                                                       
  clustermq::Q(fun = f, x = 1, n_jobs = 1)       
}
callr::r(fun, show = TRUE)
#> Submitting 1 worker jobs (ID: 6448) ...
#> Running 1 calculations (0 objs/0 Mb common; 1 calls/chunk) ...
#> Saving 7 x 7 in image
#> objc[7660]: +[NSNumber initialize] may have been in progress in another thread when fork() was called.
#> objc[7660]: +[NSNumber initialize] may have been in progress in another thread when fork() was called. We cannot safely call it or ignore it in the fork() child process. Crashing instead. Set a breakpoint on objc_initializeAfterForkError to debug.

Session info:

R version 4.0.0 (2020-04-24)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS Catalina 10.15.5

Matrix products: default
BLAS:   /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] clustermq_0.8.9

loaded via a namespace (and not attached):
 [1] ps_1.3.3          prettyunits_1.1.1 crayon_1.3.4      R6_2.4.1          rlang_0.4.6      
 [6] progress_1.2.2    rstudioapi_0.11   callr_3.4.3       vctrs_0.3.1       tools_4.0.0      
[11] hms_0.5.3         yaml_2.2.1        parallel_4.0.0    compiler_4.0.0    processx_3.4.2   
[16] pkgconfig_2.0.3   rzmq_0.9.7   

When I run this same example on a RHEL 7 machine with SGE, workers stay stuck in Eqw ("job waiting in error state"). Code:

fun <- function() {
    options(clustermq.scheduler = "sge", clustermq.template = "sge.tmpl")
  f <- function(x) {                                                                                      
    ggplot2::ggsave(filename = "plot.png", plot = plot(42))                                               
  }                                                                                                       
  clustermq::Q(fun = f, x = 1, n_jobs = 1)       
}
callr::r(fun, show = TRUE)

Template file:

#$ -N {{ job_name }}
#$ -t 1-{{ n_jobs }}
#$ -j y   
#$ -o /dev/null
#$ -cwd 
#$ -V                          
#$ -odule load R/3.6.3
CMQ_AUTH={{ auth }} R --no-save --no-restore -e 'clustermq:::worker("{{ master }}")'
@mschubert
Copy link
Owner

Can you provide a worker log for you sge example?

@wlandau
Copy link
Contributor Author

wlandau commented Jun 6, 2020

The SGE issue was actually my mistake, I was running it in a directory not available to the worker. It actually works fine. I also tried reproducing it using the multicore backend on a RHEL 7 machine, and it completed without errors.

> sessionInfo()
R version 3.6.3 (2020-02-29)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Red Hat Enterprise Linux

Matrix products: default
BLAS/LAPACK: CENSORED/intel/intel-2020/compilers_and_libraries_2020.0.166/linux/mkl/lib/intel64_lin/libmkl_gf_lp64.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

loaded via a namespace (and not attached):
[1] processx_3.4.2 compiler_3.6.3 R6_2.4.1       crayon_1.3.4   callr_3.4.3   
[6] ps_1.3.2      

I wonder if this issue is somehow related to R 4.0.0 vs 3.6.3.

@mschubert
Copy link
Owner

Ok, that's what I expected. I doubt R>=4.0.0 has anything to do with it since all the tests have been running on that for a while and they never failed.

In a way, clustermq can not be responsible for R crashes because it (currently) does not have compiled code.

My best guess (if @djbirke is using macOS as well) would be a problem with forking (mcparallel), also discussed on r-devel:

Fork without exec is not supported by macOS, basically any calls to system libraries might crash.

This is, obviously, not good. One way to address it will be to provide a non-forking multicore backend (#142).

@mschubert
Copy link
Owner

Also maybe related: forking problems with RStudio HenrikBengtsson/future#299

@djbirke
Copy link

djbirke commented Jun 8, 2020

@wlandau Thank you so much for looking into my original issue, thinking about out the probable causes, and then bringing it up here.
@mschubert Thank you for looking into this. Yes, I am using macOS. I did recently upgrade to R 4.0.0, and the issue had never occurred beforehand. Please let me know if I can be of help in tracking down the cause of this issue.

@mschubert
Copy link
Owner

@djbirke Can you maybe come up with an example that uses callr::r like above but parallel::mcparallel instead of clustermq::Q and see if the problem persists?

@mschubert
Copy link
Owner

@djbirke I have now added a multiprocess interface in the develop branch that runs on callr and hence avoids forking.

Can you try if this still crashes for you when you

remotes::install_github("mschubert/clustermq", ref="develop")
options(clustermq.scheduler = "multiprocess")
# run your example ...

Please reopen if your issue persists.

@djbirke
Copy link

djbirke commented Jun 23, 2020

@mschubert Thank you for adding the new interface. I have trouble compiling. I installed zmq using brew install zmq, but receive the following error message. Would you have any pointers?

* installing *source* packageclustermq...
** using staged installation
** libs
clang++ -mmacosx-version-min=10.13 -std=gnu++11 -I"/Library/Frameworks/R.framework/Resources/include" -DNDEBUG -I../inst -I'/Library/Frameworks/R.framework/Versions/4.0/Resources/library/Rcpp/include' -I/usr/local/include   -fPIC  -Wall -g -O2  -c RcppExports.cpp -o RcppExports.o
clang++ -mmacosx-version-min=10.13 -std=gnu++11 -I"/Library/Frameworks/R.framework/Resources/include" -DNDEBUG -I../inst -I'/Library/Frameworks/R.framework/Versions/4.0/Resources/library/Rcpp/include' -I/usr/local/include   -fPIC  -Wall -g -O2  -c bindings.cpp -o bindings.o
In file included from bindings.cpp:4:
../inst/zmq.hpp:1205:16: warning: 'send' is deprecated: from 4.3.1, use send taking message_t and send_flags [-Wdeprecated-declarations]
        return send(msg, flags_);
               ^
../inst/zmq.hpp:1189:5: note: 'send' has been explicitly marked deprecated here
    ZMQ_DEPRECATED("from 4.3.1, use send taking message_t and send_flags")
    ^
../inst/zmq.hpp:45:44: note: expanded from macro 'ZMQ_DEPRECATED'
#define ZMQ_DEPRECATED(msg) __attribute__((deprecated(msg)))
                                           ^
1 warning generated.
clang++ -mmacosx-version-min=10.13 -std=gnu++11 -I"/Library/Frameworks/R.framework/Resources/include" -DNDEBUG -I../inst -I'/Library/Frameworks/R.framework/Versions/4.0/Resources/library/Rcpp/include' -I/usr/local/include   -fPIC  -Wall -g -O2  -c zeromq.cpp -o zeromq.o
In file included from zeromq.cpp:5:
../inst/zmq.hpp:1205:16: warning: 'send' is deprecated: from 4.3.1, use send taking message_t and send_flags [-Wdeprecated-declarations]
        return send(msg, flags_);
               ^
../inst/zmq.hpp:1189:5: note: 'send' has been explicitly marked deprecated here
    ZMQ_DEPRECATED("from 4.3.1, use send taking message_t and send_flags")
    ^
../inst/zmq.hpp:45:44: note: expanded from macro 'ZMQ_DEPRECATED'
#define ZMQ_DEPRECATED(msg) __attribute__((deprecated(msg)))
                                           ^
zeromq.cpp:133:64: error: cannot pass object of non-trivial type 'std::string' (aka 'basic_string<char, char_traits<char>, allocator<char> >') through variadic function; call will abort at runtime [-Wnon-pod-varargs]
            Rf_error("Trying to access non-existing socket: ", socket_id);
                                                               ^
1 warning and 1 error generated.
make: *** [zeromq.o] Error 1
ERROR: compilation failed for packageclustermq* removing/Library/Frameworks/R.framework/Versions/4.0/Resources/library/clustermq* restoring previous/Library/Frameworks/R.framework/Versions/4.0/Resources/library/clustermqError: Failed to install 'clustermq' from GitHub:
  (converted from warning) installation of package/var/folders/ft/fmwx6nnj4zqct6dpq5z_8d6w0000gn/T//RtmphCiHI7/file167365e5edc8c/clustermq_0.8.93.tar.gzhad non-zero exit status

mschubert added a commit that referenced this issue Jun 23, 2020
@mschubert
Copy link
Owner

Thank you for trying it out! I fixed the error you got, can you try again? (which I think is is behavior of the compiler - you are using clang right?)

@djbirke
Copy link

djbirke commented Jun 23, 2020

Yes, I am using clang, with your latest fix I was able to compile. And it worked! Using options(clustermq.scheduler = "multicore") produces the bug, but using options(clustermq.scheduler = "multiprocess") works without issues. Thank you very much for your help!

@mschubert
Copy link
Owner

That's great to hear, thank you for testing!

Then the crash seems indeed related to multicore forking, and there's unfortunately nothing we can do about it.

@djbirke
Copy link

djbirke commented Jun 23, 2020

Thank you. Would you have a recommendation as to with whom I can raise this issue?

@mschubert
Copy link
Owner

Unfortunately, no. As far as I understand, this is a limitation of the operating system.

@djbirke
Copy link

djbirke commented Jun 23, 2020

That is sad. Thank you for your assessment.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants