Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

clustermq hangs after job failed #22

Closed
klmr opened this issue Feb 27, 2017 · 3 comments
Closed

clustermq hangs after job failed #22

klmr opened this issue Feb 27, 2017 · 3 comments
Labels

Comments

@klmr
Copy link
Contributor

klmr commented Feb 27, 2017

I am running the following job. It fails, but the Q call is hanging, and the interactive progress bar shows it as running:

rm_annotation_file = 'data/annotation/mm10-rmsk.txt'

gz_extract = function (file) {
    warning_to_error = function (expr)
        withCallingHandlers(expr, warning = function (w) stop(w))

    if (! grepl('\\.gz$', file))
        file = paste0(file, '.gz')

    warning_to_error(system(sprintf('gunzip %s', shQuote(file)), intern = TRUE))
}

cmq = modules::import_package('clustermq')
cmq$Q(gz_extract, file = rm_annotation_file, memory = 500, n_jobs = 1, log_worker = TRUE)

Here’s the relevant output in the log file:

rzmq7123-1.log
> clustermq:::worker("rzmq7123-1", "tcp://bc-32-1-04:7123", 500)
[1] "tcp://bc-32-1-04:7123"
[1] 500
WORKER_UP to: tcp://bc-32-1-04:7123
function (file) {
    warning_to_error = function (expr)
        withCallingHandlers(expr, warning = function (w) stop(w))

    if (! grepl('\\.gz$', file))
        file = paste0(file, '.gz')

    warning_to_error(system(sprintf('gunzip %s', shQuote(file)), intern = TRUE))
}
NULL
received: DO_CHUNK
gzip: data/annotation/mm10-rmsk.txt.gz: No such file or directory
Error: running command 'gunzip 'data/annotation/mm10-rmsk.txt.gz'' had status 1
Execution halted

------------------------------------------------------------
Sender: LSF System <lsfadmin@bc-31-2-08>
Subject: Job 3300290: <rzmq7123-1> in cluster <farm3> Exited

Job <rzmq7123-1> was submitted from host <bc-32-1-04> by user <kr15> in cluster <farm3>.
Job was executed on host(s) <bc-31-2-08>, in queue <normal>, as user <kr15> in cluster <farm3>.
</nfs/users/nfs_k/kr15> was used as the home directory.
</lustre/scratch115/realdata/mdt2/teams/miska/users/kr15/projects/time-series> was used as the working directory.
Started at Mon Feb 27 16:59:38 2017
Results reported on Mon Feb 27 17:00:02 2017

Your job looked like:

------------------------------------------------------------
# LSBATCH: User input
#BSUB-J rzmq7123-1                  # name of the job / array jobs
#BSUB-g /rzmq/7123       # group the job belongs to
#BSUB-o rzmq7123-1.log      # stdout + stderr
#BSUB-M 500             # Memory requirements in Mbytes
#BSUB-R rusage[mem=500] # Memory requirements in Mbytes
#BSUB-R select[mem>500]
#BSUB-R span[hosts=1]
#BSUB-q normal                          # name of the queue

R --no-save --no-restore -e \
    'clustermq:::worker("rzmq7123-1", "tcp://bc-32-1-04:7123", 500)'


------------------------------------------------------------

Exited with exit code 1.

Resource usage summary:

    CPU time :                                   0.62 sec.
    Total Requested Memory :                     500.00 MB
    Delta Memory :                               -

The output (if any) is above this job summary.
@mschubert
Copy link
Owner

The problem with this is that withCallingHandlers() overrides R's behavior of functions wrapped in a try() call.

> x = try(gz_extract("invalid"))
gzip: invalid.gz: No such file or directory
Error: running command 'gunzip 'invalid.gz'' had status 1

As far as I understand, this should not happen. Even if a warning inside the function call is converted to a stop(), the outside wrap of try() should return a try-error, not evaluate the stop() call globally.

@klmr
Copy link
Contributor Author

klmr commented Feb 27, 2017

Huh, that’s bizarre, unexpected and painful. I’ll try to find out how to do this better then — I thought this was how withCallingHandlers was meant to be used.

@klmr
Copy link
Contributor Author

klmr commented Feb 28, 2017

Question on Stack Overflow: http://stackoverflow.com/q/42506768/1968

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants