Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conflicting information about usage of PBS/Torque Schedulers #184

Closed
strazto opened this issue Dec 10, 2019 · 9 comments · Fixed by #185
Closed

Conflicting information about usage of PBS/Torque Schedulers #184

strazto opened this issue Dec 10, 2019 · 9 comments · Fixed by #185
Labels

Comments

@strazto
Copy link
Contributor

strazto commented Dec 10, 2019

According to the README, using a PBS scheduler requires setting options(clustermq.scheduler="PBS"/"Torque"), however according to this doc, we should specify the following:

options(
    clustermq.scheduler = "sge",
    clustermq.template.lsf = "/path/to/file/below"
)

I'm confused about which is true, or whether it matters

Cheers!

@strazto

This comment has been minimized.

@strazto

This comment has been minimized.

@strazto
Copy link
Contributor Author

strazto commented Dec 10, 2019

also, setting a scheduler that derives from SGE just provides default values for the template file if a template isn't specified

@strazto

This comment has been minimized.

@strazto

This comment has been minimized.

@strazto
Copy link
Contributor Author

strazto commented Dec 11, 2019

Just to provide a tl;dr on this:

  • The usage and readme provide conflicting information on using PBS based schedulers

  • The options given in the usage are wrong

    • They specify setting clustermq.template.lsf instead of just clustermq.template
      ```{r eval=FALSE}
      options(
      clustermq.scheduler = "sge",
      clustermq.template.lsf = "/path/to/file/below"
      )
      ```
    • This will cause clustermq to silently fail to find an option for clustermq.template, and use the default sge template.
  • The template given in the usage is wrong, assuming the user would like to submit their job to more than one worker.

    #PBS -N {{ job_name }}
    #PBS -l select=1:ncpus={{ cores | 1 }}
    #PBS -l walltime={{ walltime | 1:00:00 }}
    #PBS -q default
    #PBS -o {{ log_file | /dev/null }}
    #PBS -j oe
    ulimit -v $(( 1024 * {{ memory | 4096 }} ))
    CMQ_AUTH={{ auth }} R --no-save --no-restore -e 'clustermq:::worker("{{ master }}")'

  • The default template given according to following the readme and setting options(clustermq.scheduler = "PBS") is wrong - Or at least doesn't work on my system.

    #PBS -N {{ job_name }}
    #PBS -l nodes={{ n_jobs }}:ppn={{ cores | 1 }}
    #PBS -o {{ log_file | /dev/null }}
    #PBS -j oe
    ulimit -v $(( 1024 * {{ memory | 4096 }} ))
    CMQ_AUTH={{ auth }} R --no-save --no-restore -e 'clustermq:::worker("{{ master }}")'

    • Contrast with the SGE template

      #$ -t 1-{{ n_jobs }}
      #$ -pe {{ cores | 1 }}

    • Notice that the SGE template submits a job array of n_jobs, but with only cores per job

    • This is true of some of the other templates:

    • #SBATCH --array=1-{{ n_jobs }}
      #SBATCH --cpus-per-task={{ cores | 1 }}

    • #BSUB-J {{ job_name }}[1-{{ n_jobs }}]
      #BSUB-n {{ cores | 1 }}

  • The correct PBS template should resemble the following

    #PBS -N {{ job_name }}
    # Submit a job array of n jobs
    #PBS -J 1-{{ n_jobs }}
    # I'm not sure if "mem" is a standard PBS parameter, or is specific to our cluster
    # "mem" is the only part I'm less certain of
    # Request one chunk per job, with "cores" per chunk, and "mem" memory per chunk
    #PBS -l select=1:ncpus={{ cores | 1 }}:mpiprocs={{ cores | 1 }}:mem={{ mem | 2GB }}
    #PBS -l walltime={{ walltime | 1:00:00 }}
    
    #PBS -o {{ log_file | /dev/null }}
    #PBS -j oe
    
    ulimit -v {{ memory | 2097152 }}
    CMQ_AUTH={{ auth }} R --no-save --no-restore -e 'clustermq:::worker("{{ master }}")'

    Jobs submitted in this way will actually distribute work in parallel

@mschubert mschubert added the bug label Dec 11, 2019
@strazto
Copy link
Contributor Author

strazto commented Dec 12, 2019

@mschubert since you've labeled this a bug, I'm going to go ahead and put in a PR to fix it

@strazto
Copy link
Contributor Author

strazto commented Jan 9, 2020

@mschubert when you have time, could you please review #185 ? It should resolve this

@mschubert
Copy link
Owner

Hi @mstr3336,

Thank you for spotting this and also providing the PR.

also, setting a scheduler that derives from SGE just provides default values for the template file if a template isn't specified

This is expected behavior, but I agree it should be mentioned more explicitly in the documentation (README, User Guide)

since you've labeled this a bug, I'm going to go ahead and put in a PR to fix it

They specify setting clustermq.template.lsf instead of just clustermq.template

Thanks! The "bug" part here is the incorrect documentation (left over from a previous version), and potentially a wrong PBS template (see below).

The template given in the usage is wrong, assuming the user would like to submit their job to more than one worker

The default template given according to following the readme and setting options(clustermq.scheduler = "PBS") is wrong - Or at least doesn't work on my system

I remember putting together the template according to PBS documentation that I found online. This may have been severely out of date, and I've got no way to test this (I only have access to LSF and Slurm).

If I'm changing this, I want to make sure that there are not just multiple flavors of PBS with different options. Do you know if PBS Pro is the only PBS scheduler available? (You may be our first PBS user, I've never actually interacted with this scheduler before.)

mschubert pushed a commit that referenced this issue Feb 7, 2020
* Improved template for PBS submission

For #184 , now submits jobs as arrays

* Update userguide + README for PBS/Torque

* Update changelog

* Style and clarity improvements for docs/news, remove commented line from PBS template

* PBS template resource list spec changed to "old" syntax

According to https://www.pbsworks.com/pdfs/PBSUserGuide13.0.pdf , UG-104, this change hsould be equivalent

* Modified PBS and Torque sections in userguide

Restructured instructions for setting up custom templates so
that they follow instructions on using default templates.

For PBS and Torque sections, now only use the relevant
`clustermq.scheduler` option for that scheduler, throughout its
whole section (Ie, within the Torque section, the option
`clustermq.scheduler` is always set to "Torque", likewise for PBS)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants