Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

process memory: some (many?) sge clusters use h_vmem, not virtual_free #332

Closed
odoublewen opened this issue May 4, 2017 · 18 comments
Closed

Comments

@odoublewen
Copy link
Contributor

When using the sge executor, setting memory 16.GB in a process results in -l virtual_free=16G appearing the header of the .command.run file.

Some sge clusters don't pay attention to this, and instead use h_vmem (I'm not sure how common/uncommon this is!)

Of course, one can use clusterOptions '-l h_vmem=16G' but then one can't take advantage of the retry mechanism afforded by dynamic computing resources.

Could the way that sge interprets the memory directive be made configurable?

PS... I can use this as a workaround, but it's ugly:

process my_process {
    memory 16.GB
    clusterOptions = "-l h_vmem=${memory.toString().replaceAll(/[\sB]/,'')}"
    ....
}
@odoublewen
Copy link
Contributor Author

Reading more about gridengine, I realize that virtual_free and h_vmem are doing different things.

virtual_free is advice to the scheduler (which node to place the job on) while h_vmem is setting a hard limit on the memory usage.

So I can understand that nextflow doesn't need to support h_vmem... the workaround (above) seems sufficient.

Close this issue if you agree... thanks.

@odoublewen
Copy link
Contributor Author

One more observation:

While this:

process my_process {
    memory 16.GB
    clusterOptions = "-l h_vmem=${memory.toString().replaceAll(/[\sB]/,'')}"
    ....
}

results in -l h_vmem=16G in my .command.run file.

...It breaks when I try to dynamically set memory, like this:

process my_process {
    memory {16.GB * task.attempt}                                                                                                                           
    clusterOptions = "-l h_vmem=${memory.toString().replaceAll(/[\sB]/,'')}"
    ...
}

this results in -l h_vmem=_nf_script_033fec75$_run_closure1$_closure8@7e3060d8 in my .command.run file.

Seems that by multiplying the memory by task.attempt, the closure has somehow changed the memory object.

@pditommaso
Copy link
Member

you need to reference it as task.memory

@hartzell
Copy link
Contributor

hartzell commented May 4, 2017

Keep in mind that consumables like h_vmem are per-slot, and that slots are used as a mechanism to allocate cores. A 4 core job what wants 40GB should set h_vmem to 10GB.

virtual_free, on the other hand, simply checks how much free VM exists on a machine at placement time. That same job would ask for virtual_free of 40GB.

You can't use the same value for both characteristics unless you're asking for a single slot.

@odoublewen
Copy link
Contributor Author

@pditommaso -- Hmmm, I get ERROR ~ No such variable: task when I try
clusterOptions = "-l h_vmem=${task.memory.toString().replaceAll(/[\sB]/,'')}"

@hartzell -- yes, I was planning on dividing the number by the number of requested slots (accessed via the cpus nextflow directive)... but trying to get this working first... baby steps...

@pditommaso
Copy link
Member

There's a glitch in the syntax, = must be used only in the config file not in the nextflow script. It should be like the following:

process my_process {
    memory {16.GB * task.attempt}                                                                                                                           
    clusterOptions "-l h_vmem=${memory.toString().replaceAll(/[\sB]/,'')}"
    ...
}

@odoublewen
Copy link
Contributor Author

Oops, thanks for this. I will add this info to the google group to provide reference for others.

At one point I noticed the = sign, but since it was working (without ${variables}) I just assumed it was optional.

All working now! Thanks!

@Phlya
Copy link

Phlya commented Nov 28, 2018

Hi, I'd like to try this, but I don't really use nextflow apart from running one pipeline (distiller) that was made by other people. Can someone please show how to add the division by number of cores to this syntax?

@pditommaso
Copy link
Member

Use the discussion forum please https://groups.google.com/forum/#!forum/nextflow

@logust79
Copy link

logust79 commented Apr 6, 2020

Hi, when I use @pditommaso 's approach, removing the '=' sign,

process my_process {
    memory {16.GB * task.attempt}
    clusterOptions "-l h_vmem=${memory.toString().replaceAll(/[\sB]/,'')}"
    ...
}

I still get a similar error to this:

-l h_vmem=_nf_script_033fec75$_run_closure1$_closure8@7e3060d8

in .command.run. Any clue?

Thanks!

@pditommaso
Copy link
Member

Please a replicable test case in a separate issue, please.

@logust79
Copy link

logust79 commented Apr 6, 2020

As @pditommaso mentioned, it should be task.memory. So in the end

process my_process {
    memory {16.GB * task.attempt}
    clusterOptions "-l h_vmem=${task.memory.toString().replaceAll(/[\sB]/,'')}"
    ...
}

solved the issue

@RenzoTale88
Copy link

Hello @pditommaso,
I'm working in a cluster with the same situation above described (h_vmem instead of virtual_free). Using the above fix worked with my old configuration files for a DSL 1 workflow. However, since I've moved the workflow to DSL2, the fix stopped working.
By instance, this configuration file:

executor{
  name = "uge"
  queueSize = 500
  cpu = 1
  memory = 8.GB
  time = 23.h
}

process {

  beforeScript = """
  . /etc/profile.d/modules.sh
  module load anaconda/5.3.1
  sleep 2;
  """
  penv = "sharedmem"

  cpus = 1
  memory = 8.GB
  time = 6.h
  clusterOptions = "-l h_vmem=${memory.toString().replaceAll(/[\sB]/,'')}"

  errorStrategy = { task.exitStatus in [143,137,104,134,139,140] ? 'retry' : 'terminate' }
  maxRetries = 5
  maxErrors = '-1'

  withLabel: small{
    cpus = 1
    memory = { 4.GB * task.attempt }
    time = {6.h * task.attempt }
  }
  withLabel: medium{
    cpus = 1
    memory = { 16.GB * task.attempt }
    time = { 12.h * task.attempt }
  }
  withLabel: large{
    cpus = 1
    memory = { 32.GB * task.attempt }
    time = { 23.h * task.attempt }
  }
  withLabel: long{
    cpus = 1
    memory = { 128.GB * task.attempt }
    time = { 96.h * task.attempt }
  }
  withLabel: small_multi{
    cpus = { 2 * task.attempt }
    memory = { 8.GB * task.attempt }
    time = { 4.h * task.attempt }
  }
}

Should give 128Gb of memory per core using the long configuration. However, when I look at the .command.run file header, i see something like that:

#!/bin/bash
#$ -wd /PATH/TO/work/a1/ee5455196b900f37fe94721da68994
#$ -N nf-ROH_roh_(ROH)
#$ -o /PATH/TO/work/a1/ee5455196b900f37fe94721da68994/.command.log
#$ -j y
#$ -terse
#$ -notify
#$ -pe sharedmem 1
#$ -l h_rt=96:00:00
#$ -l h_rss=131072M,mem_free=131072M
#$ -l h_vmem=8G

The workflow uses the correct memory specification for the "normal" configuration (mem_free), but it uses the generic configuration when it comes to h_vmem, preventing my works to finish correctly. I should be able to hard code resources into the single processes, but that would limit the flexibility of the workflow itself. Is there some solution? Am i placing the configuration in the wrong place?
I've tried to place clusterOptions into each label configuration, but it didn't work, complaining that it could't find process.memory. Not sure how to proceed to fix it.

Thank you in advance for your help
Andrea

@pditommaso
Copy link
Member

pditommaso commented Dec 24, 2020

You should make the cluster option evaluated dynamically on the actual task memory value, therefore you should use the { } syntax eg

clusterOptions = { "-l h_vmem=${memory.toString().replaceAll(/[\sB]/,'')}" } 

@RenzoTale88
Copy link

@pditommaso just to get it right, something like this:

withLabel: small{
    cpus = 1
    memory = { 4.GB * task.attempt }
    time = {6.h * task.attempt }
    clusterOptions = { "-l h_vmem=${memory.toString().replaceAll(/[\sB]/,'')}" }
  }

Am I correct?

@pditommaso
Copy link
Member

Sorry task.memory not memory

withLabel: small{
    cpus = 1
    memory = { 4.GB * task.attempt }
    time = {6.h * task.attempt }
    clusterOptions = { "-l h_vmem=${task.memory.toString().replaceAll(/[\sB]/,'')}" }
  }

@RenzoTale88
Copy link

RenzoTale88 commented Dec 24, 2020 via email

@RenzoTale88
Copy link

@pditommaso it worked perfectly, thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants