Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Option to limit process requirements instead of exiting with error #640

Closed
ewels opened this issue Mar 21, 2018 · 56 comments · Fixed by #2911
Closed

Option to limit process requirements instead of exiting with error #640

ewels opened this issue Mar 21, 2018 · 56 comments · Fixed by #2911
Assignees
Labels

Comments

@ewels
Copy link
Member

ewels commented Mar 21, 2018

If Nextflow encounters a process that requests resources that are not available on the current system, it fails. For example, if 10 processors are requested and the system only has 8, we exit with an error. This is A Good Thing 👍

However, when making config files for a pipeline that aim to have generic works-for-most-situations requests, it can be nice to be able to throttle these for small genomes / test datasets / small machines for example.

In most of our pipelines, we now have this little helper function in nextflow.config:

// Function to ensure that resource requirements don't go beyond
// a maximum limit
def check_max(obj, type) {
  if(type == 'memory'){
    try {
      if(obj.compareTo(params.max_memory as nextflow.util.MemoryUnit) == 1)
        return params.max_memory as nextflow.util.MemoryUnit
      else
        return obj
    } catch (all) {
      println "   ### ERROR ###   Max memory '${params.max_memory}' is not valid! Using default value: $obj"
      return obj
    }
  } else if(type == 'time'){
    try {
      if(obj.compareTo(params.max_time as nextflow.util.Duration) == 1)
        return params.max_time as nextflow.util.Duration
      else
        return obj
    } catch (all) {
      println "   ### ERROR ###   Max time '${params.max_time}' is not valid! Using default value: $obj"
      return obj
    }
  } else if(type == 'cpus'){
    try {
      return Math.min( obj, params.max_cpus as int )
    } catch (all) {
      println "   ### ERROR ###   Max cpus '${params.max_cpus}' is not valid! Using default value: $obj"
      return obj
    }
  }
}

Then, in each config we wrap the request in the above function, as so:

cpus = { check_max( 1 * task.attempt, 'cpus') }
memory = { check_max( 8.GB * task.attempt, 'memory') }
time = { check_max( 2.h * task.attempt, 'time') }

This way, the defaults work most of the time. But if you pass --max_memory "8.GB" then all processes will use a maximum of 8.GB and the pipeline will run on your tiny machine. This is really helpful for test datasets and allows us to have a default "base" config file that is throttled by other system-specific configs.

Is this something that could be built into Nextflow as a new feature? Keep the current default behaviour, but have new command line arguments and config scopes to be able to throttle the maximum resources without failing..

Phil

@ewels
Copy link
Member Author

ewels commented Mar 21, 2018

As well as / instead of these manually specified limits, there could perhaps be something similar to errorStrategy:

process {
    notEnoughResources: 'terminate' // default behaviour
    notEnoughResources: 'limit'     // throttle to available resources
    notEnoughResources: {           // limit to specified values
        cpus: 4
        memory: 8.GB
        time: 16.h
    }
}

@pditommaso
Copy link
Member

So, as first thing I would suggest to use this simplified version of check_max function that infers that infers the argument type and takes the max value as second argument.

def check_max(obj, max) {
    if( obj instanceof nextflow.util.MemoryUnit ) {
        try {
            def other = max as nextflow.util.MemoryUnit
            return obj.compareTo(other) == 1 ? other : obj
        }   
        catch( all ) {
            println "   ### ERROR ###   Max memory '${max}' is not valid! Using default value: $obj"
            return obj
        }   
    }
    if( obj instanceof nextflow.util.Duration ) {
        try {
            def other = max as nextflow.util.Duration
            return obj.compareTo(other) == 1 ? other : obj
        }   
        catch( all ) {
            println "   ### ERROR ###   Max time '${max}' is not valid! Using default value: $obj"
            return obj
        }         
    }
    if( obj instanceof Integer ) {
        try {
            return Math.min( obj, max as int )
        }   
        catch( all ) {
            println "   ### ERROR ###   Max cpus '${max}' is not valid! Using default value: $obj"
           return obj
        }         
    }    
}

Still not sure if the best thing is to add a build-in function like the above or implement a move advanced feature as you are suggesting in the second comment.

@ewels
Copy link
Member Author

ewels commented Mar 27, 2018

Ace! Thanks for the improved function, I will use that in the short term at least 👍

Still think that the core feature would be better though 😉

@pditommaso
Copy link
Member

pditommaso commented Mar 27, 2018

also the try catch should be useless, therefore it could be simplified as:

def check_max(obj, max) {
    if( obj instanceof nextflow.util.MemoryUnit ) {
          def other = max as nextflow.util.MemoryUnit
          return obj.compareTo(other) == 1 ? other : obj
    }
    if( obj instanceof nextflow.util.Duration ) {
          def other = max as nextflow.util.Duration
          return obj.compareTo(other) == 1 ? other : obj     
    }
    if( obj instanceof Integer ) {
          return Math.min( obj, max as int )  
    }    
    
    println "   ### ERROR ###  invalid check_max value=$obj"
    return obj
}

Still think that the core feature would be better though

I agree !

@ewels
Copy link
Member Author

ewels commented Mar 27, 2018

Aha, because we're now testing the type - nice! But if supplied on the command line, will these not always be strings? I think that's why I was trying to coerce the type (and catching the exception if it failed)..

@pditommaso
Copy link
Member

I was a bit too optimistic, yes the try catch is need to handler and invalid max value (I was thinking to obj instead).

@ewels
Copy link
Member Author

ewels commented Mar 27, 2018

Could probably just have one try/ catch statement wrapping the whole function though at least...

@pditommaso
Copy link
Member

Yep, but want be able anymore to have specific error message then

@ewels
Copy link
Member Author

ewels commented Mar 27, 2018

You could check the type of obj to customise it.. ;) But may end up not saving many lines of code..

@pditommaso
Copy link
Member

lol

@ewels
Copy link
Member Author

ewels commented Mar 28, 2018

I can't resist a challenge 😆How about this?

def check_max(obj, max) {
    def obj_types = [
        'memory': nextflow.util.MemoryUnit,
        'time': nextflow.util.Duration,
        'cpus': Integer
    ]
    obj_types.each { key, obj_type ->
        try {
            if( obj instanceof obj_type )
                return obj.compareTo(max as obj_type) == 1 ? max as obj_type : obj
        catch( all ) {
            println "   ### ERROR ###   Max $key '${max}' is not valid! Using default value: $obj"
            return obj
        }
    }
    println "   ### ERROR ###   Object type not recognised! Using default value: $obj"
    return obj
}

@pditommaso
Copy link
Member

if it works, it's good :)

@ewels
Copy link
Member Author

ewels commented Mar 29, 2018

Haven't tested it yet.. ;)

Copying my main request back down here as we got a bit carried away by function refactoring 😁

process {
    notEnoughResources: 'terminate' // default behaviour
    notEnoughResources: 'limit'     // throttle to available resources
    notEnoughResources: {           // limit to specified values
        cpus: 4
        memory: 8.GB
        time: 16.h
    }
}

@pditommaso
Copy link
Member

I was thinking at this, however I'm not understanding how resource are modified? Do you still need to provide a dynamic (ie closure) rule to increase the request when the task re-submitted or you were thinking to something different ?

@ewels
Copy link
Member Author

ewels commented Apr 9, 2018

Not sure I totally understand the question. But I was thinking that the syntax could remain the same as it is currently - static values would be limited (modified) at runtime by nextflow if needed to make them smaller, dynamic rules could be used with task.attempt etc and then limited if needed.

@pditommaso
Copy link
Member

I mean, basically your proposal is to provide a mechanism that allows the setting of a max limits for some resources.

Whenever the resource request overcome this limit (either statically or dynamically defined), it fallback to this limit.

Is that right?

@ewels
Copy link
Member Author

ewels commented Apr 9, 2018

Yup!

@ewels
Copy link
Member Author

ewels commented Oct 2, 2018

Proposed syntax after discussion with default value and also a range (minimum and maximum values).

process {
    withName:fastqc {
        memory: { 1.GB * input.size(), range: 8.GB..24.GB },
        time: { 2.4.h * input.size(), range: 16.h..48.h },
        cpus: { 2  * input.size(), range: 4..16 }
    }
}

The input.size() dynamic computation here is just an example of when we could have a variable default. Here the maximum of the range would prevent the requested resources from going above a sensible default.

Another scenario is when people want to limit the requested resources:

process {
    memory: 4.GB
    withName:fastqc {
        memory: 16.GB, range: 8.GB..24.GB
    }
    withName:star {
        memory: 4.GB, range: 2.GB..6.GB
    }
}

Then have nextflow run --max_memory 8.GB which would work - using 8.GB memory for fastqc and 4.GB memory for star.

However, doing nextflow run --max_memory 2.GB would exit with an error because this is outside the range of fastqc.

Phil

@stale
Copy link

stale bot commented Apr 27, 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the wontfix label Apr 27, 2020
@ewels
Copy link
Member Author

ewels commented Apr 27, 2020

This is still probably my number-1 feature request issue 😄 The @nf-core check_max function is a constant thorn in the side, but it's too useful to remove. Would be great if we could move this functionality into Nextflow 👍

@pditommaso
Copy link
Member

I agree we need definitively better here, soon I need to tackle this problem. Pinning this issue.

@ewels
Copy link
Member Author

ewels commented May 27, 2022

Are the additional attributes (sorry I forget the name in Groovy) inherited separately to the main value? For example, would my process 'foo' here also have a max of 10.GB?

process.memory = 3.GB, max: 10.GB
process.withName['foo'].memory = 20.GB

@bentsherman
Copy link
Member

I don't plan to implement that kind of behavior at the moment. Are there any other directives that can inherit individual attributes like that? For example I don't think publishDir does that.

@ewels
Copy link
Member Author

ewels commented May 27, 2022

Not that I know of. Would it be acceptable to use a distinct namespace instead to be able to achieve this? eg. Something along the lines of my original suggestion way up there 👆🏻

#640 (comment)

process {
    notEnoughResources: 'terminate' // default behaviour
    notEnoughResources: 'limit'     // throttle to available resources
    notEnoughResources: {           // limit to specified values
        cpus: 4
        memory: 8.GB
        time: 16.h
    }
}

Then it can be defined once to work across an entire workflow (desired behaviour 99% of the time I suspect) but also at process level if more granular control is needed.

@bentsherman
Copy link
Member

bentsherman commented May 27, 2022

I like this idea. No need to refactor the resource directives, Nextflow just needs to apply these limits before submitting a task. I would make it even simpler:

process {
    resourceLimits = null   // default behaviour
    resourceLimits = [      // limit to specified values
        cpus: 4,
        memory: 8.GB,
        time: 16.h
    ]
}

@ewels
Copy link
Member Author

ewels commented May 27, 2022

Yeah I think the 'limit' / // throttle to available resources thing was me being hopeful about Nextflow auto-detecting available cpus, memory etc I think. But that's probably super problematic as many people will be running Nextflow on a different nodes. So yeah, let's forget about that (could probably do it with closures anyway..?)

So I'm fully behind your suggested syntax 👍🏻 (was {} instead of [] intentional?)

@pditommaso
Copy link
Member

This is interesting, but I don't fully understand the semantics of this. Is it meant to throttle the resources on retry?

@bentsherman
Copy link
Member

Yes. If you specify the resource limits, Nextflow will throttle any resource requests that exceed the limits. If you don't specify a limit, then such a request would fail.

However it just occurred to me that on some systems, an unschedulable task will just sit in the queue forever, so we might want a way to specify whether Nextflow should throttle or fail, like the example Phil gave.

@bentsherman bentsherman self-assigned this Jul 12, 2022
@aydemiro
Copy link

aydemiro commented Oct 26, 2023

Hi,

Would below code in a config file work currently as an alternative to defining a check_max function?

params.max_memory = 32.GB
memory = { min( 6.GB * task.attempt, params.max_memory ) }

Best,
Ozkan

@bentsherman
Copy link
Member

That should work. I believe check_max() just adds a bunch of extra logic for type checking and error messages, but your example implements the core logic.

@pditommaso
Copy link
Member

@bentsherman If I'm understanding correctly #2911 does not implement any throttling, and it's a mere config attribute that should be managed by the user via a dynamic rule (closure). Is that correct?

@bentsherman
Copy link
Member

It does throttle the task resources, see TaskConfig

@pditommaso
Copy link
Member

Maybe we are using throttling with different meaning. Stating the docs

Resource limits are a useful way to specify environment-specific limits alongside tasks with dynamic resources. Normally, if a task requests more resources than can be provisioned (e.g. a task requests 32 cores but the largest node in the cluster has 24), the task will either fail or cause the pipeline to hang forever as it will never be scheduled. If the resourceLimits directive is defined with these limits, the task resources will be automatically reduced to comply with these limits before the job is submitted.

I understand this as constraint mechanism. Instead I was expecting throttling as mechanism so increase automatically the resource depending the retry

@bentsherman
Copy link
Member

It is a constraint mechanism. You could also see it as throttling, though a different form than what you are talking about

@pditommaso
Copy link
Member

What is preventing to add a third parameter to the check_max function to specify the upper bound? this can be done now and does not require any support at nextflow level.

Nextflow should aim instead to provide a built-in replacement for the check_max thing.

@ewels
Copy link
Member Author

ewels commented Mar 11, 2024

Nextflow should aim instead to provide a built-in replacement for the check_max thing.

That's what this issue and PR #2911 is about.. I'd like to remove that config function entirely if possible.

@pditommaso
Copy link
Member

Not really, #2911 adds a directive to allow the specify the upper bound, check_max it's meant to stay with that

@bentsherman
Copy link
Member

Paolo I think you are misunderstanding. #2911 adds a directive to replace check_max so that nf-core pipelines are not defining functions in the config file. The upper bound is provided by params.max_cpus, params.max_memory, etc

@pditommaso
Copy link
Member

Can you show me an example a process is expected to be written using #2911 to automatically increase the memory on a failure

@bentsherman
Copy link
Member

bentsherman commented Mar 11, 2024

Just the normal dynamic resource pattern:

process {
    memory = { 8.GB * task.attempt }
    resourceLimits = [
        memory: 64.GB
    ]
}

@pditommaso
Copy link
Member

How the increment is determined? I don't see anything obvious in the PR

@bentsherman
Copy link
Member

The increment is independent of the resource limit. The example I just gave means "increase the task memory by 8 GB on each attempt until the system-wide limit of 64 GB is reached"

@ewels
Copy link
Member Author

ewels commented Mar 12, 2024

Edit: Had an old browser tab and didn't see the more recent replies sorry. Wrote the below in response to this comment.


Gotcha, sorry it took @bentsherman explaining your comment to me.

I'd prefer a global level setting rather than continued use of the check_max function because it's less verbose. Currently the function has to be specified in every single specification (see here for an example 🤢 ). We can clean up that syntax with the upper bound but we'd still need to call the function every time.

I don't think that there's ever a time when you don't want these limits applied, as they're there to prevent Nextflow from crashing due to unsatisfiable requests.

There's also no time when you want the limits to vary on a process-level. They are there to meet computational environment limits so they don't vary by process. If you want to change that, you do it with the regular process directives.

So because of this, to me it makes more sense to have a workflow-level setting vs. a function that can be used within multiple process-level directives.

@ewels
Copy link
Member Author

ewels commented Mar 12, 2024

Basically instead of this:

process {
    withName:'MYPROCESS_1' {
        memory = { check_max( 72.GB * task.attempt, 'memory'  ) }
        time   = { check_max( 24.h  * task.attempt, 'time'    ) }
        cpus   = { check_max( 4 * task.attempt, 'cpus'  ) }
    }
    withName:'MYPROCESS_2' {
        memory = { check_max( 72.GB, 'memory'  ) }
        time   = { check_max( 24.h, 'time'    ) }
        cpus   = { check_max( 4, 'cpus'  ) }
    }
    withName:'MYPROCESS_3' {
        memory = { check_max( 72.GB * task.attempt, 'memory'  ) }
        time   = { check_max( 24.h  * task.attempt, 'time'    ) }
        cpus   = { check_max( 4 * task.attempt, 'cpus'  ) }
    }
}

I'd much rather have this:

process {
    resourceLimits = [ memory: 64.GB, time: 24.h, cpus: 12 ]
    withName:'MYPROCESS_1' {
        memory = { 72.GB * task.attempt }
        time   = { 24.h  * task.attempt }
        cpus   = { 4 * task.attempt }
    }
    withName:'MYPROCESS_2' {
        memory = 72.GB
        time   = 24.h
        cpus   = 4
    }
    withName:'MYPROCESS_3' {
        memory = { 72.GB * task.attempt }
        time   = { 24.h  * task.attempt }
        cpus   = { 4 * task.attempt }
    }
}

Especially as the resourceLimits statement will nearly always be in a separate config file.

@ewels
Copy link
Member Author

ewels commented Apr 18, 2024

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
4 participants