How to use custom objects and have tasks cached #1934

nh13 · 2021-02-27T01:26:39Z

Sometimes it's useful to have a container class for metadata that I can pass around in channels along with the actual paths. For example,

import java.nio.file.Path

/** Container class for sample-specific metadata */
@groovy.transform.ImmutableBase
@groovy.transform.TupleConstructor
class SampleMetadata {
    /** The sample name */
    String name
    /** The path to the read one FASTQ */
    Path r1
    /** The path to the read two FASTQ */
    Path r2

    boolean equals(Object o) {
        if (this.is(o)) return true
        if (!o || getClass() != o.class) return false
        SampleMetadata that = (SampleMetadata) o
        if (name? !name.equals(that.name) : that.name!= null) return false
        if (r1? !r1.equals(that.r1) : that.r1!= null) return false
        if (r2? !r2.equals(that.r2) : that.r2!= null) return false
        return true
    }

    int hashCode() {
        int result = (this.name ? this.name.hashCode() : 0)
        result = 31 * result + (this.r1 ? this.r1.hashCode() : 0)
        result = 31 * result + (this.r2 ? this.r2.hashCode() : 0)
        return result
    }
}

If I use such a class, none of tasks are cacheable that use it. What do I have to do to make it cacheable? I couldn't find information about this in the docs.

pditommaso · 2021-03-01T12:56:09Z

Unfortunately, now the serialisation with custom objects does not work properly because NF uses Kryo under the hood for better performance that requires some custom code to register the object serialiser. Needs to be improved.

nh13 · 2021-03-01T14:50:05Z

Happy to help improve it if you point me to the place as I think this type of pattern will be common for me going forward. Just like named tuples and attrs/data classes in Python.

pditommaso · 2021-03-01T15:00:18Z

The relevant section is this

nextflow/modules/nextflow/src/main/groovy/nextflow/util/SerializationHelper.groovy

Lines 206 to 219 in d959bfd

    
           class DefaultSerializers implements SerializerRegistrant { 
        
               static private Class<Path> PATH_CLASS = (Class<Path>)Paths.get('.').class 
        
               @Override 
        
               void register(Map<Class, Object> serializers) { 
        
                   serializers.put( PATH_CLASS, PathSerializer ) 
        
                   serializers.put( URL, URLSerializer ) 
        
                   serializers.put( UUID, UUIDSerializer ) 
        
                   serializers.put( File, FileSerializer ) 
        
                   serializers.put( Pattern, PatternSerializer ) 
        
                   serializers.put( ArrayTuple, ArrayTupleSerializer ) 
        
               } 
        
           }

I think it could be introduced a CustomSerializer interface that extends standard Java Serializer and a CustomSerializerImpl that serialize a generic class implement CustomSerializer via standard java serialization mechanism.

User classes should just implement such ~~class~~ interface.

drpatelh · 2021-03-01T15:28:59Z

Would something like this help @nh13 ?

We have also reverted to passing all of the sample metadata around in a Groovy Map, however, the file path elements change dynamically through the workflow and are staged through the standard Nextflow mechanism which keeps the -resume functionality intact.

The meta map itself is initiated from the information in the input samplesheet to the pipeline right at the very beginning of the workflow as you can see here.

nh13 · 2021-03-01T15:35:34Z

Thanks @drpatelh but I think I’d prefer custom objects where I can centralize methods that return paths or the like that are based off the metadata. I’m not sure about your implementation, but I want to avoid storing data in an unstructured way or anything similar to Python dictionaries. I want named members and the like in one place for clarity.

drpatelh · 2021-03-01T15:46:03Z

Fair enough. I am using named members too which are instantiated at the module level like here but your approach is probably a better one. In fact it may be able to replace our current functionality which is mostly a workaround so be interested to see how your final implementation looks :)

Edited: My bad, wrong link above but similar concept for passing around module options - we don't actually instantiate sample meta anywhere in the proper way but rely on it being evaluated to either false / null when used in modules.

pditommaso · 2021-03-01T22:45:13Z

Ok, i've a patch for this. I'll upload in the following days.

pditommaso · 2021-03-02T19:51:10Z

Made some progress on this 👉 40a66ac

Essentially it requires the use of an annotation @ValueObject to mark an object that can safely be serialised/deserialised. eg

@ValueObject
class MyData {
  String foo 
  String bar
}

process foo {
  input:
  val x from ( new MyData(foo:'one',bar:'two') )
  output: 
  file 'x.txt'
  script:
  """
  echo "$x.foo $x.bar" > x.txt
  """
}

it also automatically implements equals and hashCode boilerplate code. I was also thinking to make it immutable that would be useful to pass parameters around. Thought?

nh13 · 2021-03-02T20:00:51Z

@pditommaso immutability would be very nice. It gets us a lot closer to case classes like in Scala.

pditommaso · 2021-03-02T20:15:03Z

Let's do it, immutable and autoclonable 8dfaadf

pditommaso · 2021-03-05T15:52:32Z

This has been included in version 21.03.0-edge. In a nutshell annotate your class with @ValueObject eg

@ValueObject
class MyData {
  String foo
  String bar
}

The class is made automatically, serializable, immutable and cleanable. Therefore attributes cannot be modified. The object needs to be created using named parameter constructor eg

def data = new MyDaya(foo:'this', bar:'that')

to modify the one more attributes a copy needs to be created as shown below:

def another = data.copyWith(foo:'Hello')

sstadick · 2021-04-16T21:32:15Z

This appears to not work when defining the class in a .groovy file loaded via the -lib option. It does work when the class is defined in a .nf file though. Is that intended?

Running off of edge I get the following error when defining a class SampleData in lib/SampleData.groovy:

❯ nextflow run -lib lib main.nf --test_csv ./test.csv 
N E X T F L O W  ~  version 21.04.0-edge
Launching `main.nf` [agitated_raman] - revision: 45305a9c69
BUG! exception in phase 'semantic analysis' in source unit 'Script_0e9b4190' The lookup for SampleData caused a failed compilation. There should not have been any compilation from this call.

main.nf

nextflow.enable.dsl=2

workflow {
	Channel.fromPath(params.test_csv).splitCsv(header: true).map { row -> new SampleData(name: row.name) }.view()
}

lib/SampleData.groovy

import nextflow.io.ValueObject

@ValueObject
class SampleData {
	String name
}

test.csv

name
Bob

However, the following works.

nextflow.enable.dsl=2

@ValueObject
class SampleData {
	String name
}

workflow {
	Channel.fromPath(params.test_csv).splitCsv(header: true).map { row -> new SampleData(name: row.name) }.view()
}

pditommaso · 2021-04-17T07:26:09Z

Since it's a plain groovy file adds import nextflow.io.ValueObject in the library file.

sstadick · 2021-04-17T15:19:31Z

~~Even with import nextflow.io.ValueObject I get the same error. I updated the example above with that as well.~~

sstadick · 2021-04-21T15:41:30Z

The above works with import nextflow.io.ValueObject I'm not sure what I was doing wrong 4 days ago, but it works now with the code I pasted above.

kaizhang · 2022-02-02T17:32:14Z

The codes below still don't work:

@ValueObject
class MyData {
  String foo 
  String bar
}

process foo {
  input:
  val x from ( new MyData(foo:'one',bar:'two') )
  output: 
  val(x) into foo_ch
  script:
  """
  echo "$x.foo $x.bar" > x.txt
  """
}

process bar {
  input:
  val(x) from foo_ch
  output: 
  file 'y.txt'
  script:
  """
  echo "$x.foo $x.bar" > y.txt
  """
}

bar is cached but foo is not.

nextflow run test.nf -resume

N E X T F L O W  ~  version 21.12.1-edge
Launching `test.nf` [high_coulomb] - revision: f5256072f4
executor >  local (1)
[b2/bd33b6] process > foo [100%] 1 of 1 ✔
[a7/0f33c1] process > bar [100%] 1 of 1, cached: 1 ✔
WARN: [foo] Unable to resume cached task -- See log file for details

The log says:

Feb-02 09:28:50.597 [Actor Thread 4] WARN  nextflow.processor.TaskProcessor - [foo] Unable to resume cached task -- See log file for details
com.esotericsoftware.kryo.KryoException: Unable to find class: MyData

nh13 · 2022-02-02T18:38:29Z

@kaizhang see this solution: #2085 (comment)

import groovy.transform.Immutable
import nextflow.io.ValueObject
import nextflow.util.KryoHelper

@ValueObject
@Immutable(copyWith=true, knownImmutables = ['foo', 'bar'])
class MyData {
    static { 
        // Register this class with the Kryo framework that serializes and deserializes objects
        // that pass through channels. This allows for caching when this object is used.
        KryoHelper.register(MyData)
    }
  String foo 
  String bar
}

kaizhang · 2022-02-03T03:12:00Z

@nh13 That works, thank you!

pditommaso added the WIP label Mar 1, 2021

pditommaso mentioned this issue Mar 2, 2021

Unable to resume cached task when process tag accesses object fields or methods. #1811

Closed

pditommaso removed the WIP label Mar 5, 2021

pditommaso closed this as completed Mar 5, 2021

nh13 mentioned this issue May 4, 2021

Using custom objects with paths #2085

Open

mahesh-panchal mentioned this issue Feb 22, 2022

Are closures not serialisable? #2659

Closed

edmundmiller mentioned this issue Feb 22, 2022

Use custom objects nf-core/modules#1338

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to use custom objects and have tasks cached #1934

How to use custom objects and have tasks cached #1934

nh13 commented Feb 27, 2021

pditommaso commented Mar 1, 2021

nh13 commented Mar 1, 2021

pditommaso commented Mar 1, 2021 •

edited

Loading

drpatelh commented Mar 1, 2021

nh13 commented Mar 1, 2021

drpatelh commented Mar 1, 2021 •

edited

Loading

pditommaso commented Mar 1, 2021

pditommaso commented Mar 2, 2021 •

edited by ewels

Loading

nh13 commented Mar 2, 2021

pditommaso commented Mar 2, 2021

pditommaso commented Mar 5, 2021 •

edited by ewels

Loading

sstadick commented Apr 16, 2021 •

edited

Loading

pditommaso commented Apr 17, 2021

sstadick commented Apr 17, 2021 •

edited

Loading

sstadick commented Apr 21, 2021

kaizhang commented Feb 2, 2022

nh13 commented Feb 2, 2022 •

edited

Loading

kaizhang commented Feb 3, 2022

How to use custom objects and have tasks cached #1934

How to use custom objects and have tasks cached #1934

Comments

nh13 commented Feb 27, 2021

pditommaso commented Mar 1, 2021

nh13 commented Mar 1, 2021

pditommaso commented Mar 1, 2021 • edited Loading

drpatelh commented Mar 1, 2021

nh13 commented Mar 1, 2021

drpatelh commented Mar 1, 2021 • edited Loading

pditommaso commented Mar 1, 2021

pditommaso commented Mar 2, 2021 • edited by ewels Loading

nh13 commented Mar 2, 2021

pditommaso commented Mar 2, 2021

pditommaso commented Mar 5, 2021 • edited by ewels Loading

sstadick commented Apr 16, 2021 • edited Loading

pditommaso commented Apr 17, 2021

sstadick commented Apr 17, 2021 • edited Loading

sstadick commented Apr 21, 2021

kaizhang commented Feb 2, 2022

nh13 commented Feb 2, 2022 • edited Loading

kaizhang commented Feb 3, 2022

pditommaso commented Mar 1, 2021 •

edited

Loading

drpatelh commented Mar 1, 2021 •

edited

Loading

pditommaso commented Mar 2, 2021 •

edited by ewels

Loading

pditommaso commented Mar 5, 2021 •

edited by ewels

Loading

sstadick commented Apr 16, 2021 •

edited

Loading

sstadick commented Apr 17, 2021 •

edited

Loading

nh13 commented Feb 2, 2022 •

edited

Loading