Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to use custom objects and have tasks cached #1934

Closed
nh13 opened this issue Feb 27, 2021 · 18 comments
Closed

How to use custom objects and have tasks cached #1934

nh13 opened this issue Feb 27, 2021 · 18 comments

Comments

@nh13
Copy link

nh13 commented Feb 27, 2021

Sometimes it's useful to have a container class for metadata that I can pass around in channels along with the actual paths. For example,

import java.nio.file.Path

/** Container class for sample-specific metadata */
@groovy.transform.ImmutableBase
@groovy.transform.TupleConstructor
class SampleMetadata {
    /** The sample name */
    String name
    /** The path to the read one FASTQ */
    Path r1
    /** The path to the read two FASTQ */
    Path r2

    boolean equals(Object o) {
        if (this.is(o)) return true
        if (!o || getClass() != o.class) return false
        SampleMetadata that = (SampleMetadata) o
        if (name? !name.equals(that.name) : that.name!= null) return false
        if (r1? !r1.equals(that.r1) : that.r1!= null) return false
        if (r2? !r2.equals(that.r2) : that.r2!= null) return false
        return true
    }

    int hashCode() {
        int result = (this.name ? this.name.hashCode() : 0)
        result = 31 * result + (this.r1 ? this.r1.hashCode() : 0)
        result = 31 * result + (this.r2 ? this.r2.hashCode() : 0)
        return result
    }
}

If I use such a class, none of tasks are cacheable that use it. What do I have to do to make it cacheable? I couldn't find information about this in the docs.

@pditommaso
Copy link
Member

Unfortunately, now the serialisation with custom objects does not work properly because NF uses Kryo under the hood for better performance that requires some custom code to register the object serialiser. Needs to be improved.

@nh13
Copy link
Author

nh13 commented Mar 1, 2021

Happy to help improve it if you point me to the place as I think this type of pattern will be common for me going forward. Just like named tuples and attrs/data classes in Python.

@pditommaso
Copy link
Member

pditommaso commented Mar 1, 2021

The relevant section is this

class DefaultSerializers implements SerializerRegistrant {
static private Class<Path> PATH_CLASS = (Class<Path>)Paths.get('.').class
@Override
void register(Map<Class, Object> serializers) {
serializers.put( PATH_CLASS, PathSerializer )
serializers.put( URL, URLSerializer )
serializers.put( UUID, UUIDSerializer )
serializers.put( File, FileSerializer )
serializers.put( Pattern, PatternSerializer )
serializers.put( ArrayTuple, ArrayTupleSerializer )
}
}

I think it could be introduced a CustomSerializer interface that extends standard Java Serializer and a CustomSerializerImpl that serialize a generic class implement CustomSerializer via standard java serialization mechanism.

User classes should just implement such class interface.

@drpatelh
Copy link
Contributor

drpatelh commented Mar 1, 2021

Would something like this help @nh13 ?

We have also reverted to passing all of the sample metadata around in a Groovy Map, however, the file path elements change dynamically through the workflow and are staged through the standard Nextflow mechanism which keeps the -resume functionality intact.

The meta map itself is initiated from the information in the input samplesheet to the pipeline right at the very beginning of the workflow as you can see here.

@nh13
Copy link
Author

nh13 commented Mar 1, 2021

Thanks @drpatelh but I think I’d prefer custom objects where I can centralize methods that return paths or the like that are based off the metadata. I’m not sure about your implementation, but I want to avoid storing data in an unstructured way or anything similar to Python dictionaries. I want named members and the like in one place for clarity.

@drpatelh
Copy link
Contributor

drpatelh commented Mar 1, 2021

Fair enough. I am using named members too which are instantiated at the module level like here but your approach is probably a better one. In fact it may be able to replace our current functionality which is mostly a workaround so be interested to see how your final implementation looks :)

Edited: My bad, wrong link above but similar concept for passing around module options - we don't actually instantiate sample meta anywhere in the proper way but rely on it being evaluated to either false / null when used in modules.

@pditommaso
Copy link
Member

Ok, i've a patch for this. I'll upload in the following days.

@pditommaso pditommaso added the WIP label Mar 1, 2021
@pditommaso
Copy link
Member

pditommaso commented Mar 2, 2021

Made some progress on this 👉 40a66ac

Essentially it requires the use of an annotation @ValueObject to mark an object that can safely be serialised/deserialised. eg

@ValueObject
class MyData {
  String foo 
  String bar
}

process foo {
  input:
  val x from ( new MyData(foo:'one',bar:'two') )
  output: 
  file 'x.txt'
  script:
  """
  echo "$x.foo $x.bar" > x.txt
  """
}

it also automatically implements equals and hashCode boilerplate code. I was also thinking to make it immutable that would be useful to pass parameters around. Thought?

@nh13
Copy link
Author

nh13 commented Mar 2, 2021

@pditommaso immutability would be very nice. It gets us a lot closer to case classes like in Scala.

@pditommaso
Copy link
Member

Let's do it, immutable and autoclonable 8dfaadf

@pditommaso
Copy link
Member

pditommaso commented Mar 5, 2021

This has been included in version 21.03.0-edge. In a nutshell annotate your class with @ValueObject eg

@ValueObject
class MyData {
  String foo
  String bar
}

The class is made automatically, serializable, immutable and cleanable. Therefore attributes cannot be modified. The object needs to be created using named parameter constructor eg

def data = new MyDaya(foo:'this', bar:'that')

to modify the one more attributes a copy needs to be created as shown below:

def another = data.copyWith(foo:'Hello')

@sstadick
Copy link

sstadick commented Apr 16, 2021

This appears to not work when defining the class in a .groovy file loaded via the -lib option. It does work when the class is defined in a .nf file though. Is that intended?

Running off of edge I get the following error when defining a class SampleData in lib/SampleData.groovy:

❯ nextflow run -lib lib main.nf --test_csv ./test.csv 
N E X T F L O W  ~  version 21.04.0-edge
Launching `main.nf` [agitated_raman] - revision: 45305a9c69
BUG! exception in phase 'semantic analysis' in source unit 'Script_0e9b4190' The lookup for SampleData caused a failed compilation. There should not have been any compilation from this call.

main.nf

nextflow.enable.dsl=2

workflow {
	Channel.fromPath(params.test_csv).splitCsv(header: true).map { row -> new SampleData(name: row.name) }.view()
}

lib/SampleData.groovy

import nextflow.io.ValueObject

@ValueObject
class SampleData {
	String name
}

test.csv

name
Bob

However, the following works.

nextflow.enable.dsl=2

@ValueObject
class SampleData {
	String name
}

workflow {
	Channel.fromPath(params.test_csv).splitCsv(header: true).map { row -> new SampleData(name: row.name) }.view()
}

@pditommaso
Copy link
Member

Since it's a plain groovy file adds import nextflow.io.ValueObject in the library file.

@sstadick
Copy link

sstadick commented Apr 17, 2021

Even with import nextflow.io.ValueObject I get the same error. I updated the example above with that as well.

@sstadick
Copy link

The above works with import nextflow.io.ValueObject I'm not sure what I was doing wrong 4 days ago, but it works now with the code I pasted above.

@kaizhang
Copy link

kaizhang commented Feb 2, 2022

The codes below still don't work:

@ValueObject
class MyData {
  String foo 
  String bar
}

process foo {
  input:
  val x from ( new MyData(foo:'one',bar:'two') )
  output: 
  val(x) into foo_ch
  script:
  """
  echo "$x.foo $x.bar" > x.txt
  """
}

process bar {
  input:
  val(x) from foo_ch
  output: 
  file 'y.txt'
  script:
  """
  echo "$x.foo $x.bar" > y.txt
  """
}

bar is cached but foo is not.

nextflow run test.nf -resume

N E X T F L O W  ~  version 21.12.1-edge
Launching `test.nf` [high_coulomb] - revision: f5256072f4
executor >  local (1)
[b2/bd33b6] process > foo [100%] 1 of 1 ✔
[a7/0f33c1] process > bar [100%] 1 of 1, cached: 1 ✔
WARN: [foo] Unable to resume cached task -- See log file for details

The log says:

Feb-02 09:28:50.597 [Actor Thread 4] WARN  nextflow.processor.TaskProcessor - [foo] Unable to resume cached task -- See log file for details
com.esotericsoftware.kryo.KryoException: Unable to find class: MyData

@nh13
Copy link
Author

nh13 commented Feb 2, 2022

@kaizhang see this solution: #2085 (comment)

import groovy.transform.Immutable
import nextflow.io.ValueObject
import nextflow.util.KryoHelper

@ValueObject
@Immutable(copyWith=true, knownImmutables = ['foo', 'bar'])
class MyData {
    static { 
        // Register this class with the Kryo framework that serializes and deserializes objects
        // that pass through channels. This allows for caching when this object is used.
        KryoHelper.register(MyData)
    }
  String foo 
  String bar
}

@kaizhang
Copy link

kaizhang commented Feb 3, 2022

@nh13 That works, thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants