Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[compiler] Lazily initialize reference genomes in generated code #10523

Merged
merged 10 commits into from Jul 20, 2021

Conversation

chrisvittal
Copy link
Collaborator

@chrisvittal chrisvittal commented May 25, 2021

Fixes #10682

@chrisvittal
Copy link
Collaborator Author

This is broken in other ways because if you attempt to add a liftover on a worker, it tries to broadcast the filesystem, which tries to grab the HailContext, which then blows up.

@tpoterba
Copy link
Contributor

OK, I think we shouldn't be creating RGs on the workers, we should just be capturing references to broadcasted values. In practice, this will look a lot like mb.getObject, I think.

@chrisvittal
Copy link
Collaborator Author

I've basically just restarted this. We need to rethink how we handle liftover and sequence initialization on the workers.

I think it mostly has to do with FS broadcasting in Liftover and FASTAReaderConfig

@chrisvittal
Copy link
Collaborator Author

To summarize from our discussion today:

  1. We want to make the FASTAReaderConfig and Liftover map transient
  2. We'll keep enough information around (like map from name to chain file path, sequence path) such that with a filesystem, the FASTA reader and LiftOvers can be regenerated.
  3. Create a method that synchronizes FASTAReaderConfig and LiftOver state, working name heal, that takes a filesystem and sets everything up.
  4. We'll call this method in the resultWithIndex closure.

@tpoterba
Copy link
Contributor

yes, that's my understanding of our design!

tpoterba and others added 2 commits July 13, 2021 13:06
Reference Genomes now manage liftover and sequences with serializable
metadata fields. The actual fasta reader config and liftover map are
transient. They are set by a new method called `heal` that synchronizes
liftover and sequences with the file paths now present in the class.
This removes the need for codeSetup and similar methods.
@chrisvittal chrisvittal force-pushed the fix-referencegenome-classloading branch from 9abbce3 to f7dcba2 Compare July 13, 2021 19:17
@chrisvittal
Copy link
Collaborator Author

Ok. Seems to work. Going to try cluster tests tomorrow to make sure things are ok in a distributed setting.

The global reference genomes are not usually initialized on workers, so,
in keeping in similar form to the old addLiftoverFromFS and
addSequenceFromReader, heal doesn't check validity of the chain files or
sequence files. The responsibility there is with addSequence and
addLiftover.
@chrisvittal
Copy link
Collaborator Author

test-dataproc passes after latest change

@tpoterba
Copy link
Contributor

🎉

Copy link
Contributor

@tpoterba tpoterba left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interfaces look great. One comment about heal.

@@ -434,6 +427,24 @@ case class ReferenceGenome(name: String, contigs: Array[String], lengths: Map[St
lo.queryInterval(interval, minMatch)
}

def heal(tmpdir: String, fs: FS): Unit = this.synchronized {
// add liftovers
liftoverMaps = Map.empty
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like it blows away the existing map -- can we make the liftoverMaps mutable and just add chain files that aren't in it already? It looks like if we currently do:

addLiftover(ctx, file2, rg2)
addLiftover(ctx, file3, rg3)
addLiftover(ctx, file4, rg4)

We create the first liftover object 3 times.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added a check to make sure we aren't adding the same liftover.

}

// add sequence
fastaReaderCfg = null
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should also check if the fasta config exists already. I think if we do:

addSequence(...)
addLiftover(ctx, file2, rg2)
addLiftover(ctx, file3, rg3)
addLiftover(ctx, file4, rg4)

The fasta is re-added for each heal in the liftovers!

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that the fastaReaderCfg is just a case class and everything else going on, it's probably fine to do this, but it's also easy enough to check that we would be creating an equivalent object. done


def hasLiftover(destRGName: String): Boolean = liftoverMaps.contains(destRGName)
def hasLiftover(destRGName: String): Boolean = liftoverMap.contains(destRGName)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should look at chain files, not liftoverMap, right?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

they should be equivalent, but done

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there was a NPE in a test coming from here, because after deserialization when heal hasn't been run, this is incorrect. Maybe that's a usage we don't care to test though. Regardless, the chain files seems like the correct source of truth though.

@tpoterba
Copy link
Contributor

is this ready for another review? you didn't dismiss the old one

@danking danking merged commit 1c28630 into hail-is:main Jul 20, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Liftover: NoClassDefFoundError: Could not initialize class __C147RGContainer_GRCh38
3 participants