Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[query] allow setting a non-builtin reference genome as the default #13856

Closed
danking opened this issue Oct 19, 2023 · 2 comments · Fixed by #13987
Closed

[query] allow setting a non-builtin reference genome as the default #13856

danking opened this issue Oct 19, 2023 · 2 comments · Fixed by #13987
Assignees

Comments

@danking
Copy link
Contributor

danking commented Oct 19, 2023

What happened?

Suppose you're working with the Wheat genome. The following is seemingly correct code but it doesn't work:

import hail as hl

rgwheat = hl.ReferenceGenome('Wheat', ...)

hl.init(default_reference=rgwheat)

The first problem is that the @typecheck on hl.init, hl.init_spark, etc. only allows a built-in reference genome.

Even if we relax that requirement, we encounter a deeper problem: creating the reference genome initializes Hail. In particular, we call Env.backend() (which calls Env.hc(), which forces initialization) so that we can call add_reference.

What does initialization mean? Historically, it meant connection to or starting a JVM/Spark process. In QoB/ServiceBackend, initialization just loads configurations, it doesn't really do anything irreversible. Regardless of what it does, we only allow initialization once.

OK, so, there's two possible routes to fix this problem:

  1. Rewrite ReferenceGenome.__init__ such that it does not initialize Hail. You have to decide how reference genomes are ultimately communicated to the backend. Do you hang a list of all created reference genomes off of the ReferenceGenome class? Do you require explicit registering a la hl.register_reference? The latter seems a bit silly. The former seems OK, but you could also ...
  2. Allow modification of the default reference after initialization. The default reference genome is just a field on the HailContext: _default_ref which is accessed through hl.default_reference(). Just modify hl.default_reference to return the reference with no arguments and set the reference with one argument. Now this works:
import hail as hl
rgwheat = hl.ReferenceGenome('Wheat', ...)
hl.default_reference(rgwheat)
mt = hl.import_vcf('wheat.vcf')

Version

0.2.124

Relevant log output

No response

@danking
Copy link
Contributor Author

danking commented Oct 19, 2023

We should add docs that describe how to do this to:

  1. hl.default_reference, obviously
  2. Deprecate the reference_genome parameter to hl.init and instruct users to use hl.default_reference. Inform that this parameter has confusing interactions with ReferenceGenome, so we're removing it.
  3. hl.ReferenceGenome.__init__ should refer users to that.

I think we should also make a separate PR that improves the hl.import_vcf error message. If the backend throws an error like

HailException: Invalid locus '1:249367215' found. Position '249367215' is not within the range [1-249250621] for reference genome 'GRCh37'.

import_vcf should catch and wrap with another exception that suggests you use a reference_genome parameter or hl.default_reference.

@danking danking added the query label Oct 23, 2023
@chrisvittal chrisvittal self-assigned this Oct 23, 2023
@chrisvittal
Copy link
Collaborator

chrisvittal commented Oct 26, 2023

This has been partially resolved with #13888

What remains is:

danking added a commit that referenced this issue Nov 27, 2023
CHANGELOG: Deprecate default_reference parameter to hl.init, users
should use `default_reference` with an argument to set new default
references usually shortly after init.

Resolves #13856

---------

Co-authored-by: Dan King <daniel.zidan.king@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants