Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Check the output in light of Reece's comments #1

Open
julesjacobsen opened this issue Jul 19, 2017 · 4 comments
Open

Check the output in light of Reece's comments #1

julesjacobsen opened this issue Jul 19, 2017 · 4 comments

Comments

@julesjacobsen
Copy link
Owner

ga4gh/vrs#11 (comment)

@reece
Copy link

reece commented Jul 31, 2017

Could I also invite you to do a review of my implementation - we need a JVM compatible library for this

Hi Jules-

The basic structure of Model.kt is understandable and looks right to me. So far, so good I think.

You have a comment "How's this supposed to work?". I think you're referring to the issue that in order to generate a Computed Identifier for a Location, one needs the CI for the referenced sequence. And, since VMC doesn't specify sequences, where do those come from? Right?

To start the answer, see this notebook. You'll see this:

ir = models.Identifier(namespace="NCBI", accession="NC_000019.10")
sequence_id = get_vmc_sequence_id(ir)
sequence_id
'VMC:GS_IIB53T8CNeJJdUqzn9V_JnRtQadwWCbl'

Okay, so what should get_vmc_sequence_id() do?

The naive implementation would be to fetch the sequence, run it through the truncated digest, and construct an Identifier. Obviously, that's going to stink for large sequences and/or a lot of variants. But, that's the base implementation.

An improvement on that would be to keep a 1:1 map of VMC ids and external identifiers. That is, cache the lookup. (Reversible is nice so that variants can emitted with familiar sequence ids.)

I use SeqRepo, a package I wrote to efficiently maintain and update a superset of essentially all sequences. (You might be interested in the design doc.) I have ambitions for a REST interface, but, alas, it's just a dream.

Does that answer your question?

And, since you're actually banging your head on this, please give feedback... especially if you see something that's ill-conceived.

@julesjacobsen
Copy link
Owner Author

Thanks for the feedback, especially on the sequence id part. There are a few issues I though of/encountered.

  1. coming from a VCF dominated world the VMC model initially seemed nice and simple, especially regarding samples and genotypes. What would be really hand would be for more examples to be made available to demonstrate the equivalent of a multi-sample VCF with different genotypes for an allele at a given position. For example:
##fileformat=VCFv4.2
##contig=<ID=6,assembly=b37,length=171115067>
##fileDate=20150218
##reference=ftp://ftp.1000genomes.ebi.ac.uk//vol1/ftp/technical/reference/phase2_reference_assembly_sequence/hs37d5.fa.gz
#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT	NA11111	NA22222	
  6	111858738	rs369794289	T	TTCAAAACTTCTTTAAGTAA	100	PASS	AC=2;AF=0.893371;AFR_AF=0.9236;AMR_AF=0.8285;AN=2;DP=19723;EAS_AF=0.999;EUR_AF=0.7684;NS=2504;SAS_AF=0.9182;VT=INDEL	GT	1|1	0/1
  1. Back to the sequences - will there be a definition on how to trim a variant? Practically everyone seems to have their own preferred way of representing an indel (1 vs 0 based), empty character, -, single-base, left or right trim first, genomic vs transcript coordinates. These all produce different outputs.

  2. Storing associations by reference makes streaming large files really hard as they need to store the whole file in memory to build up a graph of the associations before you can work with the objects. Keeping the information inline, as the VCF does makes parsing trivial - it could be done automatically by Jackson or directly using the Python json lib. Have you tried scaling this up to genome size?

@julesjacobsen
Copy link
Owner Author

Adding to the above, have you tried doing a round-trip from VCF->VMC->VCF on a genome?

@reece
Copy link

reece commented Aug 8, 2017

See/follow ga4gh/vrs#4 for vcf->vmc. I hadn't thought about vmc->vcf, but that's an interesting idea.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants