Add a varcode-powered worker for annotating VCF files #750

armish · 2015-06-17T22:54:02Z

This new worker annotates each variant in a VCF file with the top priority effect properties:

Gene name (Official gene symbol)
Transcript ID (Ensembl ID for the transcript)
Protein-level change (e.g. V600E)
Type of the variant (.e.g Substitution, Splice Site, Intronic, ...)

This worker is run when user submits a new VCF file (i.e. when a new run is created)

Once the worker completes its annotation, new information on variants are shown under Annotations Column Group with the Varcode prefix:

ihodes · 2015-06-17T22:55:57Z

cycledash/genotypes.py

@@ -397,14 +420,6 @@ def _header_spec(vcf_header_text, extant_cols):
                'Names of genes that overlap with this variant\'s '
                'starting position, derived from Ensembl Release 75.'))

-    # Remove empty supercolumns


Did something change to obsolete this?

No, actually not -- this is just an artifact of the diff. These lines are still present a little bit above, but somehow this diff view thinks that they were deleted and added back.

ihodes · 2015-06-17T23:01:19Z

Awesome to have this coming in! Thanks for the work, @armish.

I do think we can drop the gene_annotator; there's no reason to have both now, AFAIK. With that, I'd drop the varcode_ prefix to the column names as well.

Could you also add the migration code required to get the database up to date with the new schema in the PR text? We do this to make deployment slightly easier. It's definitely not the right solution… but it's something :)

Ah, and this will also require updating the test database used for generating screenshots, and generating new screenshots for the examine page. Cf. the tests/pdifftests directory for more info.

ihodes · 2015-06-17T23:03:29Z

cycledash/genotypes.py

@@ -383,8 +383,31 @@ def _header_spec(vcf_header_text, extant_cols):
        path=['sample_name']) # This path is not the default super -> sub column

    # Add Cycledash-derived columns
-    column_name = 'annotations:gene_names'
-    if column_name in extant_cols:
+    _add_extant_column_to_spec(extant_cols, 'annotations:gene_names', res, 'String', 1,


I can't tell if it's just some funky GitHub formatting, but some of these lines around here may be longer than 80 charts; mind chopping them down to size?

Yeah, will definitely do.

armish · 2015-07-01T19:34:18Z

@ihodes: and here is the migration code:

ALTER TABLE genotypes
    DROP COLUMN "annotations:gene_names",
    ADD "annotations:gene_name" TEXT,
    ADD "annotations:transcript" TEXT,
    ADD "annotations:effect_notation" TEXT,
    ADD "annotations:effect_type" TEXT;

that is, if Travis gives us the green light :)

ihodes · 2015-07-01T19:36:34Z

schema.sql

@@ -69,7 +69,11 @@ CREATE TABLE genotypes (
       quality TEXT,

       -- Cycledash-derived data
-       "annotations:gene_names" TEXT,
+       -- These are from the varcode annotation worker
+       "annotations:gene_name" TEXT,


If there are multiple genes, do they still get displayed? (If so, this should probably still be "annotations:gene_names")

Nope -- this annotation only picks an effect in a single gene. So we won't be having multiple genes annotations any more.

Hmm—is there a way to get that back? That seems like a useful feature /cc @iskandr @tavinathanson

@armish You only have a single gene if you're picking the top priority annotation for each variant.

You could do this to get the top effect for each distinct gene:

for gene_name, gene_effects in variant.effects().groupby_gene_name().items(): top_gene_effect = gene_effects.top_priority_effect()

but you'll have to then deal with all sorts of weird non-coding gene types.

From a programmatic point of view, there is no reason not to include multiple gene names; but putting my biologist hat on, I think we should only show a single gene to the users. This might be a bold step forward, but I opt for single gene/variant route.

Happy to take this discussion off-line ;)

I don't have a very strong opinion about this, but could you file this in an issue to resolve out of band with this PR?

ihodes · 2015-07-01T20:01:58Z

This is looking good! We still need these to show up in our screenshot tests as well, cf.

this will also require updating the test database used for generating screenshots, and generating new screenshots for the examine page. Cf. the tests/pdifftests directory for more info.

…w schema

armish · 2015-07-01T21:23:24Z

phew! It took me a few iterations to get it right, but here we are with the updated screenshots and a successful build \o/

ihodes · 2015-07-01T21:34:09Z

This looks great! Thanks for putting it together—will be a great addition to Cycledash.

Add a varcode-powered worker for annotating VCF files

ihodes reviewed Jun 17, 2015
View reviewed changes

armish mentioned this pull request Jun 17, 2015

Varcode annotator should show alternate effect for "Exonic Splice Site" changes #751

Open

ihodes reviewed Jun 17, 2015
View reviewed changes

armish added 11 commits July 1, 2015 12:12

add support for varcode-based annotation

c611a4b

tweak annotators for better chaining support

3a35d5f

start varcode annotator together with other workers

ea40834

show varcode annotations on the examine page

7afe7eb

fix function name problem within varcode annotator

59b81b3

fix the clean up and assert procedure for varcode worker

cec0c04

rename gene_names variable to annotations for clarity

6099ec6

drop the varcode prefix from the annotations

bf990e4

improve coding style and make it consistent w/ the rest

2b3702f

add missing res argument to the method call within genotypes

c37f717

ditch gene_annotator and remove gene annotation redundancy

b91e308

armish force-pushed the varcode-worker branch from 955ae4a to b91e308 Compare July 1, 2015 18:55

get rid of Travis specific imports within varcode annotator

e544a28

ihodes reviewed Jul 1, 2015
View reviewed changes

improve the worker stucture so that it can be externally called

3aff479

armish added 2 commits July 1, 2015 16:53

rename the gene_names column within test data for compatibility w/ ne…

5f37c82

…w schema

update examine page screenshots (change: new annotations columns)

4ee41e8

ihodes added a commit that referenced this pull request Jul 1, 2015

Merge pull request #750 from hammerlab/varcode-worker

cc9d8c0

Add a varcode-powered worker for annotating VCF files

ihodes merged commit cc9d8c0 into master Jul 1, 2015

armish deleted the varcode-worker branch July 1, 2015 21:49

armish mentioned this pull request Jul 6, 2015

Implement a new worker for Cycledash that will annotate Protein-level changes for a given variant/VCF file #699

Closed

armish mentioned this pull request Jul 20, 2015

Add more gene annotations #447

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add a varcode-powered worker for annotating VCF files #750

Add a varcode-powered worker for annotating VCF files #750

armish commented Jun 17, 2015

ihodes Jun 17, 2015

armish Jun 18, 2015

ihodes commented Jun 17, 2015

ihodes Jun 17, 2015

armish Jun 18, 2015

armish commented Jul 1, 2015

ihodes Jul 1, 2015

armish Jul 1, 2015

ihodes Jul 1, 2015

iskandr Jul 1, 2015

armish Jul 1, 2015

ihodes Jul 1, 2015

ihodes commented Jul 1, 2015

armish commented Jul 1, 2015

ihodes commented Jul 1, 2015

Add a varcode-powered worker for annotating VCF files #750

Add a varcode-powered worker for annotating VCF files #750

Conversation

armish commented Jun 17, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ihodes commented Jun 17, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

armish commented Jul 1, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ihodes commented Jul 1, 2015

armish commented Jul 1, 2015

ihodes commented Jul 1, 2015