Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a varcode-powered worker for annotating VCF files #750

Merged
merged 15 commits into from
Jul 1, 2015
Merged

Conversation

armish
Copy link
Member

@armish armish commented Jun 17, 2015

This new worker annotates each variant in a VCF file with the top priority effect properties:

  • Gene name (Official gene symbol)
  • Transcript ID (Ensembl ID for the transcript)
  • Protein-level change (e.g. V600E)
  • Type of the variant (.e.g Substitution, Splice Site, Intronic, ...)

This worker is run when user submits a new VCF file (i.e. when a new run is created)
screen shot 2015-06-17 at 5 50 15 pm

Once the worker completes its annotation, new information on variants are shown under Annotations Column Group with the Varcode prefix:
screen shot 2015-06-17 at 5 49 51 pm

Review on Reviewable

@@ -397,14 +420,6 @@ def _header_spec(vcf_header_text, extant_cols):
'Names of genes that overlap with this variant\'s '
'starting position, derived from Ensembl Release 75.'))

# Remove empty supercolumns
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did something change to obsolete this?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, actually not -- this is just an artifact of the diff. These lines are still present a little bit above, but somehow this diff view thinks that they were deleted and added back.

@ihodes
Copy link
Member

ihodes commented Jun 17, 2015

Awesome to have this coming in! Thanks for the work, @armish.

I do think we can drop the gene_annotator; there's no reason to have both now, AFAIK. With that, I'd drop the varcode_ prefix to the column names as well.

Could you also add the migration code required to get the database up to date with the new schema in the PR text? We do this to make deployment slightly easier. It's definitely not the right solution… but it's something :)

Ah, and this will also require updating the test database used for generating screenshots, and generating new screenshots for the examine page. Cf. the tests/pdifftests directory for more info.

@@ -383,8 +383,31 @@ def _header_spec(vcf_header_text, extant_cols):
path=['sample_name']) # This path is not the default super -> sub column

# Add Cycledash-derived columns
column_name = 'annotations:gene_names'
if column_name in extant_cols:
_add_extant_column_to_spec(extant_cols, 'annotations:gene_names', res, 'String', 1,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't tell if it's just some funky GitHub formatting, but some of these lines around here may be longer than 80 charts; mind chopping them down to size?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, will definitely do.

@armish
Copy link
Member Author

armish commented Jul 1, 2015

@ihodes: and here is the migration code:

ALTER TABLE genotypes
    DROP COLUMN "annotations:gene_names",
    ADD "annotations:gene_name" TEXT,
    ADD "annotations:transcript" TEXT,
    ADD "annotations:effect_notation" TEXT,
    ADD "annotations:effect_type" TEXT;

that is, if Travis gives us the green light :)

@@ -69,7 +69,11 @@ CREATE TABLE genotypes (
quality TEXT,

-- Cycledash-derived data
"annotations:gene_names" TEXT,
-- These are from the varcode annotation worker
"annotations:gene_name" TEXT,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there are multiple genes, do they still get displayed? (If so, this should probably still be "annotations:gene_names")

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nope -- this annotation only picks an effect in a single gene. So we won't be having multiple genes annotations any more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm—is there a way to get that back? That seems like a useful feature /cc @iskandr @tavinathanson

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@armish You only have a single gene if you're picking the top priority annotation for each variant.

You could do this to get the top effect for each distinct gene:

for gene_name, gene_effects in variant.effects().groupby_gene_name().items():
    top_gene_effect = gene_effects.top_priority_effect()

but you'll have to then deal with all sorts of weird non-coding gene types.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From a programmatic point of view, there is no reason not to include multiple gene names; but putting my biologist hat on, I think we should only show a single gene to the users. This might be a bold step forward, but I opt for single gene/variant route.

Happy to take this discussion off-line ;)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have a very strong opinion about this, but could you file this in an issue to resolve out of band with this PR?

@ihodes
Copy link
Member

ihodes commented Jul 1, 2015

This is looking good! We still need these to show up in our screenshot tests as well, cf.

this will also require updating the test database used for generating screenshots, and generating new screenshots for the examine page. Cf. the tests/pdifftests directory for more info.

@armish
Copy link
Member Author

armish commented Jul 1, 2015

phew! It took me a few iterations to get it right, but here we are with the updated screenshots and a successful build \o/

@ihodes
Copy link
Member

ihodes commented Jul 1, 2015

This looks great! Thanks for putting it together—will be a great addition to Cycledash.

ihodes added a commit that referenced this pull request Jul 1, 2015
Add a varcode-powered worker for annotating VCF files
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants