-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add a varcode-powered worker for annotating VCF files #750
Conversation
@@ -397,14 +420,6 @@ def _header_spec(vcf_header_text, extant_cols): | |||
'Names of genes that overlap with this variant\'s ' | |||
'starting position, derived from Ensembl Release 75.')) | |||
|
|||
# Remove empty supercolumns |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did something change to obsolete this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, actually not -- this is just an artifact of the diff. These lines are still present a little bit above, but somehow this diff view thinks that they were deleted and added back.
Awesome to have this coming in! Thanks for the work, @armish. I do think we can drop the gene_annotator; there's no reason to have both now, AFAIK. With that, I'd drop the Could you also add the migration code required to get the database up to date with the new schema in the PR text? We do this to make deployment slightly easier. It's definitely not the right solution… but it's something :) Ah, and this will also require updating the test database used for generating screenshots, and generating new screenshots for the examine page. Cf. the |
@@ -383,8 +383,31 @@ def _header_spec(vcf_header_text, extant_cols): | |||
path=['sample_name']) # This path is not the default super -> sub column | |||
|
|||
# Add Cycledash-derived columns | |||
column_name = 'annotations:gene_names' | |||
if column_name in extant_cols: | |||
_add_extant_column_to_spec(extant_cols, 'annotations:gene_names', res, 'String', 1, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can't tell if it's just some funky GitHub formatting, but some of these lines around here may be longer than 80 charts; mind chopping them down to size?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, will definitely do.
@ihodes: and here is the migration code:
that is, if Travis gives us the green light :) |
@@ -69,7 +69,11 @@ CREATE TABLE genotypes ( | |||
quality TEXT, | |||
|
|||
-- Cycledash-derived data | |||
"annotations:gene_names" TEXT, | |||
-- These are from the varcode annotation worker | |||
"annotations:gene_name" TEXT, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If there are multiple genes, do they still get displayed? (If so, this should probably still be "annotations:gene_names")
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nope -- this annotation only picks an effect in a single gene. So we won't be having multiple genes annotations any more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm—is there a way to get that back? That seems like a useful feature /cc @iskandr @tavinathanson
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@armish You only have a single gene if you're picking the top priority annotation for each variant.
You could do this to get the top effect for each distinct gene:
for gene_name, gene_effects in variant.effects().groupby_gene_name().items():
top_gene_effect = gene_effects.top_priority_effect()
but you'll have to then deal with all sorts of weird non-coding gene types.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From a programmatic point of view, there is no reason not to include multiple gene names; but putting my biologist hat on, I think we should only show a single gene to the users. This might be a bold step forward, but I opt for single gene/variant route.
Happy to take this discussion off-line ;)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't have a very strong opinion about this, but could you file this in an issue to resolve out of band with this PR?
This is looking good! We still need these to show up in our screenshot tests as well, cf.
|
phew! It took me a few iterations to get it right, but here we are with the updated screenshots and a successful build \o/ |
This looks great! Thanks for putting it together—will be a great addition to Cycledash. |
Add a varcode-powered worker for annotating VCF files
This new worker annotates each variant in a VCF file with the top priority effect properties:
This worker is run when user submits a new VCF file (i.e. when a new run is created)
Once the worker completes its annotation, new information on variants are shown under
Annotations
Column Group with the Varcode prefix: