Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Write compound (segmented) sequence locations #1438

Merged
merged 1 commit into from Mar 16, 2024

Conversation

jameshadfield
Copy link
Member

@jameshadfield jameshadfield commented Mar 15, 2024

Our current parsing for GenBank files would correctly read in complex CDSs (e.g. the V gene in measles) but we didn't correctly export this CDS in the output JSON. This is fixed here.

Note that because we use BioPython's SeqFeature.extract for both VCF and JSON/FASTA inputs the actual translations (via augur translate) don't need updating.

Screenshot of Measles V gene:
image

Screenshot of HBV, with the nexclade+GFF translations (top) and the new augur translate + GenBank translations (below) - data is identical except for the CDS names:

image
  • Checks pass
  • add a message in CHANGES.md summarizing the changes in this PR

Copy link

codecov bot commented Mar 15, 2024

Codecov Report

Attention: Patch coverage is 68.96552% with 9 lines in your changes are missing coverage. Please review.

Project coverage is 68.65%. Comparing base (4dd23c5) to head (0c7a4d3).

❗ Current head 0c7a4d3 differs from pull request most recent head e099711. Consider uploading reports for the commit e099711 to get more accurate results

Files Patch % Lines
augur/utils.py 64.00% 5 Missing and 4 partials ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #1438      +/-   ##
==========================================
- Coverage   68.73%   68.65%   -0.08%     
==========================================
  Files          69       69              
  Lines        7551     7580      +29     
  Branches     1851     1858       +7     
==========================================
+ Hits         5190     5204      +14     
- Misses       2083     2094      +11     
- Partials      278      282       +4     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@jameshadfield
Copy link
Member Author

A note on GFF:

Our parsing of GFF files (i.e. upstream of the changes here) doesn't parse segmented features encoded such as this example from the GFF3 spec:

chrX  . CDS                XXXX   YYYY .  +  0 ID=cds01;Parent=tran01
chrX  . CDS                YYYY-1 ZZZZ .  +  0 ID=cds01;Parent=tran01

When parsed by augur we get a single feature with SimpleLocation(YYYY-2,ZZZZ).

GFF works for genes which cross the origin (e.g. HBV) when encoded as a single CDS with coords a b where b>genomeLength. These are parsed by BioPython as SimpleLocation(a,b) and produce a non-segmented annotation JSON {start: a, end: b} which Auspice converts to segments internally. This is unchanged by this PR.

Copy link
Contributor

@joverlee521 joverlee521 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @jameshadfield! It's good to know the translations are fine and we just needed to correct the annotations consumed by Auspice! Changes look good to me by inspection.

I think it'd be helpful to validate the annotations output by genome_features_to_auspice_annotation. Thoughts on always validating the annotations with validate_json within the function itself?

Given a SeqFeqture with a CompoundLocation we now correctly write out
the CDS/gene using segmented coordinates. Auspice can now handle such
coordinates (see <nextstrain/auspice#1684> and
<#1281> for the corresponding
schema updates).

Note that the translations (via augur translate) of complex CDSs did not
need modifying as they already used BioPython's SeqFeature.extract
method.

Supersedes #1333
@jameshadfield
Copy link
Member Author

Force pushed to resolve conflicts in CHANGES.md

I think it'd be helpful to validate the annotations output by genome_features_to_auspice_annotation. Thoughts on always validating the annotations with validate_json within the function itself?

Validation of the node-data we are about to write out seems sensible rather than (only) validation when we read the node-data file. Validation within the actual function added in this PR has some difficulties because of how it's used in augur translate where we don't use a "nuc" key. I've implemented validation in a different way, but have split it into a separate PR so that we can get these changes into Monday's release.

@jameshadfield jameshadfield merged commit b4b69b9 into master Mar 16, 2024
20 checks passed
@jameshadfield jameshadfield deleted the genbank-complex-cds branch March 16, 2024 07:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants