Write compound (segmented) sequence locations #1438

jameshadfield · 2024-03-15T00:52:06Z

Our current parsing for GenBank files would correctly read in complex CDSs (e.g. the V gene in measles) but we didn't correctly export this CDS in the output JSON. This is fixed here.

Note that because we use BioPython's SeqFeature.extract for both VCF and JSON/FASTA inputs the actual translations (via augur translate) don't need updating.

Screenshot of Measles V gene:

Screenshot of HBV, with the nexclade+GFF translations (top) and the new augur translate + GenBank translations (below) - data is identical except for the CDS names:

Checks pass
add a message in CHANGES.md summarizing the changes in this PR

codecov · 2024-03-15T01:03:41Z

Codecov Report

Attention: Patch coverage is 68.96552% with 9 lines in your changes are missing coverage. Please review.

Project coverage is 68.65%. Comparing base (4dd23c5) to head (0c7a4d3).

❗ Current head 0c7a4d3 differs from pull request most recent head e099711. Consider uploading reports for the commit e099711 to get more accurate results

Files	Patch %	Lines
augur/utils.py	64.00%	5 Missing and 4 partials ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #1438      +/-   ##
==========================================
- Coverage   68.73%   68.65%   -0.08%     
==========================================
  Files          69       69              
  Lines        7551     7580      +29     
  Branches     1851     1858       +7     
==========================================
+ Hits         5190     5204      +14     
- Misses       2083     2094      +11     
- Partials      278      282       +4

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

jameshadfield · 2024-03-15T03:27:46Z

A note on GFF:

Our parsing of GFF files (i.e. upstream of the changes here) doesn't parse segmented features encoded such as this example from the GFF3 spec:

chrX  . CDS                XXXX   YYYY .  +  0 ID=cds01;Parent=tran01
chrX  . CDS                YYYY-1 ZZZZ .  +  0 ID=cds01;Parent=tran01

When parsed by augur we get a single feature with SimpleLocation(YYYY-2,ZZZZ).

GFF works for genes which cross the origin (e.g. HBV) when encoded as a single CDS with coords a b where b>genomeLength. These are parsed by BioPython as SimpleLocation(a,b) and produce a non-segmented annotation JSON {start: a, end: b} which Auspice converts to segments internally. This is unchanged by this PR.

joverlee521

Thanks @jameshadfield! It's good to know the translations are fine and we just needed to correct the annotations consumed by Auspice! Changes look good to me by inspection.

I think it'd be helpful to validate the annotations output by genome_features_to_auspice_annotation. Thoughts on always validating the annotations with validate_json within the function itself?

Given a SeqFeqture with a CompoundLocation we now correctly write out the CDS/gene using segmented coordinates. Auspice can now handle such coordinates (see <nextstrain/auspice#1684> and <#1281> for the corresponding schema updates). Note that the translations (via augur translate) of complex CDSs did not need modifying as they already used BioPython's SeqFeature.extract method. Supersedes #1333

jameshadfield · 2024-03-16T07:37:24Z

Force pushed to resolve conflicts in CHANGES.md

I think it'd be helpful to validate the annotations output by genome_features_to_auspice_annotation. Thoughts on always validating the annotations with validate_json within the function itself?

Validation of the node-data we are about to write out seems sensible rather than (only) validation when we read the node-data file. Validation within the actual function added in this PR has some difficulties because of how it's used in augur translate where we don't use a "nuc" key. I've implemented validation in a different way, but have split it into a separate PR so that we can get these changes into Monday's release.

jameshadfield force-pushed the genbank-complex-cds branch from 41a1984 to 0c7a4d3 Compare March 15, 2024 02:46

jameshadfield requested review from corneliusroemer and joverlee521 March 15, 2024 02:46

joverlee521 approved these changes Mar 15, 2024

View reviewed changes

jameshadfield force-pushed the genbank-complex-cds branch from 0c7a4d3 to e099711 Compare March 16, 2024 07:33

jameshadfield merged commit b4b69b9 into master Mar 16, 2024
20 checks passed

jameshadfield deleted the genbank-complex-cds branch March 16, 2024 07:44

jameshadfield mentioned this pull request Mar 16, 2024

[ancestral, translate] node data validation improvements #1440

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Write compound (segmented) sequence locations #1438

Write compound (segmented) sequence locations #1438

jameshadfield commented Mar 15, 2024 •

edited

codecov bot commented Mar 15, 2024 •

edited

jameshadfield commented Mar 15, 2024

joverlee521 left a comment

jameshadfield commented Mar 16, 2024

Write compound (segmented) sequence locations #1438

Write compound (segmented) sequence locations #1438

Conversation

jameshadfield commented Mar 15, 2024 • edited

codecov bot commented Mar 15, 2024 • edited

Codecov Report

jameshadfield commented Mar 15, 2024

joverlee521 left a comment

Choose a reason for hiding this comment

jameshadfield commented Mar 16, 2024

jameshadfield commented Mar 15, 2024 •

edited

codecov bot commented Mar 15, 2024 •

edited