Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Annotations schema updates #1281

Merged
merged 3 commits into from
Aug 20, 2023
Merged

Annotations schema updates #1281

merged 3 commits into from
Aug 20, 2023

Conversation

jameshadfield
Copy link
Member

This PR is ready for review, but note that I won't merge it until the companion Auspice PR has been merged and that there may be some very slight schema changes (e.g. property names changing) before then. Neither of those should hold up review.

See commit messages for details.

@jameshadfield jameshadfield requested a review from a team August 16, 2023 09:50
The annotations schema is updated to more accurately represent how this
data is used by Auspice version 2.46.1. Specifically, the strand of the
'nuc' property is not used by auspice, and start/end coordinates are
required properties.

Tests are updated because the tests were incorrect.

Currently the annotations section of node-data JSONs are produced by
`augur ancestral` and `augur translate`. Each adds a 'nuc' property,
and it's required for Auspice, so I think it's appropriate to make it a
required property for the schema.

The changes to the schema here are only reflected when parsing node-data
files, the next commit will make this available for our typical auspice
JSON validation.
@codecov
Copy link

codecov bot commented Aug 16, 2023

Codecov Report

Patch coverage: 91.66% and project coverage change: +0.12% 🎉

Comparison is base (17a2746) 69.74% compared to head (c530539) 69.86%.
Report is 7 commits behind head on master.

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #1281      +/-   ##
==========================================
+ Coverage   69.74%   69.86%   +0.12%     
==========================================
  Files          67       67              
  Lines        7146     7155       +9     
  Branches     1742     1744       +2     
==========================================
+ Hits         4984     4999      +15     
+ Misses       1855     1849       -6     
  Partials      307      307              
Files Changed Coverage Δ
augur/validate.py 79.32% <91.66%> (+4.62%) ⬆️

... and 2 files with indirect coverage changes

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

"strand": {
"description": "Positive or negative strand",
"type": "string",
"enum": ["-","+"]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion: Add enum back with additional values.

Even though Auspice only cares if the value is '-' or not, Augur can also export values '?' and None as suggested and implemented in 31f0b26. Defining the possible values here will improve the effectiveness of validation.

I started this in #1279 but that PR can be closed and the change included in here.

Copy link
Member Author

@jameshadfield jameshadfield Aug 17, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the review -- really appreciated!

We're going to have to think through this. All annotations are interpreted by Auspice as CDSs, so a strand of "?" or "." (which we represent as None) doesn't make sense as a CDS - neither for Auspice to display nor for Augur to translate.

I can allow them in the schema and then have Auspice filter to ["+", "-"], which is probably the most technically correct, but I would think that happily translating "?" / "." features is questionable. For augur translate + GenBank annotations, only CDS features are read and (I believe) it's not possible to encode a CDS in GenBank that's not +/-. I'm not sure what we do for augur translate + GFF annotation.

Update: Auspice PR now ignores any non-nuc annotation which is not explicitly +/- strand

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have shifted this conversation to #1279

augur/data/schema-export-v2.json Outdated Show resolved Hide resolved
augur/validate.py Outdated Show resolved Hide resolved
augur/data/schema-annotations.json Outdated Show resolved Hide resolved
tsibley added a commit to nextstrain/nextstrain.org that referenced this pull request Aug 18, 2023
Matches the new schema id (to be) in Augur, see
<nextstrain/augur#1281>.

Resolves <#704>.
Copy link
Member

@tsibley tsibley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a couple additional small changes.

I've done the .org side of things in nextstrain/nextstrain.org#705.

augur/validate.py Outdated Show resolved Hide resolved
Comment on lines 44 to 63
if refs:
# Make the validator aware of additional schemas
schema_store = {k: json.loads(resource_string(__package__, os.path.join("data", v))) for k,v in refs.items()}
resolver = jsonschema.RefResolver.from_schema(schema,store=schema_store)
schema_validator = Validator(schema, resolver=resolver)
else:
schema_validator = Validator(schema)

# By default $ref URLs which we don't define in a schema_store are fetched
# by jsonschema. This often indicates a typo (the $ref doesn't match the key
# of the schema_store) or we forgot to add a local mapping for a new $ref.
# Either way, Augur should not be accessing the network.
def resolve_remote(url):
# The exception type is not important as jsonschema will catch & re-raise as a RefResolutionError
raise Exception(f"The schema used for validation attempted to fetch the remote URL '{url}'. " +
"Augur should resolve schema references to local files, please check the schema used " +
"and update the appropriate schema_store as needed." )
schema_validator.resolver.resolve_remote = resolve_remote

return schema_validator
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

non-blocking

If we upgraded on dep on jsonschema to the 4.x series, we could use the newer and nicer APIs for reference handling here. Doesn't have to be now, though.

augur/data/schema-export-v2.json Show resolved Hide resolved
See added comments for details
See <nextstrain/auspice#1684> for the context
for these additions.
@jameshadfield
Copy link
Member Author

Schema updated to remove the name property on individual CDS segments as that functionality was removed from the Auspice PR.

@jameshadfield jameshadfield merged commit 37a0a07 into master Aug 20, 2023
27 checks passed
@jameshadfield jameshadfield deleted the annotations-schema-updates branch August 20, 2023 22:11
tsibley added a commit that referenced this pull request Aug 21, 2023
repr() will add appropriate quotes around the value.

Related-to: <#1281 (comment)>
"description": "Type of the feature. could be mRNA, CDS, or similar",
"$comment": "Unused by Auspice 2.0",
"type": "string"
"strand": {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Late to the party, but shouldn't we in principle allow each cds fragment to have its own strandedness? Here the strand is fixed for all fragments, which will be fine in most cases but who knows, maybe sometimes fragments might come from opposite strands?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ChatGPT tells me that strandedness never changes within a CDS so then what we have here should work in all cases.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With trans-splicing it might work, but doesn't seem to happen in viruses so we should be good for now: https://en.wikipedia.org/wiki/Trans-splicing

jameshadfield added a commit that referenced this pull request Mar 15, 2024
Given a SeqFeqture with a CompoundLocation we now correctly write out
the CDS/gene using segmented coordinates. Auspice can now handle such
coordinates (see <nextstrain/auspice#1684> and
<#1281> for the corresponding
schema updates).

Note that the translations (via augur translate) of complex CDSs are not
modified in this commit.

Supersedes #1333
jameshadfield added a commit that referenced this pull request Mar 15, 2024
Given a SeqFeqture with a CompoundLocation we now correctly write out
the CDS/gene using segmented coordinates. Auspice can now handle such
coordinates (see <nextstrain/auspice#1684> and
<#1281> for the corresponding
schema updates).

Note that the translations (via augur translate) of complex CDSs did not
need modifying as they already used BioPython's SeqFeature.extract
method.

Supersedes #1333
jameshadfield added a commit that referenced this pull request Mar 16, 2024
Given a SeqFeqture with a CompoundLocation we now correctly write out
the CDS/gene using segmented coordinates. Auspice can now handle such
coordinates (see <nextstrain/auspice#1684> and
<#1281> for the corresponding
schema updates).

Note that the translations (via augur translate) of complex CDSs did not
need modifying as they already used BioPython's SeqFeature.extract
method.

Supersedes #1333
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
No open projects
Development

Successfully merging this pull request may close these issues.

None yet

4 participants