Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Minimal export (no node-data files needed) #1299

Merged
merged 9 commits into from Sep 20, 2023
Merged

Conversation

jameshadfield
Copy link
Member

Allows a minimal augur export using only a (newick) tree as input,
functionality that we've wanted for over 4 years! To facilitate this we
parse branch lengths¹ from the newick file if such data wasn't available
in the node-data inputs (e.g. because there are none!).

The code for deciding where to read divergence from has been refactored
and in the process improved: the (rare? never encountered?) case where
divergence was sometimes read from node-data keys 'mutation_length' and
sometimes from 'branch_length' can non longer happen.

If data is provided which doesn't define divergence or num_date
(irregardless of whether node-data files were provided as inputs), then
the resulting dataset will fail validation.

Closes #273

¹ I suppose these might represent time in certain cases, but I haven't
seen such data in Newick files.

@jameshadfield jameshadfield changed the title Export sans node data Minimal export (no node-data files needed) Aug 31, 2023
@jameshadfield jameshadfield force-pushed the export-sans-node-data branch 2 times, most recently from 18281c0 to 4991b9a Compare August 31, 2023 04:01
@codecov
Copy link

codecov bot commented Aug 31, 2023

Codecov Report

Patch coverage is 76.00% of modified lines.

❗ Current head 57bdf28 differs from pull request most recent head 9afa278. Consider uploading reports for the commit 9afa278 to get more accurate results

Files Changed Coverage
augur/validate.py ø
augur/export_v2.py 76.00%

📢 Thoughts on this report? Let us know!.

Copy link
Contributor

@huddlej huddlej left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So cool, @jameshadfield! I left some comments for consideration, but none are blocking.

Comment on lines 18 to 21
The above minimal.json takes divergence from the newick file. This converts newick divergences of (e.g.) '1' to `1.0`
because BioPython uses floats (which is perfectly reasonable). Remove the decimal to diff the JSON.
(Note that Auspice won't behave any differently)
$ sed 's/\.0//' minimal.json > minimal.no-decimal.json
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you use --significant-digits 0 in the call to diff_jsons.py below to achieve this same effect without additional copies of the JSONs?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh cool, I forgot this was an option!

Unfortunately in the case of 0 significant digits we are still left with the associated int vs float type difference (maybe? It doesn't look like it's removing the .0 at all, but I presume deepdiff is showing the unmodified input values here)

{'old_type': <class 'int'>, 'new_type': <class 'float'>, 'old_value': 1, 'new_value': 1.0},

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I'm sorry that wasn't just a one-line fix, @jameshadfield! DeepDiff supports ignoring type changes for this scenario. If you're open to it, I can push a commit that adds a flag for the diff JSONs script to --ignore-numeric-type-changes. I bet this would be handy for other tests, too...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added this functionality in 57bdf28.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although this reminded me that there is a deep diff CLI that we could eventually switch to... 🤦🏻

return lambda node, metadata: metadata['branch_length']
if T.root.branch_length is not None:
return lambda node, metadata: node.branch_length
return None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we throw an exception when no branch length data are available instead of returning None? If the tree doesn't have branch lengths that seems like a big problem that the user would want to know about. The original div code below used to print an error, but we could raise an exception that gets nicely printed by Augur's generic exception handling logic instead.

I think the scenario where this function gets called and its output sent to convert_tree_to_json_structure is different from when we pass None to convert_tree_to_json_structure, since it suggests we expect to find divergence information.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the tree doesn't have branch lengths that seems like a big problem that the user would want to know about. The original div code below used to print an error

I think you're referring to this line? That would actually only print an error if div¹ values were set on the root node but missing on some other node on the tree. In other words, trees without any div values were just fine. I wouldn't be surprised if the error was never printed! We now avoid this situation by requiring that such information is present for every node in the tree. Although I guess we don't technically enforce that for newick trees (root branch with div, but a later branch missing it), perhaps that's worth covering?

the scenario where this function [node_div()] gets called ... suggests we expect to find divergence information.

There'll be situations where we don't have divergence info, e.g. BEAST trees or any dataset with a newick tree without branch lengths. Not something we commonly run ourselves, but they do exist. If temporal information is not provided either then the dataset will fail validation. Open to improvements here tho!

¹ Here this would be mutation_length or branch_length in the node-data

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There'll be situations where we don't have divergence info

Whoa...I didn't realize this happened. I see why you'd want to be more lenient in the branch length handling, then. Thank you for this additional context!

@@ -130,7 +130,7 @@
"type" : "object",
"$comment": "The phylogeny in a nested JSON structure",
"additionalProperties": false,
"required": ["name"],
"required": ["name", "node_attrs"],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will this change and the two below to required attributes require a major version release for Augur? I can't remember how we decided to treat schema updates, but since more external software now consumes Auspice JSONs, we might want to announce these changes more loudly/broadly.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the context of Auspice a tree without node_attrs was never valid, in the sense that it wouldn't render correctly (because this implies no distance metric). In the context of Augur perhaps such datasets existed? I can't think of any situations where it would be used, but maybe?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be good to add to change log, but probably no major version needed?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added changelog entry mentioning this

tests/functional/export_v2/cram/minimal.t Outdated Show resolved Hide resolved
Copy link
Member

@corneliusroemer corneliusroemer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perfect. I think it's just missing a change log entry, unless I'm blind

@@ -111,44 +111,63 @@ def order_nodes(node):
order_nodes(od['tree'])
return od

def convert_tree_to_json_structure(node, metadata, div=0):

def node_div(T, node_attrs):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Type annotations could be helpful here, but this is optional of course

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah! It's been on my to-do list to start using them in Python code, but not for this PR 😬

jameshadfield and others added 9 commits September 20, 2023 16:22
These properties were never actually used (neither exported from augur
export v2 nor consumed by auspice)

Closes #867 <#867>
A number of parts of the auspice-config have identical (or
almost-identical) shape to those in the resulting dataset JSON, although
the actual data may be modified as it passes through `augur export v2`.

Rather than referencing the entire auspice-config schema and pruning
down properties, which I don't actually think is possible in jsonschema,
I chose to use $refs at a more fine grained level which I find easier to
read.

The actual schema definitions should be unchanged by this commit,
although comments / descriptions have been improved.
Auspice can already set the tip label via URL state (`?tl=...`) and will
shortly be able to parse the display_default added here.

Closes #1115 <#1115>
These work fine in Auspice. While the 'colorings' property is optional,
`augur export v2` will always set a (possibly empty) array. I also chose
to allow the auspice config file to have an empty colorings definition,
which in practice behaves the same as leaving it out.

Addresses comment in #273 <#273 (comment)>
Removes previously valid string patterns which were never used within
augur and would result in unexpected behaviour in auspice.
Also updates the patternProperties of CDSs to match that used in the
genome_annotations (schema-annotations.json)
Trees could currently be produced with neither "div" nor "num_date"
information. Arguably Auspice could interpret these as cladograms
but as it stands these datasets aren't rendered by Auspice.

Datasets without this information are easy to create but rare in
practice: If the node-data files don't define "mutation_length" or
"branch_length" then there's no "div" and if they don't define
"num_date" then that's not there either.

Note that what we really want to require is that "div" is present on all
nodes and/or "num_date" is present on all nodes, but the schema doesn't
let us do this.
Allows a minimal `augur export` using only a (newick) tree as input,
functionality that we've wanted for over 4 years! To facilitate this we
parse branch lengths¹ from the newick file if such data wasn't available
in the node-data inputs (e.g. because there are none!).

The code for deciding where to read divergence from has been refactored
and in the process improved: the (rare? never encountered?) case where
divergence was sometimes read from node-data keys 'mutation_length' and
sometimes from 'branch_length' can non longer happen.

If data is provided which doesn't define divergence or num_date
(irregardless of whether node-data files were provided as inputs), then
the resulting dataset will fail validation.

Closes #273 <#273>

¹ I suppose these might represent time in certain cases, but I haven't
seen such data in Newick files.
Adds a flag to the diff JSONs script to ignore numeric type changes when
running DeepDiff [1]. Updates the export v2 minimal export test to use
this new flag instead of creating an intermediate file with sed.

[1] https://zepworks.com/deepdiff/current/ignore_types_or_values.html
@jameshadfield jameshadfield changed the base branch from schema-updates to master September 20, 2023 04:35
@jameshadfield
Copy link
Member Author

Rebased onto master, changelog entries added, and PR base changed to master.

@jameshadfield jameshadfield mentioned this pull request Sep 20, 2023
@jameshadfield jameshadfield merged commit e19b5e5 into master Sep 20, 2023
1 check passed
@jameshadfield jameshadfield deleted the export-sans-node-data branch September 20, 2023 04:38
@@ -181,7 +184,13 @@
"type": "string",
"pattern": "^(none|[a-zA-Z0-9]+)$"
},
"tip_label": {
Copy link
Member

@corneliusroemer corneliusroemer Sep 27, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The inclusion of tip_label is a bug - as Auspice doesn't yet support this tip label. Unless nextstrain/auspice#1692 gets merged soon we should revert this from the schema.
I'm sorry I missed this in review, I'll remember to look more carefully at schema changes.
@jameshadfield

I've made an issue for better support of different auspice schema versions here: #1326

* A number of schema updates and improvements [#1299][] (@jameshadfield)
* We now require all nodes to have `node_attrs` on them with one of `div` or `num_date` present
* Some never-used properties are removed from the schemas, including a pattern for defining nucleotide INDELs which was never used by augur or auspice.
* Tip label defaults are now settable within the auspice-config JSON
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Allowed per validation in auspice-config but not yet supported by current auspice 🙃

I think my preference would be to first allow something in Auspice, then add it to schema, rather than other way round - but I guess as long as auspice still works and just lacks a minor default feature, it's no a big deal either way.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah totally, I forgot that auspice PR hadn't been merged + released. I'll do that today

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Auspice 2.49.0 released which parses this display_default. It'll appear on nextstrain.org / auspice.us etc over the coming day or two

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
No open projects
Development

Successfully merging this pull request may close these issues.

Allow "minimal" augur export
3 participants