Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

export tip frequencies #83

Merged
merged 7 commits into from
Feb 21, 2018
Merged

export tip frequencies #83

merged 7 commits into from
Feb 21, 2018

Conversation

jameshadfield
Copy link
Member

@jameshadfield jameshadfield commented Feb 21, 2018

what

A tipfrequencies.JSON is exported at the same time as frequencies.JSON. This is a dictionary with keys pivots (same as frequencies) as well as each strain name -> frequencies at pivots. The frequencies are 6dp as opposed to 4dp to reduce error.

why

Auspice frequencies uses this information. This is a subset of the larger frequencies JSON to improve auspice fetch time.

Copy link
Contributor

@huddlej huddlej left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an awesome idea. My only concern is with backwards compatibility of the frequency JSONs. See inline comments for more.

@@ -39,6 +39,23 @@ def process_freqs(freq):
else:
process.log.notify("Cannot export frequencies - pivots do not exist")

def export_tip_frequency_json(process, prefix, indent):
def process_freqs(freq): # 6dp here, 4 above
return [round(x,6) for x in freq]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for rounding to 6 instead of 4. Instead of defining a replacement function inside a function, we should just define a constant at the top of the module and use the same value and process_freqs in all exports for consistency of code and data. Maybe FREQUENCY_PRECISION = 6 or EXPORT_PRECISION, etc.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if clade in clade_tip_map:
freq_json[clade_tip_map[clade]] = process_freqs(freq)

write_json(freq_json, prefix+'_tipfrequencies.json', indent=indent)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should probably just be "_tip_frequencies.json" to avoid run-on file names.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This means we can no longer use _ to easily split filenames. Can be tip-frequencies if you want.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


for clade, freq in process.tree_frequencies["global"].iteritems():
if clade in clade_tip_map:
freq_json[clade_tip_map[clade]] = process_freqs(freq)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason to switch the format of the JSON to indexing frequencies by clade name instead of clade id? Along those same lines, can maintain information about which region the frequencies are from in the data themselves?

It would be great if the tip frequencies JSON was backwards compatible with the complete JSON, so code that refers to one could refer to the other without any changes. Annotating the region will make the data more interpretable to anyone who is just looking at the JSON and it will also allow us to export tip frequencies for multiple regions in the future (which is a likely goal, I think).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


clade_tip_map = {n.clade: n.name for n in process.tree.tree.get_terminals()}

for clade, freq in process.tree_frequencies["global"].iteritems():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of looping through all the clades here, you can just loop through the tips and use tip.clade as the key to get the corresponding frequencies from process.tree_frequencies["global"][tip.clade].

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@trvrb
Copy link
Member

trvrb commented Feb 21, 2018

These are up on data.nextstrain.org for testing purposes.

@trvrb
Copy link
Member

trvrb commented Feb 21, 2018

Also, let's future proof this and add a weight attribute to each tip in the export. So it looks like:

strain_name: {
  weight: 1.0,
  frequencies: [
    0.001
    0.002
    ...
  ]
}

We'll make augur smart about weighting these later, but I think important to have in the data model.

@jameshadfield
Copy link
Member Author

jameshadfield commented Feb 21, 2018

@trvrb I've made the schema backwards compatible (i.e. the same as frequencies.json) at @huddlej's request.

It now looks like: "global_clade:325": [freq1, freq2, ...]

Happy to add a "weight" to each of these if desired, but perhaps better discussed in person.

@trvrb
Copy link
Member

trvrb commented Feb 21, 2018

I don't think separating regions is the correct way to go here. We should discuss.

@jameshadfield
Copy link
Member Author

Final tip-frequency spec looks like

"A/Dnipro/768/2016": {
    "frequencies": [0.004389,  0.004405, ....],
    "weight": 1.0
}

The proper weight value (currently 1.0 for all tips) can be added at some point in the future.

@jameshadfield jameshadfield merged commit ff6ff9f into master Feb 21, 2018
@jameshadfield jameshadfield deleted the export-tip-frequences branch February 21, 2018 18:40
@trvrb trvrb mentioned this pull request Feb 22, 2018
@tsibley tsibley mentioned this pull request Feb 15, 2022
tsibley added a commit to nextstrain/auspice that referenced this pull request Feb 15, 2022
This property was introduced with the original frequencies work¹ as an
anticipated need², but it was never used.  Omit it for now to avoid
carrying around unnecessary baggage; it can be added back in the future
easily if its time comes.

I uncovered this while authoring a JSON Schema for the tip-frequencies
format.³

¹ In PR #497 as a7bda1e.
² nextstrain/augur#83 (comment)
³ nextstrain/augur#852
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants