Prepare data#3
Merged
Merged
Conversation
This takes a more nuanced approach to preparing sequence count data (borrowing from forecasts-ncov). This subsets sequence counts between min_date and max_date (relative or absolute) and only includes locations with at least location_min_seq in this time period. It also collapses variants to "other" with less than clade_min_seq. This logic takes the place of the previous "threshold" and the previous plotting aids loc_lst and var_lst. I didn't like how there were a bunch of countries with sparse data that were informing the hierarchical estimates, but we'd never see their frequencies or growth advantages. I'd prefer to have everything on the table that's used in the final analysis.
huddlej
reviewed
Feb 25, 2025
Contributor
huddlej
left a comment
There was a problem hiding this comment.
Thanks, @trvrb! @plsteinberg and I had just chatted recently about whether to adopt this prepare_data.py script from forecasts-ncov or not. We opted not to in the other repo, just because the full script has more features than we need, but the simpler version here makes sense.
It is much nicer to know all data that went into the model appear in the figures!
I only had a minor comment about date filtering below. I can implement that change, if you agree, or you can if you're in the zone with this work... :D
This was referenced Mar 4, 2025
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
A small fix to update metadata and a larger change to prepare sequence counts. This takes a more nuanced approach to preparing sequence count data (borrowing from
forecasts-ncov). This subsets sequence counts betweenmin_dateandmax_date(relative or absolute) and only includes locations with at leastlocation_min_seqin this time period. It also collapses variants to "other" with less thanclade_min_seq.This logic takes the place of the previous
thresholdand the previous plotting aidsloc_lstandvar_lst. I didn't like how there were a bunch of countries with sparse data that were informing the hierarchical estimates, but we'd never see their frequencies or growth advantages. I'd prefer to have everything on the table that's used in the final analysis.