Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add nextclade (dataset) version info to nextclade.tsv #393

Merged
merged 1 commit into from
Mar 8, 2023

Conversation

corneliusroemer
Copy link
Member

@corneliusroemer corneliusroemer commented Mar 6, 2023

The info gets added as extra columns nextclade_version, dataset_version
and run_timestamp on new nextclade lines only

This way we have metadata for each line in the cached data
and can automatically rerun when there is a mismatch

The extra info does not end up in metadata as we only add columns
we specify explicitly

In a second step (not contained in this PR) we can trigger reruns when there's a mismatch.

Testing

I tested this locally with debug config, worked well

  • Will trigger a full genbank test run (ingest only) just in case: 16181387-2802-45b5-bb40-d83a736d625a

Note: this requires a manual rerun, otherwise tsv-utils won't be happy with differing column counts

The info gets added as extra columns nextclade_version, dataset_version
and run_timestamp on new nextclade lines only

This way we have metadata for each line in the cached data
and can automatically rerun when there is a mismatch

The extra info does not end up in metadata as we only add columns
we specify explicitly
@corneliusroemer corneliusroemer requested review from tsibley, joverlee521 and a team March 6, 2023 18:23
Copy link
Contributor

@joverlee521 joverlee521 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of adding these as columns to the nextclade.tsv, I wonder if we should just add version number and dataset timestamp to the filename? (Especially if we expect/want the data to all be generated from the same versions.)

Then, the pipeline can check for the file nextclade_<nextclade_version>_<dataset_version>.tsv. If the file does not exist, then the pipeline can just create an empty file and do a full-run without cache.

@corneliusroemer
Copy link
Member Author

Instead of adding these as columns to the nextclade.tsv, I wonder if we should just add version number and dataset timestamp to the filename? (Especially if we expect/want the data to all be generated from the same versions.)
Then, the pipeline can check for the file nextclade_<nextclade_version>_<dataset_version>.tsv. If the file does not exist, then the pipeline can just create an empty file and do a full-run without cache.

Possible but not ideal.

We don't really need to run for every nextclade version, nor do we need to usually run it for new 21L datasets (i.e. saving 50% of full runs).

We may end up running for all nextclade versions just to avoid breaking things, but this is not a strict requirement.

Checking versions are right is as easy as looking at the second line of the file, so not difficult.

Extra space is minimal.

One upside for encoding in file name would be that we have an easily accessible archive of previous versions, but that's already available through aws s3 cli - so not something major gained.

@joverlee521
Copy link
Contributor

We don't really need to run for every nextclade version, nor do we need to usually run it for new 21L datasets (i.e. saving 50% of full runs).

Hmm, maybe I misunderstood. I thought the purpose of adding version info is to be able to automatically do full runs when there's a mismatch in the versions so we can avoid issues like #392.

@corneliusroemer
Copy link
Member Author

There are three reasons:

  1. @chaoran-chen requested to have nextclade dataset version in nextclade.tsv for covSpectrum
  2. We want to improve debugging of cache rerun issues
  3. As much as possible automate the necessary rerunning when new datasets and (big) change to Nextclade come out

This PR addresses 1 and 2 and makes a flexible implementation of 3 in the future easy.

Rerun logic is currently as follows:

must:

  • Main dataset with every version
  • Changes to nextclade column output (until now it's been minor version, never patch)
  • 21L dataset when the immune escape score calculation algorithm changes (so far 0 times in ~3 months)

may

  • Run with every nextclade release, even patch, just to be sure - we also don't that often release nextclade so that's cheap

unnecessary

  • 21L datasets releases without algorithm changes

Having nextclade/dataset version and timestamp in the tsv for every record means we are flexible as to when we want to rerun.

Sure we could append nextclade and dataset versions to the file path and pull caches dynamically from s3 after we know the version of nextclade and dataset. For 21L we could not invalidate for new datasets by default. And only use minor version. I guess both works. But we loose information in those cases in case we do want to debug where something came from.

Disadvantage of putting it in file name is that there is no stable path for the latest version of the nextclade.tsv. Unless we add touch files, but that's duplicating things.

Does that make sense?

@corneliusroemer
Copy link
Member Author

I'm going to merge this later unless there are objections. We can always add or replace with file names if we prefer. I don't see a downside of this approach and it's tested.

@corneliusroemer corneliusroemer merged commit 3b711bf into master Mar 8, 2023
@corneliusroemer corneliusroemer deleted the add-version-info branch March 8, 2023 12:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
No open projects
Development

Successfully merging this pull request may close these issues.

2 participants