feat: add nextclade (dataset) version info to nextclade.tsv #393

corneliusroemer · 2023-03-06T18:23:29Z

The info gets added as extra columns nextclade_version, dataset_version
and run_timestamp on new nextclade lines only

This way we have metadata for each line in the cached data
and can automatically rerun when there is a mismatch

The extra info does not end up in metadata as we only add columns
we specify explicitly

In a second step (not contained in this PR) we can trigger reruns when there's a mismatch.

Testing

I tested this locally with debug config, worked well

Will trigger a full genbank test run (ingest only) just in case: 16181387-2802-45b5-bb40-d83a736d625a

Note: this requires a manual rerun, otherwise tsv-utils won't be happy with differing column counts

The info gets added as extra columns nextclade_version, dataset_version and run_timestamp on new nextclade lines only This way we have metadata for each line in the cached data and can automatically rerun when there is a mismatch The extra info does not end up in metadata as we only add columns we specify explicitly

joverlee521

Instead of adding these as columns to the nextclade.tsv, I wonder if we should just add version number and dataset timestamp to the filename? (Especially if we expect/want the data to all be generated from the same versions.)

Then, the pipeline can check for the file nextclade_<nextclade_version>_<dataset_version>.tsv. If the file does not exist, then the pipeline can just create an empty file and do a full-run without cache.

corneliusroemer · 2023-03-06T22:17:18Z

Instead of adding these as columns to the nextclade.tsv, I wonder if we should just add version number and dataset timestamp to the filename? (Especially if we expect/want the data to all be generated from the same versions.)
Then, the pipeline can check for the file nextclade_<nextclade_version>_<dataset_version>.tsv. If the file does not exist, then the pipeline can just create an empty file and do a full-run without cache.

Possible but not ideal.

We don't really need to run for every nextclade version, nor do we need to usually run it for new 21L datasets (i.e. saving 50% of full runs).

We may end up running for all nextclade versions just to avoid breaking things, but this is not a strict requirement.

Checking versions are right is as easy as looking at the second line of the file, so not difficult.

Extra space is minimal.

One upside for encoding in file name would be that we have an easily accessible archive of previous versions, but that's already available through aws s3 cli - so not something major gained.

joverlee521 · 2023-03-06T23:09:47Z

We don't really need to run for every nextclade version, nor do we need to usually run it for new 21L datasets (i.e. saving 50% of full runs).

Hmm, maybe I misunderstood. I thought the purpose of adding version info is to be able to automatically do full runs when there's a mismatch in the versions so we can avoid issues like #392.

corneliusroemer · 2023-03-07T02:16:56Z

There are three reasons:

@chaoran-chen requested to have nextclade dataset version in nextclade.tsv for covSpectrum
We want to improve debugging of cache rerun issues
As much as possible automate the necessary rerunning when new datasets and (big) change to Nextclade come out

This PR addresses 1 and 2 and makes a flexible implementation of 3 in the future easy.

Rerun logic is currently as follows:

must:

Main dataset with every version
Changes to nextclade column output (until now it's been minor version, never patch)
21L dataset when the immune escape score calculation algorithm changes (so far 0 times in ~3 months)

may

Run with every nextclade release, even patch, just to be sure - we also don't that often release nextclade so that's cheap

unnecessary

21L datasets releases without algorithm changes

Having nextclade/dataset version and timestamp in the tsv for every record means we are flexible as to when we want to rerun.

Sure we could append nextclade and dataset versions to the file path and pull caches dynamically from s3 after we know the version of nextclade and dataset. For 21L we could not invalidate for new datasets by default. And only use minor version. I guess both works. But we loose information in those cases in case we do want to debug where something came from.

Disadvantage of putting it in file name is that there is no stable path for the latest version of the nextclade.tsv. Unless we add touch files, but that's duplicating things.

Does that make sense?

corneliusroemer · 2023-03-07T19:38:59Z

I'm going to merge this later unless there are objections. We can always add or replace with file names if we prefer. I don't see a downside of this approach and it's tested.

corneliusroemer requested review from tsibley, joverlee521 and a team March 6, 2023 18:23

joverlee521 reviewed Mar 6, 2023

View reviewed changes

corneliusroemer merged commit 3b711bf into master Mar 8, 2023

corneliusroemer deleted the add-version-info branch March 8, 2023 12:01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add nextclade (dataset) version info to nextclade.tsv #393

feat: add nextclade (dataset) version info to nextclade.tsv #393

corneliusroemer commented Mar 6, 2023 •

edited

Loading

joverlee521 left a comment

corneliusroemer commented Mar 6, 2023

joverlee521 commented Mar 6, 2023

corneliusroemer commented Mar 7, 2023

corneliusroemer commented Mar 7, 2023

feat: add nextclade (dataset) version info to nextclade.tsv #393

feat: add nextclade (dataset) version info to nextclade.tsv #393

Conversation

corneliusroemer commented Mar 6, 2023 • edited Loading

Testing

joverlee521 left a comment

Choose a reason for hiding this comment

corneliusroemer commented Mar 6, 2023

joverlee521 commented Mar 6, 2023

corneliusroemer commented Mar 7, 2023

corneliusroemer commented Mar 7, 2023

corneliusroemer commented Mar 6, 2023 •

edited

Loading