-
Notifications
You must be signed in to change notification settings - Fork 20
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: add nextclade (dataset) version info to nextclade.tsv #393
Conversation
The info gets added as extra columns nextclade_version, dataset_version and run_timestamp on new nextclade lines only This way we have metadata for each line in the cached data and can automatically rerun when there is a mismatch The extra info does not end up in metadata as we only add columns we specify explicitly
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of adding these as columns to the nextclade.tsv, I wonder if we should just add version number and dataset timestamp to the filename? (Especially if we expect/want the data to all be generated from the same versions.)
Then, the pipeline can check for the file nextclade_<nextclade_version>_<dataset_version>.tsv
. If the file does not exist, then the pipeline can just create an empty file and do a full-run without cache.
Possible but not ideal. We don't really need to run for every nextclade version, nor do we need to usually run it for new 21L datasets (i.e. saving 50% of full runs). We may end up running for all nextclade versions just to avoid breaking things, but this is not a strict requirement. Checking versions are right is as easy as looking at the second line of the file, so not difficult. Extra space is minimal. One upside for encoding in file name would be that we have an easily accessible archive of previous versions, but that's already available through aws s3 cli - so not something major gained. |
Hmm, maybe I misunderstood. I thought the purpose of adding version info is to be able to automatically do full runs when there's a mismatch in the versions so we can avoid issues like #392. |
There are three reasons:
This PR addresses 1 and 2 and makes a flexible implementation of 3 in the future easy. Rerun logic is currently as follows: must:
may
unnecessary
Having nextclade/dataset version and timestamp in the tsv for every record means we are flexible as to when we want to rerun. Sure we could append nextclade and dataset versions to the file path and pull caches dynamically from s3 after we know the version of nextclade and dataset. For 21L we could not invalidate for new datasets by default. And only use minor version. I guess both works. But we loose information in those cases in case we do want to debug where something came from. Disadvantage of putting it in file name is that there is no stable path for the latest version of the nextclade.tsv. Unless we add touch files, but that's duplicating things. Does that make sense? |
I'm going to merge this later unless there are objections. We can always add or replace with file names if we prefer. I don't see a downside of this approach and it's tested. |
The info gets added as extra columns nextclade_version, dataset_version
and run_timestamp on new nextclade lines only
This way we have metadata for each line in the cached data
and can automatically rerun when there is a mismatch
The extra info does not end up in metadata as we only add columns
we specify explicitly
In a second step (not contained in this PR) we can trigger reruns when there's a mismatch.
Testing
I tested this locally with debug config, worked well
Note: this requires a manual rerun, otherwise tsv-utils won't be happy with differing column counts