Add Nextclade dataset version tracking file #390

corneliusroemer · 2023-02-06T15:38:32Z

We currently don't track publicly which Nextclade dataset versions
are used to produce the nextclade.tsv
This is a feature request for open covSpectrum, see
GenSpectrum/cov-spectrum-website#670 (comment)

It is also useful to know for ourselves when we last did a full run

This PR adds a file that contains a line for each dataset download
with timestamp of download and dataset version downloaded

Before a full run, this file needs to get wiped, so
instructions are added to the README

The format of the produced file should be considered unstable for now
this is made explicit in the README

The implementation isn't perfect, it could be improved
by syncing better with rerun touchfile, but it should be good enough
given limited resources

This PR is unlikely to corrupt data. It could cause a crash, but that
can be mitigated by fix or revert so no extensive
testing should be necessary before merging

Local testing should be sufficient and was successful

We currently don't track publicly which Nextclade dataset versions are used to produce the nextclade.tsv This is a feature request for open covSpectrum, see GenSpectrum/cov-spectrum-website#670 (comment) It is also useful to know for ourselves when we last did a full run This PR adds a file that contains a line for each dataset download with timestamp of download and dataset version downloaded Before a full run, this file needs to get wiped, so instructions are added to the README The format of the produced file should be considered unstable for now this is made explicit in the README The implementation isn't perfect, it could be improved by syncing better with rerun touchfile, but it should be good enough given limited resources This PR is unlikely to corrupt data. It could cause a crash, but that can be mitigated by fix or revert so no extensive testing should be necessary before merging Local testing should be sufficient

emmahodcroft · 2023-02-06T15:42:58Z

This seems like a good idea and will be useful for others too. Might be worth asking Tom for ideas if the current proposed system seems unstable, he often has good insights into this kind of stuff :)

corneliusroemer · 2023-02-06T15:47:03Z

Thanks @emmahodcroft! I have already added @tsibley to the list of reviewers :)

I don't think the current implementation is unstable. What I meant was that I don't want us to commit to this version format being considered "stable". We may experiment with it, change it. E.g. make it a tsv, with headers. So people using it should just not expect this to never change.

Better implementation is always welcome but given this is not a super crucial feature I'd be hesitant to sink too much time into it. The medium/long-term solution would be to make nextclade output the version into the nextclade.tsv, then we wouldn't have to hack it in. I remember @rneher seemed open to this, so this code hopefully won't have to stay too long in there.

Advantage of being "unstable" is that we could simply retire it any time and say: now the version is tracked in nextclade.tsv and metadata.tsv and not have to produce that version file anymore.

emmahodcroft · 2023-02-06T15:49:55Z

Ah ok, sorry, I misinterpreted - I thought you meant unstable in a more traditional sense. Seems fair to mark this as a little 'beta' and see how it works/get some ideas about other ways to do it and/or whether to put into the nextclade.tsv as you say!

corneliusroemer · 2023-02-06T15:53:32Z

My usage of unstable was inspired by Rust where they have nightly builds with lots of features that are not yet stable and things are moved to stable after a while of testing. It's different from classic alpha/beta/release in that individual features do not move together, one feature could be added to nightly in 1.61 and become stable in 1.69 or 1.65 or never.

joverlee521

Thinking through the use of the new versions file, I had a couple of questions:

Do we always do a run re-run of Nextclade when new datasets are released? If not, should we start doing that so that the metadata does not have a mixture of different dataset versions? Then, we only need to record when we do a full re-run in the versions file. With the versions file, we can automate the pipeline to do a full re-run if the dataset versions don't match.
If we don't ensure all metadata records are generated from the same dataset, then do we need to report dataset versions per row? (Just writing that out made it seem like overkill, we should probably just focus on option [1].)
How is this file expected to be used? For cov-spectrum, would they just assume the last line of the file contains the version for the latest metadata? Or would we expect them to do some form of timestamp matching?

We don't need to tackle all of these in this PR, but just wanted to get the full picture.

joverlee521 · 2023-02-06T20:06:13Z

workflow/snakemake_rules/nextclade.smk

+    params:
+        dst_version_file=config["s3_dst"] + "/version_{dataset_name}.txt",
+        src_version_file=config["s3_src"] + "/version_{dataset_name}.txt",
    shell:
        """
+        ./bin/download-from-s3 {params.dst_version_file} {output.version} 0 ||  \
+        ./bin/download-from-s3 {params.src_version_file} {output.version} 0 ||  \
+        touch {output.version}


It would be best to move this to a separate rule so that we can continue to keep the main pipeline untangled from the S3 interactions. I think this file can follow our current pattern of having a create_empty_* and a download_*_from_s3 rule so that this pipeline can run without S3 interactions.

Good point, not something I've ever done so wasn't quite aware this was a guarantee we make.

More of maintaining a separation of concerns than a guarantee, I think. (Although there are a few external users of ncov-ingest who might notice if that separation breaks down.)

joverlee521 · 2023-02-06T20:17:41Z

README.md

+
+## Unstable files produced by workflow
+
+- `version_sars-cov-2.txt` and `version_sars-cov-2-21L.txt`: used to track the version of the nextclade dataset used to generate the `nextclade.tsv` and `nextclade_21L.tsv` files. Format: `timestamp dataset_version` (e.g. `2023-02-06T14:40:23Z 2023-02-01T12:00:00Z`) for each run since and including the last full run.


Am I understanding correctly that the first line of the versions file is the last full Nextclade run?

Yes, that's what it should be in practice!

corneliusroemer · 2023-02-07T13:25:45Z

Thanks for your comments and review @joverlee521!

Do we always do a run re-run of Nextclade when new datasets are released? If not, should we start doing that so that the metadata does not have a mixture of different dataset versions? Then, we only need to record when we do a full re-run in the versions file. With the versions file, we can automate the pipeline to do a full re-run if the dataset versions don't match.

For the 21L dataset we don't need to rerun, except if the calculation logic of the Bloom scores change. We can rerun but it's a bit of a waste.

I first thought about just using the most recent version, then realized that it could be helpful to track all the versions that contribute to the current cache - this could help debug, especially for 21L when something unexpected happens.

I agree it would be nice to do an automatic full run if there's a new version. This would be a good further PR - but isn't necessary for this use case (getting at least some readable version information out).

The automatic rerun should only apply to the base dataset for now, not 21L. However, given the 21L dataset is of minor importance, we could chuck out the code related to the version there to keep the code simple. I don't think it's that useful after all to have that version. It's just 2 not super important columns in the metadata.tsv

If we don't ensure all metadata records are generated from the same dataset, then do we need to report dataset versions per row? (Just writing that out made it seem like overkill, we should probably just focus on option [1].)

I think we may enable such an option in the nextclade.tsv - we already have many columns, adding a bit redundant version info could be useful depending on use case. We could add it, it doesn't do much harm, it compresses well. It would make code easy/clean.

How is this file expected to be used? For cov-spectrum, would they just assume the last line of the file contains the version for the latest metadata? Or would we expect them to do some form of timestamp matching?
We don't need to tackle all of these in this PR, but just wanted to get the full picture.

Good question, since all lines should be the same, they can pick any line. It may not be correct when there's a bug but given that the version will only be used to display metadata on the website it's no crucial that it is 100% correct. If something unexpected happens, we can investigate.

tsibley

My overall comments are the same as @joverlee521's. :-) I concur that it'd be best to ensure that the Nextclade outputs are only ever from a single Nextclade dataset version and never a mixture. That doesn't have to be this PR (but if it was me, I'd probably just do it now rather than put it off).

A few more detailed comments below on some minor issues.

tsibley · 2023-02-13T18:51:55Z

README.md

+for file in [ "nextclade.tsv.zst.renew", "version_sars-cov-2.txt" ]; do
+  aws s3 cp - s3://nextstrain-ncov-private/$file < /dev/null
+  aws s3 cp - s3://nextstrain-data/files/ncov/open/$file < /dev/null
+done


This is syntactically valid bash, but not what you intended. It'd produce four loop iterations where:

file="[" file="nextclade.tsv.zst.renew," file="version_sars-cov-2.txt" file="]"

🙃

You want:

Suggested change

for file in [ "nextclade.tsv.zst.renew", "version_sars-cov-2.txt" ]; do

aws s3 cp - s3://nextstrain-ncov-private/$file < /dev/null

aws s3 cp - s3://nextstrain-data/files/ncov/open/$file < /dev/null

done

for file in nextclade.tsv.zst.renew version_sars-cov-2.txt; do

aws s3 cp - s3://nextstrain-ncov-private/$file < /dev/null

aws s3 cp - s3://nextstrain-data/files/ncov/open/$file < /dev/null

done

tsibley · 2023-02-13T18:52:03Z

README.md

+for file in [ "nextclade_21L.tsv.zst.renew", "version_sars-cov-2-21L.txt" ]; do
+  aws s3 cp - s3://nextstrain-ncov-private/$file < /dev/null
+  aws s3 cp - s3://nextstrain-data/files/ncov/open/$file < /dev/null
+done


tsibley · 2023-02-13T18:53:59Z

workflow/snakemake_rules/nextclade.smk

+    params:
+        dst_version_file=config["s3_dst"] + "/version_{dataset_name}.txt",
+        src_version_file=config["s3_src"] + "/version_{dataset_name}.txt",
    shell:
        """
+        ./bin/download-from-s3 {params.dst_version_file} {output.version} 0 ||  \
+        ./bin/download-from-s3 {params.src_version_file} {output.version} 0 ||  \
+        touch {output.version}


More of maintaining a separation of concerns than a guarantee, I think. (Although there are a few external users of ncov-ingest who might notice if that separation breaks down.)

tsibley · 2023-02-13T18:58:12Z

workflow/snakemake_rules/nextclade.smk

        ./nextclade dataset get --name="{wildcards.dataset_name}" --output-zip={output.dataset} --verbose
+        printf %s "$(date --utc +%FT%TZ) " >> {output.version}
+        nextclade dataset list --name="{wildcards.dataset_name}" --json | jq -r '.[0].attributes.tag.value' >>{output.version}


Mixing of ./nextclade vs. nextclade. Is that a mismatch?

tsibley · 2023-02-13T18:59:27Z

workflow/snakemake_rules/nextclade.smk

+        printf %s "$(date --utc +%FT%TZ) " >> {output.version}
+        nextclade dataset list --name="{wildcards.dataset_name}" --json | jq -r '.[0].attributes.tag.value' >>{output.version}


There should be only one writer to this file, I think, but it'd still be more robust (and IMO clearer) to append the whole line at once rather than in two separate operations.

corneliusroemer marked this pull request as ready for review February 6, 2023 15:40

corneliusroemer requested review from joverlee521 and tsibley February 6, 2023 15:40

joverlee521 requested changes Feb 6, 2023

View reviewed changes

tsibley requested changes Feb 13, 2023

View reviewed changes

joverlee521 mentioned this pull request Mar 1, 2023

Malformed/wrong nextclade.tsv #392

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Nextclade dataset version tracking file #390

Add Nextclade dataset version tracking file #390

corneliusroemer commented Feb 6, 2023 •

edited

Loading

emmahodcroft commented Feb 6, 2023

corneliusroemer commented Feb 6, 2023

emmahodcroft commented Feb 6, 2023

corneliusroemer commented Feb 6, 2023

joverlee521 left a comment

joverlee521 Feb 6, 2023

corneliusroemer Feb 7, 2023

tsibley Feb 13, 2023

joverlee521 Feb 6, 2023

corneliusroemer Feb 7, 2023

corneliusroemer commented Feb 7, 2023

tsibley left a comment

tsibley Feb 13, 2023

tsibley Feb 13, 2023

tsibley Feb 13, 2023

tsibley Feb 13, 2023

tsibley Feb 13, 2023 •

edited

Loading


		## Unstable files produced by workflow

		- `version_sars-cov-2.txt` and `version_sars-cov-2-21L.txt`: used to track the version of the nextclade dataset used to generate the `nextclade.tsv` and `nextclade_21L.tsv` files. Format: `timestamp dataset_version` (e.g. `2023-02-06T14:40:23Z 2023-02-01T12:00:00Z`) for each run since and including the last full run.

		printf %s "$(date --utc +%FT%TZ) " >> {output.version}
		nextclade dataset list --name="{wildcards.dataset_name}" --json \| jq -r '.[0].attributes.tag.value' >>{output.version}

Add Nextclade dataset version tracking file #390

Are you sure you want to change the base?

Add Nextclade dataset version tracking file #390

Conversation

corneliusroemer commented Feb 6, 2023 • edited Loading

emmahodcroft commented Feb 6, 2023

corneliusroemer commented Feb 6, 2023

emmahodcroft commented Feb 6, 2023

corneliusroemer commented Feb 6, 2023

joverlee521 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

corneliusroemer commented Feb 7, 2023

tsibley left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tsibley Feb 13, 2023 • edited Loading

Choose a reason for hiding this comment

corneliusroemer commented Feb 6, 2023 •

edited

Loading

tsibley Feb 13, 2023 •

edited

Loading