Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Nextclade dataset version tracking file #390

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

corneliusroemer
Copy link
Member

@corneliusroemer corneliusroemer commented Feb 6, 2023

We currently don't track publicly which Nextclade dataset versions
are used to produce the nextclade.tsv
This is a feature request for open covSpectrum, see
GenSpectrum/cov-spectrum-website#670 (comment)

It is also useful to know for ourselves when we last did a full run

This PR adds a file that contains a line for each dataset download
with timestamp of download and dataset version downloaded

Before a full run, this file needs to get wiped, so
instructions are added to the README

The format of the produced file should be considered unstable for now
this is made explicit in the README

The implementation isn't perfect, it could be improved
by syncing better with rerun touchfile, but it should be good enough
given limited resources

This PR is unlikely to corrupt data. It could cause a crash, but that
can be mitigated by fix or revert so no extensive
testing should be necessary before merging

Local testing should be sufficient and was successful

We currently don't track publicly which Nextclade dataset versions
are used to produce the nextclade.tsv
This is a feature request for open covSpectrum, see
GenSpectrum/cov-spectrum-website#670 (comment)

It is also useful to know for ourselves when we last did a full run

This PR adds a file that contains a line for each dataset download
with timestamp of download and dataset version downloaded

Before a full run, this file needs to get wiped, so
instructions are added to the README

The format of the produced file should be considered unstable for now
this is made explicit in the README

The implementation isn't perfect, it could be improved
by syncing better with rerun touchfile, but it should be good enough
given limited resources

This PR is unlikely to corrupt data. It could cause a crash, but that
can be mitigated by fix or revert so no extensive
testing should be necessary before merging
Local testing should be sufficient
@corneliusroemer corneliusroemer marked this pull request as ready for review February 6, 2023 15:40
@emmahodcroft
Copy link
Member

This seems like a good idea and will be useful for others too. Might be worth asking Tom for ideas if the current proposed system seems unstable, he often has good insights into this kind of stuff :)

@corneliusroemer
Copy link
Member Author

Thanks @emmahodcroft! I have already added @tsibley to the list of reviewers :)

I don't think the current implementation is unstable. What I meant was that I don't want us to commit to this version format being considered "stable". We may experiment with it, change it. E.g. make it a tsv, with headers. So people using it should just not expect this to never change.

Better implementation is always welcome but given this is not a super crucial feature I'd be hesitant to sink too much time into it. The medium/long-term solution would be to make nextclade output the version into the nextclade.tsv, then we wouldn't have to hack it in. I remember @rneher seemed open to this, so this code hopefully won't have to stay too long in there.

Advantage of being "unstable" is that we could simply retire it any time and say: now the version is tracked in nextclade.tsv and metadata.tsv and not have to produce that version file anymore.

@emmahodcroft
Copy link
Member

Ah ok, sorry, I misinterpreted - I thought you meant unstable in a more traditional sense. Seems fair to mark this as a little 'beta' and see how it works/get some ideas about other ways to do it and/or whether to put into the nextclade.tsv as you say!

@corneliusroemer
Copy link
Member Author

My usage of unstable was inspired by Rust where they have nightly builds with lots of features that are not yet stable and things are moved to stable after a while of testing. It's different from classic alpha/beta/release in that individual features do not move together, one feature could be added to nightly in 1.61 and become stable in 1.69 or 1.65 or never.

Copy link
Contributor

@joverlee521 joverlee521 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking through the use of the new versions file, I had a couple of questions:

  1. Do we always do a run re-run of Nextclade when new datasets are released? If not, should we start doing that so that the metadata does not have a mixture of different dataset versions? Then, we only need to record when we do a full re-run in the versions file. With the versions file, we can automate the pipeline to do a full re-run if the dataset versions don't match.
  2. If we don't ensure all metadata records are generated from the same dataset, then do we need to report dataset versions per row? (Just writing that out made it seem like overkill, we should probably just focus on option [1].)
  3. How is this file expected to be used? For cov-spectrum, would they just assume the last line of the file contains the version for the latest metadata? Or would we expect them to do some form of timestamp matching?

We don't need to tackle all of these in this PR, but just wanted to get the full picture.

Comment on lines +149 to +156
params:
dst_version_file=config["s3_dst"] + "/version_{dataset_name}.txt",
src_version_file=config["s3_src"] + "/version_{dataset_name}.txt",
shell:
"""
./bin/download-from-s3 {params.dst_version_file} {output.version} 0 || \
./bin/download-from-s3 {params.src_version_file} {output.version} 0 || \
touch {output.version}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be best to move this to a separate rule so that we can continue to keep the main pipeline untangled from the S3 interactions. I think this file can follow our current pattern of having a create_empty_* and a download_*_from_s3 rule so that this pipeline can run without S3 interactions.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, not something I've ever done so wasn't quite aware this was a guarantee we make.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More of maintaining a separation of concerns than a guarantee, I think. (Although there are a few external users of ncov-ingest who might notice if that separation breaks down.)


## Unstable files produced by workflow

- `version_sars-cov-2.txt` and `version_sars-cov-2-21L.txt`: used to track the version of the nextclade dataset used to generate the `nextclade.tsv` and `nextclade_21L.tsv` files. Format: `timestamp dataset_version` (e.g. `2023-02-06T14:40:23Z 2023-02-01T12:00:00Z`) for each run since and including the last full run.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Am I understanding correctly that the first line of the versions file is the last full Nextclade run?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's what it should be in practice!

@corneliusroemer
Copy link
Member Author

Thanks for your comments and review @joverlee521!

Do we always do a run re-run of Nextclade when new datasets are released? If not, should we start doing that so that the metadata does not have a mixture of different dataset versions? Then, we only need to record when we do a full re-run in the versions file. With the versions file, we can automate the pipeline to do a full re-run if the dataset versions don't match.

For the 21L dataset we don't need to rerun, except if the calculation logic of the Bloom scores change. We can rerun but it's a bit of a waste.

I first thought about just using the most recent version, then realized that it could be helpful to track all the versions that contribute to the current cache - this could help debug, especially for 21L when something unexpected happens.

I agree it would be nice to do an automatic full run if there's a new version. This would be a good further PR - but isn't necessary for this use case (getting at least some readable version information out).

The automatic rerun should only apply to the base dataset for now, not 21L. However, given the 21L dataset is of minor importance, we could chuck out the code related to the version there to keep the code simple. I don't think it's that useful after all to have that version. It's just 2 not super important columns in the metadata.tsv

If we don't ensure all metadata records are generated from the same dataset, then do we need to report dataset versions per row? (Just writing that out made it seem like overkill, we should probably just focus on option [1].)

I think we may enable such an option in the nextclade.tsv - we already have many columns, adding a bit redundant version info could be useful depending on use case. We could add it, it doesn't do much harm, it compresses well. It would make code easy/clean.

How is this file expected to be used? For cov-spectrum, would they just assume the last line of the file contains the version for the latest metadata? Or would we expect them to do some form of timestamp matching?
We don't need to tackle all of these in this PR, but just wanted to get the full picture.

Good question, since all lines should be the same, they can pick any line. It may not be correct when there's a bug but given that the version will only be used to display metadata on the website it's no crucial that it is 100% correct. If something unexpected happens, we can investigate.

Copy link
Member

@tsibley tsibley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My overall comments are the same as @joverlee521's. :-) I concur that it'd be best to ensure that the Nextclade outputs are only ever from a single Nextclade dataset version and never a mixture. That doesn't have to be this PR (but if it was me, I'd probably just do it now rather than put it off).

A few more detailed comments below on some minor issues.

Comment on lines +138 to +141
for file in [ "nextclade.tsv.zst.renew", "version_sars-cov-2.txt" ]; do
aws s3 cp - s3://nextstrain-ncov-private/$file < /dev/null
aws s3 cp - s3://nextstrain-data/files/ncov/open/$file < /dev/null
done
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is syntactically valid bash, but not what you intended. It'd produce four loop iterations where:

file="["
file="nextclade.tsv.zst.renew,"
file="version_sars-cov-2.txt"
file="]"

🙃

You want:

Suggested change
for file in [ "nextclade.tsv.zst.renew", "version_sars-cov-2.txt" ]; do
aws s3 cp - s3://nextstrain-ncov-private/$file < /dev/null
aws s3 cp - s3://nextstrain-data/files/ncov/open/$file < /dev/null
done
for file in nextclade.tsv.zst.renew version_sars-cov-2.txt; do
aws s3 cp - s3://nextstrain-ncov-private/$file < /dev/null
aws s3 cp - s3://nextstrain-data/files/ncov/open/$file < /dev/null
done

Comment on lines +149 to +152
for file in [ "nextclade_21L.tsv.zst.renew", "version_sars-cov-2-21L.txt" ]; do
aws s3 cp - s3://nextstrain-ncov-private/$file < /dev/null
aws s3 cp - s3://nextstrain-data/files/ncov/open/$file < /dev/null
done
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto.

Comment on lines +149 to +156
params:
dst_version_file=config["s3_dst"] + "/version_{dataset_name}.txt",
src_version_file=config["s3_src"] + "/version_{dataset_name}.txt",
shell:
"""
./bin/download-from-s3 {params.dst_version_file} {output.version} 0 || \
./bin/download-from-s3 {params.src_version_file} {output.version} 0 || \
touch {output.version}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More of maintaining a separation of concerns than a guarantee, I think. (Although there are a few external users of ncov-ingest who might notice if that separation breaks down.)

Comment on lines 158 to +160
./nextclade dataset get --name="{wildcards.dataset_name}" --output-zip={output.dataset} --verbose
printf %s "$(date --utc +%FT%TZ) " >> {output.version}
nextclade dataset list --name="{wildcards.dataset_name}" --json | jq -r '.[0].attributes.tag.value' >>{output.version}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mixing of ./nextclade vs. nextclade. Is that a mismatch?

Comment on lines +159 to +160
printf %s "$(date --utc +%FT%TZ) " >> {output.version}
nextclade dataset list --name="{wildcards.dataset_name}" --json | jq -r '.[0].attributes.tag.value' >>{output.version}
Copy link
Member

@tsibley tsibley Feb 13, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There should be only one writer to this file, I think, but it'd still be more robust (and IMO clearer) to append the whole line at once rather than in two separate operations.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
No open projects
Development

Successfully merging this pull request may close these issues.

4 participants