Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Surface Nextclade versions #467

Merged
merged 1 commit into from
Jul 31, 2024
Merged

Conversation

joverlee521
Copy link
Contributor

@joverlee521 joverlee521 commented Jul 27, 2024

Description of proposed changes

Creates one version JSON for each Nextclade TSV and one version JSON for the metadata TSV. Since the metadata just uses the Nextclade TSV columns directly, just add the metadata_tsv_sha256sum to the SARS-CoV-2 dataset version JSON. If we ever want to track data provenance by column, we will update the schema to include the 21L dataset version.

The two Nextclade version JSONs will be used to check whether the workflow should use the existing cache. The metadata version JSON will be used to surface the version info to downstream users of the data.

Related issue(s)

Depends on #466
Resolves #458

Checklist

  • Checks pass

Creates one version JSON for each Nextclade TSV and one version JSON
for the metadata TSV. Since the metadata just uses the Nextclade TSV
columns directly, just add the `metadata_tsv_sha256sum` to the
SARS-CoV-2 dataset version JSON. If we ever want to track data
provenance by column, we will update the schema to include the 21L
dataset version.

The two Nextclade version JSONs will be used to check whether the
workflow should use the existing cache. The metadata version JSON will
be used to surface the version info to downstream users of the data.
@joverlee521 joverlee521 mentioned this pull request Jul 27, 2024
1 task
Comment on lines +36 to +38
"nextclade_version.json": f"data/{database}/nextclade_version.json",
"nextclade_21L_version.json": f"data/{database}/nextclade_21L_version.json",
"metadata_version.json": f"data/{database}/metadata_version.json",
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Open to different S3 file names for the version JSONs.

Base automatically changed from ignore-cache to master July 29, 2024 17:40
@joverlee521 joverlee521 marked this pull request as ready for review July 30, 2024 21:15
@joverlee521 joverlee521 requested a review from huddlej July 30, 2024 21:16
Copy link
Contributor

@huddlej huddlej left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, @joverlee521! I like this approach of building separate versions JSONs per file and merging the main Nextclade version JSON into the metadata version JSON.

I don't have strong feelings about the names, either. These names seem fine. Some day we'll have nextclade_25A_version.json or more than one additional file, but this approach is flexible enough to support that.

Assuming you tested locally, we could merge and see how it works on the next run?

@joverlee521
Copy link
Contributor Author

Assuming you tested locally, we could merge and see how it works on the next run?

Yup! I'll plan to merge tomorrow morning and monitor the automated runs.

@joverlee521 joverlee521 merged commit f38bf5d into master Jul 31, 2024
1 check passed
@joverlee521 joverlee521 deleted the surface-nextclade-versions branch July 31, 2024 17:13
@joverlee521
Copy link
Contributor Author

The public metadata version file is available at https://data.nextstrain.org/files/ncov/open/metadata_version.json

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Surface Nextclade version + Nextclade dataset version in final metadata output
2 participants