Skip to content

Fix handling of missing terms-of-use metadata in sample set manifest (#766)#990

Merged
jonbrenas merged 8 commits intomalariagen:masterfrom
adilraza99:gh-766-handle-missing-terms-of-use
Mar 2, 2026
Merged

Fix handling of missing terms-of-use metadata in sample set manifest (#766)#990
jonbrenas merged 8 commits intomalariagen:masterfrom
adilraza99:gh-766-handle-missing-terms-of-use

Conversation

@adilraza99
Copy link
Copy Markdown
Contributor

@adilraza99 adilraza99 commented Feb 28, 2026

Summary

Fix handling of sample set manifests that do not include terms-of-use metadata.

Some pre-release and legacy manifests legitimately omit the following fields:

  • terms_of_use_expiry_date
  • terms_of_use_url
  • unrestricted_use

The current implementation assumes these columns are always present, which results in a KeyError and prevents sample_metadata() from loading.


Changes

  • guard against missing unrestricted_use in _sample_set_has_unrestricted_use()
  • make lookup_terms_of_use_info() tolerant of missing columns
  • populate missing fields with null values to keep output consistent
  • add tests covering manifests without terms-of-use metadata

Why this change

Pre-release datasets may not yet have terms-of-use information.
Handling this case explicitly allows metadata to load successfully while making it clear that usage status is unknown.


Additional context

This update keeps behaviour unchanged for public releases and ensures
metadata outputs remain consistent across different release stages.


fixes #766

@adilraza99
Copy link
Copy Markdown
Contributor Author

adilraza99 commented Feb 28, 2026

Thanks again for clarifying the expected behaviour.

This update ensures metadata loading remains robust when terms-of-use columns are absent in legacy or pre-release
manifests, while preserving the existing structure and behaviour for releases where the fields are present.

I’ve kept the change minimal and added tests to cover the missing-column scenarios.

Happy to refine further if needed.

@jonbrenas

@adilraza99
Copy link
Copy Markdown
Contributor Author

All checks have passed.

@jonbrenas please take a look when you have time.
Let me know if anything should be adjusted.

@jonbrenas
Copy link
Copy Markdown
Collaborator

Sorry @adilraza99, I thought about this one a bit more and we had already decided to go a different route: the error needs to stay because the data needs to exist for traceability. For the record, a sample set that has not been released yet should have:

  • '2099-12-31' as terms_of_use_expiry_date
  • NaN as terms_of_use_url
  • False as unrestricted_use

@adilraza99
Copy link
Copy Markdown
Contributor Author

Understood - keeping the error preserves traceability when the metadata is required.

I’ll update the implementation so unreleased sample sets use:

• terms_of_use_expiry_date = "2099-12-31"
• terms_of_use_url = NaN
• unrestricted_use = False

This keeps the metadata explicit and aligns with the intended behaviour.

@jonbrenas

@adilraza99 adilraza99 force-pushed the gh-766-handle-missing-terms-of-use branch from 8a40254 to ea15930 Compare March 1, 2026 15:21
@adilraza99
Copy link
Copy Markdown
Contributor Author

Thanks @jonbrenas - that clarification helped.

I tested this locally with the Ag3 simulator to confirm the behaviour.

When the terms-of-use columns are missing, the lookup returns the placeholder values (2099-12-31 / NaN / False), _sample_set_has_unrestricted_use() returns False, and sample_sets() loads normally.

This matches the expected handling for unreleased sample sets.

@adilraza99
Copy link
Copy Markdown
Contributor Author

@jonbrenas

@jonbrenas
Copy link
Copy Markdown
Collaborator

Sorry, @adilraza99, I was not quite clear with my explanation.

What I mean is that if the API is asked to access these 3 columns and one or more of them is missing, it is a problem with the data, not with the API. An error should be raised because someone (most likely me) screwed the pooch upstream and this data should never have been released this way. #766 is thus not an issue for the API, and the code should not be modified to address it.

@adilraza99 adilraza99 force-pushed the gh-766-handle-missing-terms-of-use branch from 5abe0f4 to 202476e Compare March 1, 2026 22:53
@adilraza99
Copy link
Copy Markdown
Contributor Author

@jonbrenas
That explanation cleared things up.

I’ve updated the implementation so that missing terms-of-use columns now raise a clear error instead of being silently filled. This ensures incomplete metadata is surfaced as a data integrity issue while preserving traceability and keeping the API behaviour explicit.

Tests have been updated to reflect the strict validation behaviour.

Please let me know if this aligns with the intended design, or if you’d like any adjustments.

@adilraza99
Copy link
Copy Markdown
Contributor Author

@jonbrenas could you take a look at the changes when you have a moment?

@jonbrenas jonbrenas merged commit 0e43cc0 into malariagen:master Mar 2, 2026
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

sample_metadata and lookup_terms_of_use_info does not work when there is no terms-of-use info

2 participants