Skip to content

Feature/source extra fields#42

Merged
caufieldjh merged 21 commits intolinkml:mainfrom
Reasat:feature/source-extra-fields
Mar 16, 2026
Merged

Feature/source extra fields#42
caufieldjh merged 21 commits intolinkml:mainfrom
Reasat:feature/source-extra-fields

Conversation

@Reasat
Copy link
Copy Markdown
Contributor

@Reasat Reasat commented Mar 8, 2026

Summary

Allows users to capture additional fields from reference API responses via config. JSONPath expressions are evaluated per source; extracted values are appended to reference content (with ### field_name headings) and are included in validation. Captured field names are stored in metadata and in the cache frontmatter.

Changes

  • Config: ReferenceValidationConfig.source_extra_fields — per-source map of field name → JSONPath. Docstring notes to prefer single-value paths; objects/arrays are stringified.
  • Utils: extract_extra_fields(data, field_map) returns dict[str, str]; format_extra_fields_for_content(extra) formats for appending to content.
  • Sources: ClinicalTrials, PMID, Entrez (base + GEO), DOI (Crossref + DataCite) read source_extra_fields, extract from raw response, append to content, set metadata["extra_fields_captured"].
  • Cache: _save_to_disk writes extra_fields_captured to YAML frontmatter; _load_markdown_format restores it into ReferenceContent.metadata.
  • README: New source_extra_fields subsection under Configuration with example and commands.
  • Tests: extract_extra_fields / format_extra_fields_for_content, ClinicalTrials fetch with extra fields, save/load of extra_fields_captured.

How to test

  1. Create my-config.yaml with validation.source_extra_fields (e.g. clinicaltrials eligibility JSONPath).
  2. Run: uv run linkml-reference-validator cache reference clinicaltrials:NCT00001372 --config my-config.yaml
  3. Check cache file in references_cache/ for ### eligibility (or your field) and extra_fields_captured in frontmatter.
  4. Run: uv run linkml-reference-validator validate text "Inclusion: age" clinicaltrials:NCT00001372 --config my-config.yaml — should pass if text is in the extra section.

Fixes #39

Reasat added 17 commits March 4, 2026 21:30
…the formatting function that does dict -> str
…ig should point to a string/number and not an object; recursive parsing of objects is not done; something to consider for future
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a configurable mechanism to extract additional per-source fields from raw reference API responses (via JSONPath), append them to the cached reference content as markdown sections, and persist which fields were captured in cache frontmatter/metadata.

Changes:

  • Introduces ReferenceValidationConfig.source_extra_fields (per-source field name → JSONPath).
  • Adds shared helpers to extract/format extra fields and integrates them into multiple sources (ClinicalTrials, PMID, Entrez, DOI).
  • Persists extra_fields_captured to cache frontmatter and restores it on load; adds/updates tests and docs.

Reviewed changes

Copilot reviewed 11 out of 12 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
src/linkml_reference_validator/models.py Adds source_extra_fields config field and documentation.
src/linkml_reference_validator/etl/sources/utils.py New helpers to extract extra fields with JSONPath and format them for appending to content.
src/linkml_reference_validator/etl/sources/clinicaltrials.py Extracts configured extra fields from raw CT.gov response and appends to content/metadata.
src/linkml_reference_validator/etl/sources/pmid.py Extracts configured extra fields from Entrez record and appends to content/metadata.
src/linkml_reference_validator/etl/sources/entrez.py Extracts configured extra fields for Entrez summary sources (incl. GEO) and appends to content/metadata.
src/linkml_reference_validator/etl/sources/doi.py Extracts configured extra fields from Crossref/DataCite payloads and appends to abstract/metadata.
src/linkml_reference_validator/etl/reference_fetcher.py Saves extra_fields_captured into YAML frontmatter and restores it into ReferenceContent.metadata.
tests/test_source_utils.py New unit tests for extraction/formatting helpers.
tests/test_sources.py Adds ClinicalTrials integration test for source_extra_fields; minor formatting changes.
tests/test_reference_fetcher.py Adds save/load test for extra_fields_captured; minor formatting changes.
README.md Documents source_extra_fields configuration and usage.
.gitignore Ignores .cursor directory.

Comment thread README.md Outdated
Comment thread src/linkml_reference_validator/etl/sources/entrez.py
Comment thread src/linkml_reference_validator/etl/sources/entrez.py
Comment thread src/linkml_reference_validator/etl/sources/pmid.py
Comment thread src/linkml_reference_validator/etl/reference_fetcher.py Outdated
Comment thread src/linkml_reference_validator/models.py Outdated
Reasat added 4 commits March 8, 2026 12:20
…ls.py, specifying handling of lists in JSONPath expressions.
…quested which has text, content_type now is set to "summary" for this case instead of "unavailable"
…ent is empty but extra fields produce content in clinical trials source.
… fields are strings and no wierd structure that will break the yaml formatting when writing to disk
Copy link
Copy Markdown
Contributor

@caufieldjh caufieldjh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good - thanks @Reasat !

@caufieldjh caufieldjh merged commit 38c1513 into linkml:main Mar 16, 2026
10 of 11 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Cache additional metadata from clinicaltrials.gov entries

3 participants