Documentation request: where does the input TSV for data_loader come from?

We're trying to figure out how to refresh the database that backs `DocumentMetadataAPI`, and we got stuck on the input format expected by `data_loader.process_file` / `data_loader.lambda_handler`.

The loader reads a tab-separated file with (at least) ten columns in this exact order:

```
document_id  pub_year  pub_month  pub_day  journal_name  journal_abbrev  volume  issue  article_title  abstract
```

…with `-` used as a sentinel for missing `pub_day` and `PMID:`-prefixed `document_id` values. We were hoping this was a standard PubMed bulk-export format we could regenerate from scratch, but searching turned up no such thing:

- PubMed's bulk distribution is **MEDLINE/PubMed XML** ([data elements](https://www.nlm.nih.gov/bsd/licensee/data_elements_doc.html), [data provider help](https://www.ncbi.nlm.nih.gov/books/NBK3828/)), not TSV. EFetch returns XML / JSON / MEDLINE text only.
- A GitHub-wide code search for the exact column list (`pub_year pub_month pub_day journal_name journal_abbrev volume issue article_title abstract`) returned zero matches outside this codebase.
- None of the related upstream repos appear to produce it:
  - [`UCDenver-ccp/Translator-TM-Provider-Pipelines`](https://github.com/UCDenver-ccp/Translator-TM-Provider-Pipelines) — `MedlineXmlToTextPipeline` emits plain text + BioNLP section annotations into Cloud Datastore, not bibliographic metadata TSV.
  - [`edgargaticaCU/biorxiv-aws`](https://github.com/edgargaticaCU/biorxiv-aws) — only pulls XML out of `.meca` archives; no parsing.
  - [`edgargaticaCU/semmed`](https://github.com/edgargaticaCU/semmed), [`kgx-export`](https://github.com/edgargaticaCU/kgx-export), [`Translator-TM-Provider-Evidence`](https://github.com/edgargaticaCU/Translator-TM-Provider-Evidence), [`JATS2LaTeX`](https://github.com/edgargaticaCU/JATS2LaTeX) — also not metadata-extraction tools.

Our tentative conclusion is that the TSV is produced by an internal/private job (probably whatever uploads to the GCS bucket the Lambda variant reads from), and isn't documented or open-sourced anywhere. We've written up what we figured out in a downstream README update here: https://github.com/ncats/DocumentMetadataAPI/pull/23

It would be really helpful to have, in this repo's README or a `CONTRIBUTING.md`:

1. A pointer to whatever script/pipeline produces the TSV (even if it's a private repo or an internal job — just a name).
2. Confirmation of the column contract: column order, the `-` sentinel for missing `pub_day`, whether `len > 1` empty-string handling is intentional, and what happens if more than 10 columns are present.
3. Guidance for someone who wants to regenerate the database from a fresh PubMed XML download (baseline + updates).

It might be easier to load the PubMed data into a new database, but if a data ingest tool exists for this tool as it currently exists, that would be really helpful. Thank you!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Documentation request: where does the input TSV for data_loader come from? #22

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Documentation request: where does the input TSV for data_loader come from? #22

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions