We're trying to figure out how to refresh the database that backs DocumentMetadataAPI, and we got stuck on the input format expected by data_loader.process_file / data_loader.lambda_handler.
The loader reads a tab-separated file with (at least) ten columns in this exact order:
document_id pub_year pub_month pub_day journal_name journal_abbrev volume issue article_title abstract
…with - used as a sentinel for missing pub_day and PMID:-prefixed document_id values. We were hoping this was a standard PubMed bulk-export format we could regenerate from scratch, but searching turned up no such thing:
- PubMed's bulk distribution is MEDLINE/PubMed XML (data elements, data provider help), not TSV. EFetch returns XML / JSON / MEDLINE text only.
- A GitHub-wide code search for the exact column list (
pub_year pub_month pub_day journal_name journal_abbrev volume issue article_title abstract) returned zero matches outside this codebase.
- None of the related upstream repos appear to produce it:
Our tentative conclusion is that the TSV is produced by an internal/private job (probably whatever uploads to the GCS bucket the Lambda variant reads from), and isn't documented or open-sourced anywhere. We've written up what we figured out in a downstream README update here: ncats#23
It would be really helpful to have, in this repo's README or a CONTRIBUTING.md:
- A pointer to whatever script/pipeline produces the TSV (even if it's a private repo or an internal job — just a name).
- Confirmation of the column contract: column order, the
- sentinel for missing pub_day, whether len > 1 empty-string handling is intentional, and what happens if more than 10 columns are present.
- Guidance for someone who wants to regenerate the database from a fresh PubMed XML download (baseline + updates).
It might be easier to load the PubMed data into a new database, but if a data ingest tool exists for this tool as it currently exists, that would be really helpful. Thank you!
We're trying to figure out how to refresh the database that backs
DocumentMetadataAPI, and we got stuck on the input format expected bydata_loader.process_file/data_loader.lambda_handler.The loader reads a tab-separated file with (at least) ten columns in this exact order:
…with
-used as a sentinel for missingpub_dayandPMID:-prefixeddocument_idvalues. We were hoping this was a standard PubMed bulk-export format we could regenerate from scratch, but searching turned up no such thing:pub_year pub_month pub_day journal_name journal_abbrev volume issue article_title abstract) returned zero matches outside this codebase.UCDenver-ccp/Translator-TM-Provider-Pipelines—MedlineXmlToTextPipelineemits plain text + BioNLP section annotations into Cloud Datastore, not bibliographic metadata TSV.edgargaticaCU/biorxiv-aws— only pulls XML out of.mecaarchives; no parsing.edgargaticaCU/semmed,kgx-export,Translator-TM-Provider-Evidence,JATS2LaTeX— also not metadata-extraction tools.Our tentative conclusion is that the TSV is produced by an internal/private job (probably whatever uploads to the GCS bucket the Lambda variant reads from), and isn't documented or open-sourced anywhere. We've written up what we figured out in a downstream README update here: ncats#23
It would be really helpful to have, in this repo's README or a
CONTRIBUTING.md:-sentinel for missingpub_day, whetherlen > 1empty-string handling is intentional, and what happens if more than 10 columns are present.It might be easier to load the PubMed data into a new database, but if a data ingest tool exists for this tool as it currently exists, that would be really helpful. Thank you!