Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support PubMed IDs for NCBI Bookshelf records #66

Closed
dhimmel opened this issue Jun 9, 2022 · 7 comments
Closed

Support PubMed IDs for NCBI Bookshelf records #66

dhimmel opened this issue Jun 9, 2022 · 7 comments

Comments

@dhimmel
Copy link

dhimmel commented Jun 9, 2022

We noticed some PubMed records that exist online, but aren't in PMDB:

They appear to all be PubMed IDs for a corresponding NCBI Bookshelf record.

Have you ever encountered these and does it make sense to include the ones with pubmed IDs as PMDB records?

I've run into them in the past: see manubot/manubot#298, and will post any additional information I come across like how to retrieve the complete list of pmids for bookshelf records. Also nothing the publication that describes them:

NCBI Bookshelf: books and documents in life sciences and health care
Marilu A. Hoeppner
Nucleic Acids Research (2012-11-29) https://doi.org/ghbhpc
DOI: 10.1093/nar/gks1279 · PMID: 23203889 · PMCID: PMC3531209

@JSchoenbachler
Copy link
Contributor

The way we get everything from PubMed is by downloading and parsing all of the xml files available in the "baseline" and "updatefiles" subdirectories located at ftp.ncbi.nlm.nih.gov/pubmed/ . Looking back through the code, it would seem that the only way those records aren't included is that they aren't contained in the xml files we parse to generate PMDB. I also browsed the ftp site for any possible indication of NCBI Bookshelf directories and the only thing I could find was some sort of related directory that didn't contain any sort of file we could parse out.

The unfortunate thing is that there is no real good way to check if they are contained in the files aside from manually checking the contents of each xml file, of which there are hundreds of files with over 1000 lines. It would also be considerable work to try and incorporate external data into our DB creation.

So, TL;DR: Doesn't seem like these records are in the files we use to create the database, so unless PubMed starts including them they won't make it into PMDB.

@dhimmel
Copy link
Author

dhimmel commented Jun 11, 2022

I see, probably worth contacting the help desk to inquire how we can download bookshelf records in bulk. I'm away for a week, but can get in touch with them when I'm back unless you do first.

@JSchoenbachler
Copy link
Contributor

@dhimmel So earlier this week I decided to do another look through the code and found we only parse out PubmedArticle tags and not PubmedBookArticle tags.

I then went ahead and made some modifications to the code and started checking the XML files to see if any PubmedBookArticle tags were contained within the XML files, and could not find any, so I emailed the NLM Help Desk and this is the response I got:

"Thank you for writing to the help desk. Book citations are not included in the FTP files. They can be retrieved from the web interface or with the PubMed E-Utilities API. "

I followed up by asking if they can be retrieved in bulk, but have yet to receive an answer.

@dhimmel
Copy link
Author

dhimmel commented Jun 21, 2022

Thanks @JSchoenbachler for looking into this!

Even if there is no bulk download, I wonder whether there is a way to get a list of all pubmed ids for NCBI bookshelf records? Looking forward to what you hear from the helpdesk.

@jakejh
Copy link
Collaborator

jakejh commented Jun 21, 2022

Hi @dhimmel given the data aren't in the xml, this isn't going to be a priority for us. You're welcome to investigate more and make a pull request.

@JSchoenbachler
Copy link
Contributor

@dhimmel Apologies, I forgot to put the response from NLM here. Here it is:

"Dear Colleague,

E-utilities is generally useful with being able to search for unique identifiers of records matching a search term/topic, and then using those unique identifiers to fetch the records/articles that correspond to them. For example, users can, with e-utilities, search for all PMIDs corresponding to pubmed articles containing a search term of interest using esearch. Subsequently, efetch is used to obtain the articles corresponding to each PMID.

The API searches for results when provided with the database of interest to search, so it should work for bookshelf as well. There's an extensive resource of documentation explaining use of E-utilities.

This link specifically has a section regarding how to download large batches of data: https://www.ncbi.nlm.nih.gov/books/NBK25498/#chapter3.Application_3_Retrieving_large
The overall documentation (introduction, examples, etc) can be found here: https://www.ncbi.nlm.nih.gov/books/NBK25501/"

@jakejh jakejh closed this as completed Jul 20, 2022
@dhimmel
Copy link
Author

dhimmel commented Jul 20, 2022

Thanks @JSchoenbachler for posting the helpdesk reply. It's still unclear to me whether there is a way to search for a list of all all pubmed ids corresponding to bookshelf records, which seems like is the critical missing piece here.

@jakejh I think GitHub has a new option for "Close as not planned" which is probably more appropriate here than "Close as completed".

@jakejh jakejh closed this as not planned Won't fix, can't repro, duplicate, stale Jul 20, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants