Skip to content

A bundle of scripts searching, downloading pdf, and analyzing from Pubmed.

License

Notifications You must be signed in to change notification settings

kaaass/PubmedToolkit

Repository files navigation

Pubmed Toolkit

A bundle of python scripts searching, downloading pdf, and analyzing from Pubmed.

Installation

Before installation, ensure that you had installed the Python 3 and Pip tool.

  1. Clone the repository
git clone https://github.com/kaaass/PubmedToolkit.git
cd PubmedToolkit
  1. Install the dependency using pip
pip install -r requirements.txt

pubmed_central.py

Download PDF from pubmed central by PMIDs or PMID Source File. See "PMID Source File Schema" for more detail about the source file schema.

  • Support resuming from break point
  • Support retrying failed tasks
  • Support proxy pool against anti-spider

Usage

usage: pubmed_central.py [-h] [-o OUTPUT_DIR] [--resume] [--retry] [--use-proxy]
                         [PMIDs or PMID source file [PMIDs or PMID source file ...]]

Download PDFs from pubmed central by PMIDs

positional arguments:
  PMIDs or PMID source file
                        PMIDs to download, or filepath of PMID source file.

optional arguments:
  -h, --help            show this help message and exit
  -o OUTPUT_DIR, --output-dir OUTPUT_DIR
                        output directory
  --resume              Allow resume from an exist lock file
  --retry               Retry the tasks in the failed file
  --use-proxy           Use proxy pool to access Pubmed Central

Examples

  1. Download pdf from pmid 29138661, 29123944
python pubmed_central.py 29138661 29123944
  1. Download pdf from pmid source file
python pubmed_central.py data.json
  1. Resume from an interrupted task
python pubmed_central.py data.json

PMID Source File Schema

PMID Source File is a JSON file stores an array of objects. This file could be generated by pubmed_search.py.

[
    {
    	"pmid": 0, // PMID
        // Other attributes will be ignored
	},
    // ...
]

For other formats, you might need to edit the function load_source_file.

Todo

  • Support schema: each pmid a line
  • Support schema: bibtex library

pubmed_search.py

WARNING: This is an incomplete script, you might need to edit the source code for using it.

Search entries from pubmed using a given query, saving the information as a JSON file.

Usage

Change the variable query to your favor. The query could be built by https://www.ncbi.nlm.nih.gov/pubmed/advanced.

Change the parameter max_results to specify the maximum number of result.

Run python pubmed_search.py, the result will be stored in data.json.

[WIP] pubmed_info.py

Download metadata, figures and extract text from PDFs.

Thanks

  1. https://github.com/gijswobben/pymed/
  2. https://github.com/zotero/translators

About

A bundle of scripts searching, downloading pdf, and analyzing from Pubmed.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages