A bundle of python scripts searching, downloading pdf, and analyzing from Pubmed.
Before installation, ensure that you had installed the Python 3 and Pip tool.
- Clone the repository
git clone https://github.com/kaaass/PubmedToolkit.git
cd PubmedToolkit
- Install the dependency using pip
pip install -r requirements.txt
Download PDF from pubmed central by PMIDs or PMID Source File. See "PMID Source File Schema" for more detail about the source file schema.
- Support resuming from break point
- Support retrying failed tasks
- Support proxy pool against anti-spider
usage: pubmed_central.py [-h] [-o OUTPUT_DIR] [--resume] [--retry] [--use-proxy]
[PMIDs or PMID source file [PMIDs or PMID source file ...]]
Download PDFs from pubmed central by PMIDs
positional arguments:
PMIDs or PMID source file
PMIDs to download, or filepath of PMID source file.
optional arguments:
-h, --help show this help message and exit
-o OUTPUT_DIR, --output-dir OUTPUT_DIR
output directory
--resume Allow resume from an exist lock file
--retry Retry the tasks in the failed file
--use-proxy Use proxy pool to access Pubmed Central
- Download pdf from pmid 29138661, 29123944
python pubmed_central.py 29138661 29123944
- Download pdf from pmid source file
python pubmed_central.py data.json
- Resume from an interrupted task
python pubmed_central.py data.json
PMID Source File is a JSON file stores an array of objects. This file could be generated by pubmed_search.py
.
[
{
"pmid": 0, // PMID
// Other attributes will be ignored
},
// ...
]
For other formats, you might need to edit the function load_source_file
.
- Support schema: each pmid a line
- Support schema: bibtex library
WARNING: This is an incomplete script, you might need to edit the source code for using it.
Search entries from pubmed using a given query, saving the information as a JSON file.
Change the variable query
to your favor. The query could be built by https://www.ncbi.nlm.nih.gov/pubmed/advanced.
Change the parameter max_results
to specify the maximum number of result.
Run python pubmed_search.py
, the result will be stored in data.json
.
Download metadata, figures and extract text from PDFs.