This Python script helps you automatically download open access PDFs for a list of DOIs using the Unpaywall API. It's particularly useful for research projects like the Research Excellence Framework (REF) where you need full-text academic outputs.
- Fetches OA PDF URLs via the Unpaywall API
- Downloads PDFs and saves them locally
- Handles errors gracefully
- Avoids duplicate downloads
- Rate-limited to respect API policies
- Matches pdf to DOIS
- Merge json files together
- Clean json files
- Count number of samples in json file
- Extract text from pdfs and keep track of status either success or failed
- Join pdf text and label together then convert to jsonl
-
Python 3.x
-
requests,pandasInstall dependencies:pip install requests pandas
.
├── extracted_dois.csv # Your input CSV with a 'DOI' column
├── ref_pdfs/ # Downloaded PDFs will be saved here
└── download_ref_pdfs.py # Main script
-
Insert your email in the script (Unpaywall requires it):
EMAIL = "your_email@example.com"
-
Prepare your CSV file named
extracted_dois.csvwith a column titledDOI. -
Run the script:
python download_ref_pdfs.py
- The script uses
doi.replace("/", "_")to ensure valid filenames. - A 1-second delay between requests helps you stay compliant with API usage limits.
- Only works for open access papers.
This script uses the Unpaywall API, which provides free access to millions of open-access research papers.
Created by Hazeeb – feel free to reach out for questions or improvements!