Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Migrate to CLI-based tool, multithread download, better error handling #5

Open
wants to merge 15 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -1,3 +1,7 @@
tmp/
*.pdf
.idea/

# Byte-compiled / optimized / DLL files
__pycache__/
*.py[cod]
Expand Down
61 changes: 39 additions & 22 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,38 +1,55 @@
# This is currently **broken**. Hathi Trust has changed its page structure and now uses blob urls to store the images, and I don't have the time to rewrite this to accommodate for that. It's likely still possible, using selenium to request the page and pull it from there, but I'm not familiar enough with selenium or blob urls as a whole to do it. If someone else wants to make a pull request, feel free.

# Hathi Trust Digital Library - Complete PDF Download
Download an entire book (or publication) in PDF from Hathi Trust Digital Library without "partner login" requirement.

# Motivation
Hathi Trust Digital Library is a good site to find old publications digitized from different university libraries. However, it limits the download of full PDF files to only partner universities, which are mostly american. In this sense, this code attempts to democratize knowledge and permits to download complete public domain works in PDF from Hathi Trust website.

# Features
- Multi-threaded download of PDF pages and merge in a single file.
- Smart download of pages, skipping already downloaded pages.
- Supports the two most common link formats:
- https://babel.hathitrust.org/cgi/pt?id={bookID}
- https://hdl.handle.net/XXXX/{bookID}
- Book splicing, allowing to download only a part of the book.
- Bulk download of multiple books.
- Attempts to avoid Error 429 (Too Many Requests) from Hathi Trust.
- If the error occurs, the thread will sleep for 5 seconds and try again.
- Works in most cases, but not always.
- Downloads are attempted 3 times before giving up.
- Users are notified of the failure, and have the option to redownload the missing pages for merge at the end.
- Retry attempt count is configurable via --retries option.

# Requirements
* [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/#installing-beautiful-soup) (bs4)
* [Requests](https://realpython.com/python-requests/)
* [PyPDF2](https://pythonhosted.org/PyPDF2/)
* [Progressbar](https://pypi.org/project/progressbar/)

# How to use it
Copy Hathi Trust book URL and paste into "link" variable on code line:
```python
...
link = "https://babel.hathitrust.org/cgi/pt?id=mdp.39015023320164"
r = requests.get(link)
...
```
OBS: Keep the same pattern presented (**numbers at the end**)!
# Usage

After that, all pages will be downloaded as PDF files and merged in a single file named ```BOOKNAME_output.pdf``` in the corresponding folder. The individual pages are not deleted after the end of the process!
```
usage: hathitrustPDF.py [-h] [-l LINK] [-i INPUT_FILE] [-t THREAD_COUNT] [-r RETRIES] [-b BEGIN] [-e END] [-k]
[-o OUTPUT_PATH] [-v] [-V]

# Slice pages
The code also allows you to remove only a range of pages. For that purpose, just edit the start and end page on code line:
```python
...
# Download pdf file
begin_page=1
last_page=pages_book+1
PDF Downloader and Merger

for actual_page in range(begin_page, last_page):
...
options:
-h, --help show this help message and exit
-l LINK, --link LINK HathiTrust book link
-i INPUT_FILE, --input-file INPUT_FILE
File with list of links formatted as link,output_path
-t THREAD_COUNT, --thread-count THREAD_COUNT
Number of download threads
-r RETRIES, --retries RETRIES
Number of retries for failed downloads
-b BEGIN, --begin BEGIN
First page to download
-e END, --end END Last page to download
-k, --keep Keep downloaded pages
-o OUTPUT_PATH, --output-path OUTPUT_PATH
Output file path
-v, --verbose Enable verbose mode
-V, --version show program's version number and exit
```

# Screenshot
![captura-hait](https://user-images.githubusercontent.com/56649205/72007547-abc73680-3230-11ea-9e74-4e6e495c90d2.PNG)
Loading