Skip to content

Lightweight bash script to convert scanned PDFs into searchable, copyable PDFs using Tesseract OCR with parallel processing.

License

Notifications You must be signed in to change notification settings

maxgfr/copyable-pdf

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

14 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

copyable-pdf

A lightweight, dependency-minimal bash script to convert scanned PDFs into searchable PDFs using Tesseract OCR.

License

copyable-pdf takes a PDF input, converts each page to an image, performs OCR (Optical Character Recognition) using Tesseract, and merges them back into a single, searchable PDF document.

Features

  • OCR: Make scanned documents searchable and copyable.
  • Parallel Processing: Uses multiple cores for faster OCR.
  • Dependency Check: Automatically checks for missing tools.
  • Customizable: Set language and DPI.

Installation

Via Homebrew

brew tap maxgfr/homebrew-tap
brew install copyable-pdf

Manual Installation

  1. Clone the repository:
    git clone https://github.com/maxgfr/copyable-pdf.git
    cd copyable-pdf
  2. Make the script executable:
    chmod +x script.sh
  3. (Optional) Move to your bin directory:
    mv script.sh /usr/local/bin/copyable-pdf

Dependencies

Ensure you have the following installed:

  • tesseract: For OCR.
  • poppler: For pdftoppm and pdfunite.

On macOS (Homebrew):

brew install tesseract poppler

On Ubuntu/Debian:

sudo apt-get install tesseract-ocr poppler-utils

Usage

copyable-pdf [options] input.pdf

Options

Option Description Default
-l, --lang <code> Language code (e.g., fra, eng) eng
-o, --output <path> Custom output file path input_ocr.pdf
-d, --dpi <num> DPI resolution for OCR 300
-j, --jobs <num> Number of parallel jobs Auto-detect
-t, --text Generate an additional .txt file false
-m, --markdown Generate an additional .md file false
-k, --keep Keep temporary files (debug) false
-v, --verbose Verbose output false
-h, --help Show help message -

Examples

Basic usage:

copyable-pdf document.pdf

Specify language (French) and higher DPI:

copyable-pdf -l fra -d 600 document.pdf

Explicitly set output filename:

copyable-pdf -o searchable_doc.pdf scan.pdf

License

MIT

About

Lightweight bash script to convert scanned PDFs into searchable, copyable PDFs using Tesseract OCR with parallel processing.

Topics

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Contributors 2

  •  
  •  

Languages