So this is a little bash script to add searchable text layers to your PDFs using OCR. I'm using it at work for financial statements etc.
- Batch process multiple PDFs in a folder
- Danish OCR (but has multi-language support for the cosmopolitans out there)
- Re-OCRs PDFs even if they already have text (fixes corrupted text/encoding issues)
- Automatically skips digitally signed PDFs so as to not void signatures
- Removes original files after successful OCR (signed files are preserved)
- Progress tracking in the terminal + clear error reporting
- Arch Linux
- macOS
- Windows? < Omarchy
- Tesseract OCR - The OCR engine
- ocrmypdf - Python wrapper for Tesseract
- unpaper - Image preprocessing tool (for cleaning scans)
- Python 3.7+ - Required for ocrmypdf
# Install dependencies
sudo pacman -S tesseract tesseract-data-dan python-pipx unpaper
# Install ocrmypdf
pipx install ocrmypdf
# Make script executable
chmod +x ocr_pdf.sh# Install Homebrew if not already installed
/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"
# Install dependencies
brew install tesseract tesseract-lang unpaper
# Install ocrmypdf
pip3 install ocrmypdf
# Make script executable
chmod +x ocr_pdf.shProcess all PDFs in a folder:
./ocr_pdf.sh --batch "/path/to/your/pdfs/"Process a single PDF:
./ocr_pdf.sh input.pdf [output.pdf]- Scans all PDF files in the specified folder
- Creates OCR'd versions with "(OCR)" appended to filename
- Removes original files after successful OCR
- Skips digitally signed PDFs (preserves signatures)
- Reports which files were skipped due to signatures
- Successfully OCR'd:
document.pdf→document (OCR).pdf(original removed) - Digitally signed: Original file preserved, no OCR version created
- Failed processing: Original file preserved for investigation
The script defaults to Danish (dan). If you wanna use a different language, edit ocr_pdf.sh and change --language dan to your preferred language. The full list can be found in the link below...
Install the corresponding language data:
# Arch Linux
sudo pacman -S tesseract-data-<language>
# macOS (included in tesseract-lang package)Make sure ~/.local/bin is in your PATH:
echo 'export PATH="$HOME/.local/bin:$PATH"' >> ~/.bashrc
source ~/.bashrc- Arch Linux:
sudo pacman -S tesseract - macOS:
brew install tesseract
chmod +x ocr_pdf.sh./ocr_pdf.sh --batch ~/Documents/finances/./ocr_pdf.sh receipt.pdf- Uses Tesseract OCR engine for text recognition
- Processes at 300 DPI for good accuracy on Danish characters (Å, ø, æ)
- Force re-OCRs all pages - replaces existing text (even if already present) to fix corrupted text/encoding
- Uses unpaper for image cleaning and noise removal
- Processes locally on your machine (no internet required after installation)
- Supports deskewing for rotated scans
- All OCR processing happens 100% locally on your machine. Your documents remain completely private.
MIT
Issues and pull requests are welcome! <3