A script designed to convert PDF documents into text using Optical Character Recognition (OCR).
- Python 3.x
- Virtual environment (optional but recommended)
- Poppler utility for Windows
- Tesseract OCR
To check the installation, the user can open a terminal and run the following Python commands:
import pytesseract
print(pytesseract)
- Create a virtual environment:
python -m venv venv
- Activate the virtual environment:
- Windows PowerShell:
.\venv\Scripts\Activate
- Bash:
source venv/bin/activate
- Windows PowerShell:
Install the required Python packages:
pip install pdf2image pytesseract opencv-python numpy
- Download the Poppler for Windows binaries.
- Extract and add the
bin
directory to the system's PATH.
- Download and install from the GitHub releases page.
- Add Tesseract to the system's PATH or specify the path in the script:
pytesseract.pytesseract.tesseract_cmd = r'C:\Program Files\Tesseract-OCR\tesseract.exe'
To use the script, the user should run it and, when prompted, paste the full path to the PDF document. The script will process the document and output the transcribed text.
Error: "Unable to get page count. Is poppler installed and in PATH?"
Solution: Ensure Poppler's bin
directory is added to PATH. If a virtual environment is in use, ensure it's activated.
Error: "tesseract is not installed or it's not in your PATH."
Solution: Ensure Tesseract is installed and its path is added to the system's PATH or specified in the script.
Error: "File cannot be loaded because running scripts is disabled on this system."
Solution: Allow script execution by running Set-ExecutionPolicy -ExecutionPolicy RemoteSigned -Scope CurrentUser
in PowerShell as Administrator.
ocr_pdf_populator/
├── src/
│ ├── __init__.py
│ ├── ocr_processing.py # Functions related to OCR processing
│ ├── pattern_recognition.py # Functions for extracting data using regex or other methods
│ ├── pdf_mapping.py # Functions to map and fill fields in the destination PDF
│ └── utils/ # Utility functions, common constants, etc.
│ ├── __init__.py
│ └── helpers.py
│ ├── tests/ # Unit tests
│ ├── __init__.py
│ ├── test_ocr_processing.py
│ ├── test_pattern_recognition.py
│ └── test_pdf_mapping.py
│ ├── sample_pdfs/ # Sample PDFs for testing purposes
│ ├── input/
│ └── output/
│ ├── output/ # Directory where output PDFs are saved
│ └── main.py # Main script that ties everything together