OCR Transcriber

A script designed to convert PDF documents into text using Optical Character Recognition (OCR).

Setup & Installation

Prerequisites

Python 3.x
Virtual environment (optional but recommended)
Poppler utility for Windows
Tesseract OCR

To check the installation, the user can open a terminal and run the following Python commands:

import pytesseract
print(pytesseract)

Installation

Virtual Environment (optional)

Create a virtual environment: python -m venv venv
Activate the virtual environment:
- Windows PowerShell: .\venv\Scripts\Activate
- Bash: source venv/bin/activate

Dependencies

Install the required Python packages:

pip install pdf2image pytesseract opencv-python numpy

Poppler Utility

Download the Poppler for Windows binaries.
Extract and add the bin directory to the system's PATH.

Tesseract OCR

Download and install from the GitHub releases page.
Add Tesseract to the system's PATH or specify the path in the script:

pytesseract.pytesseract.tesseract_cmd = r'C:\Program Files\Tesseract-OCR\tesseract.exe'

Usage

To use the script, the user should run it and, when prompted, paste the full path to the PDF document. The script will process the document and output the transcribed text.

Troubleshooting & Common Issues

Poppler Not Found

Error: "Unable to get page count. Is poppler installed and in PATH?"

Solution: Ensure Poppler's bin directory is added to PATH. If a virtual environment is in use, ensure it's activated.

Tesseract Not Found

Error: "tesseract is not installed or it's not in your PATH."

Solution: Ensure Tesseract is installed and its path is added to the system's PATH or specified in the script.

PowerShell Script Execution Policy

Error: "File cannot be loaded because running scripts is disabled on this system."

Solution: Allow script execution by running Set-ExecutionPolicy -ExecutionPolicy RemoteSigned -Scope CurrentUser in PowerShell as Administrator.

Current File Structure

ocr_pdf_populator/
├── src/
│   ├── __init__.py
│   ├── ocr_processing.py  # Functions related to OCR processing
│   ├── pattern_recognition.py  # Functions for extracting data using regex or other methods
│   ├── pdf_mapping.py  # Functions to map and fill fields in the destination PDF
│   └── utils/  # Utility functions, common constants, etc.
│       ├── __init__.py
│       └── helpers.py
│   ├── tests/  # Unit tests
│       ├── __init__.py
│       ├── test_ocr_processing.py
│       ├── test_pattern_recognition.py
│       └── test_pdf_mapping.py
│   ├── sample_pdfs/  # Sample PDFs for testing purposes
│       ├── input/
│       └── output/
│   ├── output/  # Directory where output PDFs are saved
│   └── main.py  # Main script that ties everything together

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
__pycache__		__pycache__
build/gui_interface		build/gui_interface
dist		dist
src		src
tests		tests
venv		venv
.gitignore		.gitignore
PDF Transcriber.bat		PDF Transcriber.bat
PDF Transcriber.vbs		PDF Transcriber.vbs
README.md		README.md
gui_interface.py		gui_interface.py
gui_interface.spec		gui_interface.spec
main.py		main.py
main_back_up.py		main_back_up.py
your_script.spec		your_script.spec

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

OCR Transcriber

Setup & Installation

Prerequisites

Installation

Virtual Environment (optional)

Dependencies

Poppler Utility

Tesseract OCR

Usage

Troubleshooting & Common Issues

Poppler Not Found

Tesseract Not Found

PowerShell Script Execution Policy

Current File Structure

About

Releases

Packages

Languages

jcub7012/QNA_OCR_READER

Folders and files

Latest commit

History

Repository files navigation

OCR Transcriber

Setup & Installation

Prerequisites

Installation

Virtual Environment (optional)

Dependencies

Poppler Utility

Tesseract OCR

Usage

Troubleshooting & Common Issues

Poppler Not Found

Tesseract Not Found

PowerShell Script Execution Policy

Current File Structure

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages