This is a demo project for "Optical Character Recognition Digitization" of full text pages. It is designed for use as a Hugging Face Gradio app.
The underlying processing pipeline includes:
- Image binarization
- Text line segmentation
- Text line extraction, filtering, and deskewing
- OCR on text lines
- Printing recognized text on generated image for visualization
Please note:
- The app is optimized for English; other languages (e.g., German) may require OCR model fine-tuning.
- When running on CPUs, a pipeline run can take over 10 minutes depending on the input image.
- For lengthy waits or if the online app is down, look at the pre-computed examples: https://github.com/pluniak/ocrd/tree/main/data/demo_data
- The demo is just a first prototype! OCR performance and computation speed should be optimized.
Usage:
- Test the the demo online at https://huggingface.co/spaces/pluniak/ocrd
- or follow the steps below to install and run the app on your local computer.
- Install Anaconda if you haven't done yet: https://docs.anaconda.com/free/anaconda/install
- Clone the repository, then set up and activate the virtual environment:
git clone https://github.com/pluniak/ocrd.git
cd ocrd
./create_conda_env_linux.sh # Linux
create_conda_env_windows.bat # Windows (using Conda terminal)
conda activate ocrd
After activating the virtual environment, you can run the app locally as a Web Server or inside a Jupyter Notebook.
Execute this script from CLI:
python ./src/app.py
Then click on the generated local URL (usually: http://127.0.0.1:7860).
Open and run this notebook:
./notebooks/app.ipynb
For more examples visit: https://github.com/pluniak/ocrd/tree/main/data/demo_data
This project makes use of significant components from the following open-source projects:
-
eynollah: An automated layout analysis tool for historical documents, developed as part of the QURATOR project. The eynollah tool is instrumental in facilitating the preprocessing of document images in this project. For more details on eynollah, visit their GitHub repository: qurator-spk/eynollah. The tool is used under the Apache License 2.0.
-
Microsoft trocr: I utilize Microsoft's trocr models for optical character recognition tasks. The trocr models are highly effective in recognizing text from a variety of document types. For more information on trocr and its usage, please see Microsoft's trocr repository under the MIT License.
I appreciate the efforts of the developers and the community in providing these high-quality open-source resources.