start-ocr

Applying pdfplumber + opencv + pytesseract to extract content and metadata from formal PDF files.
pdfplumber's page.extract_text_lines() is experimental and thus can work or not depending on the pdf file.
See documentation.

Installation

just start

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
.github/workflows		.github/workflows
docs		docs
notebooks		notebooks
src/start_ocr		src/start_ocr
tests		tests
.dockerignore		.dockerignore
.editorconfig		.editorconfig
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
Dockerfile		Dockerfile
README.md		README.md
env.example		env.example
justfile		justfile
mkdocs.yml		mkdocs.yml
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

.github/workflows

.github/workflows

docs

docs

notebooks

notebooks

src/start_ocr

src/start_ocr

tests

tests

.dockerignore

.dockerignore

.editorconfig

.editorconfig

.gitignore

.gitignore

.pre-commit-config.yaml

.pre-commit-config.yaml

Dockerfile

Dockerfile

README.md

README.md

env.example

env.example

justfile

justfile

mkdocs.yml

mkdocs.yml

pyproject.toml

pyproject.toml

Repository files navigation

start-ocr

Installation

About

Languages

justmars/start-ocr

Folders and files

Latest commit

History

Repository files navigation

start-ocr

Installation

About

Resources

Stars

Watchers

Forks

Languages