Skip to content

hansmi/paperminer

Repository files navigation

Amend Paperless documents with extracted information

Latest release CI workflow Go reference

Paperminer is a system for amending documents stored in Paperless-ngx with additional information ("facts") extracted from the documents themselves or other sources.

The hansmi/dossier package is called to parse PDF documents (other formats could be implemented).

The Go programming language's plugin package comes with a number of caveats which make it unsuitable. Compile-time plugins via the hansmi/staticplug package are used instead. It's therefore necessary to set up your own build. An example for a program with a plugin can be found in the example/myminer directory.

Plugins may use dossier sketches to look for specific regular expressions at absolute or relative positions on pages. The sketchfacts package is often sufficient even though it ignores pages beyond the first. Custom logic can produce document facts from the findings.

Plugins may also extract arbitrary document pages and implement their own data extraction. External APIs may also be involved.

Normalizing extracted text before parsing it further is generally recommended, not just for date and time: remove extraneous whitespace and separators, etc. Regular expressions should also be written to be flexible where possible. OCR-derived text is often not exactly the same as the original.

Useful packages for writing document facters: