MinePDF

MinePDF is a pure-Python tool for extracting reflowable texts, images, fonts and other contents from PDF documents. This project is written based on PDFMiner, and performs a lot of improvements to make it easy for analyzing visual contents on pages.

This project is still under developing. Documentation, test suite and debugging will be completed later.

Improved features

Variable parameters for grouping text lines into paragraphs
Better image extraction using Pillow and Numpy
Various color spaces support for color conversion
Improved font parsing for glyph mapping and font extraction
Improved render device for checking if objects visible
Customizable extraction: page range selection and resource type filter
Customizable logging system

Basic workflow

from minepdf import render_pages

fp = open('sample.pdf', 'rb')
pages = render_pages(fp, page_range='1-5,20-100', filtered=['path'])
for pageno, renderer, factory in pages: # extract each page
    handle_text(renderer.text_layer) # handle paragraphs
    handle_image(renderer.images) # handle images
    handle_resources(factory) # handle other resources (fonts, curves, etc.)
fp.close()

Text layout analysis

According to the LAParams in PDFMiner for layout analysis, this project provides multiple parameters for text layout analysis, even grouping text fragments into text lines and paragraphs.

Parameter	Definition
`fontsize_between_ratio`	minimum average font size ratio between both lines
`fontsize_linebreak_ratio`	minimum average font size ratio between the last character of the upper text line and the first character of the lower text line
`max_line_spacing`	maximum spacing between both text lines
`max_word_spacing`	maximum spacing between words in one line
`indent_length`	length of the indentation
`min_column_gap`	minimum column gap length among multi-column text blocks

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
minepdf		minepdf
tools		tools
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MinePDF

Improved features

Basic workflow

Text layout analysis

About

Releases

Packages

Languages

License

jonix6/minepdf

Folders and files

Latest commit

History

Repository files navigation

MinePDF

Improved features

Basic workflow

Text layout analysis

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages