Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

v0.10.0 #936

Merged
merged 28 commits into from
Jul 16, 2023
Merged

v0.10.0 #936

merged 28 commits into from
Jul 16, 2023

Conversation

jsvine
Copy link
Owner

@jsvine jsvine commented Jul 16, 2023

[0.10.0] - 2023-07-16

Changed

  • Normalize color representation to tuple[float|int, ...] (#917). (57d51bb)
  • Replace Wand with pypdfium2 for page.to_image(...). (b049373)

Added

  • Add pdfplumber.repair(...) and .open(repair=True) (#824). (db6ae97)
  • Add Page.find_table(...) (#873). (3772af6)
  • Add quantize=True, colors=256, bits=8 arguments/defaults to PageImage.save(...). (b049373)
  • Extract and handle patterns + (some) color spaces. (97ca4b0)

Removed

Fixed

  • Fix bug for re-crops that use relative=True (#914). (0de6da9)
  • Handle use_text_flow more consistently (#912). (b1db5b8)

RitchieP and others added 28 commits May 2, 2023 14:46
Fixed a small typo
As noted in #912, `use_text_flow` was not being handled consistently, as
characters and words were being re-sorted without checking first if this
parameter was set to `True`.
Main edge-case was with `use_text_flow` on text-lines that then
backtracked. But this rewrite also aims to make the logic more explicit
and easier to follow.
When using relative=True for a re-crop, pdfplumber was passing the wrong
bounding box to the cropping function. This commit fixes that bug and
also refactors CroppedPage.__init__(...) for clarity and consistency's
sake.
This commit normalizes the type representation of `stroking_color` and
`non_stroking_color` values. Thanks to @dhdaines for pointing out this
inconsistency.

Previously, `pdfplumber` passed along `pdfminer.six`'s colors without
normalization. Due to quirks in `pdfminer.six`'s color handling, this
meant that those values could be floats, ints, lists, or tuples. This
commit normalizes all color values (when non-None) into n-tuples, where
(val,) represents grayscale colors, (val, val, val) represents RBG, and
(val, val, val, val) represents CMYK colors.

This should solve the consistency issue, although might cause breaking
changes to code that filters for non-tuple values — e.g., `[c for c in
page.chars if c == [1, 0 0]]`. Although breaking changes are unpleasant,
I think the tradeoff for longer-term consistency is worth it.
Previously, `pdfplumber.Page` had these table-getting methods:

- `.find_tables(...)`
- `.extract_tables(...)`
- `.extract_table(...)`

For consistency/completeness's sake, this commit adds:

- `.find_table(...)`

... which, analogous to `.extract_table(...)`, returns the largest table
on the page.

Indeed, `.extract_table(...)` now uses `.find_table(...)` beneath the
hood.

Thanks to @pdille for the suggestion, here:
#864 (reply in thread)
Inspired by #828

The PDF reference allows for "colors" to be defined as a series of
numbers and/or (much less commonly) patterns.

(See p. 288 and section 4.6 here:
https://ghostscript.com/~robin/pdf_reference17.pdf)

This commit separates out the pattern component of colors into their own
attributes, `stroking_pattern` and `non_stroking_pattern` so that they
don't muddle the interpretation of standard colors' tuple-of-numbers
representation.

This commit also adds code that attempts to fetch the `ncs`/`scs` color
space of each object. Due to current limitations of pdfminer.six,
however, the only such color space immediately available is the `ncs`
(non-stroking color space) property of char objects.
This commit swaps out Wand (and its non-Python dependencies ImageMagick
and Ghostscript) for pypdfium2 for PageImage rendering. This has some
advantages:

- Less finicky: Wand often caused users problems, due to "MagickWand
  shared library not found" and "PolicyError: not authorized `PDF'"
  issues. By contrast, pypdfium2 seems (at least at first) to more
  self-contained and not require any system-tweaking.
- Faster: pypdfium2 appears to render images more quickly than Wand (see
  @cmdlineuser's tests in #899)
- More flexible: pypdfium2 appears to generate images with greater color
  depth; by default, pdfplumber quantizes those images so that they
  save/display compactly (in fact, with smaller file sizes than the
  previous code), this commit also adds parameters to retain all/more of
  the original, more detailed colors.

Thanks to @cmdlineuser in #899 for the suggestion.
This commit adds convenience methods to repair PDFs on the fly and/or to
write repaired PDFs to disk.

Currently, this does so via Ghostscript using the method we've asked
many users to try by following the instructions at
https://superuser.com/questions/278562/how-can-i-fix-repair-a-corrupted-pdf-file

Now, hopefully, this saves folks a few steps.
@codecov
Copy link

codecov bot commented Jul 16, 2023

Codecov Report

Merging #936 (28c0afc) into stable (ae676ae) will not change coverage.
The diff coverage is 100.00%.

@@            Coverage Diff            @@
##            stable      #936   +/-   ##
=========================================
  Coverage   100.00%   100.00%           
=========================================
  Files           17        18    +1     
  Lines         1532      1585   +53     
=========================================
+ Hits          1532      1585   +53     
Impacted Files Coverage Δ
pdfplumber/__init__.py 100.00% <100.00%> (ø)
pdfplumber/_version.py 100.00% <100.00%> (ø)
pdfplumber/display.py 100.00% <100.00%> (ø)
pdfplumber/page.py 100.00% <100.00%> (ø)
pdfplumber/pdf.py 100.00% <100.00%> (ø)
pdfplumber/repair.py 100.00% <100.00%> (ø)
pdfplumber/utils/text.py 100.00% <100.00%> (ø)

@jsvine jsvine merged commit 00386ad into stable Jul 16, 2023
12 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants