ENH: Add PageObject.images attribute #1330

MartinThoma · 2022-09-07T19:30:45Z

No description provided.

codecov · 2022-09-15T20:25:04Z

Codecov Report

Base: 94.71% // Head: 94.55% // Decreases project coverage by -0.15% ⚠️

Coverage data is based on head (50447af) compared to base (71de6c8).
Patch coverage: 100.00% of modified lines in pull request are covered.

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #1330      +/-   ##
==========================================
- Coverage   94.71%   94.55%   -0.16%     
==========================================
  Files          30       28       -2     
  Lines        5181     5016     -165     
  Branches     1060     1033      -27     
==========================================
- Hits         4907     4743     -164     
  Misses        164      164              
+ Partials      110      109       -1

Impacted Files	Coverage Δ
PyPDF2/_page.py	`95.12% <100.00%> (+0.12%)`	⬆️
PyPDF2/filters.py	`97.23% <100.00%> (+0.01%)`	⬆️
PyPDF2/generic/_base.py	`100.00% <0.00%> (ø)`
PyPDF2/_utils.py
PyPDF2/__init__.py
PyPDF2/_writer.py	`91.10% <0.00%> (+0.06%)`	⬆️

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

MartinThoma · 2022-09-17T13:08:09Z

@pubpub-zz @MasterOdin What do you think about this PR?

While I wrote it, I realized that PyPDF2 does something wrong with image extraction in some cases. I marked those tests with xfail. The point of this PR is not to fix those issues, but to provide a convenient interface for getting images from PDF pages. That means:

Define the property / the method to get images
Define the return value (List[File] as well as the new File class)

@pubpub-zz You mentioned that this method might not get all images of a page. For this PR, this would be acceptable to me. We can fix that later.

As a follow-up step we might use the File class for attachments as well.

I'm uncertain about the mime_type parts. Should we use extension everywhere instead?

The reason why I chose mime-type were spelling inconsistencies like this:

PNG vs png
jpg vs jpeg

Additionally, I'm uncertain if using extension vs mime_type makes a difference if we use the File class for attachments as well.

pubpub-zz

Sounds good.
Perhaps We should add in TODO about inline images (not extracted yet)
Also add a warning message if unknown type
mime type looks more appropriate for me too

MasterOdin · 2022-09-18T02:45:56Z

PyPDF2/_page.py

+                    filename = f"{obj[1:]}.{File._mime2extension(mime_type)}"
+                    images_extracted.append(
+                        File(name=filename, data=byte_stream, mime_type=mime_type)
+                    )


This strikes me as an odd abstraction, where we are passing in the mime_type as part of the File constructor, but we also need to construct the full filename, using a private static function to boot, but also that the file_extension method doesn't correspond to the extension of the passed in name, but rather mime_type.

If we go the route of passing in the mime_type for the File, I'd advocate for just passing in name sans extension altogether and we can have a special property function that does the concatenation of name + extension to give a "filename" on demand as needed by users.

The only caveat would be for attachments, it may make sense to pass in the full filename, but I'm not well versed on that part of the spec to even know how that API might look.

I think I'll go with 'File only has name + data (no mime_type)' for the moment, because it seems to have only advantages:

Less clutter / less code to maintain

No potential to discover the wrong mime type

We could make _xobj_to_image just pass the file extension as before

As _xobj_to_image is a private function, we can easily change the behavior if we see a clear advantage

PyPDF2/_utils.py

This is consistent with Pillow: https://pillow.readthedocs.io/en/latest/reference/Image.html#PIL.Image.Image.format Co-authored-by: Matthew Peveler <matt.peveler@gmail.com>

New Features (ENH): - Addition of optional visitor-functions in extract_text() (#1252) - Add metadata.creation_date and modification_date (#1364) - Add PageObject.images attribute (#1330) Bug Fixes (BUG): - Lookup index in _xobj_to_image can be ByteStringObject (#1366) - \'IndexError: index out of range\' when using extract_text (#1361) - Errors in transfer_rotation_to_content() (#1356) Robustness (ROB): - Ensure update_page_form_field_values does not fail if no fields (#1346) Testing (TST): - read_string_from_stream performance (#1355) Full Changelog: 2.10.9...2.11.0

MartinThoma added 14 commits September 7, 2022 21:30

ENH: Add PageObject.images attribute

5d1d71c

Add docs

55669e3

Add docs

8cb5c98

fix flake8

85a67e5

Add mime types

4aed7ee

Add more mime types

82e4796

Fix image extraction

ad19cc3

Update workflows

bb7185b

Flake8 fix

fe7a965

Update

c18d2a6

mime type

4d8ac66

Fix

53bd98c

mypy

b0cf635

fix imports

f1e7b84

Merge branch 'main' into image-extraction

80479ef

MartinThoma marked this pull request as ready for review September 17, 2022 11:59

MartinThoma requested review from pubpub-zz and MasterOdin September 17, 2022 12:57

pubpub-zz approved these changes Sep 17, 2022

View reviewed changes

MasterOdin reviewed Sep 18, 2022

View reviewed changes

MartinThoma and others added 3 commits September 18, 2022 11:24

Rename 'file_extension' to 'format'

30fbf41

This is consistent with Pillow: https://pillow.readthedocs.io/en/latest/reference/Image.html#PIL.Image.Image.format Co-authored-by: Matthew Peveler <matt.peveler@gmail.com>

Format rename

4b77a6a

Remove mime type

c491881

MartinThoma force-pushed the image-extraction branch from 21b54ae to c491881 Compare September 24, 2022 05:25

MartinThoma added 2 commits September 24, 2022 07:27

Add docstring mentioning inline images

44efe78

Fix test

50447af

MartinThoma merged commit 85b3e87 into main Sep 24, 2022

MartinThoma deleted the image-extraction branch September 24, 2022 05:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: Add PageObject.images attribute #1330

ENH: Add PageObject.images attribute #1330

MartinThoma commented Sep 7, 2022

codecov bot commented Sep 15, 2022 •

edited

MartinThoma commented Sep 17, 2022

pubpub-zz left a comment •

edited

MasterOdin Sep 18, 2022

MartinThoma Sep 24, 2022

ENH: Add PageObject.images attribute #1330

ENH: Add PageObject.images attribute #1330

Conversation

MartinThoma commented Sep 7, 2022

codecov bot commented Sep 15, 2022 • edited

Codecov Report

MartinThoma commented Sep 17, 2022

pubpub-zz left a comment • edited

Choose a reason for hiding this comment

MasterOdin Sep 18, 2022

Choose a reason for hiding this comment

MartinThoma Sep 24, 2022

Choose a reason for hiding this comment

codecov bot commented Sep 15, 2022 •

edited

pubpub-zz left a comment •

edited