Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Add PageObject.images attribute #1330

Merged
merged 20 commits into from Sep 24, 2022
Merged

ENH: Add PageObject.images attribute #1330

merged 20 commits into from Sep 24, 2022

Conversation

MartinThoma
Copy link
Member

No description provided.

@codecov
Copy link

codecov bot commented Sep 15, 2022

Codecov Report

Base: 94.71% // Head: 94.55% // Decreases project coverage by -0.15% ⚠️

Coverage data is based on head (50447af) compared to base (71de6c8).
Patch coverage: 100.00% of modified lines in pull request are covered.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1330      +/-   ##
==========================================
- Coverage   94.71%   94.55%   -0.16%     
==========================================
  Files          30       28       -2     
  Lines        5181     5016     -165     
  Branches     1060     1033      -27     
==========================================
- Hits         4907     4743     -164     
  Misses        164      164              
+ Partials      110      109       -1     
Impacted Files Coverage Δ
PyPDF2/_page.py 95.12% <100.00%> (+0.12%) ⬆️
PyPDF2/filters.py 97.23% <100.00%> (+0.01%) ⬆️
PyPDF2/generic/_base.py 100.00% <0.00%> (ø)
PyPDF2/_utils.py
PyPDF2/__init__.py
PyPDF2/_writer.py 91.10% <0.00%> (+0.06%) ⬆️

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

@MartinThoma MartinThoma marked this pull request as ready for review September 17, 2022 11:59
@MartinThoma
Copy link
Member Author

@pubpub-zz @MasterOdin What do you think about this PR?

While I wrote it, I realized that PyPDF2 does something wrong with image extraction in some cases. I marked those tests with xfail. The point of this PR is not to fix those issues, but to provide a convenient interface for getting images from PDF pages. That means:

  • Define the property / the method to get images
  • Define the return value (List[File] as well as the new File class)

@pubpub-zz You mentioned that this method might not get all images of a page. For this PR, this would be acceptable to me. We can fix that later.

As a follow-up step we might use the File class for attachments as well.

I'm uncertain about the mime_type parts. Should we use extension everywhere instead?

The reason why I chose mime-type were spelling inconsistencies like this:

  • PNG vs png
  • jpg vs jpeg

Additionally, I'm uncertain if using extension vs mime_type makes a difference if we use the File class for attachments as well.

Copy link
Collaborator

@pubpub-zz pubpub-zz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good.
Perhaps We should add in TODO about inline images (not extracted yet)
Also add a warning message if unknown type
mime type looks more appropriate for me too

PyPDF2/_page.py Outdated
Comment on lines 382 to 385
filename = f"{obj[1:]}.{File._mime2extension(mime_type)}"
images_extracted.append(
File(name=filename, data=byte_stream, mime_type=mime_type)
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This strikes me as an odd abstraction, where we are passing in the mime_type as part of the File constructor, but we also need to construct the full filename, using a private static function to boot, but also that the file_extension method doesn't correspond to the extension of the passed in name, but rather mime_type.

If we go the route of passing in the mime_type for the File, I'd advocate for just passing in name sans extension altogether and we can have a special property function that does the concatenation of name + extension to give a "filename" on demand as needed by users.

The only caveat would be for attachments, it may make sense to pass in the full filename, but I'm not well versed on that part of the spec to even know how that API might look.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I'll go with 'File only has name + data (no mime_type)' for the moment, because it seems to have only advantages:

  • Less clutter / less code to maintain
  • No potential to discover the wrong mime type
  • We could make _xobj_to_image just pass the file extension as before
  • As _xobj_to_image is a private function, we can easily change the behavior if we see a clear advantage

PyPDF2/_utils.py Outdated Show resolved Hide resolved
@MartinThoma MartinThoma merged commit 85b3e87 into main Sep 24, 2022
@MartinThoma MartinThoma deleted the image-extraction branch September 24, 2022 05:39
MartinThoma added a commit that referenced this pull request Sep 25, 2022
New Features (ENH):
-  Addition of optional visitor-functions in extract_text() (#1252)
-  Add metadata.creation_date and modification_date (#1364)
-  Add PageObject.images attribute (#1330)

Bug Fixes (BUG):
-  Lookup index in _xobj_to_image can be ByteStringObject (#1366)
-  \'IndexError: index out of range\' when using extract_text (#1361)
-  Errors in transfer_rotation_to_content() (#1356)

Robustness (ROB):
-  Ensure update_page_form_field_values does not fail if no fields (#1346)

Testing (TST):
-  read_string_from_stream performance (#1355)

Full Changelog: 2.10.9...2.11.0
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants