Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MAINT: Throw PdfReadError if Trailer can't be read #1298

Merged
merged 11 commits into from Aug 31, 2022

Conversation

ediamondscience
Copy link
Contributor

@ediamondscience ediamondscience commented Aug 29, 2022

… Added myself to contributor list.

In reference to #1279 I've modified _reader.py to throw a PdfReadError if the trailer doesn't exist, can't be read, or does not point to the object containing the metadata for the document.

Three pdf files were uploaded in relation to aforementioned issue, pca_var.pdf contained no trailer, and the other two had corrupted declarations that PyPDF2 could not read. I'll include them here for those interested.

From Rotary Potentiometer.pdf:
trailer^M <</Size 267/Prev 2815238/Root 176 0 R/Info 174 0 R/ID[<8CC42D96833D2A438E62CB60D00F610F><5C480BCE388CB6408C83C951FDD0DA22>]>>^M startxref^M 0^M %%EOF^M

From Status_v1_Reviewers-Guide.pdf:
trailer
/Size
447
/Root
4
0
R
/Info
5
0
R
startxref
6128518
(sorry for using multiple code blocks, but I think it's important to show that there are line breaks for each of these lines in my terminal)

We should probably figure out why these trailers are displayed like this. If I had to guess it might be because of a different encoding for line breaks. If we can update the way that we read the trailer to account for this we might be able to read useful information out of more files.

All three files from the issue should now throw an error when the test code is run. This should close #1279

@ediamondscience ediamondscience changed the title Added PdfReadError in cases where trailer is absent of can't be read.… Added PdfReadError in cases where trailer is absent or can't be read.… Aug 29, 2022
@codecov
Copy link

codecov bot commented Aug 29, 2022

Codecov Report

Merging #1298 (4f31b64) into main (4e7602a) will increase coverage by 0.00%.
The diff coverage is 100.00%.

@@           Coverage Diff           @@
##             main    #1298   +/-   ##
=======================================
  Coverage   95.02%   95.02%           
=======================================
  Files          30       30           
  Lines        4986     4988    +2     
  Branches     1025     1026    +1     
=======================================
+ Hits         4738     4740    +2     
  Misses        141      141           
  Partials      107      107           
Impacted Files Coverage Δ
PyPDF2/_reader.py 91.23% <100.00%> (+0.02%) ⬆️

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

PyPDF2/_reader.py Outdated Show resolved Hide resolved
@MartinThoma
Copy link
Member

Can you please add a test that covers this? You mentioned that https://github.com/p-sampson-II/SEEDLab-TAs-2022/blob/378e6f583ed9de274a8bda9240dfa7aa6872a733/Alex%20Curtis/Datasheets/Rotary%20Potentiometer.pdf shows the issue. You can use the file via:

url = "https://github.com/p-sampson-II/SEEDLab-TAs-2022/blob/378e6f583ed9de274a8bda9240dfa7aa6872a733/Alex%20Curtis/Datasheets/Rotary%20Potentiometer.pdf"
name = "github-Rotary-Potentiometer.pdf"
reader = PdfReader(BytesIO(get_pdf_from_url(url, name=name)))

ediamondscience and others added 2 commits August 30, 2022 04:30
Removed extraneous is True statement on line 329 of _reader.py

Co-authored-by: Martin Thoma <info@martin-thoma.de>
PyPDF2/_reader.py Outdated Show resolved Hide resolved
@MartinThoma MartinThoma changed the title Added PdfReadError in cases where trailer is absent or can't be read.… MAINT: Throwd PdfReadError if /Trailer can't be read Aug 30, 2022
@MartinThoma MartinThoma changed the title MAINT: Throwd PdfReadError if /Trailer can't be read MAINT: Throwd PdfReadError if Trailer can't be read Aug 30, 2022
@MartinThoma MartinThoma changed the title MAINT: Throwd PdfReadError if Trailer can't be read MAINT: Throw PdfReadError if Trailer can't be read Aug 30, 2022
@MartinThoma
Copy link
Member

Thank you for your work on this one!

I've just noticed that you've added resources/unreadablemetadata.pdf which has 2MB. In order to keep this repository small (and thus easy to clone, also with bad connections), I would like this file to be in https://github.com/py-pdf/sample-files/ .

Could you please add the file there? (I can also do it + use the "Contributed-by" feature of Github to attribute it to you)

1 similar comment
@MartinThoma
Copy link
Member

Thank you for your work on this one!

I've just noticed that you've added resources/unreadablemetadata.pdf which has 2MB. In order to keep this repository small (and thus easy to clone, also with bad connections), I would like this file to be in https://github.com/py-pdf/sample-files/ .

Could you please add the file there? (I can also do it + use the "Contributed-by" feature of Github to attribute it to you)

@ediamondscience
Copy link
Contributor Author

@MartinThoma I've added a pull request to get that file into the correct repository and updated the file location in test_reader.py

The commit may fail some tests in this repository before the sample files are merged, but I've tested it on my own rig here with pytest and it should work perfectly after the merge.

@MartinThoma
Copy link
Member

I was just blocked at work and added the updated submodule to PyPDF2 :-)

@MartinThoma
Copy link
Member

Could you please do git rm resources/unreadablemetadata.pdf?

@ediamondscience
Copy link
Contributor Author

Looks like the the reader causes some issues if the metadata is unreadable. Is it the right move to throw a PdfReadError for this situation?

@MartinThoma
Copy link
Member

Sorry, my mistake. I'll fix the test tomorrow 🙈

@MartinThoma MartinThoma merged commit 5c76c8f into py-pdf:main Aug 31, 2022
MartinThoma added a commit that referenced this pull request Sep 4, 2022
Version 2.10.5, 2022-09-04
--------------------------

New Features (ENH):
-  Process XRefStm (#1297)
-  Auto-detect RTL for text extraction (#1309)

Bug Fixes (BUG):
-  Avoid scaling cropbox twice (#1314)

Robustness (ROB):
-  Fix offset correction in revised PDF (#1318)
-  Crop data of /U and /O in encryption dictionary to 48 bytes (#1317)
-  MultiLine bfrange in cmap (#1299)
-  Cope with 2 digit codes in bfchar (#1310)
-  Accept '/annn' charset as ASCII code (#1316)
-  Log errors during Float / NumberObject initialization (#1315)
-  Cope with corrupted entries in xref table (#1300)

Documentation (DOC):
-  Migration guide (PyPDF2 1.x \xe2\x9e\x94 2.x) (#1324)
-  Creating a coverage report (#1319)
-  Fix AnnotationBuilder.free_text example (#1311)
-  Fix usage of page.scale by replacing it with page.scale_by (#1313)

Developer Experience (DEV):
-  Only run coverage for PyPDF2

Maintenance (MAINT):
-  PdfReaderProtocol (#1303)
-  Throw PdfReadError if Trailer can't be read (#1298)
-  Remove catching OverflowException (#1302)

Full Changelog: 2.10.4...2.10.5
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

TypeError: 'NoneType' object is not iterable
2 participants