MAINT: Throw PdfReadError if Trailer can't be read #1298

ediamondscience · 2022-08-29T01:18:24Z

… Added myself to contributor list.

In reference to #1279 I've modified _reader.py to throw a PdfReadError if the trailer doesn't exist, can't be read, or does not point to the object containing the metadata for the document.

Three pdf files were uploaded in relation to aforementioned issue, pca_var.pdf contained no trailer, and the other two had corrupted declarations that PyPDF2 could not read. I'll include them here for those interested.

From Rotary Potentiometer.pdf:
trailer^M <</Size 267/Prev 2815238/Root 176 0 R/Info 174 0 R/ID[<8CC42D96833D2A438E62CB60D00F610F><5C480BCE388CB6408C83C951FDD0DA22>]>>^M startxref^M 0^M %%EOF^M

From Status_v1_Reviewers-Guide.pdf:
trailer
/Size
447
/Root
4
0
R
/Info
5
0
R
startxref
6128518
(sorry for using multiple code blocks, but I think it's important to show that there are line breaks for each of these lines in my terminal)

We should probably figure out why these trailers are displayed like this. If I had to guess it might be because of a different encoding for line breaks. If we can update the way that we read the trailer to account for this we might be able to read useful information out of more files.

All three files from the issue should now throw an error when the test code is run. This should close #1279

… Added myself to contributor list.

codecov · 2022-08-29T01:24:14Z

Codecov Report

Merging #1298 (4f31b64) into main (4e7602a) will increase coverage by 0.00%.
The diff coverage is 100.00%.

@@           Coverage Diff           @@
##             main    #1298   +/-   ##
=======================================
  Coverage   95.02%   95.02%           
=======================================
  Files          30       30           
  Lines        4986     4988    +2     
  Branches     1025     1026    +1     
=======================================
+ Hits         4738     4740    +2     
  Misses        141      141           
  Partials      107      107

Impacted Files	Coverage Δ
PyPDF2/_reader.py	`91.23% <100.00%> (+0.02%)`	⬆️

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

PyPDF2/_reader.py

MartinThoma · 2022-08-29T14:56:40Z

Can you please add a test that covers this? You mentioned that https://github.com/p-sampson-II/SEEDLab-TAs-2022/blob/378e6f583ed9de274a8bda9240dfa7aa6872a733/Alex%20Curtis/Datasheets/Rotary%20Potentiometer.pdf shows the issue. You can use the file via:

url = "https://github.com/p-sampson-II/SEEDLab-TAs-2022/blob/378e6f583ed9de274a8bda9240dfa7aa6872a733/Alex%20Curtis/Datasheets/Rotary%20Potentiometer.pdf"
name = "github-Rotary-Potentiometer.pdf"
reader = PdfReader(BytesIO(get_pdf_from_url(url, name=name)))

Removed extraneous is True statement on line 329 of _reader.py Co-authored-by: Martin Thoma <info@martin-thoma.de>

…f to test new condition

PyPDF2/_reader.py

MartinThoma · 2022-08-30T11:12:20Z

Thank you for your work on this one!

I've just noticed that you've added resources/unreadablemetadata.pdf which has 2MB. In order to keep this repository small (and thus easy to clone, also with bad connections), I would like this file to be in https://github.com/py-pdf/sample-files/ .

Could you please add the file there? (I can also do it + use the "Contributed-by" feature of Github to attribute it to you)

MartinThoma · 2022-08-30T11:12:30Z

Thank you for your work on this one!

I've just noticed that you've added resources/unreadablemetadata.pdf which has 2MB. In order to keep this repository small (and thus easy to clone, also with bad connections), I would like this file to be in https://github.com/py-pdf/sample-files/ .

Could you please add the file there? (I can also do it + use the "Contributed-by" feature of Github to attribute it to you)

…ensure accurate test results in future

…file directory

ediamondscience · 2022-08-30T11:51:34Z

@MartinThoma I've added a pull request to get that file into the correct repository and updated the file location in test_reader.py

The commit may fail some tests in this repository before the sample files are merged, but I've tested it on my own rig here with pytest and it should work perfectly after the merge.

MartinThoma · 2022-08-30T13:19:03Z

I was just blocked at work and added the updated submodule to PyPDF2 :-)

MartinThoma · 2022-08-30T13:20:24Z

Could you please do git rm resources/unreadablemetadata.pdf?

ediamondscience · 2022-08-30T16:32:34Z

Looks like the the reader causes some issues if the metadata is unreadable. Is it the right move to throw a PdfReadError for this situation?

MartinThoma · 2022-08-30T20:43:52Z

Sorry, my mistake. I'll fix the test tomorrow 🙈

Version 2.10.5, 2022-09-04 -------------------------- New Features (ENH): - Process XRefStm (#1297) - Auto-detect RTL for text extraction (#1309) Bug Fixes (BUG): - Avoid scaling cropbox twice (#1314) Robustness (ROB): - Fix offset correction in revised PDF (#1318) - Crop data of /U and /O in encryption dictionary to 48 bytes (#1317) - MultiLine bfrange in cmap (#1299) - Cope with 2 digit codes in bfchar (#1310) - Accept '/annn' charset as ASCII code (#1316) - Log errors during Float / NumberObject initialization (#1315) - Cope with corrupted entries in xref table (#1300) Documentation (DOC): - Migration guide (PyPDF2 1.x \xe2\x9e\x94 2.x) (#1324) - Creating a coverage report (#1319) - Fix AnnotationBuilder.free_text example (#1311) - Fix usage of page.scale by replacing it with page.scale_by (#1313) Developer Experience (DEV): - Only run coverage for PyPDF2 Maintenance (MAINT): - PdfReaderProtocol (#1303) - Throw PdfReadError if Trailer can't be read (#1298) - Remove catching OverflowException (#1302) Full Changelog: 2.10.4...2.10.5

Added PdfReadError in cases where trailer is absent of can't be read.…

c3fbdcf

… Added myself to contributor list.

ediamondscience changed the title ~~Added PdfReadError in cases where trailer is absent of can't be read.…~~ Added PdfReadError in cases where trailer is absent or can't be read.… Aug 29, 2022

MartinThoma reviewed Aug 29, 2022

View reviewed changes

PyPDF2/_reader.py Outdated Show resolved Hide resolved

ediamondscience and others added 2 commits August 30, 2022 04:30

Update PyPDF2/_reader.py

50e1813

Removed extraneous is True statement on line 329 of _reader.py Co-authored-by: Martin Thoma <info@martin-thoma.de>

Added test for unreadable metadata and pdf file unreadablemetadata.pd…

b3d24c6

…f to test new condition

MartinThoma reviewed Aug 30, 2022

View reviewed changes

PyPDF2/_reader.py Outdated Show resolved Hide resolved

Update PyPDF2/_reader.py

e953636

MartinThoma changed the title ~~Added PdfReadError in cases where trailer is absent or can't be read.…~~ MAINT: Throwd PdfReadError if /Trailer can't be read Aug 30, 2022

MartinThoma changed the title ~~MAINT: Throwd PdfReadError if /Trailer can't be read~~ MAINT: Throwd PdfReadError if Trailer can't be read Aug 30, 2022

MartinThoma changed the title ~~MAINT: Throwd PdfReadError if Trailer can't be read~~ MAINT: Throw PdfReadError if Trailer can't be read Aug 30, 2022

Clarified which exception should be thrown for unreadble metadata to …

584bea6

…ensure accurate test results in future

ediamondscience mentioned this pull request Aug 30, 2022

Added sample file to test for unreadable meta data py-pdf/sample-files#15

Merged

ediamondscience added 2 commits August 30, 2022 11:33

Updated reader test code to move unreadable meta data file to sample …

648bbb1

…file directory

Merge branch 'main' of https://github.com/ediamondscience/PyPDF2

a1733b4

Merge branch 'main' into main

40a6ea7

ediamondscience added 2 commits August 30, 2022 16:08

Removed unreadable metadata from ./resources folder

35bb613

Merge branch 'main' of https://github.com/ediamondscience/PyPDF2

820afd8

Merge branch 'main' into main

4f31b64

MartinThoma merged commit 5c76c8f into py-pdf:main Aug 31, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MAINT: Throw PdfReadError if Trailer can't be read #1298

MAINT: Throw PdfReadError if Trailer can't be read #1298

ediamondscience commented Aug 29, 2022 •

edited

codecov bot commented Aug 29, 2022 •

edited

MartinThoma commented Aug 29, 2022

MartinThoma commented Aug 30, 2022

MartinThoma commented Aug 30, 2022

ediamondscience commented Aug 30, 2022

MartinThoma commented Aug 30, 2022

MartinThoma commented Aug 30, 2022

ediamondscience commented Aug 30, 2022

MartinThoma commented Aug 30, 2022

MAINT: Throw PdfReadError if Trailer can't be read #1298

MAINT: Throw PdfReadError if Trailer can't be read #1298

Conversation

ediamondscience commented Aug 29, 2022 • edited

codecov bot commented Aug 29, 2022 • edited

Codecov Report

MartinThoma commented Aug 29, 2022

MartinThoma commented Aug 30, 2022

MartinThoma commented Aug 30, 2022

ediamondscience commented Aug 30, 2022

MartinThoma commented Aug 30, 2022

MartinThoma commented Aug 30, 2022

ediamondscience commented Aug 30, 2022

MartinThoma commented Aug 30, 2022

ediamondscience commented Aug 29, 2022 •

edited

codecov bot commented Aug 29, 2022 •

edited