Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PDF comments not fully supported #288

Closed
petervwyatt opened this issue Feb 2, 2021 · 9 comments
Closed

PDF comments not fully supported #288

petervwyatt opened this issue Feb 2, 2021 · 9 comments
Assignees

Comments

@petervwyatt
Copy link

Not all "token delimiter" characters are being correctly supported unless whitespace is additionally used which is not required by any PDF spec. Test file and detailed explanation is available in this Github repo.

I'm no Go expert, but I think this is because the standard Go RTL for strconv.ParseFloat has different lexical rules to PDF.

$ pdfcpu validate -v CompactedPDFSyntaxTest.pdf
validating(mode=relaxed) CompactedPDFSyntaxTest.pdf ...
 INFO: 2021/02/02 16:14:33 PDF Version 1.5 conforming reader
Fatal: strconv.ParseFloat: parsing "999%comment": invalid syntax
dereferenceObject: problem dereferencing object 4
github.com/pdfcpu/pdfcpu/pkg/pdfcpu.dereferenceObject
        /mnt/c/Temp/share/pdfcpu/pkg/pdfcpu/read.go:2240
github.com/pdfcpu/pdfcpu/pkg/pdfcpu.dereferenceObjects
        /mnt/c/Temp/share/pdfcpu/pkg/pdfcpu/read.go:2340
...
@hhrutter
Copy link
Collaborator

hhrutter commented Feb 2, 2021

I would go many extra miles before I blame the Go standard lib.
Smells like a bug..

Excellent feedback, thanks 💚

@hhrutter hhrutter self-assigned this Feb 7, 2021
@hhrutter
Copy link
Collaborator

hhrutter commented Feb 7, 2021

I am puzzled because I am having a different problem with this file:

startxref 3369 but xref is at offset 3311.

Is this intentional?
The weird thing is Mac Preview opens the file without complaint but Adobe Reader chokes on it.

@hhrutter
Copy link
Collaborator

hhrutter commented Feb 7, 2021

OK, looks like I am running into the same issue as you are.
The % comments are not handled correctly in certain situations.
I am still wondering why the pointer to xref is corrupt.
Because of this pdfcpu is going through a work around (for a situation like this).
Maybe something I am missing here?

@petervwyatt
Copy link
Author

I think we must be looking at different files! The "xref" keyword is at 3369 (dec):

$ hexdump -C -s 3369 CompactedPDFSyntaxTest.pdf
00000d29  78 72 65 66 0d 0a 30 20  37 0d 0a 30 30 30 30 30  |xref..0 7..00000|
00000d39  30 30 30 30 30 20 36 35  35 33 35 20 66 0d 0a 30  |00000 65535 f..0|
00000d49  30 30 30 30 30 31 32 32  38 20 30 30 30 30 30 20  |000001228 00000 |
00000d59  6e 0d 0a 30 30 30 30 30  30 31 35 35 35 20 30 30  |n..0000001555 00|
00000d69  30 30 30 20 6e 0d 0a 30  30 30 30 30 30 31 35 37  |000 n..000000157|
00000d79  32 20 30 30 30 30 30 20  6e 0d 0a 30 30 30 30 30  |2 00000 n..00000|
00000d89  30 32 30 33 31 20 30 30  30 30 30 20 6e 0d 0a 30  |02031 00000 n..0|
00000d99  30 30 30 30 30 32 34 32  38 20 30 30 30 30 30 20  |000002428 00000 |
00000da9  6e 0d 0a 30 30 30 30 30  30 33 32 39 34 20 30 30  |n..0000003294 00|
00000db9  30 30 30 20 6e 0d 0a 74  72 61 69 6c 65 72 0d 0a  |000 n..trailer..|
... snip ...

@hhrutter
Copy link
Collaborator

hhrutter commented Feb 8, 2021

not in the checked in version in the safedocs repo:

Go-> hexdump -C -s 3369 CompactedPDFSyntaxTest.pdf
00000d29  30 30 30 30 30 20 6e 0a  30 30 30 30 30 30 31 35  |00000 n.00000015|
00000d39  37 32 20 30 30 30 30 30  20 6e 0a 30 30 30 30 30  |72 00000 n.00000|
00000d49  30 32 30 33 31 20 30 30  30 30 30 20 6e 0a 30 30  |02031 00000 n.00|
00000d59  30 30 30 30 32 34 32 38  20 30 30 30 30 30 20 6e  |00002428 00000 n|
00000d69  0a 30 30 30 30 30 30 33  32 39 34 20 30 30 30 30  |.0000003294 0000|
00000d79  30 20 6e 0a 74 72 61 69  6c 65 72 0a 3c 3c 2f 52  |0 n.trailer.<</R|
00000d89  6f 6f 74 20 31 20 30 20  52 2f 49 6e 66 6f 25 63  |oot 1 0 R/Info%c|
00000d99  6f 6d 6d 65 6e 74 20 61  66 74 65 72 20 6e 61 6d  |omment after nam|
00000da9  65 0a 3c 3c 2f 53 75 62  6a 65 63 74 28 43 6f 6d  |e.<</Subject(Com|
00000db9  70 61 63 74 65 64 20 53  79 6e 74 61 78 20 76 33  |pacted Syntax v3|
00000dc9  2e 30 29 25 63 6f 6d 6d  65 6e 74 20 61 66 74 65  |.0)%comment afte|
00000dd9  72 20 6c 69 74 65 72 61  6c 20 73 74 72 69 6e 67  |r literal string|
00000de9  20 65 6e 64 0a 2f 54 69  74 6c 65 3c 34 33 36 66  | end./Title<436f|
00000df9  36 64 37 30 36 31 36 33  37 34 36 35 36 34 32 30  |6d70616374656420|
00000e09  37 33 37 39 36 65 37 34  36 31 37 38 3e 25 63 6f  |73796e746178>%co|
00000e19  6d 6d 65 6e 74 20 61 66  74 65 72 20 68 65 78 20  |mment after hex |
00000e29  73 74 72 69 6e 67 20 65  6e 64 0a 2f 4b 65 79 77  |string end./Keyw|
00000e39  6f 72 64 73 28 50 44 46  2c 43 6f 6d 70 61 63 74  |ords(PDF,Compact|
00000e49  65 64 2c 53 79 6e 74 61  78 2c 49 53 4f 20 33 32  |ed,Syntax,ISO 32|
00000e59  30 30 30 2d 32 3a 32 30  32 30 29 2f 43 72 65 61  |000-2:2020)/Crea|
00000e69  74 69 6f 6e 44 61 74 65  28 44 3a 32 30 32 30 30  |tionDate(D:20200|
00000e79  33 31 37 29 2f 41 75 74  68 6f 72 28 50 65 74 65  |317)/Author(Pete|
00000e89  72 20 57 79 61 74 74 29  2f 43 72 65 61 74 6f 72  |r Wyatt)/Creator|
00000e99  3c 34 38 36 31 36 65 36  34 32 64 36 35 36 34 36  |<48616e642d65646|
00000ea9  39 37 34 3e 2f 50 72 6f  64 75 63 65 72 3c 34 38  |974>/Producer<48|
00000eb9  36 31 36 65 36 34 32 64  36 35 36 34 36 39 37 34  |616e642d65646974|
00000ec9  3e 3e 3e 0a 2f 49 44 5b  3c 31 38 44 36 42 36 34  |>>>./ID[<18D6B64|
00000ed9  31 32 34 35 43 30 33 46  41 42 45 36 37 44 39 33  |1245C03FABE67D93|
00000ee9  41 44 38 37 39 44 36 45  43 3e 3c 36 32 36 34 39  |AD879D6EC><62649|
00000ef9  39 32 43 39 32 30 37 34  35 33 33 41 34 36 41 30  |92C92074533A46A0|
00000f09  31 39 43 37 43 46 39 42  46 42 36 3e 5d 2f 53 69  |19C7CF9BFB6>]/Si|
00000f19  7a 65 20 37 3e 3e 0a 73  74 61 72 74 78 72 65 66  |ze 7>>.startxref|
00000f29  0a 33 33 36 39 20 20 0a  25 25 45 4f 46           |.3369  .%%EOF|

@hhrutter hhrutter changed the title PDF token delimiters not fully supported PDF comments not fully supported Feb 9, 2021
@hhrutter
Copy link
Collaborator

hhrutter commented Feb 9, 2021

After playing around with this file (using the mentioned last resort workaround pdfcpu provides to parse a file having a corrupt xref pointer and a patch for comment digestion) I am able to parse it now but it fails validation. 😞

It seems to be using the escape sequence \" in a Javascript Action Dict which is undefined in PDF 1.7:

1:   offset=    1204 generation=0 pdfcpu.Dict type=Catalog
<<
	<AA, <<
		<WP, <<
			<JS, (//JavaScript comment
app.alert\(\"Document Will Print Action!\"\))>
			<S, JavaScript>
		>>>
	>>>
	<MarkInfo, <<
		<Marked, true>
		<Suspects, true>
		<UserProperties, true>
	>>>
	<Pages, (3 0 R)>
	<Type, Catalog>
>>

Screenshot 2021-02-09 at 01 45 08

@petervwyatt
Copy link
Author

Great find! 🥇 I will fix that - thanks. I shouldn't blindly copy'n'paste from the files of others...

Re the file in GitHub - I'm definitely looking at a file pulled down from GitHub via GitHub Desktop so it may be a config line ending issue with Git clients. I will try different clients and different platforms. Sigh.

@hhrutter
Copy link
Collaborator

hhrutter commented Feb 9, 2021

I cloned the repo via the git command line which should be the safest way
and when I open the file on the Github page and then click on the Download button on the repo page I get the same file.
Only when doing a hover and Save as.. the downloaded file is unreadable.

@petervwyatt
Copy link
Author

OK - fixed after much faffing around. Looks like some different git-aware tools do their own thing in semi-platform dependent ways. Tested on Win10, WSL2-Ubuntu, native Ubuntu and multiple browsers with a total of 9 different git-aware tools via HTTPS and SSH (some tools seem to do different things on add/upload vs download!). Solution that now works for me across everything was to add a .gitattributes file and explicitly state *.pdf are always binary.
So please pull down the repo again and the PDF has both xref and your JS issue all fixed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants