Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"token too long" after merging a huge number of files #451

Closed
sefasenturk95 opened this issue Apr 7, 2022 · 9 comments
Closed

"token too long" after merging a huge number of files #451

sefasenturk95 opened this issue Apr 7, 2022 · 9 comments
Assignees
Labels

Comments

@sefasenturk95
Copy link

I am trying to count the pages of a big pdf file. The size is 147mb and it has 6638 pages. This is a code snippet to show how I try to count it:

pageCount, err := pdf.PageCountFile(b.pathToPDF)
if err != nil {
	return nil, 0, errors.Wrap(err, "error while count pages for pdf")
}

And this is the output:
Read: xRefTable failed: bufio.Scanner: token too long

I tried disabling the WriteXRefStream but that did not help. @hhrutter any idea on how to proceed?

@hhrutter
Copy link
Collaborator

hhrutter commented Apr 7, 2022

Hi there!

First of all please ensure your PDF is valid and passes pdfcpu validate in.pdf on the CLI.
My guess is it will probably fail...

Then execute pdfcpu validate -vv in.pdf and report back the stack trace you are getting.
Then we take it from there.

Thank you for using pdfcpu 💚

@hhrutter hhrutter changed the title Error while counting pages: Read: xRefTable failed: bufio.Scanner: token too long API: PageCountFile returns: xRefTable failed: bufio.Scanner: token too long Apr 7, 2022
@hhrutter hhrutter changed the title API: PageCountFile returns: xRefTable failed: bufio.Scanner: token too long API: PageCountFile returns: token too long Apr 7, 2022
@hhrutter hhrutter changed the title API: PageCountFile returns: token too long api: PageCountFile returns: token too long Apr 7, 2022
@sefasenturk95
Copy link
Author

Hi @hhrutter, thank you for responding. I get the same error when I try to validate the pdf:

➜  git:(develop) ✗ pdfcpu validate chat.pdf
validating(mode=relaxed) chat.pdf ...
Read: xRefTable failed: bufio.Scanner: token too long
➜  git:(develop) ✗ pdfcpu validate -vv chat.pdf
validating(mode=relaxed) chat.pdf ...
 READ: 2022/04/08 14:52:45 Read: begin
 INFO: 2022/04/08 14:52:45 PDF Version 1.5 conforming reader
 READ: 2022/04/08 14:52:45 readXRefTable: begin
 READ: 2022/04/08 14:52:45 scanning for offsetLastXRefSection starting at 146660451
 READ: 2022/04/08 14:52:45 Offset last xrefsection: 146175834
 READ: 2022/04/08 14:52:45 buildXRefTableStartingAt: begin
 READ: 2022/04/08 14:52:45 headerVersion begin
 READ: 2022/04/08 14:52:45 headerVersion: end, found header version: 1.7
 READ: 2022/04/08 14:52:45 newPositionedReader: positioned to offset: 146175834
 READ: 2022/04/08 14:52:45 xref line 1: <431793 0 obj>
Fatal: bufio.Scanner: token too long
Read: xRefTable failed
github.com/pdfcpu/pdfcpu/pkg/pdfcpu.Read
        github.com/pdfcpu/pdfcpu/pkg/pdfcpu/read.go:75
github.com/pdfcpu/pdfcpu/pkg/api.ReadContext
        github.com/pdfcpu/pdfcpu/pkg/api/api.go:48
github.com/pdfcpu/pdfcpu/pkg/api.Validate
        github.com/pdfcpu/pdfcpu/pkg/api/validate.go:43
github.com/pdfcpu/pdfcpu/pkg/api.ValidateFile
        github.com/pdfcpu/pdfcpu/pkg/api/validate.go:92
github.com/pdfcpu/pdfcpu/pkg/api.ValidateFiles
        github.com/pdfcpu/pdfcpu/pkg/api/validate.go:115
github.com/pdfcpu/pdfcpu/pkg/cli.Validate
        github.com/pdfcpu/pdfcpu/pkg/cli/cli.go:32
github.com/pdfcpu/pdfcpu/pkg/cli.Process
        github.com/pdfcpu/pdfcpu/pkg/cli/process.go:35
main.process
        github.com/pdfcpu/pdfcpu/cmd/pdfcpu/process.go:102
main.processValidateCommand
        github.com/pdfcpu/pdfcpu/cmd/pdfcpu/process.go:157
main.commandMap.process
        github.com/pdfcpu/pdfcpu/cmd/pdfcpu/cmd.go:143
main.main
        github.com/pdfcpu/pdfcpu/cmd/pdfcpu/main.go:55
runtime.main
        runtime/proc.go:255
runtime.goexit
        runtime/asm_amd64.s:1581

@hhrutter
Copy link
Collaborator

hhrutter commented Apr 8, 2022

This looks like either a malformed PDF or a parser bug.
If your PDF opens w/o any problem in Acrobat Reader and Mac Preview
you can help improve pdfcpu by sharing this file.
email with a link is also fine.

@sefasenturk95
Copy link
Author

I can open it with the Preview app on my Mac just fine. To which email can I send the file?

@hhrutter
Copy link
Collaborator

hhrutter commented Apr 8, 2022

This file is huge. Send me a link
hhrutter gmail

@hhrutter
Copy link
Collaborator

Try this before merging:

  1. Locate your config file: pdfcpu config
  2. Reconfigure: writeXRefStream: false

@sefasenturk95
Copy link
Author

sefasenturk95 commented Apr 11, 2022

Ah, cool, it seems to be working with writeXRefStream: false. Can you explain what it does and why it has to be false in this case? Thanks!

@hhrutter
Copy link
Collaborator

There are two ways to save a cross reference table.
Using xref table sections and a more compact form using an xrefstream which is pdfcpu's default.
There is a bug parsing non trivial xrefstreams that needs to be adressed.
Switching to writing xref sections is just a suggested workaround until this is fixed.
This is related to #357

@hhrutter hhrutter added bug and removed investigate labels Apr 11, 2022
@hhrutter hhrutter changed the title api: PageCountFile returns: token too long token too long error after merging a huge number of files Apr 11, 2022
@hhrutter hhrutter changed the title token too long error after merging a huge number of files "token too long" after merging a huge number of files Apr 11, 2022
hhrutter added a commit that referenced this issue Apr 11, 2022
@hhrutter
Copy link
Collaborator

This is fixed with latest commit!

adamgreenhall added a commit to adamgreenhall/pdfcpu that referenced this issue Apr 28, 2022
* Fix pdfcpu#442, pdfcpu#443

* Fix pdfcpu#437

* Fix pdfcpu#434

* Fix pdfcpu#429

* Fix pdfcpu#438

* Fix pdfcpu#440

* Fix pdfcpu#380

* Fix pdfcpu#446

* Add Fedora instructions (pdfcpu#439)

* Fix pdfcpu#389

* Fix pdfcpu#357, pdfcpu#451

* Fix free list validation

* Cleanup

* Fix pdfcpu#453

* Fix pdfcpu#457

* Revert "Revert "Fix pdfcpu#385""

This reverts commit bbe8e25.

Co-authored-by: Horst Rutter <hhrutter@gmail.com>
Co-authored-by: Fabio Alessandro Locati <77888+Fale@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants