"token too long" after merging a huge number of files #451

sefasenturk95 · 2022-04-07T13:25:49Z

I am trying to count the pages of a big pdf file. The size is 147mb and it has 6638 pages. This is a code snippet to show how I try to count it:

pageCount, err := pdf.PageCountFile(b.pathToPDF)
if err != nil {
	return nil, 0, errors.Wrap(err, "error while count pages for pdf")
}

And this is the output:
Read: xRefTable failed: bufio.Scanner: token too long

I tried disabling the WriteXRefStream but that did not help. @hhrutter any idea on how to proceed?

The text was updated successfully, but these errors were encountered:

hhrutter · 2022-04-07T21:54:14Z

Hi there!

First of all please ensure your PDF is valid and passes pdfcpu validate in.pdf on the CLI.
My guess is it will probably fail...

Then execute pdfcpu validate -vv in.pdf and report back the stack trace you are getting.
Then we take it from there.

Thank you for using pdfcpu 💚

sefasenturk95 · 2022-04-08T12:54:41Z

Hi @hhrutter, thank you for responding. I get the same error when I try to validate the pdf:

➜  git:(develop) ✗ pdfcpu validate chat.pdf
validating(mode=relaxed) chat.pdf ...
Read: xRefTable failed: bufio.Scanner: token too long

➜  git:(develop) ✗ pdfcpu validate -vv chat.pdf
validating(mode=relaxed) chat.pdf ...
 READ: 2022/04/08 14:52:45 Read: begin
 INFO: 2022/04/08 14:52:45 PDF Version 1.5 conforming reader
 READ: 2022/04/08 14:52:45 readXRefTable: begin
 READ: 2022/04/08 14:52:45 scanning for offsetLastXRefSection starting at 146660451
 READ: 2022/04/08 14:52:45 Offset last xrefsection: 146175834
 READ: 2022/04/08 14:52:45 buildXRefTableStartingAt: begin
 READ: 2022/04/08 14:52:45 headerVersion begin
 READ: 2022/04/08 14:52:45 headerVersion: end, found header version: 1.7
 READ: 2022/04/08 14:52:45 newPositionedReader: positioned to offset: 146175834
 READ: 2022/04/08 14:52:45 xref line 1: <431793 0 obj>
Fatal: bufio.Scanner: token too long
Read: xRefTable failed
github.com/pdfcpu/pdfcpu/pkg/pdfcpu.Read
        github.com/pdfcpu/pdfcpu/pkg/pdfcpu/read.go:75
github.com/pdfcpu/pdfcpu/pkg/api.ReadContext
        github.com/pdfcpu/pdfcpu/pkg/api/api.go:48
github.com/pdfcpu/pdfcpu/pkg/api.Validate
        github.com/pdfcpu/pdfcpu/pkg/api/validate.go:43
github.com/pdfcpu/pdfcpu/pkg/api.ValidateFile
        github.com/pdfcpu/pdfcpu/pkg/api/validate.go:92
github.com/pdfcpu/pdfcpu/pkg/api.ValidateFiles
        github.com/pdfcpu/pdfcpu/pkg/api/validate.go:115
github.com/pdfcpu/pdfcpu/pkg/cli.Validate
        github.com/pdfcpu/pdfcpu/pkg/cli/cli.go:32
github.com/pdfcpu/pdfcpu/pkg/cli.Process
        github.com/pdfcpu/pdfcpu/pkg/cli/process.go:35
main.process
        github.com/pdfcpu/pdfcpu/cmd/pdfcpu/process.go:102
main.processValidateCommand
        github.com/pdfcpu/pdfcpu/cmd/pdfcpu/process.go:157
main.commandMap.process
        github.com/pdfcpu/pdfcpu/cmd/pdfcpu/cmd.go:143
main.main
        github.com/pdfcpu/pdfcpu/cmd/pdfcpu/main.go:55
runtime.main
        runtime/proc.go:255
runtime.goexit
        runtime/asm_amd64.s:1581

hhrutter · 2022-04-08T17:13:20Z

This looks like either a malformed PDF or a parser bug.
If your PDF opens w/o any problem in Acrobat Reader and Mac Preview
you can help improve pdfcpu by sharing this file.
email with a link is also fine.

sefasenturk95 · 2022-04-08T19:13:01Z

I can open it with the Preview app on my Mac just fine. To which email can I send the file?

hhrutter · 2022-04-08T20:06:01Z

This file is huge. Send me a link
hhrutter gmail

hhrutter · 2022-04-10T22:35:19Z

Try this before merging:

Locate your config file: pdfcpu config
Reconfigure: writeXRefStream: false

sefasenturk95 · 2022-04-11T07:53:26Z

Ah, cool, it seems to be working with writeXRefStream: false. Can you explain what it does and why it has to be false in this case? Thanks!

hhrutter · 2022-04-11T08:40:39Z

There are two ways to save a cross reference table.
Using xref table sections and a more compact form using an xrefstream which is pdfcpu's default.
There is a bug parsing non trivial xrefstreams that needs to be adressed.
Switching to writing xref sections is just a suggested workaround until this is fixed.
This is related to #357

hhrutter · 2022-04-11T22:15:15Z

This is fixed with latest commit!

* Fix pdfcpu#442, pdfcpu#443 * Fix pdfcpu#437 * Fix pdfcpu#434 * Fix pdfcpu#429 * Fix pdfcpu#438 * Fix pdfcpu#440 * Fix pdfcpu#380 * Fix pdfcpu#446 * Add Fedora instructions (pdfcpu#439) * Fix pdfcpu#389 * Fix pdfcpu#357, pdfcpu#451 * Fix free list validation * Cleanup * Fix pdfcpu#453 * Fix pdfcpu#457 * Revert "Revert "Fix pdfcpu#385"" This reverts commit bbe8e25. Co-authored-by: Horst Rutter <hhrutter@gmail.com> Co-authored-by: Fabio Alessandro Locati <77888+Fale@users.noreply.github.com>

sefasenturk95 added the investigate label Apr 7, 2022

sefasenturk95 assigned hhrutter Apr 7, 2022

hhrutter changed the title ~~Error while counting pages: Read: xRefTable failed: bufio.Scanner: token too long~~ API: PageCountFile returns: xRefTable failed: bufio.Scanner: token too long Apr 7, 2022

hhrutter changed the title ~~API: PageCountFile returns: xRefTable failed: bufio.Scanner: token too long~~ API: PageCountFile returns: token too long Apr 7, 2022

hhrutter changed the title ~~API: PageCountFile returns: token too long~~ api: PageCountFile returns: token too long Apr 7, 2022

hhrutter added bug and removed investigate labels Apr 11, 2022

hhrutter changed the title ~~api: PageCountFile returns: token too long~~ token too long error after merging a huge number of files Apr 11, 2022

hhrutter changed the title ~~token too long error after merging a huge number of files~~ "token too long" after merging a huge number of files Apr 11, 2022

hhrutter mentioned this issue Apr 11, 2022

write: line length overflow #357

Closed

hhrutter added a commit that referenced this issue Apr 11, 2022

Fix #357, #451

e456479

hhrutter closed this as completed Apr 11, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

"token too long" after merging a huge number of files #451

"token too long" after merging a huge number of files #451

sefasenturk95 commented Apr 7, 2022

hhrutter commented Apr 7, 2022

sefasenturk95 commented Apr 8, 2022

hhrutter commented Apr 8, 2022 •

edited

Loading

sefasenturk95 commented Apr 8, 2022

hhrutter commented Apr 8, 2022

hhrutter commented Apr 10, 2022

sefasenturk95 commented Apr 11, 2022 •

edited by hhrutter

Loading

hhrutter commented Apr 11, 2022

hhrutter commented Apr 11, 2022

"token too long" after merging a huge number of files #451

"token too long" after merging a huge number of files #451

Comments

sefasenturk95 commented Apr 7, 2022

hhrutter commented Apr 7, 2022

sefasenturk95 commented Apr 8, 2022

hhrutter commented Apr 8, 2022 • edited Loading

sefasenturk95 commented Apr 8, 2022

hhrutter commented Apr 8, 2022

hhrutter commented Apr 10, 2022

sefasenturk95 commented Apr 11, 2022 • edited by hhrutter Loading

hhrutter commented Apr 11, 2022

hhrutter commented Apr 11, 2022

hhrutter commented Apr 8, 2022 •

edited

Loading

sefasenturk95 commented Apr 11, 2022 •

edited by hhrutter

Loading