ExtractImagesRaw not working since v0.3.12 #353

renard · 2021-07-12T22:42:52Z

I use pdfcpu as a library for a pet project to convert PDF files to comic books. (See below for the relevant code).

If I use a previous version (go get github.com/pdfcpu/pdfcpu@v0.3.12-0.20210416123645-ac9adc6099fe) my snippet works correctly.

However with last release (go get github.com/pdfcpu/pdfcpu) (since v0.3.12) ExtractImagesRaw stopped working with 2 symptoms:

Either the program panic:

panic: runtime error: index out of range [0] with length 0

goroutine 1 [running]:
image.(*YCbCr).YCbCrAt(0xc0001b4480, 0x0, 0x0, 0x1001)
	/usr/local/Cellar/go/1.16/libexec/src/image/ycbcr.go:81 +0x130
image.(*YCbCr).At(0xc0001b4480, 0x0, 0x0, 0xc0000ee000, 0x5a8)
	/usr/local/Cellar/go/1.16/libexec/src/image/ycbcr.go:71 +0x45
image/png.(*encoder).writeImage(0xc0001af400, 0x14e1780, 0xc0001f90c0, 0x14e7490, 0xc0001b4480, 0xe, 0xffffffffffffffff, 0x0, 0x0)
	/usr/local/Cellar/go/1.16/libexec/src/image/png/writer.go:473 +0x14a5
image/png.(*encoder).writeIDATs(0xc0001af400)
	/usr/local/Cellar/go/1.16/libexec/src/image/png/writer.go:531 +0xf0
image/png.(*Encoder).Encode(0xc00061eff0, 0x14e17e0, 0xc00143f140, 0x14e7490, 0xc0001b4480, 0x0, 0x0)
	/usr/local/Cellar/go/1.16/libexec/src/image/png/writer.go:632 +0x388
image/png.Encode(...)
	/usr/local/Cellar/go/1.16/libexec/src/image/png/writer.go:561
github.com/pdfcpu/pdfcpu/pkg/pdfcpu.renderDCTEncodedImage(0xc000074000, 0xc0017fca10, 0x1015300, 0xc000272444, 0x4, 0x65, 0xc00143f050, 0xef00000000000000, 0x0, 0xc0001230e0, ...)
	/Users/renard/go/pkg/mod/github.com/pdfcpu/pdfcpu@v0.3.12/pkg/pdfcpu/writeImage.go:784 +0x270
github.com/pdfcpu/pdfcpu/pkg/pdfcpu.RenderImage(0xc000074000, 0xc0017fca10, 0x0, 0xc000272444, 0x4, 0x65, 0x1, 0xc00312f630, 0x0, 0x1, ...)
	/Users/renard/go/pkg/mod/github.com/pdfcpu/pdfcpu@v0.3.12/pkg/pdfcpu/writeImage.go:804 +0x1c5
github.com/pdfcpu/pdfcpu/pkg/pdfcpu.(*Context).ExtractImage(0xc0000723c0, 0xc0017fca10, 0x0, 0xc000272444, 0x4, 0x65, 0x0, 0x1, 0x0, 0x0)
	/Users/renard/go/pkg/mod/github.com/pdfcpu/pdfcpu@v0.3.12/pkg/pdfcpu/extract.go:314 +0x310
github.com/pdfcpu/pdfcpu/pkg/pdfcpu.(*Context).ExtractPageImages(0xc0000723c0, 0x46, 0x0, 0xc000134090, 0x1, 0x1, 0x0, 0x1)
	/Users/renard/go/pkg/mod/github.com/pdfcpu/pdfcpu@v0.3.12/pkg/pdfcpu/extract.go:336 +0x154
github.com/pdfcpu/pdfcpu/pkg/api.ExtractImagesRaw(0x14e5680, 0xc00012e590, 0x0, 0x0, 0x0, 0xc00013a580, 0x1, 0xa5, 0x0, 0x0, ...)
	/Users/renard/go/pkg/mod/github.com/pdfcpu/pdfcpu@v0.3.12/pkg/api/extract.go:65 +0x2a5
main.readPDF(0x7ffeefbff952, 0x42, 0x0, 0x0)
	/Users/renard/Src/cbconvert/test/main.go:18 +0x165
main.main()
	/Users/renard/Src/cbconvert/test/main.go:34 +0xf3
exit status 2

Or the program enters an infinite loop (I can only assume since program get stuck at ExtractImagesRaw).

Seems to be related to either #329 or #323 (not sure).
I can provide by email some files having this issue.

Code I use:

package main

import (
	"fmt"
	"os"

	pdfapi "github.com/pdfcpu/pdfcpu/pkg/api"
)

func readPDF(file string) (err error) {
	f, err := os.Open(file)
	if err != nil {
		return err
	}
	defer f.Close()

	fmt.Printf("PDF: Opening %s\n", f)
	pages, err := pdfapi.ExtractImagesRaw(f, nil, nil)
	if err != nil {
		fmt.Printf("PDF: %s\n", err)
		return
	}
	for _, page := range pages {
		// Pre v0.3.12
		//fn := fmt.Sprintf("%.5d.%s", page.PageNr, page.Type)
		// Post v0.3.12
		fn := fmt.Sprintf("%s.%s", page.Name, page.FileType)
                // In real world, this function copies the image content:
		//buf := new(bytes.Buffer)
		//_, err = io.Copy(buf, page.Reader)
		fmt.Printf("Adding %s\n", fn)
	}
	fmt.Printf("PDF: Read %d pages\n", len(pages))
	return
}

# arg1 is the pdf file to inspect
func main() {
	err := readPDF(os.Args[1])
	if err != nil {
		fmt.Printf("Error: %s\n", err)
		panic(err)
	}
}

The text was updated successfully, but these errors were encountered:

hhrutter · 2021-07-12T22:56:21Z

Please provide one small file that reproduces your symptoms so I can provide a fix.

renard · 2021-07-12T22:57:53Z

Looks like ExtractImagesRaw is not used any more. I changed my code with:

package main

import (
	"fmt"
	"os"

	pdfapi "github.com/pdfcpu/pdfcpu/pkg/api"
	"github.com/pdfcpu/pdfcpu/pkg/pdfcpu"
)

func readPDF(file string) (err error) {
	f, err := os.Open(file)
	if err != nil {
		return err
	}
	defer f.Close()

	fmt.Printf("PDF: Opening %s\n", f)
	err = pdfapi.ExtractImages(f, nil, PrintImg, nil)
	if err != nil {
		fmt.Printf("PDF: %s\n", err)
		return
	}
	return
}

func PrintImg(img pdfcpu.Image, singleImgPerPage bool, maxPageDigits int) error {
	fmt.Printf("s:%s, d:%d, %#v\n", singleImgPerPage, maxPageDigits, img)
	return nil

}

func main() {
	fmt.Printf("ARGS: %#v\n", os.Args)
	err := readPDF(os.Args[1])
	if err != nil {
		fmt.Printf("Error: %s\n", err)
		panic(err)
	}
}

And it does work until:

s:%!s(bool=true), d:2, pdfcpu.Image{Reader:(*bytes.Buffer)(0xc000f4c4b0), Name:"Im15", FileType:"png", pageNr:15, objNr:95, width:0, height:0, bpc:0, cs:"", comp:0, sMask:false, imgMask:false, thumb:false, interpol:false, size:0, filter:""}
panic: runtime error: index out of range [0] with length 0

goroutine 1 [running]:
image.(*YCbCr).YCbCrAt(0xc000224400, 0x0, 0x0, 0x1001)
	/usr/local/Cellar/go/1.16/libexec/src/image/ycbcr.go:81 +0x130
image.(*YCbCr).At(0xc000224400, 0x0, 0x0, 0xc0039f7980, 0x5a8)
	/usr/local/Cellar/go/1.16/libexec/src/image/ycbcr.go:71 +0x45
image/png.(*encoder).writeImage(0xc000139900, 0x14d7880, 0xc000223080, 0x14dd5d0, 0xc000224400, 0xe, 0xffffffffffffffff, 0x0, 0x0)
	/usr/local/Cellar/go/1.16/libexec/src/image/png/writer.go:473 +0x14a5
image/png.(*encoder).writeIDATs(0xc000139900)
	/usr/local/Cellar/go/1.16/libexec/src/image/png/writer.go:531 +0xf0
image/png.(*Encoder).Encode(0xc00000f350, 0x14d78e0, 0xc000f4cb10, 0x14dd5d0, 0xc000224400, 0x0, 0x0)
	/usr/local/Cellar/go/1.16/libexec/src/image/png/writer.go:632 +0x388
image/png.Encode(...)
	/usr/local/Cellar/go/1.16/libexec/src/image/png/writer.go:561
github.com/pdfcpu/pdfcpu/pkg/pdfcpu.renderDCTEncodedImage(0xc000202000, 0xc0014c1f80, 0x1015300, 0xc0002bbc44, 0x4, 0x65, 0xc000f4c4e0, 0x5f00000000000000, 0x0, 0xc0000a8f18, ...)
	/Users/renard/go/pkg/mod/github.com/pdfcpu/pdfcpu@v0.3.12/pkg/pdfcpu/writeImage.go:784 +0x270
github.com/pdfcpu/pdfcpu/pkg/pdfcpu.RenderImage(0xc000202000, 0xc0014c1f80, 0x0, 0xc0002bbc44, 0x4, 0x65, 0x1, 0xc003155710, 0x0, 0x1, ...)
	/Users/renard/go/pkg/mod/github.com/pdfcpu/pdfcpu@v0.3.12/pkg/pdfcpu/writeImage.go:804 +0x1c5
github.com/pdfcpu/pdfcpu/pkg/pdfcpu.(*Context).ExtractImage(0xc0002003c0, 0xc0014c1f80, 0x0, 0xc0002bbc44, 0x4, 0x65, 0xc001854000, 0x0, 0xc000145aa0, 0x10c71d1)
	/Users/renard/go/pkg/mod/github.com/pdfcpu/pdfcpu@v0.3.12/pkg/pdfcpu/extract.go:314 +0x310
github.com/pdfcpu/pdfcpu/pkg/pdfcpu.(*Context).ExtractPageImages(0xc0002003c0, 0x10, 0xc0001fb800, 0x4, 0x1452e6c, 0x3, 0xf, 0x5f)
	/Users/renard/go/pkg/mod/github.com/pdfcpu/pdfcpu@v0.3.12/pkg/pdfcpu/extract.go:336 +0x154
github.com/pdfcpu/pdfcpu/pkg/api.ExtractImages(0x14db740, 0xc0000b45a0, 0x0, 0x0, 0x0, 0x147c088, 0xc0000c2630, 0xa5, 0x0)
	/Users/renard/go/pkg/mod/github.com/pdfcpu/pdfcpu@v0.3.12/pkg/api/extract.go:115 +0x345
main.readPDF(0x7ffeefbff952, 0x42, 0x0, 0x0)
	/Users/renard/Src/cbconvert/test/main.go:19 +0x16d
main.main()
	/Users/renard/Src/cbconvert/test/main.go:45 +0xf3
exit status 2

renard · 2021-07-12T23:01:23Z

Please provide one small file that reproduces your symptoms so I can provide a fix.

Sent by mail

hhrutter · 2021-07-12T23:14:32Z

Yes, there was some refactoring going on in that area in order ro make the reader containing image data optional for listing images where this is not needed.

Thanks for reporting this.
I'll keep you posted.

hhrutter · 2021-07-13T15:18:30Z

👍 This is fixed with the latest commit.
You can still use ExtractImagesRaw if you don't care about getting back ALL images in one memory chunk.

I encourage everybody to go get the latest commit.

hhrutter self-assigned this Jul 12, 2021

hhrutter added the bug label Jul 12, 2021

hhrutter closed this as completed in 281745f Jul 13, 2021

renard mentioned this issue Nov 4, 2021

Performance issue while extracting image #391

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ExtractImagesRaw not working since v0.3.12 #353

ExtractImagesRaw not working since v0.3.12 #353

renard commented Jul 12, 2021

hhrutter commented Jul 12, 2021

renard commented Jul 12, 2021

renard commented Jul 12, 2021

hhrutter commented Jul 12, 2021

hhrutter commented Jul 13, 2021

ExtractImagesRaw not working since v0.3.12 #353

ExtractImagesRaw not working since v0.3.12 #353

Comments

renard commented Jul 12, 2021

hhrutter commented Jul 12, 2021

renard commented Jul 12, 2021

renard commented Jul 12, 2021

hhrutter commented Jul 12, 2021

hhrutter commented Jul 13, 2021