Adding ability to extract text to an io.Reader #1

robarchibald · 2017-08-18T03:48:43Z

Thanks for the awesome library! I thought it would be useful to have the ability to extract text from a PDF file easily so I added it to pdflib. The charmap capability was a beast, but I got it working for the files I tested on. Thanks for considering this pull request

coveralls · 2017-08-18T03:57:36Z

Coverage decreased (-0.6%) to 50.672% when pulling 99d8987 on EndFirstCorp:master into c0da03d on hhrutter:master.

hhrutter · 2017-08-20T20:34:22Z

Thx for your PR!
Appreciated 👍
Sorry I did not see it until I committed my squashed master branch :(
I'll get back to you.

hhrutter · 2017-08-21T21:34:46Z

Please share your specific intention of returning text.
Do you need text for a specific page or the whole file?
Please provide some tests and (go)doc so I get the idea of your usecase.
The api interfaces with files and directories for the moment so we want to stay consistent.
If there is a usecase for providing extracted content via io.Reader maybe that's a missing layer (in between). Like I said I squashed the last couple of commits together including the one you forked off, could you be so kind and provide your changes based off 8f3ac5c. Thank you so much.

robarchibald · 2017-09-05T17:45:50Z

I appreciate you taking the time to look at this pull request. It turns out that since submitting the pull request I created my own library to convert PDF’s to text at https://github.com/EndFirstCorp/pdf2txt. My use case for this is that I’m filling out a pipeline from a web server file upload. On file upload, you get access to a multipart.File. Rather than saving that multipart.File to a standard OS file, my goal was to minimize the I/O and 1) read in the file and convert it to text and then 2) output the text file to disk. I wanted it to work with an io.Reader, not a io.ReaderAt like the typical implementation of PDF parsers are. So, that’s what I’ve done.

If you’re still interested in implementing this pull request, I can look at resubmitting.

hhrutter · 2017-09-10T14:31:47Z

Understood. If you can resubmit your text extraction code so that it is consistent with the existing extraction code for images, fonts and content and supply test code as well I am happy to merge in. Basically what that means is writing out to a file and not a reader. Thank you.

hhrutter · 2017-11-28T10:51:31Z

Text extraction will definitely be an additional functionality at some point but
since this PR originated from an old version for now I am closing it.

using latest source, and update read_buf

* move booklet out of nup api * booklet cli * cleanup * note * cleanup

hhrutter and others added 11 commits August 7, 2017 22:00

clean up

2109018

clean up

d180aeb

clean up

4b274b1

clean up

c0da03d

Utter craziness

fbd4359

fix imports

cb0e7b6

Splitting Extract text into separate file

c4c1fae

Update features

4de9acd

backing out import changes

9d9aba3

Merge branch 'master' of https://github.com/EndFirstCorp/pdflib

57f7dcd

minor cleanup

99d8987

hhrutter force-pushed the master branch from 0e53e7b to 8f3ac5c Compare August 20, 2017 20:22

hhrutter self-assigned this Aug 20, 2017

ghost mentioned this pull request Oct 8, 2017

replacing text in a PDF #4

Open

hhrutter closed this Nov 28, 2017

charleswklau pushed a commit to charleswklau/pdfcpu that referenced this pull request Dec 28, 2018

Merge pull request pdfcpu#1 from charleswklau/master

1cd1d9a

using latest source, and update read_buf

yuthan mentioned this pull request Feb 16, 2020

Invalid page dict entry: Group #166

Closed

lhq0826 mentioned this pull request Sep 9, 2020

cli: image watermark panics #222

Closed

hhrutter mentioned this pull request Nov 25, 2020

Fatal: pdfcpu: validateNameEntry: dict=rootDict entry=Type invalid dict entry: Pages #250

Closed

This was referenced Nov 27, 2020

Corrupt name object when parsing #252

Closed

dereferenceObject: problem dereferencing stream 11: EOF #256

Closed

validation error: dict=type1FontDict required entry=FirstChar missing #258

Closed

hhrutter pushed a commit that referenced this pull request Feb 12, 2021

Booklet commands split from nup (#1)

1da6dfa

* move booklet out of nup api * booklet cli * cleanup * note * cleanup

hhrutter mentioned this pull request Dec 14, 2021

Text extraction #410

Closed

YootTanA mentioned this pull request Sep 26, 2023

dict=pagesDict entry=UserUnit: unsupported in version 1.3 This file could be PDF/A compliant but pdfcpu only supports versions <= PDF V1.7 #717

Closed

joel-rieke mentioned this pull request Jan 18, 2024

pdfcpu val -v crashes on pdf that displays fine in chrome but crashes... #780

Closed

sanbornm mentioned this pull request May 18, 2024

Panic on image extract #871

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Adding ability to extract text to an io.Reader #1

Adding ability to extract text to an io.Reader #1

Uh oh!

robarchibald commented Aug 18, 2017

Uh oh!

coveralls commented Aug 18, 2017

Uh oh!

hhrutter commented Aug 20, 2017

Uh oh!

hhrutter commented Aug 21, 2017

Uh oh!

robarchibald commented Sep 5, 2017

Uh oh!

hhrutter commented Sep 10, 2017

Uh oh!

hhrutter commented Nov 28, 2017

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Adding ability to extract text to an io.Reader #1

Adding ability to extract text to an io.Reader #1

Uh oh!

Conversation

robarchibald commented Aug 18, 2017

Uh oh!

coveralls commented Aug 18, 2017

Uh oh!

hhrutter commented Aug 20, 2017

Uh oh!

hhrutter commented Aug 21, 2017

Uh oh!

robarchibald commented Sep 5, 2017

Uh oh!

hhrutter commented Sep 10, 2017

Uh oh!

hhrutter commented Nov 28, 2017

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants