Skip to content

Conversation

@robarchibald
Copy link

Thanks for the awesome library! I thought it would be useful to have the ability to extract text from a PDF file easily so I added it to pdflib. The charmap capability was a beast, but I got it working for the files I tested on. Thanks for considering this pull request

@coveralls
Copy link

Coverage Status

Coverage decreased (-0.6%) to 50.672% when pulling 99d8987 on EndFirstCorp:master into c0da03d on hhrutter:master.

@hhrutter
Copy link
Collaborator

Thx for your PR!
Appreciated 👍
Sorry I did not see it until I committed my squashed master branch :(
I'll get back to you.

@hhrutter hhrutter self-assigned this Aug 20, 2017
@hhrutter
Copy link
Collaborator

Please share your specific intention of returning text.
Do you need text for a specific page or the whole file?
Please provide some tests and (go)doc so I get the idea of your usecase.
The api interfaces with files and directories for the moment so we want to stay consistent.
If there is a usecase for providing extracted content via io.Reader maybe that's a missing layer (in between). Like I said I squashed the last couple of commits together including the one you forked off, could you be so kind and provide your changes based off 8f3ac5c. Thank you so much.

@robarchibald
Copy link
Author

I appreciate you taking the time to look at this pull request. It turns out that since submitting the pull request I created my own library to convert PDF’s to text at https://github.com/EndFirstCorp/pdf2txt. My use case for this is that I’m filling out a pipeline from a web server file upload. On file upload, you get access to a multipart.File. Rather than saving that multipart.File to a standard OS file, my goal was to minimize the I/O and 1) read in the file and convert it to text and then 2) output the text file to disk. I wanted it to work with an io.Reader, not a io.ReaderAt like the typical implementation of PDF parsers are. So, that’s what I’ve done.

If you’re still interested in implementing this pull request, I can look at resubmitting.

@hhrutter
Copy link
Collaborator

Understood. If you can resubmit your text extraction code so that it is consistent with the existing extraction code for images, fonts and content and supply test code as well I am happy to merge in. Basically what that means is writing out to a file and not a reader. Thank you.

@ghost ghost mentioned this pull request Oct 8, 2017
@hhrutter
Copy link
Collaborator

Text extraction will definitely be an additional functionality at some point but
since this PR originated from an old version for now I am closing it.

@hhrutter hhrutter closed this Nov 28, 2017
charleswklau pushed a commit to charleswklau/pdfcpu that referenced this pull request Dec 28, 2018
using latest source, and update read_buf
hhrutter pushed a commit that referenced this pull request Feb 12, 2021
* move booklet out of nup api

* booklet cli

* cleanup

* note

* cleanup
@hhrutter hhrutter mentioned this pull request Dec 14, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants