Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

disableCombineTextItems like pdf-parse #7

Open
JonSilver opened this issue Nov 13, 2021 · 4 comments
Open

disableCombineTextItems like pdf-parse #7

JonSilver opened this issue Nov 13, 2021 · 4 comments
Assignees
Labels
enhancement New feature or request

Comments

@JonSilver
Copy link

Nice work on this library! Just wondering...

A lot of text comes out very difficult to parse, because multiple items of text on a line get combined, with no separator. The pdf-parse library had a disableCombineTextItems property on render options which could improve this situation. Perhaps something like that, or a "line items separator" string you could specify that gets inserted in between same-line items of text.

@lublak lublak added the enhancement New feature or request label Nov 20, 2021
@lublak lublak self-assigned this Nov 20, 2021
@lublak
Copy link
Owner

lublak commented Nov 20, 2021

@JonSilver hi and thank you :)
Sorry for the late reply. Actually, I always try to answer in the next few days but I was on vacation for a week.
That's right the text contents are combined per line.
Currently I am working on another parsing function that allows you to access the complete content of the pdf.
All composites will be returned in an array.
Would this be a solution for you?

@JonSilver
Copy link
Author

Hi @lublak. Hope you had a great vacation 😁

Yes I suppose an array of items would be pretty good too, but an optional, settable delimiter to be included in the output text between items would be great for regex parsing. Different purposes, different solutions. 😊

@lublak
Copy link
Owner

lublak commented Nov 20, 2021

@JonSilver yes I had :)
I will think about it but a solution I will offer in any case.

@lublak
Copy link
Owner

lublak commented Dec 2, 2021

The current development I have now for the time being publicly pushed into an extra branch, to be found here: https://github.com/lublak/pdfdataextract/tree/contentinfoextractor

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants