Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Expose PDF.js getTextContent method via a Content property. #20

Merged
merged 1 commit into from
Jul 20, 2014
Merged

Expose PDF.js getTextContent method via a Content property. #20

merged 1 commit into from
Jul 20, 2014

Conversation

yveszoundi
Copy link
Contributor

This is a suggested implementation for #19.

Right now I roll pdf2json with a minor modification in the pdf.js script.

Thank you for your great work.

@palin27
Copy link

palin27 commented Mar 28, 2014

Great Suggestion!

@modesty
Copy link
Owner

modesty commented Mar 30, 2014

I can see the "content" property would be useful in certain use cases, but it doesn't fit into the current output format. Because the output format is designed to be a simplified structure that can be used to re-construct the PDF content (not just text, but also color, styles, lines, positions, sizes, fields, types, formats, etc.) in client renderer, and text content is already part of Texts property.

In case you only need Text from PDF, don't care about other content, I'd suggest to add another top level method that only returns text content.

@palin27
Copy link

palin27 commented Mar 31, 2014

I known the goal of your work is re-construct the PDF content. I need only for text but I really don't understand why Texts property contains ASCII character, for example 2C instead ','. Instead, If I use promise object from page.getTextContent() it returns clear text!

You did a great work.

@modesty modesty reopened this Jul 20, 2014
modesty added a commit that referenced this pull request Jul 20, 2014
Expose PDF.js getTextContent method via a Content property. ---- I've got a couple of more inquiries on getting raw text out of PDF, reopen this pull request and merge it for testing. Thanks for the contribution and sorry for the delay.
@modesty modesty merged commit ec960ba into modesty:master Jul 20, 2014
@modesty
Copy link
Owner

modesty commented Jul 20, 2014

I've got a couple of more inquiries on getting raw test content from PDF lately, reopen this pull request and merged it for more testing. Thanks for the contribution and sorry for the delay.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants