Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Text search API #819

Closed
arturadib opened this issue Nov 18, 2011 · 3 comments
Closed

Text search API #819

arturadib opened this issue Nov 18, 2011 · 3 comments

Comments

@arturadib
Copy link
Contributor

@notmasteryet As I understand it you'll be tackling text search soon. @hubgit and I have been talking about extracting text from PDFs, and it'd be nice if we have an API for that.

Does it make sense to build the search feature on top of a PDFDoc() API that extracts the text from each individual page?

@hubgit Feel free to chime in on your needs here.

@jviereck
Copy link
Contributor

I'm wondering how this will look from an infrastructure point of view. We can reuse the IR extracted by the PartialEvaluator and look for showSpacedText and showText commands or we add a new function PartialEvaluator.getText(), that only returns a stream of text as found on one page.

Calling such a PartialEvaluator.getText() should be more performant and as people asked frequently to extract only text content from a PDF, it might be useful to have such a function.

@arturadib
Copy link
Contributor Author

Calling such a PartialEvaluator.getText() should be more performant

That sounds good. It'd be nice if we could offer a more friendly wrapper in PDFDoc() or Page() so that consumers of the API don't have to deal with an object from the guts of the code though (PartialEvaluator).

How about page.getText(), since users already have to deal with the page object?

@yurydelendik
Copy link
Contributor

We have getTextContent (see https://github.com/mozilla/pdf.js/blob/master/src/api.js#L394). Closing as resolved

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants