Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

--example option gets content,metadata,names,read, but cannot get 'text' #171

Open
mindgitrwx opened this issue Feb 6, 2023 · 8 comments

Comments

@mindgitrwx
Copy link

Error: no example target name 'text'
screenshot 2023-02-07 at 7 27 29 AM

@Implocell
Copy link

Since this is removed from the examples, does it mean text extraction is no longer possible? I'm looking to find a pdf library to extract text from a pdf, not the whole content just the text. Was that what the example did, and if so is there a way to still do that? Btw thanks for sharing this library! 🚀

@YgorSouza
Copy link

FWIW this is the commit that removed the text example: 520cd39.

Sadly, it does not explain why the example was removed.

I tried to just call the code that gets the operations (page.contents.as_ref().unwrap().operations(&file);) and debug print them, and the text is there. The example does not compile with the latest version of the crate because of some missing arguments or incorrect types in some functions, so I imagine that the API changed and the example was removed instead of being updated.

@s3bk
Copy link
Contributor

s3bk commented Apr 30, 2023

@Implocell

Text extraction is possible, but has been moved out of this library.

https://github.com/pdf-rs/pdf_render/blob/master/render/examples/trace.rs

@StripedMonkey
Copy link

While there's a level where the file structure of PDF makes text editing and modification difficult to do without "rendering" it, I think it's possibly a bit absurd that there's no attempt to collect the text on a page without implementing a disconcertingly large number of operations and transforms on text.

I have been struggling with figuring out a (relatively) trivial method of obtaining something resembling the text within PDFs that can handle kerning/positioning of characters without my naive "put a space between every TextDraw/TextDrawAdjusted". Is this something that may be supported in this library eventually, or is that kind of thing being offloaded to pdf_render permanently?

@s3bk
Copy link
Contributor

s3bk commented Jul 17, 2023

It is a hugely nontrivial task. pdf_tools is one attempt at it.
Some pdf files happen to use only ascii and dont do silly things. There it works.

But one you are ouside that, no chance.
Even pdf_render does not deal with trying to join words and lines, but the just released pdf_text crate does.

@StripedMonkey
Copy link

ah! I missed pdf_text! Yea, I figured that there is a lot of edge cases but am just hoping for something that can handle more than raw text objects with all the nonsense of laying that out.

@s3bk
Copy link
Contributor

s3bk commented Jul 17, 2023

Everything i wrote is because i have hit a specific edge case in the wild.
They exist and will come to haunt you.

@s3bk
Copy link
Contributor

s3bk commented Jul 17, 2023

If you only need to parse pdf files from one source, it is a different problem. They you can make assumptions about the file.

For something general, you will find that for any assumption, someone wrote a piece of code that proves it wrong.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants