--example option gets content,metadata,names,read, but cannot get 'text' #171

mindgitrwx · 2023-02-06T22:27:53Z

Error: no example target name 'text'

Implocell · 2023-03-11T07:28:07Z

Since this is removed from the examples, does it mean text extraction is no longer possible? I'm looking to find a pdf library to extract text from a pdf, not the whole content just the text. Was that what the example did, and if so is there a way to still do that? Btw thanks for sharing this library! 🚀

YgorSouza · 2023-04-28T04:04:46Z

FWIW this is the commit that removed the text example: 520cd39.

Sadly, it does not explain why the example was removed.

I tried to just call the code that gets the operations (page.contents.as_ref().unwrap().operations(&file);) and debug print them, and the text is there. The example does not compile with the latest version of the crate because of some missing arguments or incorrect types in some functions, so I imagine that the API changed and the example was removed instead of being updated.

s3bk · 2023-04-30T09:26:44Z

@Implocell

Text extraction is possible, but has been moved out of this library.

https://github.com/pdf-rs/pdf_render/blob/master/render/examples/trace.rs

StripedMonkey · 2023-07-17T04:11:35Z

While there's a level where the file structure of PDF makes text editing and modification difficult to do without "rendering" it, I think it's possibly a bit absurd that there's no attempt to collect the text on a page without implementing a disconcertingly large number of operations and transforms on text.

I have been struggling with figuring out a (relatively) trivial method of obtaining something resembling the text within PDFs that can handle kerning/positioning of characters without my naive "put a space between every TextDraw/TextDrawAdjusted". Is this something that may be supported in this library eventually, or is that kind of thing being offloaded to pdf_render permanently?

s3bk · 2023-07-17T05:15:23Z

It is a hugely nontrivial task. pdf_tools is one attempt at it.
Some pdf files happen to use only ascii and dont do silly things. There it works.

But one you are ouside that, no chance.
Even pdf_render does not deal with trying to join words and lines, but the just released pdf_text crate does.

StripedMonkey · 2023-07-17T06:28:22Z

ah! I missed pdf_text! Yea, I figured that there is a lot of edge cases but am just hoping for something that can handle more than raw text objects with all the nonsense of laying that out.

s3bk · 2023-07-17T06:31:17Z

Everything i wrote is because i have hit a specific edge case in the wild.
They exist and will come to haunt you.

s3bk · 2023-07-17T06:35:58Z

If you only need to parse pdf files from one source, it is a different problem. They you can make assumptions about the file.

For something general, you will find that for any assumption, someone wrote a piece of code that proves it wrong.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

--example option gets content,metadata,names,read, but cannot get 'text' #171

--example option gets content,metadata,names,read, but cannot get 'text' #171

mindgitrwx commented Feb 6, 2023

Implocell commented Mar 11, 2023

YgorSouza commented Apr 28, 2023

s3bk commented Apr 30, 2023

StripedMonkey commented Jul 17, 2023

s3bk commented Jul 17, 2023

StripedMonkey commented Jul 17, 2023

s3bk commented Jul 17, 2023

s3bk commented Jul 17, 2023

--example option gets content,metadata,names,read, but cannot get 'text' #171

--example option gets content,metadata,names,read, but cannot get 'text' #171

Comments

mindgitrwx commented Feb 6, 2023

Implocell commented Mar 11, 2023

YgorSouza commented Apr 28, 2023

s3bk commented Apr 30, 2023

StripedMonkey commented Jul 17, 2023

s3bk commented Jul 17, 2023

StripedMonkey commented Jul 17, 2023

s3bk commented Jul 17, 2023

s3bk commented Jul 17, 2023