Handling files processed with ocrmypdf #284

tcurdt · 2021-01-22T22:57:22Z

I am scanning documents which also includes OCR information.

The problem: searching them still isn't great. I either need to rely on the OS level indexing, or use something like pdfgrep - which is very slow. The normal grep does not seem to work on the OCR data. That's why I also export the OCR information as a sidecar txt file. Which means I can just use grep. Which is of course not really great as I now are dealing with 2 files.

So I had the idea: Maybe I could add the OCR information to the PDF in a way that a normal grep could find the data anyway.

I looked at some file format information and IIUC I could maybe add the OCR information at the beginning of the PDF as comment.

Is this similar to what the add keyword feature does? Or is there a way to add comments?

The text was updated successfully, but these errors were encountered:

hhrutter · 2021-01-23T00:57:07Z

Hi!

I don't think relying on grep is the way to go here because whatever your modification it may end up in an object stream and that's compressed usually. Although you could configure pdfcpu to turn this off but if this is feasable for your scenario I can't tell.

What's the nature of the comment - the OCR info you would like to add?
Do you have one of several keywords, properties or a paragraph of arbitrary text?

You could pdfcpu keyword add and then grep the output of pdfcpu keyword list
or do something similar using the pdfcpu propcommand.
That way you don't need to store 2 files.

Maybe using PDF attachments is an option if you have prose like comments.
In this case I would need to take a look if there is a way to make that directly grepable
because I guess extracting the attachment to a temp file might not be the best solution resource wise but who knows..

tcurdt · 2021-01-23T13:18:07Z

Thanks for your input!

I don't think relying on grep is the way to go here because whatever your modification it may end up in an object stream and that's compressed usually

Ah, OK. I guess that's one of reasons why pdfgrep is usually meant to be used. From my shallow investigations it sounded like I could have a comment right after the magic header. Since it's mostly scanned documents I assume the most useful compression is already done on the image level - but I am not sure it's worth mucking with the compression.

What's the nature of the comment - the OCR info you would like to add?
Do you have one of several keywords, properties or a paragraph of arbitrary text?

The idea was to add the OCR text a 2nd time as comment. It's already in the PDF but not in plaintext (probably compressed given your suggestions above).

You could pdfcpu keyword add and then grep the output of pdfcpu keyword list

That's somewhat what pdfgrep does I guess. But it's quite slow. So pdfgrep'ing over a couple of thousand PDFs isn't quite the same as straight grep'ing. So the only option for quick search is to rely on OS level indexing like e.g. Spotlight on macOS. (Not sure what the Windows/Linux equivalent is called)

Maybe it's actually easier look at it from the searching angle. Maybe a spotlight cli search allows for more search freedom. And then there is no need for grep.

But as a related question: Can pdfcpu extract the OCR information somehow?

Maybe using PDF attachments is an option if you have prose like comments.
In this case I would need to take a look if there is a way to make that directly grepable
because I guess extracting the attachment to a temp file might not be the best solution resource wise but who knows..

Yes, indeed. There would probably little benefit of just using pdfgrep if it would have to be extracted like that.

But PDF attachments sounds interesting nevertheless :) I wasn't aware there is such a thing.

hhrutter · 2021-01-23T17:11:22Z

If you get me a sample I can take a look about extracting the OCR text info.
I don't see why this shouldn't be possible although I would need to know more about your use case and your toolstack.
How you're doing your OCR and what tools you are using..

hhrutter · 2021-01-23T20:27:19Z

It is my understanding that pdfgrep only searches the visible page content.
In order to do that it has to decode all content streams of a page and extract out text to be filtered on the fly.
That's a lot of work so it makes sense this takes time.

It does not search in metadata like the infodict(keys, properties) ot metainfo dicts.
It also will not catch low level PDF % comments as it is not dealing with the raw data stream.

Of course you could hack in this comment but then you need to write the corresponding search tool.
grep will choke on binary files like modern PDFs.

tcurdt · 2021-01-23T20:36:36Z

If you get me a sample I can take a look about extracting the OCR text info.

Happy to provide one.

I don't see why this shouldn't be possible although I would need to know more about your use case and your toolstack.
How you're doing your OCR and what tools you are using..

I am using ocrmypdf

It is my understanding that pdfgrep only searches the visible page content.

It does seem to take into account the OCR information though.

Of course you could hack in this comment but then you need to write the corresponding search tool.
grep will choke on binary files like modern PDFs.

Well, especially if the data were at the beginning of the file grep should work OKish at lest for finding the files. But it stays a hack. But I was inspired by this amazing talk :) https://www.youtube.com/watch?v=hdCs6bPM4is

Searching the OS index from the command line seems to work surprisingly well. So maybe this hack'ish approach isn't needed. But extracting OCR information with pdfcpu would still be great. Then I could safely delete the sidecar files.

hhrutter · 2021-01-24T20:06:04Z

👍 A one page sample with one line of OCR'ed text would be ideal for analysis.

tcurdt · 2021-01-28T22:30:54Z

I finally got around to create two test cases:

a single line ocr-single.zip
a single line plus paragraph ocr-multi.zip

% ocrmypdf --version
11.6.0

% ocrmypdf -l deu+eng --force-ocr --sidecar out.txt Scan-2021012821.55.28.pdf out.pdf
Scanning contents: 100%|██████████████████████| 1/1 [00:00<00:00, 64.77page/s]
OCR: 100%|████████████████████████████████████| 1.0/1.0 [00:04<00:00,  4.82s/page]
Postprocessing...
PDF/A conversion: 100%|███████████████████████| 1/1 [00:00<00:00,  3.99page/s]
JPEGs: 0image [00:00, ?image/s]
JBIG2: 0item [00:00, ?item/s]
Optimize ratio: 1.00 savings: 0.0%
Output file is a PDF/A-2B (as expected)

% ocrmypdf -l deu+eng --force-ocr --sidecar out.txt Scan-2021012821.59.41.pdf out.pdf 
Scanning contents: 100%|██████████████████████| 1/1 [00:00<00:00, 84.41page/s]
OCR: 100%|████████████████████████████████████| 1.0/1.0 [00:04<00:00,  4.39s/page]
Postprocessing...
PDF/A conversion: 100%|███████████████████████| 1/1 [00:00<00:00,  3.36page/s]
JPEGs: 0image [00:00, ?image/s]
JBIG2: 0item [00:00, ?item/s]
Optimize ratio: 1.00 savings: 0.0%
Output file is a PDF/A-2B (as expected)

It would be great if there was a way to extract the text sidecar files from the pdfs.

hhrutter · 2021-01-29T22:52:04Z

Looking at these it turns out the OCR'ed text content is assembled into regular positioned text blocks
using a special glyphless font containing placeholders which makes sense since the text gets rendered on top
of the scanned source image.

A pdfcpu grep command would rely on text extraction which pdfcpu does not offer at the moment.
Even having text extraction at our disposal I think you would end up with a similar performance as pdfgrep since
text extraction involves quite some string processing.

tcurdt · 2021-01-29T23:06:09Z

Since adding the text as comments seems problematic, and since I found the spotlight cli, I think extracting the sidecar files would be the main focus for me at this stage.

So if pdfcpu would allow to extract the text of those glyphless font blocks (even better along side with the position) that would be great. Given the text are not glyphs this should be possible, right?

Thanks for looking into the files.

hhrutter · 2021-01-29T23:41:03Z

Yeah low level comments are not feasable in this usecase because firstly you need to store the extracted text per page I guess and secondly there is a limit of max 256 characters per comment line if I recall correctly.

Yet there must be an easy way to access this info if you have a stack of files with each of them containing 1000s of pages.
Like I said text extraction ain't easy and it definitely amounts to a major performance hit in your case.

Try attaching a textfile containing the appended OCR'ed sidecars onto the PDF.
Extracting an attachment is a straight forward operation even more so because in your case the attachment is simple text.
Yes, I think that's what I would try out.

PDF/A compatibility may be an issue if that's a requirement.
I am not familiar with PDF/A, it may or may not allow attachments.

tcurdt · 2021-01-30T01:06:38Z

Text per pages is of course another issue. But the sidecar files are also just a single txt file. They are very simplistic. For the extraction of the sidecar files performance would not be an issue. It seems in the end the metadata indexing of the OS is good enough for the search. So search should be covered good enough. Plus there is pdfgrep in case it really isn't.

Now I just want to be able to get rid of the sidecars. If I could re-create that information from the PDF - that would be a win.

I am not familiar how the attachment would work. And PDF/A might indeed be another hurdle.

Do you think it would be possible to add the text blocks extraction to pdfcpu?
Would the API allow for that? Or would that be a major undertaking?

hhrutter · 2021-01-30T09:54:16Z

Text extraction is an open issue.

tcurdt · 2021-01-30T10:56:23Z

Do you refer to #4 ?

I am wondering if the fact that it is using a glyphless font makes is easier.

hhrutter · 2021-01-30T11:56:04Z

Yes this relates to #4 and no it's just regular text extraction.
The actual glyphs are irrelevant for this.
All you care about are character codes and their positioning within text blocks and any white space within.

tcurdt · 2021-01-30T12:08:03Z

Does that mean it should be doable with the current API? So I could just try this myself?
Or would this need changes?

hhrutter · 2021-01-30T12:51:25Z

I am repeating myself here:

pdfcpu does not support text extraction - it's not part of the API and it is not part of the CLI.

I hope this is clear now.

tcurdt · 2021-01-30T13:12:48Z

No support in the CLI was clear. Not possible with the current code base not so much.
I was looking through https://github.com/pdfcpu/pdfcpu/tree/master/pkg/pdfcpu
But fair enough.

hhrutter changed the title ~~adding comments~~ Handling files processed with ocrmypdf Jan 25, 2021

tcurdt closed this as completed Jan 30, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Handling files processed with ocrmypdf #284

Handling files processed with ocrmypdf #284

tcurdt commented Jan 22, 2021 •

edited by hhrutter

Loading

hhrutter commented Jan 23, 2021 •

edited

Loading

tcurdt commented Jan 23, 2021

hhrutter commented Jan 23, 2021

hhrutter commented Jan 23, 2021

tcurdt commented Jan 23, 2021

hhrutter commented Jan 24, 2021

tcurdt commented Jan 28, 2021

hhrutter commented Jan 29, 2021

tcurdt commented Jan 29, 2021

hhrutter commented Jan 29, 2021

tcurdt commented Jan 30, 2021

hhrutter commented Jan 30, 2021

tcurdt commented Jan 30, 2021

hhrutter commented Jan 30, 2021

tcurdt commented Jan 30, 2021 •

edited

Loading

hhrutter commented Jan 30, 2021

tcurdt commented Jan 30, 2021

Handling files processed with ocrmypdf #284

Handling files processed with ocrmypdf #284

Comments

tcurdt commented Jan 22, 2021 • edited by hhrutter Loading

hhrutter commented Jan 23, 2021 • edited Loading

tcurdt commented Jan 23, 2021

hhrutter commented Jan 23, 2021

hhrutter commented Jan 23, 2021

tcurdt commented Jan 23, 2021

hhrutter commented Jan 24, 2021

tcurdt commented Jan 28, 2021

hhrutter commented Jan 29, 2021

tcurdt commented Jan 29, 2021

hhrutter commented Jan 29, 2021

tcurdt commented Jan 30, 2021

hhrutter commented Jan 30, 2021

tcurdt commented Jan 30, 2021

hhrutter commented Jan 30, 2021

tcurdt commented Jan 30, 2021 • edited Loading

hhrutter commented Jan 30, 2021

tcurdt commented Jan 30, 2021

tcurdt commented Jan 22, 2021 •

edited by hhrutter

Loading

hhrutter commented Jan 23, 2021 •

edited

Loading

tcurdt commented Jan 30, 2021 •

edited

Loading