Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handling files processed with ocrmypdf #284

Closed
tcurdt opened this issue Jan 22, 2021 · 17 comments
Closed

Handling files processed with ocrmypdf #284

tcurdt opened this issue Jan 22, 2021 · 17 comments

Comments

@tcurdt
Copy link

tcurdt commented Jan 22, 2021

I am scanning documents which also includes OCR information.

The problem: searching them still isn't great. I either need to rely on the OS level indexing, or use something like pdfgrep - which is very slow. The normal grep does not seem to work on the OCR data. That's why I also export the OCR information as a sidecar txt file. Which means I can just use grep. Which is of course not really great as I now are dealing with 2 files.

So I had the idea: Maybe I could add the OCR information to the PDF in a way that a normal grep could find the data anyway.

I looked at some file format information and IIUC I could maybe add the OCR information at the beginning of the PDF as comment.

Is this similar to what the add keyword feature does? Or is there a way to add comments?

@hhrutter
Copy link
Collaborator

hhrutter commented Jan 23, 2021

Hi!

I don't think relying on grep is the way to go here because whatever your modification it may end up in an object stream and that's compressed usually. Although you could configure pdfcpu to turn this off but if this is feasable for your scenario I can't tell.

What's the nature of the comment - the OCR info you would like to add?
Do you have one of several keywords, properties or a paragraph of arbitrary text?

You could pdfcpu keyword add and then grep the output of pdfcpu keyword list
or do something similar using the pdfcpu propcommand.
That way you don't need to store 2 files.

Maybe using PDF attachments is an option if you have prose like comments.
In this case I would need to take a look if there is a way to make that directly grepable
because I guess extracting the attachment to a temp file might not be the best solution resource wise but who knows..

@tcurdt
Copy link
Author

tcurdt commented Jan 23, 2021

Thanks for your input!

I don't think relying on grep is the way to go here because whatever your modification it may end up in an object stream and that's compressed usually

Ah, OK. I guess that's one of reasons why pdfgrep is usually meant to be used. From my shallow investigations it sounded like I could have a comment right after the magic header. Since it's mostly scanned documents I assume the most useful compression is already done on the image level - but I am not sure it's worth mucking with the compression.

What's the nature of the comment - the OCR info you would like to add?
Do you have one of several keywords, properties or a paragraph of arbitrary text?

The idea was to add the OCR text a 2nd time as comment. It's already in the PDF but not in plaintext (probably compressed given your suggestions above).

You could pdfcpu keyword add and then grep the output of pdfcpu keyword list

That's somewhat what pdfgrep does I guess. But it's quite slow. So pdfgrep'ing over a couple of thousand PDFs isn't quite the same as straight grep'ing. So the only option for quick search is to rely on OS level indexing like e.g. Spotlight on macOS. (Not sure what the Windows/Linux equivalent is called)

Maybe it's actually easier look at it from the searching angle. Maybe a spotlight cli search allows for more search freedom. And then there is no need for grep.

But as a related question: Can pdfcpu extract the OCR information somehow?

Maybe using PDF attachments is an option if you have prose like comments.
In this case I would need to take a look if there is a way to make that directly grepable
because I guess extracting the attachment to a temp file might not be the best solution resource wise but who knows..

Yes, indeed. There would probably little benefit of just using pdfgrep if it would have to be extracted like that.

But PDF attachments sounds interesting nevertheless :) I wasn't aware there is such a thing.

@hhrutter
Copy link
Collaborator

If you get me a sample I can take a look about extracting the OCR text info.
I don't see why this shouldn't be possible although I would need to know more about your use case and your toolstack.
How you're doing your OCR and what tools you are using..

@hhrutter
Copy link
Collaborator

It is my understanding that pdfgrep only searches the visible page content.
In order to do that it has to decode all content streams of a page and extract out text to be filtered on the fly.
That's a lot of work so it makes sense this takes time.

It does not search in metadata like the infodict(keys, properties) ot metainfo dicts.
It also will not catch low level PDF % comments as it is not dealing with the raw data stream.

Of course you could hack in this comment but then you need to write the corresponding search tool.
grep will choke on binary files like modern PDFs.

@tcurdt
Copy link
Author

tcurdt commented Jan 23, 2021

If you get me a sample I can take a look about extracting the OCR text info.

Happy to provide one.

I don't see why this shouldn't be possible although I would need to know more about your use case and your toolstack.
How you're doing your OCR and what tools you are using..

I am using ocrmypdf

It is my understanding that pdfgrep only searches the visible page content.

It does seem to take into account the OCR information though.

Of course you could hack in this comment but then you need to write the corresponding search tool.
grep will choke on binary files like modern PDFs.

Well, especially if the data were at the beginning of the file grep should work OKish at lest for finding the files. But it stays a hack. But I was inspired by this amazing talk :) https://www.youtube.com/watch?v=hdCs6bPM4is

Searching the OS index from the command line seems to work surprisingly well. So maybe this hack'ish approach isn't needed. But extracting OCR information with pdfcpu would still be great. Then I could safely delete the sidecar files.

@hhrutter
Copy link
Collaborator

👍 A one page sample with one line of OCR'ed text would be ideal for analysis.

@hhrutter hhrutter changed the title adding comments Handling files processed with ocrmypdf Jan 25, 2021
@tcurdt
Copy link
Author

tcurdt commented Jan 28, 2021

I finally got around to create two test cases:

  1. a single line ocr-single.zip
  2. a single line plus paragraph ocr-multi.zip
% ocrmypdf --version
11.6.0

% ocrmypdf -l deu+eng --force-ocr --sidecar out.txt Scan-2021012821.55.28.pdf out.pdf
Scanning contents: 100%|██████████████████████| 1/1 [00:00<00:00, 64.77page/s]
OCR: 100%|████████████████████████████████████| 1.0/1.0 [00:04<00:00,  4.82s/page]
Postprocessing...
PDF/A conversion: 100%|███████████████████████| 1/1 [00:00<00:00,  3.99page/s]
JPEGs: 0image [00:00, ?image/s]
JBIG2: 0item [00:00, ?item/s]
Optimize ratio: 1.00 savings: 0.0%
Output file is a PDF/A-2B (as expected)

% ocrmypdf -l deu+eng --force-ocr --sidecar out.txt Scan-2021012821.59.41.pdf out.pdf 
Scanning contents: 100%|██████████████████████| 1/1 [00:00<00:00, 84.41page/s]
OCR: 100%|████████████████████████████████████| 1.0/1.0 [00:04<00:00,  4.39s/page]
Postprocessing...
PDF/A conversion: 100%|███████████████████████| 1/1 [00:00<00:00,  3.36page/s]
JPEGs: 0image [00:00, ?image/s]
JBIG2: 0item [00:00, ?item/s]
Optimize ratio: 1.00 savings: 0.0%
Output file is a PDF/A-2B (as expected)

It would be great if there was a way to extract the text sidecar files from the pdfs.

@hhrutter
Copy link
Collaborator

Looking at these it turns out the OCR'ed text content is assembled into regular positioned text blocks
using a special glyphless font containing placeholders which makes sense since the text gets rendered on top
of the scanned source image.

A pdfcpu grep command would rely on text extraction which pdfcpu does not offer at the moment.
Even having text extraction at our disposal I think you would end up with a similar performance as pdfgrep since
text extraction involves quite some string processing.

@tcurdt
Copy link
Author

tcurdt commented Jan 29, 2021

Since adding the text as comments seems problematic, and since I found the spotlight cli, I think extracting the sidecar files would be the main focus for me at this stage.

So if pdfcpu would allow to extract the text of those glyphless font blocks (even better along side with the position) that would be great. Given the text are not glyphs this should be possible, right?

Thanks for looking into the files.

@hhrutter
Copy link
Collaborator

Yeah low level comments are not feasable in this usecase because firstly you need to store the extracted text per page I guess and secondly there is a limit of max 256 characters per comment line if I recall correctly.

Yet there must be an easy way to access this info if you have a stack of files with each of them containing 1000s of pages.
Like I said text extraction ain't easy and it definitely amounts to a major performance hit in your case.

Try attaching a textfile containing the appended OCR'ed sidecars onto the PDF.
Extracting an attachment is a straight forward operation even more so because in your case the attachment is simple text.
Yes, I think that's what I would try out.

PDF/A compatibility may be an issue if that's a requirement.
I am not familiar with PDF/A, it may or may not allow attachments.

@tcurdt
Copy link
Author

tcurdt commented Jan 30, 2021

Text per pages is of course another issue. But the sidecar files are also just a single txt file. They are very simplistic. For the extraction of the sidecar files performance would not be an issue. It seems in the end the metadata indexing of the OS is good enough for the search. So search should be covered good enough. Plus there is pdfgrep in case it really isn't.

Now I just want to be able to get rid of the sidecars. If I could re-create that information from the PDF - that would be a win.

I am not familiar how the attachment would work. And PDF/A might indeed be another hurdle.

Do you think it would be possible to add the text blocks extraction to pdfcpu?
Would the API allow for that? Or would that be a major undertaking?

@hhrutter
Copy link
Collaborator

Text extraction is an open issue.

@tcurdt
Copy link
Author

tcurdt commented Jan 30, 2021

Do you refer to #4 ?

I am wondering if the fact that it is using a glyphless font makes is easier.

@hhrutter
Copy link
Collaborator

Yes this relates to #4 and no it's just regular text extraction.
The actual glyphs are irrelevant for this.
All you care about are character codes and their positioning within text blocks and any white space within.

@tcurdt
Copy link
Author

tcurdt commented Jan 30, 2021

Does that mean it should be doable with the current API? So I could just try this myself?
Or would this need changes?

@hhrutter
Copy link
Collaborator

I am repeating myself here:

pdfcpu does not support text extraction - it's not part of the API and it is not part of the CLI.

I hope this is clear now.

@tcurdt
Copy link
Author

tcurdt commented Jan 30, 2021

No support in the CLI was clear. Not possible with the current code base not so much.
I was looking through https://github.com/pdfcpu/pdfcpu/tree/master/pkg/pdfcpu
But fair enough.

@tcurdt tcurdt closed this as completed Jan 30, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants