-
-
Notifications
You must be signed in to change notification settings - Fork 466
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Handling files processed with ocrmypdf #284
Comments
Hi! I don't think relying on grep is the way to go here because whatever your modification it may end up in an object stream and that's compressed usually. Although you could configure pdfcpu to turn this off but if this is feasable for your scenario I can't tell. What's the nature of the comment - the OCR info you would like to add? You could Maybe using PDF attachments is an option if you have prose like comments. |
Thanks for your input!
Ah, OK. I guess that's one of reasons why
The idea was to add the OCR text a 2nd time as comment. It's already in the PDF but not in plaintext (probably compressed given your suggestions above).
That's somewhat what Maybe it's actually easier look at it from the searching angle. Maybe a spotlight cli search allows for more search freedom. And then there is no need for But as a related question: Can
Yes, indeed. There would probably little benefit of just using But PDF attachments sounds interesting nevertheless :) I wasn't aware there is such a thing. |
If you get me a sample I can take a look about extracting the OCR text info. |
It is my understanding that pdfgrep only searches the visible page content. It does not search in metadata like the infodict(keys, properties) ot metainfo dicts. Of course you could hack in this comment but then you need to write the corresponding search tool. |
Happy to provide one.
I am using ocrmypdf
It does seem to take into account the OCR information though.
Well, especially if the data were at the beginning of the file grep should work OKish at lest for finding the files. But it stays a hack. But I was inspired by this amazing talk :) https://www.youtube.com/watch?v=hdCs6bPM4is Searching the OS index from the command line seems to work surprisingly well. So maybe this hack'ish approach isn't needed. But extracting OCR information with |
👍 A one page sample with one line of OCR'ed text would be ideal for analysis. |
I finally got around to create two test cases:
It would be great if there was a way to extract the text sidecar files from the pdfs. |
Looking at these it turns out the OCR'ed text content is assembled into regular positioned text blocks A |
Since adding the text as comments seems problematic, and since I found the spotlight cli, I think extracting the sidecar files would be the main focus for me at this stage. So if Thanks for looking into the files. |
Yeah low level comments are not feasable in this usecase because firstly you need to store the extracted text per page I guess and secondly there is a limit of max 256 characters per comment line if I recall correctly. Yet there must be an easy way to access this info if you have a stack of files with each of them containing 1000s of pages. Try attaching a textfile containing the appended OCR'ed sidecars onto the PDF. PDF/A compatibility may be an issue if that's a requirement. |
Text per pages is of course another issue. But the sidecar files are also just a single txt file. They are very simplistic. For the extraction of the sidecar files performance would not be an issue. It seems in the end the metadata indexing of the OS is good enough for the search. So search should be covered good enough. Plus there is Now I just want to be able to get rid of the sidecars. If I could re-create that information from the PDF - that would be a win. I am not familiar how the attachment would work. And PDF/A might indeed be another hurdle. Do you think it would be possible to add the text blocks extraction to |
Text extraction is an open issue. |
Do you refer to #4 ? I am wondering if the fact that it is using a glyphless font makes is easier. |
Yes this relates to #4 and no it's just regular text extraction. |
Does that mean it should be doable with the current API? So I could just try this myself? |
I am repeating myself here: pdfcpu does not support text extraction - it's not part of the API and it is not part of the CLI. I hope this is clear now. |
No support in the CLI was clear. Not possible with the current code base not so much. |
I am scanning documents which also includes OCR information.
The problem: searching them still isn't great. I either need to rely on the OS level indexing, or use something like
pdfgrep
- which is very slow. The normalgrep
does not seem to work on the OCR data. That's why I also export the OCR information as a sidecar txt file. Which means I can just usegrep
. Which is of course not really great as I now are dealing with 2 files.So I had the idea: Maybe I could add the OCR information to the PDF in a way that a normal
grep
could find the data anyway.I looked at some file format information and IIUC I could maybe add the OCR information at the beginning of the PDF as comment.
Is this similar to what the
add keyword
feature does? Or is there a way to add comments?The text was updated successfully, but these errors were encountered: